DHCP pool issues

26 10 2010

Thought I’d write about a situation with DHCP that I came across recently.

Our satellite office was designed to have a maximum of 50 staff. It has one VLAN set up for Cisco phones (which uses one DHCP pool) and one for staff computers. We sub-let some seating in this office, so there are also VLANs for the tenants.

The DHCP pool contains 100 addresses available for lease, which has been sufficient for the three years that we’ve had the office.

A few days ago, however, I received a call saying that “some staff could not access the network”. I obtained clarification, as definitions of “the network” vary greatly depending on which end user you’re talking to.

The difference this time was that “the network” actually meant the network, rather than just a file server or Sharepoint intranet – the file server, email server and internet connectivity were all unavailable to some users.

I knew that no configuration changes had been made to the network in the last few days, so this wasn’t the cause (when something breaks, always check the last thing that was changed). I also verified that the tenants weren’t affected – so this was specific to the staff address range.

I logged in to the AD server remotely (verifying the internet / VPN connectivity in the process) and had a look at the DHCP server.

Immediately I noticed some blue exclamation marks against the “Staff” DHCP pool.

Right-clicking on the Scope, and selecting “Show Statistics” showed that 100% of the addresses had been leased.

This meant that when new devices attempted to access the network, or the existing machines attempted to renew their DHCP lease, there were no addresses to lease, so their machines instead assigned themselves an APIPA address, and therefore couldn’t access any resources.

But what was the actual cause of this? There were still very much less than 50 end users using the staff VLAN and thus that DHCP pool, so what changed?

The short answer is smart phones. The number of smart devices on the network has grown massively. Whenever WiFi is enabled on these devices, the owner connects to the corporate WLAN, using a DHCP lease in the process.

Another related change is the usage of the satellite office.. more staff from HQ are having meetings in the city, and going to the office for the rest of the day. So, although the number of permanent users hasn’t increased, the number of users “passing through” has.

So how did I alleviate this issue? This network was originally configured by a consultant, so I took a look at the DHCP configuration, and found that the lease length was 8 days. So the computers of these brief-visit users, their smart phones, and the smart phones of the permanent staff were all obtaining DHCP leases, which weren’t expiring for over a week. This depleted the pool very quickly.

I made two changes to alleviate this issue. Firstly, I increased the size of the DHCP pool. Admittedly, I was lucky to be able to do this – it’s not always possible due to a large number of IPs in the network being configured (for servers etc).

Secondly, I reduced the lease length to 1 day.

This will result in an increase in traffic on the network, as a result of the leases expiring much sooner and the hosts renewing their leases.

However, it lends itself well to the temporary nature of the staff / devices using the office, and will have a negligable effect on the  user experience for the staff based in the office full-time.





vRanger Update

7 10 2010

I realised the reason for the lack of external disk in the drive listings when setting up a repository for vRanger. External disks have to be shared (and the service account under which vRanger runs given full access to the share).

Once shared, the share appears in the list, and you’re good to go.

I now have all my jobs backing up directly to external disk, which is a massive time saving over the previous scenario. Thank you Vizioncore!





vRanger Pro DPP error “Version string portion too short or too long”

2 10 2010

Got the above message after running a job for the first time in a version of Vizioncore vRanger upgraded to 4.5.3 DPP.

Couldn’t find the same error *in relation to vRanger* anywhere on the web, so raised a call with support. Thought I’d detail my findings here.

The exact scenario in my case:

A mixture of ESX OS versions, including 3.5, 4.1 and 4.2. Running a version of vRanger Pro around version 4.1.1. This had been working successfully for three of my virtual servers for around 6 months. Ideally, I’d have been backing up straight to external disk, ready for off-siting. However, although this worked fine on a number of external disk models in our other installation with an older version of vRanger, jobs failed repeatedly if I tried this on the installation in question.

Therefore, I had to run jobs direct to local disk, and then move the backup files to external disk after. Needless to say, this added a lot of time to the process, handholding to ensure sufficient free space on the disk / make sure the move process hadn’t failed, etc etc.

So having renewed my maintenence and bought some new licenses, I decided to install the latest version, to see if I’d now be able to back up straight to external disk. My chosen route was to uninstall the previous version, rather than upgrade, but to keep the existing database. This was upgraded during the install process, and generated no error messages.

However, upon running a job, I repeatedly got the error “Version string portion too short or too long”.  This was regardless of whether I backed up to extenal or internal disks.

Vizioncore support was very quick with the initial response, which was to remove all machines from the inventory (which moves jobs relating to them to “Disabled Jobs”), and re-add them. At this point, I did notice that the ESX host in question added itself to the inventory in a non fully qualified domain format, whereas the others were added as machinename.domain.extension.

I didn’t investigate this however, and attempted to re-run the job. This resulted in the same error message as before.

I emailed support again, and had to chase this time, as there was no response for a few hours, and it was getting close to the end of the week (hence carrying on and writing this up on a Saturday!). I suggested uninstalling the app, and choosing a different SQL instance for the DB location. Their response was that this should work, because the root cause was a difference between the actual host name of the ESX server, and the name of the server according to the inventory.

This reminded me of the fact that the machine had been pulled in to the inventory as machine name without FQDN, so I tried backing up another machine.

This worked perfectly, so the error is now narrowed down to the specific machine, which I’m about to remove / re-add to the inventory and re-test.

So, look out for any differences in server name in the inventory if you experience this error.

UPDATE: Found the actual cause for the difference in DNS name / entry in Inventory – ensure that the “Domain Name” value in “Configuration” –> “DNS Routing” is filled in. In my case, this was the reason for the machine name not including the FQDN in the entry

On a related note, I mentioned above the VM backups are moved to external disk for offsiting. I have a True Crypt encrypted volume on each of the external disks, so that the VM backups are encrypted. vRanger did not list the drive letter mapped to this volume as a drive when I tried backing up directly to the volume. It worked fine however, when I tried doing so to the internal disk.

I noticed that there is an “Encryption” option in this version of DPP, which I’ve just tested on an ESX server hosting 2 VMs, directly after carrying out a standard backup. The time taken was 22 minutes instead of 18. This is a 22% difference, so a fair increase, but I’m going to experiment implementing this for all of my servers, to see what the results are. If I can back up direct to disk, in an encrypted format, that’ll be a massive time saving overall.





Walking the Network – the Cisco Discovery Protocol (CDP)

11 09 2010

Thought I’d write about “walking the network”, using the Cisco Discovery Protocol. This is not exactly high-level Cisco stuff, but useful for those with little experience of using Cisco gear.

Walking the network is a useful way of verifying that network diagrams are up to date, and diagramming the network if no documentation exists.

CDP is a layer 2 protocol running by default on most Cisco gear. It can be disabled, because of potential security risks (I’ll expand on that later). It reports the directly connected Cisco devices. It shows which ports the devices are connected to, the IP addresses of the devices.

My methodology for walking the network is:

1. Connect to a Cisco device, using a rollover cable, or telnet / SSH if you have login details.

2. Use “show CDP neighbors” to show a concise list of Cisco devices connected, and which port they’re connected to. For the purposes of documenting the network, I normally sketch it out on paper first, then transfer to electronic format once I have all the information I need.

3. Once you have device and port information for connected devices, you could then use “show cdp neighbors detail“. This shows much more information about each connected device, including its’ IP address.

Now you can add the IP address information to the devices on your sketch.

4. Once you’ve gathered the information you need from the first device, connect to the first directly connected device via Telnet, using the IP address information gained from step 3. Repeat the “show cdp neighbors” and update your network sketch accordingly.

5. Repeat this for all devices until all devices are accounted for. Provided the CDP protocol is enabled on all devices, you’ll have an accurate diagram of the Cisco connections on your network.

Back to those security issues – this is a potential issue in two circumstances; firstly if an edge device has been compromised, and secondly if someone has already gained unauthorised access to an item of kit on your network internally. All kit should be physically secured, and logically secured with well protected passwords to prevent this.

However, in the circumstance that unauthorised access has been gained, a malicious user can use the exact techniques described above to walk the network and potentially gain access to other devices.