Of major concern was the whole HA issue that meant DLR nodes got stuck in a ‘split-brain’ mode after 24 days of operations – and every 24 days after that! It also didn’t help that the previous version was causing VMs to lose network connectivity if the pMAC of the DLR was the MAC address in the default gateway.
Anyways, hopefully all the bugs have been ironed out and the new release is more stable!
So I read about this issue a week or so ago when this bug started doing the rounds in the VMware communities and The Register picked up on the issue…. I was planning to blog about it but it slipped my mind due to a busy end of month! >_<”
Anyways, VMware have sheepishly recognised the bug and produced a KB article about it: http://kb.vmware.com/kb/2090639
The bug affects VMs with Changed Block Tracking (CBT) turned on, specifically those VMs that have had its storage (so a single vdisk) increased in size by more than 128GB.
The problem only presents itself when it comes to the execution of the command QueryChangedDiskAreas(). This API call is commonly used by backup softwares to determine what part of a VMs vmdk file has been changed since the last backup in order to execute an Incremental Backup.
It seems that once the vmdk is increased to more than 128GB, you get an inaccurate list of allocated VM disk sectors returned by the API call, and so any sort of incremental backup could be erroneous and some changed blocks may not be captured during backup. Obviously this means that in the case of you restoring from the erroneous backup, you may experience data loss!
This is a known issue affecting VMware ESXi 4.x and ESXi 5.x and currently, there is NO resolution.
To work around this issue, VMware recommends that you disable and then re-enable CBT on the VM. The next backup after toggling CBT will be a full backup of the virtual machine.
The issue here is in order to disable CBT, you need to power off your VM and ensure there are no snapshots attached to the VM…… quite a pain in the rear end!
Info on how to disable and enable CBT can be found here: http://kb.vmware.com/kb/1031873
Also I’m not too sure whether it fixes CBT or whether it will keep generating the same inaccurate info every time the vdisk blocks change and you try to run an Incremental…. unfortunately there isn’t enough information out there yet!
I pity the admin who has to run daily fulls in order to combat this bug….. 128GB backups… ouch!
Fortunately none of my customers have a vdisk of that monstrous size so this shouldn’t affect many of them!
Those of you using NFS storage and planning to upgrade to the latest version of vSphere – 5.5 U1 – please hold off your upgrades as there is a bug within the code which is currently causing issues on paths to NFS volumes.
The bug causes the intermittent loss of connectivity, which can lead to an “All Paths Down” error to your NFS storage! During the disconnects VMs will appear frozen and the NFS datastores may be greyed out. This appears to impact all storage vendors and all environments on 5.5. U1 accessing NFS…..!!
Obviously the loss of a path will impact IOs from VMs to datastores…… and this can result in BSODs for Windows VMs and filesystems becoming read only for Linux VMs (or even kernel panics)!
The recommendation at this point is not to upgrade to vSphere 5.5 U1 and stay on vSphere 5.5 GA. If you have upgraded to 5.5 U1 then you may need to downgrade back to 5.5GA.
Obviously the main reason for upgrading to 5.5 U1 was to patch the Heartbleed vulnerability within OpenSSL, VMware are informing customers not to upgrade but to install security patches to address the Heartbleed vulnerability…. More info on this process can be found here: http://kb.vmware.com/kb/2076665