Saturday, June 11, 2011

Using Converter: Hot or cold clone?

Ulli Hankeln (aka continuum) recently started an interesting thread in the VMware communities asking what cloning method people prefer with VMware Converter.
Some time ago we started a fairly large server virtualization project (several hundred servers) and ask ourselves exactly this question. After some discussions we identified some evaluation criteria and judged both of two methods following these criteria:

1. Risk of data loss
This is clear point for cold clone, because it will make the machine absolutely unavailable and unchangeable during the conversion process. With a hot clone it is possible that data on the machine is changed during the conversion process that will then not be replicated to the virtual clone.

2. Operations complexity
This in contrast is a clear point for hot cloning. This process is fully automated through the Converter GUI. For cold cloning you need to boot the server from a CD. If you are lucky you can do that remotely be using HP iLO or similar remote control adapters. But you will always stumble over some servers that you will need to visit physically to mount the boot CD.

3. Operations success rate and possible errors
What can go wrong with a cold clone? You might not be able to even start it, because you cannot boot the server through iLO (or similar means) and do not have physical access to the system (at the time you need it) to mount the CD physically. If this is not a problem the second possible error is that the boot CD does not include the necessary drivers for the conversion: You need a working mass storage driver and a working network driver, but the CD only includes a limited set of drivers that might not be suitable for you if the server you are going to virtualize has quite modern hardware. It is possible though to rebuild the boot CD and add additional drivers to it, but this is an extra step and needs extra time and effort.
What can go wrong with a hot clone? You will always be able to start the hot cloning, but in some situations it will fail half way through or at the end. Possible errors are e.g. corrupt file systems or bugs in converter. In such a case troubleshooting is difficult, although you will find some hints in the VMware KB. Fortunately, this does not happen too often. In our experience in clearly less than 5% of all cases.
So, what's working better? It very much depends on your setup, the types of hardware you have, the OSs you are virtualizing and the applications you are running in them.
For us hot cloning wins this criteria, but we were not able to decide that until we already started our virtualization project and made good progress with it.

4. VMware support
In recent releases of Converter (including 5.0 being in beta right now) VMware has dropped support for cold cloning. I guess the reason for this is that they wanted to get rid of maintaining the boot CD and instead concentrate on improving the hot conversion process. And they really did that e.g. by adding incremental data synchronization after the first data copy run.
So, this is a win for hot cloning. Anyway, it is still possible to use cold cloning with the current releases of Converter, because it has support for importing the images of third party cloning tools (like Symantec Ghost and Acronis TrueImage). The reconfiguration process of Converter will make these images bootable with the virtual clone by injecting the necessary mass storage drivers into it.

Conclusions: In our project we have a "hot cloning first"-policy. To mitigate the risk of data loss we stop all unneeded services on the source machine (like application and database services) before starting the process. On machines that are used interactively (like Windows Terminal Servers) we disable logons.
Only if the hot clone fails we try a cold clone as the next step.

As stated earlier, your mileage may vary. What do you think? Like always comments are welcome.

Wednesday, June 8, 2011

VMware Converter 5.0 beta available

VMware has released a beta of the upcoming new version 5.0 of Converter. Check this community for details.
According to the preliminary release notes the new version will support the next version 5.0 of vSphere, but also earlier versions down to ESX 3.0 and vCenter 2.5.
But the most important and best news is that Converter 5 will properly handle disk alignments (see my previous post about the lack of this in earlier versions)!

Sunday, June 5, 2011

Two things to know about VMware Converter

VMware Converter is a tool to convert physical machines (either online or via an existing backup image) to VMware virtual machines. It is available as a free stand-alone version and in a vCenter-integrated version.
Like others we are using it a lot for virtualizing existing physical servers, just because it's free and/or comes pre-installed with vCenter. It also does a pretty good job and is well supported by VMware, but ...
you should be aware of two issues when using Converter more than occasionally.

1. Windows 2000 support
With the latest versions (Stand-alone converter 4.3 and vCenter 4.1-integrated) VMware dropped support for converting Windows 2000 machines (see the notes about supported guest OSs in the Release Notes). The really bad about this is that it does not just tell you this when you try to convert a Windows 2000 machine, but throws an error message about not being able to install the Converter agent on the target computer. It looks like it tries to install the Windows XP version of the agent which fails.
At first it looks like this is not a big problem, because older versions of the converter still support Windows 2000. If you run vSphere 4.1 you can use the stand-alone Converter 4.0.1 to convert Windows 2000 machines by connecting to the vCenter 4.1 server or directly to an ESX(i) 4.1 host. We have done this a lot and it always worked. However, if you carefully look at the Release Notes of Converter 4.0.1 you will notice that it only supports vSphere 4.0 as virtualization platform, but not vSphere 4.1.
We asked VMware support how we - as a vSphere 4.1 customer - are supposed to convert a Windows 2000 machine using Converter in a way that is fully supported by VMware. Here are the instructions (it's only one possible way, but you will get the idea):
a) Install an ESX(i) 4.0 host and add it to an existing vCenter 4.1 instance
b) Use the Stand-alone Converter 4.0.1 to connect to this ESX(i) 4.0 host and convert the Windows 2000 machine
c) Migrate the virtualized Windows 2000 machine to an ESX(i) 4.1 host (either cold or by VMotion)
That's a bit cumbersome, isn't it? Anyway, as stated above you can also use the stand-alone converter 4.0.1 to connect directly to vSphere 4.1. It is not officially supported, but seems to work quite well.

2. Disk alignment
If you care about storage performance then you want your VMFS volumes and your guest OS partitions to be aligned. There are a lot of good explanations about what disk alignment is and why it is important. My personal favorite is on Duncan Epping's blog.
Now, the big issue is that VMware Converter does not align the guest OS partitions in the target virtual machine. Although VMware is also pointing out the importance of disk alignment since a long time (see e.g. this ESX3 white paper) they have still - as of version 4.3 - not yet built this capability into their own Converter product.
So, if you are serious about disk performance and are planning for a large virtualization project you may want to consider alternatives to VMware Converter. There are other commercial products available that do proper disk alignment. One example is Quest vConverter.

Update (2011-09-01): Good news. Today VMware released Converter 5.0 which is now able to do disk alignments!

Sunday, May 29, 2011

Network troubleshooting, Part III: A real life example (Broadcom NICs dropping packets)

Recently we had a strange problem inside a Linux VM: a rsync-job that was used to copy data from a local disk to a NFS-mounted share reproducibly failed during data copy with a "broken pipe" error message.

Using the methods I wrote about in Part I and Part II of this little troubleshooting series (and some trial and error for sure) we found out that the issue would only occur if the VM was using a certain type of physical NIC, the HP NC371i (with a Broadcom BCM5709 chipset).
Later we also discovered corresponding VMKernel.log-messages like this one:

... vmkernel: 36:02:06:55.923 cpu5:6816883)WARNING: Tso: 545: TSO packet with one segment.>
... vmkernel: 36:02:06:56.325 cpu5:7129949)WARNING: Tso: 545: TSO packet with one segment.
... vmkernel: 36:02:06:57.128 cpu4:6816885)WARNING: Tso: 545: TSO packet with one segment.
... vmkernel: 36:02:06:57.128 cpu4:6816885)WARNING: LinNet: map_pkt_to_skb: This message has repeated 640 times: vmnic1: runt TSO packet (tsoMss=1448, frameLen=1514)

Enough evidence to open a support call with VMware... The outcome was that there is a known problem with the bnx2 driver (that is used for this type of NIC). It drops TSO packets that are below a certain minimum size it expects. The issue only occurs with some of the Broadcom chipsets that this driver can handle. The BCM5709 was not on the list before we opened our case, but it looks like it is also affected.

By the way, TSO stands for TCP segmentation offload and is used to offload the necessary segmentation of large TCP packets to the NIC's hardware. A good thing, if it works flawlessly.
The obvious workaround is to disable TSO by using the appropriate driver options. You could disable it on the host's physical Broadcom-NICs, but this would mean sacrificing the performance benefits of TSO for all VMs using these NICs.
We did not do that, because all other VMs did not have any problems with TSO. Instead we decided to disable TSO only inside the Linux VM that had this problem. This solved the issue for us.

Saturday, May 28, 2011

Network troubleshooting, Part II: What physical switch port does the pNIC connect to?

When you have found out what physical NIC (pNIC) a VM is actually using (see my previous post) you may want to check the external switch port that this pNIC connects to (Is it properly configured, what about the error counters?). Okay, what switch port do you need to check?
It is considered good data center practice to have every connection from every server's pNIC to the switch ports carefully documented. Do you? Do you trust the documentation? Is it up to date? If yes you are fine and can stop reading here...

If you want to be sure, and if you use Cisco switches in your data centers then there is a much more reliable way to track these connections: The Cisco Discovery Protocol (CDP). On Cisco devices this is enabled by default, and it periodically broadcasts interface information to the devices attached to its ports (like your ESX hosts).
By default ESX(i) (version 3.5 and above) will receive these broadcasts and display the information in it through the VI client. In the Hosts and Clusters-view select Networking in the Configuration tab of a host. This will display your virtual switches with their physical up-links (vmnic0, vmnic1, etc.). Now click on the little speech bubbles next to the Physical Adapters and a window like the following will pop up:

CDP information shown in the VI client

You can find a lot useful information here. The Device ID is the name of the Cisco switch. And Port ID shows the number/name of the switch module and the port number on that module. So you can tell your network admins exactly what switch port they need to check.

If CDP information is not available for a physical adapter the pop-up window will also tell you this. Possible reasons: You don't use Cisco switches or have CDP broadcasts disabled on them, or the ESX(i) host's interfaces are not in CDP listen mode.

For more detailed information on CDP and how to configure it in ESX see the VMware KB: Cisco Discovery Protocol (CDP) network information.

Thursday, May 26, 2011

Network troubleshooting, Part I: What physical NIC does the VM use?

If you encounter a network issue in a VM (like bad performance or packet drops) a good first question to ask yourself is: Is this issue limited to the VM or can it be pinned to one of the host's physical NICs?
So, you need to find out what physical NIC (pNIC) the VM is actually using. In most environment this is not obvious, because the virtual switch that the VM connects to typically has multiple physical up-links (for redundancy) that are all active (to maximize bandwidth).

Unfortunately, it is not possible to find this out by using the VI client. It does not reveal this information regardless whether you use standard or distributed virtual switches.
You need to log in to the host that runs the VM (see the HowTos section for instructions) and run esxtop.
Press n to switch to the network view, and you will see a picture like this one:

Network view of esxtop (click to enlarge)
Find the VM's display name in the USED-BY column and look to the corresponding TEAM-PNIC column then. In this example the VM FRASINT215 uses vmnic1.

Updated be2net driver fixes issues with G7 blades

When we started to deploy our HP ProLiant BL620c G7 blade servers we stumbled over some issues with the driver (be2net) for the built-in FlexNIC adapters. They are documented in the VMware KB:
We followed the recommendation in these articles and updated the be2net driver to version 2.102.554.0. However, we still experienced hangs of the ESXi host and network outages whenever the host was rebooted or had its dvS-connections reconfigured.
These hangs were accompanied by VMKernel.log-messages like this one:

... vmkernel: 10:06:11:06.193 cpu0:4153)WARNING: CpuSched: 939: world 4153(helper11-0) did not yield PCPU 0 for 2993 msec, refCharge=5975 msec, coreCharge=6374 msec,

After opening a support call with VMware we finally found out that these problems were caused by improper handling of VLAN hardware offloading by the be2net driver, and that they only occur when you are using distributed virtual switches (dvS) like we did.
So, after configuring the blade hosts with virtual standard switches (vSS) the problem went away.

Since then we were waiting for a fixed be2net-driver (from Emulex) to be able to return to dvS. We really did not want to abandon this option because it offers some benefits (load based teaming of the physical uplinks and Network I/O Control) over the standard switch.

Today, the waiting finally ended. Emulex has finished the fixed driver, it is available here:
VMware ESX/ESXi 4.x Driver CD for Emulex OneConnect 10Gb Ethernet Controller

Update (18. Jul 2011): In the meantime VMware made two new KB articles available that reference the problems described here and the new driver:
In the latter one it is also recommended to update the NIC's firmware. The current one (as of today) is available at HP as a bootable ISO file. Thanks to makö for pointing this out in this post's comments.