Wednesday, December 21, 2011

How to do an Online Virtual Connect firmware upgrade

Okay, this is a follow-up to my previous post ... I was finally able to find out on my own how to do this. The answer is in HP's white paper "HP Virtual Connect Firmware Upgrade Steps and Procedures". This is a must read for anyone being concerned with the VC firmware upgrade process, I will try to summarize the most important points here.

You must use the Virtual Connect Support Utility (VCSU). The current version is 1.60 and is available for download here.

It helps to understand how the VCSU does the upgrade: First it uploads the new firmware to all VC modules simultaneously. This phase is absolutely uncritical, because the VC modules continue working normally during the upload. If you use the default parameters it will then activate the new firmware by rebooting the VC modules one after the other in a controlled manner - and this is the process that really impacts the network availability of your hosts and VMs!
Why? The controlled reboot takes 20 or more seconds, and - of course - the VC module will not properly forward and receive network traffic during that time. However, the blade servers, resp. their NICs that are connected to this module are not properly disconnected during that time, i.e. they do not get a link down notification! If you use the default failover detection method for your virtual switches (Link state only) the hosts will continue using the up-links to the module that is just rebooting, and this results in a loss of network connectivity.

So, how do you cope with that? One possible work around is to use Beacon probing as the failover detection method for the virtual switches. But in my opinion this is not the best and easiest choice. No, the real answer is on page 13 of the white paper:
"For the customer environments where changing Network Failover Detection options or HA settings is not possible, utilizing VCSU manual firmware activation order (-of manual) is recommended. In this case, modules will be updated but not activated and the user will need to perform manual activation by resetting (rebooting) modules via OA GUI or CLI interface. This option will eliminate potential of up to 20 sec network outage that may occur on a graceful shutdown of VC Ethernet and FlexFabric modules."
Using the manual activation order (parameters "-oe manual" and "-of manual") ensures that the VCSU will not gracefully reboot the VC modules at all. You then need to do that on your own (just manual), by resetting the VC modules through the Onboard Administrator (OA). When you do a hard reset of a VC module the connected hosts will immediately get link down notifications, just as if the module suddenly fails or loses all its own up-links because the external switch failed. You should just wait about 5 minutes for the resetted module to get fully online before you reset the second one.

If your ESX(i) hosts are properly and redundantly configured you will notice only a minimal network interruption during this process. In my test it was just a single ping drop.

Yes, that's the whole secret of doing an online VC firmware upgrade! For me only one questions remains: Why is HP making it so hard to find this information? If you search hp.com for instructions on how to do this you will find tons of useless and contradicting information on this topic, and even their own Support engineers are not able to give a quick and right answer to the question. At least, one of them sent me a copy of the white paper (he could not just provide a link to it, because he was not able to find it on the HP pages...).

Thursday, December 15, 2011

HP Virtual Connect firmware update - can you do this online?

I don't know the answer to this question, but I'm trying to find this out ...

We have two HP c7000 enclosures with Virtual Connect FlexFabric modules to connect to external Cisco Ethernet switches and Brocade FC switches. Both enclosures are fully loaded with 8x BL620c G7 blade servers running ESXi 4.1 Update 2.
Right now we are still able to completely evacuate an enclosure if we want to do maintenance (mainly firmware upgrades) on it, because we have stretched two clusters over both enclosures that each have not more than 50% of their capacity used.

However, given our current VM growth rate we will soon reach a point where this will be no longer possible (without purchasing and deploying a third enclosure). So, I'm currently testing and looking for ways to do an online Virtual Connect firmware upgrade without interrupting network and SAN connectivity. With all the redundancy that is in the enclosure this should be possible, and an HP engineer I lately talked to confirmed that this is indeed possible using HP's Virtual Connect Support Utility (VCSU), and he pointed me to its manual for instructions.

I remember that I already tried this method a while ago. I don't know the firmware and tool versions anymore that I did this test with, but it was not very successful. Although I followed the instructions given I noticed ping timeouts for up to 15 seconds during the upgrade process (I was pinging the hosts VMkernel address).

I just started a thread in the VMTN forums to get some input from others. Has anyone done this successfully? Is there anything to check and configure that is not obvious before trying this? Please share your experience by posting to the VMTN thread or leaving a comment here. Thanks!

Once I have found a working method I will of course update this post!

Update (2011-12-21): I found it ... Read about it in my next post!

Thursday, November 17, 2011

ESXi-Customizer 2.6 and Tgz2Vib5 1.0

I just published the new version 2.6 of my ESXi-Customizer script.

What's new:
  • With this version you are able to optionally create an (U)EFI-bootable ISO file for the installation of ESXi 5.0. (U)EFI stands for (Universal) Extensible Firmware Interface. This is going to replace the current BIOS firmware interface on modern PCs. Please note that the original VMware ESXi 5.0 ISO is already UEFI-capable, the new version of my script is just able to keep this possibility in the customized ISO.
  • The new version includes an additional utility script called Tgz2Vib5. With this script you are able to convert an OEM.tgz-style driver package (for ESXi 5.0 only!) into a VIB file. That is the "official" VMware format for software packages - read more about it in this earlier post!
I'd like to encourage the developers of community supported ESXi 5.0 drivers to convert their packages into VIB format (using Tgz2Vib5) before publishing them! The VIB format has several advantages over the traditional OEM.tgz format:
  • You can add descriptive meta data (like vendor/author name, version and detailed description) to the driver package
  • Unlike an OEM.tgz file a VIB file can be easily installed into an already running ESXi 5.0 system by running the following commands inside a local or remote ESXi shell:
      esxcli software acceptance set --level=CommunitySupported
      esxcli software vib install -v VIB-URL

    (The host must not run any VMs at install time, because it needs to be rebooted after the installation.)
  • A VIB file can also be updated with a newer version without having to re-install the whole system. This can be achieved by running the following command inside a local or remote ESXi shell:
      esxcli software vib update -v VIB_URL


Sunday, October 30, 2011

vSphere 4.1 Update 2 released - What's in it for me (and you)

VMware released Update 2 for vSphere 4.1 on Oct 27th. It includes numerous bug fixes for the vCenter server and client (see VC resolved issues) and ESXi (see ESXi resolved issues).

 I will list some of the fixes here, because I personally welcome them very much, and I'm sure that others will feel the same:
  • The vSphere client performed badly with Windows 7, because of frequent screen-redraws when the Windows desktop composition feature was enabled. The only workaround was to disable desktop composition while running the vSphere client. This should be fixed now.
  • There are multiple fixes and enhancements to ESXi syslogging:
    • If ESXi fails to reach the syslog server while booting, it now keeps retrying it every 10 minutes.
    • Very long syslog messages (like these produced by the vpxagent ...) are no longer truncated or split into multiple lines. If you are using a third-party solution like Splunk for collecting syslog messages then you will certainly welcome this, because it is nearly impossible to handle split messages correctly with them.
  • But the most important issue that is resolved in Update 2 is this: "Virtual machine with large amounts of RAM (32GB and higher) loses pings during vMotion". Uh, what? VMs losing pings when being vMotioned? Yes, this can really happen (without Update 2), and I personally experienced this problem: When one of two clustered Microsoft Exchange 2010 VM with 48GB RAM was vMotioned it lost network connectivity for more than 15 seconds (between 20 and 30% of the vMotion progress) which triggered a cluster failover. We have not yet checked if this particular issue is really resolved now with Update 2, but VMware Support had put us off to it when we complained about that, so there is a good chance ...

Thursday, October 27, 2011

Update: ESXi 5.0 on HP G7 blades, now a Go!

About three weeks back I reported on Emulex firmware problems that prevented the use of ESXi 5.0 on HP G7 blade hardware. This was fixed now, somehow...

HP has now updated the advisory that describes the issue and published an updated firmware that fixes the VLAN handling problems with ESXi 5.0 if it is used together with the be2net driver 4.0.355.1.

Be sure that you read the release notes of the firmware! It looks like it is an emergency/workaround release that leaves many issues unresolved. A firmware version that you can really trust for production will probably be available mid-November.

Update (2012-12-09): HP and Emulex published the final version of the OneConnect firmware (4.0.360.15a) on Nov 19th. VMware's KB2007397 also lists the recommended drivers to use with this firmware for both ESXi 4.1 and 5.0.

Update (2012-03-09): HP has published yet another firmware update on March 5th. Download version 4.0.360.15b. The previous link has become invalid.

Update (2012-04-16): Please refer to my HP & VMware links page to find the download for the latest version of the firmware.

Wednesday, October 26, 2011

VMware finally released the Open Source Code of vSphere 5.0!

Great news! Today VMware finally made the vSphere v5.0 Open Source code archives available for download.

Why is that important?

Since the release of VMware's ESXi 5.0 (Aug 24, 2011) many people are asking for the development of drivers for hardware devices that are not supported by ESXi 5.0 out-of-the-box.

ESXi device drivers are based on Linux device drivers (which lead to the persistent misunderstanding that ESXi itself is based on Linux), but the stock Linux driver code must be modified in a specific way to be compatible with ESXi.

With past versions of ESXi (up to 4.1) it was possible to study and reproduce these required modifications, because VMware published the source code of the ESXi device drivers (the original Linux code plus their modifications). The reason for this is that most Linux drivers are licensed under the GPL (General Public License), and the GPL requires that derived works are also published under the GPL and their source code is made freely available (aka the "Copyleft" principle).

Now, that VMware also published the Open Source Code of ESXi 5.0 (including the device drivers that it contains) it will be possible (or at least much easier) to develop custom ESXi 5.0 drivers for devices that are not officially supported by VMware.


Wednesday, October 12, 2011

HP Virtual Connect profile not applied ...

When I recently rebooted one of our BL620c G7 blades with ESXi 4.1 installed on it I found that the server had suddenly lost network connectivity after the reboot.
A quick check on the console revealed that the Virtual Connect profile that was defined for that blade had not been applied to it. I realized that because the MAC addresses of the NICs had not been overwritten with the virtual addresses of the Virtual Connect profile.

I tried powering down and up the blade, re-assigning the Virtual Connect profile multiple times, all to no avail ... Then I had the idea that it might be related to the iLO-board of the blade, and - yes, indeed - after resetting the iLO3-board of the blade the Virtual Connect profile was properly applied and all was fine again.

While later looking on hp.com for some related information I stumbled over the Customer advisory c02820591 that described an issue with the Virtual Connect profile being lost upon an iLO3 reset. Not exactly the issue that I had, and the advisory also stated that this is fixed with iLO3 firmware version 1.20, and that is already installed on our iLOs. However, the advisory confirmed my assumption that the Virtual Connect profile is applied by the iLO-board.

So, if you have similar problems try resetting the iLO-board before you start pulling your hair out, or the blade out of the chassis ...