VMware Front Experience: network

Showing posts with label network. Show all posts

Monday, September 10, 2012

The SBHVL project - Part 2: Backup and Disaster Recovery

This is part 2 of the SBHVL (Small Budget Hosted Virtual Lab) project post series (see also the Introduction and Part 1). This time it's about backing things up and recovering from a disaster.

Installing and maintaining a test lab can cause significant efforts and will take time. This is true for the initial installation and configuration of the lab infrastructure (our physical host), but also for every single test scenario that usually consists of multiple VMs, software installed on them, configuration work, test data etc. All this is definitely worth being backed up, although it's just a test lab...

The SBHVL project - Part 1: Basic setup, Networking and Security

This week I introduced the Small Budget Hosted Virtual Lab (SBHVL) project, and in this post I will describe the basic setup of the physical lab host, the network architecture and some related security aspects.
Read more »

Wednesday, December 21, 2011

How to do an Online Virtual Connect firmware upgrade

Okay, this is a follow-up to my previous post ... I was finally able to find out on my own how to do this. The answer is in HP's white paper "HP Virtual Connect Firmware Upgrade Steps and Procedures". This is a must read for anyone being concerned with the VC firmware upgrade process, I will try to summarize the most important points here.

You must use the Virtual Connect Support Utility (VCSU). The current version is 1.60 and is available for download here.

It helps to understand how the VCSU does the upgrade: First it uploads the new firmware to all VC modules simultaneously. This phase is absolutely uncritical, because the VC modules continue working normally during the upload. If you use the default parameters it will then activate the new firmware by rebooting the VC modules one after the other in a controlled manner - and this is the process that really impacts the network availability of your hosts and VMs!
Why? The controlled reboot takes 20 or more seconds, and - of course - the VC module will not properly forward and receive network traffic during that time. However, the blade servers, resp. their NICs that are connected to this module are not properly disconnected during that time, i.e. they do not get a link down notification! If you use the default failover detection method for your virtual switches (Link state only) the hosts will continue using the up-links to the module that is just rebooting, and this results in a loss of network connectivity.

So, how do you cope with that? One possible work around is to use Beacon probing as the failover detection method for the virtual switches. But in my opinion this is not the best and easiest choice. No, the real answer is on page 13 of the white paper:

"For the customer environments where changing Network Failover Detection options or HA settings is not possible, utilizing VCSU manual firmware activation order (-of manual) is recommended. In this case, modules will be updated but not activated and the user will need to perform manual activation by resetting (rebooting) modules via OA GUI or CLI interface. This option will eliminate potential of up to 20 sec network outage that may occur on a graceful shutdown of VC Ethernet and FlexFabric modules."

Using the manual activation order (parameters "-oe manual" and "-of manual") ensures that the VCSU will not gracefully reboot the VC modules at all. You then need to do that on your own (just manual), by resetting the VC modules through the Onboard Administrator (OA). When you do a hard reset of a VC module the connected hosts will immediately get link down notifications, just as if the module suddenly fails or loses all its own up-links because the external switch failed. You should just wait about 5 minutes for the resetted module to get fully online before you reset the second one.

If your ESX(i) hosts are properly and redundantly configured you will notice only a minimal network interruption during this process. In my test it was just a single ping drop.

Yes, that's the whole secret of doing an online VC firmware upgrade! For me only one questions remains: Why is HP making it so hard to find this information? If you search hp.com for instructions on how to do this you will find tons of useless and contradicting information on this topic, and even their own Support engineers are not able to give a quick and right answer to the question. At least, one of them sent me a copy of the white paper (he could not just provide a link to it, because he was not able to find it on the HP pages...).

Friday, July 1, 2011

Mysterious port 903

I recently investigated what network ports are used by ESXi 4.1, because I had to compile the firewall requirements for a new deployment of ESXi hosts in a DMZ. There is a detailed source available for that in the VMware KB:

KB1012382: TCP and UDP Ports required to access vCenter Server, ESX hosts, and other network components

And there are numerous other sources available (even nice diagrams like this one). In most cases it is obvious that their authors referred to and relied on the above mentioned official VMware KB source.

I'm usually not paranoid, but maybe I talked too much with the IT security guys (who tend to be extremely paranoid ;-)). Anyway, following the rule "Trust no one" I started looking at the network ports that are really used in our current production environment and compared them to the list in the KB article.

So I stumbled over port 903... According to the list both the vCenter server and any vSphere Client connect to an ESXi 4.1 host on that port for accessing the VM remote console. However, when I checked the network connections on the vCenter server and my Windows Desktop running the vSphere Client (with "netstat -an") I was not able to see any connection to an ESXi host's port 903, even when I opened multiple VM consoles. Instead it was obvious that port 902 is used for console connections.

This made me really curious, so I logged on to an ESXi host (in Tech support mode) and checked the open network connections there. In ESXi you use the command "esxcli network connection list" for that which produces an output that is quite similar to the netstat output (With classic ESX the netstat command is still available in the service console).

This command will also list all ports that are opened in LISTEN mode, that means there is some process waiting for connections on that port. But there was no listening process for port 903, and that means that no one and nothing would be able to connect to that port!

I opened a support request with VMware asking for clarification on the mysterious port 903 and was very curious about their answer. Of course, they quoted their own KB article first, insisted on that the port was actually used for this and that, but finally - after raising the issue to engineering - they admitted that "ESXi does not use port 903".

Also a request was made to update the KB article accordingly. So, when you read this it might already have been corrected to not include port 903 anymore, but the numerous third party documents based on KB1012382 will take some more time to be updated ...

Bottom line: Information is good. Correct information is better. Try to verify it if it is really important to you.

Sunday, May 29, 2011

Network troubleshooting, Part III: A real life example (Broadcom NICs dropping packets)

Recently we had a strange problem inside a Linux VM: a rsync-job that was used to copy data from a local disk to a NFS-mounted share reproducibly failed during data copy with a "broken pipe" error message.

Using the methods I wrote about in Part I and Part II of this little troubleshooting series (and some trial and error for sure) we found out that the issue would only occur if the VM was using a certain type of physical NIC, the HP NC371i (with a Broadcom BCM5709 chipset).
Later we also discovered corresponding VMKernel.log-messages like this one:

... vmkernel: 36:02:06:55.923 cpu5:6816883)WARNING: Tso: 545: TSO packet with one segment.>
... vmkernel: 36:02:06:56.325 cpu5:7129949)WARNING: Tso: 545: TSO packet with one segment.
... vmkernel: 36:02:06:57.128 cpu4:6816885)WARNING: Tso: 545: TSO packet with one segment.
... vmkernel: 36:02:06:57.128 cpu4:6816885)WARNING: LinNet: map_pkt_to_skb: This message has repeated 640 times: vmnic1: runt TSO packet (tsoMss=1448, frameLen=1514)

Enough evidence to open a support call with VMware... The outcome was that there is a known problem with the bnx2 driver (that is used for this type of NIC). It drops TSO packets that are below a certain minimum size it expects. The issue only occurs with some of the Broadcom chipsets that this driver can handle. The BCM5709 was not on the list before we opened our case, but it looks like it is also affected.

By the way, TSO stands for TCP segmentation offload and is used to offload the necessary segmentation of large TCP packets to the NIC's hardware. A good thing, if it works flawlessly.
The obvious workaround is to disable TSO by using the appropriate driver options. You could disable it on the host's physical Broadcom-NICs, but this would mean sacrificing the performance benefits of TSO for all VMs using these NICs.
We did not do that, because all other VMs did not have any problems with TSO. Instead we decided to disable TSO only inside the Linux VM that had this problem. This solved the issue for us.

Saturday, May 28, 2011

Network troubleshooting, Part II: What physical switch port does the pNIC connect to?

When you have found out what physical NIC (pNIC) a VM is actually using (see my previous post) you may want to check the external switch port that this pNIC connects to (Is it properly configured, what about the error counters?). Okay, what switch port do you need to check?
It is considered good data center practice to have every connection from every server's pNIC to the switch ports carefully documented. Do you? Do you trust the documentation? Is it up to date? If yes you are fine and can stop reading here...

If you want to be sure, and if you use Cisco switches in your data centers then there is a much more reliable way to track these connections: The Cisco Discovery Protocol (CDP). On Cisco devices this is enabled by default, and it periodically broadcasts interface information to the devices attached to its ports (like your ESX hosts).
By default ESX(i) (version 3.5 and above) will receive these broadcasts and display the information in it through the VI client. In the Hosts and Clusters-view select Networking in the Configuration tab of a host. This will display your virtual switches with their physical up-links (vmnic0, vmnic1, etc.). Now click on the little speech bubbles next to the Physical Adapters and a window like the following will pop up:

CDP information shown in the VI client

You can find a lot useful information here. The Device ID is the name of the Cisco switch. And Port ID shows the number/name of the switch module and the port number on that module. So you can tell your network admins exactly what switch port they need to check.

If CDP information is not available for a physical adapter the pop-up window will also tell you this. Possible reasons: You don't use Cisco switches or have CDP broadcasts disabled on them, or the ESX(i) host's interfaces are not in CDP listen mode.

For more detailed information on CDP and how to configure it in ESX see the VMware KB: Cisco Discovery Protocol (CDP) network information.

Thursday, May 26, 2011

Network troubleshooting, Part I: What physical NIC does the VM use?

If you encounter a network issue in a VM (like bad performance or packet drops) a good first question to ask yourself is: Is this issue limited to the VM or can it be pinned to one of the host's physical NICs?
So, you need to find out what physical NIC (pNIC) the VM is actually using. In most environment this is not obvious, because the virtual switch that the VM connects to typically has multiple physical up-links (for redundancy) that are all active (to maximize bandwidth).

Unfortunately, it is not possible to find this out by using the VI client. It does not reveal this information regardless whether you use standard or distributed virtual switches.
You need to log in to the host that runs the VM (see the HowTos section for instructions) and run esxtop.
Press n to switch to the network view, and you will see a picture like this one:

Network view of esxtop (click to enlarge)

Find the VM's display name in the USED-BY column and look to the corresponding TEAM-PNIC column then. In this example the VM FRASINT215 uses vmnic1.

Updated be2net driver fixes issues with G7 blades

When we started to deploy our HP ProLiant BL620c G7 blade servers we stumbled over some issues with the driver (be2net) for the built-in FlexNIC adapters. They are documented in the VMware KB:

We followed the recommendation in these articles and updated the be2net driver to version 2.102.554.0. However, we still experienced hangs of the ESXi host and network outages whenever the host was rebooted or had its dvS-connections reconfigured.

These hangs were accompanied by VMKernel.log-messages like this one:

... vmkernel: 10:06:11:06.193 cpu0:4153)WARNING: CpuSched: 939: world 4153(helper11-0) did not yield PCPU 0 for 2993 msec, refCharge=5975 msec, coreCharge=6374 msec,

After opening a support call with VMware we finally found out that these problems were caused by improper handling of VLAN hardware offloading by the be2net driver, and that they only occur when you are using distributed virtual switches (dvS) like we did.
So, after configuring the blade hosts with virtual standard switches (vSS) the problem went away.

Since then we were waiting for a fixed be2net-driver (from Emulex) to be able to return to dvS. We really did not want to abandon this option because it offers some benefits (load based teaming of the physical uplinks and Network I/O Control) over the standard switch.

Today, the waiting finally ended. Emulex has finished the fixed driver, it is available here:

VMware ESX/ESXi 4.x Driver CD for Emulex OneConnect 10Gb Ethernet Controller

Update (18. Jul 2011): In the meantime VMware made two new KB articles available that reference the problems described here and the new driver:

In the latter one it is also recommended to update the NIC's firmware. The current one (as of today) is available at HP as a bootable ISO file. Thanks to makö for pointing this out in this post's comments.