Thursday, January 26, 2012

Feature request: Add sudo to ESXi to make AD integration a success story

Recently I posted about (undocumented) improvements in the area of AD integration, but it looks like I missed a very important point:

You can log on to a local or remote console using an AD account that has administrative rights, but you won't have root privileges in this session, e.g. you cannot edit any configuration files, restart services etc. To gain root rights you need to use the su command, but that means that you still need to know and enter the password of the root user! From a compliance standpoint this is not acceptable, because the whole point of AD integration is that each VMware administrator uses his AD account for administration and does not even know the root password - to make sure that each change to the system can easily be related to a personal account (Well, for emergency cases e.g. when AD authentication is not available you still need someone who knows the root password or e.g. has it written down on a piece of paper in a sealed envelope).

The easiest way to achieve this would be to use the sudo command in the ESXi shell to run commands in root context without the need to know root's password. This is common practice when managing Unix/Linux servers. Now the point is: sudo used to be available in ESX, but it is not available in ESXi.

So I have a simple feature request for VMware: Add sudo to ESXi! It is the missing piece that would make AD integration a success story, finally.

If you agree and also feel bothered by this, please vote for this feature request in the VMware Community forums, where I opened this thread for that. Thank you all for voting/commenting and special thanks to Masa who brought this to my attention in the comments of my above mentioned post!

Wednesday, January 25, 2012

How to use ESXi 5 as an NTP server - OR - How to permanently add custom firewall rules?

Recently my attention was caught by a question posted to the VMware Community forums that sounds odd at first sight: Is it possible to configure ESXi 5.0 to act as a NTP server?

I wondered why should you try to do this? On the one hand it is not recommended to use ESXi for anything else than the task that it was designed for: being a hypervisor. On the other hand it is not recommended to run a VM as NTP server, because exact timekeeping can be quite a challenge in VMs as they do not own a real hardware clock timer. So, should you run a physical box just for NTP? Small shops that have reached 100% virtualization run only ESXi on their remaining physical servers. So I can understand people considering an exception and wanting to run an ESXi host as NTP server - it is a very lightweight service anyway ...

Now back to the question ..., and the answer is: Yes, it is. In fact it is very easy to do this, because once you have configured ESXi 5.0 to act as a NTP client it will also automatically act as a NTP server! The NTP daemon (/sbin/ntpd) does both at the same time, and its configuration file (/etc/ntp.conf) even allows any other machine to query it by default. There is only one hurdle: the ESXi 5.0 firewall.
By default it blocks the port for incoming NTP queries (UDP port 123). We need to create a custom firewall extension to open that port. KB2005304 explains how to do this. Basically you need to create a custom XML configuration file in the directory /etc/vmware/firewall, e.g. /etc/vmware/firewall/ntpd.xml with the following contents:

<!-- Firewall configuration information for NTP Daemon -->
<ConfigRoot>
  <service>
      <id>NTP Daemon</id>
      <rule id='0000'>
          <direction>inbound</direction>
          <protocol>udp</protocol>
          <porttype>dst</porttype>
          <port>123</port>
      </rule>
      <enabled>false</enabled>
      <required>false</required>
  </service>
</ConfigRoot>

(Take care when you copy or modify this: The XML tags are case sensitive!)

Then load the new configuration by running the following command inside a ESXi shell:
  esxcli network firewall refresh

After that you can see the custom firewall rule in the firewall properties dialog of the vSphere client:

Custom "NTP Daemon" firewall rule
 Enable the rule, and you are done ...
... until the next reboot of the host, because User defined xml firewall configurations are not persistent across ESXi host reboots. The KB article that describes this problem also includes a work-around to resolve it: Put the XML file on a shared datastore and modify the /etc/rc.local boot script to copy the file to the correct location on every reboot.

This works, but I personally consider this an ugly hack, because this modification is not inherent in the system but introduces a dependency to an external resource (the datastore). So I created a VIB file that you can effectively install on ESXi and that will permanently add the XML file to the system.
Run the following commands inside an ESXi shell to install the VIB file:

   esxcli software acceptance set --level CommunitySupported
   esxcli software vib install -v http://files.v-front.de/fwenable-ntpd-1.2.0.x86_64.vib

The first command  is needed for ESXi to accept the custom VIB, because it does not include a trusted signature file. The second command will download and install the VIB file (Note: you can also download the file with a browser, store it on a local datastore and reference the local file with the install command).
The installation will not require the host to be in maintenance mode and it will be immediately effective without the need to reboot the host! It will also automatically reload the firewall rules, so the only step left is to enable the rule in the vSphere client.

By the way, I created this VIB file with a new and improved version of my TGZ2VIB5 script that I currently work on. Once I have finished this new version and made it available here I will also post a detailed description of how I created the VIB file.

Tuesday, January 24, 2012

Top VMware and Virtualization Blog voting 2012 now open

Just a short note: Eric Siebert has just opened this year's voting for the Top VMware and Virtualization Blogs. This blog is listed in the categories "Independent Blogger" and "New Blog" (and among "all" of course). Make yourself heard and vote here!

Wednesday, January 18, 2012

Hard to find HP tools: The Offline Array Configuration Utility (ACU)

If you have ever looked for a solution to a specific problem or the download page for a certain tool on www.hp.com then you probably know: Searching (and finding) something is a pain on these pages, and the more desperate you need it the longer it will take you ...
So maybe I will even make a series of "Hard to find HP tools" posts. Anyway I will start with the Offline ACU tool today.

So, what do you need this tool for? I had this challenge before and I reminded that when I came across this VMware Community forums post: Imagine you have an HP based ESXi host with VMs running on local disks attached to a Smart Array RAID Controller. You have run out of disk space and decide to add an additional hard disk to the server. Instead of creating a new (unprotected) RAID volume on this single disk you prefer to expand an existing RAID volume with it. This will give you more disk space and keep the current RAID protection level. How do you do that?
No problem, if you had Windows (or Linux) running directly on the box, because HP made available the Array Configuration Utility (ACU) for these operating systems. It will allow you to do the RAID expansion online while the OS is running. However, for ESXi this tool is not available as an online version.
This is why you need to use the Offline ACU tool. This is just a bootable CD with Linux and the Linux ACU tool on it. So, you need to schedule a downtime for the host (and the VMs running on it) and reboot with that CD to make the required changes to your RAID volumes. Not online, but better than nothing ...

You can find the download link to the current version of the HP ProLiant Offline Array Configuration Utility on my HP & VMware links page (in the General section).

Once you have successfully expanded your RAID volume (and booted into ESXi again) you just need to do the same with the VMFS datastore that resides on it. Please note that since vSphere 4.0 you can grow a VMFS datastore online, and you do not need to use VMFS extents. Choose "Increase..." from the datastore's properties menu:



Friday, January 13, 2012

Undocumented parameters for ESXi 5.0 Active Directory integration

Since vSphere version 4.1 it is possible to integrate an ESXi host into a Microsoft Active Directory (AD). After the host is joined to the domain you can assign permissions to AD groups and users by connecting directly to the host with the vSphere client.
Instructions on how to do this (with ESXi 5.0) is available e.g. here in the VMware Online Documentation.

I first looked at AD integration when vSphere 4.1 was released and found one really annoying drawback in it that ruled it out from a possible implementation in our environment: When an ESXi 4.1 host is joined to a domain it will automatically (and repeatedly!) look up an AD group called "ESX Admins", and as soon as it finds this group it will grant this group Administrator permissions on the ESXi host. The real problem here is that the name ("ESX Admins") of this AD group is hard coded and can not be configured.
This may be a nice feature for small environments - you just need to create this group, fill in the necessary people and you are done. But if you think about an enterprise environment of a large company with lots of different sites, IT teams and vSphere installations, but only one Active Directory, you can not assume that all ESXi hosts in this company are managed by the same group of people.

When vSphere 5.0 was released I looked at the release notes and documentation to find out if this drawback was removed, but I did not find any positive information. Tests I did also showed that an ESXi 5.0 host behaves the same way, looks up the "ESX Admins" group and adds it with Administrator permissions.

However, recently I stumbled over the following when browsing the advanced configuration parameters of an ESXi 5.0 host:
Configuring the "ESX Admins" group
Yes, with ESXi 5.0 it is possible to change the name of the AD group that is automatically added by setting the advanced configuration option Config.HostAgent.plugins.hostsvc.esxAdminsGroup. You can even completely disable this functionality by setting the option  Config.HostAgent.plugins.hostsvc.esxAdminsGroupAutoAdd to false.
I searched for this again in the VMware documentation and the Knowledge Base, but did not find it being mentioned anywhere. So it looks like at the time this is completely undocumented, but it works as expected (I could not resist from immediately trying this out)!

Wednesday, December 21, 2011

How to do an Online Virtual Connect firmware upgrade

Okay, this is a follow-up to my previous post ... I was finally able to find out on my own how to do this. The answer is in HP's white paper "HP Virtual Connect Firmware Upgrade Steps and Procedures". This is a must read for anyone being concerned with the VC firmware upgrade process, I will try to summarize the most important points here.

You must use the Virtual Connect Support Utility (VCSU). The current version is 1.60 and is available for download here.

It helps to understand how the VCSU does the upgrade: First it uploads the new firmware to all VC modules simultaneously. This phase is absolutely uncritical, because the VC modules continue working normally during the upload. If you use the default parameters it will then activate the new firmware by rebooting the VC modules one after the other in a controlled manner - and this is the process that really impacts the network availability of your hosts and VMs!
Why? The controlled reboot takes 20 or more seconds, and - of course - the VC module will not properly forward and receive network traffic during that time. However, the blade servers, resp. their NICs that are connected to this module are not properly disconnected during that time, i.e. they do not get a link down notification! If you use the default failover detection method for your virtual switches (Link state only) the hosts will continue using the up-links to the module that is just rebooting, and this results in a loss of network connectivity.

So, how do you cope with that? One possible work around is to use Beacon probing as the failover detection method for the virtual switches. But in my opinion this is not the best and easiest choice. No, the real answer is on page 13 of the white paper:
"For the customer environments where changing Network Failover Detection options or HA settings is not possible, utilizing VCSU manual firmware activation order (-of manual) is recommended. In this case, modules will be updated but not activated and the user will need to perform manual activation by resetting (rebooting) modules via OA GUI or CLI interface. This option will eliminate potential of up to 20 sec network outage that may occur on a graceful shutdown of VC Ethernet and FlexFabric modules."
Using the manual activation order (parameters "-oe manual" and "-of manual") ensures that the VCSU will not gracefully reboot the VC modules at all. You then need to do that on your own (just manual), by resetting the VC modules through the Onboard Administrator (OA). When you do a hard reset of a VC module the connected hosts will immediately get link down notifications, just as if the module suddenly fails or loses all its own up-links because the external switch failed. You should just wait about 5 minutes for the resetted module to get fully online before you reset the second one.

If your ESX(i) hosts are properly and redundantly configured you will notice only a minimal network interruption during this process. In my test it was just a single ping drop.

Yes, that's the whole secret of doing an online VC firmware upgrade! For me only one questions remains: Why is HP making it so hard to find this information? If you search hp.com for instructions on how to do this you will find tons of useless and contradicting information on this topic, and even their own Support engineers are not able to give a quick and right answer to the question. At least, one of them sent me a copy of the white paper (he could not just provide a link to it, because he was not able to find it on the HP pages...).

Thursday, December 15, 2011

HP Virtual Connect firmware update - can you do this online?

I don't know the answer to this question, but I'm trying to find this out ...

We have two HP c7000 enclosures with Virtual Connect FlexFabric modules to connect to external Cisco Ethernet switches and Brocade FC switches. Both enclosures are fully loaded with 8x BL620c G7 blade servers running ESXi 4.1 Update 2.
Right now we are still able to completely evacuate an enclosure if we want to do maintenance (mainly firmware upgrades) on it, because we have stretched two clusters over both enclosures that each have not more than 50% of their capacity used.

However, given our current VM growth rate we will soon reach a point where this will be no longer possible (without purchasing and deploying a third enclosure). So, I'm currently testing and looking for ways to do an online Virtual Connect firmware upgrade without interrupting network and SAN connectivity. With all the redundancy that is in the enclosure this should be possible, and an HP engineer I lately talked to confirmed that this is indeed possible using HP's Virtual Connect Support Utility (VCSU), and he pointed me to its manual for instructions.

I remember that I already tried this method a while ago. I don't know the firmware and tool versions anymore that I did this test with, but it was not very successful. Although I followed the instructions given I noticed ping timeouts for up to 15 seconds during the upgrade process (I was pinging the hosts VMkernel address).

I just started a thread in the VMTN forums to get some input from others. Has anyone done this successfully? Is there anything to check and configure that is not obvious before trying this? Please share your experience by posting to the VMTN thread or leaving a comment here. Thanks!

Once I have found a working method I will of course update this post!

Update (2011-12-21): I found it ... Read about it in my next post!