Virtual Infrastructure best practices
[Updated: 8-11-2009 10:00]
Lately I keep receiving questions from colleagues regarding virtual infrastructure design using VMware products. So I decided to sum up the best practices I use when designing a new virtual infrastructure. Some of the best practices are based on numbers and calculations but others are pretty obvious. Nevertheless you would be surprised how many environments I’ve encounter were the most basic best practices have NOT been met.
So hereby my list of best practices on:
- ESX(i);
- vCenter;
- Licensing;
- Storage;
- Networking;
- Virtual machines.
If you have additions or new insights please reply.
ESX(i)
Hardware Compatibility List (HCL)
First of all check all hardware against the VMware HCL (http://www.vmware.com/go/hcl). This is a very basic step which should be included in every implementation but is omitted on numerous occasions. You would be very surprised how many customers I encounter were the purchased hardware is not listed on the HCL. The most painful case was at a customer were we lost the deal because our price was €15.000,- too high. It turned out that the competition had offered two HP servers directly connected to NetApp storage but a direct connection was not supported according to the VMware HCL. The competition had to make amends and offer a complete fiber network, worth €30.000,- for free.
Hardware assistance
Before you start the ESX installation, remember to switch on Intel VT and XD bit in the BIOS before installation. If you switch Intel VT on after the VMware installation you can not run 64 bit operating systems unless you re-install vSphere again.
Storage
Another thing to remember before you start the VMware ESX installation is to disconnect the fiber, iSCSI or NFS storage to prevent the installation from reformating the existing VMFS datastores and losing all virtual machines.
Partitioning
If you’re using ESX ensure your disk partitioning is correct and ready for the future. We regularly get questions from colleagues on how to extend ESX partitions, like swap, because they hadn’t taken future needs into account. Therefor I recommend the following partitioning (this is updated for vSphere 4):
Mount point | Type | Size | Comments |
/ | ext3 | 5120MB | The root partition stores the ESX system. If this partition has no free space, the ESX host will most likely crash. It is very important to prevent this. |
swap | 1600MB | The swap partition is used to swap memory pages if there is no more physical memory for the service console. Keep in mind future needs like service console agents for back-up or monitoring. | |
/var | ext3 | 4096MB | The var partition stores most system logs. Creating a separate var partition provides dedicated log storage space (/var/log) while protecting the root partition from being filled by log files. Normally the var partition is part of the root partition. |
/home | ext3 | 2048MB | The home partition is created to prevent the root partition from filling up. By default the home partition is part of the root partition. By creating a separate partition for it the root partition will be protected from filling up. Service console accounts (not vCenter) each get a separate home folder. |
/opt | ext3 | 2048MB | The opt partition stores HA log files and is created to avoid filling the root root partition. By default the opt partition is part of the root partition. |
/tmp | ext3 | 2048MB | The tmp partition is also created to prevent filling of the root partition. Tmp can be used to extract patches and stage patches. By default the tmp partition is part of the root partition. By creating a separate partition for it the root partition will be protected from filling up. |
/boot | ext3 | 1100MB | The boot partition stores the files necessary to boot the service console. (* |
vmkcore | 150MB | The vmkcore partition temporarily stores log and error information in case of a VMkernel crash. (* |
* Automatically created by the installer but not displayed.
It is also recommended to rename the local datastore/VMFS partition during installation. By default the name is ‘Storage1’, to prevent mix ups rename it to [- Local Storage] or [Local Storage@].
Root SSH access
This is a returning item in many virtual infrastructure designs and a great way to p*ss off Anne Jan. If you enable SSH root access, every administrator who ever added a host to vCenter can access your ESX host using Putty or any other SSH utility and cause mayhem. Why would you do this and not give all of your employees a masterkey to all the doors in your building? Another downside to SSH root access is that in log files you can not differentiate between people, so you can’t find out who did what on your ESX host. In a Windows environment Enterprise or Domain Administrator roles are handled with care so why shouldn’t you act in a similar fashion with VMware ESX as your infrastructure foundation.
Instead of enabling SSH root access I would recommend to create local ESX user for those who need SSH access and have them switch user as soon as they’re logged on or have them use sudo. To prevent having to remember multiple passwords I strongly advice to use Active Directory integration.
Service console installs
Most of the time the reason to use ESX instead of ESXi is its ability to install all kinds of agents in the Service Console. VMware strongly recommends not installing any agents in the Service Console and nowadays there are other ways to handle this even when using ESXi. So as a best practice recommendation: Do not install software in the Service Console unless it is absolutely necessary.
A real life case: During a virtual infrastructure implementation last year I hadn’t taken the Navisphere agents into account during the design stage. When system specialists started building the virtual infrastructure they ran into problems connection the EMC storage to the ESX hosts. Customer storage administrators quickly suggested to install the Navisphere agents in the service console. Luckily the System Specialist reported to the Project Leader which blocked the install just in time. Because it was a deviation from the design an impact analysis was performed. The analysis report came out with a negative advice because a search came up with tons of problems regarding Service Console agent installs and Navisphere in particular. That combined with the negative advice from VMware and the fact that it was just needed to register the LUNs once, the Navisphere agent was not installed and the storage administrator had to register the LUNs manually.
After we delivered the virtual infrastructure the customer they installed the Navisphere agent anyway and that’s when the trouble started. Despite the claims from EMC that no problems had been reported to them and the customer could install the agent without issues, the installation resulted in all kinds of issues.
Time sync
The correct time and time sync is essential in a virtual infrastructure because virtual machines can be ‘paused’ when no CPU cycles are needed. So configure a NTP time source in your network to synchronize all ESX hosts with. (Thanks Marius Redelinghuys for the input)
Licensing
Because VMware licenses VMware ESX by socket it is recommended to use CPU’s with a high core density. This way your will get the bes performance for the lowest price.
Regarding the different vSphere flavors, Standard, Advanced, Enterprise, Enterprise plus, pick the version that best suits your needs.
vCenter
Physical or virtual
If your virtual infrastructure is well designed and fully redundant there is no limitation why you shouldn’t run your vCenter on a virtual server, besides the number of ESX hosts. It is fully supported and by running vCenter on a virtual machine, you can profit from all benefits a virtual infrastructure can deliver. The only limitation is the number of ESX hosts you have to manage. In a large environment a physical vCenter server is recommended, high availability can be achieved by using vCenter Server heartbeat. The the sizing below to determine to go virtual or physical.
Sizing
- less than 10 ESX host:
- virtual server;
- 1 vCPU;
- 3GB of memory;
- Windows 32 or 64 bit operating system.
- between 10 and 50 ESX hosts:
- virtual server;
- 2 vCPUs;
- 4GB of memory;
- Windows 32 or 64 bit operating system (64 bit preferred).
- between 50 and 200 ESX hosts:
- Physical or virtual server (virtual preferred);
- 4 vCPUs;
- 4GB of memory;
- Windows 32 or 64 bit operating system (64 bit preferred).
- more than 200 ESX host:
- Physical server;
- 4 vCPUs;
- 8GB of memory;
- Windows 64Bit operating system.
DRS and HA
DRS and HA are two techniques which need to be addressed when running your vCenter Server on a virtual machine. First of all it is recommended to exclude the vCenter Server from DRS. It’s not that DRS is not supported or that it has negative impact on performance but this way you always know on which ESX hosts your vCenter Server is running.
HA is the technique which makes sure the vCenter Server virtual machine restarts in case of hardware failure. Because vCenter Server is the primary management interface for your virtual infrastructure it is important to get this up and running as soon as possible so configure your virtual vCenter Server with restart priority high. Because vCenter Server is dependent on several supporting services like Active Directory, DNS and SQL, make sure these services are online at the same time or before vCenter Server is.
Dependencies
vCenter Server is dependent on several services like Active Directory, DNS and SQL. It is required to have these services up and running together with vCenter Server or minimize the dependencies.
How do you ensure that SQL is online before the vCenter Server service is started? If vCenter and SQL are running on the same (virtual) server, configure a dependency on your SQL service in the vCenter Server service.
How to minimize dependencies? If you’re running a fully virtual environment with no supporting physical servers and you need to boot ESX before you can start a DNS server, you can minimize the DNS dependency by configuring a host file on every ESX host. Because this is a manual action which requires additional maintenance my advice is to only implement this in smaller environments were 100% virtualization can be achieved.
Update Manager
Because of security purposes I prefer to install VMware Update Manager on a separate virtual machine. I simply do not like the primary management platform to have internet access.
Cluster design
As your cluster is the boundary for the DRS and HA configuration this is an important design decision. Why?
First of all, in vSphere 4 the size of a HA cluster is limited to a maximum of 32 hosts and 1280 virtual machines.
Second, depending on the number of hosts in a cluster there is a maximum of supported virtual machine per ESX host.
- 100 virtual machines/host if there are eight or less ESX hosts in a cluster;
- 40 virtual machines/host if there are more than eight ESX hosts in a cluster.
Third, a VMware HA Cluster consists of primary and secondary nodes. Primary nodes hold cluster settings and all ‘node states’ which are synchronized between primaries. The first five hosts that join the VMware HA cluster are automatically selected as primary nodes, all the others are automatically selected as secondary nodes. The primary nodes are responsible for the HA failover process. Duncan Epping wrote a great section on his blog: Yellow-Bricks.com.
All three combined results in the following best practice: Take your hardware into account when designing VMware clusters.
For instance, when using blades it is important not to place all primary hosts in the same blade enclosure because if all primary hosts fail simultaneously no HA initiated restart of the VMs will take place. HA needs at least one primary host to restart VMs. This results in a cluster of max eight ESX hosts divided between two blade enclosures, four ESX hosts in each enclosure. This way you create a cluster of eight ESX hosts which can hold 100 virtual machine per ESX host which totals 700 virtual machines with a failover capacity of one host.
Storage
Spindles and RAID levels
With regards to storage spindles are key, more spindles equals more performance. The second item dictating performance is the RAID level. RAID levels are very important when designing a virtual infrastructure. When configuring storage it’s a compromise between capacity, performance and availability. These choices can make or break storage performance. Slower SATA disks in RAID10 can outperform faster SAS disks in RAID5. So the bottom line is, make sure your VMFS storage gets the best performance and all other storage gets the performance, availability and capacity it needs. So know your I/O characteristics.
Number of VMs/LUN
You’ll be surprised how many virtual infrastructure I encounter with only one extremely BIG LUN which contains all virtual machines. Most of the time, with this config, the end user is not satisfied with the performance. When I talk to them and propose to chop up their big LUNs into several smaller ones to improve performance, most of the time the reaction is one of disbelief. When I give them one smaller LUN and let them put a poor performing virtual machine on it, the discussion is over 9 out of 10 times.
This is the reason VMware best practices advices not to put more than 16 to 20 server VMs or 30 to 40 desktop VMs on a LUN. Personally I like to keep the lower values, so a maximum of 16 server VMs per LUN.
LUN size
When limiting your design to 16 server VMs per LUN and obey the other VMware best practices like space for snapshots, clones and +/-20% free space the recommended LUN size is between 400 and 600 GB.
VMDK or RDM
When designing a virtual infrastructure and determining LUN size it’s a waste to fill a datastore with one virtual machine. In almost every design I keep to the following personal best practice: For every virtual machine disk larger than 20 to 50GB use a Raw Device Mapping (RDM).
In the past there have been discussions stating that RDMs have better performance but tests from VMware show that the performance difference is minimal and can be neglected.
Another reason to use RDMs over VMDK disks is the level of low level disk access/control and the need for SAN based features like snapshots, deduplication, etc. There are two compatibility modes, physical or virtual. The level of virtualization an application allows and the functional needs determine the compatibility mode. For instance in physical compatibility mode it’s not possible to use VMware snapshotting.
Block sizes
The VMFS block size can be set when formatting the LUN. The blocks size determines the maximum size of files which can be created on the VMFS storage.
Below the block sizes and the related maximum file sizes:
Block size | Maximum file size |
1 MB | 256 GB |
2 MB | 512 GB |
4 MB | 1024 GB |
8 MB | 2048 GB |
When using 400 to 600 GB LUNs and assigning RDMs for virtual disks over 20 to 50 GB you can suffice with a 1MB block size because disk file will never exceed 256 GB.
Smaller block sizes also complement thin provisioned disks because the thin provisioned disks will grow with block size increments. But in contrast, a larger block size results in less SCSI locks. So set the block size based on the desired performance, maximum file size and disk strategy, I usually go with a 1 MB block size and never experienced a negative performance impact due to excessive SCSI locking.
Thin-on-thin
A situation where I did experience a negative performance impact is when using thin-on-thin. So do not use thin provisioned virtual disks on thin provisioned LUNs.
Create an ISO store
For daily maintenance and use of a virtual infrastructure it is very convenient to create a central ISO store where you store image of all CDs and DVDs used. This way it’s very easy to mount an image to use in your virtual machine and it reduces the risk of version sprawl in your virtual infrastructure.
Disk alignment
Disk alignment is something which can have substancial negative impact on performance. This goes for both VMFS partitions and vmdk disk files.
When creating a VMFS partition using the Virtual Infrastructure client the alignment is automatically set correctly.
Disk aligment in vmdk disk files is a bit more complex. If you want to perform a manual alignment of the file system in the vmdk disk file, check out this VMware document but I warn you it’s a very lengthy process.
It is much easier to use a tool which does all this work for you. Vizioncore has a great freeware tool called vOptimizer WasteFinder which scans through VMware vCenter Servers to locate over allocated virtual storage and misaligned virtual machines. Improperly aligned VMs experience decreased I/O throughput and higher latency. Optionally, vOptimizer Pro from Vizioncore can be purchased to quickly and easily reclaim wasted virtual storage and to align VMs to proper 64K partition boundaries. The freeware version includes two free alignment tasks.
Networking
Separate management, storage and VM traffic
To secure your virtual environment it’simportant to separate your virtual machine- from your management traffic. Besides that you need to ensure the desired 1 Gb bandwidth for your VMotion traffic.
Regarding IP storage it is important to create a separate network for IP storage traffic because this is a whole different data characteristic and IP storage network components require high performance network hardware. IP storage switches require very few ports per processor, preferably one on one, and require a fast backplane.
How to realize this? Best practice is to create separate vSwitches for virtual machine-, storage- and management traffic. Typically I use vSwitch0 for Management and VMotion, vSwitch 1 for IP storage and vSwitch 3 for virtual machine networks. When using VMware FT a fourth vSwitch is required to create a dedicated FT network.
I know there are people out there (and I even have colleagues who preach this) who put all traffic/portgroups on one vSwitch combining management, storage, VMotion and virtual machine traffic and claim they can present dedicated bandwidth and a secure connection to VMotion, FT, management and IP storage using QoS and VLAN tagging on the network layer. In my opinion this creates a chaotic situation where I do not have control over my network links and it’s not clear what they are used for. And if it’s not clear to me how should a virtual infrastructure administrator be able to understand this. VLAN tagging on different portgroups, Ok, but one vSwitch with 5-10 physical uplinks? Good luck troubleshooting this. Network designs like the one on the right are complex enough as it is.
My best practice: Keep it simple and clear! Use separate vSwitches for different ‘roles’ and separate Management from IP storage and virtual machine traffic.
Fully redundant networking
To ensure a high available infrastructure it is very important to design a rock solid network infrastructure. How? Every vSwitch has a minimum of two physical network interfaces which are connected to separate switches. So, in case a network adapter or a switch fails the virtual infrastructure keeps running, however with less capacity. But it’s always better to have somewhat slower infrastructure then no infrastructure at all.
Keep in mind that multi port network interfaces presents itself as separate network interfaces in VMware ESX. So a 4x 1Gb nic shows up in VMware ESX as 4 separate network adapters. Divide your physical uplinks over separate multi port network interfaces to ensure that when a complete multi port network adapter fails not all connections are lost.
Avoid Ether Channels/Link Aggregation
The use of ether channels/link aggregation only makes sense when there are virtual machines which require more than 1 Gb bandwidth, this is rarely the case. Besides that it is really difficult to configure and it has many dependencies. Most of the time network links are divided between switches for redundancy and it’s not possible or very very difficult to configure link aggregation accros multiple switches. So hands of ether channels/link aggregation and manage network load balancing using the standard VMware ESX policies (based on originating port ID, MAC address hash, etc)
Link speed and duplex setting
Many inexplicable network problems are caused by wrong network settings. Collisions, retransmits, slow links, these can all be caused by mismatched settings between network adapter and network switch. This usually happens when using different brand network adapters and switches. I’ve even come across slow performing network links which were caused by auto/auto settings on the network interface and switch.
There is however a downside to changing this setting and it is related to Distributed Power Management (DPM). DPM depends on Wake on LAN (WoL) and some network adapters support wake-on-LAN only at 10 or 100 Mb, not at 1 Gb. If such a network adapter is connected to a switch that supports 1 Gb (or higher), it will attempt to negotiate down to 100 Mb when the machine powers off. If the switch and network adapter are manually set to 1000 Mb/Full duplex the network adapter loses its connection to the switch when the machine powers off and wake-on-LAN fails to bring the ESX host back online when needed. Nowadays vSphere supports more techniques to wake up the host, besides WoL, so this shouldn’t be a huge issue. (Thanks Marius Redelinghuys for the input)
Best practices is to always set the speed and duplex settings manually to 1000MB/Full duplex on both ends, switch and network adapter if the network adapter allows this and test if DPM functions correctly. If not try another technique to get the ESX host out of standby mode. If that doesn’t work you will have to do with the auto/auto setting on just a few of the network adapter.
DMZ networks
When combining LA N and DMZ on ESX hosts you’re heading for a conflict with the network administrators. I found it very difficult to convince them that VMware ESX is very secure and it’s no problem to combine LAN and DMZ.
A combination of VMware security whitepapers including the famous NSA report and the promise of a physically separated network does the trick 8 out of 10 times.
So, create a separate vSwitch for the DMZ network with minimal two physical network adapters preferably connected to different switches.
The risk involved in combining LAN and DMZ is human error. You need to inform the virtual infrastructure admins on the risks and check DMZ connections frequently.
Virtual Machines
Remove unused hardware
When creating a virtual machine or template make sure the virtual machine has no unused, obsolete hardware. So remove floppy drives, serial and parallel ports when you don’t need it. It’s just like with physical hardware, devices use IRQs and need to be polled while using resources in the process and slowing down the system. With virtual machines the principal is the same but now there a lot of virtual machines consolidated on the same hardware. Imaging the extra unnecessary load on the ESX host which can be used for more important processes. The performance gain won’t be huge but every little bit helps and maybe you can squeeze in an extra virtual machine when tuning your virtual infrastructure to the max.
Disable services
Here again the principal is the same as with physical servers, disable unused services to achieve optimal system performance. In a virtual environment this is even more important because you can save resources on all virtual machines running on an ESX host.
The best way to achieve this in a Windows environment, is to create an OU structure in Active Directory based on server roles and create policies which disable unused services. This way you do not have to configure every server separately and the disabled services can be easily managed centrally. In my 12 year IT career I’ve come across one customer who had this in place and running perfectly.
Start with minimum resources
In physical environment we are used to size servers based on peak usage. With virtual machine it is recommended to start with a minimum amount of resources. Why?
First of all it is very easy to add resources at a later time if this turns out to be needed.
Second, with assigning resources there’s also a reservation and the reservations must be available for the virtual machine to start. So, adding too much resources will result in higher reservations which will result in less virtual machines on an ESX host.
Third, assigning more resources will not always mean that the virtual machine will perform faster. A real life scenario: at a customer site a colleague had installed an Exchange 2007 server on ESX 3.5 and assigned 4 vCPUs and a lot of memory and despite the huge amount of resources the server didn’t perform well (understatement of the century). After removing two vCPUs the machine came to life and after removing another vCPU the server was racing. So the Exchange server performed much much better with less vCPUs. The VMware ESX CPU scheduling was holding the virtual machine down.
It’s difficult so determine what is just enough and what is way too much. Most of the time I base this decision on the Capacity Planner information I gather at the start of the project. (Thanks Marius Redelinghuys for the input)
Tags In
Related Posts
54 Comments
Leave a Reply Cancel reply
You must be logged in to post a comment.
Hi Eric,
Very good article, thank you very much. Some of the points that I have on the article.
If using DPM then VMware recommends that you do not use fixed network settings unless you have thoroughly tested it, Most servers switches to 10Mbps if in sleep mode and can therefore not be automatically revived when needed if the sw2itch is trying to talk Gb.
Also recommended is to run with the minimum vCPU’s per server. Increase later if really needed.
What would your recommendations be regarding network teaming in VM’s? My understanding is that VMware still recommends NIC & HBA Teaming for Unplanned downtime.
Part of my best practice for installations is also to ensure a proper NTP server is available with updates from Internet and ensure that all the hosts uses the same time source.
Thanx
Marius
Hi Eric,
Very good article, thank you very much. Some of the points that I have on the article.
If using DPM then VMware recommends that you do not use fixed network settings unless you have thoroughly tested it, Most servers switches to 10Mbps if in sleep mode and can therefore not be automatically revived when needed if the sw2itch is trying to talk Gb.
Also recommended is to run with the minimum vCPU’s per server. Increase later if really needed.
What would your recommendations be regarding network teaming in VM’s? My understanding is that VMware still recommends NIC & HBA Teaming for Unplanned downtime.
Part of my best practice for installations is also to ensure a proper NTP server is available with updates from Internet and ensure that all the hosts uses the same time source.
Thanx
Marius
Thnx Marius for the input.
I idd forgot NTP, good one.
Regarding DPM,I haven’t got much experience with DPM. I will look into that and add both items later this week.
I will also add minimum resources (vCPUS, etc) to VM sizing
Regarding the nic & hba teaming for unplanned downtime, definite yes for ESX hosts but for VMs? I can’t quite follow. Please explain.
Thnx Marius for the input.
I idd forgot NTP, good one.
Regarding DPM,I haven’t got much experience with DPM. I will look into that and add both items later this week.
I will also add minimum resources (vCPUS, etc) to VM sizing
Regarding the nic & hba teaming for unplanned downtime, definite yes for ESX hosts but for VMs? I can’t quite follow. Please explain.
Hi Eric,
This is the slide notes that I got from a VMware presentation a while back, I can mail you the PPT slide if you want to see in what context this was presented
Regards
Marius
—————————————————————————-
Today VMware provides a variety of solutions that shield applications from infrastructure downtime. VMotion protects applications from planned server downtime, HA provides the first line of defense against unplanned server downtime.
Storage VMotion protects applications against planned storage downtime, while Consolidated backup provides a framework to protect against data corruption or data loss
At the interconnect layer, NIC & HBA teaming provide resilience to unplanned component failures
At the virtual machine level, VM failure monitoring provides the automated restart in the case of virtual machine failures
Beyond individual sets of servers/storage, if the entire set goes down, Site Recovery Manager provides the orchestration of recovery from downtime and can be used for planned site downtime/migration as well.
—————————————————————————-
Hi Eric,
This is the slide notes that I got from a VMware presentation a while back, I can mail you the PPT slide if you want to see in what context this was presented
Regards
Marius
—————————————————————————-
Today VMware provides a variety of solutions that shield applications from infrastructure downtime. VMotion protects applications from planned server downtime, HA provides the first line of defense against unplanned server downtime.
Storage VMotion protects applications against planned storage downtime, while Consolidated backup provides a framework to protect against data corruption or data loss
At the interconnect layer, NIC & HBA teaming provide resilience to unplanned component failures
At the virtual machine level, VM failure monitoring provides the automated restart in the case of virtual machine failures
Beyond individual sets of servers/storage, if the entire set goes down, Site Recovery Manager provides the orchestration of recovery from downtime and can be used for planned site downtime/migration as well.
—————————————————————————-
This describes redundancy at ESX interconnect level (hba/nic) not redundant nics at the VM level.
This describes redundancy at ESX interconnect level (hba/nic) not redundant nics at the VM level.
Hi Eric,
The point on DPM. See http://www.vmware.com/pdf/Perf_Best_Practices_vSphere4.0.pdf on Page 12 & 28 as a starting reference
Also see VMware kb http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1003373 on speed issues
Regards
Marius
Hi Eric,
The point on DPM. See http://www.vmware.com/pdf/Perf_Best_Practices_vSphere4.0.pdf on Page 12 & 28 as a starting reference
Also see VMware kb http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1003373 on speed issues
Regards
Marius
Eric,
Started reading this through thinking it’d be another basics of vmware defaults. You cought me by surprise, I found it a good read and a very good explanation of basics some people do tend to forget.
I was wondering if you could write a bit on the ratio of vCPU vs physical Cores. I haven’t had the time or materials yet to really test this on vSphere 4, but from general practice I did notice on ESX 3.x with dual cpu(dual cores) systems the ratio on avg wasn’t more then 5-7 per core. If I’d exceed that limit I’d run into extremely high cpu ready times and thus latency inside the virtual machines.
Eric,
Started reading this through thinking it’d be another basics of vmware defaults. You cought me by surprise, I found it a good read and a very good explanation of basics some people do tend to forget.
I was wondering if you could write a bit on the ratio of vCPU vs physical Cores. I haven’t had the time or materials yet to really test this on vSphere 4, but from general practice I did notice on ESX 3.x with dual cpu(dual cores) systems the ratio on avg wasn’t more then 5-7 per core. If I’d exceed that limit I’d run into extremely high cpu ready times and thus latency inside the virtual machines.
Good job so far!
If you’re going to mention a “famous” document then you should provide a link to said document! :)
Good job so far!
If you’re going to mention a “famous” document then you should provide a link to said document! :)
@Mike: i would like to link the article but i’m still searching :-)
@rest: thnx for the compliments, with regards to the questions and additions, i will add it or post a response asap
@Mike: i would like to link the article but i’m still searching :-)
@rest: thnx for the compliments, with regards to the questions and additions, i will add it or post a response asap
I added most of Marius’ input to the article. Thanks for the input.
I will add a CPU-to-vCPU ratio section later. Have to sleep now :)
@Marius: Are you from South Africa? Are you in any way related to Dennis Harding?
I added most of Marius’ input to the article. Thanks for the input.
I will add a CPU-to-vCPU ratio section later. Have to sleep now :)
@Marius: Are you from South Africa? Are you in any way related to Dennis Harding?
Yes, I am in South Africa, based in Rivonia, Johannesburg. Unfortunately I have not met Dennis Harding yet.
Yes, I am in South Africa, based in Rivonia, Johannesburg. Unfortunately I have not met Dennis Harding yet.
@Marius, Dennis is from Johannesburg also, great guy and very good trainer.
@Marius, Dennis is from Johannesburg also, great guy and very good trainer.
@Edwin & Eric, now I remember, he is a trainer at Torque-IT, the local VMware training company?
@Edwin & Eric, now I remember, he is a trainer at Torque-IT, the local VMware training company?
Some more best practices:
Remember to switch VT & XD Bit in BIOS on before installation. If you switch VT on after VMware installation you can not run 64 bit OS’es unless you re-install vSphere again
Install VMware on its own physical disk, if the VMFS is on the same disk (e.g. single server with DAS) you will loose the VMFS (and VM’s) if you re-install vSphere (ESX 3.5 never had a problem with this). Or am I doing it wrong?
If you install VMware to local disks and the system is connected to the SAN, unplug the SAN during installation of vSphere
Some more best practices:
Remember to switch VT & XD Bit in BIOS on before installation. If you switch VT on after VMware installation you can not run 64 bit OS’es unless you re-install vSphere again
Install VMware on its own physical disk, if the VMFS is on the same disk (e.g. single server with DAS) you will loose the VMFS (and VM’s) if you re-install vSphere (ESX 3.5 never had a problem with this). Or am I doing it wrong?
If you install VMware to local disks and the system is connected to the SAN, unplug the SAN during installation of vSphere
Best Practice: Always remember to document everything to the point that you can rebuild the entire environment based on your documentation alone
Best Practice: Always remember to document everything to the point that you can rebuild the entire environment based on your documentation alone
Eric,
another one to put into the best practice, if it’s a newly created environment with new virtual machines being put into it: esx partition alignment. http://www.vmware.com/pdf/esx3_partition_align.pdf
It holds up both in ESX3 and vSphere 4. I know it’s a shitty job, but it does boost performance.
Eric,
another one to put into the best practice, if it’s a newly created environment with new virtual machines being put into it: esx partition alignment. http://www.vmware.com/pdf/esx3_partition_align.pdf
It holds up both in ESX3 and vSphere 4. I know it’s a shitty job, but it does boost performance.
Hi Eric,
For this section here:
* 100 virtual machines/cluster if there are eight or less ESX hosts in a cluster;
* 40 virtual machines/cluster if there are more than eight ESX hosts in a cluster.
I think you mean “machines/host”.
Cheers, Forbes.
Hi Eric,
For this section here:
* 100 virtual machines/cluster if there are eight or less ESX hosts in a cluster;
* 40 virtual machines/cluster if there are more than eight ESX hosts in a cluster.
I think you mean “machines/host”.
Cheers, Forbes.
@Forbes Guthrie: Oh damn, how could we have missed that during the review? You’re totally correct. I’ve changed it.
@J.R. Kalf: Thanks, I added a disk alignment section.
@Marius Redelinghuys: Thanks again. I added a hardware assistance and storage to the ESX(i) section.
With your additions this Best Pratices article is getting better by the day. Great community effort!
@Forbes Guthrie: Oh damn, how could we have missed that during the review? You’re totally correct. I’ve changed it.
@J.R. Kalf: Thanks, I added a disk alignment section.
@Marius Redelinghuys: Thanks again. I added a hardware assistance and storage to the ESX(i) section.
With your additions this Best Pratices article is getting better by the day. Great community effort!
Erik, great initiative and amazing effort from the community.
Erik, great initiative and amazing effort from the community.
1) the /boot and the vmkcore are a different size by default than you mention. It’s 1100 and 150Mb from the top of my head.
2) I would recommend to use SUDO instead of SU, as SUDO leaves the perfect audit trail
3) I never recommend a host file as it leads to inconsistency
4) Lun Size and Blocksizes, I wouldn’t recommend default based on those arguments.
5) The Networking Diagram doesn’t seem like a “regular” diagram with 3 nics for the management layer… why is that?
1) the /boot and the vmkcore are a different size by default than you mention. It’s 1100 and 150Mb from the top of my head.
2) I would recommend to use SUDO instead of SU, as SUDO leaves the perfect audit trail
3) I never recommend a host file as it leads to inconsistency
4) Lun Size and Blocksizes, I wouldn’t recommend default based on those arguments.
5) The Networking Diagram doesn’t seem like a “regular” diagram with 3 nics for the management layer… why is that?
@Duncan:
1) You have a good head ;-) These were ESX3 numbers. Updated with vSphere now.
2) Added that. I prefer su because I’m not such a linux/SC wizkid.
3) I only use it in very small environments (2-3 ESX hosts) where it’s easy too keep it consistent. In larger environment there’s a physical DNS (and AD) most of the time.
4)I’m curious what you use as block size and on which arguments you base your decision. Block size is always a bit of a grey area. VMware states that block sizes do not have a performance penalty and VMFS uses sub channels so small file sizes are no issue either.
5)This is idd no ‘regular’ diagram. I used it to indicate that such setups are complex but it’s easy to see which connection is used for which function. In this case I did not use all nics but the customer wanted to use all to improve on redundancy and capacity. Because of that I added the spare nics to the management and VM network.
@Duncan:
1) You have a good head ;-) These were ESX3 numbers. Updated with vSphere now.
2) Added that. I prefer su because I’m not such a linux/SC wizkid.
3) I only use it in very small environments (2-3 ESX hosts) where it’s easy too keep it consistent. In larger environment there’s a physical DNS (and AD) most of the time.
4)I’m curious what you use as block size and on which arguments you base your decision. Block size is always a bit of a grey area. VMware states that block sizes do not have a performance penalty and VMFS uses sub channels so small file sizes are no issue either.
5)This is idd no ‘regular’ diagram. I used it to indicate that such setups are complex but it’s easy to see which connection is used for which function. In this case I did not use all nics but the customer wanted to use all to improve on redundancy and capacity. Because of that I added the spare nics to the management and VM network.
My justification would normally be “flexibility”. If you pick a larger blocksize now you have more flexibility to grow later. If you pick a smaller one you are more or less restricted because of the block size and would need to svmotion vms around.
Some best practices seem to be copied from other articles, might nice to add a link to the source.
My justification would normally be “flexibility”. If you pick a larger blocksize now you have more flexibility to grow later. If you pick a smaller one you are more or less restricted because of the block size and would need to svmotion vms around.
Some best practices seem to be copied from other articles, might nice to add a link to the source.
These are best practices I collected during the last 6 to 7 years since I work with VMware ESX/VI. I can’t remember where I found what best practice or from which colleague I heard it .
These are best practices I collected during the last 6 to 7 years since I work with VMware ESX/VI. I can’t remember where I found what best practice or from which colleague I heard it .
Hi Erik,
Thanks for the article.
Two questions though:
1.) The part of “partitioning” only applies to the use of ESX and not ESXi?
2.) During my search on partitioning I stumbled on a (few) posts on aligning
partitions, because it has an impact on performance.
http://communities.vmware.com/message/1326824#1326824
Is this a ‘must do’ :) ?
Thanx Jaap
Hi Erik,
Thanks for the article.
Two questions though:
1.) The part of “partitioning” only applies to the use of ESX and not ESXi?
2.) During my search on partitioning I stumbled on a (few) posts on aligning
partitions, because it has an impact on performance.
http://communities.vmware.com/message/1326824#1326824
Is this a ‘must do’ :) ?
Thanx Jaap
You’re welcome.
1) Idd, partitioning only applies to ESX.
2) In my opinion this is certainly a ‘must’ although I have seen virtual infrastructures with misaligned disks which are not affected due to the low load on the environment. In stressed environments this will have more affect and you should align the partitions to get the best IO performance.
Keep in mind that there are two locations/disk/volumes to align.
First of all you should align the datastore/VMFS. Just use the vSphere client to create your datastore, this automatically does it for you.
Second you should align the virtual machine partitions in the vmdk disk.
You’re welcome.
1) Idd, partitioning only applies to ESX.
2) In my opinion this is certainly a ‘must’ although I have seen virtual infrastructures with misaligned disks which are not affected due to the low load on the environment. In stressed environments this will have more affect and you should align the partitions to get the best IO performance.
Keep in mind that there are two locations/disk/volumes to align.
First of all you should align the datastore/VMFS. Just use the vSphere client to create your datastore, this automatically does it for you.
Second you should align the virtual machine partitions in the vmdk disk.
Anything newer??
No, not yet but I think the best practices don't change much with the arrival of vSphere 4.1.
I will review the best practices after my holiday.
Very good article and to the point, a real life saver.
Now that ESX is going to die, I would recheck the partitioning section and update it as need
Erik, thanks for this sharing.
Haim Chibotero
http://chibotero.blogspot.com –> iPhones in Israel & more
Hi all,
I am a newbie in the VMware application, and I need a tool to size the server and storage for VMware VDI to know the really cost of this solution.please help
With vCenter 5, is it still best practices not to put more than 16 to 20 server VMs on a LUN/datastore?
vCenter 5 makes no difference so I assume you mean vSphere 5.
From: http://www.vmguru.nl/wordpress/2010/12/choosing-vmfs-block-size-with-vsphere-4-1/
“In the past, one of the best practices was not to overload your
datastores by limiting the maximum number of virtual machines per
datastore because of the amount of SCSI reservations and the associated
performance-impact. Using VAAI will change how we do our datastore
design. We can now simply base our datastore size on the maximum number
of virtual machines per datastore, which is now limited by IOPS
requirements. So using VAAI will result in having larger datastores
which subsequently impacts the used VMFS block size.”
So from vSphere 4.1 onwards it got a little bit more complicated. You have to take into account if you use VAAI compatible storage. If so, the above metrics apply. If not, I would still keep the 16-20 VM maximum because of SCSI locking or use NFS instead.