I Need To Make A Change With My IT Support! Call (225) 706-8414

Troubleshoot disk latency issues for VMware VMs using esxtop

You are getting reports that VMs are “slow”. Nothing specific perhaps, but that things just aren’t moving very fast. You can’t identify any specific cause in the VM itself. There is a good chance you have a latency issue with the underlying data stores.

Note: Not having enough RAM for the VM can cause the OS inside the VM to page a lot, which will increase disk I/O, causing a red herring. Be sure to check page file usage!

So how do you find out if you have an I/O issue? Use esxtop!

First, log into your ESXi box via SSH. This will give you command-line access.

Now, run ‘esxtop’ on the command-line. The program will open with output like so:

3:55:12pm up 29 days 4:21, 199 worlds; CPU load average: 0.06, 0.05, 0.05
PCPU USED(%): 86 1.3 0.6 18 1.2 10 0.1 8.3 0.1 0.1 1.2 0.2 0.1 0.8 2.8 0.1 AVG: 8.2
PCPU UTIL(%): 100 5.2 1.7 24 3.0 19 0.7 9.6 1.0 1.2 2.7 1.3 1.2 2.6 3.9 1.3 AVG: 11
CORE UTIL(%): 100 25 21 9.9 1.2 3.1 2.7 4.1 AVG: 20

Great, it’s up. Next, let’s view our adaptors. Hit ‘d’ to view the adapters.

3:56:26pm up 29 days 4:23, 199 worlds; CPU load average: 0.06, 0.05, 0.05

ADAPTR PATH NPTH CMDS/s READS/s WRITES/s MBREAD/s MBWRTN/s DAVG/cmd KAVG/cmd GAVG/cmd QAVG/cmd
vmhba0 - 2 14.25 8.03 6.22 0.24 0.09 0.21 0.07 0.28 0.01
vmhba1 - 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
vmhba32 - 2 12.65 0.40 12.25 0.00 0.05 0.72 0.12 0.84 0.01
vmhba33 - 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
vmhba34 - 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Now hit ‘f’. This lets you choose the columns. Now hit ‘G’ to remove the Overall stats, ‘E’ to remove the IO stats, and then hit ‘H’ and ‘I’ to add Read and Write.

Current Field order: ABCdefgHIjkl

* A: ADAPTR = Adapter Name
* B: PATH = Path Name
* C: NPATHS = Num Paths
D: QSTATS = Queue Stats
E: IOSTATS = I/O Stats
F: RESVSTATS = Reserve Stats
G: LATSTATS/cmd = Overall Latency Stats (ms)
* H: LATSTATS/rd = Read Latency Stats (ms)
* I: LATSTATS/wr = Write Latency Stats (ms)
J: ERRSTATS/s = Error Stats
K: PAESTATS/s = PAE Stats
L: SPLTSTATS/s = SPLIT Stats

Toggle fields with a-l, any other key to return:

Remember, we are interested in the latency stats. (Refer to 4.2.2 in the guide.)

Now we have:

4:01:38pm up 29 days 4:28, 198 worlds; CPU load average: 0.04, 0.05, 0.05

ADAPTR PATH NPTH DAVG/rd KAVG/rd GAVG/rd QAVG/rd DAVG/wr KAVG/wr GAVG/wr QAVG/wr
vmhba0 - 2 0.22 0.04 0.27 0.01 0.20 0.10 0.30 0.02
vmhba1 - 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
vmhba32 - 2 0.00 0.00 0.00 0.00 0.20 0.08 0.28 0.01
vmhba33 - 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
vmhba34 - 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Right now you are seeing your adapters, path names, number of paths, and then latency stats.

If GAVG/rd or GAVG/wr is over 30 (ms) consistentlyl, you have an issue. You may hit 250 ms now and then, but not often.

Now, let’s switch over to another view.

Click ‘u’ to view the individual disks.

Click ‘f’ to view columns, and set the following:

Current Field order: ABcdefghIjklmnop

* A: DEVICE = Device Name
* B: ID = Path/World/Partition Id
C: NUM = Num of Objects
D: SHARES = Shares
E: BLKSZ = Block Size (bytes)
F: QSTATS = Queue Stats
G: IOSTATS = I/O Stats
H: RESVSTATS = Reserve Stats
* I: LATSTATS/cmd = Overall Latency Stats (ms)
J: LATSTATS/rd = Read Latency Stats (ms)
K: LATSTATS/wr = Write Latency Stats (ms)
L: ERRSTATS/s = Error Stats
M: PAESTATS/s = PAE Stats
N: SPLTSTATS/s = SPLIT Stats
O: VAAISTATS= VAAI Stats
P: VAAILATSTATS/cmd = VAAI Latency Stats (ms)

Toggle fields with a-p, any other key to return:

So now we have:

4:09:46pm up 29 days 4:36, 198 worlds; CPU load average: 0.03, 0.04, 0.04

DEVICE PATH/WORLD/PARTITION DAVG/cmd KAVG/cmd GAVG/cmd QAVG/cmd
mpx.vmhba1:C0:T - 0.00 0.00 0.00 0.00
naa.60014057053 - 0.00 0.00 0.00 0.00
naa.600140596e4 - 0.00 0.00 0.00 0.00
t10.ATA_____WDC - 0.22 0.22 0.44 0.20
t10.ATA_____WDC - 0.00 0.00 0.00 0.00
t10.ATA_____WDC - 0.15 0.09 0.24 0.07
t10.ATA_____WDC - 0.14 0.13 0.27 0.11

This lets you drill down from the adapter to the actual disks so you can see where your problem area is. We have SATA disks here for the example. Normally this would be a SAN or NAS offering up iSCSI, but you will see the disks here since the LUNs are mounted as a disk.

Now look at this nightmare:

4:10:37pm up 29 days 4:37, 198 worlds; CPU load average: 0.03, 0.04, 0.04

DEVICE PATH/WORLD/PARTITION DAVG/cmd KAVG/cmd GAVG/cmd QAVG/cmd
mpx.vmhba1:C0:T - 0.00 0.00 0.00 0.00
naa.60014057053 - 0.00 0.00 0.00 0.00
naa.600140596e4 - 0.00 0.00 0.00 0.00
t10.ATA_____WDC - 9.97 588.70 598.67 588.67
t10.ATA_____WDC - 0.00 0.00 0.00 0.00
t10.ATA_____WDC - 0.15 26.65 26.80 12.19
t10.ATA_____WDC - 0.14 0.11 0.25 0.10

We have latency of 400 ms and 500 ms on a SATA disk. People are not going to be happy. At that level, the disk is saturated and the app is going to come to a grinding halt.

Let’s see who is suffering.

Hit ‘v’ for VMs, ‘f’ for fields, and then set this:

Current Field order: aBCDefGH

A: ID = Vscsi Id
* B: GID = Grp Id
* C: VMNAME = VM Name
* D: VSCSINAME = Vscsi Name
E: NUM = Num of Vscsis
F: IOSTATS = I/O Stats
* G: LATSTATS/rd = Read Latency Stats (ms)
* H: LATSTATS/wr = Write Latency Stats (ms)

Toggle fields with a-h, any other key to return:

And we have:

4:12:13pm up 29 days 4:38, 198 worlds; CPU load average: 0.03, 0.03, 0.04

GID VMNAME VSCSINAME LAT/rd LAT/wr
3797 ABCSQL1_VM - 688.90 0.00
4059 EFGsql-01 - 0.00 0.00
4108 XYZ-01-res - 0.00 80.77

ABCSQL1_VM, obviously a SQL server, is getting killed. Whatever application is using that SQL server is going to be suffering greatly right now.

So what do you do now?

You add more and faster disks. Right now this VM is running on a single SATA disk. Replace with several disks in a RAID-10 configuration, either locally or on a NAS offering up iSCSI via GigE or 10-GigE.

 

 

Concerned About Cyber Attacks?

CLICK HERE >

Want to Migrate to the Cloud?

CLICK HERE >
Office 365

Ready to Experience Microsoft Office 365?

Want the latest IT news directly in your inbox? Subscribe now!