Virtual Control

VMware Cloud Foundation Solutions

Health Check Handbook

ESXi Hosts
Health Check Handbook

ESXi host health validation covering hardware sensors, storage, networking, services, patches, security configuration, and performance analysis.

ESXiHardwareStorageNetworkingSecurity

VCF 9.0

VMware Cloud Foundation

Proprietary & Confidential

ESXi Hosts Health Check Handbook

Comprehensive Health Verification for ESXi Hosts in VCF 9

Author: Virtual Control LLC Date: March 2026 Version: 1.0 Classification: Internal Use Platform: VMware Cloud Foundation 9.0 / ESXi 8.0 U3+

1. Overview & Purpose
2. Prerequisites
3. Quick Reference — All Checks Summary
4. Hardware Health
5. Storage Health
6. Networking Health
7. Services Health
- 7.1 Service Listing
- 7.2 Critical Services
8. NTP Configuration
9. Syslog Configuration
10. Scratch Partition
11. Core Dump Configuration
12. Security Health
13. Performance Health
14. Boot Configuration
15. Patch / VIB Level
16. Port Reference Table
17. Common Issues & Remediation
18. CLI Quick Reference Card

1. Overview & Purpose

This handbook provides a complete, step-by-step health check procedure for ESXi 8.0 hosts deployed within a VCF 9.0 environment. Use this during:

Routine maintenance — Weekly/monthly proactive host verification
Pre/post-upgrade — Before and after ESXi patches or VCF lifecycle operations
Incident response — When VMs are experiencing performance or connectivity issues
Capacity planning — Evaluating host resource utilization and VM density

Scope

Category	Checks Performed
Hardware	Sensors, IPMI, memory errors, CPU, PCI, firmware
Storage	VMFS, SCSI, HBA, multipathing, NFS, latency, SMART
Networking	VMkernel, vSwitches, uplinks, drivers, CDP/LLDP, connectivity
Services	All ESXi services, hostd, vpxa, NTP, syslog
Security	SSH, lockdown mode, firewall, certificates, account lockout
Performance	CPU ready, co-stop, memory balloon/swap, overcommit
Boot/Patch	Boot device, boot banks, image profile, VIB compliance

Notation: All commands in this document are run via SSH as root on the ESXi host unless otherwise specified. Replace <esxi-host> with your actual hostname or IP.

2. Prerequisites

Required Access

Access Type	Target	Credentials
SSH (22)	Each ESXi host	root / password
HTTPS (443)	Each ESXi host	root / password
vCenter API	vCenter Server	administrator@vsphere.local

Enable SSH on ESXi (if disabled)

# Via DCUI: F2 → Troubleshooting Options → Enable SSH
# Via PowerCLI:
Get-VMHost -Name <esxi-host> | Get-VMHostService | Where {$_.Key -eq "TSM-SSH"} | Start-VMHostService
# Via esxcli (from vCenter or DCUI):
vim-cmd hostsvc/enable_ssh
vim-cmd hostsvc/start_ssh

Environment Variables

# Set per-host or iterate across all hosts
export ESXI_HOSTS="192.168.1.201 192.168.1.202 192.168.1.203 192.168.1.204"
export ESXI_USER="root"
export ESXI_PASS="YourPassword123!"

# Loop template for multi-host checks
for HOST in $ESXI_HOSTS; do
  echo "======== $HOST ========"
  ssh root@$HOST '<command>'
done

3. Quick Reference — All Checks Summary

#	Check	Command	PASS	WARN	FAIL
4.1	Hardware Sensors	`esxcli hardware ipmi sdr list`	All `0x01` (Normal)	Any `0x02` (Warning)	Any `0x04` (Critical)
4.2	Memory Errors	`esxcli hardware memory get`	0 uncorrectable	Correctable ECC errors	Uncorrectable ECC errors
5.1	VMFS Datastores	`esxcli storage vmfs extent list`	All mounted	Snapshot consolidation needed	Datastore unmounted
5.3	Multipathing	`esxcli storage nmp path list`	All paths `active`	Any path `standby` unexpected	Any path `dead`
5.5	Disk Latency	`esxtop` (d for disk)	< 20ms avg	20-50ms avg	> 50ms avg
6.1	VMkernel NICs	`esxcli network ip interface list`	All enabled, IPs assigned	MTU mismatch	VMkernel missing/down
6.3	Uplinks	`esxcli network nic list`	All `Up`	Any NIC not at expected speed	Any NIC `Down`
7.2	Critical Services	`esxcli system process list`	hostd, vpxa running	Non-critical service stopped	hostd or vpxa not running
8	NTP	`esxcli system ntp get`	Synchronized, drift < 1s	Drift 1-5s	NTP not configured or drift > 5s
9	Syslog	`esxcli system syslog config get`	Remote target configured	Local-only logging	Syslog service not running
10	Scratch	`vim-cmd hostsvc/advopt/get ScratchConfig.ConfiguredScratchLocation`	Persistent storage	Ramdisk (non-persistent)	Not configured
11	Core Dump	`esxcli system coredump partition get`	Partition configured, active	Network dump only	No dump configured
12.1	SSH	`esxcli system process list \| grep SSH`	Disabled (production)	Enabled with timeout	Enabled, no timeout
12.2	Lockdown Mode	`vim-cmd hostsvc/hostsummary \| grep lockdownMode`	`lockdownNormal` or `lockdownStrict`	`lockdownDisabled` (lab OK)	`lockdownDisabled` (production)
13.1	CPU Ready	`esxtop` (c for CPU)	< 5% %RDY	5-10% %RDY	> 10% %RDY
13.2	Balloon/Swap	`esxcli hardware memory get`	0 balloon, 0 swap	Balloon active	Swap active
14	Boot Banks	`bootbank-util status`	Both banks healthy	Alt bank outdated	Primary bank corrupt
15	Patch Level	`esxcli software profile get`	Matches VCF BOM	1 patch behind	2+ patches behind

4. Hardware Health

4.1 Hardware Sensors (IPMI)

What: Read IPMI sensor data to verify temperatures, voltages, fan speeds, and power supply status.

Why: Hardware sensor warnings precede failures. Catching them early prevents unplanned outages.

CLI Method

esxcli hardware ipmi sdr list

Expected Output (Healthy):

Name                    Entity ID  Sensor Type     Reading  Units    Status
---------------------------------------------------------------------------
Inlet Temp              0x01.0x01  Temperature     22       C        0x01
Exhaust Temp            0x01.0x02  Temperature     35       C        0x01
CPU1 Temp               0x03.0x01  Temperature     48       C        0x01
CPU2 Temp               0x03.0x02  Temperature     45       C        0x01
FAN1                    0x1d.0x01  Fan             6200     RPM      0x01
FAN2                    0x1d.0x02  Fan             6100     RPM      0x01
P1-DIMMA1 Temp          0x20.0x01  Temperature     32       C        0x01
PSU1 Status             0x0a.0x01  Power Supply    —        —        0x01
PSU2 Status             0x0a.0x02  Power Supply    —        —        0x01

Status Codes

Code	Meaning	Action
`0x01`	Normal	No action
`0x02`	Warning	Monitor closely
`0x04`	Critical	Immediate attention
`0x08`	Non-recoverable	Hardware replacement

Pass / Warn / Fail

Result	Criteria	Indicator
PASS	All sensors `0x01` (Normal)	Hardware healthy
WARN	Any sensor `0x02` (Warning)	Monitor / schedule maintenance
FAIL	Any sensor `0x04` or `0x08`	Replace hardware component

Additional Sensor Commands

# Detailed hardware platform info
esxcli hardware platform get

# BMC/IPMI firmware info
esxcli hardware ipmi bmc get

Remediation:
1. High temps: Check datacenter cooling, airflow, dust
2. Fan failure: Replace fan module; most servers continue with N-1 fans
3. PSU warning: Check power cables; replace PSU under warranty
4. Always cross-reference with vendor BMC (iDRAC, iLO, IPMI web UI)

4.2 Memory Hardware Errors

What: Check for correctable (CE) and uncorrectable (UE) memory errors.

# Memory overview
esxcli hardware memory get

# Check DIMM status
esxcli hardware memory dimm list

Expected Output:

Physical Memory: 549,755,813,888 Bytes (512 GB)
Reliable Memory: 549,755,813,888 Bytes

DIMM Locator     Bank     Type   Speed   Size      Status
----------------------------------------------------------
P1-DIMMA1        Bank 0   DDR5   4800    32768 MB  ok
P1-DIMMB1        Bank 1   DDR5   4800    32768 MB  ok
P1-DIMMC1        Bank 2   DDR5   4800    32768 MB  ok
...

Check VMkernel Log for Memory Errors

grep -i "machine check\|memory error\|ECC\|CECC\|UECC" /var/log/vmkernel.log | tail -20

Result	Criteria	Indicator
PASS	All DIMMs `ok`, no errors in vmkernel log	Healthy
WARN	Correctable ECC errors (CE) present	Schedule DIMM replacement
FAIL	Uncorrectable errors (UE) or DIMM status not `ok`	Immediate replacement

4.3 CPU Health

# CPU info
esxcli hardware cpu list | head -30

# CPU global stats
esxcli hardware cpu global get

Expected Output:

CPU Packages: 2
CPU Cores: 32
CPU Threads: 64
Hyperthreading Active: true
Hyperthreading Supported: true

4.4 PCI Devices

# List all PCI devices
esxcli hardware pci list | grep -E "Device Name|Vendor Name|Address"

4.5 Firmware Versions

# BIOS info
esxcli hardware platform get

# Storage controller firmware
esxcli storage core adapter list

5. Storage Health

5.1 VMFS Datastores

What: Verify all VMFS datastores are mounted, accessible, and have adequate free space.

# List VMFS extents
esxcli storage vmfs extent list

Expected Output:

Volume Name  VMFS UUID                             Extent #  Device Name                           Partition
-----------------------------------------------------------------------------------------------------------------
datastore1   61234567-abcdef01-2345-001122334455   0         naa.600508b4001234567890abcdef012345   3
vsanDatastore 71234567-abcdef01-2345-001122334455  0         —                                     —

Check Datastore Free Space

# Datastore capacity (via df equivalent)
esxcli storage filesystem list

Expected Output:

Mount Point              Volume Name    UUID                                  Mounted  Type  Size            Free
------------------------------------------------------------------------------------------------------------------
/vmfs/volumes/ds1        datastore1     61234567-abcdef01...                  true     VMFS-6  1099511627776  549755813888
/vmfs/volumes/vsanDs     vsanDatastore  71234567-abcdef01...                  true     vsan    4398046511104  2199023255552

Result	Criteria	Indicator
PASS	All datastores mounted, > 20% free	Healthy
WARN	Any datastore 10-20% free	Plan capacity expansion
FAIL	Any datastore < 10% free or unmounted	Critical — VMs may not power on

5.2 SCSI Device Status

# List all SCSI devices
esxcli storage core device list

# Check for devices in APD/PDL state
esxcli storage core device list | grep -E "Display Name|Status|Is Perennially Reserved"

Key Indicators:

Status	Meaning
`on`	Device is online and accessible
`off`	Device is offline — APD or PDL
`APD`	All Paths Down — temporary, may recover
`PDL`	Permanent Device Loss — device is gone

APD / PDL Warning: If any device shows APD or PDL, VMs on that storage may be inaccessible. APD triggers automatic VM termination after 140 seconds by default. Check Misc.APDHandlingEnable and Misc.APDTimeout.

5.3 HBA Health & Multipathing

What: Verify all storage paths are active and HBAs are functioning.

# List HBAs
esxcli storage core adapter list

Expected Output:

HBA Name  Driver       Link State  UID                            Description
------------------------------------------------------------------------------
vmhba0    lsi_mr3      link-up     sas.5001636001234567           LSI Logic SAS3108
vmhba1    lsi_mr3      link-up     sas.5001636001234568           LSI Logic SAS3108
vmhba32   iscsi_vmk    online      iqn.1998-01.com.vmware:esxi01  VMware iSCSI

Multipathing Status

# List all paths and their state
esxcli storage nmp path list

Expected Output:

Runtime Name: vmhba0:C0:T0:L0
  Device: naa.600508b400123456...
  Adapter: vmhba0
  LUN: 0
  State: active
  Transport: sas

Runtime Name: vmhba1:C0:T0:L0
  Device: naa.600508b400123456...
  Adapter: vmhba1
  LUN: 0
  State: active
  Transport: sas

Result	Criteria	Indicator
PASS	All paths `active`, multiple paths per device	Fully redundant
WARN	Any path `standby` when should be `active`	Check path policy
FAIL	Any path `dead` or single path per device	Path failure — no redundancy

Remediation:
1. Dead path: Check physical cables, HBA, switch ports
2. Re-scan storage: esxcli storage core adapter rescan --all
3. Reclaim dead paths: esxcli storage core claiming reclaim -d naa.xxx
4. Check multipath policy: esxcli storage nmp device list

5.4 NFS Mounts

# List NFS datastores
esxcli storage nfs list

Expected Output:

Volume Name  Host            Share       Accessible  Mounted  Read-Only  Hardware Acceleration
----------------------------------------------------------------------------------------------
nfs-backup   192.168.1.100   /exports   true        true     false      Supported

Result	Criteria	Indicator
PASS	All NFS shares `Accessible: true`, `Mounted: true`	Healthy
FAIL	Any NFS share `Accessible: false`	NFS server unreachable

5.5 Disk Latency & SMART

What: Check real-time disk latency and SMART health data.

Check Latency via esxtop

# Interactive mode — press 'd' for disk view
esxtop

# Batch mode (3 samples, 5-second intervals)
esxtop -b -d 5 -n 3 | grep -E "DAVG|KAVG|GAVG"

Key Metrics:

Metric	Description	PASS	WARN	FAIL
DAVG/cmd	Device latency (physical disk)	< 20ms	20-50ms	> 50ms
KAVG/cmd	Kernel latency (VMkernel queue)	< 2ms	2-5ms	> 5ms
GAVG/cmd	Guest OS observed latency	< 25ms	25-50ms	> 50ms

SMART Data (Local Disks)

# Check SMART status for local disks
esxcli storage core device smart get -d <device-naa-id>

6. Networking Health

6.1 VMkernel Adapters

What: Verify all VMkernel adapters have correct IPs, MTU, and enabled services.

# List VMkernel interfaces
esxcli network ip interface list

Expected Output:

Name   MAC Address        Enabled  MTU   Portgroup/DVPort    Stack       IPv4 Address
---------------------------------------------------------------------------------------
vmk0   00:50:56:01:aa:01  true     1500  Management Network  defaultTcpipStack  192.168.1.201
vmk1   00:50:56:01:aa:02  true     9000  vMotion             vmotion     192.168.10.201
vmk2   00:50:56:01:aa:03  true     9000  vSAN                defaultTcpipStack  192.168.12.74
vmk10  00:50:56:01:aa:04  true     1600  nsx-overlay         defaultTcpipStack  192.168.14.201

Check VMkernel Tagged Services

esxcli network ip interface tag get -i vmk0
esxcli network ip interface tag get -i vmk1
esxcli network ip interface tag get -i vmk2

VMkernel	Expected Service Tags	MTU
vmk0	Management	1500
vmk1	vMotion	9000
vmk2	vSAN	9000
vmk10	NSX TEP (no tag — NSX managed)	1600+

Result	Criteria	Indicator
PASS	All vmknics enabled, correct IPs, correct MTU	Healthy
WARN	MTU mismatch or wrong service tag	Reconfigure
FAIL	VMkernel adapter missing or no IP	Connectivity loss

6.2 vSwitch / vDS Configuration

# List standard vSwitches
esxcli network vswitch standard list

# List Distributed vSwitches
esxcli network vswitch dvs vmware list

Expected Output (DVS):

Name: DSwitch-Management
  Configured Ports: 2048
  Max Ports: 2048
  MTU: 9000
  CDP Status: both
  Uplinks: vmnic0, vmnic1

Name: DSwitch-Compute
  Configured Ports: 2048
  Max Ports: 2048
  MTU: 9000
  Uplinks: vmnic2, vmnic3

6.3 Physical Uplink Status

What: Verify all physical NICs are connected, at expected speed, and link is up.

esxcli network nic list

Expected Output:

Name    PCI Device  Driver  Admin Status  Link Status  Speed  Duplex  MTU    MAC Address
-------------------------------------------------------------------------------------------
vmnic0  0000:3b:00  i40en   Up            Up           25000  Full    9000   00:50:56:01:aa:10
vmnic1  0000:3b:01  i40en   Up            Up           25000  Full    9000   00:50:56:01:aa:11
vmnic2  0000:5e:00  i40en   Up            Up           25000  Full    9000   00:50:56:01:aa:12
vmnic3  0000:5e:01  i40en   Up            Up           25000  Full    9000   00:50:56:01:aa:13

Result	Criteria	Indicator
PASS	All NICs `Link Status: Up`, expected speed	Healthy
WARN	NIC at lower speed than expected (e.g., 10G vs 25G)	Check cables/switch
FAIL	Any NIC `Link Status: Down`	Cable/switch/NIC failure

Remediation:
1. NIC down: Check cable, switch port, SFP/transceiver
2. Speed mismatch: Check auto-negotiation, cable category, switch config
3. Replace NIC driver: esxcli software vib install -v /path/to/driver.vib

6.4 NIC Driver & Firmware

# Driver details per NIC
esxcli network nic get -n vmnic0

Expected Output:

Advertised Auto Negotiation: true
Auto Negotiation: true
Driver Info:
  Bus Info: 0000:3b:00.0
  Driver: i40en
  Firmware Version: 9.20
  Version: 2.5.3.0
Link Detected: true
Link Status: Up

6.5 CDP / LLDP Neighbor Info

What: Discover the physical switch port each ESXi NIC is connected to.

# CDP (Cisco Discovery Protocol)
esxcli network vswitch dvs vmware lacp config get
# or via vim-cmd:
vim-cmd hostsvc/net/query_networkhint --pnic-name=vmnic0

Using Python for CDP/LLDP

# Python one-liner to get CDP info
python3 -c "
from pyVim.connect import SmartConnect
import ssl
# ... (connect to host and query PhysicalNicHintInfo)
"

6.6 vmkping Connectivity Tests

# Management network
vmkping -I vmk0 <vcenter-ip>

# vMotion network (jumbo frame test)
vmkping -I vmk1 -d -s 8972 <other-host-vmotion-ip>

# vSAN network (jumbo frame test)
vmkping -I vmk2 -d -s 8972 <other-host-vsan-ip>

# NSX TEP (MTU 1600)
vmkping -I vmk10 -d -s 1572 <other-host-tep-ip>

Test	PASS	FAIL
Management	0% loss	Any loss
vMotion jumbo	0% loss with -s 8972	Loss or "packet too big"
vSAN jumbo	0% loss with -s 8972	Loss or MTU error
NSX TEP	0% loss with -s 1572	Loss or MTU error

7. Services Health

7.1 Service Listing

# List all services and their status
esxcli system process list | head -60

# Alternative — list via chkconfig
chkconfig --list | sort

7.2 Critical Services

Service	Process	Function	Impact if Down
hostd	`hostd`	ESXi host agent	Host unreachable from vCenter
vpxa	`vpxa`	vCenter agent	Host disconnects from vCenter
vobd	`vobd`	VMware Observability	Events/alarms not generated
fdm	`fdm`	HA agent	HA not functional on host
ntpd	`ntpd`	Time sync	Time drift, cert issues
sfcbd	`sfcbd`	CIM broker	Hardware monitoring unavailable
lbtd	`lbtd`	Load-based teaming	NIC load balancing inactive
nsx-mpa	`nsx-mpa`	NSX management plane agent	NSX connectivity loss
nsx-proxy	`nsx-proxy`	NSX proxy	NSX data plane issues

Verify Critical Services

/etc/init.d/hostd status
/etc/init.d/vpxa status
/etc/init.d/fdm status
/etc/init.d/ntpd status
/etc/init.d/nsx-mpa status
/etc/init.d/nsx-proxy status

Expected Output for each:

hostd is running.

Remediation:
1. Restart hostd: /etc/init.d/hostd restart
2. Restart vpxa: /etc/init.d/vpxa restart
3. Restart all management agents: /sbin/services.sh restart
4. Check logs: /var/log/hostd.log, /var/log/vpxa.log

8. NTP Configuration

What: Verify NTP is configured, running, and the host clock is synchronized.

# Check NTP configuration
esxcli system ntp get

Expected Output:

Enabled: true
Loglevel: warning
Server:
  - 192.168.1.1
  - 192.168.1.2

# Check NTP service status
/etc/init.d/ntpd status

# Check time offset
esxcli system time get
ntpq -p

Expected ntpq Output:

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*192.168.1.1     .GPS.            1 u   64  128  377    0.543    0.125   0.043
+192.168.1.2     .GPS.            1 u   32  128  377    0.621    0.213   0.051

Result	Criteria	Indicator
PASS	NTP enabled, offset < 1000ms, `reach` > 0	Synchronized
WARN	Offset 1-5 seconds	Drifting
FAIL	NTP not configured, offset > 5s, or `reach` = 0	Unsynchronized

Time Sync Critical: ESXi hosts with > 5 seconds clock drift can cause vSAN issues, certificate validation failures, and cluster partition events.

9. Syslog Configuration

What: Verify syslog is forwarding to a remote collector.

# Get syslog config
esxcli system syslog config get

Expected Output:

Default Network Retry Timeout: 180
Dropped Log File Rotation Size: 100
Dropped Log File Rotations: 10
Log Output: /scratch/log
Log To Unique Subdirectory: false
Remote Host: udp://loginsight.lab.local:514

# Check syslog service
esxcli system syslog mark --message="Health Check Test $(date)"
# Then verify it appears on remote collector

Result	Criteria	Indicator
PASS	Remote host configured, syslog service running	Healthy
WARN	Local logging only (no remote)	Logs may be lost on failure
FAIL	Syslog service not running	No logging

10. Scratch Partition

What: Verify the scratch partition points to persistent storage (not ramdisk).

vim-cmd hostsvc/advopt/get ScratchConfig.ConfiguredScratchLocation
vim-cmd hostsvc/advopt/get ScratchConfig.CurrentScratchLocation

Expected Output:

ConfiguredScratchLocation: /vmfs/volumes/datastore1/.locker-<hostname>
CurrentScratchLocation: /vmfs/volumes/datastore1/.locker-<hostname>

Warning: If the scratch location is /tmp/scratch or empty, it's using ramdisk. Logs and coredumps will be lost on reboot. Set it to persistent storage.

11. Core Dump Configuration

# Check core dump partition
esxcli system coredump partition get

# Check network core dump
esxcli system coredump network get

Expected Output:

Active: true
Configured: true
Partition: naa.xxx:7

# Network dump:
Enabled: true
Host VmkNic: vmk0
Network Server IP: 192.168.1.50
Network Server Port: 6500

Result	Criteria	Indicator
PASS	Partition configured and active, or network dump enabled	Can capture PSOD
WARN	Only network dump (no local partition)	Depends on network availability
FAIL	No dump target configured	PSOD data will be lost

12. Security Health

12.1 SSH Status

# Check if SSH is running
/etc/init.d/SSH status

# Check SSH timeout
esxcli system settings advanced list -o /UserVars/ESXiShellInteractiveTimeOut
esxcli system settings advanced list -o /UserVars/ESXiShellTimeOut

Environment	Expected SSH State	Timeout
Production	Disabled (enable only for maintenance)	300-900 seconds
Lab	Enabled acceptable	900 seconds

12.2 Lockdown Mode

vim-cmd hostsvc/hostsummary | grep lockdownMode

Expected Output:

lockdownMode = "lockdownNormal"

Mode	Description	Recommendation
`lockdownDisabled`	No lockdown	Lab only
`lockdownNormal`	Only vCenter can manage host	Production recommended
`lockdownStrict`	vCenter only, no DCUI	High-security environments

12.3 ESXi Firewall Rules

# List all firewall rulesets
esxcli network firewall ruleset list

# Check specific rules
esxcli network firewall ruleset rule list --ruleset-id=sshServer

Key Rulesets to Verify

Ruleset	Expected State	Purpose
`sshServer`	Enabled (maintenance) / Disabled (production)	SSH access
`webAccess`	Enabled	Host client UI
`vSphereClient`	Enabled	vCenter connectivity
`nsx`	Enabled	NSX communication
`ntpClient`	Enabled	NTP synchronization
`syslog`	Enabled	Log forwarding

12.4 Certificate Validity

# Check the ESXi host certificate
openssl x509 -in /etc/vmware/ssl/rui.crt -noout -dates -subject

Expected Output:

notBefore=Jan 15 00:00:00 2026 GMT
notAfter=Jan 15 00:00:00 2028 GMT
subject=CN = esxi-01.lab.local, ...

Result	Criteria	Indicator
PASS	Certificate > 30 days from expiry	Healthy
WARN	Certificate 7-30 days from expiry	Plan renewal
FAIL	Certificate expired or < 7 days	Renew immediately

12.5 Account Lockout Policy

esxcli system account policy get

Expected Output:

Maximum Failed Login Attempts: 5
Unlock Time (seconds): 900

13. Performance Health

13.1 CPU Ready & Co-Stop

What: Check if VMs are experiencing CPU scheduling delays (CPU ready time) or NUMA/co-stop issues.

esxtop Method

# Interactive — press 'c' for CPU view
esxtop

# Batch mode
esxtop -b -d 5 -n 3 > /tmp/esxtop-cpu.csv

Key Columns in CPU View:

Column	Description	PASS	WARN	FAIL
%RDY	CPU Ready %	< 5%	5-10%	> 10%
%CSTP	Co-Stop %	< 3%	3-5%	> 5%
%USED	CPU Used %	< 80%	80-90%	> 90%

CPU Ready Explained: %RDY shows the percentage of time a vCPU wanted to run but had to wait for a physical CPU. High values indicate host CPU overcommitment. Reduce VM vCPU count or migrate VMs to balance load.

13.2 Memory Ballooning & Swap

# Host memory summary
esxcli hardware memory get

# Check ballooning and swap via esxtop ('m' for memory view)
esxtop

Key Memory Metrics:

Metric	Description	PASS	WARN	FAIL
MCTL (Balloon)	Memory reclaimed by balloon driver	0 MB	> 0 but < 10% of VM memory	> 10% of VM memory
SWCUR (Swap used)	Memory swapped to disk	0 MB	Any swap used	Swap actively in use (SWR/s > 0)
ZIP/s (Compression)	Memory compression rate	0	Low compression	High compression
CACHEUSD	Host cache used	—	—	> 80% of cache

Swap Warning: Active swapping (SWR/s > 0) causes severe VM performance degradation. Add physical memory or reduce VM count.

13.3 Host Overcommit Ratio

# Calculate overcommit ratio
# Total VM vCPUs / Physical threads
esxcli hardware cpu global get
vim-cmd vmsvc/getallvms | wc -l

# Total configured VM memory vs physical memory
esxcli hardware memory get

Recommended Ratios

Resource	PASS	WARN	FAIL
vCPU : pCPU	< 3:1	3:1 - 5:1	> 5:1
vRAM : pRAM	< 1.2:1	1.2:1 - 1.5:1	> 1.5:1

14. Boot Configuration

What: Verify the boot device, boot banks, and acceptance level.

# Boot device
esxcli system boot device get

# Boot bank status
/bin/bootbank-util status

# Acceptance level
esxcli software acceptance get

Expected Output:

Boot Device: mpx.vmhba0:C0:T0:L0
Boot Filesystem UUID: 61234567-abcdef01...
Boot bank: /bootbank (valid)
Alt boot bank: /altbootbank (valid)
Acceptance Level: VMwareCertified

Result	Criteria	Indicator
PASS	Both boot banks valid, acceptance `VMwareCertified` or `VMwareAccepted`	Healthy
WARN	Alt boot bank outdated	Update after next patch
FAIL	Primary boot bank corrupt or `CommunitySupported` acceptance	Remediation needed

15. Patch / VIB Level

What: Verify the installed ESXi image profile matches the VCF BOM.

# Installed image profile
esxcli software profile get

Expected Output:

Name: (Updated) ESXi-8.0U3-12345678-standard
Vendor: VMware, Inc.
Creation Time: 2026-01-15T00:00:00
Modification Time: 2026-01-15T00:00:00
Stateless Ready: True

# List all installed VIBs
esxcli software vib list | head -30

# Check for specific VIB
esxcli software vib list | grep -i nsx

VCF 9.0 Expected Versions (check BOM)

Component	Expected Version
ESXi	8.0 U3 (build 12345678+)
NSX VIBs	4.2.x
vSAN	Included in ESXi
Drivers	Per HCL

16. Port Reference Table

Inbound Ports (to ESXi)

Source	Port	Protocol	Purpose
vCenter	443	TCP	Host management
vCenter	902	TCP	VM console (MKS)
vCenter	8080	TCP	vSphere Update Manager
Admin	22	TCP	SSH (when enabled)
Admin	443	TCP	Host Client UI
Admin	5989	TCP	CIM/WBEM
NSX Manager	443	TCP	Host preparation
Syslog Collector	514	UDP/TCP	Syslog (if push)
SNMP Manager	161	UDP	SNMP queries

Outbound Ports (from ESXi)

Destination	Port	Protocol	Purpose
vCenter	443	TCP	vpxa → VPXD
vCenter	80	TCP	Reverse proxy
NSX Manager	1234	TCP	MPA
NSX Manager	1235	TCP	Central CLI
NSX Manager	5671	TCP	Message bus
ESXi (other)	8000	TCP	vMotion
ESXi (other)	902	TCP	NFC (provisioning)
ESXi (other)	2233	TCP	vSAN transport
ESXi (other)	4789	UDP	Geneve overlay
NTP Server	123	UDP	Time sync
DNS Server	53	TCP/UDP	Name resolution
Syslog	514/6514	UDP/TCP	Log forwarding
Dump Collector	6500	TCP	Network core dump

17. Common Issues & Remediation

17.1 PSOD (Purple Screen of Death)

Symptom	Likely Cause	Resolution
Purple screen on console	Kernel panic — driver/firmware bug, hardware failure, memory corruption	Collect core dump; check `/var/core/`; contact VMware support with dump
Recurring PSOD	Driver bug	Update driver/firmware per HCL; check VMware KB

Collect PSOD info:
1. Note the error message on purple screen
2. Core dump location: esxcli system coredump partition get
3. Extract: vm-support --performance --output /vmfs/volumes/datastore1/
4. File SR with VMware including the vm-support bundle

17.2 Storage APD / PDL

Symptom	Type	Resolution
VMs frozen, "APD timeout" in logs	All Paths Down	Check SAN fabric, zoning, HBA; paths auto-recover
VMs terminated, "PDL" in logs	Permanent Device Loss	Storage LUN is permanently gone; restore from backup

17.3 Network Partition

Symptom	Likely Cause	Resolution
Host shows disconnected in vCenter	Management network issue	Check vmk0 IP, switch port, VLAN; restart management agents
vSAN partition event	vSAN network failure	Check vmk2, switch MTU, VLAN trunking
vMotion failures	vMotion network	Check vmk1, MTU 9000, IP routing

17.4 Host Disconnection from vCenter

# On ESXi host — restart management agents
/sbin/services.sh restart

# If that doesn't work, restart only vpxa
/etc/init.d/vpxa restart

# Check hostd logs for errors
tail -100 /var/log/hostd.log | grep -i error
tail -100 /var/log/vpxa.log | grep -i error

17.5 Performance Degradation

Symptom	Check	Resolution
High CPU Ready	`esxtop` %RDY > 10%	Right-size VMs (reduce vCPUs), DRS rebalance
Memory ballooning	`esxtop` MCTLSZ > 0	Add RAM or migrate VMs
Storage latency	`esxtop` DAVG > 50ms	Check SAN, move VM to faster storage
Network drops	`esxcli network nic stats get -n vmnicX`	Check NIC errors, replace cable/NIC

18. CLI Quick Reference Card

System Information

Command	Purpose
`esxcli system version get`	ESXi version and build
`esxcli system hostname get`	Hostname and domain
`esxcli system uuid get`	System UUID
`esxcli system boot device get`	Boot device info
`esxcli system time get`	Current system time
`esxcli system stats uptime get`	Uptime in seconds
`esxcli hardware platform get`	Hardware platform info
`esxcli hardware cpu global get`	CPU count/threads
`esxcli hardware memory get`	Total physical memory

Storage Commands

Command	Purpose
`esxcli storage filesystem list`	Datastores and capacity
`esxcli storage vmfs extent list`	VMFS extents
`esxcli storage core device list`	SCSI devices
`esxcli storage core adapter list`	HBA adapters
`esxcli storage nmp path list`	Multipath status
`esxcli storage nmp device list`	Path policy per device
`esxcli storage core adapter rescan --all`	Rescan all storage
`esxcli storage nfs list`	NFS datastores

Networking Commands

Command	Purpose
`esxcli network nic list`	Physical NICs
`esxcli network nic get -n vmnicX`	NIC detail/driver
`esxcli network nic stats get -n vmnicX`	NIC statistics
`esxcli network ip interface list`	VMkernel adapters
`esxcli network ip interface ipv4 get`	VMkernel IPv4
`esxcli network ip route ipv4 list`	Routing table
`esxcli network ip dns server list`	DNS servers
`esxcli network vswitch standard list`	Standard vSwitches
`esxcli network vswitch dvs vmware list`	Distributed vSwitches
`esxcli network firewall ruleset list`	Firewall rules
`vmkping -I vmkX <target>`	Ping from VMkernel

Service Commands

Command	Purpose
`/etc/init.d/hostd status`	Host agent status
`/etc/init.d/vpxa status`	vCenter agent status
`/etc/init.d/ntpd status`	NTP daemon status
`/etc/init.d/SSH status`	SSH service status
`/sbin/services.sh restart`	Restart all mgmt agents
`esxcli system process list`	Running processes
`chkconfig --list`	Service startup config

Security Commands

Command	Purpose
`esxcli software acceptance get`	Acceptance level
`esxcli system account policy get`	Lockout policy
`vim-cmd hostsvc/hostsummary \| grep lockdown`	Lockdown mode
`esxcli network firewall get`	Firewall status
`openssl x509 -in /etc/vmware/ssl/rui.crt -noout -dates`	Cert expiry

Software / Patch Commands

Command	Purpose
`esxcli software profile get`	Current image profile
`esxcli software vib list`	All installed VIBs
`esxcli software vib install -v <path>`	Install VIB
`esxcli software sources profile list -d <depot>`	Available profiles

Performance / Diagnostic Commands

Command	Purpose
`esxtop`	Real-time performance
`esxtop -b -d 5 -n 3 > /tmp/out.csv`	Batch performance capture
`vm-support --performance`	Generate support bundle
`vmkbacktrace`	Stack trace
`vobd -e`	Event logging
`tail -f /var/log/vmkernel.log`	Kernel log (live)
`tail -f /var/log/hostd.log`	Host agent log
`tail -f /var/log/vpxa.log`	vCenter agent log

ESXi Hosts Health Check Handbook

Table of Contents

1. Overview & Purpose

Scope

2. Prerequisites

Required Access

Enable SSH on ESXi (if disabled)

Environment Variables

3. Quick Reference — All Checks Summary

4. Hardware Health

4.1 Hardware Sensors (IPMI)

CLI Method

Status Codes

Pass / Warn / Fail

Additional Sensor Commands

4.2 Memory Hardware Errors

Check VMkernel Log for Memory Errors

4.3 CPU Health

4.4 PCI Devices

4.5 Firmware Versions

5. Storage Health

5.1 VMFS Datastores

Check Datastore Free Space

5.2 SCSI Device Status

5.3 HBA Health & Multipathing

Multipathing Status

5.4 NFS Mounts

5.5 Disk Latency & SMART

Check Latency via esxtop

SMART Data (Local Disks)

6. Networking Health

6.1 VMkernel Adapters

Check VMkernel Tagged Services

6.2 vSwitch / vDS Configuration

6.3 Physical Uplink Status

6.4 NIC Driver & Firmware

6.5 CDP / LLDP Neighbor Info

Using Python for CDP/LLDP

6.6 vmkping Connectivity Tests

7. Services Health

7.1 Service Listing

7.2 Critical Services

Verify Critical Services

8. NTP Configuration

9. Syslog Configuration

10. Scratch Partition

11. Core Dump Configuration

12. Security Health

12.1 SSH Status

12.2 Lockdown Mode

12.3 ESXi Firewall Rules

Key Rulesets to Verify

12.4 Certificate Validity

12.5 Account Lockout Policy

13. Performance Health

13.1 CPU Ready & Co-Stop

esxtop Method

13.2 Memory Ballooning & Swap

13.3 Host Overcommit Ratio

Recommended Ratios

14. Boot Configuration

15. Patch / VIB Level

VCF 9.0 Expected Versions (check BOM)

16. Port Reference Table

Inbound Ports (to ESXi)

Outbound Ports (from ESXi)

17. Common Issues & Remediation

17.1 PSOD (Purple Screen of Death)

17.2 Storage APD / PDL

17.3 Network Partition

17.4 Host Disconnection from vCenter

17.5 Performance Degradation

18. CLI Quick Reference Card

System Information

Storage Commands

Networking Commands

Service Commands

Security Commands

Software / Patch Commands

Performance / Diagnostic Commands