This handbook provides a systematic, step-by-step approach to verifying the health and operational readiness of vSAN clusters running within VMware Cloud Foundation (VCF) 9 environments. It covers both vSAN Original Storage Architecture (OSA) and the vSAN Express Storage Architecture (ESA), which is the recommended configuration in VCF 9.
This document covers the following health check domains:
| Domain | Description |
|---|---|
| Cluster Health | Overall cluster status, membership, and health service results |
| Disk Group Status | Physical disk health, SMART data, cache and capacity tiers |
| Capacity Analysis | Space utilization, deduplication, compression, slack space |
| Resync Operations | Active resyncs, component movement, ETA, and impact |
| Network Health | VMkernel configuration, connectivity, jumbo frames, partitions |
| Performance | IOPS, latency, congestion, outstanding IO metrics |
| Object Health | Object compliance, accessibility, redundancy state |
| Stretched Cluster | Site configuration, witness host, inter-site latency |
| Fault Domains | Domain layout, host distribution, policy interaction |
| HCL Compliance | Controller, driver, and firmware compatibility verification |
Health checks should be executed at these critical intervals:
| Trigger | Frequency | Checks |
|---|---|---|
| Routine maintenance | Weekly | Full suite |
| Pre-upgrade (VCF lifecycle) | Before each LCM bundle | Full suite |
| Post-upgrade | Immediately after LCM completes | Full suite |
| Host addition/removal | After cluster change | Cluster, disk, network, capacity |
| Disk replacement | After replacement completes | Disk group, resync, object health |
| Network change | After vDS/vmkernel modification | Network health, connectivity |
| Performance complaint | On demand | Performance, congestion, resync |
| After power event | After datacenter power restoration | Full suite |
| Pre-expansion | Before adding workload domains | Capacity, performance baseline |
This handbook is intended for:
| Requirement | Detail |
|---|---|
| vCenter SSO Admin | administrator@vsphere.local or equivalent role |
| ESXi Root Access | SSH enabled on target hosts (temporarily, disable after) |
| SDDC Manager Access | Admin-level access for LCM and inventory queries |
| vSAN Witness Host | Root access if stretched cluster is deployed |
| Network Access | Ability to reach vSAN VMkernel IPs on port 2233 |
| Tool | Version | Purpose |
|---|---|---|
esxcli |
Built into ESXi 8.x | Primary CLI for vSAN health checks |
vSAN Health Service |
Built into vCenter 8.x | Automated health test framework |
| PowerCLI | 13.3+ | Scripted health checks and reporting |
| RVC (Ruby vSphere Console) | Built into vCenter appliance | Deep vSAN diagnostics |
vmkping |
Built into ESXi | vSAN network validation |
vsanDiskMgmt |
Built into ESXi | Disk management and SMART queries |
Python (pyVmomi) |
8.0+ | API-driven automation |
| Tool | Purpose |
|---|---|
| vSAN Observer | Real-time performance monitoring (HTML5 dashboard) |
| vRealize Operations / Aria Operations | Trending, capacity forecasting |
| VDT (VMware Diagnostic Tool) | Automated diagnostic collection |
| SOS Report | Support bundle generation |
The Ruby vSphere Console is accessed directly from the vCenter Server Appliance (VCSA).
# SSH to VCSA
ssh root@vcsa-01.vcf.local
# Launch RVC
rvc administrator@vsphere.local@localhost
# Navigate to the vSAN cluster
cd /localhost/SDDC-Datacenter/computers/SDDC-Cluster1
# Run the vSAN health check
vsan.health.health_summary .
# Full cluster health summary
vsan.health.health_summary /localhost/datacenter/computers/cluster
# Disk balance check
vsan.disks_stats /localhost/datacenter/computers/cluster
# Object placement info
vsan.object_info /localhost/datacenter/computers/cluster
# Network partition check
vsan.cluster_info /localhost/datacenter/computers/cluster
# Resync dashboard
vsan.resync_dashboard /localhost/datacenter/computers/cluster
# Performance diagnostics
vsan.perf.stats_object_list /localhost/datacenter/computers/cluster
# Install or update PowerCLI
Install-Module -Name VMware.PowerCLI -Scope CurrentUser -Force
# Configure certificate handling for lab/internal environments
Set-PowerCLIConfiguration -InvalidCertificateAction Ignore -Confirm:$false
# Connect to vCenter
Connect-VIServer -Server vcsa-01.vcf.local -User administrator@vsphere.local -Password '<password>'
# Verify vSAN module is loaded
Get-Module VMware.VimAutomation.Storage -ListAvailable
The following table provides a consolidated view of every health check in this handbook with pass/warn/fail criteria.
| # | Check | CLI / Method | PASS | WARN | FAIL |
|---|---|---|---|---|---|
| 1 | Cluster Health Status | esxcli vsan health cluster list |
All tests green | Any test yellow | Any test red |
| 2 | Cluster Membership | esxcli vsan cluster get |
All hosts in cluster | Host count mismatch | Partitioned cluster |
| 3 | Disk Group Status | esxcli vsan storage list |
All disks healthy | Disk degraded | Disk absent/failed |
| 4 | SMART Health | esxcli vsan debug disk smart get |
All attributes OK | Wear leveling > 80% | Reallocated sectors > 0 |
| 5 | Capacity Used % | vSAN UI / PowerCLI | < 70% | 70-80% | > 80% |
| 6 | Slack Space | Calculated | >= 25% of raw | 15-25% of raw | < 15% of raw |
| 7 | Dedup/Compression Ratio | vSAN UI | > 1.5x | 1.0-1.5x | < 1.0x (overhead) |
| 8 | Active Resyncs | esxcli vsan debug resync summary |
0 active | < 100 components | > 100 components |
| 9 | Resync ETA | vSAN UI | < 1 hour | 1-8 hours | > 8 hours |
| 10 | vSAN VMkernel Config | esxcli network ip interface list |
vSAN vmk on each host | MTU mismatch | vmk missing |
| 11 | Jumbo Frame Test | vmkping -s 8972 -d |
0% packet loss | Intermittent loss | Complete failure |
| 12 | Network Partition | Health service | No partition | N/A | Partition detected |
| 13 | Read Latency | vSAN perf service | < 1 ms | 1-5 ms | > 5 ms |
| 14 | Write Latency | vSAN perf service | < 2 ms | 2-10 ms | > 10 ms |
| 15 | Congestion | esxcli vsan debug controller list |
0 | 1-40 | > 40 |
| 16 | Outstanding IO | vsish |
< 32 | 32-64 | > 64 |
| 17 | Object Health | esxcli vsan debug object health summary |
All healthy | Reduced redundancy | Inaccessible objects |
| 18 | Policy Compliance | vSAN UI / PowerCLI | All compliant | Non-compliant (rebuilding) | Non-compliant (stuck) |
| 19 | Witness Host | esxcli vsan cluster get |
Connected | High latency | Disconnected |
| 20 | Inter-Site Latency | vmkping |
< 5 ms RTT | 5-100 ms | > 200 ms / timeout |
| 21 | Fault Domain Count | vSAN UI | >= 3 FDs | 2 FDs | 1 FD (no protection) |
| 22 | HCL Controller | Health service | Certified | DB outdated > 90 days | Not on HCL |
| 23 | HCL Driver/Firmware | Health service | Matched | Minor mismatch | Critical mismatch |
| 24 | Health Service Status | vCenter UI | Running, recent test | Last test > 24h ago | Service not running |
| 25 | Silenced Alarms | Health service | 0 silenced | 1-3 silenced | > 3 silenced |
The primary entry point for vSAN health is the esxcli vsan health cluster list command. This queries the vSAN health service and returns the state of every registered health test.
esxcli vsan health cluster list
Group: Cluster
Overall Health: green
Tests:
vSAN Health Service Up-To-Date: green
vSAN Build Recommendation Engine Health: green
vSAN CLOMD Liveness: green
vSAN Disk Balance: green
vSAN Object Health: green
vSAN Cluster Partition: green
Group: Network
Overall Health: green
Tests:
All Hosts Have a vSAN VMkernel Adapter: green
All Hosts Have Matching Subnets: green
vSAN: Basic (Unicast) Connectivity Check: green
vSAN: MTU Check (Ping with Large Packet Size): green
vMotion: Basic Connectivity Check: green
Group: Physical Disk
Overall Health: green
Tests:
vSAN Disk Health: green
Metadata Health: green
Component Metadata Health: green
Congestion: green
Disk Space Usage: green
Group: Data
Overall Health: green
Tests:
vSAN Object Health: green
vSAN VM Health: green
Group: Limits
Overall Health: green
Tests:
Current Cluster Situation: green
After 1 Additional Host Failure: green
Host Component Limit: green
| Result | Condition | Action |
|---|---|---|
| PASS | All groups show green |
No action required |
| WARN | One or more tests show yellow |
Investigate the specific test; see relevant section of this handbook |
| FAIL | Any test shows red |
Immediate investigation required; do NOT proceed with maintenance |
esxcli vsan health cluster list -t "test name" to get details on the specific failing test. Cross-reference with the relevant section of this handbook for targeted remediation steps.
vSAN proactive tests simulate failure scenarios to predict cluster behavior under stress.
esxcli vsan health cluster list -t "vSAN Disk Balance"
Health Test: vSAN Disk Balance
Status: green
Description: Disks are well balanced. Max variance: 8%
# From RVC
vsan.proactive_rebalance /localhost/datacenter/computers/cluster --start
The vSAN Health Service runs within vCenter and executes periodic health tests.
# On any ESXi host in the cluster
esxcli vsan health cluster list -t "vSAN Health Service Up-To-Date"
# Get the vSAN cluster
$cluster = Get-Cluster -Name "SDDC-Cluster1"
# Get vSAN view
$vsanClusterHealthSystem = Get-VsanView -Id "VsanVcClusterHealthSystem-vsan-cluster-health-system"
# Run health check
$vsanClusterHealthSystem.VsanQueryVcClusterHealthSummary(
$cluster.ExtensionData.MoRef,
$null, $null, $true, $null, $null, "defaultView"
)
| Attribute | Expected |
|---|---|
| Service Running | Yes |
| Last Test Time | Within 60 minutes |
| Test Result Format | Per-group green/yellow/red |
| Auto-Run Interval | Every 60 minutes (configurable) |
Every host in a vSAN cluster must be an active member. Use esxcli vsan cluster get to verify.
esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2026-03-26T14:30:00Z
Local Node UUID: 5f3e8c7a-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Local Node Type: NORMAL
Local Node State: MASTER
Member Count: 4
Sub-Cluster Member UUIDs: 5f3e8c7a-..., 6a4f9d8b-..., 7b5a0e9c-..., 8c6b1fad-...
Sub-Cluster Membership Entry Revision: 12
Sub-Cluster Member Count: 4
Maintenance Mode State: OFF
# Run on each ESXi host to ensure consistent membership
for host in esx01 esx02 esx03 esx04; do
echo "=== $host ==="
ssh root@$host 'esxcli vsan cluster get | grep "Member Count"'
done
| Result | Condition | Action |
|---|---|---|
| PASS | Member Count matches expected host count on ALL hosts | Healthy |
| WARN | A host shows BACKUP instead of MASTER/AGENT |
Verify roles; may be transitional |
| FAIL | Member count differs between hosts (split-brain) | Network partition detected -- see Section 8.4 |
vmkping -I vmk1 <target_ip>/etc/init.d/clomd restartIn vSAN OSA, storage is organized into disk groups (1 cache SSD + 1-7 capacity disks). In vSAN ESA (VCF 9 default), all NVMe devices participate in a single storage pool without a separate cache tier.
esxcli vsan storage list
Device: naa.55cd2e414f5356c0
Display Name: naa.55cd2e414f5356c0
Is SSD: true
In CMMDS: true
On-disk Format Version: 15
Is Capacity Tier: false
Is Cache Tier: true
RAID Level: NA
vSAN UUID: 52e9a1f4-xxxx-xxxx-xxxx-xxxxxxxxxxxx
vSAN Disk Group UUID: 52e9a1f4-xxxx-xxxx-xxxx-xxxxxxxxxxxx
vSAN Disk Group Name: naa.55cd2e414f5356c0
Health Status: Healthy
Device: naa.55cd2e414f53789a
Display Name: naa.55cd2e414f53789a
Is SSD: true
In CMMDS: true
On-disk Format Version: 15
Is Capacity Tier: true
Is Cache Tier: false
RAID Level: NA
vSAN UUID: 63fa2b05-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Health Status: Healthy
esxcli vsan storage list
Device: t10.NVMe____Dell_Ent_NVMe_v2_AGN_RI_U.2_1.6TB
Display Name: Dell Ent NVMe AGN RI U.2 1.6TB
Is SSD: true
In CMMDS: true
On-disk Format Version: 19
Is Capacity Tier: true
Is Cache Tier: false
ESA Eligible: true
Storage Pool UUID: 74ab3c16-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Health Status: Healthy
# List all vSAN disks with health
esxcli vsan storage list | grep -E "Display Name|Health Status|Is Cache|Is Capacity"
| Result | Condition | Action |
|---|---|---|
| PASS | All disks: Health Status: Healthy |
No action |
| WARN | Any disk: Health Status: Degraded |
Schedule replacement at next window |
| FAIL | Any disk: Health Status: Failed or missing |
Immediate replacement required |
Self-Monitoring, Analysis, and Reporting Technology (SMART) provides early warning of disk failure.
esxcli vsan debug disk smart get -d naa.55cd2e414f5356c0
Parameter Value Threshold Worst Status
---------------------------- ----- --------- ----- ------
Health Status OK N/A N/A OK
Media Wearout Indicator 98 0 98 OK
Write Error Count 0 0 0 OK
Read Error Count 0 0 0 OK
Power-on Hours 14820 0 14820 OK
Power Cycle Count 12 0 12 OK
Reallocated Sector Count 0 0 0 OK
Uncorrectable Error Count 0 0 0 OK
Temperature Celsius 34 0 42 OK
| Attribute | PASS | WARN | FAIL |
|---|---|---|---|
| Media Wearout Indicator | > 20% remaining | 5-20% remaining | < 5% remaining |
| Reallocated Sector Count | 0 | 1-10 | > 10 |
| Uncorrectable Error Count | 0 | 1-5 | > 5 |
| Temperature Celsius | < 50C | 50-70C | > 70C |
| Write Error Count | 0 | 1-10 | > 10 |
| Read Error Count | 0 | 1-10 | > 10 |
esxcli system maintenanceMode set -e true -m ensureAccessibilityesxcli vsan storage remove -d naa.xxxxesxcli vsan storage add -d naa.xxxx -s naa.cache_diskesxcli system maintenanceMode set -e false
# Comprehensive disk listing with all properties
esxcli vsan storage list --format=xml
# Get all vSAN disk information
$cluster = Get-Cluster "SDDC-Cluster1"
$vsanDisks = Get-VsanDisk -Cluster $cluster
foreach ($disk in $vsanDisks) {
[PSCustomObject]@{
Host = $disk.VsanDiskGroup.VMHost.Name
DiskGroup = $disk.VsanDiskGroup.Name
CanonicalName = $disk.CanonicalName
IsSSD = $disk.IsSsd
IsCacheDisk = $disk.IsCacheDisk
CapacityGB = [math]::Round($disk.CapacityGB, 2)
}
} | Format-Table -AutoSize
# From any ESXi host in the cluster
esxcli vsan debug object health summary get
$cluster = Get-Cluster "SDDC-Cluster1"
$spaceReport = Get-VsanSpaceUsage -Cluster $cluster
# Display capacity summary
[PSCustomObject]@{
"Total Capacity (TB)" = [math]::Round($spaceReport.TotalCapacityGB / 1024, 2)
"Used Capacity (TB)" = [math]::Round($spaceReport.UsedCapacityGB / 1024, 2)
"Free Capacity (TB)" = [math]::Round($spaceReport.FreeCapacityGB / 1024, 2)
"Used %" = [math]::Round(($spaceReport.UsedCapacityGB / $spaceReport.TotalCapacityGB) * 100, 1)
}
Total Capacity (TB) : 23.64
Used Capacity (TB) : 9.82
Free Capacity (TB) : 13.82
Used % : 41.5
When deduplication and compression are enabled (common in vSAN ESA and optional in OSA all-flash), significant space savings are expected.
esxcli vsan debug space show
$cluster = Get-Cluster "SDDC-Cluster1"
$spaceReport = Get-VsanSpaceUsage -Cluster $cluster
[PSCustomObject]@{
"Before Dedup & Compression (TB)" = [math]::Round($spaceReport.PhysicalUsedCapacityGB / 1024, 2)
"After Dedup & Compression (TB)" = [math]::Round($spaceReport.UsedCapacityGB / 1024, 2)
"Dedup Ratio" = [math]::Round($spaceReport.DedupRatio, 2)
"Compression Ratio" = [math]::Round($spaceReport.CompressionRatio, 2)
"Overall Savings Ratio" = [math]::Round($spaceReport.DedupCompressionRatio, 2)
}
| Result | Condition | Action |
|---|---|---|
| PASS | Savings ratio > 1.5x | Good efficiency |
| WARN | Savings ratio 1.0-1.5x | Review workload data characteristics |
| FAIL | Savings ratio < 1.0x | Dedup/compression overhead exceeds savings; consider disabling |
vSAN uses thin provisioning by default for object storage. The logical provisioned space can far exceed physical capacity.
$cluster = Get-Cluster "SDDC-Cluster1"
$vms = Get-VM -Location $cluster
$report = foreach ($vm in $vms) {
$disks = Get-HardDisk -VM $vm
foreach ($disk in $disks) {
[PSCustomObject]@{
VM = $vm.Name
Disk = $disk.Name
ProvisionedGB = [math]::Round($disk.CapacityGB, 2)
UsedGB = [math]::Round(($disk.CapacityGB - $disk.FreeSpaceGB), 2) # Approximate
ThinProvisioned = $disk.StorageFormat -eq "Thin"
}
}
}
$report | Format-Table -AutoSize
Write-Host "Total Provisioned: $([math]::Round(($report | Measure-Object -Property ProvisionedGB -Sum).Sum, 2)) GB"
Write-Host "Total Used: $([math]::Round(($report | Measure-Object -Property UsedGB -Sum).Sum, 2)) GB"
vSAN reserves slack space for resyncs, maintenance operations, and failure recovery. The formula depends on the cluster size and policy.
Slack Space = Max(HostCapacity, 25% of RawCapacity)
Where:
HostCapacity = Total raw capacity / Number of hosts
(i.e., the capacity of the largest single host)
Cluster: 4 hosts x 10 TB raw each = 40 TB total raw
HostCapacity = 40 TB / 4 = 10 TB
25% of Raw = 40 TB x 0.25 = 10 TB
Slack Space = Max(10 TB, 10 TB) = 10 TB
Usable Capacity = 40 TB - 10 TB = 30 TB
(Before policy overhead)
With FTT=1, RAID-1 mirroring:
Effective Usable = 30 TB / 2 = 15 TB
| Used % | Status | Description | Action |
|---|---|---|---|
| 0-70% | PASS | Healthy capacity headroom | Normal operations |
| 70-75% | WARN | Approaching capacity limits | Plan expansion or cleanup |
| 75-80% | WARN | vSAN generates a warning alarm | Active capacity management needed |
| 80-90% | FAIL | vSAN throttles new writes | Immediate expansion or VM migration |
| 90-95% | FAIL | Severe performance impact | Emergency capacity action |
| >95% | FAIL | Risk of data inaccessibility | Emergency: free space immediately |
Get-VsanSpaceUsage -Cluster $cluster | Select -ExpandProperty SpaceDetailResyncs occur when vSAN needs to rebuild or move components. They can be triggered by host maintenance, disk failures, policy changes, or rebalancing.
esxcli vsan debug resync summary
Resync Summary:
Total Objects Resyncing: 0
Total Bytes To Resync: 0 B
Total Bytes Resynced: 0 B
Total Recoveries: 0
Total Rebalance: 0
Total Policy Change: 0
Total Evacuating: 0
Resync Summary:
Total Objects Resyncing: 42
Total Bytes To Resync: 287.35 GB
Total Bytes Resynced: 143.67 GB
Total Recoveries: 38
Total Rebalance: 4
Total Policy Change: 0
Total Evacuating: 0
| Result | Condition | Action |
|---|---|---|
| PASS | 0 objects resyncing | Cluster fully converged |
| WARN | < 100 objects, progress advancing | Monitor progress; expected after maintenance |
| FAIL | > 100 objects or resync stalled | Investigate root cause; check for disk or network issues |
# Real-time resync monitoring
watch -n 10 'esxcli vsan debug resync summary'
$cluster = Get-Cluster "SDDC-Cluster1"
$vsanHealthSystem = Get-VsanView -Id "VsanVcClusterHealthSystem-vsan-cluster-health-system"
$resyncStatus = $vsanHealthSystem.VsanQuerySyncingVsanObjects(
$cluster.ExtensionData.MoRef
)
foreach ($obj in $resyncStatus) {
[PSCustomObject]@{
UUID = $obj.Uuid
BytesToSync = [math]::Round($obj.BytesToSync / 1GB, 2)
BytesSynced = [math]::Round($obj.RecoveryETA, 0)
Reason = $obj.Reason
}
} | Format-Table -AutoSize
# Provides a continuously updating view of resync progress
vsan.resync_dashboard /localhost/datacenter/computers/cluster
Active resyncs consume disk IO and network bandwidth. vSAN uses a throttling mechanism to limit impact on production workloads.
esxcli system settings advanced list -o /LSOM/lsomResyncThrottleEnabled
esxcli system settings advanced list -o /VSAN/ResyncThrottleAdaptive
| Parameter | Default | Impact |
|---|---|---|
ResyncThrottleAdaptive |
1 (enabled) | vSAN automatically reduces resync bandwidth when VM IO is detected |
ResyncBandwidthCap |
0 (unlimited) | Maximum MB/s for resync traffic per host |
lsomResyncThrottleEnabled |
1 | Enables disk-level resync throttling |
Every host in the vSAN cluster must have a dedicated VMkernel adapter tagged for vSAN traffic.
esxcli network ip interface list | grep -A5 "vsan"
esxcli network ip interface list
vmk1
Name: vmk1
MAC Address: 00:50:56:6a:xx:xx
Enabled: true
Portset: DvsPortset-0
Portgroup: SDDC-DPortGroup-vSAN
VDS Name: SDDC-Dswitch-Private
MTU: 9000
TSO MSS: 65535
Port ID: 33554435
Netstack Instance: defaultTcpipStack
IPv4 Address: 172.16.10.101
IPv4 Netmask: 255.255.255.0
IPv4 Broadcast: 172.16.10.255
IPv6 Enabled: false
Tags: VSAN
esxcli vsan network list
Interface
VmkNic Name: vmk1
IP Protocol: IP
Interface UUID: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Agent Group Multicast Address: 224.2.3.4
Agent Group IPv6 Multicast Address: ff19::2:3:4
Agent Group Multicast Port: 23451
Master Group Multicast Address: 224.1.2.3
Master Group IPv6 Multicast Address: ff19::1:2:3
Master Group Multicast Port: 12345
Host Unicast Channel Bound Port: 12321
Multicast Enabled: true
Traffic Type: vsan
| Result | Condition | Action |
|---|---|---|
| PASS | All hosts have vmk with Tags: VSAN, MTU 9000 |
Healthy |
| WARN | MTU mismatch across hosts | Standardize MTU to 9000 |
| FAIL | Host missing vSAN-tagged vmk adapter | Add vSAN VMkernel adapter immediately |
vSAN can operate in multicast mode (legacy) or unicast mode (default in vSAN 7+/VCF 5+). VCF 9 clusters should use unicast.
esxcli vsan network list | grep "Multicast Enabled"
| Mode | VCF 9 Status | Notes |
|---|---|---|
Unicast (Multicast Enabled: false) |
Recommended | Default for new VCF 9 clusters |
Multicast (Multicast Enabled: true) |
Legacy | Requires IGMP snooping on physical switches |
# From each host, test connectivity to every other host on vSAN network
vmkping -I vmk1 172.16.10.102
vmkping -I vmk1 172.16.10.103
vmkping -I vmk1 172.16.10.104
Jumbo frames (MTU 9000) are required for optimal vSAN performance. End-to-end validation is critical.
# From ESXi host, test jumbo frame path to each peer
# -s 8972 = 9000 - 20 (IP header) - 8 (ICMP header) = 8972
# -d = set DF (Don't Fragment) bit
vmkping -I vmk1 -s 8972 -d 172.16.10.102
vmkping -I vmk1 -s 8972 -d 172.16.10.103
vmkping -I vmk1 -s 8972 -d 172.16.10.104
PING 172.16.10.102 (172.16.10.102): 8972 data bytes
8980 bytes from 172.16.10.102: icmp_seq=0 ttl=64 time=0.254 ms
8980 bytes from 172.16.10.102: icmp_seq=1 ttl=64 time=0.198 ms
8980 bytes from 172.16.10.102: icmp_seq=2 ttl=64 time=0.211 ms
--- 172.16.10.102 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.198/0.221/0.254 ms
PING 172.16.10.102 (172.16.10.102): 8972 data bytes
sendto() failed (Message too long)
--- 172.16.10.102 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
| Result | Condition | Action |
|---|---|---|
| PASS | 0% packet loss on all hosts with 8972 byte payload | Jumbo frames working end-to-end |
| WARN | Intermittent packet loss | Check physical switch MTU, NIC firmware |
| FAIL | 100% loss or "Message too long" | MTU mismatch in path -- check vmk, vDS, physical switch |
esxcli network ip interface list | grep MTUmtu 9216 (allows for overhead)esxcli network nic listvmkping -I vmk1 -s 8972 -d <target_ip>
A vSAN network partition occurs when hosts lose connectivity to each other, causing the cluster to split into sub-clusters.
esxcli vsan health cluster list -t "vSAN Cluster Partition"
# Run on EVERY host and compare Sub-Cluster Member UUIDs
esxcli vsan cluster get
# Check CMMDS master node
esxcli vsan cluster get | grep "Local Node State"
If multiple hosts report MASTER, a partition exists -- only one host should be MASTER.
| Result | Condition | Action |
|---|---|---|
| PASS | Single MASTER, all hosts in same sub-cluster | No partition |
| FAIL | Multiple MASTERs or mismatched sub-cluster membership | Active partition -- URGENT |
esxcli network nic stats get -n vmnic0 for errors/dropsvmkping -I vmk1 <peer_vsan_ip> from each hostesxcli network ip interface set -i vmk1 -e false && esxcli network ip interface set -i vmk1 -e truetail -f /var/log/clomd.log | grep -i partition
If a stretched cluster is deployed, the witness host must be reachable from both sites.
# From preferred site host
vmkping -I vmk1 <witness_vsan_ip>
# From secondary site host
vmkping -I vmk1 <witness_vsan_ip>
esxcli vsan cluster get | grep -A2 "Witness"
| Result | Condition | Action |
|---|---|---|
| PASS | Witness reachable from both sites, < 200ms RTT | Healthy |
| WARN | Witness reachable but RTT > 100ms | Investigate WAN link quality |
| FAIL | Witness unreachable from either site | Immediate investigation -- quorum at risk |
Navigate to: Cluster > Monitor > vSAN > Performance > Virtual Machine Consumption
# Real-time IOPS and latency from ESXi
vsish -e get /vmkModules/lsom/disks/<disk_uuid>/info
$cluster = Get-Cluster "SDDC-Cluster1"
$vsanPerfSystem = Get-VsanView -Id "VsanPerformanceManager-vsan-performance-manager"
# Define time range (last 1 hour)
$endTime = Get-Date
$startTime = $endTime.AddHours(-1)
# Query cluster performance
$spec = New-Object VMware.Vsan.Views.VsanPerfQuerySpec
$spec.EntityRefId = "cluster-domclient:*"
$spec.StartTime = $startTime
$spec.EndTime = $endTime
$perfData = $vsanPerfSystem.VsanPerfQueryPerf(@($spec), $cluster.ExtensionData.MoRef)
| Metric | PASS | WARN | FAIL |
|---|---|---|---|
| Read Latency (average) | < 1 ms | 1-5 ms | > 5 ms |
| Write Latency (average) | < 2 ms | 2-10 ms | > 10 ms |
| Read IOPS | Per baseline | > 20% below baseline | > 50% below baseline |
| Write IOPS | Per baseline | > 20% below baseline | > 50% below baseline |
| Read Cache Hit Ratio (OSA) | > 90% | 70-90% | < 70% |
vSAN congestion values indicate back-pressure in the IO stack. A non-zero congestion value means vSAN is throttling IO.
esxcli vsan debug controller list
Controller: naa.55cd2e414f5356c0
State: HEALTHY
Congestion Value: 0
Congestion Type: None
Outstanding IO: 0
# Get per-disk congestion
vsish -e get /vmkModules/lsom/disks/<disk_uuid>/info | grep -i congestion
| Congestion Value | Status | Description |
|---|---|---|
| 0 | PASS | No congestion |
| 1-20 | WARN | Mild congestion -- transient during bursts |
| 21-40 | WARN | Moderate congestion -- sustained IO pressure |
| 41-60 | FAIL | High congestion -- significant IO throttling |
| 61-100 | FAIL | Severe congestion -- critical performance impact |
esxcli vsan debug resync summaryesxtop (press u for disk view)Outstanding IO counts indicate the number of IO operations queued but not yet completed.
esxcli vsan debug controller list | grep "Outstanding IO"
# Per-device outstanding IO
vsish -e get /vmkModules/lsom/disks/<disk_uuid>/info | grep outstanding
| Outstanding IO | Status | Description |
|---|---|---|
| 0-16 | PASS | Normal queue depth |
| 17-32 | PASS | Moderate load, acceptable |
| 33-64 | WARN | Elevated queue depth |
| > 64 | FAIL | Queue saturation -- investigate |
vscsiStats provides detailed IO profiling for individual VMs and virtual disks.
# List all virtual SCSI handles
vscsiStats -l
# Start collection for a specific handle
vscsiStats -s -w <world_id> -i <handle_id>
# Wait for collection period (e.g., 60 seconds)
sleep 60
# Retrieve statistics
vscsiStats -p all -w <world_id> -i <handle_id>
# Stop collection
vscsiStats -x -w <world_id> -i <handle_id>
| Metric | Description |
|---|---|
| IO Size Histogram | Distribution of IO sizes (4K, 8K, 16K, etc.) |
| Seek Distance | Sequential vs. random IO pattern |
| Outstanding IO | Per-VMDK queue depth |
| Latency Histogram | Distribution of latency values |
| IO Type | Read/write ratio |
vscsiStats sparingly in production. It adds minor overhead during collection. Collect for 60-120 seconds to get a representative sample, then stop immediately.
The vSAN Performance Service must be enabled for historical performance data.
$cluster = Get-Cluster "SDDC-Cluster1"
$vsanConfig = Get-VsanClusterConfiguration -Cluster $cluster
$vsanConfig.PerformanceServiceEnabled
Set-VsanClusterConfiguration -Cluster $cluster -PerformanceServiceEnabled $true
esxcli vsan health cluster list -t "Performance Service"
| Result | Condition | Action |
|---|---|---|
| PASS | Performance service enabled and collecting data | Healthy |
| WARN | Service enabled but stats database > 80% full | Archive or increase stats DB size |
| FAIL | Performance service disabled or not functioning | Enable via PowerCLI or vCenter UI |
esxcli vsan debug object health summary get
Object Health Summary:
Total Objects: 2847
Healthy: 2847
Objects with Reduced Redundancy: 0
Inaccessible Objects: 0
Non-Compliant Objects: 0
Quorum Not Satisfied: 0
$cluster = Get-Cluster "SDDC-Cluster1"
$vsanHealth = Get-VsanView -Id "VsanVcClusterHealthSystem-vsan-cluster-health-system"
$objHealth = $vsanHealth.VsanQueryVcClusterHealthSummary(
$cluster.ExtensionData.MoRef, $null, $null, $true, $null, $null, "objectHealth"
)
$objHealth.ObjectHealth.ObjectHealthDetail | ForEach-Object {
[PSCustomObject]@{
Category = $_.NumObjects
ObjectCount = $_.ObjHealthState
}
}
vSAN object compliance verifies that every object meets its assigned storage policy (FTT, stripe width, etc.).
$cluster = Get-Cluster "SDDC-Cluster1"
$vms = Get-VM -Location $cluster
foreach ($vm in $vms) {
$spPolicy = Get-SpbmEntityConfiguration -VM $vm
foreach ($policy in $spPolicy) {
if ($policy.ComplianceStatus -ne "compliant") {
[PSCustomObject]@{
VM = $vm.Name
Entity = $policy.Entity
Policy = $policy.StoragePolicy.Name
Status = $policy.ComplianceStatus
}
}
}
}
| Result | Condition | Action |
|---|---|---|
| PASS | All objects compliant | No action |
| WARN | Objects non-compliant but actively rebuilding | Monitor resync progress |
| FAIL | Objects persistently non-compliant | Investigate capacity or host availability |
Inaccessible objects have lost quorum -- they cannot be read or written. This is the most critical vSAN health state.
esxcli vsan debug object health summary get | grep "Inaccessible"
esxcli vsan debug object list --type=inaccessible
# Get the object UUID from the inaccessible list, then:
esxcli vsan debug object list -u <object_uuid>
esxcli vsan debug object list -u <uuid>grep -i "inaccessible" /var/log/clomd.logObjects with reduced redundancy are accessible but have fewer copies than specified by their policy.
esxcli vsan debug object health summary get | grep "Reduced Redundancy"
esxcli vsan debug object list --type=reducedRedundancy
| Result | Condition | Action |
|---|---|---|
| PASS | 0 objects with reduced redundancy | Full policy compliance |
| WARN | Objects in reduced redundancy during resync | Expected after host/disk event; monitor resync |
| FAIL | Persistent reduced redundancy (no active resync) | Investigate CLOM; check capacity/host availability |
esxcli vsan debug resync summaryesxcli vsan health cluster list -t "Host Component Limit"vsan.fix_renamed_objects /path/to/cluster/etc/init.d/clomd restart
In a vSAN stretched cluster, hosts are divided into two fault domains (sites) plus a witness host.
esxcli vsan cluster get
Look for:
Preferred Fault Domain: site-a
Secondary Fault Domain: site-b
$cluster = Get-Cluster "SDDC-Cluster1"
$vsanConfig = Get-VsanClusterConfiguration -Cluster $cluster
[PSCustomObject]@{
StretchedCluster = $vsanConfig.StretchedClusterEnabled
PreferredSite = $vsanConfig.PreferredFaultDomain.Name
SecondarySite = ($vsanConfig.FaultDomains | Where-Object { $_.Name -ne $vsanConfig.PreferredFaultDomain.Name }).Name
WitnessHost = $vsanConfig.WitnessHost.Name
}
| Result | Condition | Action |
|---|---|---|
| PASS | Both sites have equal host counts, preferred site set correctly | Healthy |
| WARN | Uneven host distribution between sites | Rebalance hosts if possible |
| FAIL | One site has no hosts or stretched cluster misconfigured | Reconfigure stretched cluster |
The witness host provides the tiebreaker vote in a stretched cluster. It must be in a third fault domain.
# From any cluster host
esxcli vsan cluster get | grep -i witness
# SSH to witness host
ssh root@witness-host.vcf.local
# Verify vSAN is running
esxcli vsan cluster get
# Check witness disk status
esxcli vsan storage list
# Verify network connectivity to both sites
vmkping -I vmk0 <site-a-host-vsan-ip>
vmkping -I vmk0 <site-b-host-vsan-ip>
| Resource | Minimum | Recommended |
|---|---|---|
| vCPUs | 2 | 2 |
| Memory | 16 GB (< 750 components) | 32 GB (> 750 components) |
| Witness disk cache | 5 GB SSD | 10 GB SSD |
| Witness disk capacity | 15 GB | 30 GB |
Site affinity rules ensure that specific VMs prefer to run at a particular site during normal operations.
$cluster = Get-Cluster "SDDC-Cluster1"
$rules = Get-DrsRule -Cluster $cluster | Where-Object { $_.Type -eq "VMAffinity" }
$rules | Format-Table Name, Type, Enabled, VMIds -AutoSize
# Check vSAN storage policies with site affinity
Get-SpbmStoragePolicy | Where-Object {
$_.AnyOfRuleSets.AnyOfRules.Capability.Name -match "locality"
} | ForEach-Object {
[PSCustomObject]@{
PolicyName = $_.Name
Locality = ($_.AnyOfRuleSets.AnyOfRules | Where-Object {
$_.Capability.Name -match "locality"
}).Value
}
}
# From a host at Site A to a host at Site B
vmkping -I vmk1 -c 100 <site-b-host-vsan-ip>
| Link | Maximum RTT | PASS | WARN | FAIL |
|---|---|---|---|---|
| Site A to Site B | 5 ms (data) | < 5 ms | 5-100 ms | > 100 ms |
| Either Site to Witness | 200 ms | < 100 ms | 100-200 ms | > 200 ms |
| Bandwidth (data sites) | 10 Gbps | >= 10 Gbps | 1-10 Gbps | < 1 Gbps |
Fault domains define failure boundaries. vSAN places components across fault domains to ensure that a single domain failure does not cause data loss.
$cluster = Get-Cluster "SDDC-Cluster1"
$faultDomains = Get-VsanFaultDomain -Cluster $cluster
foreach ($fd in $faultDomains) {
[PSCustomObject]@{
Name = $fd.Name
HostCount = ($fd.VMHost | Measure-Object).Count
Hosts = ($fd.VMHost.Name -join ", ")
}
} | Format-Table -AutoSize
esxcli vsan cluster get | grep "Fault Domain"
For optimal fault tolerance, hosts should be evenly distributed across fault domains.
| Configuration | PASS | WARN | FAIL |
|---|---|---|---|
| Fault Domain Count | >= 3 FDs | 2 FDs | 1 FD or none configured |
| Hosts per FD | Equal distribution | +/- 1 host variance | Severe imbalance |
| FTT=1 compliance | >= 3 FDs | 2 FDs (works but no FD-level protection) | 1 FD |
| FTT=2 compliance | >= 5 FDs | 3-4 FDs | < 3 FDs |
Fault Domain: rack-01 -> esx-01.vcf.local
Fault Domain: rack-02 -> esx-02.vcf.local
Fault Domain: rack-03 -> esx-03.vcf.local
Fault Domain: rack-04 -> esx-04.vcf.local
When fault domains are configured, vSAN places mirrors/parity components in different fault domains. The policy must be compatible with the number of fault domains.
$cluster = Get-Cluster "SDDC-Cluster1"
$fds = Get-VsanFaultDomain -Cluster $cluster
$fdCount = ($fds | Measure-Object).Count
$policies = Get-SpbmStoragePolicy | Where-Object { $_.Name -like "*vSAN*" }
foreach ($pol in $policies) {
$ftt = ($pol.AnyOfRuleSets.AnyOfRules | Where-Object {
$_.Capability.Name -eq "VSAN.hostFailuresToTolerate"
}).Value
$requiredFDs = (2 * $ftt) + 1 # For RAID-1
[PSCustomObject]@{
Policy = $pol.Name
FTT = $ftt
RequiredFDs = $requiredFDs
AvailableFDs = $fdCount
Compliant = $fdCount -ge $requiredFDs
}
} | Format-Table -AutoSize
New-VsanFaultDomain -Name "rack-05" -VMHost (Get-VMHost "esx-05.vcf.local")
esxcli vsan health cluster list -t "vSAN Health Service Up-To-Date"
Health Test: vSAN Health Service Up-To-Date
Status: green
Description: vSAN Health Service is up-to-date.
Last Run: 2026-03-26T14:00:00Z
# On VCSA, check health service status
vmon-cli --status vsanhealth
$cluster = Get-Cluster "SDDC-Cluster1"
$healthSystem = Get-VsanView -Id "VsanVcClusterHealthSystem-vsan-cluster-health-system"
$healthSystem.VsanQueryVcClusterHealthSummary(
$cluster.ExtensionData.MoRef,
$null, $null, $true, $null, $null, "defaultView"
)
| Result | Condition | Action |
|---|---|---|
| PASS | Service running, last test < 1 hour ago | Healthy |
| WARN | Service running but last test > 24 hours ago | Force a refresh |
| FAIL | Service not running | Restart: vmon-cli --restart vsanhealth on VCSA |
The vSAN Health Service organizes tests into the following categories:
| Category | Tests Included | Frequency |
|---|---|---|
| Cluster | Partition, CLOMD liveness, disk balance, member health | Every 60 min |
| Network | VMkernel config, connectivity, MTU, multicast | Every 60 min |
| Physical Disk | Disk health, metadata, congestion, capacity | Every 60 min |
| Data | Object health, VM health, compliance | Every 60 min |
| Limits | Component limits, host failure simulation | Every 60 min |
| HCL | Controller, driver, firmware, HCL DB age | Every 24 hours |
| Performance | Performance service status, stats integrity | Every 60 min |
| Stretched Cluster | Witness, site configuration, inter-site latency | Every 60 min |
| Encryption | KMS connectivity, key status, rekey status | Every 60 min |
Silenced alarms are health tests that have been muted by an administrator. Excessive silencing can mask real problems.
$cluster = Get-Cluster "SDDC-Cluster1"
$healthSystem = Get-VsanView -Id "VsanVcClusterHealthSystem-vsan-cluster-health-system"
$silenced = $healthSystem.VsanHealthGetVsanClusterSilentChecks($cluster.ExtensionData.MoRef)
Write-Host "Silenced checks count: $($silenced.Count)"
$silenced | ForEach-Object { Write-Host " - $_" }
$healthSystem.VsanHealthSetVsanClusterSilentChecks(
$cluster.ExtensionData.MoRef,
$null # Pass null to clear all silenced checks
)
| Result | Condition | Action |
|---|---|---|
| PASS | 0 silenced alarms | Full visibility into health |
| WARN | 1-3 silenced alarms | Review each; unsilence if no longer needed |
| FAIL | > 3 silenced alarms | Audit all silenced checks; likely masking real issues |
HCL (Hardware Compatibility List) compliance ensures that storage controllers, drivers, and firmware are certified for vSAN.
esxcli vsan health cluster list -t "vSAN HCL Health"
# Controller model
esxcli storage core adapter list
# Driver version
esxcli storage core adapter stats get -a vmhba0
# Firmware version
esxcli storage core adapter list | grep -i firmware
$cluster = Get-Cluster "SDDC-Cluster1"
$healthSystem = Get-VsanView -Id "VsanVcClusterHealthSystem-vsan-cluster-health-system"
$hclResult = $healthSystem.VsanQueryVcClusterHealthSummary(
$cluster.ExtensionData.MoRef, $null, $null, $true, $null, $null, "hclInfo"
)
$hclResult.HclInfo | ForEach-Object {
[PSCustomObject]@{
Host = $_.Hostname
Controller = $_.ControllerName
Driver = $_.DriverVersion
Firmware = $_.FirmwareVersion
HCLStatus = $_.HclStatus
}
} | Format-Table -AutoSize
| Result | Condition | Action |
|---|---|---|
| PASS | All controllers/drivers/firmware on HCL | Fully certified |
| WARN | HCL database outdated (> 90 days) | Update HCL DB |
| FAIL | Controller, driver, or firmware NOT on HCL | Update driver/firmware to certified version |
The HCL database is bundled with vCenter and should be updated regularly.
esxcli vsan health cluster list -t "vSAN HCL DB Up-To-Date"
$healthSystem = Get-VsanView -Id "VsanVcClusterHealthSystem-vsan-cluster-health-system"
$healthSystem.VsanVcUploadHclDb($null) # Downloads latest from VMware
$jsonContent = Get-Content -Path "C:\path\to\all.json" -Raw
$healthSystem.VsanVcUploadHclDb($jsonContent)
The following ports must be open for vSAN communication between all participating hosts and vCenter.
| Port | Protocol | Direction | Service | Description |
|---|---|---|---|---|
| 2233 | TCP/UDP | Host <-> Host | vSAN Transport | Primary vSAN data transport (IO traffic) |
| 12321 | UDP | Host <-> Host | vSAN Clustering (Unicast) | Unicast agent-to-agent communication |
| 12345 | UDP | Host <-> Host | vSAN Clustering (Multicast) | Multicast master group (legacy) |
| 23451 | UDP | Host <-> Host | vSAN Clustering (Multicast) | Multicast agent group (legacy) |
| 8080 | TCP | Host -> vCenter | vSAN Health | Health check data upload |
| 6500 | TCP | Host -> vCenter | vSAN VASA | VASA provider for storage policies |
| 8006 | TCP | vCenter -> Host | vSAN VASA | VASA provider callback |
| 443 | TCP | Host <-> vCenter | HTTPS | vSphere API, management |
| 902 | TCP/UDP | Host <-> vCenter | NFC/Heartbeat | Network file copy, host heartbeat |
| 8010 | TCP | Host -> vCenter | vSAN Performance | Performance data upload |
| 2233 | TCP | Host <-> Witness | vSAN Transport | Witness traffic (stretched cluster) |
| 12321 | UDP | Host <-> Witness | vSAN Clustering | Witness cluster communication |
| 514 | UDP | Host -> Syslog | Syslog | vSAN log forwarding |
| 8100 | TCP | Host <-> Host | vSAN RDMA | RDMA transport (ESA with RDMA NICs) |
| 8200 | TCP | Host <-> Host | vSAN RDMA | RDMA transport secondary |
# Verify vSAN firewall rules on ESXi host
esxcli network firewall ruleset list | grep -i vsan
# Check if vSAN ports are open
esxcli network firewall ruleset rule list -r vsanvp
esxcli network firewall ruleset rule list -r vsanEncryption
esxcli network firewall ruleset rule list -r vsanhealth
# From each ESXi host, test TCP 2233 to peers
nc -z -w3 172.16.10.102 2233 && echo "OK" || echo "FAIL"
nc -z -w3 172.16.10.103 2233 && echo "OK" || echo "FAIL"
nc -z -w3 172.16.10.104 2233 && echo "OK" || echo "FAIL"
esxcli vsan storage list shows Health Status: Failed or disk is missing/var/log/vmkernel.log# Check disk status
esxcli vsan storage list | grep -E "Display Name|Health Status"
# Check SMART data
esxcli vsan debug disk smart get -d naa.<disk_id>
# Check kernel log for disk errors
grep -i "disk error\|I/O error\|medium error" /var/log/vmkernel.log | tail -20
# Check vSAN trace for disk events
grep -i "disk" /var/log/vsantraced.log | tail -20
esxcli vsan storage add -s naa.new_cache -d naa.cap1 -d naa.cap2esxcli vsan storage remove -d naa.failed_diskesxcli vsan storage add -d naa.new_disk -s naa.cache_diskesxcli vsan debug resync summary
esxcli vsan storage remove -d naa.failed_nvmeesxcli vsan storage add -d naa.new_nvmeesxcli vsan debug resync summary
esxcli vsan cluster get shows different Member Counts on different hosts# Check cluster membership on each host
esxcli vsan cluster get
# Check network connectivity
vmkping -I vmk1 <peer_vsan_ip>
# Check physical NIC status
esxcli network nic stats get -n vmnic2
# Check for CRC errors, drops, overruns
esxcli network nic stats get -n vmnic2 | grep -i "error\|drop\|overrun"
# Check switch port channel status
esxcli network vswitch dvs vmware lacp status get
esxcli network vswitch dvs vmware listesxcli network vswitch dvs vmware lacp status getvmkping -I vmk1 -s 8972 -d <peer>esxcli vsan network remove -i vmk1esxcli vsan network ip add -i vmk1# Check resync volume
esxcli vsan debug resync summary
# Check network utilization
esxtop # Press 'n' for network view, look at vmk1 throughput
# Check throttle settings
esxcli system settings advanced list -o /VSAN/ResyncThrottleAdaptive
esxcli system settings advanced set -o /VSAN/ResyncThrottleAdaptive -i 1esxcli system settings advanced set -o /VSAN/ResyncBandwidthCap -i 500esxcli system settings advanced set -o /VSAN/ResyncBandwidthCap -i 0
# Check congestion
esxcli vsan debug controller list
# Check disk latency
esxcli vsan debug disk latency get
# Check for noisy neighbor VMs
esxtop # Press 'v' for VM disk view, sort by DAVG (device average latency)
# Check if resyncs are causing pressure
esxcli vsan debug resync summary
# Check cache tier utilization (OSA only)
vsish -e get /vmkModules/lsom/disks/<cache_uuid>/info | grep -i cache
CLOM (Cluster Level Object Manager) is the vSAN component responsible for object placement and repair. CLOM errors indicate placement failures.
# Check CLOM log for errors
grep -i "error\|fail\|cannot place" /var/log/clomd.log | tail -30
# Check component limits
esxcli vsan health cluster list -t "Host Component Limit"
# Check CLOM status
/etc/init.d/clomd status
# List objects with placement issues
esxcli vsan debug object list --type=nonCompliant
| Error | Cause | Fix |
|---|---|---|
Not enough fault domains |
FTT > available FDs | Add hosts/FDs or reduce FTT |
Not enough disk space |
Capacity > 80% | Free space or add capacity |
Component limit reached |
> 9000 components/host | Reduce FTT, consolidate VMs, or add hosts |
Cannot place |
Combination of above | Analyze specific constraint from log |
Disk group offline |
Cache disk failure (OSA) | Replace cache disk, recreate DG |
/etc/init.d/clomd restart# Get cluster status
esxcli vsan cluster get
# Join a vSAN cluster
esxcli vsan cluster join -c <cluster-uuid>
# Leave a vSAN cluster
esxcli vsan cluster leave
# Restore cluster from backup
esxcli vsan cluster restore -c <cluster-uuid>
# List all health checks
esxcli vsan health cluster list
# Run a specific health test
esxcli vsan health cluster list -t "<test name>"
# Get health summary
esxcli vsan health cluster get
# List all vSAN disks
esxcli vsan storage list
# Add a disk to vSAN (OSA - with cache disk)
esxcli vsan storage add -d naa.<capacity_disk> -s naa.<cache_disk>
# Add a disk to vSAN (ESA)
esxcli vsan storage add -d naa.<nvme_disk>
# Remove a disk from vSAN
esxcli vsan storage remove -d naa.<disk_id>
# Auto-claim disks
esxcli vsan storage automode set -e true
# List vSAN network interfaces
esxcli vsan network list
# Add a VMkernel interface to vSAN
esxcli vsan network ip add -i vmk1
# Remove a VMkernel interface from vSAN
esxcli vsan network remove -i vmk1
# Test connectivity with jumbo frames
vmkping -I vmk1 -s 8972 -d <target_ip>
# Test standard connectivity
vmkping -I vmk1 <target_ip>
# Resync summary
esxcli vsan debug resync summary
# Object health summary
esxcli vsan debug object health summary get
# List objects by type
esxcli vsan debug object list --type=inaccessible
esxcli vsan debug object list --type=reducedRedundancy
esxcli vsan debug object list --type=nonCompliant
# Disk SMART data
esxcli vsan debug disk smart get -d naa.<disk_id>
# Controller info (congestion, outstanding IO)
esxcli vsan debug controller list
# Space usage details
esxcli vsan debug space show
# Disk latency
esxcli vsan debug disk latency get
# List vSAN storage policies applied to a VM's namespace
esxcli vsan policy getdefault
# Set the default vSAN policy
esxcli vsan policy setdefault -c "proportionalCapacity=0" -p "hostFailuresToTolerate=1"
# Enter maintenance mode (ensure accessibility)
esxcli system maintenanceMode set -e true -m ensureAccessibility
# Enter maintenance mode (full data migration)
esxcli system maintenanceMode set -e true -m evacuateAllData
# Enter maintenance mode (no data migration)
esxcli system maintenanceMode set -e true -m noAction
# Exit maintenance mode
esxcli system maintenanceMode set -e false
# List all vSAN advanced settings
esxcli system settings advanced list -o /VSAN
# Common performance-related settings
esxcli system settings advanced list -o /VSAN/ResyncThrottleAdaptive
esxcli system settings advanced list -o /VSAN/ResyncBandwidthCap
esxcli system settings advanced list -o /LSOM/lsomResyncThrottleEnabled
# Set a vSAN advanced parameter
esxcli system settings advanced set -o /VSAN/ResyncThrottleAdaptive -i 1
# vSAN trace log
/var/log/vsantraced.log
# CLOMD (object placement) log
/var/log/clomd.log
# vSAN management log (on VCSA)
/var/log/vmware/vpxd/vpxd.log # (vSAN operations logged here)
# vSAN health log (on VCSA)
/var/log/vmware/vsanHealth/vsanhealth.log
# VMkernel log (disk errors, IO errors)
/var/log/vmkernel.log
# Syslog (general ESXi system log)
/var/log/syslog.log
# vSAN observer data (if enabled)
/var/log/vsan/observer/
# Install PowerCLI
Install-Module -Name VMware.PowerCLI -Scope CurrentUser -Force
# Connect to vCenter
Connect-VIServer -Server vcsa-01.vcf.local -User administrator@vsphere.local
# Ignore certificate errors (lab only)
Set-PowerCLIConfiguration -InvalidCertificateAction Ignore -Confirm:$false
# Get vSAN cluster configuration
$cluster = Get-Cluster "SDDC-Cluster1"
Get-VsanClusterConfiguration -Cluster $cluster
# Get cluster hosts
Get-VMHost -Location $cluster | Select Name, ConnectionState, PowerState
# Get vSAN datastore
Get-Datastore -RelatedObject $cluster | Where-Object { $_.Type -eq "vsan" }
# Get vSAN health summary
$cluster = Get-Cluster "SDDC-Cluster1"
$healthSystem = Get-VsanView -Id "VsanVcClusterHealthSystem-vsan-cluster-health-system"
$summary = $healthSystem.VsanQueryVcClusterHealthSummary(
$cluster.ExtensionData.MoRef,
$null, $null, $true, $null, $null, "defaultView"
)
# Display overall health
$summary.OverallHealth
$summary.OverallHealthDescription
# Display per-group health
$summary.Groups | ForEach-Object {
[PSCustomObject]@{
Group = $_.GroupName
Health = $_.GroupHealth
}
} | Format-Table -AutoSize
# Get vSAN space usage
$cluster = Get-Cluster "SDDC-Cluster1"
Get-VsanSpaceUsage -Cluster $cluster
# Detailed space breakdown
$space = Get-VsanSpaceUsage -Cluster $cluster
[PSCustomObject]@{
"Total (TB)" = [math]::Round($space.TotalCapacityGB / 1024, 2)
"Used (TB)" = [math]::Round($space.UsedCapacityGB / 1024, 2)
"Free (TB)" = [math]::Round($space.FreeCapacityGB / 1024, 2)
"Used %" = [math]::Round(($space.UsedCapacityGB / $space.TotalCapacityGB) * 100, 1)
"Dedup Ratio" = [math]::Round($space.DedupRatio, 2)
"Compression Ratio"= [math]::Round($space.CompressionRatio, 2)
}
# List all vSAN disks
$cluster = Get-Cluster "SDDC-Cluster1"
Get-VsanDisk -Cluster $cluster | Select VsanDiskGroup, CanonicalName, IsCacheDisk, CapacityGB
# Get disk groups per host
$hosts = Get-VMHost -Location $cluster
foreach ($vmHost in $hosts) {
$dgs = Get-VsanDiskGroup -VMHost $vmHost
foreach ($dg in $dgs) {
[PSCustomObject]@{
Host = $vmHost.Name
DiskGroup = $dg.Name
DiskCount = ($dg | Get-VsanDisk).Count
}
}
} | Format-Table -AutoSize
# List all vSAN storage policies
Get-SpbmStoragePolicy | Where-Object { $_.Name -like "*vSAN*" } |
Select Name, Description
# Check VM compliance
$vms = Get-VM -Location (Get-Cluster "SDDC-Cluster1")
foreach ($vm in $vms) {
$compliance = Get-SpbmEntityConfiguration -VM $vm
foreach ($c in $compliance) {
if ($c.ComplianceStatus -ne "compliant") {
[PSCustomObject]@{
VM = $vm.Name
Entity = $c.Entity
Status = $c.ComplianceStatus
Policy = $c.StoragePolicy.Name
}
}
}
} | Format-Table -AutoSize
# Create a new vSAN storage policy
New-SpbmStoragePolicy -Name "vSAN-FTT1-RAID1" -Description "FTT=1 RAID-1 Mirroring" -RuleSet (
New-SpbmRuleSet -Name "vSAN" -AllOfRules @(
New-SpbmRule -Capability (Get-SpbmCapability -Name "VSAN.hostFailuresToTolerate") -Value 1,
New-SpbmRule -Capability (Get-SpbmCapability -Name "VSAN.replicaPreference") -Value "RAID-1 (Mirroring) - Performance"
)
)
# List fault domains
$cluster = Get-Cluster "SDDC-Cluster1"
Get-VsanFaultDomain -Cluster $cluster | ForEach-Object {
[PSCustomObject]@{
Name = $_.Name
Hosts = ($_.VMHost.Name -join ", ")
}
} | Format-Table -AutoSize
# Create a new fault domain
New-VsanFaultDomain -Name "rack-05" -VMHost (Get-VMHost "esx-05.vcf.local")
# Remove a fault domain
Remove-VsanFaultDomain -VsanFaultDomain (Get-VsanFaultDomain -Name "rack-05")
# Get stretched cluster configuration
$cluster = Get-Cluster "SDDC-Cluster1"
$config = Get-VsanClusterConfiguration -Cluster $cluster
[PSCustomObject]@{
StretchedCluster = $config.StretchedClusterEnabled
PreferredSite = $config.PreferredFaultDomain.Name
WitnessHost = $config.WitnessHost.Name
}
# Set preferred fault domain
Set-VsanClusterConfiguration -Cluster $cluster -PreferredFaultDomain (
Get-VsanFaultDomain -Name "site-a"
)
# Enable performance service
$cluster = Get-Cluster "SDDC-Cluster1"
Set-VsanClusterConfiguration -Cluster $cluster -PerformanceServiceEnabled $true
# Check performance service status
(Get-VsanClusterConfiguration -Cluster $cluster).PerformanceServiceEnabled
# Query performance data
$vsanPerfMgr = Get-VsanView -Id "VsanPerformanceManager-vsan-performance-manager"
$spec = New-Object VMware.Vsan.Views.VsanPerfQuerySpec
$spec.EntityRefId = "cluster-domclient:*"
$spec.StartTime = (Get-Date).AddHours(-1)
$spec.EndTime = Get-Date
$vsanPerfMgr.VsanPerfQueryPerf(@($spec), $cluster.ExtensionData.MoRef)
# Enter maintenance mode (ensure accessibility)
$vmHost = Get-VMHost "esx-01.vcf.local"
Set-VMHost -VMHost $vmHost -State Maintenance -VsanDataMigrationMode EnsureAccessibility
# Enter maintenance mode (full evacuation)
Set-VMHost -VMHost $vmHost -State Maintenance -VsanDataMigrationMode Full
# Exit maintenance mode
Set-VMHost -VMHost $vmHost -State Connected
# Pre-check maintenance mode (dry run)
$vsanHealthSystem = Get-VsanView -Id "VsanVcClusterHealthSystem-vsan-cluster-health-system"
$vsanHealthSystem.VsanQueryVcClusterHealthSummary(
(Get-Cluster "SDDC-Cluster1").ExtensionData.MoRef,
$null, $null, $true, $null, $null, "maintenanceMode"
)
# Full vSAN Health Report
function Get-VsanHealthReport {
param(
[string]$ClusterName = "SDDC-Cluster1"
)
$cluster = Get-Cluster $ClusterName
$config = Get-VsanClusterConfiguration -Cluster $cluster
$space = Get-VsanSpaceUsage -Cluster $cluster
$hosts = Get-VMHost -Location $cluster
Write-Host "========================================" -ForegroundColor Cyan
Write-Host " vSAN Health Report: $ClusterName" -ForegroundColor Cyan
Write-Host " Generated: $(Get-Date)" -ForegroundColor Cyan
Write-Host "========================================" -ForegroundColor Cyan
# Cluster Config
Write-Host "`n--- Cluster Configuration ---" -ForegroundColor Yellow
Write-Host " Hosts: $($hosts.Count)"
Write-Host " vSAN Enabled: $($config.VsanEnabled)"
Write-Host " Stretched: $($config.StretchedClusterEnabled)"
Write-Host " Perf Service: $($config.PerformanceServiceEnabled)"
# Capacity
Write-Host "`n--- Capacity ---" -ForegroundColor Yellow
$usedPct = [math]::Round(($space.UsedCapacityGB / $space.TotalCapacityGB) * 100, 1)
Write-Host " Total: $([math]::Round($space.TotalCapacityGB / 1024, 2)) TB"
Write-Host " Used: $([math]::Round($space.UsedCapacityGB / 1024, 2)) TB ($usedPct%)"
Write-Host " Free: $([math]::Round($space.FreeCapacityGB / 1024, 2)) TB"
if ($usedPct -gt 80) {
Write-Host " STATUS: CRITICAL" -ForegroundColor Red
} elseif ($usedPct -gt 70) {
Write-Host " STATUS: WARNING" -ForegroundColor Yellow
} else {
Write-Host " STATUS: HEALTHY" -ForegroundColor Green
}
# Host Status
Write-Host "`n--- Host Status ---" -ForegroundColor Yellow
foreach ($h in $hosts) {
$state = $h.ConnectionState
$color = if ($state -eq "Connected") { "Green" } else { "Red" }
Write-Host " $($h.Name): $state" -ForegroundColor $color
}
# Disk Health
Write-Host "`n--- Disk Health ---" -ForegroundColor Yellow
$disks = Get-VsanDisk -Cluster $cluster
Write-Host " Total Disks: $($disks.Count)"
# Policy Compliance
Write-Host "`n--- Policy Compliance ---" -ForegroundColor Yellow
$vms = Get-VM -Location $cluster
$nonCompliant = 0
foreach ($vm in $vms) {
$compliance = Get-SpbmEntityConfiguration -VM $vm -ErrorAction SilentlyContinue
$nonCompliant += ($compliance | Where-Object { $_.ComplianceStatus -ne "compliant" }).Count
}
if ($nonCompliant -eq 0) {
Write-Host " All VMs compliant" -ForegroundColor Green
} else {
Write-Host " Non-compliant entities: $nonCompliant" -ForegroundColor Red
}
Write-Host "`n========================================" -ForegroundColor Cyan
Write-Host " Report Complete" -ForegroundColor Cyan
Write-Host "========================================" -ForegroundColor Cyan
}
# Execute the report
Get-VsanHealthReport -ClusterName "SDDC-Cluster1"
vSAN Health Check Handbook
Version 1.0 -- March 2026
Copyright 2026 Virtual Control LLC. All rights reserved.
This document is for internal use only and may not be distributed without written permission.
VMware, vSAN, vSphere, vCenter, ESXi, and VCF are registered trademarks of Broadcom Inc.