This handbook provides a complete, step-by-step health check procedure for VMware NSX 4.x deployed within a VCF 9.0 environment. It is designed for VMware administrators who need to verify NSX health during:
| Area | Components Checked |
|---|---|
| Management Plane | NSX Manager cluster, services, certificates, backups |
| Control Plane | Controller connectivity, transport node preparation |
| Data Plane | TEP connectivity, tunnels, BFD sessions |
| Edge Services | Edge clusters, HA state, routing, BGP |
| Security | Distributed firewall, gateway firewall, rule realization |
| Networking | Segments, logical switches, MAC/ARP/VTEP tables |
Each check in this handbook follows a consistent format:
$NSX_VIP = NSX Manager VIP (e.g., nsx-vip.lab.local)$NSX_NODE1/2/3 = Individual NSX Manager nodes$NSX_USER = admin$NSX_PASS = NSX Manager admin password
| Access Type | Target | Credentials Needed |
|---|---|---|
| HTTPS (443) | NSX Manager VIP | admin / password |
| SSH (22) | Each NSX Manager node | admin or root |
| SSH (22) | ESXi hosts (transport nodes) | root |
| SSH (22) | Edge nodes (bare metal) | admin or root |
| API | NSX Manager VIP:443 | admin / password |
| Tool | Purpose | Install Location |
|---|---|---|
curl |
API calls to NSX Manager | Jumpbox / workstation |
ssh |
CLI access to NSX / ESXi / Edge | Jumpbox / workstation |
jq |
JSON parsing for API output | apt install jq or brew install jq |
openssl |
Certificate verification | Pre-installed on Linux/Mac |
| Web Browser | NSX Manager UI verification | Workstation |
# NSX Manager
export NSX_VIP="nsx-vip.lab.local"
export NSX_NODE1="nsx-mgr-01.lab.local"
export NSX_NODE2="nsx-mgr-02.lab.local"
export NSX_NODE3="nsx-mgr-03.lab.local"
export NSX_USER="admin"
export NSX_PASS="YourPassword123!"
# Convenience function for NSX API calls
nsx_api() {
curl -sk -u "$NSX_USER:$NSX_PASS" \
-H "Content-Type: application/json" \
"https://$NSX_VIP$1" 2>/dev/null
}
# Example usage:
# nsx_api /api/v1/cluster/status | jq .
| # | Check | Method | Pass Criteria | Warn Criteria | Fail Criteria |
|---|---|---|---|---|---|
| 4.1 | Manager Cluster Status | API | STABLE |
DEGRADED |
UNSTABLE / unreachable |
| 4.2 | Node Health | API | All nodes CONNECTED |
1 node degraded | 2+ nodes down |
| 4.3 | Service Status | CLI | All services running |
Non-critical stopped | Critical service stopped |
| 5.1 | Manager CPU | CLI/API | < 70% | 70-85% | > 85% |
| 5.2 | Manager Memory | CLI/API | < 75% | 75-90% | > 90% |
| 5.3 | Manager Disk | CLI | < 70% used | 70-85% used | > 85% used |
| 6.1 | Certificates | API | > 30 days to expiry | 7-30 days | < 7 days or expired |
| 7.1 | Backup Config | API | Configured & recent | > 24h since last | No backup configured |
| 8.1 | Transport Nodes | API | All SUCCESS |
Any IN_PROGRESS |
Any FAILED |
| 8.3 | NSX Agents | CLI | nsx-mpa + nsx-proxy running |
Agent restarting | Agent not running |
| 9.1 | TEP Connectivity | CLI | vmkping MTU 1600 succeeds | Intermittent loss | vmkping fails |
| 9.2 | BFD Sessions | CLI | All UP |
Any INIT |
Any DOWN |
| 10.1 | Edge Cluster | API | All members UP |
1 member degraded | Multiple members down |
| 10.2 | Edge HA | API | ACTIVE/STANDBY correct |
HA state mismatch | Both STANDBY |
| 11.1 | BGP Neighbors | CLI | All Established |
Any OpenConfirm |
Any Idle/Active |
| 11.3 | Gateway State | API | REALIZED |
IN_PROGRESS |
ERROR |
| 12.1 | DFW Realization | API | All rules REALIZED |
Any IN_PROGRESS |
Any ERROR |
| 14.1 | Segments | API | All SUCCESS |
Any IN_PROGRESS |
Any ERROR |
| 15 | Alarms | API | 0 critical alarms | Warning alarms present | Critical alarms present |
What: Verify the NSX Manager cluster is stable with all 3 nodes participating.
Why: A degraded management cluster means reduced redundancy. An unstable cluster can cause configuration failures and data loss.
# Get cluster status
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/api/v1/cluster/status" | jq .
Expected Output (Healthy):
{
"detailed_cluster_status": {
"overall_status": "STABLE",
"groups": [
{
"group_type": "CONTROLLER",
"group_status": "STABLE",
"members": [
{
"member_uuid": "abc123...",
"member_status": "UP",
"member_fqdn": "nsx-mgr-01.lab.local"
},
{
"member_uuid": "def456...",
"member_status": "UP",
"member_fqdn": "nsx-mgr-02.lab.local"
},
{
"member_uuid": "ghi789...",
"member_status": "UP",
"member_fqdn": "nsx-mgr-03.lab.local"
}
]
},
{
"group_type": "MANAGER",
"group_status": "STABLE",
"members": [...]
},
{
"group_type": "POLICY",
"group_status": "STABLE",
"members": [...]
}
]
},
"mgmt_cluster_status": {
"status": "STABLE"
},
"control_cluster_status": {
"status": "STABLE"
}
}
ssh admin@$NSX_NODE1
# Then run:
get cluster status
Expected Output:
Cluster Id: a1b2c3d4-e5f6-7890-abcd-ef1234567890
Overall Status: STABLE
Group Type: CONTROLLER
Group Status: STABLE
Leader: nsx-mgr-01.lab.local
Group Type: MANAGER
Group Status: STABLE
Leader: nsx-mgr-01.lab.local
Group Type: POLICY
Group Status: STABLE
Leader: nsx-mgr-02.lab.local
| Result | Criteria | Indicator |
|---|---|---|
| PASS | overall_status = STABLE, all groups STABLE |
All 3 nodes UP, all group leaders elected |
| WARN | overall_status = DEGRADED |
1 node down but quorum maintained |
| FAIL | overall_status = UNSTABLE or unreachable |
2+ nodes down, no quorum, API unresponsive |
get cluster statusping $NSX_NODE1restart cluster boot-manager/var/log/cluster-manager/cluster-manager.logWhat: Check the health status of each individual NSX Manager node.
Why: A node can be in the cluster but unhealthy (high load, service failures).
# List all cluster nodes with health
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/api/v1/cluster/nodes" | jq '.results[] | {
fqdn: .fqdn,
ip: .ip_address,
status: .status,
version: .node_version
}'
# Per-node health via appliance API
for NODE in $NSX_NODE1 $NSX_NODE2 $NSX_NODE3; do
echo "=== $NODE ==="
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NODE/api/v1/node" | jq '{
hostname: .hostname,
kernel_version: .kernel_version,
product_version: .product_version,
uptime: .system_time
}'
done
| Result | Criteria | Indicator |
|---|---|---|
| PASS | All 3 nodes report status CONNECTED and matching versions |
Healthy cluster |
| WARN | Version mismatch between nodes | Possible mid-upgrade state |
| FAIL | Any node DISCONNECTED or unreachable |
Node failure |
What: Verify all critical NSX Manager services are running on each node.
Why: Individual service failures can cause specific feature outages even when the cluster appears healthy.
ssh admin@$NSX_NODE1
get services
Expected Output (all running):
Service Name Service Status
-------------------------------------
async_replicator running
cluster-manager running
controller running
corfu running
corfu-nonconfig running
datastore running
http running
install-upgrade running
liagent running
manager running
migration-coordinator running
monitoring running
nsx-message-bus running
nsx-ui-plugin running
phonehome running
platform-client running
policy running
proton running
search running
syslog-server running
upgrade-coordinator running
# Get node services status
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/api/v1/node/services" | jq '.results[] | {
service: .service_name,
status: .service_status.runtime_state
}'
| Service | Function | Impact if Down |
|---|---|---|
controller |
Control plane | Transport node connectivity loss |
corfu |
Distributed datastore | Configuration data unavailable |
manager |
Management plane | API / UI unavailable |
policy |
Policy engine | Security policy changes fail |
http |
Web server / API | All API calls fail |
proton |
Message bus | Cluster communication failure |
datastore |
Data persistence | Config persistence failure |
search |
Search engine | UI search / inventory failure |
controller, corfu, or manager are not running, the NSX environment is in a critical state. Do not make configuration changes until these services are restored.
What: Check CPU and memory utilization on each NSX Manager node.
Why: NSX Manager nodes under resource pressure may respond slowly or fail to process API requests, causing cascading issues across the environment.
ssh admin@$NSX_NODE1
# CPU utilization
get node-stats
# Node status with resource utilization
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_NODE1/api/v1/node/status" | jq '{
cpu_cores: .cpu_cores,
load_average: .load_average,
mem_total: .mem_total,
mem_used: .mem_used,
mem_cache: .mem_cache,
swap_total: .swap_total,
swap_used: .swap_used,
uptime: .uptime
}'
Expected Output (Healthy):
{
"cpu_cores": 12,
"load_average": [1.2, 1.5, 1.8],
"mem_total": 49152,
"mem_used": 32768,
"mem_cache": 8192,
"swap_total": 8192,
"swap_used": 0,
"uptime": 8640000
}
| Metric | PASS | WARN | FAIL |
|---|---|---|---|
| CPU Load Average (per core) | < 0.7 | 0.7 - 0.85 | > 0.85 |
| Memory Used % | < 75% | 75% - 90% | > 90% |
| Swap Used | 0 MB | < 1 GB | > 1 GB |
get service http connection-limittop -b -n 1 | head -20 (as root)
What: Check disk space utilization on all NSX Manager filesystem partitions.
Why: Full disks cause service crashes, log loss, and database corruption.
ssh admin@$NSX_NODE1
# As admin:
get filesystem-info
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_NODE1/api/v1/node/status" | jq '.filesystem'
ssh root@$NSX_NODE1
df -h
Expected Output:
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 40G 15G 23G 40% /
/dev/sda3 80G 28G 48G 37% /nonconfig-datastore
/dev/sda5 40G 12G 26G 32% /config-datastore
/dev/sda6 20G 4G 15G 21% /image
tmpfs 24G 12M 24G 1% /dev/shm
| Filesystem | PASS | WARN | FAIL |
|---|---|---|---|
/ (root) |
< 70% | 70-85% | > 85% |
/nonconfig-datastore |
< 70% | 70-85% | > 85% |
/config-datastore |
< 70% | 70-85% | > 85% |
/image |
< 80% | 80-90% | > 90% |
ls -la /var/log/ — identify large filesls -la /image/logrotate -f /etc/logrotate.confget service corfu compaction-status
What: Check if any resource-related alarms are active on NSX Manager nodes.
# Check system alarms related to resources
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/api/v1/alarms?feature_name=system" | jq '.results[] | {
id: .id,
severity: .severity,
description: .description,
status: .status,
entity_id: .entity_id
}'
What: Retrieve all certificates managed by NSX and verify none are expired or expiring soon.
Why: Expired certificates cause communication failures between NSX components, transport nodes, and external integrations.
# List all certificates
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/api/v1/trust-management/certificates" | jq '.results[] | {
id: .id,
display_name: .display_name,
not_before: .not_before,
not_after: .not_after,
subject: .pem_encoded | split("\n")[0]
}'
# Check certificate expiry with days remaining
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/api/v1/trust-management/certificates" | jq -r '
.results[] |
"\(.display_name)\t\(.not_after)\t\(
((.not_after / 1000) - (now | floor)) / 86400 | floor
) days remaining"'
Expected Output:
nsx-mgr-01-cert 1774003200000 365 days remaining
nsx-mgr-02-cert 1774003200000 365 days remaining
nsx-mgr-03-cert 1774003200000 365 days remaining
nsx-api-cert 1774003200000 365 days remaining
mp-cluster-cert 1774003200000 365 days remaining
| Result | Criteria | Indicator |
|---|---|---|
| PASS | All certificates > 30 days from expiry | Healthy |
| WARN | Any certificate 7-30 days from expiry | Plan replacement |
| FAIL | Any certificate < 7 days or expired | Immediate action required |
# Verify the NSX Manager certificate externally
echo | openssl s_client -connect $NSX_VIP:443 2>/dev/null | \
openssl x509 -noout -dates -subject -issuer
Expected Output:
notBefore=Jan 15 00:00:00 2026 GMT
notAfter=Jan 15 00:00:00 2028 GMT
subject=CN = nsx-vip.lab.local
issuer=CN = NSX CA
What: Verify that NSX Manager backups are configured and running on schedule.
Why: Without valid backups, any cluster failure could result in complete NSX configuration loss.
# Get backup configuration
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/api/v1/cluster/backups/config" | jq '{
backup_enabled: .backup_enabled,
backup_schedule: .backup_schedule,
remote_file_server: .remote_file_server.server,
remote_path: .remote_file_server.directory_path
}'
Expected Output (Healthy):
{
"backup_enabled": true,
"backup_schedule": {
"resource_type": "WeeklyBackupSchedule",
"days_of_week": ["MONDAY", "WEDNESDAY", "FRIDAY"],
"hour_of_day": 2,
"minute_of_day": 0
},
"remote_file_server": "backup-server.lab.local",
"remote_path": "/backups/nsx/"
}
# Check backup history
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/api/v1/cluster/backups/history" | jq '.results[0:5] | .[] | {
start_time: .start_time,
end_time: .end_time,
backup_status: .success,
node: .node_id
}'
| Result | Criteria | Indicator |
|---|---|---|
| PASS | Backup configured, last successful < 24 hours ago | Healthy |
| WARN | Backup configured, last successful > 24 hours ago | Check schedule |
| FAIL | No backup configured, or all recent backups failed | Critical risk |
POST /api/v1/cluster/backups?action=backup_to_remoteping backup-server.lab.localWhat: Verify all ESXi hosts prepared as NSX transport nodes are in a healthy state.
Why: Transport node failures mean VMs on that host lose NSX networking (overlay, DFW, load balancing).
# Get all transport node states
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/api/v1/transport-nodes/state" | jq '.results[] | {
transport_node_id: .transport_node_id,
node_deployment_state: .node_deployment_state.state,
state: .state,
status: .status,
host_node_deployment_status: .host_node_deployment_status
}'
Expected Output (Healthy):
{
"transport_node_id": "abc123",
"node_deployment_state": "NODE_READY",
"state": "success",
"status": "UP"
}
# Compact summary of all transport node states
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/api/v1/transport-nodes/state?status=UP" | jq '.result_count'
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/api/v1/transport-nodes/state?status=DOWN" | jq '.result_count'
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/api/v1/transport-nodes/state?status=DEGRADED" | jq '.result_count'
| Result | Criteria | Indicator |
|---|---|---|
| PASS | All transport nodes SUCCESS / UP |
Healthy |
| WARN | Any node IN_PROGRESS (e.g., during upgrade) |
Monitor |
| FAIL | Any node FAILED or DOWN |
Immediate investigation |
What: Verify NSX components on individual ESXi hosts via the nsxcli shell.
ssh root@<esxi-host>
nsxcli
> get managers
Expected Output:
Manager IP : 192.168.1.71
Manager IP : 192.168.1.72
Manager IP : 192.168.1.73
Connected to : 192.168.1.71
> get controllers
Expected Output:
Controller IP : 192.168.1.71
Controller IP : 192.168.1.72
Controller IP : 192.168.1.73
Connected to : 192.168.1.72
Status : Connected
What: Verify the NSX agent processes (nsx-mpa, nsx-proxy) are running on each ESXi host.
Why: These agents handle management plane communication and data plane programming.
# Check NSX services on ESXi
/etc/init.d/nsx-mpa status
/etc/init.d/nsx-proxy status
/etc/init.d/nsx-context-mux status
Expected Output:
nsx-mpa is running
nsx-proxy is running
nsx-context-mux is running
esxcli software vib list | grep -i nsx
Expected Output (shows installed NSX VIBs):
nsx-esx-datapath 4.2.0.0.0-12345678 VMW VMwareCertified 2026-01-15
nsx-mpa 4.2.0.0.0-12345678 VMW VMwareCertified 2026-01-15
nsx-platform-client 4.2.0.0.0-12345678 VMW VMwareCertified 2026-01-15
nsx-proxy 4.2.0.0.0-12345678 VMW VMwareCertified 2026-01-15
/etc/init.d/nsx-mpa restart/etc/init.d/nsx-proxy restart/var/log/nsx-mpa.log and /var/log/nsx-syslog.logWhat: Verify ESXi transport nodes can reach NSX Managers on required ports.
| Source | Destination | Port | Protocol | Purpose |
|---|---|---|---|---|
| ESXi | NSX Manager | 1234 | TCP | NSX Manager → Host MPA |
| ESXi | NSX Manager | 1235 | TCP | NSX Central CLI |
| ESXi | NSX Manager | 5671 | TCP | Message bus (RabbitMQ) |
| ESXi | NSX Manager | 443 | TCP | HTTPS API |
| ESXi | ESXi (TEP) | 4789 | UDP | Geneve overlay |
| ESXi | ESXi (TEP) | 6081 | UDP | Geneve (BFD) |
# Test connectivity to NSX Manager ports from ESXi
nc -zv $NSX_VIP 1234
nc -zv $NSX_VIP 1235
nc -zv $NSX_VIP 5671
nc -zv $NSX_VIP 443
Expected Output:
Connection to nsx-vip.lab.local 1234 port [tcp/*] succeeded!
Connection to nsx-vip.lab.local 1235 port [tcp/*] succeeded!
Connection to nsx-vip.lab.local 5671 port [tcp/*] succeeded!
Connection to nsx-vip.lab.local 443 port [tcp/https] succeeded!
What: Verify Tunnel Endpoint (TEP) connectivity between all ESXi hosts with proper MTU.
Why: TEP connectivity is the foundation of NSX overlay networking. Failed TEP = no overlay connectivity between VMs on different hosts.
# From ESXi host, ping another host's TEP IP
vmkping -I vmk10 -d -s 1572 <remote-TEP-IP>
-s 1572 flag sets the payload size. With IP (20 bytes) and ICMP (8 bytes) headers, total packet size = 1600 bytes, which is the minimum MTU for Geneve overlay. The -d flag sets the Don't Fragment bit to ensure the full MTU path is tested.
# Script to test TEP connectivity from current host
REMOTE_TEPS="192.168.12.75 192.168.12.76 192.168.12.82"
for TEP in $REMOTE_TEPS; do
echo "Testing TEP: $TEP"
vmkping -I vmk10 -d -s 1572 -c 3 $TEP
echo "---"
done
Expected Output (Healthy):
PING 192.168.12.75 (192.168.12.75): 1572 data bytes
1580 bytes from 192.168.12.75: icmp_seq=0 ttl=64 time=0.543 ms
1580 bytes from 192.168.12.75: icmp_seq=1 ttl=64 time=0.421 ms
1580 bytes from 192.168.12.75: icmp_seq=2 ttl=64 time=0.389 ms
--- 192.168.12.75 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
| Result | Criteria | Indicator |
|---|---|---|
| PASS | 0% packet loss with MTU 1600 | TEP healthy |
| WARN | Intermittent loss (< 5%) or high latency (> 5ms) | Network issue |
| FAIL | 100% loss or MTU 1600 fails (but smaller succeeds) | TEP broken or MTU mismatch |
esxcli network vswitch dvs vmware list — MTU should be 9000 or at least 1600vmkping -I vmk10 -d -s 1400 <TEP>
What: Verify Bidirectional Forwarding Detection (BFD) sessions between transport nodes.
Why: BFD provides fast failure detection for TEP tunnels. Down BFD = NSX considers the tunnel failed.
ssh root@<esxi-host>
nsxcli
> get bfd-sessions
Expected Output:
BFD Session Remote IP State Diagnostic
-------------------------------------------------------
192.168.12.75 192.168.12.75 UP No Diagnostic
192.168.12.76 192.168.12.76 UP No Diagnostic
192.168.12.82 192.168.12.82 UP No Diagnostic
| Result | Criteria | Indicator |
|---|---|---|
| PASS | All BFD sessions UP |
Healthy |
| WARN | Any session in INIT state |
Establishing |
| FAIL | Any session DOWN or missing |
Tunnel failure |
What: Verify overlay tunnel status between transport nodes.
> get logical-switches
Expected Output:
Logical Switch UUID VNI Port Count
---------------------------------------------------------------
abc123-def456-789012-abcdef-123456789012 71001 5
def456-789012-abcdef-123456-789012abcdef 71002 3
> get tunnel-ports
Expected Output:
Tunnel Port Remote IP Encap Status
--------------------------------------------
vxlan_12 192.168.12.75 Geneve UP
vxlan_13 192.168.12.76 Geneve UP
vxlan_14 192.168.12.82 Geneve UP
What: Verify the Edge cluster and all member Edge nodes are operational.
Why: Edge nodes provide north-south routing, NAT, load balancing, VPN, and gateway firewall services. Edge failure = services outage.
# List edge clusters
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/api/v1/edge-clusters" | jq '.results[] | {
id: .id,
display_name: .display_name,
member_count: (.members | length),
deployment_type: .deployment_type
}'
# Get edge cluster status
EDGE_CLUSTER_ID="<edge-cluster-id>"
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/api/v1/edge-clusters/$EDGE_CLUSTER_ID/status" | jq .
Expected Output:
{
"edge_cluster_id": "abc123...",
"member_status": [
{
"member_index": 0,
"transport_node_id": "tn-edge-01",
"status": "UP"
},
{
"member_index": 1,
"transport_node_id": "tn-edge-02",
"status": "UP"
}
]
}
What: Verify Edge High Availability is correctly configured with proper ACTIVE/STANDBY states.
# Get Tier-0 gateways
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/policy/api/v1/infra/tier-0s" | jq '.results[] | {
id: .id,
display_name: .display_name,
ha_mode: .ha_mode,
failover_mode: .failover_mode
}'
ssh admin@<edge-node>
get high-availability status
Expected Output:
Active-Standby Status
Service Router : ACTIVE
Distributed Router: ACTIVE
HA Channel State : UP
Peer Status : STANDBY (edge-02)
| Result | Criteria | Indicator |
|---|---|---|
| PASS | One edge ACTIVE, one STANDBY, HA channel UP |
Healthy HA |
| WARN | HA channel DOWN but roles correct |
HA monitoring degraded |
| FAIL | Both edges STANDBY or both ACTIVE (split-brain) |
HA failure |
restart service high-availability
What: Verify datapath and routing services on edge nodes.
ssh admin@<edge-node>
get service status
Expected Output:
Service Status
--------------------------
dataplane running
edge-health-agg running
nestdb running
nsx-edge-mpa running
nsx-proxy running
syslog running
get interfaces
Expected Output:
Interface IP Address MAC MTU Admin Status
--------------------------------------------------------------------
fp-eth0 192.168.10.1/24 00:50:56:xx:xx:01 1500 UP
fp-eth1 192.168.20.1/24 00:50:56:xx:xx:02 1500 UP
fp-eth2 169.254.0.1/28 02:50:56:xx:xx:03 1500 UP
What: Verify BGP sessions with upstream physical routers are established and routes are being exchanged.
Why: BGP peering is critical for north-south traffic flow. Failed BGP = no external connectivity.
ssh admin@<edge-node>
get bgp neighbor summary
Expected Output:
BGP Summary:
Router ID: 192.168.10.1
Local AS: 65001
Neighbor AS State Up/Down Prefixes Received
-----------------------------------------------------------------
192.168.10.254 65000 Established 5d12h 12
192.168.20.254 65000 Established 5d12h 12
# Get BGP neighbors via policy API
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/policy/api/v1/infra/tier-0s/<tier0-id>/locale-services/<locale-service-id>/bgp/neighbors/status" | jq .
| Result | Criteria | Indicator |
|---|---|---|
| PASS | All neighbors Established, prefixes received > 0 |
BGP healthy |
| WARN | Any neighbor OpenConfirm or OpenSent |
BGP negotiating |
| FAIL | Any neighbor Idle or Active |
BGP session down |
get bgp configget bgp neighbor <IP> advertised-routesget bgp neighbor <IP> received-routesWhat: Verify the routing table contains expected routes for all network segments.
get route-table
Expected Output (abbreviated):
Flags: c - connected, s - static, b - BGP, ns - nsx_static
> - selected route, * - FIB route
b>* 0.0.0.0/0 via 192.168.10.254 fp-eth0 [20/0]
c>* 192.168.10.0/24 directly connected fp-eth0
c>* 192.168.20.0/24 directly connected fp-eth1
b>* 10.0.0.0/8 via 192.168.10.254 fp-eth0 [20/0]
ns>* 172.16.0.0/16 via 169.254.0.2 linked [3/0]
ns>* 172.17.0.0/16 via 169.254.0.2 linked [3/0]
| Route | Source | Expected Next Hop |
|---|---|---|
| Default (0.0.0.0/0) | BGP | Physical router IP |
| Management network | Connected | Local interface |
| Overlay segments | NSX static | DR backplane |
| External networks | BGP | Physical router IP |
What: Verify Tier-0 and Tier-1 gateways are fully realized (configured intent matches actual state).
# Check Tier-0 realized state
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/policy/api/v1/infra/tier-0s/<tier0-id>/state" | jq '{
state: .state,
details: .details
}'
# Check Tier-1 realized state
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/policy/api/v1/infra/tier-1s" | jq '.results[] | {
display_name: .display_name,
id: .id
}'
# Then for each Tier-1:
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/policy/api/v1/infra/tier-1s/<tier1-id>/state" | jq '{
state: .state,
details: .details
}'
| Result | Criteria | Indicator |
|---|---|---|
| PASS | All gateways state: REALIZED |
Configuration applied |
| WARN | Any gateway state: IN_PROGRESS |
Being configured |
| FAIL | Any gateway state: ERROR |
Configuration failure |
What: Verify all DFW security policies and rules have been successfully realized (pushed) to transport nodes.
Why: Unrealized rules mean security policies are not being enforced, creating security gaps.
# Check DFW realization status
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/policy/api/v1/infra/realized-state/status?intent_path=/infra/domains/default/security-policies" | jq '{
publish_status: .publish_status,
consolidated_status: .consolidated_status.consolidated_status
}'
# List all security policies and their realization state
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/policy/api/v1/infra/domains/default/security-policies" | jq '.results[] | {
display_name: .display_name,
id: .id,
rule_count: (.rules | length)
}'
# Check realization for a specific policy
POLICY_ID="<policy-id>"
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/policy/api/v1/infra/realized-state/status?intent_path=/infra/domains/default/security-policies/$POLICY_ID" | jq .
| Result | Criteria | Indicator |
|---|---|---|
| PASS | All policies REALIZED |
Rules enforced on all hosts |
| WARN | Any policy IN_PROGRESS |
Rules being pushed |
| FAIL | Any policy ERROR |
Rules NOT enforced — security gap |
What: Verify DFW filters are attached to VM vNICs on ESXi hosts.
Why: If dvfilter is not attached, DFW rules are not applied to that VM regardless of realization status.
# List all dvfilters
summarize-dvfilter
Expected Output (abbreviated):
world 12345678 vmm0:MyVM vcUuid:'50 12 ab cd ...'
++ port 50331651 myvm.eth0
vNic slot 2
name: nic-12345678-eth0-vmware-sfw.2
agentName: vmware-sfw
state: IOChain Attached
agentName: vmware-sfw — confirms DFW is activestate: IOChain Attached — confirms filter is operationalWhat: Verify DFW rules are programmed on the ESXi datapath using vsipioctl.
vsipioctl getfilters
Expected Output:
Filter Name : nic-12345678-eth0-vmware-sfw.2
Filter Rules:
Rule Count : 47
Applied To : myvm.eth0
vsipioctl getrules -f nic-12345678-eth0-vmware-sfw.2
Expected Output (abbreviated):
ruleset domain-c8:1001 {
# DFW Section: Application Rules
rule 1001 at 1 inout protocol any from any to any accept with log;
rule 1002 at 2 inout protocol tcp from addrset src-001 to addrset dst-001 port 443 accept;
rule 1003 at 3 inout protocol any from any to any drop;
}
vsipioctl getaddrsets -f nic-12345678-eth0-vmware-sfw.2
What: Verify Gateway Firewall rules on Tier-0 and Tier-1 gateways are realized and synchronized between HA edge pairs.
Why: Gateway firewall protects north-south traffic. Desynchronized rules between active/standby edges create intermittent security gaps during failover.
# Tier-0 Gateway Firewall policies
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/policy/api/v1/infra/tier-0s/<tier0-id>/gateway-firewall" | jq .
# Tier-1 Gateway Firewall policies
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/policy/api/v1/infra/tier-1s/<tier1-id>/gateway-firewall" | jq .
# SSH to edge node
ssh admin@<edge-node>
get firewall rules
Expected Output:
Rule ID Action Direction Source Destination Service Logged
---------------------------------------------------------------------------------
1001 allow in-out 10.0.0.0/8 any HTTPS yes
1002 allow in-out any 10.0.0.0/8 SSH yes
2001 drop in-out any any any yes
# On ACTIVE edge:
get firewall rules | wc -l
# On STANDBY edge:
get firewall rules | wc -l
The rule count should be identical on both edges.
| Result | Criteria | Indicator |
|---|---|---|
| PASS | Rules realized, rule count matches between HA pair | Healthy |
| WARN | Rule count differs by 1-2 (possible in-flight update) | Monitor |
| FAIL | Significant rule count mismatch or rules not realized | Security gap |
What: Verify all NSX segments are operational and correctly realized.
# List all segments with status
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/policy/api/v1/infra/segments" | jq '.results[] | {
display_name: .display_name,
id: .id,
type: .type,
vlan_ids: .vlan_ids,
transport_zone_path: .transport_zone_path,
admin_state: .admin_state
}'
# For each segment, check realization
SEGMENT_ID="<segment-id>"
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/policy/api/v1/infra/segments/$SEGMENT_ID/state" | jq '{
state: .state,
details: .details
}'
| Result | Criteria | Indicator |
|---|---|---|
| PASS | All segments admin_state: UP, realized SUCCESS |
Healthy |
| WARN | Any segment IN_PROGRESS |
Being configured |
| FAIL | Any segment ERROR or admin_state: DOWN |
Connectivity loss |
What: Verify logical switch forwarding tables are populated correctly.
nsxcli
> get logical-switch <VNI> mac-table
Expected Output:
Inner MAC Outer MAC Outer IP Flags
--------------------------------------------------------------
00:50:56:aa:bb:01 00:50:56:cc:dd:01 192.168.12.75 L
00:50:56:aa:bb:02 00:50:56:cc:dd:02 192.168.12.76 R
> get logical-switch <VNI> arp-table
Expected Output:
IP Address MAC Address Flags
-------------------------------------------
172.16.1.10 00:50:56:aa:bb:01 L
172.16.1.11 00:50:56:aa:bb:02 R
> get logical-switch <VNI> vtep-table
Expected Output:
VTEP IP VTEP MAC Segment ID Is Local
------------------------------------------------------------
192.168.12.74 00:50:56:66:xx:01 192.168.12.0 true
192.168.12.75 00:50:56:66:xx:02 192.168.12.0 false
192.168.12.76 00:50:56:66:xx:03 192.168.12.0 false
What: Check for active alarms on the NSX Manager that indicate current or impending issues.
Why: NSX generates alarms for certificate expiry, resource exhaustion, connectivity failures, and more. Unresolved alarms indicate health issues.
# Get all open alarms
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/api/v1/alarms?status=OPEN,ACKNOWLEDGED&sort_by=severity" | jq '.results[] | {
alarm_id: .id,
severity: .severity,
feature: .feature_name,
event_type: .event_type,
description: .description,
entity: .entity_id,
last_reported: .last_reported_time,
status: .status
}'
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/api/v1/alarms?status=OPEN&severity=CRITICAL" | jq '.result_count'
| Category | Feature Name | Example Alarm |
|---|---|---|
| Cluster | clustering |
Manager node connectivity lost |
| Certificate | trust_management |
Certificate expiring soon |
| Transport | transport_node |
Transport node connection down |
| Edge | edge |
Edge node unhealthy |
| Routing | routing |
BGP neighbor down |
| Firewall | firewall |
DFW realization failure |
| Backup | backup |
Backup failure |
| Resource | system |
Disk space critical |
| Result | Criteria | Indicator |
|---|---|---|
| PASS | 0 critical alarms, 0 high-severity alarms | Clean |
| WARN | Warning-severity alarms present | Review recommended |
| FAIL | Critical or high-severity alarms present | Immediate action |
# Acknowledge an alarm
curl -sk -X POST -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/api/v1/alarms/<alarm-id>?action=set_status&new_status=ACKNOWLEDGED"
# Resolve an alarm (after fixing the root cause)
curl -sk -X POST -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/api/v1/alarms/<alarm-id>?action=set_status&new_status=RESOLVED"
| Source | Destination | Port | Protocol | Purpose |
|---|---|---|---|---|
| Admin Workstation | NSX Manager | 443 | TCP | Web UI / API |
| Admin Workstation | NSX Manager | 22 | TCP | SSH CLI |
| NSX Manager | NSX Manager | 443 | TCP | Inter-node API |
| NSX Manager | NSX Manager | 5671 | TCP | Messaging (AMQP) |
| NSX Manager | NSX Manager | 9000 | TCP | Cluster bootstrap |
| NSX Manager | NSX Manager | 9200 | TCP | Corfu datastore |
| NSX Manager | NSX Manager | 9300 | TCP | Search engine |
| NSX Manager | vCenter | 443 | TCP | Compute Manager |
| NSX Manager | vCenter | 902 | TCP | VM console proxy |
| NSX Manager | DNS | 53 | TCP/UDP | Name resolution |
| NSX Manager | NTP | 123 | UDP | Time sync |
| NSX Manager | Syslog | 514/6514 | UDP/TCP | Log forwarding |
| NSX Manager | SFTP Backup | 22 | TCP | Backup file transfer |
| NSX Manager | LDAP/AD | 389/636 | TCP | Authentication |
| Source | Destination | Port | Protocol | Purpose |
|---|---|---|---|---|
| ESXi | NSX Manager | 1234 | TCP | MPA communication |
| ESXi | NSX Manager | 1235 | TCP | Central CLI |
| ESXi | NSX Manager | 5671 | TCP | Message bus |
| ESXi | NSX Manager | 443 | TCP | API / certificate |
| ESXi | ESXi (TEP) | 4789 | UDP | Geneve overlay |
| ESXi | ESXi (TEP) | 6081 | UDP | Geneve BFD |
| NSX Manager | ESXi | 443 | TCP | Host preparation |
| Source | Destination | Port | Protocol | Purpose |
|---|---|---|---|---|
| Edge | NSX Manager | 1234 | TCP | MPA |
| Edge | NSX Manager | 1235 | TCP | Central CLI |
| Edge | NSX Manager | 5671 | TCP | Message bus |
| Edge | NSX Manager | 443 | TCP | API |
| Edge | ESXi (TEP) | 4789 | UDP | Geneve overlay |
| Edge | Edge | 4789 | UDP | Inter-edge overlay |
| Edge | Physical Router | 179 | TCP | BGP peering |
| Edge | Physical Router | ICMP | — | BFD (if configured) |
| Symptom | Likely Cause | Resolution |
|---|---|---|
Transport node shows DOWN |
NSX Manager unreachable from host | Check port 1234/5671 connectivity; restart nsx-mpa |
get controllers shows Disconnected |
Controller service issue | Restart controller: restart service controller on NSX Manager |
| Multiple nodes disconnecting | Network partition | Check upstream switch/VLAN; verify management network |
| Symptom | Likely Cause | Resolution |
|---|---|---|
Host stuck in IN_PROGRESS |
VIB installation hung | Reboot ESXi host; re-trigger preparation |
FAILED with "VIB conflict" |
Incompatible third-party VIB | Remove conflicting VIB; check compatibility matrix |
FAILED with certificate error |
Clock skew | Verify NTP on ESXi and NSX Manager |
| Certificate Type | Impact When Expired | Recovery |
|---|---|---|
| NSX Manager API | API calls fail, UI inaccessible | Replace via CLI: set certificate api |
| Cluster certificate | Inter-node communication fails | Replace per VMware KB 83054 |
| Transport node | Host loses control plane | Replace and re-prepare host |
Symptoms:
Resolution:
get cluster status| Command | Purpose |
|---|---|
get cluster status |
Cluster health overview |
get services |
All service status |
get node-stats |
CPU/memory/disk stats |
get filesystem-info |
Disk utilization |
get certificate api |
API certificate details |
get managers |
Connected manager nodes |
get controllers |
Connected controllers |
get interface eth0 |
Network interface info |
get log-file syslog follow |
Tail syslog in real-time |
restart cluster boot-manager |
Restart cluster membership |
restart service <name> |
Restart a specific service |
set logging-level <service> level debug |
Enable debug logging |
clear logging-level <service> |
Reset to default logging |
| Command | Purpose |
|---|---|
get managers |
NSX Manager connectivity |
get controllers |
Controller connectivity |
get logical-switches |
Overlay segments on host |
get logical-switch <VNI> mac-table |
MAC forwarding table |
get logical-switch <VNI> arp-table |
ARP table |
get logical-switch <VNI> vtep-table |
VTEP table |
get bfd-sessions |
BFD tunnel status |
get tunnel-ports |
Geneve tunnel endpoints |
get firewall rules |
DFW rules (on edge) |
| Command | Purpose |
|---|---|
/etc/init.d/nsx-mpa status |
MPA agent status |
/etc/init.d/nsx-proxy status |
Proxy agent status |
/etc/init.d/nsx-mpa restart |
Restart MPA |
summarize-dvfilter |
DFW filter summary |
vsipioctl getfilters |
DFW filter list |
vsipioctl getrules -f <filter> |
DFW rules for a filter |
vsipioctl getaddrsets -f <filter> |
Address sets |
esxcli software vib list | grep nsx |
Installed NSX VIBs |
vmkping -I vmk10 -d -s 1572 <IP> |
TEP MTU test |
| Command | Purpose |
|---|---|
get high-availability status |
HA state |
get bgp neighbor summary |
BGP sessions |
get route-table |
Routing table |
get interfaces |
Network interfaces |
get service status |
Running services |
get firewall rules |
Gateway FW rules |
get arp-table |
ARP entries |
# Basic authentication (used throughout this document)
curl -sk -u "admin:password" "https://$NSX_VIP/api/v1/..."
# Session-based authentication
curl -sk -c cookies.txt -X POST \
-d '{"j_username":"admin","j_password":"password"}' \
"https://$NSX_VIP/api/session/create"
# Subsequent calls with session cookie
curl -sk -b cookies.txt "https://$NSX_VIP/api/v1/..."
| Endpoint | Method | Purpose |
|---|---|---|
/api/v1/cluster/status |
GET | Cluster health |
/api/v1/cluster/nodes |
GET | Node inventory |
/api/v1/node |
GET | Local node info |
/api/v1/node/status |
GET | Node resource usage |
/api/v1/node/services |
GET | Service status |
/api/v1/trust-management/certificates |
GET | All certificates |
/api/v1/cluster/backups/config |
GET | Backup configuration |
/api/v1/cluster/backups/history |
GET | Backup history |
/api/v1/transport-nodes/state |
GET | Transport node status |
/api/v1/edge-clusters |
GET | Edge clusters |
/api/v1/edge-clusters/<id>/status |
GET | Edge cluster status |
/api/v1/alarms |
GET | Active alarms |
/policy/api/v1/infra/tier-0s |
GET | Tier-0 gateways |
/policy/api/v1/infra/tier-1s |
GET | Tier-1 gateways |
/policy/api/v1/infra/segments |
GET | Network segments |
/policy/api/v1/infra/domains/default/security-policies |
GET | DFW policies |
/policy/api/v1/infra/realized-state/status |
GET | Realization status |
| Parameter | Example | Purpose |
|---|---|---|
cursor |
?cursor=0 |
Pagination start |
page_size |
?page_size=100 |
Results per page |
sort_by |
?sort_by=display_name |
Sort field |
sort_ascending |
?sort_ascending=true |
Sort order |
included_fields |
?included_fields=display_name,id |
Limit response fields |
# Get NSX Manager version
curl -sk -u "$NSX_USER:$NSX_PASS" \
"https://$NSX_VIP/api/v1/node/version" | jq .
Expected Output:
{
"node_version": "4.2.0.0.0.12345678",
"product_version": "4.2.0.0.0.12345678"
}
NSX 4.x Health Check Handbook Version 1.0 | March 2026 © 2026 Virtual Control LLC — All Rights Reserved