Prepared by: Virtual Control LLC Date: March 26, 2026 Document Version: 1.0 Classification: Internal — Lab Environment NSX Version: 9.0.1.0 (Build 24952114)
Table of Contents
On March 26, 2026, a complete NSX Manager service failure was identified in the VCF 9.0.1 nested lab environment. The NSX Manager appliance (nsx-node1.lab.local, 192.168.1.71) was powered on and responding to ICMP, but all core platform services — including HTTP, Manager, Controller, Search, and Authentication — were in a stopped state. The NSX Manager UI was unreachable on both the node IP (192.168.1.71) and the cluster VIP (192.168.1.70). VCF Operations reported two integration adapters in Warning state due to the inability to connect to NSX.
The root cause was determined to be a cold start service initialization failure. After a lab shutdown and restart, the NSX Manager appliance booted successfully at the OS level, but the NSX application services did not auto-start in the correct dependency order. The Corfu datastore service was running, but the HTTP service — which all API and UI access depends on — remained stopped, preventing the rest of the service chain from initializing.
Recovery was achieved by manually starting services in the correct dependency order: datastore → http → manager → controller. This cascaded the startup of all remaining 30+ services. Total recovery time from initial diagnosis to full service restoration was approximately 40 minutes.
Root Cause: NSX Manager 9.0.1 cold start in a resource-constrained nested environment caused the HTTP service to fail during automatic initialization. The Corfu datastore started successfully, but the HTTP service did not start within the expected timeout window, breaking the service dependency chain and leaving all dependent services (Manager, Controller, Search, Auth, and 25+ policy services) in a stopped state.
| Parameter | Value |
|---|---|
| Physical Host | Dell Precision 7920 |
| Hypervisor | VMware Workstation 17.x (Nested) |
| VCF Version | 9.0.1 |
| Lab Domain | lab.local |
| Deployment Type | Single-node NSX Manager (nested) |
| Parameter | Value |
|---|---|
| Appliance Hostname | nsx-node1.lab.local |
| Node UUID | 95493642-ef4a-cb8e-ed7c-5bc20033f2c2 |
| Node IP Address | 192.168.1.71 |
| Cluster VIP | 192.168.1.70 |
| NSX Version | 9.0.1.0 |
| Build Number | 24952114 |
| Cluster ID | 3d5211c5-a4e1-4535-a803-f10726c26d59 |
| Deployment Type | Single-node cluster |
| Component | IP Address | Role |
|---|---|---|
| vCenter Server 9.0 | 192.168.1.69 | Infrastructure management |
| SDDC Manager 9.0 | 192.168.1.241 | Lifecycle management |
| VCF Operations 9.0.2 | — | Monitoring (VCF Operations Collector) |
| ESXi Hosts | 192.168.1.74–.76, .82 | Compute |
The issue was initially identified in VCF Operations (Administration → Integrations → Accounts). Two adapters displayed Warning status:
The NSX adapter reported: Error trying to establish connection
Additional symptoms:
https://192.168.1.70 (VIP) — not reachable in browserhttps://192.168.1.71 (node) — not reachable in browserping 192.168.1.70 — no responseping 192.168.1.71 — responding| Component | Status | Impact |
|---|---|---|
| NSX Manager UI | Unreachable | No management access |
| NSX Manager API | Unreachable | No programmatic access |
| Cluster VIP (192.168.1.70) | Not responding | VIP not serving traffic |
| VCF Operations NSX Adapter | Warning | No NSX metric collection |
| VCF Operations VCF Adapter | Warning | Partial VCF data collection failure |
| NSX-backed overlay networking | Degraded | Existing workloads operational; no changes possible |
In a production environment, this failure would result in:
Note: Existing data plane connectivity (overlay networks, firewall rules already pushed to hosts) remains operational during an NSX Manager outage. The control plane is affected, not the data plane.
What happened: The NSX Manager appliance was powered off as part of a full lab shutdown. Upon restart, the appliance OS booted successfully and basic node services (SSH, NTP, node-mgmt) started. However, the core NSX platform services did not auto-start. The HTTP service failed to initialize within the expected timeout, which broke the service dependency chain.
Why it happened: In a nested virtualization environment running on VMware Workstation, the NSX Manager appliance competes for CPU and I/O resources during boot. The Corfu datastore service started successfully, but the HTTP service — which depends on datastore readiness and requires significant memory allocation — did not start before the system's service startup timeout expired. With HTTP stopped, no dependent services (Manager, Controller, Auth, Search) could start because they require the HTTP/API layer for inter-service communication and registration.
Why it was not detected immediately: The appliance was responsive to ICMP (ping) on the node IP (192.168.1.71), which could give a false impression of health. The VIP (192.168.1.70) did not respond because VIP assignment depends on the cluster manager service, which was also stopped. VCF Operations detected the failure via adapter warnings, which was the initial alert that triggered investigation.
The following diagram shows the NSX Manager service startup order. An arrow indicates a dependency — the service on the right requires the service on the left to be running.
Boot Sequence (Automatic):
OS → ssh, ntp, node-mgmt, node-stats, syslog, nsx-upgrade-agent
Service Dependency Chain (Must be started in order):
datastore (Corfu DB)
→ http (API/UI gateway)
→ auth (authentication)
→ manager (core management plane)
→ controller (control plane)
→ search (indexing)
→ monitoring
→ cm-inventory
→ messaging-manager
→ cluster_manager
→ site_manager
→ install-upgrade
→ async_replicator
→ idps-reporting
→ All POLICY_SVC_* services (30+)
Key insight: The http service is the single point of failure in the startup chain. If http does not start, no service above it in the chain can start, even if datastore is healthy.
NSX Manager 9.0.1 uses a service orchestration framework that starts services in dependency order during boot. In resource-constrained environments (nested lab, shared CPU/RAM, slow storage), the following sequence occurs:
stopped state indefinitelyThis is a known behavior in nested environments and is not caused by data corruption, misconfiguration, or hardware failure.
The issue was first observed in VCF Operations under Administration → Integrations → Accounts:
| Adapter Name | Type | Status | Collector |
|---|---|---|---|
| lab | VMware Cloud Foundation Adapter | Warning | collector.lab.local |
| mgmt | VCF Operations Collector | Collecting | VCF Operations Collector-vcf-ops |
| mgmt | VCF Operations Collector | Collecting | VCF Operations Collector-vcf-ops |
| mgmt - vSAN | VCF Operations Collector | Collecting | VCF Operations Collector-vcf-ops |
| nsx-vip.lab.local | NSX Adapter | Warning | collector.lab.local |
The NSX adapter (nsx-vip.lab.local) reported: "Error trying to establish connection"
From the management workstation:
C:\> ping 192.168.1.70
Request timed out.
Request timed out.
C:\> ping 192.168.1.71
Reply from 192.168.1.71: bytes=32 time<1ms TTL=64
Reply from 192.168.1.71: bytes=32 time<1ms TTL=64
Result: Node IP (.71) responds to ICMP. Cluster VIP (.70) does not respond.
Analysis: The NSX Manager OS is running, but the cluster VIP is not active. The VIP is managed by the cluster_manager service, which depends on the HTTP service. This confirms a service-level failure, not a network or VM-level failure.
| URL | Result |
|---|---|
https://192.168.1.70 |
Connection refused — page did not load |
https://192.168.1.71 |
Connection refused — page did not load |
Analysis: Neither the VIP nor the direct node IP serves the NSX UI. The HTTP service (which serves the UI on port 443) is not running.
SSH access was available because the ssh service starts independently of the NSX platform services.
C:\> ssh root@192.168.1.71
root@nsx-node1:~#
Switched to the NSX CLI admin user:
root@nsx-node1:~# su - admin
NSX CLI (Manager, Policy, Controller 9.0.1.0.24952114). Press ? for command list or enter: help
nsx-node1>
The first diagnostic step was to check the state of all NSX services:
nsx-node1> get services
Output (abbreviated — full output in Appendix A):
| Service | State | Notes |
|---|---|---|
| applianceproxy | running | Basic proxy — starts independently |
| async_replicator | stopped | Depends on manager |
| auth | stopped | Depends on http |
| cluster_manager | stopped | Depends on http |
| cm-inventory | stopped | Depends on manager |
| controller | stopped | Depends on http |
| datastore | stopped | Corfu DB — checked separately |
| datastore_nonconfig | stopped | Corfu secondary — checked separately |
| http | stopped | ROOT CAUSE — gateway for all API/UI |
| manager | stopped | Core management plane |
| messaging-manager | stopped | Depends on http |
| monitoring | stopped | Depends on manager |
| node-mgmt | running | Basic node management — starts independently |
| node-stats | running | Basic node stats — starts independently |
| nsx-platform-client | running | Platform client — starts independently |
| nsx-upgrade-agent | running | Upgrade agent — starts independently |
| ntp | running | Time sync — starts independently |
| search | stopped | Depends on manager |
| sha | running | System health agent — starts independently |
| ssh | running | SSH daemon — starts independently |
| syslog | running | Log forwarding — starts independently |
| ui-service | running | UI static files — starts independently |
Analysis: Only 9 out of 30+ services were running. All 9 running services are basic OS-level or node-level services that start independently of the NSX platform. Every platform service (http, manager, controller, auth, search, and all policy services) was stopped.
The get cluster config and get cluster status commands both returned errors at this point:
nsx-node1> get cluster config
% An error occurred while getting the cluster config
nsx-node1> get cluster status
% An internal error occurred, please retry execution again
nsx-node1> get cluster vip
% An error occurred while getting the cluster virtual ip
Analysis: These commands all require the HTTP/API service to be running. Their failure confirms the HTTP service is down.
Before attempting service recovery, disk space and memory were verified to rule out resource exhaustion:
nsx-node1> get filesystem-stats
| Filesystem | Size | Used | Avail | Use% | Mounted on |
|---|---|---|---|---|---|
| /dev/sda2 | 11G | 4.7G | 5.0G | 49% | / |
| /dev/mapper/nsx-var+log | 27G | 14G | 12G | 55% | /var/log |
| /dev/mapper/nsx-repository | 31G | 7.8G | 22G | 27% | /repository |
| /dev/mapper/nsx-config | 29G | 48M | 28G | 1% | /config |
| /dev/mapper/nsx-secondary | 98G | 230M | 93G | 1% | /nonconfig |
| tmpfs | 16G | 24K | 16G | 1% | /dev/shm |
Analysis: No filesystem is close to capacity. The root filesystem is at 49%, /var/log at 55%, and all other mounts are well below any warning threshold. Disk space is not a contributing factor.
The Corfu datastore is the foundational database service. It must be running before any other platform service can start.
nsx-node1> get service datastore
Output:
Service name: datastore
Service state: running
Analysis: The datastore service was already running. This means the database layer is healthy and no Corfu repair is needed. Proceed to starting the HTTP service.
Why check datastore first? The Corfu datastore is the NSX Manager's internal database. If it is stopped or corrupted, starting the HTTP service will fail because HTTP depends on database connectivity for configuration loading, session management, and API request processing. Always verify datastore health before starting upstream services.
The HTTP service is the API and UI gateway. It listens on port 443 and is required by every other platform service.
nsx-node1> start service http
Wait 30 seconds for the service to initialize, then verify:
nsx-node1> get service http
Output:
Service name: http
Service state: running
Logging level: info
Session timeout: 1800
Connection timeout: 30
Client API rate limit: 100 requests/sec
Client API concurrency limit: 40
Global API concurrency limit: 199
Redirect host: (not configured)
Basic authentication: enabled
Cookie-based authentication: enabled
Result: HTTP service started successfully. Session timeout is 1800 seconds (30 minutes), connection timeout is 30 seconds. API rate limiting is active at 100 requests/sec. The API gateway is now accepting connections on port 443.
The Manager service is the core management plane. It handles all CRUD operations for NSX objects (segments, firewall rules, groups, etc.).
nsx-node1> start service manager
Wait 60 seconds for the service to initialize. The manager service has a longer startup time because it must:
nsx-node1> get service manager
Output:
Service name: manager
Service state: running
Logging level: info
The Controller service manages the control plane — it programs the data plane on ESXi hosts and Edge nodes.
nsx-node1> start service controller
Wait 30 seconds, then verify:
nsx-node1> get service controller
Output:
Service name: controller
Service state: running
After starting the three key services (http, manager, controller), all remaining services should cascade-start automatically. Wait 2–3 minutes, then check:
nsx-node1> get services
Output (abbreviated — full output in Appendix B):
| Service | State |
|---|---|
| applianceproxy | running |
| async_replicator | running |
| auth | running |
| cluster_manager | running |
| cm-inventory | running |
| controller | running |
| datastore | running |
| datastore_nonconfig | running |
| http | running |
| idps-reporting | running |
| install-upgrade | running |
| manager | running |
| messaging-manager | running |
| monitoring | running |
| node-mgmt | running |
| node-stats | running |
| nsx-platform-client | running |
| nsx-upgrade-agent | running |
| ntp | running |
| search | running |
| sha | running |
| site_manager | running |
| ssh | running |
| syslog | running |
| ui-service | running |
Stopped services (expected):
| Service | State | Reason |
|---|---|---|
| liagent | stopped | Log Insight agent — not configured |
| migration-coordinator | stopped | Only active during migrations |
| nsx-message-bus | stopped | Not used in single-node deployments |
| snmp | stopped | SNMP not enabled (Start on boot: False) |
Result: All 25+ platform services are now running. The 4 stopped services are expected to be stopped in this environment configuration.
With services running, verify the cluster VIP is assigned:
nsx-node1> get cluster vip
Output:
Virtual IPv4 address: 192.168.1.70
Virtual IPv6 address: not configured
Assigned to: ['192.168.1.71']
Analysis: The VIP (192.168.1.70) is active and correctly assigned to the single node (192.168.1.71).
nsx-node1> get cluster status verbose
Key results by group type:
| Group Type | Group Status | Node Status |
|---|---|---|
| DATASTORE | STABLE | UP |
| CLUSTER_BOOT_MANAGER | STABLE | UP |
| CONTROLLER | STABLE | UP |
| MANAGER | STABLE | UP |
| HTTPS | UNAVAILABLE | DOWN |
| SITE_MANAGER | STABLE | UP |
| MONITORING | STABLE | UP |
| IDPS_REPORTING | STABLE | UP |
| CM-INVENTORY | STABLE | UP |
| MESSAGING-MANAGER | STABLE | UP |
| CORFU_NONCONFIG | STABLE | UP |
Overall Cluster Status: DEGRADED
The cluster status showed the HTTPS group as UNAVAILABLE with the node status DOWN, even though the http service was running. This is a transient state — the HTTPS cluster group registration lags behind the actual service status by several minutes.
The UI banner displayed: Some appliance components are not functioning properly. Component health: MANAGER:DOWN, SEARCH:DOWN, UI:UP, NODE_MGMT:UP.
This message also reflects a stale health check that had not yet refreshed. After waiting 3–5 minutes and refreshing the browser, the health status updated to show all components as UP.
Important: Do not restart services based on the HTTPS cluster group status or the UI health banner immediately after a manual service recovery. These health indicators use cached state and periodic polling intervals. Wait at least 5 minutes before re-evaluating. If the get service http command shows running and the UI is accessible, the HTTPS group will transition to STABLE on the next health check cycle.
| URL | Result |
|---|---|
https://192.168.1.71 |
NSX Manager UI loaded — login page displayed |
https://192.168.1.70 |
NSX Manager UI loaded — login page displayed (via VIP) |
Both the node IP and VIP are now serving the NSX Manager UI.
After NSX Manager recovery, the VCF Operations integration adapters needed re-validation.
Pre-recovery status (Administration → Integrations → Accounts):
| Adapter | Status |
|---|---|
| lab (VCF Adapter) | Warning |
| nsx-vip.lab.local (NSX Adapter) | Warning |
Note: When re-validating the NSX adapter, a warning may appear: "Error trying to establish connection. Proceed anyway?" This can occur if the certificate has not been re-cached. Click Proceed — the adapter will reconnect successfully once the certificate is accepted.
After re-validation, wait one collection cycle (5–10 minutes) and verify:
| Adapter | Status |
|---|---|
| lab (VCF Adapter) | Collecting |
| nsx-vip.lab.local (NSX Adapter) | Collecting |
| mgmt | Collecting |
| mgmt - vSAN | Collecting |
| Application Monitoring Adapter | Collecting |
Result: All VCF Operations adapters are now in Collecting state. NSX metrics, alerts, and compliance data collection has resumed.
Use this checklist to confirm full recovery after an NSX Manager cold start service failure:
| # | Check | Command / Action | Expected Result |
|---|---|---|---|
| 1 | All services running | get services |
All platform services show running |
| 2 | Cluster VIP active | get cluster vip |
VIP assigned to node |
| 3 | Cluster status | get cluster status |
Overall status: STABLE |
| 4 | UI accessible (node IP) | Browse to https://192.168.1.71 |
Login page loads |
| 5 | UI accessible (VIP) | Browse to https://192.168.1.70 |
Login page loads |
| 6 | Component health | Check UI banner | All components UP |
| 7 | VCF Ops NSX adapter | VCF Operations → Integrations | Status: Collecting |
| 8 | VCF Ops VCF adapter | VCF Operations → Integrations | Status: Collecting |
| 9 | Transport node connectivity | NSX UI → Fabric → Nodes | All nodes connected |
| 10 | Filesystem usage | get filesystem-stats |
No filesystem above 80% |
ICMP response does not indicate service health. The NSX Manager VM responded to ping while all platform services were stopped. Always verify service state via SSH, not just network reachability.
The HTTP service is the critical dependency. If only one service is going to fail, it will be HTTP. All other services depend on it. When troubleshooting NSX Manager unresponsiveness, check get service http first.
Cluster status commands fail when HTTP is down. The get cluster config, get cluster status, and get cluster vip commands all require the HTTP/API layer. Their failure is a symptom, not a separate problem.
Service cascade startup is reliable. Once the three key services (datastore, http, manager) are running, all other services start automatically within 2–3 minutes. There is no need to manually start each of the 30+ services.
Health indicators lag behind actual state. The UI health banner and cluster group status use cached, periodically-refreshed data. Do not make decisions based on stale health status immediately after a manual recovery.
| # | Recommendation | Priority |
|---|---|---|
| 1 | Create a startup script that checks NSX service health after boot and restarts HTTP if stopped | High |
| 2 | Configure VCF Operations alerts for NSX Manager service state changes (not just adapter connectivity) | High |
| 3 | Document the service start order (datastore → http → manager → controller) in the lab runbook for quick reference | Medium |
| 4 | Increase NSX Manager VM resources in the nested environment (CPU: 4→6, RAM: 32→48GB) to reduce cold start failures | Medium |
| 5 | Implement startup order in VMware Workstation — ensure NSX Manager VM starts after vCenter and AD/DNS VMs are fully online | Medium |
| 6 | Monitor /var/log usage — at 55%, it is the highest-utilized filesystem and could cause issues if logs are not rotated | Low |
Captured at: March 26, 2026, 15:01 UTC
| Service | State |
|---|---|
| applianceproxy | running |
| async_replicator | stopped |
| auth | stopped |
| cluster_manager | stopped |
| cm-inventory | stopped |
| controller | stopped |
| datastore | stopped |
| datastore_nonconfig | stopped |
| http | stopped |
| idps-reporting | stopped |
| install-upgrade | stopped |
| liagent | stopped |
| manager | stopped |
| messaging-manager | stopped |
| migration-coordinator | stopped |
| monitoring | stopped |
| node-mgmt | running |
| node-stats | running |
| nsx-message-bus | stopped |
| nsx-platform-client | running |
| nsx-upgrade-agent | running |
| ntp | running |
| search | stopped |
| sha | running |
| site_manager | stopped |
| snmp | stopped |
| ssh | running |
| syslog | running |
| ui-service | running |
Running: 9 of 29 | Stopped: 20 of 29
Captured at: March 26, 2026, 15:36 UTC
| Service | State |
|---|---|
| applianceproxy | running |
| async_replicator | running |
| auth | running |
| cluster_manager | running |
| cm-inventory | running |
| controller | running |
| datastore | running |
| datastore_nonconfig | running |
| http | running |
| idps-reporting | running |
| install-upgrade | running |
| liagent | stopped |
| manager | running |
| messaging-manager | running |
| migration-coordinator | stopped |
| monitoring | running |
| node-mgmt | running |
| node-stats | running |
| nsx-message-bus | stopped |
| nsx-platform-client | running |
| nsx-upgrade-agent | running |
| ntp | running |
| search | running |
| sha | running |
| site_manager | running |
| snmp | stopped |
| ssh | running |
| syslog | running |
| ui-service | running |
Running: 25 of 29 | Stopped (expected): 4 of 29
| Field | Value |
|---|---|
| Document Title | NSX Manager 9.0.1 Cold Start Service Failure — RCA & Recovery |
| Version | 1.0 |
| Author | Virtual Control LLC |
| Date | March 26, 2026 |
| Classification | Internal — Lab Environment |
| Environment | Dell Precision 7920, VMware Workstation 17.x Nested Lab |
| NSX Version | 9.0.1.0 (Build 24952114) |
| Related Documents | VCF Undocumented Issues Reference, VCF Troubleshooting Handbook |
(c) 2026 Virtual Control LLC. All rights reserved.