Virtual Control

VMware Cloud Foundation Solutions

Troubleshooting Report

NSX Manager 9.0.1
Cold Start Service Failure

Root cause analysis documenting NSX Manager service chain failure after cold start, manual service recovery, and VCF Operations re-integration.

NSX ManagerCold StartService RecoveryRoot Cause Analysis

VCF 9.0

VMware Cloud Foundation

Proprietary & Confidential

NSX Manager 9.0.1 Cold Start Service Failure — Root Cause Analysis & Recovery

Prepared by: Virtual Control LLC Date: March 26, 2026 Document Version: 1.0 Classification: Internal — Lab Environment NSX Version: 9.0.1.0 (Build 24952114)

1. Executive Summary

On March 26, 2026, a complete NSX Manager service failure was identified in the VCF 9.0.1 nested lab environment. The NSX Manager appliance (nsx-node1.lab.local, 192.168.1.71) was powered on and responding to ICMP, but all core platform services — including HTTP, Manager, Controller, Search, and Authentication — were in a stopped state. The NSX Manager UI was unreachable on both the node IP (192.168.1.71) and the cluster VIP (192.168.1.70). VCF Operations reported two integration adapters in Warning state due to the inability to connect to NSX.

The root cause was determined to be a cold start service initialization failure. After a lab shutdown and restart, the NSX Manager appliance booted successfully at the OS level, but the NSX application services did not auto-start in the correct dependency order. The Corfu datastore service was running, but the HTTP service — which all API and UI access depends on — remained stopped, preventing the rest of the service chain from initializing.

Recovery was achieved by manually starting services in the correct dependency order: datastore → http → manager → controller. This cascaded the startup of all remaining 30+ services. Total recovery time from initial diagnosis to full service restoration was approximately 40 minutes.

Root Cause: NSX Manager 9.0.1 cold start in a resource-constrained nested environment caused the HTTP service to fail during automatic initialization. The Corfu datastore started successfully, but the HTTP service did not start within the expected timeout window, breaking the service dependency chain and leaving all dependent services (Manager, Controller, Search, Auth, and 25+ policy services) in a stopped state.

2. Environment Reference

2.1 Infrastructure Details

2.2 NSX Manager Configuration

2.3 Upstream Dependencies

3. Problem Statement

3.1 Symptom Description

Parameter	Value
Physical Host	Dell Precision 7920
Hypervisor	VMware Workstation 17.x (Nested)
VCF Version	9.0.1
Lab Domain	lab.local
Deployment Type	Single-node NSX Manager (nested)

Parameter	Value
Appliance Hostname	nsx-node1.lab.local
Node UUID	95493642-ef4a-cb8e-ed7c-5bc20033f2c2
Node IP Address	192.168.1.71
Cluster VIP	192.168.1.70
NSX Version	9.0.1.0
Build Number	24952114
Cluster ID	3d5211c5-a4e1-4535-a803-f10726c26d59
Deployment Type	Single-node cluster

Component	IP Address	Role
vCenter Server 9.0	192.168.1.69	Infrastructure management
SDDC Manager 9.0	192.168.1.241	Lifecycle management
VCF Operations 9.0.2	—	Monitoring (VCF Operations Collector)
ESXi Hosts	192.168.1.74–.76, .82	Compute

The issue was initially identified in VCF Operations (Administration → Integrations → Accounts). Two adapters displayed Warning status:

3.2 Affected Components

3.3 Business Impact

4. Root Cause Analysis

4.1 RCA Summary

Component	Status	Impact
NSX Manager UI	Unreachable	No management access
NSX Manager API	Unreachable	No programmatic access
Cluster VIP (192.168.1.70)	Not responding	VIP not serving traffic
VCF Operations NSX Adapter	Warning	No NSX metric collection
VCF Operations VCF Adapter	Warning	Partial VCF data collection failure
NSX-backed overlay networking	Degraded	Existing workloads operational; no changes possible

What happened: The NSX Manager appliance was powered off as part of a full lab shutdown. Upon restart, the appliance OS booted successfully and basic node services (SSH, NTP, node-mgmt) started. However, the core NSX platform services did not auto-start. The HTTP service failed to initialize within the expected timeout, which broke the service dependency chain.

Why it happened: In a nested virtualization environment running on VMware Workstation, the NSX Manager appliance competes for CPU and I/O resources during boot. The Corfu datastore service started successfully, but the HTTP service — which depends on datastore readiness and requires significant memory allocation — did not start before the system's service startup timeout expired. With HTTP stopped, no dependent services (Manager, Controller, Auth, Search) could start because they require the HTTP/API layer for inter-service communication and registration.

Why it was not detected immediately: The appliance was responsive to ICMP (ping) on the node IP (192.168.1.71), which could give a false impression of health. The VIP (192.168.1.70) did not respond because VIP assignment depends on the cluster manager service, which was also stopped. VCF Operations detected the failure via adapter warnings, which was the initial alert that triggered investigation.

4.2 Service Dependency Chain

The following diagram shows the NSX Manager service startup order. An arrow indicates a dependency — the service on the right requires the service on the left to be running.

Key insight: The http service is the single point of failure in the startup chain. If http does not start, no service above it in the chain can start, even if datastore is healthy.

4.3 Why Services Did Not Auto-Start

NSX Manager 9.0.1 uses a service orchestration framework that starts services in dependency order during boot. In resource-constrained environments (nested lab, shared CPU/RAM, slow storage), the following sequence occurs:

This is a known behavior in nested environments and is not caused by data corruption, misconfiguration, or hardware failure.

5. Phase 1 — Initial Assessment

5.1 VCF Operations Alerts

The issue was first observed in VCF Operations under Administration → Integrations → Accounts:

Adapter Name	Type	Status	Collector
lab	VMware Cloud Foundation Adapter	Warning	collector.lab.local
mgmt	VCF Operations Collector	Collecting	VCF Operations Collector-vcf-ops
mgmt	VCF Operations Collector	Collecting	VCF Operations Collector-vcf-ops
mgmt - vSAN	VCF Operations Collector	Collecting	VCF Operations Collector-vcf-ops
nsx-vip.lab.local	NSX Adapter	Warning	collector.lab.local

The NSX adapter (nsx-vip.lab.local) reported: "Error trying to establish connection"

5.2 Network Connectivity Test

Analysis: The NSX Manager OS is running, but the cluster VIP is not active. The VIP is managed by the cluster_manager service, which depends on the HTTP service. This confirms a service-level failure, not a network or VM-level failure.

5.3 Browser Access Test

URL	Result
`https://192.168.1.70`	Connection refused — page did not load
`https://192.168.1.71`	Connection refused — page did not load

Analysis: Neither the VIP nor the direct node IP serves the NSX UI. The HTTP service (which serves the UI on port 443) is not running.

6. Phase 2 — SSH Diagnostics

6.1 Establish SSH Session

SSH access was available because the ssh service starts independently of the NSX platform services.

6.2 Check All Services

Service	State	Notes
applianceproxy	running	Basic proxy — starts independently
async_replicator	stopped	Depends on manager
auth	stopped	Depends on http
cluster_manager	stopped	Depends on http
cm-inventory	stopped	Depends on manager
controller	stopped	Depends on http
datastore	stopped	Corfu DB — checked separately
datastore_nonconfig	stopped	Corfu secondary — checked separately
http	stopped	ROOT CAUSE — gateway for all API/UI
manager	stopped	Core management plane
messaging-manager	stopped	Depends on http
monitoring	stopped	Depends on manager
node-mgmt	running	Basic node management — starts independently
node-stats	running	Basic node stats — starts independently
nsx-platform-client	running	Platform client — starts independently
nsx-upgrade-agent	running	Upgrade agent — starts independently
ntp	running	Time sync — starts independently
search	stopped	Depends on manager
sha	running	System health agent — starts independently
ssh	running	SSH daemon — starts independently
syslog	running	Log forwarding — starts independently
ui-service	running	UI static files — starts independently

Analysis: Only 9 out of 30+ services were running. All 9 running services are basic OS-level or node-level services that start independently of the NSX platform. Every platform service (http, manager, controller, auth, search, and all policy services) was stopped.

The get cluster config and get cluster status commands both returned errors at this point:

Analysis: These commands all require the HTTP/API service to be running. Their failure confirms the HTTP service is down.

6.3 Filesystem and Resource Check

Before attempting service recovery, disk space and memory were verified to rule out resource exhaustion:

Filesystem	Size	Used	Avail	Use%	Mounted on
/dev/sda2	11G	4.7G	5.0G	49%	/
/dev/mapper/nsx-var+log	27G	14G	12G	55%	/var/log
/dev/mapper/nsx-repository	31G	7.8G	22G	27%	/repository
/dev/mapper/nsx-config	29G	48M	28G	1%	/config
/dev/mapper/nsx-secondary	98G	230M	93G	1%	/nonconfig
tmpfs	16G	24K	16G	1%	/dev/shm

Analysis: No filesystem is close to capacity. The root filesystem is at 49%, /var/log at 55%, and all other mounts are well below any warning threshold. Disk space is not a contributing factor.

7. Phase 3 — Service Recovery

7.1 Verify Datastore (Corfu) Status

The Corfu datastore is the foundational database service. It must be running before any other platform service can start.

Analysis: The datastore service was already running. This means the database layer is healthy and no Corfu repair is needed. Proceed to starting the HTTP service.

Why check datastore first? The Corfu datastore is the NSX Manager's internal database. If it is stopped or corrupted, starting the HTTP service will fail because HTTP depends on database connectivity for configuration loading, session management, and API request processing. Always verify datastore health before starting upstream services.

7.2 Start HTTP Service

The HTTP service is the API and UI gateway. It listens on port 443 and is required by every other platform service.

7.3 Start Manager Service

The Manager service is the core management plane. It handles all CRUD operations for NSX objects (segments, firewall rules, groups, etc.).

Wait 60 seconds for the service to initialize. The manager service has a longer startup time because it must:

7.4 Start Controller Service

The Controller service manages the control plane — it programs the data plane on ESXi hosts and Edge nodes.

7.5 Verify All Services Running

After starting the three key services (http, manager, controller), all remaining services should cascade-start automatically. Wait 2–3 minutes, then check:

8. Phase 4 — Cluster Validation

8.1 Cluster VIP Verification

Service	State
applianceproxy	running
async_replicator	running
auth	running
cluster_manager	running
cm-inventory	running
controller	running
datastore	running
datastore_nonconfig	running
http	running
idps-reporting	running
install-upgrade	running
manager	running
messaging-manager	running
monitoring	running
node-mgmt	running
node-stats	running
nsx-platform-client	running
nsx-upgrade-agent	running
ntp	running
search	running
sha	running
site_manager	running
ssh	running
syslog	running
ui-service	running

Service	State	Reason
liagent	stopped	Log Insight agent — not configured
migration-coordinator	stopped	Only active during migrations
nsx-message-bus	stopped	Not used in single-node deployments
snmp	stopped	SNMP not enabled (Start on boot: False)

Analysis: The VIP (192.168.1.70) is active and correctly assigned to the single node (192.168.1.71).

8.2 Cluster Status Verbose

8.3 HTTPS Group Status

Group Type	Group Status	Node Status
DATASTORE	STABLE	UP
CLUSTER_BOOT_MANAGER	STABLE	UP
CONTROLLER	STABLE	UP
MANAGER	STABLE	UP
HTTPS	UNAVAILABLE	DOWN
SITE_MANAGER	STABLE	UP
MONITORING	STABLE	UP
IDPS_REPORTING	STABLE	UP
CM-INVENTORY	STABLE	UP
MESSAGING-MANAGER	STABLE	UP
CORFU_NONCONFIG	STABLE	UP

The cluster status showed the HTTPS group as UNAVAILABLE with the node status DOWN, even though the http service was running. This is a transient state — the HTTPS cluster group registration lags behind the actual service status by several minutes.

The UI banner displayed:

Some appliance components are not functioning properly. Component health: MANAGER:DOWN, SEARCH:DOWN, UI:UP, NODE_MGMT:UP.

This message also reflects a stale health check that had not yet refreshed. After waiting 3–5 minutes and refreshing the browser, the health status updated to show all components as UP.

Important: Do not restart services based on the HTTPS cluster group status or the UI health banner immediately after a manual service recovery. These health indicators use cached state and periodic polling intervals. Wait at least 5 minutes before re-evaluating. If the get service http command shows running and the UI is accessible, the HTTPS group will transition to STABLE on the next health check cycle.

8.4 Browser Access Verification

9. Phase 5 — Upstream Integration Recovery

9.1 VCF Operations Adapter Status

After NSX Manager recovery, the VCF Operations integration adapters needed re-validation.

9.2 Re-validate NSX Adapter

9.3 Confirm Collection Status

10. Post-Recovery Verification Checklist

Use this checklist to confirm full recovery after an NSX Manager cold start service failure:

Adapter	Status
lab (VCF Adapter)	Collecting
nsx-vip.lab.local (NSX Adapter)	Collecting
mgmt	Collecting
mgmt - vSAN	Collecting
Application Monitoring Adapter	Collecting

#	Check	Command / Action	Expected Result
1	All services running	`get services`	All platform services show `running`
2	Cluster VIP active	`get cluster vip`	VIP assigned to node
3	Cluster status	`get cluster status`	Overall status: STABLE
4	UI accessible (node IP)	Browse to `https://192.168.1.71`	Login page loads
5	UI accessible (VIP)	Browse to `https://192.168.1.70`	Login page loads
6	Component health	Check UI banner	All components UP
7	VCF Ops NSX adapter	VCF Operations → Integrations	Status: Collecting
8	VCF Ops VCF adapter	VCF Operations → Integrations	Status: Collecting
9	Transport node connectivity	NSX UI → Fabric → Nodes	All nodes connected
10	Filesystem usage	`get filesystem-stats`	No filesystem above 80%

#	Recommendation	Priority
1	Create a startup script that checks NSX service health after boot and restarts HTTP if stopped	High
2	Configure VCF Operations alerts for NSX Manager service state changes (not just adapter connectivity)	High
3	Document the service start order (datastore → http → manager → controller) in the lab runbook for quick reference	Medium
4	Increase NSX Manager VM resources in the nested environment (CPU: 4→6, RAM: 32→48GB) to reduce cold start failures	Medium
5	Implement startup order in VMware Workstation — ensure NSX Manager VM starts after vCenter and AD/DNS VMs are fully online	Medium
6	Monitor /var/log usage — at 55%, it is the highest-utilized filesystem and could cause issues if logs are not rotated	Low

URL	Result
`https://192.168.1.71`	NSX Manager UI loaded — login page displayed
`https://192.168.1.70`	NSX Manager UI loaded — login page displayed (via VIP)

Service	State
applianceproxy	running
async_replicator	stopped
auth	stopped
cluster_manager	stopped
cm-inventory	stopped
controller	stopped
datastore	stopped
datastore_nonconfig	stopped
http	stopped
idps-reporting	stopped
install-upgrade	stopped
liagent	stopped
manager	stopped
messaging-manager	stopped
migration-coordinator	stopped
monitoring	stopped
node-mgmt	running
node-stats	running
nsx-message-bus	stopped
nsx-platform-client	running
nsx-upgrade-agent	running
ntp	running
search	stopped
sha	running
site_manager	stopped
snmp	stopped
ssh	running
syslog	running
ui-service	running

Field	Value
Document Title	NSX Manager 9.0.1 Cold Start Service Failure — RCA & Recovery
Version	1.0
Author	Virtual Control LLC
Date	March 26, 2026
Classification	Internal — Lab Environment
Environment	Dell Precision 7920, VMware Workstation 17.x Nested Lab
NSX Version	9.0.1.0 (Build 24952114)
Related Documents	VCF Undocumented Issues Reference, VCF Troubleshooting Handbook