Each Part of this bible uses a distinct color theme for quick visual identification:
| Part | Color | Topic |
|---|---|---|
| ■■■ Part I | Purple | Architecture & Fundamentals |
| ■■■ Part II | Blue | Deployment Guide |
| ■■■ Part III | Green | Day 2 Operations |
| ■■■ Part IV | Teal | NSX Networking & Security |
| ■■■ Part V | Orange | vSAN Storage |
| ■■■ Part VI | Gold | Security, Certificates & Compliance |
| ■■■ Part VII | Red | Troubleshooting & Recovery |
| ■■■ Part VIII | Slate | Complete Command Reference |
| ■■■ Part IX | Crimson | Disaster Recovery & Health Checks |
| ■■■ Appendices | Indigo | Quick Reference, Ports, Logs, Glossary |
| Topic | Section |
|---|---|
| Active Directory Identity Source | 3.5.5 |
| Air-Gapped License Activation | 1.5, 3.2 |
| Alerts & Notifications | 3.9 |
| API Authentication (Bearer Token) | Appendix I |
| API Endpoint Reference (SDDC Manager) | Appendix I |
| API Quick Reference | 8.7 |
| API Task Lifecycle | Appendix I |
| Aria Suite Lifecycle Deployment | 2.5.3 |
| Backup Configuration | 3.10, 7.7.6 |
| Bringup Process | 2.6 |
| Certificate Architecture | 6.1 |
| Certificate Authority (Microsoft CA) | 3.6.2, 6.4.1 |
| Certificate Authority (OpenSSL) | 3.6.3, 6.4.2 |
| Certificate Commands (keytool) | 6.7, 8.5.2 |
| Certificate Commands (openssl) | 6.2.2, 8.5.1 |
| Certificate Mismatch | 7.6.4 |
| Certificate Replacement (NSX) | 6.2, 4.5.4 |
| Certificate Troubleshooting Flowchart | 7.8.2 |
| Cloud Builder / VCF Installer | 1.2, 2.4 |
| Compliance Monitoring | 3.8, 6.6 |
| Component Architecture | 1.2 |
| Credential Cascade Failure | 7.2.6, Appendix G |
| Credentials Reference | A.1.3 |
| Custom Dashboards | 3.9.6 |
| Data Source Connections | 3.4 |
| Deployment Failure Flowchart | 7.8.1 |
| Diagnostic Scripts | Appendix F |
| Disk Management (vSAN) | 5.2 |
| Distributed Firewall (DFW) | 4.3.5, 4.3.6 |
| DNS Records | 1.3, 2.1, A.1.2 |
| Drift Detection | 3.8.4 |
| ESXi Certificate Regeneration | 6.3 |
| ESXi Commands | 8.1 |
| ESXi Host Recovery | 7.7.5 |
| esxcli Commands | 8.1.1 |
| esxtop | 5.4.4, 8.1.3 |
| EVC Compatibility | 7.4.3 |
| Fleet Management | 3.3 |
| Flowcharts (All) | 7.8 |
| Full Cleanup & Redeployment | 7.7.4 |
| Glossary | Appendix D |
| Hardware Requirements | 1.6 |
| Interview Cheat Sheet | Appendix H |
| IP Address Plan | 2.1, A.1.1 |
| Java Keystore | 6.7 |
| JSON Configuration File | 2.4 |
| keytool Commands | 6.7.2, 8.5.2 |
| License Registration | 3.2 |
| Licensing Model | 1.5 |
| Log File Matrix | Appendix C |
| Log Forwarding (SDDC Manager) | 3.11.4 |
| Management Domain | 1.1, 2.6 |
| Memory Convergence (vMotion) | 7.4.2 |
| Network Architecture | 1.3 |
| NFS Mount Issues | 7.2.4 |
| NSX API Commands | 8.4.2 |
| NSX CLI Commands | 8.4.1 |
| NSX Manager Recovery | 7.7.3 |
| NSX Manager Setup | 4.1 |
| NSX Monitoring | 4.4 |
| NSX OOM Issues | 4.5.1, 7.5.1 |
| NSX Port Requirements | 4.5.7, A.2.3 |
| NSX Troubleshooting | 4.5, 7.5 |
| Offline Depot Setup | 2.3 |
| Offline Depot Troubleshooting | 7.6 |
| OpenSSL Configuration | 6.2.1, 6.4.2 |
| Orphaned Object Cleanup | 5.2.4 |
| ovftool Deployment | 2.5 |
| OVA Property Names | 2.5.4 |
| Password Management | 3.7, 6.5 |
| Password Rotation | 3.7.5, 6.5.2 |
| Port Reference | Appendix B |
| PAGER=cat (psql) | 7.2.6, 8.3 |
| PostgreSQL (SDDC Manager) | 7.2.6, 8.3 |
| PostgreSQL Issues (vCenter) | 7.3.2 |
| Python HTTPS Server | 2.3.3, 7.6.5 |
| Recovery Procedures | 7.7 |
| SATP Claim Rules | 5.1.3 |
| SDDC Manager API Handbook | Appendix I |
| SDDC Manager Bootstrap (Local Storage) | 5.3.0 |
| SDDC Manager Commands | 8.3 |
| SDDC Manager Recovery | 7.7.1 |
| SDDC Manager SSH | 7.2.5 |
| SDDC Manager Troubleshooting | 7.2 |
| task_metadata (platform DB) | 7.2.6, Appendix F |
| Technical Accomplishments | Appendix G |
| Undocumented by Broadcom (35 Discoveries) | G.6 |
| Segments (NSX) | 4.3 |
| Service Failure Flowchart | 7.8.4 |
| SoS Diagnostic Bundle | 7.2.8 |
| SSO Configuration | 3.5 |
| Storage Architecture | 1.4 |
| Storage Migration (Thick→Thin) | 5.3 |
| TEP Configuration (vmk0) | 4.2.3 |
| Tier-0/Tier-1 Gateways | 4.3.4 |
| Timeout Loop Issues | 7.2.3 |
| TLS/FIPS Compatibility | 7.6.1 |
| Traceflow | 4.5.8, 7.5.6 |
| Transport Node Configuration | 4.2 |
| Transport Node Troubleshooting | 4.2.5, 7.5.2 |
| Trust Store Updates | 6.2.5 |
| vCenter Commands | 8.2 |
| vCenter Deployment Stuck | 7.3.1, 7.8.7 |
| vCenter Recovery | 7.7.2 |
| vCenter Troubleshooting | 7.3 |
| VCF Cloud Account | 3.4.1 |
| VCF Installer | 2.4 |
| VCF Operations First Login | 3.1 |
| VCF Operations for Logs | 3.11 |
| VDT (Deployment Toolkit) | 2.7, 7.1 |
| vhv.enable Ghost Setting | 7.4.1 |
| vLCM Host Seeding Failure | 7.8.8 |
| vMotion IP Assignments | 2.1 |
| vMotion Troubleshooting | 7.4 |
| vmkfstools Commands | 5.3, 8.1.2 |
| VMkernel Layout | 1.3, A.1.5 |
| VMware Workstation VMX Settings | 1.6, 2.2 |
| VMX Configuration | 2.2 |
| VPXD Issues | 7.3.4 |
| vSAN ESA Configuration | 5.1 |
| vSAN ESA vs OSA | 1.4, 5.1.1 |
| vSAN Health Check | 5.4.2 |
| vSAN Issue Flowchart | 7.8.5 |
| vSAN Monitoring | 5.4 |
| vSAN Observer | 5.4.5 |
| vSAN Troubleshooting | 5.5 |
| Windows / Depot Commands | 8.6 |
| Workload Domains | 1.1 |
VMware Cloud Foundation (VCF) is a unified software-defined data center (SDDC) platform that integrates compute virtualization (vSphere/ESXi), software-defined networking (NSX), software-defined storage (vSAN), and centralized lifecycle management (SDDC Manager) into a single, validated, and automated stack. VCF delivers a turnkey private cloud that can be deployed, operated, and upgraded as a cohesive unit rather than managing individual VMware products separately.
| Change | VCF 5.x | VCF 9.0 |
|---|---|---|
| Deployment tool | Cloud Builder | VCF Installer (same OVA as SDDC Manager) |
| Management UI | SDDC Manager UI (primary) | VCF Operations (SDDC Manager UI deprecated) |
| Operations suite | Aria Suite (optional) | VCF Operations (mandatory) |
| Licensing | 11 license keys, per-socket | 2 keys (per-core + per-TiB), 16-core minimum per CPU |
| FIPS mode | Optional | Enabled by default, cannot be disabled |
| NSX availability | Standalone or VCF | VCF only (no standalone NSX) |
| vSAN default | OSA or ESA | ESA recommended for new deployments |
| vLCM baselines | Supported | Removed -- must use vLCM Images (desired state) |
| IWA authentication | Supported | Removed -- use AD over LDAPS or Identity Federation |
| Host Profiles | Supported | Deprecated -- use vSphere Configuration Profiles |
| Post-deployment installer | Power off Cloud Builder | VCF Installer transforms into SDDC Manager |
Management Domain (Required)
VI Workload Domains (Optional)
| Architecture | Description | Minimum Hosts |
|---|---|---|
| Consolidated | Management + Edge services on same hosts | 4 |
| Standard | Separate management and edge clusters | 3 management + edge hosts |
+-------------------------------------------------------------------+
| VCF OPERATIONS (Mandatory) |
| Fleet Management | Monitoring | Diagnostics |
+-------------------------------------------------------------------+
| VCF AUTOMATION (Optional) |
| Self-Service | Blueprints | Service Broker | Orchestrator |
+-------------------------------------------------------------------+
| SDDC MANAGER |
| Lifecycle Management | Deployment | Orchestration |
+-----------------+-----------------+-----------------+--------------+
| vSphere | NSX | vSAN | vCenter |
| (Compute) | (Networking) | (Storage) | (Mgmt) |
+-----------------+-----------------+-----------------+--------------+
| ESXi HYPERVISOR |
| Type 1 Bare-Metal |
+-------------------------------------------------------------------+
SDDC Manager is the central lifecycle management and orchestration platform for VCF. In the lab, it runs at 192.168.1.241 (sddc-manager.lab.local).
| Attribute | Details |
|---|---|
| Purpose | Central lifecycle management, deployment, orchestration |
| Version | 9.0.1.0 build 24962180 |
| Key Services | domainmanager, lcm, operationsmanager, commonsvcs, nginx, postgresql |
| Log Location | /var/log/vmware/vcf/ |
| UI Port | 443 (HTTPS) |
| SSH Access | Only vcf user can SSH in; root access via su - from vcf session |
| REST API | https://sddc-manager.lab.local/v1/ |
Key Functions:
Lab lesson: SCP does not work to SDDC Manager due to its restricted shell. Use
ssh vcf@host "cat > file" < localfilefor file transfer instead.
vCenter manages all ESXi hosts, VMs, clusters, DRS, HA, and vMotion. In the lab, it runs at 192.168.1.69 (vcenter.lab.local).
| Attribute | Details |
|---|---|
| Purpose | Compute virtualization management |
| Version | 9.0.1.0 build 24957454 |
| Key Services | vpxd, vsphere-ui, vmware-postgres, sso (sts), vlcm, eam |
| Log Location | /var/log/vmware/ |
| UI Port | 443 (vSphere Client), 5480 (VAMI) |
| Resources | 4 vCPU, 19GB RAM |
Key Functions:
NSX provides software-defined networking, overlay networks, micro-segmentation, and gateway firewalls. In the lab, a single-node NSX Manager runs at 192.168.1.71 (nsx-node1.lab.local) with VIP at 192.168.1.70 (nsx-vip.lab.local).
| Attribute | Details |
|---|---|
| Purpose | Software-defined networking and security |
| Version | 9.0.1.0 build 24952114 |
| Key Services | proton, corfu, nsx-proxy (on hosts) |
| Log Location | /var/log/proton/ |
| Cluster Ports | 1234 (agent), 1235 (cluster) |
| Resources | 6 vCPU, 32GB RAM (minimum for nested) |
Key Concepts:
192.168.1.70) that provides the management endpoint for the NSX cluster. In HA mode (3 nodes), the VIP moves to the active manager.Lab lesson: NSX Manager
smalldeployment needs 32GB RAM and 6 vCPU minimum in nested environments. 16GB causes kernel OOM, 24GB runs but crashes under load (e.g., transport node deployment).
vSAN aggregates local disks across ESXi hosts into a shared datastore. In the lab, vSAN ESA runs across all 4 hosts as datastore vcenter-cl01-ds-vsan01.
| Attribute | Details |
|---|---|
| Purpose | Software-defined storage |
| Architecture | ESA (Express Storage Architecture) |
| Key Services | vsanmgmtd, clomd, vsan-health |
| Minimum Hosts | 3 for cluster, 4 for VCF management domain |
| Default Policy | RAID-1 (FTT=1) |
VCF Operations (formerly Aria Operations) provides monitoring, diagnostics, fleet management, and the primary management UI for VCF 9.0. In the lab, it runs at 192.168.1.77 (vcf-ops.lab.local).
| Attribute | Details |
|---|---|
| Purpose | Monitoring, diagnostics, fleet management, primary VCF UI |
| Version | 9.0.2.0 build 25137838 |
| Deployment Model | xsmall (Simple -- single node) |
| Resources | 2 vCPU, 8GB RAM |
Key Functions:
The VCF Installer is new in VCF 9.0 and replaces Cloud Builder from VCF 5.x. The VCF Installer OVA is the same OVA as SDDC Manager -- it serves dual purpose. When deployed on the management domain ESXi host, it runs as the installer; after bringup completes, it transforms into SDDC Manager.
| Aspect | Cloud Builder (5.x) | VCF Installer (9.0) |
|---|---|---|
| Purpose | Initial deployment only | Deployment + fleet management |
| Post-deployment | Power off and archive | Transforms into SDDC Manager |
| Integration | Standalone | Integrated with VCF Operations |
+-----------------------+
| VCF Operations |
| 192.168.1.77 |
+----------+------------+
|
+----------v------------+
| Fleet Mgmt (Proxy) |
| 192.168.1.78 |
+----------+------------+
|
+----------v------------+
| SDDC Manager |
| 192.168.1.241 |
+--+------+------+------+
| | |
+----------+ +---+---+ +----------+
| | | |
+------v------+ +----v----+ +------v------+
| vCenter | | NSX | | vSAN |
| .69 | | .70/.71| | (4 hosts) |
+------+------+ +----+----+ +------+------+
| | |
+------v--------------v------------v------+
| ESXi Hosts (Transport Nodes) |
| .74 (esxi01) .75 (esxi02) |
| .76 (esxi03) .82 (esxi04) |
+-----------------------------------------+
| Network | Purpose | Subnet | MTU | VMkernel |
|---|---|---|---|---|
| Management | ESXi mgmt, vCenter, SDDC Manager, NSX TEP (overlay) | 192.168.1.0/24 | 1500 | vmk0 |
| vMotion | Live VM migration | 192.168.11.0/24 | 9000 (recommended) | vmk1 |
| vSAN | Storage traffic | 192.168.12.0/24 | 9000 (recommended) | vmk2 |
| NSX Hyperbus | NSX internal | 169.254.0.0/16 | -- | vmk50 |
| VMkernel | TCP/IP Stack | Purpose |
|---|---|---|
| vmk0 | defaultTcpipStack | Management + NSX TEP (overlay) |
| vmk1 | vmotion | vMotion |
| vmk2 | defaultTcpipStack | vSAN |
| vmk50 | hyperbus | NSX Hyperbus (internal, auto-created) |
| Host | vmk0 (Mgmt/TEP) | vmk1 (vMotion) | vmk2 (vSAN) |
|---|---|---|---|
| esxi01.lab.local | 192.168.1.74 | 192.168.11.121 | 192.168.12.121 |
| esxi02.lab.local | 192.168.1.75 | 192.168.11.120 | 192.168.12.120 |
| esxi03.lab.local | 192.168.1.76 | 192.168.11.122 | 192.168.12.122 |
| esxi04.lab.local | 192.168.1.82 | 192.168.11.123 | 192.168.12.123 |
In the lab, all networking runs through a single VDS (vSphere Distributed Switch):
VDS: vcenter-cl01-vds01
├── Port Group: vcenter-cl01-vds01-pg-vm-mgmt (Management)
├── Port Group: vcenter-cl01-vds01-pg-vmotion (vMotion)
└── Port Group: vcenter-cl01-vds01-pg-vsan (vSAN)
Each ESXi VM in VMware Workstation has 4x vmxnet3 adapters in bridged mode. Promiscuous mode is enabled in the VMX file for all NICs (ethernet*.noPromisc = "FALSE") to allow nested VM traffic to flow.
NSX 9.0 introduces the "Use VMkernel Adapter" option for TEP assignment, which reuses vmk0 (the management VMkernel) as the tunnel endpoint. This eliminates the need for a dedicated TEP VLAN and IP pool -- ideal for nested lab environments.
tn-profile-mgmtvcenter-cl01-vds01nsx-overlay-transportzonensx-default-uplink-hostswitch-profileBoth forward (A) and reverse (PTR) records are required for ALL VCF components. The DNS server in the lab is a Windows VM at 192.168.1.230 which also serves as the Active Directory domain controller for lab.local.
# Forward Records (A)
192.168.1.69 vcenter.lab.local
192.168.1.70 nsx-vip.lab.local
192.168.1.71 nsx-node1.lab.local
192.168.1.74 esxi01.lab.local
192.168.1.75 esxi02.lab.local
192.168.1.76 esxi03.lab.local
192.168.1.82 esxi04.lab.local
192.168.1.77 vcf-ops.lab.local
192.168.1.78 fleet.lab.local
192.168.1.79 collector.lab.local
192.168.1.90 automation.lab.local
192.168.1.94 aria-lifecycle.lab.local
192.168.1.241 sddc-manager.lab.local
Important: PTR records (reverse DNS) must also be created for every entry. VCF Installer validation and NSX both require working reverse DNS.
DNS entries NOT needed for Simple Mode deployment:
All VCF components must synchronize time from the same NTP source. In the lab, 192.168.1.230 serves as both DNS and NTP. NTP configuration on NSX Manager is done via the admin CLI, not the UI:
# SSH to NSX Manager as admin
set name-servers 192.168.1.230
set ntp-servers 192.168.1.230
| Feature | vSAN ESA | vSAN OSA |
|---|---|---|
| Architecture | Single storage tier (flat pool) | Disk groups (cache + capacity tiers) |
| Disk Type | NVMe SSDs only | SAS/SATA/NVMe (mixed) |
| Disk Groups | None | Up to 5 per host, 1 cache + 7 capacity each |
| Performance | Higher (optimized for flash) | Standard |
| Compression/Dedup | Higher efficiency | Standard |
| Minimum Devices | 4 NVMe per host | 1 cache SSD + 1 capacity per group |
| Nested Lab Support | Yes (with HCL bypass) | Yes |
| VCF 9.0 Default | Recommended for new deployments | Supported for existing infrastructure |
The lab uses vSAN ESA across 4 hosts. Because nested virtual disks are not on the VMware HCL, a bypass is required before running the VCF Installer:
# SSH to VCF Installer (192.168.1.240) as root
echo "vsan.esa.sddc.managed.disk.claim=true" >> /etc/vmware/vcf/domainmanager/application-prod.properties
systemctl restart domainmanager
Virtual SATA disks must be marked as SSD in the VMX file:
sata0:0.virtualSSD = "1"
sata0:2.virtualSSD = "1"
vSAN storage policies define data protection levels using FTT (Failures to Tolerate):
| FTT | Can Survive | RAID-1 Min Hosts | RAID-5/6 Min Hosts |
|---|---|---|---|
| 1 | 1 failure | 3 | 4 |
| 2 | 2 failures | 5 | 6 |
| 3 | 3 failures | 7 | N/A |
vcenter-cl01-ds-vsan01VCF 9.0 reduces licensing complexity from 11 license keys to just 2:
| License Key | Purpose | Model |
|---|---|---|
| VMware Cloud Foundation | Compute licensing | Per-core (16-core minimum per CPU) |
| VMware vSAN | Storage licensing | Per-terabyte (TiB) |
| Tier | Included Features |
|---|---|
| VCF Starter | Basic SDDC: vSphere, vSAN, NSX networking |
| VCF Standard | + NSX Advanced security (DFW, IDS/IPS), vSAN Enterprise, VCF Operations |
| VCF Enterprise | + VCF Automation, Kubernetes support, multi-cloud capabilities |
Note: VCF Operations is mandatory across all tiers in VCF 9.0.
| Requirement | Specification |
|---|---|
| Minimum hosts (mgmt domain) | 4 ESXi hosts |
| Minimum hosts (workload domain) | 3 ESXi hosts |
| CPU | Intel VT-x or AMD-V capable, on VMware HCL |
| RAM per host | Minimum 256GB (512GB+ recommended) |
| Storage (vSAN ESA) | 4+ NVMe SSDs per host |
| Storage (vSAN OSA) | 1 cache SSD + capacity disks per disk group |
| Network | 2x 25GbE minimum (10GbE supported, 100GbE recommended) |
| MTU | 1600+ for NSX TEP, 9000 for vSAN/vMotion |
| NIC | On VMware HCL |
| Component | Specification |
|---|---|
| Physical Host | Dell Precision 7920, 35-core CPU, 192GB RAM |
| Storage | D: 2TB SSD, E: 2TB SSD, 2x 4TB HDD |
| Hypervisor | VMware Workstation (latest) |
| Network Mode | Bridged (all ESXi VMs on same physical network) |
| Nested ESXi Hosts | 4 VMs |
| DNS/AD Server | Windows VM at 192.168.1.230 |
| Total RAM consumed | ~192GB (4x48GB ESXi + management VMs) |
| Setting | Value |
|---|---|
| vCPUs | 32 |
| Cores per Socket | 4 |
| RAM | 48GB (49,152 MB) |
| Network Adapters | 4x vmxnet3 (bridged) |
| Boot Disk | SCSI (pvscsi) |
| vSAN Disk 1 | SATA (sata0:0) -- marked as SSD |
| vSAN Disk 2 | SATA (sata0:2) -- marked as SSD |
| Guest OS | vmkernel9 |
| Firmware | EFI |
| Hardware Version | 21 (virtualHW.version = "21") |
The following settings must be added to each ESXi VM's .vmx file for nested virtualization to work:
# ===========================================
# NESTED VIRTUALIZATION SETTINGS
# ===========================================
vhv.enable = "TRUE"
vpmc.enable = "TRUE"
vvtd.enable = "TRUE"
# ===========================================
# PROMISCUOUS MODE FOR NESTED VM TRAFFIC
# ===========================================
ethernet0.noPromisc = "FALSE"
ethernet0.allowGuestConnectionControl = "TRUE"
ethernet1.noPromisc = "FALSE"
ethernet1.allowGuestConnectionControl = "TRUE"
ethernet2.noPromisc = "FALSE"
ethernet2.allowGuestConnectionControl = "TRUE"
ethernet3.noPromisc = "FALSE"
ethernet3.allowGuestConnectionControl = "TRUE"
# ===========================================
# MARK DISKS AS SSD FOR VSAN
# ===========================================
sata0:0.virtualSSD = "1"
sata0:2.virtualSSD = "1"
For esxi01 only (has extra disk for local storage): add
sata0:3.virtualSSD = "1"
VMX file locations:
D:\VMs\esxi01.lab.local\esxi01.lab.local.vmx (D: 2TB SSD)
E:\VMs\esxi02.lab.local\esxi02.lab.local.vmx (E: 2TB SSD)
E:\VMs\esxi03.lab.local\esxi03.lab.local.vmx (4TB HDD)
F:\VMs\esxi04.lab.local\esxi04.lab.local.vmx (F: 4TB HDD)
| Setting | Purpose |
|---|---|
vhv.enable = "TRUE" |
Passes VT-x/AMD-V to nested ESXi (required for nested VMs) |
vpmc.enable = "TRUE" |
Virtual Performance Counters for CPU monitoring |
vvtd.enable = "TRUE" |
Virtual Intel VT-d (IOMMU) for nested passthrough |
ethernet*.noPromisc = "FALSE" |
Allows nested VM traffic to flow through VMware Workstation vSwitch |
ethernet*.allowGuestConnectionControl |
Allows ESXi to control network connections |
sata*:*.virtualSSD = "1" |
Marks virtual SATA disks as SSD for vSAN detection |
| VM | vCPU | RAM | Storage | Deployed By |
|---|---|---|---|---|
| vCenter Server | 4 | 19GB | vSAN | VCF Installer |
| NSX Manager | 6 | 32GB | vSAN (thin) | Manual (ovftool) |
| SDDC Manager | 4 | 16GB | vSAN (thin, ~108GB used) | VCF Installer bringup |
| VCF Operations | 2 | 8GB | vSAN (thin) | Manual (ovftool) |
| Fleet (Cloud Proxy) | 2 | 4GB | vSAN (thin) | VCF Operations import |
Hyper-V, VBS, and related features must be disabled on the Windows host for nested virtualization to work:
# Run in PowerShell as Administrator
bcdedit /set hypervisorlaunchtype off
Disable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V-All -NoRestart
Disable-WindowsOptionalFeature -Online -FeatureName VirtualMachinePlatform -NoRestart
Disable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux -NoRestart
Also disable Memory Integrity: Windows Security > Device Security > Core isolation details > Turn OFF "Memory integrity".
REBOOT REQUIRED after these changes.
Verify after reboot:
bcdedit /enum | findstr hypervisor
# Should return nothing or "hypervisorlaunchtype Off"
Get-CimInstance -ClassName Win32_DeviceGuard -Namespace root\Microsoft\Windows\DeviceGuard
# VirtualizationBasedSecurityStatus should be 0
| Component | IP Address | FQDN | Role |
|---|---|---|---|
| esxi01 | 192.168.1.74 | esxi01.lab.local | ESXi Host 1 |
| esxi02 | 192.168.1.75 | esxi02.lab.local | ESXi Host 2 |
| esxi03 | 192.168.1.76 | esxi03.lab.local | ESXi Host 3 |
| esxi04 | 192.168.1.82 | esxi04.lab.local | ESXi Host 4 |
| vCenter | 192.168.1.69 | vcenter.lab.local | vCenter Server |
| NSX VIP | 192.168.1.70 | nsx-vip.lab.local | NSX Manager Virtual IP |
| NSX Node 1 | 192.168.1.71 | nsx-node1.lab.local | NSX Manager Node |
| VCF Operations | 192.168.1.77 | vcf-ops.lab.local | VCF Operations |
| Fleet (Cloud Proxy) | 192.168.1.78 | fleet.lab.local | Fleet Management |
| Collector | 192.168.1.79 | collector.lab.local | Operations Collector |
| Automation | 192.168.1.90 | automation.lab.local | VCF Automation |
| Aria Lifecycle | 192.168.1.94 | aria-lifecycle.lab.local | Lifecycle Manager |
| SDDC Manager | 192.168.1.241 | sddc-manager.lab.local | SDDC Manager |
| NSX Manager (SDDC reg) | 192.168.1.70 | nsx-manager.lab.local | SDDC Manager's registered NSX FQDN |
| DNS / NTP / AD | 192.168.1.230 | dc.lab.local | DNS, NTP, Active Directory |
| Gateway | 192.168.1.1 | -- | Default gateway |
Critical: SDDC Manager registers NSX using the FQDN
nsx-manager.lab.local(mapped to VIP .70). NSX certificates must include this name in the SAN field, not justnsx-node1.lab.local.
| Host | vMotion IP (vmk1) |
|---|---|
| esxi01 | 192.168.11.121 |
| esxi02 | 192.168.11.120 |
| esxi03 | 192.168.11.122 |
| esxi04 | 192.168.11.123 |
| Host | vSAN IP (vmk2) |
|---|---|
| esxi01 | 192.168.12.121 |
| esxi02 | 192.168.12.120 |
| esxi03 | 192.168.12.122 |
| esxi04 | 192.168.12.123 |
All of the following must have both forward (A) and reverse (PTR) records:
# ESXi hosts
192.168.1.74 esxi01.lab.local
192.168.1.75 esxi02.lab.local
192.168.1.76 esxi03.lab.local
192.168.1.82 esxi04.lab.local
# Core infrastructure
192.168.1.69 vcenter.lab.local
192.168.1.70 nsx-vip.lab.local
192.168.1.70 nsx-manager.lab.local
192.168.1.71 nsx-node1.lab.local
192.168.1.241 sddc-manager.lab.local
# VCF Operations ecosystem
192.168.1.77 vcf-ops.lab.local
192.168.1.78 fleet.lab.local
192.168.1.79 collector.lab.local
192.168.1.90 automation.lab.local
192.168.1.94 aria-lifecycle.lab.local
[ ] Physical host: Hyper-V disabled, Memory Integrity off, rebooted
[ ] VMware Workstation installed
[ ] 4 ESXi VMs created with correct specs (32 vCPU, 48GB RAM, 4x vmxnet3)
[ ] VMX files edited with nested virtualization + promiscuous mode + SSD marking
[ ] ESXi 9.0.1 installed on all 4 VMs from VMware ISO
[ ] DNS server running with all A and PTR records
[ ] NTP server accessible from all hosts
[ ] ESXi hosts have only vSwitch0 with vmk0 (clean state)
[ ] ESXi hosts not connected to any vCenter
[ ] SSH enabled on all ESXi hosts
[ ] Nested virtualization verified: cat /proc/cpuinfo | grep -E "vmx|svm"
[ ] SSD status verified: esxcli storage core device list | grep "Is SSD"
[ ] VCF Installer OVA downloaded from Broadcom Support Portal
[ ] Offline depot prepared (if not using online Broadcom depot)
[ ] Common password set on all ESXi hosts (used during VCF Installer wizard)
Each ESXi VM must have the following settings. These go at the END of the .vmx file (the VM must be powered off when editing):
# ===========================================
# NESTED VIRTUALIZATION SETTINGS
# ===========================================
# Hardware virtualization passthrough
vhv.enable = "TRUE"
# Virtual Performance Counters
vpmc.enable = "TRUE"
# Virtual VT-d / IOMMU
vvtd.enable = "TRUE"
# ===========================================
# PROMISCUOUS MODE FOR NESTED VM TRAFFIC
# ===========================================
ethernet0.noPromisc = "FALSE"
ethernet0.allowGuestConnectionControl = "TRUE"
ethernet1.noPromisc = "FALSE"
ethernet1.allowGuestConnectionControl = "TRUE"
ethernet2.noPromisc = "FALSE"
ethernet2.allowGuestConnectionControl = "TRUE"
ethernet3.noPromisc = "FALSE"
ethernet3.allowGuestConnectionControl = "TRUE"
# ===========================================
# MARK DISKS AS SSD FOR VSAN
# ===========================================
sata0:0.virtualSSD = "1"
sata0:2.virtualSSD = "1"
All 4 network adapters should be configured as vmxnet3 in Bridged mode, connected to the same physical NIC as the host's management network. This allows all nested VMs to communicate on the 192.168.1.0/24 subnet.
Each ESXi VM should have at minimum:
After powering on each ESXi VM, verify SSD detection:
# SSH to ESXi host
ssh root@192.168.1.74
# Verify nested virtualization is working
cat /proc/cpuinfo | grep -E "vmx|svm"
# Should output lines containing "vmx" or "svm"
# Verify disks detected as SSD
esxcli storage core device list | grep -E "Display Name|Is SSD"
# Each vSAN disk should show "Is SSD: true"
If disks show as HDD, verify the VMX file has sata0:0.virtualSSD = "1" entries and perform a full power cycle (shutdown + power on, not just reboot).
For air-gapped or lab environments without direct internet access, an offline depot server provides VCF binaries to the SDDC Manager / VCF Installer over HTTPS.
Download the following from the Broadcom Support Portal:
Metadata (required):
vcf-9.0.1.0-offline-depot-metadata.zip -- Contains the PROD directory structure with manifest, product version catalog, vSAN HCL data, and compatibility dataAppliances and Binaries:
| File | Component |
|---|---|
VCF-SDDC-Manager-Appliance-9.0.1.0.24962180.ova |
SDDC Manager |
VMware-VCSA-all-9.0.1.0.24957454.iso |
vCenter Server |
nsx-unified-appliance-9.0.1.0.24952114.ova |
NSX Manager |
VCF-OPS-Lifecycle-Manager-Appliance-9.0.1.0.24960371.ova |
Aria Lifecycle |
Operations-Appliance-9.0.1.0.24960351.ova |
VCF Operations |
Operations-Cloud-Proxy-9.0.1.0.24960349.ova |
Operations Cloud Proxy |
O11N_VA-9.0.1.0.24923009.ova |
Orchestrator |
vmsp-vcfa-combined-9.0.1.0.24965341.tar |
VCF Automation |
VmwareCompatibilityData.json |
Compatibility data |
Generate a self-signed TLS certificate for the depot server. Run on the Windows depot server (requires OpenSSL -- included with Git for Windows):
openssl req -x509 -newkey rsa:2048 `
-keyout "C:\VCF-Depot\server.key" `
-out "C:\VCF-Depot\server.crt" `
-days 365 -nodes `
-subj "/CN=192.168.1.52" `
-addext "subjectAltName=IP:192.168.1.52"
Important: The SAN must include the IP address that SDDC Manager will use to connect. If using a hostname, add a DNS entry as well.
Save the following as C:\VCF-Depot\https_server.py:
#!/usr/bin/env python3
"""
HTTPS server for VCF Offline Depot
Serves files with TLS 1.2+ for SDDC Manager compatibility
"""
import http.server
import ssl
import os
import base64
import socketserver
from functools import partial
# Configuration
PORT = 8443
CERT_FILE = 'server.crt'
KEY_FILE = 'server.key'
USERNAME = 'admin'
PASSWORD = 'admin'
class AuthHandler(http.server.SimpleHTTPRequestHandler):
protocol_version = "HTTP/1.1"
def __init__(self, *args, directory=None, **kwargs):
super().__init__(*args, directory=directory, **kwargs)
def do_HEAD(self):
if not self.authenticate():
return
super().do_HEAD()
def do_GET(self):
if not self.authenticate():
return
super().do_GET()
def authenticate(self):
auth_header = self.headers.get('Authorization')
if auth_header is None:
self.send_auth_request()
return False
try:
auth_type, credentials = auth_header.split(' ', 1)
if auth_type.lower() != 'basic':
self.send_auth_request()
return False
decoded = base64.b64decode(credentials).decode('utf-8')
username, password = decoded.split(':', 1)
if username == USERNAME and password == PASSWORD:
return True
except Exception:
pass
self.send_auth_request()
return False
def send_auth_request(self):
self.send_response(401)
self.send_header('WWW-Authenticate', 'Basic realm="VCF Depot"')
self.send_header('Content-type', 'text/html')
self.send_header('Content-Length', '23')
self.send_header('Connection', 'close')
self.end_headers()
self.wfile.write(b'Authentication required')
def log_message(self, format, *args):
print(f"{self.client_address[0]} - {format % args}")
class ThreadedHTTPServer(socketserver.ThreadingMixIn, http.server.HTTPServer):
daemon_threads = True
def run_server():
os.chdir(os.path.dirname(os.path.abspath(__file__)))
context = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
context.minimum_version = ssl.TLSVersion.TLSv1_2
context.maximum_version = ssl.TLSVersion.TLSv1_3
if hasattr(context, 'post_handshake_auth'):
context.post_handshake_auth = False
context.options |= ssl.OP_NO_TICKET
context.options |= getattr(ssl, 'OP_NO_RENEGOTIATION', 0)
context.load_cert_chain(CERT_FILE, KEY_FILE)
try:
context.set_ciphers('DEFAULT:!aNULL:!MD5:!DSS')
except ssl.SSLError:
pass
handler = partial(AuthHandler, directory=os.getcwd())
server = ThreadedHTTPServer(('0.0.0.0', PORT), handler)
server.socket = context.wrap_socket(server.socket, server_side=True)
print(f"VCF Offline Depot Server")
print(f"========================")
print(f"Serving: {os.getcwd()}")
print(f"URL: https://192.168.1.52:{PORT}/")
print(f"Credentials: {USERNAME} / {PASSWORD}")
print(f"TLS: 1.2 - 1.3")
print(f"Press Ctrl+C to stop")
try:
server.serve_forever()
except KeyboardInterrupt:
print("\nStopped.")
server.shutdown()
if __name__ == '__main__':
run_server()
Key server design decisions:
OP_NO_RENEGOTIATION prevents Java TLS renegotiation errors from SDDC ManagerHTTP/1.1 protocol version is required for Java clientsThreadingMixIn handles concurrent requests from SDDC Manager (which makes parallel downloads)DEFAULT:!aNULL:!MD5:!DSS cipher string provides FIPS-compatible TLSExtract the official metadata zip and place binaries in the correct locations:
# Extract metadata
Expand-Archive -Path "vcf-9.0.1.0-offline-depot-metadata.zip" -DestinationPath "C:\VCF-Depot\metadata-extract" -Force
Copy-Item "C:\VCF-Depot\metadata-extract\PROD\*" "C:\VCF-Depot\PROD\" -Recurse -Force
# Create component directories
New-Item -ItemType Directory -Path "C:\VCF-Depot\PROD\COMP\SDDC_MANAGER_VCF" -Force
New-Item -ItemType Directory -Path "C:\VCF-Depot\PROD\COMP\VCENTER" -Force
New-Item -ItemType Directory -Path "C:\VCF-Depot\PROD\COMP\NSX_T_MANAGER" -Force
New-Item -ItemType Directory -Path "C:\VCF-Depot\PROD\COMP\VRSLCM" -Force
New-Item -ItemType Directory -Path "C:\VCF-Depot\PROD\COMP\VROPS" -Force
New-Item -ItemType Directory -Path "C:\VCF-Depot\PROD\COMP\VCF_OPS_CLOUD_PROXY" -Force
New-Item -ItemType Directory -Path "C:\VCF-Depot\PROD\COMP\VRA" -Force
New-Item -ItemType Directory -Path "C:\VCF-Depot\PROD\COMP\VRO" -Force
New-Item -ItemType Directory -Path "C:\VCF-Depot\PROD\COMP\SDDC_MANAGER_VCF\lcm\productVersionCatalog" -Force
File placement map:
| File | Destination |
|---|---|
VCF-SDDC-Manager-Appliance-*.ova |
PROD\COMP\SDDC_MANAGER_VCF\ |
VMware-VCSA-all-*.iso |
PROD\COMP\VCENTER\ |
nsx-unified-appliance-*.ova |
PROD\COMP\NSX_T_MANAGER\ |
VCF-OPS-Lifecycle-Manager-*.ova |
PROD\COMP\VRSLCM\ |
Operations-Appliance-*.ova |
PROD\COMP\VROPS\ |
Operations-Cloud-Proxy-*.ova |
PROD\COMP\VCF_OPS_CLOUD_PROXY\ |
O11N_VA-*.ova |
PROD\COMP\VRO\ |
vmsp-vcfa-combined-*.tar |
PROD\COMP\VRA\ |
VmwareCompatibilityData.json |
PROD\COMP\SDDC_MANAGER_VCF\Compatibility\ |
productVersionCatalog.json |
PROD\COMP\SDDC_MANAGER_VCF\lcm\productVersionCatalog\ |
Final directory tree:
C:\VCF-Depot\
├── https_server.py
├── server.crt
├── server.key
└── PROD\
├── metadata\
│ ├── manifest\v1\
│ │ └── vcfManifest.json
│ └── productVersionCatalog\v1\
│ ├── productVersionCatalog.json
│ └── productVersionCatalog.sig
├── vsan\hcl\
│ ├── all.json
│ └── lastupdatedtime.json
└── COMP\
├── SDDC_MANAGER_VCF\
│ ├── VCF-SDDC-Manager-Appliance-9.0.1.0.24962180.ova
│ ├── Compatibility\
│ │ └── VmwareCompatibilityData.json
│ └── lcm\productVersionCatalog\
│ └── productVersionCatalog.json
├── VCENTER\
│ └── VMware-VCSA-all-9.0.1.0.24957454.iso
├── NSX_T_MANAGER\
│ └── nsx-unified-appliance-9.0.1.0.24952114.ova
├── VRSLCM\
│ └── VCF-OPS-Lifecycle-Manager-Appliance-9.0.1.0.24960371.ova
├── VROPS\
│ └── Operations-Appliance-9.0.1.0.24960351.ova
├── VCF_OPS_CLOUD_PROXY\
│ └── Operations-Cloud-Proxy-9.0.1.0.24960349.ova
├── VRA\
│ └── vmsp-vcfa-combined-9.0.1.0.24965341.tar
└── VRO\
└── O11N_VA-9.0.1.0.24923009.ova
Allow inbound traffic on port 8443 on the Windows depot server:
netsh advfirewall firewall add rule name="Allow 8443 Inbound" dir=in action=allow protocol=tcp localport=8443
Lab lesson: If the Windows network profile is set to "Public", the firewall blocks all inbound connections silently. Change the network profile to "Private" in Windows Settings > Network & Internet > Ethernet > Network profile type.
cd C:\VCF-Depot
python https_server.py
From SDDC Manager, verify connectivity:
curl -k -u admin:admin https://192.168.1.52:8443/PROD/metadata/productVersionCatalog/v1/productVersionCatalog.json
Import certificate into SDDC Manager trust store:
SSH into SDDC Manager as root:
# Pull the depot server certificate
openssl s_client -connect 192.168.1.52:8443 </dev/null 2>/dev/null | openssl x509 > /tmp/depot.crt
# Find Java cacerts path
CACERTS=$(find /usr -name cacerts 2>/dev/null | head -1)
echo "Truststore: $CACERTS"
# Import certificate
keytool -import -trustcacerts -alias vcf-depot -file /tmp/depot.crt -keystore $CACERTS -storepass changeit -noprompt
# Restart services to pick up new certificate
systemctl restart commonsvcs domainmanager lcm operationsmanager
Configure depot in VCF Installer UI:
| Field | Value |
|---|---|
| FQDN or IP Address | 192.168.1.52 |
| Port | 8443 |
| Username | admin |
| Password | admin |
Click Configure. On success, available VCF versions appear in the UI.
"Secure protocol communication error"
"Path not found - 404 File not found"
C:\VCF-Depot\"Product Version Catalog (PVC) does not exist"
productVersionCatalog.json not extracted from official metadata zip, or LCM copy missingvcf-9.0.1.0-offline-depot-metadata.zip; copy to PROD\COMP\SDDC_MANAGER_VCF\lcm\productVersionCatalog\TLS/FIPS connection issues
context.minimum_version = ssl.TLSVersion.TLSv1_2 and OP_NO_RENEGOTIATION are set in the server scripthttps://192.168.1.74/ui (esxi01 Host Client)VCF-SDDC-Manager-Appliance-9.0.1.0.24962180.ova192.168.1.240vcf-installer.lab.local192.168.1.1192.168.1.230VCF 9.0.1 has a built-in bypass. After the VCF Installer OVA is running:
# SSH to VCF Installer as root
ssh root@192.168.1.240
# Add the vSAN ESA HCL bypass
echo "vsan.esa.sddc.managed.disk.claim=true" >> /etc/vmware/vcf/domainmanager/application-prod.properties
# Restart the domain manager service
systemctl restart domainmanager
# Verify the property was added
cat /etc/vmware/vcf/domainmanager/application-prod.properties | grep vsan
https://vcf-installer.lab.localadmin@localhttps://192.168.1.52:8443 with credentialsLab note: The VCF Installer in Simple Mode deploys vCenter, configures vSAN ESA across all 4 hosts, and creates the VDS. After deployment, the installer OVA transforms into SDDC Manager.
The VCF Installer wizard generates a JSON configuration internally. The key structure contains:
{
"skipEsxThumbprintValidation": true,
"managementPoolName": "mgmt-pool",
"ceipEnabled": false,
"fipsModeEnabled": true,
"ntpServers": ["192.168.1.230"],
"dnsSpec": {
"nameserver": "192.168.1.230",
"domain": "lab.local"
},
"sddcManagerSpec": {
"hostname": "sddc-manager",
"ipAddress": "192.168.1.241"
},
"networkSpecs": [
{ "networkType": "MANAGEMENT", "subnet": "192.168.1.0/24", "gateway": "192.168.1.1" },
{ "networkType": "VMOTION", "subnet": "192.168.11.0/24" },
{ "networkType": "VSAN", "subnet": "192.168.12.0/24" }
],
"nsxtSpec": {
"nsxtManagerSize": "small",
"nsxtManagers": [
{ "hostname": "nsx-node1", "ip": "192.168.1.71" }
],
"vip": "192.168.1.70",
"vipFqdn": "nsx-vip.lab.local"
},
"vsanSpec": {
"vsanName": "vcenter-cl01-ds-vsan01",
"datastoreName": "vcenter-cl01-ds-vsan01",
"esaEnabled": true
},
"hostSpecs": [
{ "hostname": "esxi01.lab.local", "ipAddress": "192.168.1.74" },
{ "hostname": "esxi02.lab.local", "ipAddress": "192.168.1.75" },
{ "hostname": "esxi03.lab.local", "ipAddress": "192.168.1.76" },
{ "hostname": "esxi04.lab.local", "ipAddress": "192.168.1.82" }
]
}
In nested lab environments, SDDC Manager's automated deployment often times out. The workaround is to deploy components manually using ovftool directly on the VCF Installer/SDDC Manager CLI.
Key lesson: Always probe an OVA with
ovftool <ova>first to discover the correct OVF property names. Property names vary between OVAs and are not always documented.
Key lesson: ovftool on VCF Installer/SDDC Manager requires SINGLE-LINE commands. Backslash line continuation breaks
--noSSLVerifyand other flags.
/usr/bin/ovftool --skipManifestCheck --powerOn --diskMode=thin --acceptAllEulas --allowExtraConfig --ipProtocol=IPv4 --ipAllocationPolicy=fixedPolicy --noSSLVerify --datastore=vcenter-cl01-ds-vsan01 --network=vcenter-cl01-vds01-pg-vm-mgmt --deploymentOption=xsmall --name=vcf-ops --prop:root_password='Success01!0909!!' --prop:ipv4_address.VMware_Aria_Operations=192.168.1.77 --prop:ipv4_type.VMware_Aria_Operations=Static --prop:domain.VMware_Aria_Operations=vcf-ops.lab.local --prop:ipv4_gateway.VMware_Aria_Operations=192.168.1.1 --prop:DNS.VMware_Aria_Operations=192.168.1.230 --prop:ipv4_netmask.VMware_Aria_Operations=255.255.255.0 --X:waitForIp --overwrite --X:logFile=/tmp/vcf-ops-manual.log --X:logLevel=verbose /nfs/vmware/vcf/nfs-mount/bundle/8a3336da-1b81-5144-b43e-d84eae7a8d8f/8a3336da-1b81-5144-b43e-d84eae7a8d8f/Operations-Appliance-9.0.2.0.25137838.ova "vi://administrator%40vsphere.local:Success01%210909%21%21@vcenter.lab.local/vcenter-dc01/host/vcenter-cl01"
Warning: SDDC Manager will delete manually deployed VMs it does not recognize if it is in an active deployment loop. Wait for any SDDC Manager deployment tasks to fail completely before deploying manually.
/usr/bin/ovftool --skipManifestCheck --powerOn --diskMode=thin --acceptAllEulas --allowExtraConfig --ipProtocol=IPv4 --noSSLVerify --datastore=vcenter-cl01-ds-vsan01 --network=vcenter-cl01-vds01-pg-vm-mgmt --deploymentOption=small --name=nsx-manager --prop:nsx_role='NSX Manager' --prop:nsx_passwd_0='Success01!0909!!' --prop:nsx_cli_passwd_0='Success01!0909!!' --prop:nsx_cli_audit_passwd_0='Success01!0909!!' --prop:nsx_hostname=nsx-node1.lab.local --prop:nsx_ip_0=192.168.1.71 --prop:nsx_netmask_0=255.255.255.0 --prop:nsx_gateway_0=192.168.1.1 --prop:nsx_dns1_0=192.168.1.230 --prop:nsx_domain_0=lab.local --prop:nsx_ntp_0=192.168.1.230 --prop:nsx_isSSHEnabled=True --prop:nsx_allowSSHRootLogin=True --X:waitForIp --X:logFile=/tmp/nsx-manager.log --X:logLevel=verbose /nfs/vmware/vcf/nfs-mount/bundle/028849ee-d3e7-5748-9b90-47d503c6dd3e/028849ee-d3e7-5748-9b90-47d503c6dd3e/nsx-unified-appliance-9.0.1.0.24952114.ova "vi://administrator%40vsphere.local:Success01%210909%21%21@vcenter.lab.local/vcenter-dc01/host/vcenter-cl01"
Post-deployment NSX configuration:
192.168.1.70set name-servers 192.168.1.230set ntp-servers 192.168.1.230vcenter.lab.local/usr/bin/ovftool --skipManifestCheck --powerOn --diskMode=thin --acceptAllEulas --allowExtraConfig --ipProtocol=IPv4 --noSSLVerify --datastore=vcenter-cl01-ds-vsan01 --network=vcenter-cl01-vds01-pg-vm-mgmt --name=aria-lifecycle --prop:vami.hostname=automation.lab.local --prop:varoot-password='Success01!0909!!' --prop:admin-password='Success01!0909!!' --prop:va-ssh-enabled=True --prop:vami.ip0.VCF_OPS_Management_Appliance=192.168.1.90 --prop:vami.netmask0.VCF_OPS_Management_Appliance=255.255.255.0 --prop:vami.gateway.VCF_OPS_Management_Appliance=192.168.1.1 --prop:vami.DNS.VCF_OPS_Management_Appliance=192.168.1.230 --prop:vami.domain.VCF_OPS_Management_Appliance=lab.local --X:waitForIp --X:logFile=/tmp/aria-lifecycle.log --X:logLevel=verbose /nfs/vmware/vcf/nfs-mount/bundle/7301e3db-1ea7-5dd8-be67-c778becec936/7301e3db-1ea7-5dd8-be67-c778becec936/VCF-OPS-Lifecycle-Manager-Appliance-9.0.1.0.24960371.ova "vi://administrator%40vsphere.local:Success01%210909%21%21@vcenter.lab.local/vcenter-dc01/host/vcenter-cl01"
Important: The OVF property names for this appliance use
VCF_OPS_Management_Applianceas the VM identifier (e.g.,vami.ip0.VCF_OPS_Management_Appliance). These were discovered by probing the OVA withovftool <ova>. The format is NOTvami.ip0.VCF-OPS-Lifecycle-Manageror any other variant.
Before deploying any OVA via ovftool, probe it to discover the correct property names:
/usr/bin/ovftool /path/to/component.ova
This outputs all available OVF properties including their correct keys, types, and default values. Use these exact property names in the --prop: arguments.
After the VCF Installer deploys vCenter and vSAN (Phase 1), or after manually deploying all components, the bringup process registers everything into a management domain.
https://vcf-installer.lab.local (or the SDDC Manager IP)admin@localThe following validation errors were encountered and fixed during the lab bringup:
| Validation Error | Fix |
|---|---|
| NSX VIP not configured | NSX UI > System > Appliances > Set Virtual IP > 192.168.1.70 |
| Compute manager not found in NSX | NSX UI > System > Fabric > Compute Managers > Add vcenter.lab.local |
| DNS not configured in NSX | SSH admin@192.168.1.71 > set name-servers 192.168.1.230 |
| NTP not configured in NSX | SSH admin@192.168.1.71 > set ntp-servers 192.168.1.230 |
| DRS not fully automated | vCenter > vcenter-cl01 > Configure > DRS > Fully Automated |
| VM evacuation policy mismatch | vCenter > vcenter-cl01 > Configure > vSphere Lifecycle Manager > Enable "Migrate powered off and suspended VMs" |
| Aria Lifecycle IP in use (.94) | Deleted existing VM at .94, let installer redeploy fresh |
| NSX certificate (EC vs RSA) | Resolved after NSX health stabilized |
| NSX cluster not stable | Resolved after RAM increase to 32GB |
| NSX minimum version check | Resolved after NSX services came fully online (9.0.1 > 4.2.1 minimum) |
Key lesson: Many installer validation errors are cascading failures from an unhealthy NSX Manager. Fix NSX health first (ensure adequate RAM, wait for all services to start) and most other errors resolve automatically.
After passing all validations, bringup creates the management domain:
mgmt# SSH to SDDC Manager as vcf, then su - to root
ssh vcf@192.168.1.241
su -
# Check all SDDC Manager services
systemctl status vcf-services
# Check individual critical services
systemctl status domainmanager
systemctl status lcm
systemctl status operationsmanager
systemctl status nginx
systemctl status postgresql
# Verify management domain via API
curl -k -X POST https://localhost/v1/tokens -H "Content-Type: application/json" -d '{"username":"admin@local","password":"Success01!0909!!"}'
# Use the returned accessToken for subsequent API calls
curl -k -X GET https://localhost/v1/domains -H "Authorization: Bearer <token>"
# Should show domain "mgmt" with status "ACTIVE"
curl -k -X GET https://localhost/v1/hosts -H "Authorization: Bearer <token>"
# Should show all 4 ESXi hosts with status "ACTIVE"
In the lab, Fleet Management (Cloud Proxy) deployment failed during bringup with error "Upload binary content Operations-Cloud-Proxy-9.0.1.0.24960349.ova to VCF Operations fleet management failed."
Workaround -- Deploy via VCF Operations import:
https://192.168.1.77) > Fleet Management > Lifecycle192.168.1.78The VCF Diagnostic Tool (VDT) is a read-only Python diagnostic tool that checks VCF environment health including certificates, services, inventory, disk, NFS, locks, credentials, NSX, and LCM configuration.
VDT is NOT pre-installed on SDDC Manager. It must be downloaded separately from Broadcom KB article 344917 and uploaded manually.
# On your workstation, download from Broadcom KB 344917
# File: vdt-2.2.7_02-05-2026.zip
# MD5: cc5780c93984fff13c91b8756d3b497d
# SHA256: 8801db4dfa3ed0ac19b8d33482d8dbff0634f0ac03f0d36926b438eab7cb43fc
# Upload to SDDC Manager (SCP works from external machine TO SDDC Manager)
scp vdt-2.2.7_02-05-2026.zip vcf@192.168.1.241:/home/vcf/
# SSH to SDDC Manager
ssh vcf@192.168.1.241
# Extract
unzip vdt-2.2.7_02-05-2026.zip
cd /home/vcf/vdt-2.2.7_02-05-2026
python vdt.py
VDT prompts for the SSO administrator password (administrator@vsphere.local). It then runs all health checks and produces both text and JSON reports.
Results location:
/var/log/vmware/vcf/vdt/vdt-<timestamp>.txt
/var/log/vmware/vcf/vdt/vdt-<timestamp>.json
| Category | What It Checks |
|---|---|
| SDDC Manager Info | Version, hostname, build |
| NTP Service & Server | NTP daemon running, server responding |
| /etc/hosts | Properly formatted |
| SDDC Manager Services | COMMON_SERVICES, LCM, DOMAIN_MANAGER, OPERATIONS_MANAGER, SDDC_MANAGER_UI |
| Disk Utilization | Filesystem space and inodes |
| Host/Domain/vCenter/NSX Status | All components ACTIVE in inventory |
| Certificate Trust/Expiry/SAN | Certs in trust stores, not expired, SAN contains hostname+IP |
| Deployment/Resource/Changelog Locks | No stuck locks |
| Credential Health | No invalid transactions, no stale credentials |
| NFS Mount Ownership | Correct owner (root:vcf) on /nfs/vmware/vcf/nfs-mount/ |
| Transport Node FQDNs | FQDN matches display name |
| LCM Manifest | Manifest file present in DB |
FAIL: NFS Mount Ownership
Symptom: /nfs/vmware/vcf/nfs-mount/ owned by nginx instead of root
# Fix
chown root:vcf /nfs/vmware/vcf/nfs-mount/
# Verify
ls -la /nfs/vmware/vcf/
# Should show: drwxrwxr-x root vcf nfs-mount/
Reference: https://knowledge.broadcom.com/external/article/392923
FAIL: NSX Certificate SAN Missing
Symptom: VDT reports "SAN contains neither hostname nor IP" for NSX VIP and NSX Manager. Default NSX self-signed cert has SAN=*.lab.local which VDT does not accept.
Fix: Generate a new self-signed certificate with explicit SAN entries and apply via NSX API. Full procedure:
# Step 1: Create OpenSSL config on NSX Manager (SSH as root)
cat > /tmp/nsx-cert.conf << 'EOF'
[ req ]
default_bits = 2048
distinguished_name = req_distinguished_name
req_extensions = req_ext
x509_extensions = req_ext
prompt = no
[ req_distinguished_name ]
countryName = US
stateOrProvinceName = Lab
localityName = Lab
organizationName = lab.local
commonName = nsx-vip.lab.local
[ req_ext ]
basicConstraints = CA:FALSE
subjectAltName = @alt_names
[alt_names]
DNS.1 = nsx-vip.lab.local
DNS.2 = nsx-node1.lab.local
DNS.3 = nsx-manager.lab.local
IP.1 = 192.168.1.70
IP.2 = 192.168.1.71
EOF
# Step 2: Generate self-signed certificate
openssl req -x509 -nodes -days 825 -newkey rsa:2048 -keyout /tmp/nsx.key -out /tmp/nsx.crt -config /tmp/nsx-cert.conf -sha256
# Step 3: Create JSON payload (Python avoids PEM escaping issues)
python -c "
import json
cert = open('/tmp/nsx.crt').read()
key = open('/tmp/nsx.key').read()
print(json.dumps({'pem_encoded': cert, 'private_key': key}))
" > /tmp/nsx-import.json
# Step 4: Import certificate (single-line curl -- NSX shell has no backslash continuation)
curl -k -u admin:'Success01!0909!!' -X POST "https://192.168.1.71/api/v1/trust-management/certificates?action=import" -H "Content-Type: application/json" -d @/tmp/nsx-import.json
# Returns certificate ID, e.g.: 701d1416-5054-4038-8749-4ac495980ebd
# Step 5: Get node UUID
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/cluster
# Returns node UUID, e.g.: 95493642-ef4a-cb8e-ed7c-5bc20033f2c2
# Step 6: Apply to NSX Manager node
curl -k -u admin:'Success01!0909!!' -X POST "https://192.168.1.71/api/v1/trust-management/certificates/<cert-id>?action=apply_certificate&service_type=API&node_id=<node-uuid>"
# Step 7: Apply to cluster VIP
curl -k -u admin:'Success01!0909!!' -X POST "https://192.168.1.71/api/v1/trust-management/certificates/<cert-id>?action=apply_certificate&service_type=MGMT_CLUSTER"
Important:
DNS.3 = nsx-manager.lab.localis required in the SAN because SDDC Manager registers NSX using this FQDN. Without it, VDT fails the SAN check.
FAIL: NSX Certificate Trust (after replacing cert)
After replacing the NSX self-signed certificate, import it into SDDC Manager's trust stores:
# On SDDC Manager as root:
# Pull the active NSX certificate
openssl s_client -showcerts -connect 192.168.1.71:443 < /dev/null 2>/dev/null | openssl x509 -outform PEM > /tmp/nsx-root.crt
# Import into VCF trust store
KEY=$(cat /etc/vmware/vcf/commonsvcs/trusted_certificates.key)
keytool -importcert -alias nsx-selfsigned -file /tmp/nsx-root.crt -keystore /etc/vmware/vcf/commonsvcs/trusted_certificates.store -storepass "$KEY" -noprompt
# Import into Java cacerts
keytool -importcert -alias nsx-selfsigned -file /tmp/nsx-root.crt -keystore /etc/alternatives/jre/lib/security/cacerts -storepass changeit -noprompt
# Restart SDDC Manager services
/opt/vmware/vcf/operationsmanager/scripts/cli/sddcmanager_restart_services.sh
Services take approximately 5 minutes to restart. After restart, re-run VDT to confirm all NSX cert trust checks pass.
Reference: https://knowledge.broadcom.com/external/article/316056
WARN: vCenter Certificate SAN
Symptom: VDT reports "SAN contains hostname but not IP" for vCenter. This is cosmetic and acceptable for lab environments -- vCenter's default certificate includes the FQDN but not the IP address in the SAN.
| Check | Result |
|---|---|
| SDDC Manager Info | PASS -- Version 9.0.1.0.24962180 |
| NTP Service & Server | PASS -- 192.168.1.230 responding |
| /etc/hosts | PASS |
| SDDC Manager Services | PASS -- All 5 services ACTIVE |
| Commonservices API | PASS -- HTTP 200 |
| Disk Utilization (space + inodes) | PASS |
| Host/Domain/vCenter/PSC/Cluster/NSX Status | PASS -- All ACTIVE |
| SDDC Cert (Trust/Expiry/SAN) | PASS -- 717 days remaining |
| vCenter Cert Trust/Expiry | PASS |
| vCenter Cert SAN | WARN (hostname but not IP -- cosmetic) |
| NSX VIP Cert (Trust/Expiry/SAN) | PASS -- 825 days remaining |
| NSX Manager Cert (Trust/Expiry/SAN) | PASS |
| Deployment/Resource/Changelog Locks | PASS -- No locks |
| Service Account Auth | PASS |
| Credential Transactions | PASS |
| NFS Mount Ownership | PASS (after fix) |
| NFS Subdirectories | PASS |
| Transport Node FQDNs | PASS |
| LCM Manifest | PASS |
VCF Operations (formerly VMware Aria Operations) is the mandatory central management console for the entire VCF 9.0 platform. The SDDC Manager UI is deprecated and will be removed in a future release. VCF Operations is now the primary interface for fleet management, lifecycle management, licensing, monitoring, certificate management, password management, and all Day 2 operations.
| Component | Address |
|---|---|
| VCF Operations | 192.168.1.77 (vcf-ops.lab.local) |
| SDDC Manager | 192.168.1.241 (sddc-manager.lab.local) |
| vCenter Server | 192.168.1.69 (vcenter.lab.local) |
| Offline Depot Server | 192.168.1.52:8443 |
| ESXi Hosts | esxi01 (.74), esxi02 (.75), esxi03 (.76), esxi04 (.82) |
| NSX Manager | 192.168.1.71 (nsx-node1.lab.local) |
| NSX VIP | 192.168.1.70 (nsx-vip.lab.local) |
| Fleet Management (Cloud Proxy) | 192.168.1.78 (fleet.lab.local) |
| DNS Server | 192.168.1.230 (Windows AD DC for lab.local) |
| Mode | Air-gapped / Disconnected |
https://192.168.1.77adminThe left navigation pane displays the main sections:
| Section | Purpose |
|---|---|
| Fleet Management | Lifecycle management, depot configuration, component health |
| Infrastructure Operations | Monitoring, dashboards, alerts, diagnostics |
| Security & Compliance | Compliance benchmarks, drift detection |
| License Management | Registration and license file management |
| Administration | Integrations, accounts, access control, system settings |
Note: If licensing has not been completed, some menu items may be grayed out. VCF Operations runs in evaluation mode for up to 90 days after deployment.
If VCF Operations was deployed manually via OVA rather than through the VCF Installer, the initial setup wizard appears automatically on first access:
admin user (minimum 8 characters, upper, lower, number, special character)VCF Operations ships with the Customer Experience Improvement Program (CEIP) enabled by default. For air-gapped labs this should be disabled:
Tip: In a disconnected environment, CEIP data cannot be sent anyway, but disabling prevents unnecessary connection attempts that clutter logs.
VCF 9.0 uses a unified subscription-based license file model. The old 25-character license keys are replaced by license files. There are only two license types: VMware Cloud Foundation (cores) and VMware vSAN (TiBs). All other components (NSX, vCenter, VCF Automation, etc.) are automatically licensed when a primary license is assigned.
Navigation: VCF Operations > License Management > Registration
.jws (JSON Web Signed) file to a local machine or USB driveThis step is performed on a machine with internet access:
.jws file to a computer with internet access via USB drive or secure transferhttps://vcf.broadcom.comNavigation: VCF Operations > License Management > Registration
https://192.168.1.77Since the environment is air-gapped, you must manually report usage at least every 180 days:
https://vcf.broadcom.comWARNING: If license usage data is not submitted within 180 days, licenses are treated as expired. Hosts are disconnected from vCenter and workload operations are blocked. In a lab environment, set a calendar reminder.
The Fleet Management appliance handles lifecycle management functions formerly in SDDC Manager. If deployed via the VCF Installer, this may already be connected. If not:
Navigation: https://192.168.1.77/admin/ (the Admin UI, not the main UI)
https://192.168.1.77/admin/admin with your VCF Operations admin passwordLab Context: In the lab, Fleet Management was deployed at 192.168.1.78 via the VCF Operations Lifecycle import (not during bringup, which failed). The Cloud Proxy was deployed automatically during this process.
In VCF 9.0, depot functionality has moved from SDDC Manager to VCF Operations. You must configure the depot before you can download binaries for additional components. Only one depot connection (online OR offline) can be ACTIVE at a time.
Navigation: VCF Operations > Fleet Management > Lifecycle > VCF Management > Depot Configuration
https://192.168.1.52:8443adminadminNavigation: VCF Operations > Fleet Management > Lifecycle > VCF Instances > (select your instance) > Depot Settings
192.168.1.52:8443Note: Before configuring the SDDC Manager depot, you may need to trust the SSL certificate of your offline depot server. This was already done during the initial bringup (certificate imported into SDDC Manager's Java trust store).
After depot configuration, binaries become available for download and deployment:
Tip: Binary downloads from depot may intermittently fail. If a download disappears, retry it.
This is the critical step that connects VCF Operations to your SDDC Manager, enabling automatic monitoring of all VCF domains including vCenter, NSX, and vSAN.
Navigation: VCF Operations > Administration > Integrations > Accounts tab > Add
Lab VCF Instance (or any descriptive name)Management Domain - Lab Environmentsddc-manager.lab.local (use FQDN rather than IP for VCF SSO to work properly)SDDC Manager Adminadministrator@vsphere.localAfter configuration, VCF Operations automatically:
Note: Initial collection takes multiple cycles (standard cycle = 5 minutes). Allow 15-30 minutes for full data population.
When you add a VCF account, vCenter accounts are normally auto-discovered. If you need to add one manually:
Navigation: VCF Operations > Administration > Integrations > Accounts tab > Add
vcenter.lab.local - 192.168.1.69Management Domain vCentervcenter.lab.local or 192.168.1.69administrator@vsphere.local and passwordImportant: vCenter accounts do NOT start monitoring automatically. You must manually initiate data collection.
Navigation: VCF Operations > Administration > Integrations > Accounts
Key Timing Notes:
| Metric | Interval |
|---|---|
| Standard collection cycle | Every 5 minutes |
| Initial collection (full population) | 15-30 minutes |
| Property-based diagnostic scans | Every 4 hours |
| Telegraf agent data collection | Every 4 minutes |
| Cloud proxy registration (first boot) | Up to 20 minutes |
VCF 9.0 introduces the VCF Identity Broker (VIDB), which provides federated SSO across all VCF components.
Navigation: VCF Operations > Fleet Management > Identity & Access > VCF Management > Operations Appliance
Navigation: VCF Operations > Administration > Control Panel > Authentication Sources
Navigation: VCF Operations > Administration > Control Panel > Access Control
vcf-admins, vcf-readonly, Domain Admins)To add AD authentication to vCenter separately:
https://192.168.1.69lab.localDC=lab,DC=localDC=lab,DC=localldap://192.168.1.230:389Lab Context: The lab has AD/LDAP configured via the embedded identity broker with lab.local domain at 192.168.1.230. Attribute mappings: userName=sAMAccountName, firstName=givenName, lastName=sn, email=mail. Domain Admins group synced with nested groups enabled.
VCF 9.0 introduces unified, non-disruptive TLS certificate management across all VCF components.
Navigation: VCF Operations > Fleet Management > Certificates
Navigation: VCF Operations > Fleet Management > Certificates > Configure CA
https:// and end with certsrv (e.g., https://ca.lab.local/certsrv)svc-vcf-ca)Important: VCF management components only support Microsoft CA. VCF Instance components support both Microsoft CA and OpenSSL. You configure the CA separately for management components and instance components.
After configuring a CA, replace default self-signed certificates with enterprise CA-signed certificates. Certificates eligible for non-disruptive auto-renewal include: ESX SSL, vCenter machine SSL, NSX LM/VIP, SDDC Manager SSL, and VCF Operations certificates.
On the Certificates page, enable auto-renewal for supported certificates. This prevents unexpected certificate expiration.
Lab Note: In a lab with no Microsoft CA, you can continue using self-signed certificates. The certificate management UI will show certificate expiration warnings, which is normal.
VCF 9.0 provides unified password management centralized in VCF Operations, replacing the password management previously found in SDDC Manager.
Navigation: VCF Operations > Fleet Management > Passwords
VCF Management Components:
| Component |
|---|
| Fleet Management |
| VCF Automation |
| VCF Identity Broker |
| VCF Operations |
| VCF Operations for Logs |
| VCF Operations for Networks |
VCF Instance/Domain Components:
| Component |
|---|
| ESX hosts (esxi01-04) |
| NSX Manager |
| vCenter Server |
| SDDC Manager |
| Function | When to Use | What It Does |
|---|---|---|
| Update | You changed a password outside VCF | Updates VCF database to match the new password on the component |
| Rotate | Scheduled password change | Changes password on BOTH the component AND the VCF database |
| Remediate | A rotation failed mid-way | Re-syncs by accepting the current password on the component |
Password rotation generates a randomized password:
Note: Auto-rotate is automatically enabled for vCenter Server. It may take up to 24 hours to configure the auto-rotate policy for a newly deployed vCenter.
If a password gets out of sync between SDDC Manager and the actual component:
Prerequisites:
Steps:
Tip: Password rotation options from VCF 5.x are not fully available in VCF Operations yet. Use the SDDC Manager API as a workaround for some rotation tasks if needed.
WARNING — Credential Rotation Cascade Failure: If a credential update or rotation fails mid-operation (commonly because NSX was temporarily unreachable during a boot storm or maintenance), the component resource can get stuck in
ACTIVATINGstate with stale exclusive locks blocking all future password operations. Error messages:"Resources [host] are not available/ready"or"Unable to acquire resource level lock(s)". This requires a database-level fix on SDDC Manager — see Section 7.2.6 for the complete repair procedure.
Navigation: VCF Operations > Security & Compliance > Compliance
Built-in standards (no additional download):
| Standard | Notes |
|---|---|
| DISA Security Standards | Defense Information Systems Agency STIGs |
| FISMA Security Standards | Federal Information Security Management Act |
| HIPAA | Health Insurance Portability and Accountability Act |
Standards requiring marketplace download (.PAK file):
| Standard | Notes |
|---|---|
| PCI DSS Compliance Standards | Payment Card Industry Data Security Standard |
| CIS Security Standards | Center for Internet Security Benchmarks |
| NIST SP 800-171 | Controlled Unclassified Information |
| NIST SP 800-53 R5 | Security and Privacy Controls |
For air-gapped environments, install marketplace packs manually:
Navigation: VCF Operations > Administration > Repository
.PAK fileNavigation: VCF Operations > Fleet Management > Configuration Drifts > Schedule Drift Detection
Navigation: VCF Operations > Infrastructure Operations > Configurations > Outbound Settings
Lab Email Notificationsvcf-ops@lab.localVCF Operations| Plug-In | Use Case |
|---|---|
| Standard Email Plugin | SMTP email notifications |
| SNMP Trap Plugin | SNMP v1/v2c/v3 traps to network management systems |
| Webhook Notification Plugin | REST webhooks (supports Basic Auth, Bearer Token, OAuth, X.509, API Key) |
| Log File | Write alerts to log files |
| ServiceNow | ITSM integration |
| Slack | Chat-based alerting |
| Network Share | Write to network file shares |
Navigation: VCF Operations > Infrastructure Operations > Configurations > Notifications
Step 1 - Basic Details:
Critical Host Alerts)Step 2 - Define Filtering Criteria:
Step 3 - Select Outbound Method:
Step 4 - Payload Template:
Step 5 - Test:
Step 6 - Create:
Navigation: VCF Operations > Infrastructure Operations > Dashboards & Reports
| Dashboard Category | What It Shows |
|---|---|
| Overview | Geo-map view of VCF instances, inventory sections, diagnostic findings, security risk highlights |
| Cluster Configuration | vSphere cluster configuration requiring attention |
| ESXi Configuration | ESXi host configurations needing review |
| Network Configuration | vSphere distributed switch configurations |
| VM Configuration | Virtual machine configurations |
| vSAN Configuration | vSAN configuration details |
| vSAN OSA Performance | Read/write latency, contention, utilization |
| vSAN ESA Performance | ESA-specific metrics |
| Security Operations | User auth, encryption status, CVE advisories, certificate health |
| Skyline Operational | Proactive monitoring and recommendation dashboard |
| Energy Efficiency | Virtualization efficiency, idle VM impact |
/ in the name creates folder hierarchy, e.g., Lab/Overview)Navigation: VCF Operations > Fleet Management > Lifecycle > Settings > SFTP Settings
Navigation: VCF Operations > Inventory > VCF Instance > Actions > Manage VCF Instance Settings
VCF Operations for Logs is not deployed automatically during initial bringup. It must be deployed as a Day 2 operation. Status: Deployed.
| Setting | Value |
|---|---|
| FQDN | logs.lab.local |
| IP Address | 192.168.1.242 |
| VM Name | logs |
| Node Size | Small |
| Deployment Method | Fleet Management with custom cert |
Known Issue — Self-Signed Certificate SAN Mismatch: The Fleet Management deployment wizard's "Generate self-signed certificate" option may produce a certificate whose SAN entries do not match the node FQDN/IP, causing a precheck error: "Certificate validation for component vrli:vrli-master — The hosts in the certificate doesn't match with the provided/product hosts." The workaround is to generate a custom certificate with OpenSSL and import it. See Section 3.11.1a.
Navigation: VCF Operations > Fleet Management > Lifecycle > VCF Management > Components
Prerequisites: Depot must be configured (see Section 3.3) and the operations-logs binary must be downloaded via Binary Management > INSTALL BINARIES tab. The OVA and PAK files must be in the offline depot under PROD\COMP\VRLI\.
operations-logsvcenter.lab.local), cluster (vcenter-cl01), VM network, and datastore (vcenter-cl01-ds-vsan01)!@#$%^&*)logslogs.lab.local192.168.1.242If the wizard's self-signed certificate fails precheck validation, generate a proper certificate with OpenSSL on SDDC Manager (SSH as vcf, then su - to root):
Step 1 — Verify DNS resolution:
nslookup logs.lab.local 192.168.1.230
nslookup 192.168.1.242 192.168.1.230
ping -c 2 logs.lab.local
Step 2 — Create OpenSSL config and generate certificate:
cat > /tmp/vrli-cert.cnf << 'EOF'
[req]
default_bits = 4096
prompt = no
default_md = sha256
distinguished_name = dn
req_extensions = v3_req
x509_extensions = v3_req
[dn]
C = US
ST = California
L = Lab
O = Lab
OU = VCF
CN = logs.lab.local
[v3_req]
basicConstraints = CA:FALSE
keyUsage = digitalSignature, keyEncipherment
extendedKeyUsage = serverAuth, clientAuth
subjectAltName = @alt_names
[alt_names]
DNS.1 = logs.lab.local
DNS.2 = logs
IP.1 = 192.168.1.242
EOF
openssl req -x509 -nodes -days 730 -newkey rsa:4096 \
-keyout /tmp/vrli.key -out /tmp/vrli.crt \
-config /tmp/vrli-cert.cnf
Step 3 — Verify SANs are correct:
openssl x509 -in /tmp/vrli.crt -noout -text | grep -A5 "Subject Alternative Name"
# Expected: DNS:logs.lab.local, DNS:logs, IP Address:192.168.1.242
Step 4 — Transfer cert to workstation:
Display the certificate and key, then copy-paste into local files (vrli.crt and vrli.key):
cat /tmp/vrli.crt
cat /tmp/vrli.key
Step 5 — Import in Fleet Management wizard:
vrli.crt (certificate) and vrli.key (private key) — must be PEM formatStep 6 — Verify deployment:
# Check appliance is reachable
curl -sk https://logs.lab.local:9543/api/v2/deployment/new -o /dev/null -w "%{http_code}"
# Check certificate on deployed appliance
openssl s_client -connect logs.lab.local:443 -servername logs.lab.local </dev/null 2>/dev/null | openssl x509 -noout -subject -issuer -dates
Navigation: VCF Operations > Administration > Control Panel > Log Management
As of VCF 9.0, there is no automated way to configure the logs agent on SDDC Manager:
deploy_vcf_ops_logs_agent.sh scriptssh vcf@192.168.1.241 "cat > /home/vcf/deploy_vcf_ops_logs_agent.sh" < deploy_vcf_ops_logs_agent.sh)Note: The log collection configuration for vCenter adapter instances is NOT included in configuration export/import operations. SCP does not work with SDDC Manager's restricted shell -- use the
ssh cat >method for file transfers.
The following tasks have moved from SDDC Manager to VCF Operations in VCF 9.0:
| Task | VCF 9.0 Location in VCF Operations |
|---|---|
| DNS/NTP Configuration | Inventory > VCF Instance > Actions > Manage VCF Instance Settings > Network Settings |
| Workload Domain Creation | Inventory > VCF Instance > Add Workload Domain |
| Backup Configuration | Fleet Management > Lifecycle > Settings |
| Certificate Authority | Fleet Management > Certificates > Configure CA |
| Certificate Management | Fleet Management > Certificates |
| Password Management | Fleet Management > Passwords |
| Network Pools | vCenter: Global Inventory > Hosts > Network Pools |
| Host Commissioning | vCenter: Global Inventory > Unassigned Hosts |
| Cluster Creation | vCenter: New SDDC Cluster |
| Licensing | License Management (single file model) |
Critical Note: While the SDDC Manager UI is still present in VCF 9.0, performing tasks there does not immediately sync to VCF Operations. Changes depend on scheduled synchronization intervals. Use VCF Operations as the primary interface for all Day 2 operations.
| # | Issue | Impact |
|---|---|---|
| 1 | Relationships not updated after 2nd collection cycle in management packs built with the Management Pack Builder | Custom management packs may show stale data |
| 2 | Custom network adapters do not start after VCF Operations and VCF Operations for Networks are updated to VCF 9.0 | Workaround required |
| 3 | VCF Operations for Networks stops collecting metrics when NSX is upgraded from 4.2.1 to 9.0 | Re-configure after upgrade |
| 4 | Manually stopped adapter instances start collecting after a management pack upgrade | Monitor adapter states after upgrades |
| 5 | Binary downloads from depot may intermittently fail | Retry the download |
| 6 | Fleet Management appliance root password must be 15+ characters | Precheck will fail otherwise |
| 7 | Only one VCF Operations for Networks instance supported | Cannot add multiple |
| 8 | Log collection configuration for vCenter adapters not included in config export/import | Manually reconfigure after import |
| 9 | License expires if usage file not submitted within 180 days (disconnected mode) | Hosts disconnect, workloads blocked |
| 10 | Do not configure NTP during OVF deployment (KB 374792) | Configure it in the setup wizard instead |
| 11 | Password rotation options from VCF 5.x not fully available | Use SDDC Manager API as workaround |
| 12 | After workload domain redeployment, vCenter/vSAN adapter may enter Warning | Reconfigure adapter |
| 13 | Infrastructure Health Adapter "no data receiving" — stale SDDC Manager credential | Fix: Integrations → SDDC Mgr → ROTATE or set manually → VALIDATE → SAVE → reboot appliance |
| 14 | Adapter log paths changed in 9.x — /storage/log/vcops/log/adapters/<Name>/ |
Legacy /var/log/vmware/vcops/adapters/ does not exist |
| 15 | NSX adapter warnings when NSX is powered off | Expected — clears when NSX is back online |
| 16 | NSX adapter PKIX cert trust failure — self-signed cert not trusted | Import NSX cert into /usr/java/jre-vmware-17/lib/security/cacerts (password changeit), reboot |
| 17 | NSX System Managed Credential ROTATE fails | Uncheck System Managed, set manually (admin/password), VALIDATE, SAVE |
| 18 | Two separate NSX adapters exist — VCF uses VIP, NSX "Aria Admin" uses node FQDN | Both need credentials configured separately |
| 19 | Credential Update/Rotate/Remediate cascade failure — stuck tasks and locks | Full PostgreSQL repair required — see Section 7.2.6 |
[ ] License Management -- license valid, not evaluation mode
[ ] Administration > Integrations > Accounts -- all adapters green "Collecting"
[ ] Fleet Management dashboard -- all components healthy, Connected
[ ] Depot configuration -- connected to offline depot, binaries available
[ ] Infrastructure Operations > VCF Instances -- shows VCF instance with all domains
[ ] All ESXi hosts (esxi01-04) visible in inventory
[ ] VCF Health -- certificates, NTP, DNS checks passing
[ ] Security & Compliance -- SDDC benchmarks activated
[ ] Fleet Management > Passwords -- all accounts valid
[ ] Fleet Management > Certificates -- all certificates visible with expiration dates
NSX 9.0 provides software-defined networking and security for VCF. In VCF 9.0, NSX is only available as part of the VCF stack -- there is no standalone NSX deployment option.
+-----------------------------------------------------------+
| NSX MANAGER CLUSTER |
| (3-node for HA, 1-node for lab) |
+-----------------------------------------------------------+
| TIER-0 GATEWAY |
| (Provider Router - North-South) |
| BGP/OSPF to Physical |
+-----------------------------------------------------------+
| TIER-1 GATEWAY |
| (Tenant Router - Internal) |
| NAT, Load Balancing |
+-----------------------------------------------------------+
| SEGMENTS |
| (Layer 2 - Overlay or VLAN) |
+-----------------------------------------------------------+
| RAM Allocation | Result in Nested Lab |
|---|---|
| 16GB | Kernel OOM, constant crashes, console shows sysrq: Show Memory |
| 24GB | Runs initially, but MANAGER/SEARCH services crash under load (e.g., transport node configuration) |
| 32GB (minimum) | Stable operation with 4-host cluster |
| Resource | Minimum for Nested | Production |
|---|---|---|
| RAM | 32GB | 48GB+ |
| vCPU | 6 | 8+ |
| Deployment Size | small | medium/large |
Critical Lesson: NSX Manager
smalldeployment needs 32GB RAM and 6 vCPU minimum in nested environments. 16GB causes kernel OOM. 24GB runs but crashes under load. Many VCF Installer validation errors are cascading failures from an unhealthy NSX -- fix NSX health first.
In nested lab environments, SDDC Manager's automated deployment often times out. Deploy NSX Manager manually using ovftool from the VCF Installer CLI:
/usr/bin/ovftool --skipManifestCheck --powerOn --diskMode=thin --acceptAllEulas --allowExtraConfig --ipProtocol=IPv4 --noSSLVerify --datastore=vcenter-cl01-ds-vsan01 --network=vcenter-cl01-vds01-pg-vm-mgmt --deploymentOption=small --name=nsx-manager --prop:nsx_role='NSX Manager' --prop:nsx_passwd_0='Success01!0909!!' --prop:nsx_cli_passwd_0='Success01!0909!!' --prop:nsx_cli_audit_passwd_0='Success01!0909!!' --prop:nsx_hostname=nsx-node1.lab.local --prop:nsx_ip_0=192.168.1.71 --prop:nsx_netmask_0=255.255.255.0 --prop:nsx_gateway_0=192.168.1.1 --prop:nsx_dns1_0=192.168.1.230 --prop:nsx_domain_0=lab.local --prop:nsx_ntp_0=192.168.1.230 --prop:nsx_isSSHEnabled=True --prop:nsx_allowSSHRootLogin=True --X:waitForIp --X:logFile=/tmp/nsx-manager.log --X:logLevel=verbose /nfs/vmware/vcf/nfs-mount/bundle/028849ee-d3e7-5748-9b90-47d503c6dd3e/028849ee-d3e7-5748-9b90-47d503c6dd3e/nsx-unified-appliance-9.0.1.0.24952114.ova "vi://administrator%40vsphere.local:Success01%210909%21%21@vcenter.lab.local/vcenter-dc01/host/vcenter-cl01"
Important: Use single-line commands. Backslash continuation breaks
--noSSLVerifyand other flags with ovftool 5.0.
After NSX Manager boots (~15 minutes for all services to stabilize in nested environments):
https://192.168.1.71admin192.168.1.70DNS and NTP on NSX are configured via the admin CLI, NOT the UI:
# SSH to NSX Manager
ssh admin@192.168.1.71
# Configure DNS
set name-servers 192.168.1.230
# Configure NTP
set ntp-servers 192.168.1.230
# Verify DNS
get name-servers
# Verify NTP
get ntp-servers
Warning: Do NOT attempt to configure DNS/NTP via the NSX Manager web UI. Use the admin CLI commands above.
NSX must be connected to vCenter as a compute manager:
vcenter.lab.localvcenter.lab.localadministrator@vsphere.localAfter NSX Manager is deployed, it must be registered in SDDC Manager during the VCF Installer bringup process. The bringup wizard validates:
If any of these fail, the bringup will not proceed. Fix NSX health first -- many validation errors are cascading failures from an unhealthy NSX.
# SSH to NSX Manager as admin
ssh admin@192.168.1.71
# Check cluster status
get cluster status
# Check all service statuses
get cluster status verbose
# Verify via API
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/cluster/status
Key services that must be UP:
Tip: In nested environments, NSX services can take 10-15 minutes to stabilize after restart. If the API returns error 101 "Some appliance components are not functioning properly", wait and retry.
NSX 9.0 creates default transport zones during deployment:
| Transport Zone | Type | Purpose |
|---|---|---|
nsx-overlay-transportzone |
Overlay | For GENEVE-encapsulated VM-to-VM traffic |
nsx-vlan-transportzone-mgmt |
VLAN | For direct VLAN connectivity to physical network |
Navigation: NSX Manager > System > Fabric > Profiles > Transport Node Profiles
tn-profile-mgmtvcenter-cl01-vds01nsx-overlay-transportzonensx-default-uplink-hostswitch-profileNSX 9.0 introduces the "Use VMkernel Adapter" option for TEP (Tunnel Endpoint) IP assignment. This allows vmk0 (the management VMkernel) to be reused as the TEP interface, eliminating the need for a dedicated TEP VLAN and IP pool. This is ideal for nested environments and simplified lab deployments.
How it works:
IPv4 Assignment options in Transport Node Profile:
| Option | Description | When to Use |
|---|---|---|
| Use IP Pool | Allocate TEP IPs from a pre-configured IP pool | Production with dedicated TEP VLAN |
| Use DHCP | Obtain TEP IPs via DHCP | Environments with DHCP on TEP VLAN |
| Use VMkernel Adapter | Reuse vmk0 management IP as TEP | Nested labs, simplified deployments |
Navigation: NSX Manager > System > Fabric > Nodes > Host Transport Nodes
vcenter-cl01)tn-profile-mgmt from the dropdownExpected result after successful application:
| Host | Status | TEP IP (vmk0) |
|---|---|---|
| esxi01.lab.local | Success / Up | 192.168.1.74 |
| esxi02.lab.local | Success / Up | 192.168.1.75 |
| esxi03.lab.local | Success / Up | 192.168.1.76 |
| esxi04.lab.local | Success / Up | 192.168.1.82 |
If transport node configuration fails in nested environments:
# SSH to ESXi host
ssh root@192.168.1.74
# Restart management network
/etc/init.d/hostd restart
/etc/init.d/vpxa restart
Lab Lesson: The initial transport node application failed because NSX at 24GB RAM / 4 vCPU could not handle the deployment load. After increasing to 32GB / 6 vCPU and powering off SDDC Manager to free resources, re-applying the profile succeeded on all 4 hosts.
On each ESXi host, verify NSX transport node status:
# SSH to ESXi host
ssh root@192.168.1.74
# Check NSX proxy status
/etc/init.d/nsx-proxy status
# Check NSX datapath (DFW)
/etc/init.d/nsx-datapath status
# Check NSX operations agent
/etc/init.d/nsx-opsagent status
# List VMkernel interfaces (confirm vmk50 hyperbus exists)
esxcli network ip interface list
# Check TEP connectivity to another host
vmkping 192.168.1.75
# View NSX logs
tail -50 /var/log/nsx-syslog.log
# Check NSX agent communication (port 1234)
esxcli network ip connection list | grep 1234
VMkernel Network Layout (after transport node config):
| VMkernel | Subnet | TCP/IP Stack | Purpose |
|---|---|---|---|
| vmk0 | 192.168.1.0/24 | defaultTcpipStack | Management + NSX TEP (overlay) |
| vmk1 | 192.168.11.0/24 | vmotion | vMotion |
| vmk2 | 192.168.12.0/24 | defaultTcpipStack | vSAN |
| vmk50 | 169.254.0.0/16 | hyperbus | NSX Hyperbus (internal, auto-created) |
| Segment Type | Requires | Use Case |
|---|---|---|
| Overlay Segment | Overlay Transport Zone, Tier-1 Gateway, Subnet/Gateway | VM-to-VM east-west traffic across hosts |
| VLAN-Backed Segment | VLAN Transport Zone, VLAN ID | Direct VLAN connectivity to physical network |
Navigation: NSX Manager > Networking > Segments
web-segmentnsx-overlay-transportzone10.10.10.1/24)VLAN-100-Production100Note: VLAN-backed segments do NOT require a Tier-1 gateway connection, subnet gateway IP, or DHCP configuration.
Tier-0 Gateway (Provider Router):
Tier-1 Gateway (Tenant Router):
The DFW enforces micro-segmentation at the VM vNIC level for east-west traffic. Rules are processed in this order:
| Priority | Category | Purpose |
|---|---|---|
| 1 | Emergency | Critical security policies |
| 2 | Infrastructure | Protect infrastructure components |
| 3 | Environment | Zone-based policies |
| 4 | Application | App-specific micro-segmentation |
| 5 | Default | Catch-all rules |
Within each category: Rules process TOP to BOTTOM. First match wins.
Navigation: NSX Manager > Security > Distributed Firewall
Instead of using IP-based rules (which break when VMs move), use NSX tags:
Web-Serversweb-tierApply tags to VMs:
web-tierWhen you configure a VCF Cloud Account in VCF Operations (see Section 3.4), NSX adapters are automatically discovered and configured for all domains that have NSX deployed. No manual configuration is needed.
Navigation: VCF Operations > Administration > Integrations > Accounts
The NSX adapter retrieves alerts and findings from NSX into VCF Operations. VCF 9.0 includes enhanced NSX monitoring:
| Feature | Description |
|---|---|
| Enhanced Edge Node Monitoring | New edge node metrics sub-groups |
| Network Operations Overview | vSphere networking and NSX inventory summary |
| Network Alert Trends | Visibility into network alerts over time |
| Transport Node Status | Real-time health of all transport nodes |
| Segment Health | Overlay and VLAN segment connectivity status |
For deeper network monitoring capabilities:
Navigation: VCF Operations > Administration > Integrations > Repository
Important: Starting from VCF 9.0, only ONE VCF Operations for Networks instance integration is supported. During deployment, VCF Operations Fleet Management integrates VCF Operations and VCF Operations for Networks automatically.
| Metric Category | Key Indicators |
|---|---|
| Transport Node | Configuration state, connection status, TEP reachability |
| NSX Manager | Service health (MANAGER, SEARCH, UI, CONTROLLER, NODE_MGMT) |
| DFW | Rule hit counts, dropped packets, policy publish status |
| Segments | Port count, traffic throughput, MAC learning |
| Edge Nodes | CPU/memory utilization, throughput, session counts |
Symptom: NSX Manager console shows repeated sysrq: Show Memory messages, all NSX-related validation checks fail.
Diagnosis:
# Check NSX Manager memory from vCenter
# VM > Monitor > Performance > Memory
# Check service health via API
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/cluster/status
Resolution:
RUNNINGStep 1: Check status in NSX Manager
Navigate to System > Fabric > Nodes > Host Transport Nodes and review status (green/yellow/red).
Step 2: Test TEP connectivity from ESXi host
# SSH to ESXi host
ssh root@192.168.1.74
# Find TEP VMkernel
esxcfg-vmknic -l | grep -i tep
# For vmk0-as-TEP configuration, test management connectivity
vmkping 192.168.1.75
# Test with MTU 1600 (GENEVE overhead requires 1600+ bytes)
vmkping -d -s 1572 192.168.1.75
Step 3: Check NSX agent on host
/etc/init.d/nsx-proxy status
/etc/init.d/nsx-datapath status
tail -50 /var/log/nsx-syslog.log
Step 4: Resync transport node
In NSX Manager > System > Fabric > Nodes, click problematic host > Actions > Redeploy Node.
NSX Manager CLI (SSH as admin):
# Overall cluster status
get cluster status
# Detailed service list
get cluster status verbose
# Get manager node list
get managers
# Get all transport nodes
get transport-nodes
NSX Manager API:
# Cluster status
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/cluster/status
# Transport node status
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/transport-nodes
# Transport node state
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/transport-nodes/state
# Compute managers
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/fabric/compute-managers
# List certificates
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/trust-management/certificates
# Node UUID (from cluster info)
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/cluster
Important: NSX shell does NOT support backslash line continuation. All curl commands must be single-line.
The default NSX self-signed certificate may not include proper SAN entries. VDT will report FAIL if the certificate SAN does not include the FQDN that SDDC Manager uses to register NSX.
Step 1: Create OpenSSL config on NSX Manager (SSH as root):
cat > /tmp/nsx-cert.conf << 'EOF'
[ req ]
default_bits = 2048
distinguished_name = req_distinguished_name
req_extensions = req_ext
x509_extensions = req_ext
prompt = no
[ req_distinguished_name ]
countryName = US
stateOrProvinceName = Lab
localityName = Lab
organizationName = lab.local
commonName = nsx-vip.lab.local
[ req_ext ]
basicConstraints = CA:FALSE
subjectAltName = @alt_names
[alt_names]
DNS.1 = nsx-vip.lab.local
DNS.2 = nsx-node1.lab.local
DNS.3 = nsx-manager.lab.local
IP.1 = 192.168.1.70
IP.2 = 192.168.1.71
EOF
Critical:
DNS.3 = nsx-manager.lab.localis required because SDDC Manager registers NSX using this FQDN. Without it, VDT reports SAN check failure.
Step 2: Generate certificate and key:
openssl req -x509 -nodes -days 825 -newkey rsa:2048 \
-keyout /tmp/nsx.key -out /tmp/nsx.crt \
-config /tmp/nsx-cert.conf -sha256
Step 3: Verify SAN entries:
openssl x509 -in /tmp/nsx.crt -text -noout | grep -A4 "Subject Alternative Name"
Step 4: Create JSON payload using Python (avoids shell PEM escaping issues):
python -c "
import json
cert = open('/tmp/nsx.crt').read()
key = open('/tmp/nsx.key').read()
print(json.dumps({'pem_encoded': cert, 'private_key': key}))
" > /tmp/nsx-import.json
Step 5: Import certificate into NSX:
curl -k -u admin:'Success01!0909!!' -X POST "https://192.168.1.71/api/v1/trust-management/certificates?action=import" -H "Content-Type: application/json" -d @/tmp/nsx-import.json
Note the certificate ID from the response (e.g., 701d1416-5054-4038-8749-4ac495980ebd).
Step 6: Get node UUID:
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/cluster
Step 7: Apply to NSX Manager node:
curl -k -u admin:'Success01!0909!!' -X POST "https://192.168.1.71/api/v1/trust-management/certificates/<CERT-ID>?action=apply_certificate&service_type=API&node_id=<NODE-UUID>"
Step 8: Apply to cluster VIP:
curl -k -u admin:'Success01!0909!!' -X POST "https://192.168.1.71/api/v1/trust-management/certificates/<CERT-ID>?action=apply_certificate&service_type=MGMT_CLUSTER"
Step 9: Import into SDDC Manager trust stores (SSH to SDDC Manager as root):
# Pull active NSX certificate
openssl s_client -showcerts -connect 192.168.1.71:443 < /dev/null 2>/dev/null | openssl x509 -outform PEM > /tmp/nsx-root.crt
# Import into VCF trust store
KEY=$(cat /etc/vmware/vcf/commonsvcs/trusted_certificates.key)
keytool -importcert -alias nsx-selfsigned -file /tmp/nsx-root.crt \
-keystore /etc/vmware/vcf/commonsvcs/trusted_certificates.store \
-storepass "$KEY" -noprompt
# Import into Java cacerts
keytool -importcert -alias nsx-selfsigned -file /tmp/nsx-root.crt \
-keystore /etc/alternatives/jre/lib/security/cacerts \
-storepass changeit -noprompt
# Restart SDDC Manager services
/opt/vmware/vcf/operationsmanager/scripts/cli/sddcmanager_restart_services.sh
Services take ~5 minutes to restart. After restart, re-run VDT to confirm NSX cert trust checks pass.
Reference: https://knowledge.broadcom.com/external/article/316056
| Task | Command |
|---|---|
| Cluster status | get cluster status |
| Manager list | get managers |
| Transport nodes | get transport-nodes |
| Logical switches (segments) | get logical-switches |
| Logical routers (gateways) | get logical-routers |
| VTEP information | get vtep |
| VTEP table | get vtep-table |
| Firewall rules | get firewall rules |
| Firewall status | get firewall status |
| Interfaces | get interfaces |
| Set DNS | set name-servers <ip> |
| Set NTP | set ntp-servers <ip> |
| Task | Command |
|---|---|
| NSX proxy status | /etc/init.d/nsx-proxy status |
| Restart NSX proxy | /etc/init.d/nsx-proxy restart |
| NSX datapath status | /etc/init.d/nsx-datapath status |
| NSX operations agent | /etc/init.d/nsx-opsagent status |
| View NSX logs | tail -50 /var/log/nsx-syslog.log |
| Check NSX port 1234 connections | esxcli network ip connection list | grep 1234 |
| List VMkernel interfaces | esxcli network ip interface list |
| List DVS info | esxcli network vswitch dvs vmware list |
| Port | Protocol | Purpose |
|---|---|---|
| 443 | TCP | NSX Manager UI and API |
| 1234 | TCP | NSX agent communication (host to manager) |
| 1235 | TCP | NSX cluster inter-node |
| 6081 | UDP | GENEVE overlay encapsulation |
| 8080 | TCP | NSX Manager internal API |
Navigation: NSX Manager > Plan & Troubleshoot > Traffic Analysis > Traceflow
Interpreting Results:
| Result | Action |
|---|---|
| Green line | Path working -- check application layer |
| Red X (DFW rule) | Check firewall rule ordering and policies |
| Red X (TEP unreachable) | Check physical network, MTU, VLAN configuration |
| Red X (No route) | Check Tier-0/Tier-1 routing configuration |
vSAN Express Storage Architecture (ESA) is the default storage architecture in VCF 9.0, replacing the older Original Storage Architecture (OSA). ESA eliminates the distinction between cache and capacity tiers, treating all devices as a single flat storage pool with software-managed caching.
| Feature | vSAN OSA | vSAN ESA |
|---|---|---|
| Disk Groups | Cache + Capacity tiers | Single flat pool (no disk groups) |
| Cache Devices | Dedicated SSD for cache | No dedicated cache — software-managed |
| Capacity Devices | SSD or HDD | NVMe SSDs only (production) |
| RAID Support | RAID-1/5/6 | RAID-1/5/6 with native snapshots |
| Compression | Dedup + Compression (capacity tier) | Always-on compression |
| Erasure Coding | Available | Improved efficiency |
| Performance | Depends on cache tier sizing | Consistent — all devices contribute |
| Minimum Disks per Host | 1 cache + 1 capacity | 1 storage device |
| Nested Lab Support | VMX virtualSSD flag | VMX virtualSSD flag + HCL bypass |
VCF 9.0.1 includes a built-in bypass for vSAN ESA HCL validation, eliminating the need for the mock VIB that was required in earlier versions. This bypass allows virtual SATA disks marked as SSD in the VMX file to be claimed by vSAN ESA.
Step 1: Mark virtual disks as SSD in VMX files
Edit each ESXi VM's .vmx file in VMware Workstation (VM must be powered off):
# Add to each ESXi VM's VMX file
sata0:0.virtualSSD = "1"
sata0:2.virtualSSD = "1"
For esxi01 only (has an extra disk):
sata0:3.virtualSSD = "1"
VMX file locations in this lab:
D:\VMs\esxi01.lab.local\esxi01.lab.local.vmx
E:\VMs\esxi02.lab.local\esxi02.lab.local.vmx
E:\VMs\esxi03.lab.local\esxi03.lab.local.vmx
F:\VMs\esxi04.lab.local\esxi04.lab.local.vmx
Step 2: Enable the vSAN ESA HCL bypass on the VCF Installer
SSH to the VCF Installer (192.168.1.240) as root:
# Add the vSAN ESA HCL bypass property
echo "vsan.esa.sddc.managed.disk.claim=true" >> /etc/vmware/vcf/domainmanager/application-prod.properties
# Restart the domain manager service to apply
systemctl restart domainmanager
# Verify the property was written
cat /etc/vmware/vcf/domainmanager/application-prod.properties | grep vsan
Important: This bypass must be applied BEFORE running the VCF Installer wizard. If the wizard has already been started, restart
domainmanagerand refresh the browser.
Step 3: Verify SSD detection on ESXi hosts after power-on
SSH to each ESXi host and confirm disks are recognized as SSD:
# Check SSD status for all storage devices
esxcli storage core device list | grep -E "Display Name|Is SSD"
# Expected output for each disk:
# Display Name: Local ATA Disk (t10.ATA...)
# Is SSD: true
If virtual disks are not detected as SSD even after setting virtualSSD in the VMX file, use SATP (Storage Array Type Plugin) claim rules to force SSD detection:
# List current SATP rules filtering for SSD
esxcli storage nmp satp rule list | grep enable_ssd
# Add a claim rule to mark a specific device as SSD
esxcli storage nmp satp rule add -s VMW_SATP_LOCAL \
-d t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001 \
-o enable_ssd
# Reclaim the device to apply the new rule
esxcli storage core claiming reclaim \
-d t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001
# Verify the device is now marked as SSD
esxcli storage core device list -d t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001 | grep "Is SSD"
Note: SATP claim rules persist across reboots. The VMX
virtualSSDapproach is preferred because it marks the disk at the hardware emulation level before ESXi boots.
VCF Installer automatically creates a default vSAN storage policy during deployment. For nested labs with only 4 hosts, the default policy uses:
| Policy Setting | Value |
|---|---|
| Failures to Tolerate (FTT) | 1 |
| Failure Tolerance Method | RAID-1 (Mirroring) |
| Object Space Reservation | Thin provisioning |
To create a custom storage policy in vCenter:
https://vcenter.lab.local > Policies and Profiles > VM Storage PoliciesvSAN-thin-FTT1vcenter-cl01-ds-vsan01After VCF Installer completes deployment, verify the vSAN datastore:
# On any ESXi host, list vSAN storage
esxcli vsan storage list
# Check vSAN cluster membership
esxcli vsan cluster get
# List datastores visible to the host
esxcli storage filesystem list | grep -i vsan
# Verify datastore is accessible in vCenter
# Navigate to: vcenter.lab.local > vcenter-dc01 > vcenter-cl01 > Datastores
# Datastore name: vcenter-cl01-ds-vsan01
# Comprehensive disk query with vSAN eligibility
vdq -iH
# Quick eligibility check
vdq -q
# List all vSAN storage devices and their state
esxcli vsan storage list
# List all storage devices with full details
esxcli storage core device list
# Filter for device name and SSD status
esxcli storage core device list | grep -E "^t10|^naa|Display Name|Is SSD|Size"
# Check partition tables on a specific disk
partedUtil getptbl /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001
Sample vdq -q output for an eligible disk:
{
"Name": "t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001",
"State": "Eligible for use by VSAN",
"Reason": "None",
"IsSSD": "1"
}
Sample output for an ineligible disk:
{
"Name": "t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001",
"State": "Ineligible for use by VSAN",
"Reason": "Has partitions",
"IsSSD": "1"
}
Removing a disk from vSAN:
# Remove a specific disk from vSAN storage
esxcli vsan storage remove -d t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001
# Verify removal
esxcli vsan storage list
vdq -q
Cleaning up old vSAN partitions (required after failed deployments):
# Check existing partitions
partedUtil getptbl /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001
# Delete partition 1
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001 1
# Delete partition 2
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001 2
# Verify disk is now eligible
vdq -q
Warning: Deleting partitions destroys all data on those partitions. Only use this procedure on disks that are being reclaimed for a fresh vSAN deployment.
vSAN ESA does not use disk groups. For environments still running vSAN OSA, disk groups consist of one cache device and one or more capacity devices:
# List current disk groups
esxcli vsan storage list
# Remove an entire disk group by specifying the cache disk
esxcli vsan storage remove -d <cache-disk-device-name>
Orphaned vSAN objects can occur after VM deletions or failed migrations:
From the command line:
# List vSAN objects on a host
esxcli vsan debug object list
# Check for inaccessible objects
esxcli vsan debug object health summary get
This is a chicken-and-egg problem inherent to every VCF deployment:
esxi01-local in the lab)This means SDDC Manager is always initially deployed to local storage and must be manually migrated to shared storage (vSAN) afterward. In the lab, this was done during Phase 7 (Feb 10–11) after the management domain bringup was complete.
Resource contention: In a nested lab, this is especially problematic because esxi01 ends up hosting both SDDC Manager and other large VMs (like NSX Manager at 32GB RAM) on its local datastore, with no ability to vMotion until the migration to shared storage is complete.
The vCenter migration wizard cannot thin-provision virtual disks when migrating to a vSAN datastore. When you attempt to migrate a thick-provisioned VM using the vCenter storage migration wizard and select "thin provisioning," the disks remain at their full allocated size on vSAN. This is particularly problematic for VMs like SDDC Manager that allocate far more disk space than they actually use.
In this lab, SDDC Manager had 6 disks totaling 914GB allocated but only ~108GB of actual data:
| Disk | Allocated | Actual Used |
|---|---|---|
| sddc-manager.vmdk | 32GB | 2.6GB |
| sddc-manager_1.vmdk | 16GB | 2.6GB |
| sddc-manager_2.vmdk | 240GB | 3.0GB |
| sddc-manager_3.vmdk | 512GB | 99.5GB |
| sddc-manager_4.vmdk | 26GB | 30MB |
| sddc-manager_5.vmdk | 88GB | 64MB |
| Total | 914GB | ~108GB |
The vmkfstools -i command with the -d thin flag creates a true thin-provisioned copy of each virtual disk. This must be done per-disk from the ESXi shell.
Prerequisites:
Step 1: Power off the VM in vCenter
Step 2: SSH to the ESXi host where the VM is registered
ssh root@192.168.1.74
Step 3: Create the destination directory on vSAN
mkdir -p /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/
Step 4: Clone each disk as thin provisioned
# Disk 0 (32GB allocated, 2.6GB actual)
vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager.vmdk /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager.vmdk -d thin
# Disk 1 (16GB allocated, 2.6GB actual)
vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_1.vmdk /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_1.vmdk -d thin
# Disk 2 (240GB allocated, 3.0GB actual)
vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_2.vmdk /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_2.vmdk -d thin
# Disk 3 (512GB allocated, 99.5GB actual) — LARGEST DISK, takes longest
vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_3.vmdk /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_3.vmdk -d thin
# Disk 4 (26GB allocated, 30MB actual)
vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_4.vmdk /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_4.vmdk -d thin
# Disk 5 (88GB allocated, 64MB actual)
vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_5.vmdk /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_5.vmdk -d thin
Warning: Disk 3 (512GB/99.5GB) failed on the first attempt due to a host disconnect during the clone. If a clone fails partway through, delete the partial copy before retrying:
vmkfstools -U /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_3.vmdkThen retry the clone command.
Step 5: Copy configuration files
# Copy VMX, NVRAM, and VMSD files
cp /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager.vmx /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/
cp /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager.nvram /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/
cp /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager.vmsd /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/
# Verify thin provisioned disks on vSAN
ls -la /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/
# Check actual disk usage (thin should show much less)
du -sh /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/
Step 1: Unregister the old VM from inventory
In vCenter, right-click the VM > Remove from Inventory (NOT "Delete from Disk" -- you want to keep the original files as a backup).
Step 2: Register the new VM from vSAN
In vCenter, navigate to Datastores > vcenter-cl01-ds-vsan01 > Browse Files > sddc-manager/ > right-click sddc-manager.vmx > Register VM.
Step 3: Power on and verify
Power on the VM from vCenter and verify it boots correctly. All services should start normally since the disk contents are identical -- only the provisioning format changed.
Step 4: Clean up the original files (optional, after confirming success)
# Only after confirming the migrated VM works correctly
rm -rf /vmfs/volumes/esxi01-local/sddc-manager/
When you configure a VCF Cloud Account or vCenter account in VCF Operations that points to a vSAN-enabled cluster, vSAN monitoring data is automatically collected. No separate configuration is required.
Access vSAN Storage Operations Dashboard:
Navigation: VCF Operations > Infrastructure Operations > Storage Operations
The centralized storage dashboard shows:
Predefined vSAN Dashboards:
Navigation: VCF Operations > Infrastructure Operations > Dashboards & Reports
Run vSAN Performance Diagnostics:
vcenter-cl01)Note: Diagnostic reports are available for the past 7 days only. Diagnostics run on both vSAN OSA and ESA HCI architectures.
# Check vSAN cluster health summary
esxcli vsan health cluster list
# Run a specific health check
esxcli vsan health cluster get -t "Network health"
# Check vSAN cluster membership
esxcli vsan cluster get
# List all vSAN storage devices and their state
esxcli vsan storage list
# Check resync status
esxcli vsan debug resync summary get
# Check vSAN object health
esxcli vsan debug object health summary get
| Metric | Location | Threshold |
|---|---|---|
| Network Latency | vSAN Health > Network | < 5ms (will be yellow in nested labs) |
| Disk Latency | vSAN Health > Physical Disk | < 10ms read, < 20ms write |
| Congestion | vSAN Health > Performance | < 30 (0-255 scale) |
| Capacity Utilization | vSAN Capacity | < 80% (warning at 70%) |
| Component Health | vSAN Health > Data | All objects healthy |
| Resync Operations | Monitor > vSAN > Resyncing Objects | Should be 0 during steady state |
# Launch esxtop in disk (storage) mode
esxtop
# Press 'u' to switch to disk device view
# Press 'v' to switch to disk VM view
Key esxtop storage metrics:
| Column | Meaning |
|---|---|
| CMDS/s | Total commands per second |
| READS/s | Read operations per second |
| WRITES/s | Write operations per second |
| MBREAD/s | Read throughput in MB/s |
| MBWRTN/s | Write throughput in MB/s |
| LAT/rd | Average read latency (ms) |
| LAT/wr | Average write latency (ms) |
| KAVG/rd | Kernel average read latency |
| GAVG/rd | Guest average read latency |
vSAN Observer provides real-time and historical performance data through a web-based interface. It is available through the Ruby vSphere Console (RVC):
# Connect to RVC from vCenter shell
rvc administrator@vsphere.local@localhost
# Navigate to cluster
cd /vcenter.lab.local/vcenter-dc01/computers/vcenter-cl01
# Start vSAN Observer
vsan.observer . --run-webserver --force
The observer starts a web server (typically on port 8010) that can be accessed from a browser.
Network Latency (Expected Yellow)
vSAN health check shows yellow on "Network latency check" -- this is normal and expected for nested ESXi in VMware Workstation. Typical latency values in this lab:
| From Host | To Host | Latency (ms) | Threshold (ms) |
|---|---|---|---|
| 192.168.12.122 | 192.168.12.123 | 6.81 | 5 |
| 192.168.12.123 | 192.168.12.122 | 6.32 | 5 |
| 192.168.12.123 | 192.168.12.120 | 6.61 | 5 |
| 192.168.12.123 | 192.168.12.121 | 6.15 | 5 |
Even "passing" pairs average 4.48ms latency, which is high for physical hosts but typical for virtualized NICs. This remains yellow and does not affect functionality.
Symptom: esxcli storage core device list shows Is SSD: false for virtual disks, or vdq -q shows "Ineligible for use by VSAN."
Diagnosis:
# Check if disk is seen at all
esxcli storage core device list | grep -E "^t10|Is SSD"
# Check vSAN eligibility
vdq -q
# Check for stale partitions
partedUtil getptbl /vmfs/devices/disks/<device-name>
Resolution (in order of preference):
sata0:X.virtualSSD = "1" to the VMX file, power onesxcli storage nmp satp rule add -s VMW_SATP_LOCAL -d <device> -o enable_ssd then esxcli storage core claiming reclaim -d <device>partedUtil delete to remove old partitionsSymptom: vSAN health shows "Network partition" or hosts appear to be in different sub-clusters.
Diagnosis:
# Check vSAN cluster membership
esxcli vsan cluster get
# Check network connectivity between vSAN VMkernel ports
vmkping -I vmk2 192.168.12.120
vmkping -I vmk2 192.168.12.121
vmkping -I vmk2 192.168.12.122
vmkping -I vmk2 192.168.12.123
# Check VMkernel adapter status
esxcli network ip interface list
Resolution:
ethernet*.noPromisc = "FALSE"Symptom: vSAN objects show as "Degraded," "Reduced Availability," or "Inaccessible."
# Check object health summary
esxcli vsan debug object health summary get
# List objects with issues
esxcli vsan debug object list
# In vCenter: Cluster > Monitor > vSAN > Virtual Objects
# Filter for non-healthy objects
Resolution:
After host maintenance, disk replacement, or policy changes, vSAN resyncs objects:
# Check resync summary
esxcli vsan debug resync summary get
# In vCenter: Cluster > Monitor > vSAN > Resyncing Objects
# Shows: Objects resyncing, bytes remaining, ETA
Tip: Do not put another host into maintenance mode while resync is in progress. Wait for resync to complete (0 resyncing objects) before proceeding.
# vSAN trace files location
ls /var/log/vmkernel.log | head
ls /var/log/vobd.log | head
# Search for vSAN-related errors in vmkernel log
grep -i "vsan\|cmmds\|clom\|dom\|lsom" /var/log/vmkernel.log | tail -50
# vSAN specific logs
cat /var/log/vsanmgmt.log | tail -50
cat /var/log/vsantraced.log | tail -50
# Check vSAN daemon status
/etc/init.d/vsanmgmtd status
/etc/init.d/vsand status
Key vSAN log abbreviations:
| Abbreviation | Full Name | Purpose |
|---|---|---|
| CMMDS | Cluster Monitoring, Membership, and Directory Service | Cluster membership |
| CLOM | Cluster Level Object Manager | Object placement |
| DOM | Distributed Object Manager | Object I/O |
| LSOM | Local Log-Structured Object Manager | Local disk I/O |
| RDT | Reliable Datagram Transport | vSAN network transport |
VCF uses TLS certificates for secure communication between all platform components. In VCF 9.0, certificate management is centralized through VCF Operations (Fleet Management > Certificates), replacing the certificate management previously found in SDDC Manager.
The certificate trust chain works as follows:
Components and their certificate locations:
| Component | Certificate Location | Type |
|---|---|---|
| ESXi Hosts | /etc/vmware/ssl/rui.crt and rui.key |
Self-signed (auto-generated) |
| vCenter Server | VMCA-managed (internal) | VMCA-signed |
| NSX Manager | Internal keystore, managed via API | Self-signed or CA-signed |
| SDDC Manager | /etc/vmware/vcf/commonsvcs/ |
Self-signed or CA-signed |
| VCF Operations | Internal keystore | Self-signed or CA-signed |
| Aspect | Self-Signed | CA-Signed |
|---|---|---|
| Trust | Must be manually imported into trust stores | Automatically trusted if CA root is in trust stores |
| Complexity | Low — generated locally | Higher — requires CA infrastructure |
| VDT Validation | Passes if SAN/trust store entries are correct | Passes inherently |
| Renewal | Manual | Can be automated via VCF Operations |
| Production Use | Not recommended | Required |
| Lab Use | Acceptable | Optional |
Certificates in VCF have the following lifecycle stages:
VCF Operations supports auto-renewal for: ESX SSL, vCenter machine SSL, NSX LM/VIP, SDDC Manager SSL, and VCF Operations certificates.
| Communication Path | Certificate Used | Trust Required By |
|---|---|---|
| Browser to vCenter | vCenter machine SSL | Browser |
| Browser to NSX Manager | NSX API certificate | Browser |
| SDDC Manager to vCenter | vCenter machine SSL | SDDC Manager trust stores |
| SDDC Manager to NSX | NSX API/VIP certificate | SDDC Manager trust stores |
| vCenter to ESXi | ESXi rui.crt | vCenter VMCA trust |
| NSX to ESXi (transport nodes) | ESXi rui.crt + NSX node cert | Mutual trust |
| VCF Operations to SDDC Manager | SDDC Manager SSL cert | VCF Operations |
This is the most complex certificate operation in VCF. The default NSX self-signed certificate generated during ovftool deployment uses a wildcard SAN (*.lab.local) without specific hostnames or IPs, causing VDT to report failures. This section documents the complete, lab-tested procedure for replacing the NSX certificate.
Critical: The SAN must include
nsx-manager.lab.local(the FQDN that SDDC Manager uses to register NSX), not justnsx-node1.lab.local. Without it, VDT reports "SAN contains IP but not hostname" because it looks up the registered FQDN and does not find it in the certificate SAN.
SSH to the NSX Manager as root and create the OpenSSL configuration file:
ssh root@192.168.1.71
cat > /tmp/nsx-cert.conf << 'EOF'
[ req ]
default_bits = 2048
distinguished_name = req_distinguished_name
req_extensions = req_ext
x509_extensions = req_ext
prompt = no
[ req_distinguished_name ]
countryName = US
stateOrProvinceName = Lab
localityName = Lab
organizationName = lab.local
commonName = nsx-vip.lab.local
[ req_ext ]
basicConstraints = CA:FALSE
subjectAltName = @alt_names
[alt_names]
DNS.1 = nsx-vip.lab.local
DNS.2 = nsx-node1.lab.local
DNS.3 = nsx-manager.lab.local
IP.1 = 192.168.1.70
IP.2 = 192.168.1.71
EOF
Explanation of each SAN entry:
| Entry | Purpose |
|---|---|
DNS.1 = nsx-vip.lab.local |
NSX Virtual IP FQDN (cluster access point) |
DNS.2 = nsx-node1.lab.local |
NSX Manager node FQDN (direct node access) |
DNS.3 = nsx-manager.lab.local |
SDDC Manager's registered FQDN for NSX -- REQUIRED |
IP.1 = 192.168.1.70 |
NSX VIP IP address |
IP.2 = 192.168.1.71 |
NSX Manager node IP address |
Important: If you have multiple NSX Manager nodes (HA deployment), add DNS and IP entries for each node (DNS.4, DNS.5, IP.3, IP.4, etc.).
Generate a new self-signed certificate and private key:
openssl req -x509 -nodes -days 825 -newkey rsa:2048 \
-keyout /tmp/nsx.key -out /tmp/nsx.crt \
-config /tmp/nsx-cert.conf -sha256
Verify the certificate SAN entries:
openssl x509 -in /tmp/nsx.crt -text -noout | grep -A4 "Subject Alternative Name"
Expected output:
X509v3 Subject Alternative Name:
DNS:nsx-vip.lab.local, DNS:nsx-node1.lab.local, DNS:nsx-manager.lab.local, IP Address:192.168.1.70, IP Address:192.168.1.71
Verify the certificate details:
# Check subject, issuer, validity period
openssl x509 -in /tmp/nsx.crt -text -noout | head -20
# Check key type and size
openssl x509 -in /tmp/nsx.crt -text -noout | grep "Public-Key"
The NSX API requires the certificate and private key as a JSON payload with PEM-encoded strings. Shell escaping of PEM data (which contains newlines) is error-prone, so a Python script is used to build the JSON correctly.
Build the JSON payload using Python:
python -c "
import json
cert = open('/tmp/nsx.crt').read()
key = open('/tmp/nsx.key').read()
print(json.dumps({'pem_encoded': cert, 'private_key': key}))
" > /tmp/nsx-import.json
Why Python? NSX shell does NOT support backslash line continuation. All curl commands must be single-line. Python avoids the shell escaping issues with
\ncharacters embedded in PEM data that would break acurl -d '...'payload.
Verify the JSON was built correctly:
python -c "import json; d=json.load(open('/tmp/nsx-import.json')); print('cert lines:', d['pem_encoded'].count('\n'), 'key lines:', d['private_key'].count('\n'))"
Import the certificate into NSX (single-line curl -- mandatory):
curl -k -u admin:'Success01!0909!!' -X POST "https://192.168.1.71/api/v1/trust-management/certificates?action=import" -H "Content-Type: application/json" -d @/tmp/nsx-import.json
The response includes a certificate ID. Example:
{
"results": [
{
"id": "701d1416-5054-4038-8749-4ac495980ebd",
...
}
]
}
Record the certificate ID (701d1416-5054-4038-8749-4ac495980ebd in this lab) -- it is needed for the apply step.
Prerequisite: All NSX services must be healthy (MANAGER, SEARCH, UI, NODE_MGMT all UP). If services are DOWN, the API returns error 101: "Some appliance components are not functioning properly." Check service status:
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/cluster/statusServices can take 10-15 minutes to stabilize after NSX restart in nested environments.
The certificate must be applied in two steps: first to the NSX Manager node (API service), then to the cluster VIP (MGMT_CLUSTER).
Step 1: Get the node UUID from cluster info:
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/cluster
From the response, extract the node UUID. In this lab: 95493642-ef4a-cb8e-ed7c-5bc20033f2c2
Step 2: Apply certificate to NSX Manager node (API service):
curl -k -u admin:'Success01!0909!!' -X POST "https://192.168.1.71/api/v1/trust-management/certificates/701d1416-5054-4038-8749-4ac495980ebd?action=apply_certificate&service_type=API&node_id=95493642-ef4a-cb8e-ed7c-5bc20033f2c2"
Expected response: empty body with HTTP 200 -- this means success.
Important: Apply to the node FIRST, then to the VIP. Applying in the wrong order can cause connectivity issues.
Step 3: Apply certificate to the cluster VIP (MGMT_CLUSTER):
curl -k -u admin:'Success01!0909!!' -X POST "https://192.168.1.71/api/v1/trust-management/certificates/701d1416-5054-4038-8749-4ac495980ebd?action=apply_certificate&service_type=MGMT_CLUSTER"
Expected response: empty body with HTTP 200 -- success.
Step 4: Verify the new certificate is active on both endpoints:
# Verify node certificate (.71)
openssl s_client -connect 192.168.1.71:443 -showcerts </dev/null 2>/dev/null | openssl x509 -noout -text | grep -A2 "Subject Alternative Name"
# Verify VIP certificate (.70)
openssl s_client -connect 192.168.1.70:443 -showcerts </dev/null 2>/dev/null | openssl x509 -noout -text | grep -A2 "Subject Alternative Name"
Both should show:
X509v3 Subject Alternative Name:
DNS:nsx-vip.lab.local, DNS:nsx-node1.lab.local, DNS:nsx-manager.lab.local, IP Address:192.168.1.70, IP Address:192.168.1.71
After replacing the NSX self-signed certificate, the new certificate's root is NOT in SDDC Manager's trust stores. The old NSX cert was pre-trusted during bringup; the new self-signed cert must be explicitly imported into both SDDC Manager keystores.
SSH to SDDC Manager:
# Only the vcf user can SSH in (root and admin are rejected)
ssh vcf@192.168.1.241
# Switch to root
su -
Note on file transfers: SCP does not work with SDDC Manager due to the restricted shell. Use
ssh vcf@host "cat > file" < localfilefor file transfers.
Step 1: Pull the active NSX certificate:
openssl s_client -showcerts -connect 192.168.1.71:443 < /dev/null 2>/dev/null | openssl x509 -outform PEM > /tmp/nsx-root.crt
Step 2: Verify the certificate is correct:
openssl x509 -in /tmp/nsx-root.crt -noout -text | grep -A2 "Subject Alternative Name"
# Should show: DNS:nsx-vip.lab.local, DNS:nsx-node1.lab.local, DNS:nsx-manager.lab.local, IP Address:192.168.1.70, IP Address:192.168.1.71
Step 3: Import into the VCF trust store:
The VCF trust store password is stored in a .key file alongside the store:
# Read the trust store password
KEY=$(cat /etc/vmware/vcf/commonsvcs/trusted_certificates.key)
# Import the NSX certificate
keytool -importcert -alias nsx-selfsigned -file /tmp/nsx-root.crt \
-keystore /etc/vmware/vcf/commonsvcs/trusted_certificates.store \
-storepass "$KEY" -noprompt
Step 4: Import into the Java cacerts keystore:
keytool -importcert -alias nsx-selfsigned -file /tmp/nsx-root.crt \
-keystore /etc/alternatives/jre/lib/security/cacerts \
-storepass changeit -noprompt
Step 5: Restart SDDC Manager services:
/opt/vmware/vcf/operationsmanager/scripts/cli/sddcmanager_restart_services.sh
Services take approximately 5 minutes to restart. After restart, re-run VDT to confirm NSX cert trust checks pass.
Trust store paths and passwords reference:
| Item | Path / Value |
|---|---|
| VCF trust store | /etc/vmware/vcf/commonsvcs/trusted_certificates.store |
| VCF trust store password | Contents of /etc/vmware/vcf/commonsvcs/trusted_certificates.key |
| Java cacerts | /etc/alternatives/jre/lib/security/cacerts |
| Java cacerts password | changeit |
| Service restart script | /opt/vmware/vcf/operationsmanager/scripts/cli/sddcmanager_restart_services.sh |
Reference: KB 316056 - Trusting Custom Certificates in SDDC Manager
ESXi hosts auto-generate self-signed SSL certificates at first boot. Regeneration is required when:
localhost.localdomain)Symptom in VCF Installer/SDDC Manager logs:
javax.net.ssl.SSLPeerUnverifiedException: Certificate for <esxi01.lab.local> doesn't match any of the subject alternative names: [localhost.localdomain]
SSH to the ESXi host and check:
# Check current hostname
esxcli system hostname get
# View current certificate SAN
openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout | grep -A1 "Subject Alternative Name"
# View full certificate details (subject, issuer, validity)
openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout
Run on each ESXi host that needs certificate regeneration:
esxi01.lab.local (192.168.1.74):
# Step 1: Ensure hostname is correct
esxcli system hostname set --fqdn=esxi01.lab.local
# Step 2: Verify hostname
esxcli system hostname get
# Step 3: Backup existing certificates
mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak
# Step 4: Generate new certificates
/sbin/generate-certificates
# Step 5: Restart all services to apply new certificate
services.sh restart
# Step 6: Verify new certificate has correct SAN
openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout | grep -A1 "Subject Alternative Name"
esxi02.lab.local (192.168.1.75):
esxcli system hostname set --fqdn=esxi02.lab.local
mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak
/sbin/generate-certificates
services.sh restart
esxi03.lab.local (192.168.1.76):
esxcli system hostname set --fqdn=esxi03.lab.local
mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak
/sbin/generate-certificates
services.sh restart
esxi04.lab.local (192.168.1.82):
esxcli system hostname set --fqdn=esxi04.lab.local
mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak
/sbin/generate-certificates
services.sh restart
After regenerating ESXi certificates, you must update the thumbprints in VCF. Get the new thumbprints from the VCF Installer or SDDC Manager:
# Get SHA-256 thumbprint for each host
echo | openssl s_client -connect 192.168.1.74:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256
echo | openssl s_client -connect 192.168.1.75:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256
echo | openssl s_client -connect 192.168.1.76:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256
echo | openssl s_client -connect 192.168.1.82:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256
Then re-validate the hosts in the VCF Installer UI to update the stored thumbprints.
VCF Operations supports configuring a Microsoft Certificate Authority for automated certificate issuance and renewal.
Navigation: VCF Operations > Fleet Management > Certificates > Configure CA
Configuration Steps:
https:// and end with certsrv (e.g., https://ca.lab.local/certsrv)svc-vcf-ca)Important: VCF management components (VCF Operations, Fleet Management, VCF Automation) only support Microsoft CA. VCF Instance components (vCenter, NSX, ESXi, SDDC Manager) support both Microsoft CA and OpenSSL.
Microsoft CA Template Requirements:
The certificate template used for VCF must support:
For environments without Microsoft CA infrastructure, VCF supports OpenSSL as an alternative CA for VCF Instance components.
Configuration Steps:
sddc-manager.lab.local)US)lab.local)When using Microsoft CA, create a dedicated certificate template:
VCF Web ServerVCF Web ServerLab Note: In this lab environment, no Microsoft CA is deployed. All certificates are self-signed. The certificate management UI in VCF Operations shows certificate expiration warnings, which is expected and acceptable for lab use.
VCF 9.0 centralizes password management in VCF Operations, replacing the password management previously in SDDC Manager.
Navigation: VCF Operations > Fleet Management > Passwords
The password dashboard shows:
Managed VCF Management Components:
| Component | Managed Accounts |
|---|---|
| Fleet Management | root, admin |
| VCF Automation | root, admin |
| VCF Identity Broker | root, admin |
| VCF Operations | root, admin |
| VCF Operations for Logs | root, admin |
| VCF Operations for Networks | root, admin |
Managed VCF Instance/Domain Components:
| Component | Managed Accounts |
|---|---|
| ESXi Hosts | root |
| NSX Manager | root, admin, audit |
| vCenter Server | root, administrator@vsphere.local |
| SDDC Manager | root, vcf, admin@local |
Manual password update (specify exact password):
This changes the password on both the server side (where the account resides) and the client side (where credentials are stored in SDDC Manager).
Automated password rotation (system-generated random password):
Note: Auto-rotate is automatically enabled for vCenter Server. It may take up to 24 hours to configure the auto-rotate policy for a newly deployed vCenter.
WARNING — Credential Rotation Cascade Failure: If a rotation or update fails mid-operation (e.g., NSX unreachable during boot storm), the resource gets stuck in
ACTIVATINGorERRORstate inplatform.nsxt, stale locks fillplatform.lock, and unresolved tasks pile up inplatform.task_metadata(resolved=false). Each UI retry adds more stuck tasks. The API cannot cancel these tasks (TA_TASK_CAN_NOT_BE_RETRIED). Fix requires direct PostgreSQL repair: fix nsxt status, clear locks, mark task_metadata resolved, clear task_lock, then restart operationsmanager. See Section 7.2.6 for the full 6-step database repair procedure.
Password remediation (when out of sync):
If a password gets out of sync between SDDC Manager's stored credential and the actual component password:
Prerequisites:
| Component | Account | Default Password | Notes |
|---|---|---|---|
| ESXi Hosts | root | Set during install | Same across all hosts in lab |
| vCenter Server | administrator@vsphere.local | Set during VCF Installer | SSO administrator |
| vCenter Server | root | Set during VCF Installer | Appliance shell |
| NSX Manager | admin | Set during OVF deployment | Web UI + CLI |
| NSX Manager | root | Set during OVF deployment | Appliance shell |
| NSX Manager | audit | Set during OVF deployment | Read-only CLI |
| SDDC Manager | vcf | Set during deployment | SSH login user |
| SDDC Manager | root | Set during deployment | Via su - from vcf |
| SDDC Manager | admin@local | Set during deployment | Web UI |
| VCF Operations | admin | Set during deployment | Web UI |
| VCF Operations | root | Set during OVF deployment | Appliance shell |
| Lab password pattern | all | Success01!0909!! |
Used across this lab |
VCF enforces password complexity requirements:
!@#$%^&*VCF Operations provides built-in and downloadable compliance frameworks:
Built-in (available immediately):
| Framework | Coverage |
|---|---|
| vSphere Security Configuration Guide | ESXi hosts, VMs, vCenter |
| vSAN Security Configuration Guide | vSAN clusters and configurations |
| NSX Security Configuration Guide | NSX Manager, transport nodes |
| DISA Security Standards | Defense Information Systems Agency STIG |
| FISMA Security Standards | Federal Information Security Management Act |
| HIPAA | Health Insurance Portability and Accountability Act |
Downloadable (requires .PAK file from VMware Marketplace):
| Framework | Coverage |
|---|---|
| PCI DSS Compliance Standards | Payment Card Industry Data Security Standard |
| CIS Security Standards | Center for Internet Security benchmarks |
| NIST SP 800-171 | Protecting Controlled Unclassified Information |
| NIST SP 800-53 R5 | Security and Privacy Controls |
Navigation: VCF Operations > Security & Compliance > Compliance
Activate VMware SDDC Benchmarks:
Install Marketplace Compliance Packs (for air-gapped environments):
.PAK file from the VMware Marketplace on an internet-connected machine.PAK fileAfter activation, the Compliance dashboard shows:
Security Operations Dashboard:
Navigation: VCF Operations > Infrastructure Operations > Dashboards & Reports > Security Operations
This dashboard provides:
When compliance checks identify violations:
| Keystore | Path | Password | Used By |
|---|---|---|---|
| VCF trust store | /etc/vmware/vcf/commonsvcs/trusted_certificates.store |
Contents of /etc/vmware/vcf/commonsvcs/trusted_certificates.key |
SDDC Manager VCF services |
| Java cacerts | /etc/alternatives/jre/lib/security/cacerts |
changeit |
Java-based SDDC Manager services |
| VCF Installer Java | $JAVA_HOME/lib/security/cacerts |
changeit |
VCF Installer LCM service |
Note: When replacing any component certificate with a new self-signed cert, the new cert must be imported into BOTH the VCF trust store AND the Java cacerts keystore. Missing either one causes VDT trust check failures.
List all certificates in a keystore:
keytool -list -keystore /etc/alternatives/jre/lib/security/cacerts -storepass changeit
List certificates with details (verbose):
keytool -list -v -keystore /etc/alternatives/jre/lib/security/cacerts -storepass changeit
List a specific certificate by alias:
keytool -list -alias nsx-selfsigned -keystore /etc/alternatives/jre/lib/security/cacerts -storepass changeit -v
Import a certificate:
# Import into Java cacerts
keytool -importcert -alias <alias-name> -file /tmp/cert.crt \
-keystore /etc/alternatives/jre/lib/security/cacerts \
-storepass changeit -noprompt
# Import into VCF trust store
KEY=$(cat /etc/vmware/vcf/commonsvcs/trusted_certificates.key)
keytool -importcert -alias <alias-name> -file /tmp/cert.crt \
-keystore /etc/vmware/vcf/commonsvcs/trusted_certificates.store \
-storepass "$KEY" -noprompt
Delete a certificate:
# Delete from Java cacerts
keytool -delete -alias <alias-name> \
-keystore /etc/alternatives/jre/lib/security/cacerts \
-storepass changeit
# Delete from VCF trust store
KEY=$(cat /etc/vmware/vcf/commonsvcs/trusted_certificates.key)
keytool -delete -alias <alias-name> \
-keystore /etc/vmware/vcf/commonsvcs/trusted_certificates.store \
-storepass "$KEY"
Export a certificate from a keystore:
keytool -exportcert -alias <alias-name> \
-keystore /etc/alternatives/jre/lib/security/cacerts \
-storepass changeit \
-file /tmp/exported-cert.crt -rfc
Check if a specific alias exists:
keytool -list -alias nsx-selfsigned \
-keystore /etc/alternatives/jre/lib/security/cacerts \
-storepass changeit 2>&1 | head -1
# Returns "nsx-selfsigned, ..." if found, or error if not found
Change keystore password:
keytool -storepasswd \
-keystore /etc/alternatives/jre/lib/security/cacerts \
-storepass changeit \
-new <new-password>
Download a remote certificate and import in one step:
# Pull certificate from a remote server
openssl s_client -showcerts -connect 192.168.1.71:443 < /dev/null 2>/dev/null | openssl x509 -outform PEM > /tmp/remote-cert.crt
# Verify it is the correct certificate
openssl x509 -in /tmp/remote-cert.crt -noout -subject -issuer -dates
# Import into both keystores
keytool -importcert -alias remote-server -file /tmp/remote-cert.crt \
-keystore /etc/alternatives/jre/lib/security/cacerts \
-storepass changeit -noprompt
KEY=$(cat /etc/vmware/vcf/commonsvcs/trusted_certificates.key)
keytool -importcert -alias remote-server -file /tmp/remote-cert.crt \
-keystore /etc/vmware/vcf/commonsvcs/trusted_certificates.store \
-storepass "$KEY" -noprompt
Find all Java cacerts files on the system:
find / -name "cacerts" -type f 2>/dev/null
Restart services after keystore changes:
# On SDDC Manager
/opt/vmware/vcf/operationsmanager/scripts/cli/sddcmanager_restart_services.sh
# On VCF Installer
systemctl restart lcm
systemctl restart domainmanager
Tip: Always verify changes with VDT after modifying trust stores. Run VDT from SDDC Manager:
cd /home/vcf/vdt-2.2.7_02-05-2026 && python vdt.pyVDT report location:
/var/log/vmware/vcf/vdt/vdt-<timestamp>.txt
The VCF Diagnostic Tool (VDT) is a standalone Python utility that validates the health and configuration of your VCF environment. It is NOT pre-installed on SDDC Manager and must be downloaded separately from Broadcom.
VDT is distributed via Broadcom Knowledge Base article 344917. Navigate to:
https://knowledge.broadcom.com/external/article/344917
Download the latest version. In this lab, the version used is vdt-2.2.7_02-05-2026.zip.
Warning: SCP does not work with SDDC Manager due to the restricted shell on the
vcfuser. Only thevcfuser can SSH in (root and admin are rejected). Use thessh cat redirectmethod for file transfer.
Method 1: SSH cat redirect (recommended)
# From your Windows workstation (PowerShell)
ssh vcf@192.168.1.241 "cat > /home/vcf/vdt-2.2.7_02-05-2026.zip" < C:\VCF-Depot\vdt-2.2.7_02-05-2026.zip
Method 2: SCP (if it works in your environment)
scp C:\VCF-Depot\vdt-2.2.7_02-05-2026.zip vcf@192.168.1.241:/home/vcf/
SSH to SDDC Manager as vcf, then extract:
ssh vcf@192.168.1.241
cd /home/vcf
unzip vdt-2.2.7_02-05-2026.zip
ls -la vdt-2.2.7_02-05-2026/
No additional installation is required. VDT is a Python script that runs directly.
cd /home/vcf/vdt-2.2.7_02-05-2026
python vdt.py
VDT will prompt for administrator@vsphere.local password. It then performs a comprehensive validation of the entire VCF stack.
VDT produces a text report and JSON output at:
/var/log/vmware/vcf/vdt/vdt-<timestamp>.txt
/var/log/vmware/vcf/vdt/vdt-<timestamp>.json
Lab VDT Results Summary (vcf-lab, Feb 12 2026):
| Category | Status | Details |
|---|---|---|
| SDDC Manager Info | PASS | Version 9.0.1.0.24962180, hostname sddc-manager.lab.local |
| NTP Service & Server | PASS | 192.168.1.230 responding |
| /etc/hosts | PASS | Properly formatted |
| SDDC Manager Services | PASS | COMMON_SERVICES, LCM, DOMAIN_MANAGER, OPERATIONS_MANAGER, SDDC_MANAGER_UI -- all ACTIVE |
| Commonservices API | PASS | HTTP 200 on localhost |
| Disk Utilization | PASS | Filesystem healthy (space and inodes) |
| Host/Domain/Cluster Status | PASS | All ACTIVE |
| vCenter/PSC/NSX Status | PASS | All ACTIVE |
| SDDC Cert Trust/Expiry/SAN | PASS | 717 days remaining |
| vCenter Cert Trust/Expiry | PASS | 725 days remaining |
| vCenter Cert SAN | WARN | Hostname but not IP in SAN (cosmetic, acceptable for lab) |
| NSX VIP Cert Trust/Expiry/SAN | PASS | Fixed after cert replacement and trust store import |
| NSX Manager Cert Trust/Expiry/SAN | PASS | Fixed after cert replacement and trust store import |
| Deployment/Resource Locks | PASS | No locks detected |
| Changelog Locks | PASS | All 4 DBs (domainmanager, operationsmanager, lcm, platform) |
| Service Account Auth | PASS | No authentication issues |
| NFS Mount Ownership | PASS | Fixed: chown root:vcf /nfs/vmware/vcf/nfs-mount/ |
| Depot Config | PASS | Checks skipped for 9.x+ |
Note: VDT showed "not found" for Aria Lifecycle, Automation, Operations, Logs, and Workspace One. This is expected when these products were deployed manually outside SDDC Manager's Aria inventory.
NFS Mount Ownership: FAIL
# Before: owner was nginx instead of root
ls -la /nfs/vmware/vcf/
# drwxrwxr-x nginx vcf nfs-mount/
# Fix:
chown root:vcf /nfs/vmware/vcf/nfs-mount/
# After: owner is root, group is vcf
# Reference: https://knowledge.broadcom.com/external/article/392923
NSX Certificate SAN: FAIL
The default NSX self-signed certificate uses a wildcard SAN (*.lab.local) without specific hostnames or IPs. VDT reports "SAN contains neither hostname nor IP." See Section 7.5 for the complete NSX certificate replacement procedure.
NSX Certificate Trust: FAIL
After replacing the NSX self-signed certificate, the new root is not in SDDC Manager's keystores. See Section 7.5 for the trust store import procedure.
Service Properties Ownership: FAIL
# Check ownership of service property files
ls -la /opt/vmware/vcf/domainmanager/conf/
ls -la /opt/vmware/vcf/operationsmanager/conf/
# Fix: ensure correct ownership
chown vcf:vcf /opt/vmware/vcf/domainmanager/conf/application-prod.properties
chown vcf:vcf /opt/vmware/vcf/operationsmanager/conf/application-prod.properties
SDDC Manager runs multiple services managed via systemd. Here are the key services and their management commands:
| Service | Purpose | Command |
|---|---|---|
| domainmanager | Domain lifecycle operations | systemctl status domainmanager |
| lcm | Lifecycle management | systemctl status lcm |
| operationsmanager | Operations and monitoring | systemctl status operationsmanager |
| commonsvcs | Shared platform services | systemctl status commonsvcs |
| postgresql | Internal database | systemctl status postgresql |
| nginx | Web server / reverse proxy | systemctl status nginx |
| vcf-services | All VCF services (target) | systemctl status vcf-services |
Check all service statuses:
systemctl status domainmanager
systemctl status lcm
systemctl status operationsmanager
systemctl status commonsvcs
systemctl status postgresql
systemctl status nginx
Restart all VCF services:
systemctl restart vcf-services
# Wait 3-5 minutes for all services to start
systemctl status vcf-services
Restart individual service:
systemctl restart domainmanager
systemctl restart lcm
systemctl restart operationsmanager
Full service restart script (recommended for major changes):
/opt/vmware/vcf/operationsmanager/scripts/cli/sddcmanager_restart_services.sh
# Takes approximately 5 minutes
/var/log/vmware/vcf/
├── domainmanager/
│ ├── domainmanager.log # Main domain manager log
│ └── domainmanager-gc.log # Garbage collection log
├── lcm/
│ ├── lcm.log # Lifecycle management log
│ ├── lcm-debug.log # LCM debug (TLS errors show here)
│ └── upgrade/ # Upgrade-specific logs
├── operationsmanager/
│ ├── operationsmanager.log # Operations manager log
│ └── operationsmanager-gc.log # Garbage collection log
├── sos/
│ └── sos.log # SoS utility log
├── commonsvcs/
│ └── commonsvcs.log # Common services log
├── vdt/
│ └── vdt-<timestamp>.txt # VDT report files
└── sddc-support/
└── sddc-support.log # Support bundle log
Log analysis commands:
# View last 100 lines of domain manager log
tail -100 /var/log/vmware/vcf/domainmanager/domainmanager.log
# Follow log in real-time
tail -f /var/log/vmware/vcf/domainmanager/domainmanager.log
# Search for errors across all VCF logs
grep -ri "error\|exception\|failed" /var/log/vmware/vcf/domainmanager/domainmanager.log | tail -50
# Search for specific time period
grep "2026-02-12 14:" /var/log/vmware/vcf/domainmanager/domainmanager.log
# Count error occurrences
grep -c "error" /var/log/vmware/vcf/domainmanager/domainmanager.log
# Search for LCM TLS errors
grep -i "tlsfatal\|ssl\|certificate" /var/log/vmware/vcf/lcm/lcm-debug.log | tail -20
Problem: SDDC Manager deployment via VCF Installer enters a timeout loop in nested VMware Workstation environments. The installer waits for SDDC Manager to become responsive, but the appliance takes too long to boot and initialize services on resource-constrained nested hosts.
Symptoms:
Solution: Bypass the VCF Installer for SDDC Manager deployment. Deploy SDDC Manager manually using ovftool with a single-line command (backslash continuation breaks --noSSLVerify).
# Single-line ovftool command (do NOT use backslash line continuation)
ovftool --acceptAllEulas --noSSLVerify --allowExtraConfig --diskMode=thin --powerOn --name=sddc-manager --ipProtocol=IPv4 --ipAllocationPolicy=fixedPolicy --prop:BACKUP_PASSWORD=Success01!0909!! --prop:ROOT_PASSWORD=Success01!0909!! --prop:VCF_PASSWORD=Success01!0909!! --prop:BASIC_AUTH_PASSWORD=Success01!0909!! --prop:vami.hostname=sddc-manager.lab.local --prop:vami.ip0.SDDC-Manager-Appliance=192.168.1.241 --prop:vami.netmask0.SDDC-Manager-Appliance=255.255.255.0 --prop:vami.gateway.SDDC-Manager-Appliance=192.168.1.1 --prop:vami.DNS.SDDC-Manager-Appliance=192.168.1.230 --prop:vami.domain.SDDC-Manager-Appliance=lab.local --prop:vami.searchpath.SDDC-Manager-Appliance=lab.local --prop:vami.NTP.SDDC-Manager-Appliance=192.168.1.230 --datastore=esxi01-local --network="VM Network" vi://root:Success01!0909!!@192.168.1.74 /path/to/sddc-manager.ova
Key lesson: ovftool on the VCF Installer must use single-line commands. Backslash continuation breaks
--noSSLVerifyand other flags.
VDT may report NFS mount ownership failures when the mount point owner is incorrect.
# Check NFS mount ownership
ls -la /nfs/vmware/vcf/
# Expected: root:vcf ownership on nfs-mount/
# If showing nginx:vcf, fix with:
chown root:vcf /nfs/vmware/vcf/nfs-mount/
# Verify NFS subdirectories exist
ls -la /nfs/vmware/vcf/nfs-mount/
# Should contain: bundle/, depot/, depot/local/
Critical: Only the
vcfuser can SSH to SDDC Manager. Therootandadminusers are rejected at the SSH level.
# SSH to SDDC Manager
ssh vcf@192.168.1.241
# Get root access from vcf session
su -
# File transfer workaround (SCP does not work due to restricted shell)
ssh vcf@192.168.1.241 "cat > /home/vcf/myfile.zip" < localfile.zip
# Transfer file FROM SDDC Manager
ssh vcf@192.168.1.241 "cat /path/to/file" > local_copy
Account lockout (faillock):
SDDC Manager uses faillock (not pam_tally2) to lock accounts after failed SSH attempts. Automated scripts with wrong passwords can quickly lock the vcf account.
# From SDDC Manager console as root:
# Check lockout status
faillock --user vcf
# Unlock the vcf account
faillock --user vcf --reset
# Unlock root (if also locked)
faillock --user root --reset
If locked out of ALL accounts (root, vcf, admin): Boot into single-user mode via GRUB — reboot the VM, press
eat the GRUB menu, appendinit=/bin/bashto thelinuxline, press Ctrl+X. Then:mount -o remount,rw /→faillock --user root --reset→faillock --user vcf --reset→reboot -f
PostgreSQL overview:
SDDC Manager uses PostgreSQL 15 with data directory /data/pgdata. It listens on TCP 127.0.0.1 only (not Unix sockets — you'll get "No such file or directory" without -h 127.0.0.1). Authentication uses scram-sha-256.
psql pager trap: When running psql queries via Paramiko or remote shell, the default pager (
less/more) captures output and waits for interactive input, corrupting the session. Always setPAGER=catbefore running psql commands, or pass it inline:PAGER=cat psql -h 127.0.0.1 .... For Paramikoinvoke_shell(), also setheight=1000to prevent terminal-based paging.
# Check PostgreSQL status
systemctl status postgresql
# List all databases
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -l"
# Check database disk usage
df -h /
Key databases and tables:
| Database | Key Tables | Key Columns | Purpose |
|---|---|---|---|
platform |
nsxt |
id, status |
NSX cluster resource status (ACTIVE/ACTIVATING/ERROR) |
platform |
lock |
resource/lock columns | Exclusive operation locks |
platform |
task_metadata |
resolved (boolean) |
Task resolution tracking |
platform |
task_lock |
task-to-lock associations | Task-lock relationships |
operationsmanager |
task |
state (NOT status) |
Operation tasks |
operationsmanager |
execution |
execution_status (NOT status) |
Execution tracking |
operationsmanager |
processing_task |
status |
Active processing queue |
operationsmanager |
execution_to_task |
mapping columns | Execution-task relationships |
domainmanager |
domain-related tables | — | Domain lifecycle state |
Key discovery: The API cannot cancel stuck tasks —
PATCHreturnsTA_TASK_CAN_NOT_BE_RETRIEDandDELETEreturns HTTP 500. Database repair is the only option for cascade failures.
Accessing PostgreSQL (trust auth workaround):
The PostgreSQL password is not easily discoverable in configuration files. The workaround is to temporarily set trust authentication:
# SSH as vcf, then su - to root
# Back up pg_hba.conf (CRITICAL)
cp /data/pgdata/pg_hba.conf /data/pgdata/pg_hba.conf.bak
# Temporarily allow passwordless local connections
sed -i 's/scram-sha-256/trust/g' /data/pgdata/pg_hba.conf
# Reload postgres (no restart needed)
su - postgres -c "/usr/pgsql/15/bin/pg_ctl reload -D /data/pgdata"
# Disable psql pager (CRITICAL for scripted/remote sessions)
export PAGER=cat
export PGPAGER=cat
# Now you can connect without a password
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform"
CRITICAL: Always restore pg_hba.conf immediately after making changes:
cp /data/pgdata/pg_hba.conf.bak /data/pgdata/pg_hba.conf su - postgres -c "/usr/pgsql/15/bin/pg_ctl reload -D /data/pgdata"
Credential Cascade Failure — Full Diagnosis & 6-Step Repair
Symptoms:
"Resources [nsx-vip.lab.local] are not available/ready" or "not in ACTIVE state""Unable to acquire resource level lock(s)""[2] account(s) has been disconnected"/v1/nsxt-clusters shows empty or non-ACTIVE status/v1/tasks?status=IN_PROGRESS)Root Cause Chain: A failed credential operation (often due to NSX being temporarily unreachable during a boot storm) triggers a cascade:
ACTIVATING or ERROR state in platform.nsxt tableplatform.lock table, blocking all new operationsIN_PROGRESS in platform.task_metadata (resolved=false), piling upDiagnosis:
# 1. Get auth token from SDDC Manager
TOKEN=$(curl -sk -X POST https://sddc-manager.lab.local/v1/tokens \
-H "Content-Type: application/json" \
-d '{"username":"administrator@vsphere.local","password":"Success01!0909!!"}' \
| python3 -c "import sys,json; print(json.load(sys.stdin)['accessToken'])")
# 2. Check NSX cluster resource state (look for status field)
curl -sk "https://sddc-manager.lab.local/v1/nsxt-clusters" \
-H "Authorization: Bearer $TOKEN" | python3 -m json.tool
# If status is "ACTIVATING" or "ERROR" instead of "ACTIVE" → this is the problem
# 3. Check for stale resource locks
curl -sk "https://sddc-manager.lab.local/v1/resource-locks" \
-H "Authorization: Bearer $TOKEN" | python3 -m json.tool
# 4. Check for stuck IN_PROGRESS tasks
curl -sk "https://sddc-manager.lab.local/v1/tasks?status=IN_PROGRESS" \
-H "Authorization: Bearer $TOKEN" | python3 -c \
"import sys,json; d=json.load(sys.stdin); print(f'Stuck tasks: {len(d.get(\"elements\",[]))}')"
# 5. Verify NSX is actually healthy (from SDDC Manager)
curl -sk -u admin:'Success01!0909!!' --connect-timeout 10 \
https://nsx-vip.lab.local/api/v1/cluster/status
# overall_status should be "STABLE"
Fix — Full 6-Step Database Repair:
WARNING: Direct database manipulation is unsupported and should only be done in lab environments. Always back up before modifying.
Step 1: Access PostgreSQL on SDDC Manager
SSH as vcf, then su - to root. Enable trust auth (see above), then set pager:
cp /data/pgdata/pg_hba.conf /data/pgdata/pg_hba.conf.bak
sed -i 's/scram-sha-256/trust/g' /data/pgdata/pg_hba.conf
su - postgres -c "/usr/pgsql/15/bin/pg_ctl reload -D /data/pgdata"
export PAGER=cat
Step 2: Fix the stuck resource status
The nsxt table status can be ACTIVATING, ERROR, or other non-ACTIVE values:
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -t -c \"SELECT id, status FROM nsxt;\""
# Fix ANY non-ACTIVE status
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"UPDATE nsxt SET status = 'ACTIVE' WHERE status != 'ACTIVE';\""
Step 3: Clear stale resource locks
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"SELECT count(*) FROM lock;\""
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"DELETE FROM lock;\""
Step 4: Mark stuck tasks as resolved
The task_metadata table in the platform DB tracks task resolution state. Unresolved tasks (resolved=false) from failed operations accumulate and can interfere with new operations:
# Check unresolved task count
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"SELECT resolved, count(*) FROM task_metadata GROUP BY resolved;\""
# Mark all unresolved tasks as resolved
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"UPDATE task_metadata SET resolved = true WHERE resolved = false;\""
# Clear task_lock table if any entries exist
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"DELETE FROM task_lock;\""
Step 5: Restore pg_hba.conf (CRITICAL — do not skip)
cp /data/pgdata/pg_hba.conf.bak /data/pgdata/pg_hba.conf
su - postgres -c "/usr/pgsql/15/bin/pg_ctl reload -D /data/pgdata"
# Verify it's back to scram-sha-256
grep -c 'scram-sha-256' /data/pgdata/pg_hba.conf
# Should return 4 or more
Step 6: Restart operationsmanager service
systemctl restart operationsmanager
# Wait 2-3 minutes for it to fully start
systemctl is-active operationsmanager
Verification:
# NSX cluster should now show ACTIVE
curl -sk "https://sddc-manager.lab.local/v1/nsxt-clusters" \
-H "Authorization: Bearer $TOKEN" | python3 -c \
"import sys,json; [print(f'{c[\"id\"]}: {c[\"status\"]}') for c in json.load(sys.stdin).get('elements',[])]"
# Resource locks should be empty
curl -sk "https://sddc-manager.lab.local/v1/resource-locks" \
-H "Authorization: Bearer $TOKEN"
# IN_PROGRESS tasks should be zero or minimal
curl -sk "https://sddc-manager.lab.local/v1/tasks?status=IN_PROGRESS" \
-H "Authorization: Bearer $TOKEN" | python3 -c \
"import sys,json; print(f'IN_PROGRESS: {len(json.load(sys.stdin).get(\"elements\",[]))}')"
# Credential remediate should now succeed via VCF Operations Fleet Management UI
Credential Cascade Failure Flowchart:
┌──────────────────────────────────────────────┐
│ Credential Update/Rotate/Remediate fails │
│ in SDDC Manager or VCF Operations UI │
└──────────────────┬───────────────────────────┘
│
┌────────▼────────┐
│ Check task error │
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │ │
▼ ▼ ▼
"not in "Unable to "503 Service
ACTIVE state" acquire lock" Unavailable"
│ │ │
▼ ▼ ▼
Fix nsxt Delete from NSX still
table status lock table booting/
(ACTIVATING/ in platform unstable
ERROR→ACTIVE) DB │
│ │ ▼
│ │ Wait for
│ │ NSX load
│ │ to settle
│ │ (< 20)
└──────┬───────┘ │
▼ │
Mark task_metadata │
resolved = true ◄──────┘
│
▼
Clear task_lock
│
▼
Restore pg_hba.conf
│
▼
Restart
operationsmanager
│
▼
Retry credential
operation
Key insight: Three tables in the
platformdatabase must be cleaned: (1)nsxt— resource status, (2)lock— operation locks, (3)task_metadata— task resolution tracking (+task_lock). Theoperationsmanagerdatabase has separatetaskandexecutiontables (columns:task.state,execution.execution_status— NOTstatus). The API won't let you cancel or delete stuck tasks — database repair is required.
General database troubleshooting:
# If database connection fails:
# 1. Check PostgreSQL logs
tail -100 /var/log/postgresql/postgresql-*.log
# 2. Restart PostgreSQL
systemctl restart postgresql
# 3. Wait 2 minutes, then restart VCF services
sleep 120
systemctl restart vcf-services
Quick SQL reference (for experienced users):
-- Connect: su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform"
-- Fix NSX status (covers ACTIVATING and ERROR)
UPDATE nsxt SET status = 'ACTIVE' WHERE status != 'ACTIVE';
-- Clear stale locks
DELETE FROM lock;
-- Resolve stuck tasks
UPDATE task_metadata SET resolved = true WHERE resolved = false;
DELETE FROM task_lock;
Why each repair step is needed:
| Step | Table | Action | Why |
|---|---|---|---|
| 2 | nsxt |
Set status to ACTIVE | Stuck ACTIVATING/ERROR makes every new operation fail at prevalidation |
| 3 | lock |
Delete all rows | Stale exclusive locks block all new operations ("Unable to acquire resource level lock(s)") |
| 4 | task_metadata |
Set resolved=true | Unresolved tasks accumulate with each UI retry (47 found during initial diagnosis) |
| 4 | task_lock |
Delete all rows | Orphaned task-lock relationships must be cleared |
| 5 | pg_hba.conf |
Restore backup | Trust auth is a security risk — restore immediately |
| 6 | operationsmanager |
Restart service | Service caches DB state in memory — restart forces re-read of cleaned tables |
Steps 2-4 must all be done in one session — fixing just the status without clearing locks still fails, and vice versa. All three tables participate in the prevalidation check. The trust auth window should be as short as possible.
Schema discovery notes: None of this is documented by Broadcom. The schema was mapped by exploring databases with \l, listing tables with \dt, and querying information_schema.columns. Key discoveries: task_metadata uses resolved boolean (not a status field), operationsmanager.task uses column state (not status), and execution uses execution_status (not status). Early script versions failed because of these naming differences. The API's PATCH /v1/tasks/{id} returns TA_TASK_CAN_NOT_BE_RETRIED and DELETE returns HTTP 500 — database repair is the only option.
# Get authentication token
curl -k -X POST https://localhost/v1/tokens \
-H "Content-Type: application/json" \
-d '{"username":"admin@local","password":"Success01!0909!!"}'
# Check task status via API
curl -k -H "Authorization: Bearer <access-token>" \
https://localhost/v1/tasks/<task-id>
# Cancel a stuck task via API
curl -k -X PATCH https://localhost/v1/tasks/<task-id> \
-H "Authorization: Bearer <access-token>" \
-H "Content-Type: application/json" \
-d '{"status":"CANCELLED"}'
# Check VCF health via API
curl -k -H "Authorization: Bearer <access-token>" \
https://localhost/v1/system/health
SDDC Manager includes the SoS (Supportability and Serviceability) utility for comprehensive log collection:
# SSH to SDDC Manager as vcf, then su - to root
ssh vcf@192.168.1.241
su -
# Navigate to SoS directory
cd /opt/vmware/sddc-support/
# Generate log bundle for the management domain
./sos --domain-name mgmt --log-bundle
# Generate with health check included
./sos --domain-name mgmt --log-bundle --health-check
# Include free (unassigned) hosts
./sos --domain-name mgmt --log-bundle --include-free-hosts
# Bundle output location:
# /var/log/vmware/vcf/sddc-support/sos-<timestamp>.tar.gz
# Transfer logs to Broadcom support (VCF 9)
./sos --log-assist --sr-number <support-request-number>
Symptoms:
Diagnostic commands (SSH to vCenter VM):
# Check current deployment status
cat /var/log/firstboot/firstbootStatus.json
# Check for running processes
ps aux | grep -E "install|firstboot|postgres|vpxd"
# Check disk I/O (should show activity)
vmstat 1 5
# Check memory usage
free -h
# Check for error logs
tail -50 /var/log/vmware/firstboot/installer.log
grep -i "error\|fail\|exception" /var/log/vmware/firstboot/*.log
Monitoring deployment progress from VCF Installer:
# Find the latest ci-installer log directory
ls -lt /var/log/vmware/vcf/domainmanager/ | head -5
# Watch the installation log
tail -f /var/log/vmware/vcf/domainmanager/ci-installer-XX-XX-XX-XX-XX-XXX/ci-installer.log
# Search for errors
grep -i "error\|failed\|exception" /var/log/vmware/vcf/domainmanager/ci-installer-XX-XX-XX-XX-XX-XXX/ci-installer.log
Expected deployment stages:
If deployment is stuck at "Installing Containers" (60%), check PostgreSQL:
# Check if postgres service exists
ls -la /storage/db/vpostgres/
# Check for postgres config file
ls -la /storage/db/vpostgres/postgresql.conf
# Check postgres user/group
grep postgres /etc/passwd
grep postgres /etc/group
# Check postgres logs
tail -50 /var/log/vmware/vpostgres/*.log
Warning: If PostgreSQL never initialized (missing
postgresql.confand missing postgres user), the database initialization failed. This is typically unrecoverable and requires full redeployment.
Post-deployment PostgreSQL health check:
# Check database service
service-control --status vmware-vpostgres
# Check database connections
/opt/vmware/vpostgres/current/bin/psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"
# If database is unhealthy:
service-control --restart vmware-vpostgres
# Wait 5 minutes, then restart vpxd:
service-control --restart vpxd
Check all vCenter services:
# List all services with status
service-control --status --all
# Alternative: use vmon-cli
vmon-cli --list
# Check specific service
vmon-cli --status vpxd
service-control --status vpxd
Expected healthy services (all should show RUNNING):
| Service | Purpose |
|---|---|
| vpxd | Core vCenter daemon |
| vsphere-ui | vSphere Client web interface |
| vmware-vpostgres | Embedded PostgreSQL database |
| rhttpproxy | Reverse proxy |
| lookupsvc | Lookup service (SSO) |
| sts | Security Token Service |
| vlcm | vSphere Lifecycle Manager |
| content-library | Content Library |
| eam | ESX Agent Manager |
Restart a specific service:
service-control --restart vpxd
# Wait 2-3 minutes for service to start
service-control --status vpxd
Restart all services (causes brief outage):
service-control --restart --all
# Wait 10-15 minutes for all services to start
service-control --status --all
# Check vpxd status
service-control --status vpxd
# Review vpxd logs
tail -100 /var/log/vmware/vpxd/vpxd.log
# Search for vpxd errors
grep -i "error\|exception\|failed" /var/log/vmware/vpxd/vpxd.log | tail -50
# Check vSphere Client logs
tail -100 /var/log/vmware/vsphere-ui/logs/vsphere_client_virgo.log
# Restart vpxd
service-control --restart vpxd
When vCenter deployment fails, VCF provides a reference token. To find detailed errors:
# Search for reference token in logs (example token: 3OHCKD)
grep -r "3OHCKD" /var/log/vmware/vcf/
grep -B20 -A20 "3OHCKD" /var/log/vmware/vcf/domainmanager/*.log
See Section 7.7 for the complete failed deployment recovery procedure.
Problem: The vhv.enable setting can persist in a VM's runtime DICT (vmware.log) even when it is not present in the VMX file. This causes vMotion to fail with:
Migration failed after VM memory precopy. Configuration mismatch:
The virtual machine cannot be restored because the snapshot was taken with VHV enabled.
Root cause (lab-tested): The vCenter UI showed "Expose hardware assisted virtualization" unchecked, and the VMX file had no vhv.enable entry. However, the VM runtime logs revealed vhv.enable = "TRUE" inherited from the original deployment environment.
Diagnostic steps:
# SSH to the ESXi host running the VM
ssh root@192.168.1.74
# Search VM logs for vhv references
grep -i vhv /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/*
# Check the VMX file directly
grep -i vhv /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager.vmx
Fix: Add an explicit vhv.enable = "FALSE" to the VMX file, even if the setting does not currently appear:
# Power off the VM first, then:
echo 'vhv.enable = "FALSE"' >> /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager.vmx
# Power the VM back on
Key lesson: The absence of
vhv.enablein the VMX file does NOT mean it is disabled. The setting can persist in runtime/logs from a previous environment. Always add an explicitvhv.enable = "FALSE"to fix vMotion failures related to VHV mismatch.
Problem: Hot vMotion fails in nested VMware Workstation environments because memory convergence cannot complete within the timeout window.
Error message:
Migration was canceled because the amount of changing memory was greater
than the available network bandwidth
Root cause: Nested environments have limited network throughput and higher memory change rates, making it difficult for vMotion to converge the memory state between source and destination hosts.
Workarounds:
# Cold migration procedure:
1. Power off the VM (graceful shutdown)
2. Right-click VM in vCenter -> Migrate
3. Select "Change both compute resource and storage"
4. Select destination host and datastore
5. Complete the migration
6. Power the VM back on
In the lab, SDDC Manager was successfully relocated from esxi01 to esxi03 using cold migration after hot vMotion failed.
Problem: DRS cannot migrate VMs between hosts with different CPU generations.
Diagnostic steps:
# Check CPU model on each host (from vCenter or ESXi SSH)
esxcli hardware cpu global get
# Check EVC status on cluster
# In vSphere Client: Cluster -> Configure -> VMware EVC
EVC mode hierarchy (Intel):
Newest -> Intel "Cascade Lake" Generation
Intel "Skylake" Generation
Intel "Broadwell" Generation
Intel "Haswell" Generation
Intel "Ivy Bridge" Generation
Oldest -> Intel "Sandy Bridge" Generation
EVC mode must be set to the lowest CPU generation in the cluster. All VMs may need to be powered off before changing EVC mode.
# Check vMotion VMkernel adapter exists
esxcfg-vmknic -l | grep -i vmotion
# Test vMotion network connectivity between hosts
vmkping -I vmk1 192.168.100.11
# Check vMotion is enabled on the VMkernel adapter
esxcli network ip interface tag get -i vmk1
# Verify MTU settings (1500 for nested, do NOT use 9000)
esxcfg-vmknic -l
# Check vMotion port (TCP 8000) connectivity
nc -z 192.168.100.11 8000
| Network | VLAN | Subnet | Gateway | MTU |
|---|---|---|---|---|
| vMotion | 100 | 192.168.100.0/24 | 192.168.100.1 | 1500 |
Warning: Do NOT use jumbo frames (MTU 9000) in nested VMware Workstation environments. Use MTU 1500 for all networks.
Problem: NSX Manager deployed with the small option (16GB RAM) crashes with kernel OOM (Out of Memory) in nested environments. Console shows repeated sysrq: Show Memory messages.
Impact: All NSX-related validation checks in VCF Installer fail, and services cannot stabilize.
Sizing requirements for nested environments:
| RAM | vCPU | Result |
|---|---|---|
| 16GB | 4 | Kernel OOM, constant crashes |
| 24GB | 4 | Runs, but MANAGER/SEARCH services crash under load (transport node config) |
| 32GB | 6 | Required for stable operation with 4-host cluster |
Resolution:
# Power off NSX Manager VM
# In vCenter: right-click NSX Manager VM -> Power -> Shut Down Guest OS
# Edit VM settings:
# - Memory: 32 GB
# - CPU: 6 vCPU
# Power on NSX Manager VM
# Wait 10-15 minutes for all services to stabilize
Key lesson: Many VCF Installer validation errors are cascading failures from an unhealthy NSX. Fix NSX health first before troubleshooting other validation failures.
Symptoms:
Diagnostic commands on ESXi host:
# Check NSX proxy agent status
/etc/init.d/nsx-proxy status
# Start NSX proxy if not running
/etc/init.d/nsx-proxy start
# Check NSX datapath status
/etc/init.d/nsx-datapath status
# Check connectivity to NSX Manager (port 1234)
esxcli network ip connection list | grep 1234
# Review NSX agent logs
tail -50 /var/log/nsx-syslog.log
# Find TEP VMkernel adapter
esxcfg-vmknic -l | grep -i tep
# Test TEP-to-TEP connectivity
vmkping <other-host-tep-ip>
Transport node recovery steps:
In the lab, transport node configuration initially failed when NSX had only 24GB RAM. After increasing to 32GB/6vCPU:
1. Removed failed profile from cluster
2. Restarted management network on all hosts
3. Re-applied tn-profile-mgmt
4. All 4 hosts configured successfully -- vmk0 used as TEP
Force resync from NSX Manager UI:
1. Navigate to System -> Fabric -> Nodes -> Host Transport Nodes
2. Click on the problematic host
3. Click Actions -> Redeploy Node
4. Wait 5-10 minutes for resync
NSX certificate issues are the most common VDT failures. Two types of problems occur:
Problem 1: SAN Missing Hostnames/IPs
The default NSX self-signed certificate uses a wildcard SAN (*.lab.local) without specific hostnames or IPs. VDT reports "SAN contains neither hostname nor IP."
Step 1: Create OpenSSL config on NSX Manager (SSH as root):
cat > /tmp/nsx-cert.conf << 'EOF'
[ req ]
default_bits = 2048
distinguished_name = req_distinguished_name
req_extensions = req_ext
x509_extensions = req_ext
prompt = no
[ req_distinguished_name ]
countryName = US
stateOrProvinceName = Lab
localityName = Lab
organizationName = lab.local
commonName = nsx-vip.lab.local
[ req_ext ]
basicConstraints = CA:FALSE
subjectAltName = @alt_names
[alt_names]
DNS.1 = nsx-vip.lab.local
DNS.2 = nsx-node1.lab.local
DNS.3 = nsx-manager.lab.local
IP.1 = 192.168.1.70
IP.2 = 192.168.1.71
EOF
Critical:
DNS.3 = nsx-manager.lab.localis required because SDDC Manager registers NSX using this FQDN. Without it, VDT reports "SAN contains IP but not hostname."
Step 2: Generate certificate and build JSON payload:
# Generate cert (single-line, no backslash continuation)
openssl req -x509 -nodes -days 825 -newkey rsa:2048 -keyout /tmp/nsx.key -out /tmp/nsx.crt -config /tmp/nsx-cert.conf -sha256
# Verify SAN entries
openssl x509 -in /tmp/nsx.crt -text -noout | grep -A4 "Subject Alternative Name"
# Build JSON payload using Python (avoids shell PEM escaping issues)
python -c "
import json
cert = open('/tmp/nsx.crt').read()
key = open('/tmp/nsx.key').read()
print(json.dumps({'pem_encoded': cert, 'private_key': key}))
" > /tmp/nsx-import.json
Warning: NSX shell does NOT support backslash line continuation. All curl commands must be single-line. Use Python to build JSON payloads containing PEM data.
Step 3: Import and apply certificate via NSX API:
# Import cert (single-line)
curl -k -u admin:'Success01!0909!!' -X POST "https://192.168.1.71/api/v1/trust-management/certificates?action=import" -H "Content-Type: application/json" -d @/tmp/nsx-import.json
# Note the certificate ID from response (e.g., 701d1416-5054-4038-8749-4ac495980ebd)
# Get node UUID
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/cluster
# Note the node UUID (e.g., 95493642-ef4a-cb8e-ed7c-5bc20033f2c2)
# Apply to node (API service)
curl -k -u admin:'Success01!0909!!' -X POST "https://192.168.1.71/api/v1/trust-management/certificates/701d1416-5054-4038-8749-4ac495980ebd?action=apply_certificate&service_type=API&node_id=95493642-ef4a-cb8e-ed7c-5bc20033f2c2"
# Apply to VIP (MGMT_CLUSTER)
curl -k -u admin:'Success01!0909!!' -X POST "https://192.168.1.71/api/v1/trust-management/certificates/701d1416-5054-4038-8749-4ac495980ebd?action=apply_certificate&service_type=MGMT_CLUSTER"
# Verify on both endpoints
openssl s_client -connect 192.168.1.71:443 -showcerts </dev/null 2>/dev/null | openssl x509 -noout -text | grep -A2 "Subject Alternative Name"
openssl s_client -connect 192.168.1.70:443 -showcerts </dev/null 2>/dev/null | openssl x509 -noout -text | grep -A2 "Subject Alternative Name"
Prerequisite: All NSX services must be healthy (MANAGER, SEARCH, UI, NODE_MGMT all UP). If services are DOWN, the API returns error 101. Wait 10-15 minutes after NSX restart in nested environments.
Problem 2: Certificate Trust Failure
After replacing the NSX certificate, VDT reports "NSX VIP Cert Trust: FAIL" because the new self-signed cert root is not in SDDC Manager's keystores.
Step 1: Pull the NSX certificate (SSH to SDDC Manager as root):
openssl s_client -showcerts -connect 192.168.1.71:443 < /dev/null 2>/dev/null | openssl x509 -outform PEM > /tmp/nsx-root.crt
# Verify it is the correct cert
openssl x509 -in /tmp/nsx-root.crt -noout -text | grep -A2 "Subject Alternative Name"
Step 2: Import into VCF trust store:
KEY=$(cat /etc/vmware/vcf/commonsvcs/trusted_certificates.key)
keytool -importcert -alias nsx-selfsigned -file /tmp/nsx-root.crt \
-keystore /etc/vmware/vcf/commonsvcs/trusted_certificates.store \
-storepass "$KEY" -noprompt
Step 3: Import into Java cacerts:
keytool -importcert -alias nsx-selfsigned -file /tmp/nsx-root.crt \
-keystore /etc/alternatives/jre/lib/security/cacerts \
-storepass changeit -noprompt
Step 4: Restart SDDC Manager services:
/opt/vmware/vcf/operationsmanager/scripts/cli/sddcmanager_restart_services.sh
# Wait ~5 minutes, then re-run VDT
Key trust store paths:
| Item | Path/Value |
|---|---|
| VCF trust store | /etc/vmware/vcf/commonsvcs/trusted_certificates.store |
| VCF trust store password | Contents of /etc/vmware/vcf/commonsvcs/trusted_certificates.key |
| Java cacerts | /etc/alternatives/jre/lib/security/cacerts |
| Java cacerts password | changeit |
| Service restart script | /opt/vmware/vcf/operationsmanager/scripts/cli/sddcmanager_restart_services.sh |
Reference: https://knowledge.broadcom.com/external/article/316056
# SSH to NSX Manager as admin
ssh admin@192.168.1.71
# Check cluster status
get cluster status
# Check all service status (from root shell)
/etc/init.d/proton-manager status
/etc/init.d/corfu_server status
# Check NSX API health
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/cluster/status
NSX Manager critical services:
| Service | Purpose |
|---|---|
| MANAGER | NSX Management plane |
| SEARCH | Search/indexing service |
| UI | NSX Manager web interface |
| NODE_MGMT | Node management |
| proton | Core NSX engine |
| corfu | Distributed datastore |
For single-node NSX deployments (common in nested labs):
# Check cluster health via API
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/cluster/status
# DNS/NTP configured via admin CLI (NOT the UI)
ssh admin@192.168.1.71
set name-servers 192.168.1.230
set ntp-servers 192.168.1.230
get name-servers
get ntp-servers
1. Log in to NSX Manager: https://nsx-vip.lab.local
2. Navigate to Plan & Troubleshoot -> Traffic Analysis -> Traceflow
3. Configure source VM and destination VM/IP
4. Select protocol (ICMP, TCP, UDP)
5. Click "Trace" and review results:
- Green line = packet delivered successfully
- Red X = packet dropped (shows WHERE and by which rule)
- Yellow triangle = packet received but not forwarded
Problem: VCF 9.0.1 uses BouncyCastle FIPS TLS implementation which has strict certificate validation. Connection to offline depot with self-signed certificate fails.
Symptoms:
Secure protocol communication error, check logs for more details
LCM debug logs show:
org.bouncycastle.tls.TlsFatalAlert caught when processing request to {s}->https://192.168.1.160:8443
Diagnostic commands on VCF Installer / SDDC Manager:
# Test SSL connectivity
openssl s_client -connect 192.168.1.160:8443
# Test with TLS 1.2 specifically
openssl s_client -connect 192.168.1.160:8443 -tls1_2
# Check cipher negotiation
openssl s_client -connect 192.168.1.160:8443 -tls1_2 </dev/null 2>&1 | grep -E "Cipher|Protocol|Verify"
# View certificate details
openssl s_client -connect 192.168.1.160:8443 </dev/null 2>/dev/null | openssl x509 -text -noout
# Get certificate fingerprint
openssl s_client -connect 192.168.1.160:8443 </dev/null 2>/dev/null | openssl x509 -noout -fingerprint -sha256
Fix: Import the depot certificate into the Java truststore:
# Download certificate from depot server
openssl s_client -connect 192.168.1.160:8443 </dev/null 2>/dev/null | openssl x509 -outform PEM > /tmp/depot.crt
# Verify certificate was downloaded
cat /tmp/depot.crt
# Find Java truststore
echo $JAVA_HOME
# Output: /usr/lib/jvm/openjdk-java17-headless.x86_64
# Delete old certificate if exists
keytool -delete -alias offline-depot -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit
# Import new certificate
keytool -import -trustcacerts -alias offline-depot -file /tmp/depot.crt -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit -noprompt
# Verify import
keytool -list -alias offline-depot -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit
# Restart LCM service
systemctl restart lcm
# Wait 2 minutes, verify LCM is ready
systemctl status lcm
tail -f /var/log/vmware/vcf/lcm/lcm-debug.log | grep -i "started\|ready"
Problem: SDDC Manager requests files that do not exist in the depot structure.
Symptoms in HTTPS server log:
192.168.1.125 - "HEAD /PROD/COMP/VCENTER/VMware-VCSA-all-9.0.1.0.24957454.iso HTTP/1.1" 404 -
Fix: Check the HTTPS server logs to identify the exact path requested. Place the file at the correct location:
C:\VCF-Depot\PROD\COMP\<COMPONENT>\<filename>
Reference: Broadcom KB 413848
Problem: "Product Version Catalog (PVC) does not exist"
Cause: The productVersionCatalog.json was not extracted from the official vcf-9.0.1.0-offline-depot-metadata.zip, or the LCM-specific copy is missing.
Fix:
1. Extract metadata from the official zip file
2. Copy productVersionCatalog.json to:
PROD\COMP\SDDC_MANAGER_VCF\lcm\productVersionCatalog\
# Verify the depot server certificate matches what is in the truststore
# Get server certificate fingerprint
openssl s_client -connect 192.168.1.160:8443 </dev/null 2>/dev/null | openssl x509 -noout -fingerprint -sha256
# Get truststore certificate fingerprint
keytool -list -alias offline-depot -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit
# If fingerprints don't match, re-import the correct certificate
keytool -delete -alias offline-depot -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit
keytool -import -trustcacerts -alias offline-depot -file /tmp/depot.crt -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit -noprompt
systemctl restart lcm
The offline depot uses a Python HTTPS server on the Windows host at 192.168.1.160:8443.
Starting the server:
cd C:\VCF-DEPOT
python https_server.py
Generating certificates (if needed):
cd C:\VCF-DEPOT
python generate_cert.py
# Then start the server
python https_server.py
Certificate requirements for FIPS compliance:
Monitoring depot requests:
Watch the HTTPS server console window during depot operations. Successful requests show 200 status codes. Any 404 indicates a file SDDC Manager expects but cannot find.
If SDDC Manager reports "Product Version Catalog does not exist":
productVersionCatalog.json exists at:C:\VCF-Depot\PROD\COMP\SDDC_MANAGER_VCF\lcm\productVersionCatalog\productVersionCatalog.json
curl -k -u admin:admin https://192.168.1.160:8443/PROD/COMP/SDDC_MANAGER_VCF/lcm/productVersionCatalog/productVersionCatalog.json
Database Corruption:
# 1. Stop VCF services
systemctl stop vcf-services
# 2. Check disk space
df -h
# 3. Check memory
free -m
# 4. Restore PostgreSQL from backup (backup location varies)
# Consult your backup documentation for restore procedure
# 5. Restart services
systemctl start vcf-services
# 6. Verify services are running
systemctl status vcf-services
Service Won't Start:
# 1. Check specific service logs
tail -100 /var/log/vmware/vcf/<service>/<service>.log
# 2. Check disk space (services fail if disk is full)
df -h
# 3. Check memory
free -m
# 4. Restart individual service
systemctl restart <service-name>
# 5. If still failing, restart all services
/opt/vmware/vcf/operationsmanager/scripts/cli/sddcmanager_restart_services.sh
SDDC Manager UI Inaccessible:
# 1. Verify VM is powered on (check via vCenter or ESXi)
# 2. Verify network connectivity
ping 192.168.1.241
# 3. SSH as vcf user
ssh vcf@192.168.1.241
su -
# 4. Check Nginx
systemctl status nginx
nginx -t
systemctl restart nginx
# 5. Check all VCF services
systemctl status vcf-services
# 6. Restart all services if needed
systemctl restart vcf-services
# Wait 3-5 minutes
From VAMI Backup:
service-control --status --allService Recovery (no backup needed):
# SSH to vCenter
ssh root@vcenter.lab.local
# Check all services
service-control --status --all
# Restart a single failed service
service-control --restart <service-name>
# Or restart all services (causes outage)
service-control --restart --all
# Wait 10-15 minutes
Single Node Failure (3-node cluster):
Single Node Recovery (lab with 1 node):
# Check NSX services
ssh admin@192.168.1.71
get cluster status
# If services are unhealthy, restart NSX Manager VM
# Power off, wait 30 seconds, power on
# Wait 10-15 minutes for all services to stabilize
# Verify via API
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/cluster/status
Complete Cluster Recovery:
VCF does NOT provide a rollback mechanism for failed management domain deployments. A failed deployment requires manual cleanup:
Step 1: Delete Failed vCenter VM
# From the ESXi host running the vCenter VM
vim-cmd vmsvc/getallvms
# Find the vCenter VM ID (look for vcenter.lab.local)
# Power off if running
vim-cmd vmsvc/power.off <vmid>
# Unregister the VM
vim-cmd vmsvc/unregister <vmid>
# Delete VM files from datastore (if needed)
rm -rf /vmfs/volumes/<datastore>/vcenter.lab.local/
Step 2: Clean Up VDS (Distributed Switch)
# List current distributed switches
esxcli network vswitch dvs vmware list
# Remove VMkernel ports from VDS
esxcli network ip interface remove -i vmk1 # vMotion
esxcli network ip interface remove -i vmk2 # vSAN
Step 3: Clean Up vSAN Configuration (run on EACH ESXi host)
# List current vSAN storage
esxcli vsan storage list
# Remove vSAN disk groups
esxcli vsan storage remove -d t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001
esxcli vsan storage remove -d t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________04000000000000000001
# Delete partitions from cache disk
partedUtil getptbl /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001 1
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001 2
# Delete partitions from capacity disk
partedUtil getptbl /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________04000000000000000001
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________04000000000000000001 1
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________04000000000000000001 2
# Verify disks are now eligible
vdq -q
Common error: If you see
cache disk/s are in an invalid state...available size is 0.0 GB, disks still have partitions. UsepartedUtilto delete them.
Step 4: Verify Hosts Are Ready
# On each ESXi host, verify:
esxcli system hostname get
openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout | grep -A1 "Subject Alternative Name"
vim-cmd hostsvc/runtimeinfo | grep ssh
vdq -q
esxcli network vswitch dvs vmware list
Step 5: Remove Depot Connection in VCF UI
Step 6: Restart VCF Services
systemctl restart lcm
systemctl restart domainmanager
sleep 120
systemctl status lcm
systemctl status domainmanager
Step 7: Retry Deployment
Disconnected from vCenter:
# SSH to the host
ssh root@<esxi-host-ip>
# Check vpxa agent (vCenter agent)
/etc/init.d/vpxa status
# Restart vpxa
/etc/init.d/vpxa restart
# Restart all management agents
services.sh restart
# If still disconnected, force reconnect from vCenter UI:
# Right-click host -> Connection -> Disconnect
# Wait 30 seconds
# Right-click host -> Connection -> Connect
Rebuilding Host:
vim-cmd hostsvc/enable_ssh && vim-cmd hostsvc/start_ssh| Component | Backup Method | Frequency |
|---|---|---|
| SDDC Manager | VM snapshot + PostgreSQL dump | Before any upgrade |
| vCenter | VAMI file-based backup (NFS/SFTP) | Daily |
| NSX Manager | NSX built-in backup to remote store | Daily |
| ESXi Configuration | Host profile / auto-backup.sh | After changes |
ESXi auto-backup:
/sbin/auto-backup.sh
vCenter backup configuration:
1. Open VAMI: https://vcenter.lab.local:5480
2. Navigate to Backup
3. Configure backup schedule (protocol, location, credentials)
4. Schedule: Daily recommended
START: VCF Deployment Failed
|
+---> Note reference token from error message
| +---> Search logs: grep -r "TOKEN" /var/log/vmware/vcf/
|
+---> Delete failed vCenter VM
| +---> vim-cmd vmsvc/getallvms
| +---> vim-cmd vmsvc/power.off <vmid>
| +---> vim-cmd vmsvc/unregister <vmid>
|
+---> Clean up vSAN on EACH host
| +---> esxcli vsan storage remove -d <device>
| +---> partedUtil delete ... (both partitions)
| +---> vdq -q (verify eligible)
|
+---> Clean up VDS (if configured)
| +---> esxcli network ip interface remove ...
|
+---> Remove depot connection in VCF UI
| +---> Re-add with certificate
|
+---> Verify SSH enabled on all hosts
| +---> vim-cmd hostsvc/enable_ssh
|
+---> Retry deployment
START: VDT reports NSX cert FAIL (Trust or SAN)
|
+---> Check which check failed
| +---> SAN FAIL: Certificate missing hostnames/IPs
| +---> Trust FAIL: Certificate root not in SDDC Manager keystores
|
+---> If SAN FAIL:
| +---> SSH to NSX Manager as root
| +---> Create OpenSSL config with all SANs:
| | DNS.1 = nsx-vip.lab.local
| | DNS.2 = nsx-node1.lab.local
| | DNS.3 = nsx-manager.lab.local <-- SDDC Manager registered FQDN
| | IP.1 = 192.168.1.70 (VIP)
| | IP.2 = 192.168.1.71 (node)
| +---> Generate cert: openssl req -x509 ...
| +---> Build JSON: python (avoid shell PEM escaping)
| +---> Import via API: POST /api/v1/trust-management/certificates?action=import
| +---> Apply to node: ?action=apply_certificate&service_type=API&node_id=<uuid>
| +---> Apply to VIP: ?action=apply_certificate&service_type=MGMT_CLUSTER
|
+---> If Trust FAIL (after cert replacement):
| +---> SSH to SDDC Manager as vcf, then su - to root
| +---> Pull cert: openssl s_client ... > /tmp/nsx-root.crt
| +---> Import to VCF store: keytool -importcert ... trusted_certificates.store
| +---> Import to Java cacerts: keytool -importcert ... cacerts
| +---> Restart services: sddcmanager_restart_services.sh
|
+---> Re-run VDT after ~5 minutes
+---> Expected: NSX cert checks all PASS
START: "Secure protocol communication error"
|
+---> Test connectivity: ping 192.168.1.160
| +---> FAIL: Check network/firewall
|
+---> Test SSL: openssl s_client -connect 192.168.1.160:8443
| +---> FAIL: Check depot server is running (python https_server.py)
|
+---> Check certificate: View cert details
| +---> Wrong hostname/IP: Regenerate certificate (python generate_cert.py)
|
+---> Import certificate to Java truststore
| +---> keytool -import -trustcacerts -alias offline-depot ...
|
+---> Verify fingerprints match
| +---> MISMATCH: Re-import correct certificate
|
+---> Restart LCM service
+---> systemctl restart lcm
+---> Wait 2 minutes, retry connection
START: VCF Component Service Not Responding
|
+---> Identify which component is affected
| +---> SDDC Manager: https://sddc-manager.lab.local
| +---> vCenter: https://vcenter.lab.local
| +---> NSX: https://nsx-vip.lab.local
|
+---> Verify VM is powered on (check via vCenter or ESXi)
| +---> Powered Off: Power on, wait 5-10 min
|
+---> SSH to the appliance
| +---> SDDC Manager: ssh vcf@192.168.1.241 -> su -
| +---> vCenter: ssh root@192.168.1.69
| +---> NSX: ssh admin@192.168.1.71
|
+---> Check services
| +---> SDDC Manager: systemctl status vcf-services
| +---> vCenter: service-control --status --all
| +---> NSX: get cluster status
|
+---> Restart failed services
| +---> SDDC Manager: systemctl restart <service>
| +---> vCenter: service-control --restart <service>
| +---> NSX: Power cycle VM (wait 10-15 min in nested env)
|
+---> Check logs for errors
| +---> SDDC Manager: /var/log/vmware/vcf/<service>/<service>.log
| +---> vCenter: /var/log/vmware/vpxd/vpxd.log
| +---> NSX: /var/log/proton/nsxapi.log
|
+---> Check database health
| +---> SDDC Manager: systemctl status postgresql
| +---> vCenter: service-control --status vmware-vpostgres
|
+---> If still not resolved:
+---> Collect SoS bundle: /opt/vmware/sddc-support/sos --log-bundle
+---> Open Broadcom support case
START: vSAN Health Warning or Error
|
+---> Check vSAN Skyline Health
| +---> vSphere Client -> Cluster -> Monitor -> vSAN -> Skyline Health
|
+---> Identify failure category
| +---> Cluster health
| +---> Network connectivity
| +---> Data / object health
| +---> Disk health
| +---> Capacity limits
|
+---> If SSD Detection Failure (nested env):
| +---> esxcli storage core device list | grep "Is SSD"
| +---> If "Is SSD: false":
| | +---> Shut down ESXi VM in Workstation
| | +---> Edit VMX: sata0:X.virtualSSD = 1
| | +---> Power on, verify: esxcli storage core device list
| +---> If "Has partitions":
| +---> esxcli vsan storage remove -d <device>
| +---> partedUtil delete ... (all partitions)
| +---> vdq -q (verify eligible)
|
+---> If Object Degraded:
| +---> Monitor -> vSAN -> Resyncing Components
| +---> Allow rebuild to complete (ensure 30% free capacity)
| +---> Do NOT make changes during rebuild
|
+---> If Disk Failed:
| +---> Identify disk (serial number, slot)
| +---> Remove from disk group
| +---> Replace physically (hot-swap if supported)
| +---> Add new disk to vSAN
| +---> Monitor rebuild
|
+---> If Network Health Warning (nested env):
+---> Latency warnings are expected in nested environments
+---> Verify MTU is 1500 (NOT 9000)
+---> Test vSAN network: vmkping -I vmk2 <other-host-vsan-ip>
START: "Certificate doesn't match subject alternative names"
|
+---> Check current cert SAN
| +---> openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout | grep -A1 "Subject Alternative Name"
|
+---> Set correct hostname
| +---> esxcli system hostname set --fqdn=esxi01.lab.local
|
+---> Backup old certificates
| +---> mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
| +---> mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak
|
+---> Generate new certificates
| +---> /sbin/generate-certificates
|
+---> Restart services
| +---> services.sh restart
|
+---> Update thumbprints in VCF
+---> Re-validate hosts in UI
+---> Get new thumbprints:
echo | openssl s_client -connect 192.168.1.74:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256
START: vCenter deployment stuck at percentage
|
+---> Wait 30 minutes (large downloads may be slow)
|
+---> SSH to vCenter VM (ssh root@vcenter.lab.local, password: vmware)
|
+---> Check firstboot status
| +---> cat /var/log/firstboot/firstbootStatus.json
|
+---> Check for activity
| +---> vmstat 1 5 (disk I/O)
| +---> tail -f /var/log/vmware/firstboot/installer.log
|
+---> If stuck at 60% "Installing Containers":
| +---> Check postgres: ls /storage/db/vpostgres/
| +---> Missing postgresql.conf: Database failed to init
| +---> UNRECOVERABLE: Must redeploy
|
+---> Check services: vmon-cli --list
| +---> Services not started: Check individual logs
|
+---> If unrecoverable:
+---> Delete vCenter VM (vim-cmd vmsvc/unregister)
+---> Clean up vSAN on all hosts
+---> Reset depot connection
+---> Retry deployment
START: "Extraction of image from host failed"
|
+---> Check SSH status on ESXi host
| +---> vim-cmd hostsvc/runtimeinfo | grep ssh
|
+---> SSH Disabled?
| +---> vim-cmd hostsvc/enable_ssh
| +---> vim-cmd hostsvc/start_ssh
|
+---> Verify SSH on ALL hosts (esxi01-04)
| +---> esxcli system ssh set --enable=true
| +---> esxcli system ssh get
|
+---> Retry vCenter deployment
+----------------------------------------------------------------------+
| PROBLEM IDENTIFIED |
| | |
| v |
| +---------------------------+ |
| | Check VCF Health in | |
| | VCF Operations | |
| +---------------------------+ |
| | |
| +------------+------------+ |
| v v |
| +---------------+ +---------------+ |
| | All Green | | Red/Yellow | |
| +---------------+ +---------------+ |
| | | |
| v v |
| +------------------+ +------------------+ |
| | Check component | | Click on issue | |
| | logs directly | | for details | |
| +------------------+ +------------------+ |
| | | |
| v v |
| +------------------+ +------------------+ |
| | Use Diagnostics | | Follow | |
| | for known issues | | remediation | |
| +------------------+ +------------------+ |
| | | |
| v v |
| +------------------+ +------------------+ |
| | Still not | | Issue resolved? | |
| | resolved? | +------------------+ |
| +------------------+ | |
| | Yes ------+------ No |
| v | | |
| +------------------+ +----v------+ +----v-----------------+ |
| | Collect SoS | | Document | | Try alternative | |
| | logs | | resolution| | resolution | |
| +------------------+ +-----------+ +----------------------+ |
| | | |
| v | |
| +------------------+ | |
| | Open Support |<--------------------------+ |
| | Case | |
| +------------------+ |
+----------------------------------------------------------------------+
| Error | Cause | Resolution |
|---|---|---|
| "Secure protocol communication error" | Self-signed cert not trusted | Import cert to Java truststore, restart LCM |
| "Certificate doesn't match subject alternative names" | ESXi cert has wrong hostname | Regenerate cert: /sbin/generate-certificates |
| "Found zero SSD devices" | VMX missing virtualSSD flag | Edit VMX: sata0:X.virtualSSD = 1 |
| "Migration failed...VHV enabled" | Ghost vhv.enable in runtime | Add explicit vhv.enable = "FALSE" to VMX |
| "Memory convergence timeout" | Nested env bandwidth limit | Use cold migration as fallback |
| "Password out of sync" | Password changed outside VCF | Use Update Password in SDDC Manager |
| "Transport node disconnected" | TEP connectivity issue | Check VTEP, MTU, NSX proxy on host |
| "vSAN degraded" | Disk or host failure | Allow rebuild, replace failed components |
| "Task failed - prerequisite not met" | Missing dependency | Complete prerequisite first, retry |
| "503 Service Unavailable" (vCenter) | vCenter services down | service-control --restart --all |
| "NSX Manager unavailable" | NSX OOM or service crash | Check RAM (need 32GB nested), restart |
| "SAN contains neither hostname nor IP" (VDT) | NSX cert uses wildcard SAN | Replace cert with explicit SANs |
| "Product Version Catalog does not exist" | PVC file missing in depot | Extract metadata, copy to correct path |
| "Extraction of image from host failed" | SSH disabled on ESXi | Enable SSH: vim-cmd hostsvc/enable_ssh |
| Component | Log Path |
|---|---|
| SDDC Manager (all) | /var/log/vmware/vcf/ |
| SDDC Manager Domain Manager | /var/log/vmware/vcf/domainmanager/domainmanager.log |
| SDDC Manager LCM | /var/log/vmware/vcf/lcm/lcm.log |
| SDDC Manager LCM Debug | /var/log/vmware/vcf/lcm/lcm-debug.log |
| SDDC Manager Ops Manager | /var/log/vmware/vcf/operationsmanager/operationsmanager.log |
| VDT Reports | /var/log/vmware/vcf/vdt/vdt-<timestamp>.txt |
| SoS Bundles | /var/log/vmware/vcf/sddc-support/sos-<timestamp>.tar.gz |
| vCenter vpxd | /var/log/vmware/vpxd/vpxd.log |
| vCenter vSphere UI | /var/log/vmware/vsphere-ui/logs/vsphere_client_virgo.log |
| vCenter PostgreSQL | /var/log/vmware/vpostgres/postgresql-*.log |
| vCenter firstboot | /var/log/firstboot/firstbootStatus.json |
| NSX Manager | /var/log/proton/nsxapi.log |
| NSX Syslog (on ESXi) | /var/log/nsx-syslog.log |
| ESXi hostd | /var/log/hostd.log |
| ESXi vpxa | /var/log/vpxa.log |
| ESXi vmkernel | /var/log/vmkernel.log |
| vSAN health | /var/log/vmware/vsan-health/ |
| Service | Port | Protocol |
|---|---|---|
| SDDC Manager UI | 443 | HTTPS |
| vCenter Server | 443 | HTTPS |
| NSX Manager | 443 | HTTPS |
| ESXi Management | 443, 902 | HTTPS, VMware |
| SSH | 22 | TCP |
| vSAN | 2233 | TCP |
| vMotion | 8000 | TCP |
| NSX Manager Cluster | 1234 | TCP |
| Offline Depot | 8443 | HTTPS |
esxcli system -- System administration and configuration.
# Display hostname, FQDN, and domain
esxcli system hostname get
# Set fully qualified domain name
esxcli system hostname set --fqdn=esxi01.lab.local
# Set short hostname only
esxcli system hostname set --host=esxi01
# Set domain only
esxcli system hostname set --domain=lab.local
# Get ESXi version and build number
esxcli system version get
# Enter maintenance mode (no vSAN data evacuation)
esxcli system maintenanceMode set -e true -m noAction
# Enter maintenance mode (evacuate all vSAN data)
esxcli system maintenanceMode set -e true -m evacuateAllData
# Exit maintenance mode
esxcli system maintenanceMode set -e false
# Check maintenance mode status
esxcli system maintenanceMode get
# Get system time
esxcli system time get
esxcli network -- VMkernel, vSwitch, IP, and firewall management.
# List all VMkernel interfaces
esxcli network ip interface list
# Get IPv4 configuration for a specific VMkernel interface
esxcli network ip interface ipv4 get -i vmk0
# Set IPv4 address on VMkernel interface (static)
esxcli network ip interface ipv4 set -i vmk2 -I 192.168.12.74 -N 255.255.255.0 -t static
# Add a new VMkernel interface
esxcli network ip interface add -i vmk1 -p "vMotion"
# List all standard vSwitches with uplinks and portgroups
esxcli network vswitch standard list
# Add uplink NIC to vSwitch
esxcli network vswitch standard uplink add -u vmnic3 -v vSwitch0
# Remove uplink NIC from vSwitch
esxcli network vswitch standard uplink remove -u vmnic3 -v vSwitch0
# Get failover policy (active, standby, unused adapters)
esxcli network vswitch standard policy failover get -v vSwitch0
# Set adapter as active in failover policy
esxcli network vswitch standard policy failover set -v vSwitch0 -a vmnic3
# Get security policy for a vSwitch
esxcli network vswitch standard policy security get -v vSwitch0
# Get security policy for a specific portgroup
esxcli network vswitch standard portgroup policy security get -p "VM Network"
# List distributed virtual switches
esxcli network vswitch dvs vmware list
# List all physical NICs with link status and speed
esxcli network nic list
# Get detailed NIC information
esxcli network nic get -n vmnic0
# Get NIC traffic statistics
esxcli network nic stats get -n vmnic0
# Filter NIC stats for packet and byte counts
esxcli network nic stats get -n vmnic0 | grep -E "Packets|Bytes"
# Show ARP table entries
esxcli network ip neighbor list
# Filter ARP for specific subnet
esxcli network ip neighbor list | grep 192.168.12
# Show IPv4 routing table
esxcli network ip route ipv4 list
# List active network connections
esxcli network ip connection list
# Filter connections for NSX Manager communication (port 1234)
esxcli network ip connection list | grep 1234
# List firewall rulesets and their enabled/disabled status
esxcli network firewall ruleset list
# Filter firewall for SSH rules
esxcli network firewall ruleset list | grep -i ssh
esxcli storage -- Device, adapter, and filesystem management.
# List all storage devices with capacity, vendor, model, SSD status
esxcli storage core device list
# Filter for SSD detection status
esxcli storage core device list | grep -E "Display Name|Is SSD"
# Rescan all storage adapters for new devices
esxcli storage core adapter rescan --all
# Rescan a specific adapter
esxcli storage core adapter rescan --adapter=vmhba0
# List all storage adapters
esxcli storage core adapter list
# List all mounted filesystems and VMFS datastores
esxcli storage filesystem list
# List VMFS extents
esxcli storage vmfs extent list
# Rescan VMFS filesystems
esxcli storage filesystem rescan
esxcli vsan -- vSAN cluster, storage, health, and network operations.
# Get vSAN cluster status (member count, node state, health)
esxcli vsan cluster get
# Force host to leave vSAN cluster (CAUTION)
esxcli vsan cluster leave
# List unicast agents (all cluster members)
esxcli vsan cluster unicastagent list
# List vSAN storage devices and disk groups
esxcli vsan storage list
# Disable automatic disk claiming
esxcli vsan storage automode set --enabled=false
# Enable automatic disk claiming
esxcli vsan storage automode set --enabled=true
# Add storage to vSAN (cache + capacity tier)
esxcli vsan storage add -s <cache-device> -d <capacity-device>
# Remove device from vSAN
esxcli vsan storage remove -s <device>
# List vSAN health checks and their status
esxcli vsan health cluster list
# Get specific health test results
esxcli vsan health cluster get -t "vSAN Health"
# List vSAN network adapters
esxcli vsan network list
# Add VMkernel interface to vSAN traffic
esxcli vsan network ip add -i vmk1
# Remove VMkernel interface from vSAN traffic
esxcli vsan network ip remove -i vmk1
# Show vSAN resync status and progress
esxcli vsan debug resync summary
# List vSAN objects for debugging
esxcli vsan debug object list
esxcli software -- VIB and software depot management.
# List installed VIBs
esxcli software vib list
# Install a VIB from a local path
esxcli software vib install -v /path/to/vib.vib
# Remove a VIB
esxcli software vib remove -n <vib-name>
# Show installed software profile
esxcli software profile get
# Add software depot
esxcli software sources profile list -d /path/to/depot.zip
# Display VMDK metadata and lock information
vmkfstools -D "/vmfs/volumes/vsan:XXXX/vcenter/vcenter.vmdk"
# Clone VMDK from one datastore to another (thick to thin conversion)
# Lab-tested: Used to migrate SDDC Manager from local to vSAN (914GB thick -> 108GB thin)
vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager.vmdk /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager.vmdk -d thin
# Clone as thin provisioned (per-disk for large VMs)
vmkfstools -i <source-vmdk> <destination-vmdk> -d thin
# Clone as thick lazy zeroed
vmkfstools -i <source-vmdk> <destination-vmdk> -d zeroedthick
# Clone as thick eager zeroed
vmkfstools -i <source-vmdk> <destination-vmdk> -d eagerzeroedthick
# Delete a VMDK file (use when cleaning failed clones)
vmkfstools -U /vmfs/volumes/<datastore>/<vm>/<disk>.vmdk
# Create a new VMDK (50GB thin)
vmkfstools -c 50G -d thin /vmfs/volumes/<datastore>/<vm>/newdisk.vmdk
# Extend an existing VMDK to 100GB
vmkfstools -X 100G /vmfs/volumes/<datastore>/<vm>/disk.vmdk
# Get disk geometry information
vmkfstools -g /vmfs/volumes/<datastore>/<vm>/disk.vmdk
Disk format types:
| Flag | Format | Description |
|---|---|---|
-d thin |
Thin provisioned | Allocates space on demand (saves storage) |
-d zeroedthick |
Thick lazy zeroed | Allocates full space, zeros on first write |
-d eagerzeroedthick |
Thick eager zeroed | Allocates and zeros all space immediately |
vdq -- Disk qualification for vSAN:
# List all eligible disks for vSAN
vdq -qH
# Detailed disk qualification query
vdq -q -d <device-name>
esxtop -- Real-time performance monitoring:
# Launch interactive performance monitor
esxtop
# Batch mode: capture to CSV (5-second intervals, 10 samples)
esxtop -b -d 5 -n 10 > /tmp/esxtop.csv
Interactive view keys:
| Key | View | Key Columns |
|---|---|---|
c |
CPU | %USED, %RDY, %CSTP, %MLMTD |
m |
Memory | MCTLSZ (balloon), SWCUR (swap), CACHEUSD |
n |
Network | MbTX/s, MbRX/s, %DRPTX, %DRPRX |
d |
Disk/Storage | DAVG (device latency), KAVG (kernel latency), GAVG (guest latency) |
v |
VM view | Per-VM resource utilization |
u |
Disk device | Per-device I/O statistics |
vim-cmd -- VM management from ESXi shell:
# List all registered VMs with VMIDs
vim-cmd vmsvc/getallvms
# Get power state of a VM
vim-cmd vmsvc/power.getstate <vmid>
# Power on a VM
vim-cmd vmsvc/power.on <vmid>
# Power off a VM (hard power off)
vim-cmd vmsvc/power.off <vmid>
# Graceful shutdown (requires VMware Tools)
vim-cmd vmsvc/power.shutdown <vmid>
# Reset (hard reboot) a VM
vim-cmd vmsvc/power.reset <vmid>
# Register a VM from its VMX file
vim-cmd solo/registervm "/vmfs/volumes/vsan:XXXX/vcenter/vcenter.vmx"
# Unregister a VM (does not delete files)
vim-cmd vmsvc/unregister <vmid>
# List all devices attached to a VM
vim-cmd vmsvc/device.getdevices <vmid>
# Force VM into BIOS/EFI on next boot
vim-cmd vmsvc/setboot.options <vmid> enterBIOSSetup=true
# Enter maintenance mode
vim-cmd hostsvc/maintenance_mode_enter
# Exit maintenance mode
vim-cmd hostsvc/maintenance_mode_exit
localcli -- Bypass hostd for direct VMkernel operations:
# Useful when hostd is unresponsive
localcli network ip interface list
localcli storage core device list
localcli system hostname get
dcli -- vCenter REST API client on ESXi:
# List VMs via vCenter API from ESXi shell
dcli +server vcenter.lab.local +username administrator@vsphere.local com vmware vcenter vm list
esxcfg- -- Legacy network configuration commands:*
# List all VMkernel interfaces with IP, MTU, and enabled services
esxcfg-vmknic -l
# List all virtual switches with portgroups and uplinks
esxcfg-vswitch -l
# List physical NICs with driver, link state, speed, duplex
esxcfg-nics -l
vmkping -- VMkernel stack ping utility:
# Basic ping
vmkping 192.168.12.75
# Ping from specific VMkernel interface
vmkping -I vmk2 192.168.12.75
# MTU test with Don't Fragment flag (1600 byte total for overlay networks)
vmkping -d -s 1572 192.168.12.75
# Ping with count
vmkping -c 10 192.168.12.75
vscsiStats -- Storage I/O statistics:
# List VMs available for storage statistics
vscsiStats -l
# Start collecting stats for a VM
vscsiStats -s -w <world-id>
# Print storage statistics
vscsiStats -p all -w <world-id>
vsish -- VMkernel System Information Shell:
# List vsish nodes
vsish -e ls /
# Get memory statistics
vsish -e get /memory/comprehensive
# Get network portset info
vsish -e get /net/portsets/
Partition utilities:
# Display partition table of a disk
partedUtil getptbl /dev/disks/<device-name>
# Create fresh GPT label (DESTROYS ALL DATA)
partedUtil mklabel /dev/disks/<device-name> gpt
ESXi service control scripts:
# Restart ALL management services (causes brief outage)
services.sh restart
# Host daemon (hostd) control
/etc/init.d/hostd restart
/etc/init.d/hostd status
# vCenter agent (vpxa) control
/etc/init.d/vpxa restart
/etc/init.d/vpxa status
# SSH service control
/etc/init.d/SSH status
/etc/init.d/SSH start
/etc/init.d/SSH stop
# NSX proxy agent on ESXi
/etc/init.d/nsx-proxy status
/etc/init.d/nsx-proxy restart
# NSX operations agent
/etc/init.d/nsx-opsagent status
# NSX datapath (distributed firewall)
/etc/init.d/nsx-datapath status
# Regenerate ESXi SSL certificates (run after FQDN change)
/sbin/generate-certificates
# Persist configuration changes across reboots
/sbin/auto-backup.sh
# Check status of ALL vCenter services
service-control --status --all
# Check status of a specific service
service-control --status vpxd
# Start all services
service-control --start --all
# Stop all services (causes vCenter outage)
service-control --stop --all
# Restart a specific service
service-control --restart vpxd
service-control --restart vsphere-client
service-control --restart vmware-vpostgres
service-control --restart vsphere-ui
# Restart all services (causes brief outage)
service-control --restart --all
Critical vCenter services:
| Service | Purpose |
|---|---|
vpxd |
Core vCenter Server daemon |
vsphere-ui |
vSphere Client web interface |
vmware-vpostgres |
Embedded PostgreSQL database |
vmcad |
Certificate Authority daemon |
vmdird |
Directory Service (vmdir) |
vmafdd |
Authentication Framework daemon |
vmware-sps |
Profile-Driven Storage |
vlcm |
vSphere Lifecycle Manager |
eam |
ESX Agent Manager |
lookupsvc |
Lookup Service |
applmgmt |
Appliance Management |
vCenter database operations:
# Connect to vCenter PostgreSQL database
/opt/vmware/vpostgres/current/bin/psql -U postgres
# Test database connection
/opt/vmware/vpostgres/current/bin/psql -U postgres -c "SELECT 1;"
# Check active database connections
/opt/vmware/vpostgres/current/bin/psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"
vCenter certificate management:
# Launch certificate manager wizard
/usr/lib/vmware-vmca/bin/certificate-manager
# List certificates in VECS stores
for store in MACHINE_SSL_CERT TRUSTED_ROOTS; do
/usr/lib/vmware-vmafd/bin/vecs-cli entry list --store $store
done
SSO management (cmsso-util):
# Repoint vCenter to external Platform Services Controller (legacy)
cmsso-util repoint --repoint-partner <psc-fqdn>
# List SSO domain information
/opt/vmware/bin/dir-cli service list --login administrator@vsphere.local
Appliance management:
# Get appliance version
vamicli version --appliance
# Check for available updates
vamicli update --check
# VAMI login shell
/opt/vmware/share/vami/vami_login
VCF service management (systemctl):
# Check all VCF services status
systemctl status vcf-services
# Restart all VCF services
systemctl restart vcf-services
# Start all VCF services
systemctl start vcf-services
# Stop all VCF services
systemctl stop vcf-services
Individual SDDC Manager services:
| Service Name | systemctl Command |
|---|---|
| Domain Manager | systemctl status domainmanager / systemctl restart domainmanager |
| Lifecycle Manager | systemctl status lcm / systemctl restart lcm |
| Operations Manager | systemctl status operationsmanager / systemctl restart operationsmanager |
| NGINX (reverse proxy) | systemctl status nginx / systemctl restart nginx |
| PostgreSQL (database) | systemctl status postgresql / systemctl restart postgresql |
| SDDC Manager UI | systemctl restart sddc-manager-ui-app.service |
| Common Services | systemctl status commonsvcs |
Service discovery:
# List all VCF-related systemd service units
systemctl list-units --type=service | grep vcf
SOS utility (Supportability and Serviceability):
Path: /opt/vmware/sddc-support/sos
# Collect comprehensive log bundle for VMware support
/opt/vmware/sddc-support/sos --log-bundle
# Run health check on SDDC Manager and all components
/opt/vmware/sddc-support/sos --health-check
# Collect logs for a specific workload domain
/opt/vmware/sddc-support/sos --domain-name mgmt
# Get inventory of all VCF components
/opt/vmware/sddc-support/sos --get-inventory
# Clean up old log bundles to free disk space
/opt/vmware/sddc-support/sos --cleanup-logs
# Retrieve current passwords (requires authentication)
/opt/vmware/sddc-support/sos --get-passwords
# Backup SDDC Manager configuration
/opt/vmware/sddc-support/sos --backup-config
SDDC Manager database (PostgreSQL):
Always use
PAGER=catwhen running psql on SDDC Manager to prevent pager traps in remote/scripted sessions.
# Connect to SDDC Manager database (use -h 127.0.0.1, NOT localhost or Unix sockets)
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform"
# Test database connection
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -c 'SELECT 1;'"
# List all databases
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -l"
# Backup SDDC Manager database
su - postgres -c "pg_dump -h 127.0.0.1 platform > /tmp/platform_backup.sql"
# Full cascade repair (quick reference)
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"UPDATE nsxt SET status = 'ACTIVE' WHERE status != 'ACTIVE';\""
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"DELETE FROM lock;\""
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"UPDATE task_metadata SET resolved = true WHERE resolved = false;\""
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"DELETE FROM task_lock;\""
# See Section 7.2.6 for full procedure with diagnosis and verification
psql internal commands:
| Command | Description |
|---|---|
\dt |
List all tables |
\l |
List databases |
\d <table> |
Describe table columns |
\q |
Exit psql |
\? |
Help |
Configuration file locations on SDDC Manager:
| File | Purpose |
|---|---|
/etc/vmware/vcf/domainmanager/application-prod.properties |
Domain Manager configuration |
/etc/vmware/vcf/commonsvcs/trusted_certificates.store |
VCF trust store (password in .key file) |
/etc/vmware/vcf/commonsvcs/trusted_certificates.key |
VCF trust store password |
/etc/alternatives/jre/lib/security/cacerts |
Java cacerts trust store (password: changeit) |
/etc/resolv.conf |
DNS configuration |
/nfs/vmware/vcf/nfs-mount/bundle/ |
VCF bundle depot directory |
File transfer workaround (SCP does not work with restricted shell):
# SDDC Manager only allows SSH as 'vcf' user (root/admin rejected for SSH)
# SCP fails due to restricted shell; use ssh cat method instead:
ssh vcf@192.168.1.241 "cat > /home/vcf/file.zip" < localfile.zip
# Root access: su - from vcf session
ssh vcf@192.168.1.241
su -
SDDC Manager service restart script (alternative):
# Full service restart with proper sequencing
/opt/vmware/vcf/operationsmanager/scripts/cli/sddcmanager_restart_services.sh
All NSX CLI commands are run from the NSX Manager console or SSH session as admin. NSX shell does NOT support backslash line continuation -- all commands must be single-line.
# Get cluster status (controller cluster health)
get cluster status
# List NSX Manager nodes
get managers
# Get cluster node details
get cluster nodes
# Get certificate information
get certificate api
# List all transport nodes
get transport-nodes
# Get transport node status by UUID
get transport-node <uuid> status
# List all logical switches (segments)
get logical-switches
# List all logical routers (gateways)
get logical-routers
# List all interfaces
get interfaces
# Show VTEP (Tunnel Endpoint) information
get vtep
# Display VTEP table entries
get vtep-table
# List all distributed firewall rules
get firewall rules
# Check DFW status
get firewall status
# Get details of a specific firewall rule
get firewall rule <rule-id>
# Start a traceflow for network debugging
start traceflow --src-port <port-id> --dst-ip <ip>
# Get traceflow results
get traceflow <traceflow-id>
# Set DNS servers (admin CLI, NOT the UI)
set name-servers 192.168.1.230
# Set NTP servers (admin CLI)
set ntp-servers 192.168.1.230
# Restart a specific NSX service
restart service <service-name>
# Check NSX service status
get service <service-name>
All curl commands to NSX must be single-line. No backslash continuation in NSX shell.
# Check NSX cluster status
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/cluster/status
# Get full cluster information (includes node UUIDs)
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/cluster
# List all certificates
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/trust-management/certificates
# Import a certificate (use Python to build JSON payload for PEM data)
curl -k -u admin:'Success01!0909!!' -X POST "https://192.168.1.71/api/v1/trust-management/certificates?action=import" -H "Content-Type: application/json" -d @/tmp/nsx-import.json
# Apply certificate to NSX Manager node (API service)
curl -k -u admin:'Success01!0909!!' -X POST "https://192.168.1.71/api/v1/trust-management/certificates/<cert-id>?action=apply_certificate&service_type=API&node_id=<node-uuid>"
# Apply certificate to cluster VIP (management cluster)
curl -k -u admin:'Success01!0909!!' -X POST "https://192.168.1.71/api/v1/trust-management/certificates/<cert-id>?action=apply_certificate&service_type=MGMT_CLUSTER"
# List transport nodes via API
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/transport-nodes
# List segments via API
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/policy/api/v1/infra/segments
# Get transport zone list
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/transport-zones
# List compute managers
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/fabric/compute-managers
Building JSON payload for certificate import (Python method):
python -c "
import json
cert = open('/tmp/nsx.crt').read()
key = open('/tmp/nsx.key').read()
print(json.dumps({'pem_encoded': cert, 'private_key': key}))
" > /tmp/nsx-import.json
This avoids shell escaping issues with \n characters in PEM data.
# Generate a self-signed certificate and private key (basic)
openssl req -x509 -newkey rsa:2048 -keyout server.key -out server.crt -days 365 -nodes -subj '/CN=hostname'
# Generate with Subject Alternative Names (SANs)
openssl req -x509 -newkey rsa:2048 -keyout server.key -out server.crt -days 365 -nodes \
-subj "/CN=192.168.1.52/O=VCF-Depot/C=US" \
-addext "subjectAltName=IP:192.168.1.52,DNS:localhost" \
-addext "keyUsage=digitalSignature,keyEncipherment" \
-addext "extendedKeyUsage=serverAuth"
# Generate private key separately
openssl genrsa -out server.key 2048
# Generate CSR (Certificate Signing Request)
openssl req -new -key server.key -out server.csr -subj "/CN=hostname/O=Org/C=US"
# Sign CSR with CA certificate
openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out server.crt -days 365
# Generate certificate using config file (lab-tested for NSX)
openssl req -x509 -nodes -days 825 -newkey rsa:2048 \
-keyout /tmp/nsx.key -out /tmp/nsx.crt \
-config /tmp/nsx-cert.conf -sha256
# View full certificate details
openssl x509 -in cert.crt -text -noout
# View Subject Alternative Names only
openssl x509 -in cert.crt -text -noout | grep -A1 'Subject Alternative Name'
# View certificate validity dates
openssl x509 -in cert.crt -noout -dates
# View expiration date only
openssl x509 -in cert.crt -noout -enddate
# View certificate subject
openssl x509 -in cert.crt -noout -subject
# View certificate issuer
openssl x509 -in cert.crt -noout -issuer
# Verify certificate against CA
openssl verify -CAfile ca.crt server.crt
# View remote server certificate (connect and display chain)
openssl s_client -connect vcenter.lab.local:443 -showcerts
# Pull remote certificate and save to file
openssl s_client -showcerts -connect 192.168.1.71:443 < /dev/null 2>/dev/null | openssl x509 -outform PEM > /tmp/nsx-root.crt
# Check certificate fingerprint (SHA-256)
openssl x509 -in cert.crt -noout -fingerprint -sha256
# Convert PEM to DER format
openssl x509 -in cert.pem -outform der -out cert.der
# Convert DER to PEM format
openssl x509 -in cert.der -inform der -outform pem -out cert.pem
# Import certificate into a Java truststore
keytool -import -trustcacerts -alias <name> -file <cert> -keystore <cacerts> -storepass changeit -noprompt
# Example: import into Cloud Builder / SDDC Manager Java cacerts
keytool -import -trustcacerts -alias vcf-depot \
-file /tmp/depot.crt \
-keystore /usr/lib/jvm/openjdk-java17-headless.x86_64/lib/security/cacerts \
-storepass changeit -noprompt
# List all certificates in a keystore (summary)
keytool -list -keystore /etc/alternatives/jre/lib/security/cacerts -storepass changeit
# List certificates with full details (verbose)
keytool -list -v -keystore /etc/alternatives/jre/lib/security/cacerts -storepass changeit
# Delete a certificate from keystore
keytool -delete -alias <name> -keystore /etc/alternatives/jre/lib/security/cacerts -storepass changeit
# Export a certificate from keystore
keytool -export -alias <name> -keystore <cacerts> -storepass changeit -file exported.crt
Common VCF keystores:
| Keystore Path | Password | Purpose |
|---|---|---|
/etc/alternatives/jre/lib/security/cacerts |
changeit |
Java default trust store |
/etc/vmware/vcf/commonsvcs/trusted_certificates.store |
Contents of .key file |
VCF common services trust store |
/usr/lib/jvm/openjdk-java17-headless.x86_64/lib/security/cacerts |
changeit |
Java 17 trust store |
Lab-tested: Import NSX self-signed cert into SDDC Manager trust stores:
# Step 1: Pull the active NSX certificate
openssl s_client -showcerts -connect 192.168.1.71:443 < /dev/null 2>/dev/null | openssl x509 -outform PEM > /tmp/nsx-root.crt
# Step 2: Import into VCF trust store
KEY=$(cat /etc/vmware/vcf/commonsvcs/trusted_certificates.key)
keytool -importcert -alias nsx-selfsigned -file /tmp/nsx-root.crt \
-keystore /etc/vmware/vcf/commonsvcs/trusted_certificates.store \
-storepass "$KEY" -noprompt
# Step 3: Import into Java cacerts
keytool -importcert -alias nsx-selfsigned -file /tmp/nsx-root.crt \
-keystore /etc/alternatives/jre/lib/security/cacerts \
-storepass changeit -noprompt
# Step 4: Restart SDDC Manager services
/opt/vmware/vcf/operationsmanager/scripts/cli/sddcmanager_restart_services.sh
PowerShell commands for depot and certificate management:
# Disable Hyper-V (required for nested virtualization in VMware Workstation)
bcdedit /set hypervisorlaunchtype off
Disable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V-All -NoRestart
Disable-WindowsOptionalFeature -Online -FeatureName VirtualMachinePlatform -NoRestart
Disable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux -NoRestart
# Verify hypervisor is off after reboot
bcdedit /enum | findstr hypervisor
# Check Device Guard / VBS status (VirtualizationBasedSecurityStatus should be 0)
Get-CimInstance -ClassName Win32_DeviceGuard -Namespace root\Microsoft\Windows\DeviceGuard
# Check VMX file settings from Windows
type "D:\VMs\esxi01.lab.local\esxi01.lab.local.vmx" | findstr /i "vhv vpmc vvtd"
certutil commands (Windows certificate management):
# View certificate details
certutil -dump cert.crt
# Verify certificate chain
certutil -verify cert.crt
# Import certificate into Windows trust store
certutil -addstore Root cert.crt
# Export certificate from Windows store
certutil -exportPFX -p "password" Root cert.pfx
# Hash a file (verify download integrity)
certutil -hashfile file.zip SHA256
DNS management (Windows Server):
# Add forward DNS record (A record)
Add-DnsServerResourceRecordA -Name "vcenter" -ZoneName "lab.local" -IPv4Address "192.168.1.69"
# Add reverse DNS record (PTR record)
Add-DnsServerResourceRecordPtr -Name "69" -ZoneName "1.168.192.in-addr.arpa" -PtrDomainName "vcenter.lab.local"
# Verify DNS resolution
nslookup vcenter.lab.local
# Verify reverse DNS
nslookup 192.168.1.69
# List all DNS records in a zone
Get-DnsServerResourceRecord -ZoneName "lab.local"
SDDC Manager API endpoints:
| Method | Endpoint | Purpose |
|---|---|---|
| POST | /v1/tokens |
Get authentication bearer token |
| GET | /v1/system |
System information |
| GET | /v1/hosts |
List all commissioned hosts |
| GET | /v1/domains |
List all workload domains |
| GET | /v1/tasks |
List all tasks |
| PATCH | /v1/tasks/<id> |
Cancel a stuck task |
| GET | /v1/clusters |
List all clusters |
| GET | /v1/nsxt-clusters |
List NSX clusters |
| GET | /v1/vcenters |
List all vCenter instances |
| GET | /v1/credentials |
List all managed credentials |
| GET | /v1/bundles |
List available bundles |
| POST | /v1/bundles |
Upload a bundle |
# Authenticate and get bearer token
curl -k -X POST https://sddc-manager.lab.local/v1/tokens -H "Content-Type: application/json" -d '{"username":"admin@local","password":"Success01!0909!!"}'
# Get system information
curl -k -X GET https://sddc-manager.lab.local/v1/system -H "Authorization: Bearer <token>"
# List all hosts
curl -k -X GET https://sddc-manager.lab.local/v1/hosts -H "Authorization: Bearer <token>"
# List all domains
curl -k -X GET https://sddc-manager.lab.local/v1/domains -H "Authorization: Bearer <token>"
# List all tasks
curl -k -X GET https://sddc-manager.lab.local/v1/tasks -H "Authorization: Bearer <token>"
# Cancel a stuck task
curl -k -X PATCH https://sddc-manager.lab.local/v1/tasks/<task-id> -H "Authorization: Bearer <token>" -H "Content-Type: application/json" -d '{"status":"CANCELLED"}'
NSX API endpoints:
| Method | Endpoint | Purpose |
|---|---|---|
| GET | /api/v1/cluster/status |
Cluster health status |
| GET | /api/v1/cluster |
Cluster info with node UUIDs |
| GET | /api/v1/transport-nodes |
List transport nodes |
| GET | /api/v1/transport-zones |
List transport zones |
| GET | /api/v1/trust-management/certificates |
List all certificates |
| POST | /api/v1/trust-management/certificates?action=import |
Import certificate |
| POST | /api/v1/trust-management/certificates/<id>?action=apply_certificate |
Apply certificate |
| GET | /api/v1/fabric/compute-managers |
List compute managers |
| GET | /policy/api/v1/infra/segments |
List segments (Policy API) |
| GET | /policy/api/v1/infra/tier-0s |
List Tier-0 gateways |
| GET | /policy/api/v1/infra/tier-1s |
List Tier-1 gateways |
vCenter API endpoints:
| Method | Endpoint | Purpose |
|---|---|---|
| POST | /api/session |
Create session (Basic auth) |
| GET | /api/vcenter/vm |
List all VMs |
| GET | /api/vcenter/host |
List all hosts |
| GET | /api/vcenter/cluster |
List all clusters |
| GET | /api/vcenter/datastore |
List all datastores |
| GET | /api/vcenter/network |
List all networks |
Authentication patterns:
# SDDC Manager: Bearer token authentication
TOKEN=$(curl -sk -X POST https://sddc-manager.lab.local/v1/tokens -H "Content-Type: application/json" -d '{"username":"admin@local","password":"Success01!0909!!"}' | python -c "import sys,json;print(json.load(sys.stdin)['accessToken'])")
curl -k -H "Authorization: Bearer $TOKEN" https://sddc-manager.lab.local/v1/system
# NSX Manager: Basic authentication
curl -k -u admin:'Success01!0909!!' https://192.168.1.71/api/v1/cluster/status
# vCenter: Session-based authentication
SESSION=$(curl -sk -X POST https://vcenter.lab.local/api/session -u 'administrator@vsphere.local:Success01!0909!!' | tr -d '"')
curl -sk -H "vmware-api-session-id: $SESSION" https://vcenter.lab.local/api/vcenter/vm
API status codes:
| Code | Meaning |
|---|---|
| 200 | Success |
| 201 | Created |
| 202 | Accepted (async operation started) |
| 400 | Bad Request (malformed JSON or invalid parameters) |
| 401 | Unauthorized (bad credentials or expired token) |
| 403 | Forbidden (insufficient permissions) |
| 404 | Not Found |
| 409 | Conflict (resource already exists) |
| 500 | Internal Server Error |
On March 13, 2026, the Windows host running the nested VCF 9.0 lab environment was force-rebooted by Windows Updates. This caused an unclean shutdown of all nested VMs simultaneously, including all four ESXi hosts, vCenter, SDDC Manager, NSX Manager, and the VCF management components that were in the process of being deployed.
| Impact Area | Description |
|---|---|
| All VMs | Powered off ungracefully |
| vSAN Cluster | Entered partitioned state — datastore inaccessible |
| NSX Manager | Services became unstable, crash loop |
| SDDC Manager | CPU soft lockups from resource contention |
| VCF Management Components | Deployment task interrupted mid-deploy at step 25 of 28 |
| Fleet (vRSLCM) | CPU soft lockups |
| VCF Operations | Cluster stuck in INITIALIZATION_FAILED state |
Recovery Duration: Approximately 48 hours across multiple troubleshooting sessions
Outcome: Full recovery achieved — all VCF components operational
| Component | Hostname | IP Address | VM ID | vCPU | RAM |
|---|---|---|---|---|---|
| ESXi Host 1 | esxi01.lab.local | 192.168.1.201 | — | 8 | 48 GB |
| ESXi Host 2 | esxi02.lab.local | 192.168.1.202 | — | 8 | 48 GB |
| ESXi Host 3 | esxi03.lab.local | 192.168.1.203 | — | 8 | 48 GB |
| ESXi Host 4 | esxi04.lab.local | 192.168.1.204 | — | 8 | 48 GB |
| vCenter Server | vcenter.lab.local | 192.168.1.69 | vm-18 | 2 | 16 GB |
| SDDC Manager | sddc-manager.lab.local | 192.168.1.241 | vm-68 | 4 | 16 GB |
| NSX Manager | nsx-manager.lab.local | 192.168.1.71 | vm-58 | 6 | 30 GB |
| NSX VIP | nsx-vip.lab.local | 192.168.1.70 | — | — | — |
| Fleet (vRSLCM) | fleet.lab.local | 192.168.1.78 | vm-4014 | 4 | 12 GB |
| VCF Operations | vcf-ops.lab.local | 192.168.1.77 | vm-4015 | 8 | 32 GB |
| Collector | collector.lab.local | 192.168.1.79 | vm-4016 | 4 | 16 GB |
| Logs | — | — | vm-69 | 4 | 8 GB |
Total nested VM resources: 32 vCPU, 130 GB RAM (management VMs only, excluding ESXi hosts)
| Item | ID |
|---|---|
| SDDC Manager UUID | 90ffb005-52c9-4d35-b254-0217f5305b59 |
| Fleet Environment ID | df6d02bb-692a-4c44-a0d3-99e29c672bd0 |
| Fleet Request ID | be0221fd-e620-48f3-8543-eb67b26616b0 |
| Deployment Task ID | a48065d5-1ead-48ea-9d1e-113ae80732d2 |
| VCF Ops Admin User ID | 6df57f67-9573-47a8-a9d4-e9efa841a2ba |
| vCenter GUID | 92109cf0-ad3b-4ffa-8972-a77bb7fadacf |
| NSX Cluster ID | 6c55d856-ab96-4190-8495-3cc8cb23450c |
After the Windows host rebooted, the vSAN datastore was inaccessible and the vSAN cluster showed a partitioned state:
esxcli vsan cluster get showed hosts in separate sub-clustersRoot Cause: The ungraceful shutdown caused the vSwitch failover policies for the vSAN portgroup to revert to using an incorrect NIC teaming configuration, preventing vSAN traffic between hosts.
Diagnosis steps on each host:
# Check vSAN cluster membership
esxcli vsan cluster get
# Test vSAN VMkernel connectivity from esxi01
vmkping -I vmk2 192.168.12.75
vmkping -I vmk2 192.168.12.76
vmkping -I vmk2 192.168.12.82
# Check vSwitch NIC teaming — look for "Unused Adapters"
esxcli network vswitch standard policy failover get --vswitch-name=vSwitch0
The vSAN portgroup failover policy needed to be corrected on all four ESXi hosts:
# Fix the failover policy for the vSAN VMkernel portgroup
esxcfg-vswitch -p "vSAN" -N vmnic0 vSwitch0
# Verify fix
esxcli network vswitch standard policy failover get --vswitch-name=vSwitch0
# Should show: Active Adapters: vmnic3 (or appropriate NIC)
# Should show: Unused Adapters: (empty)
After correcting the failover policy on all hosts, vSAN traffic resumed and the cluster reformed.
# Monitor vSAN resync progress
esxcli vsan debug resync summary get
# Verify cluster health
esxcli vsan cluster get
esxcli vsan health cluster list
Note: vSAN object resync took approximately 30-45 minutes after the cluster reformed. All objects returned to a compliant state.
After vSAN recovery, NSX Manager was reachable but unstable — UI intermittently available, SDDC Manager reported NSX as "UNSTABLE", and services were in a crash loop.
Root Cause: The ungraceful shutdown corrupted some NSX service state. Services needed a clean restart.
# SSH to NSX Manager
ssh admin@192.168.1.71
# Check service status
get service
# Restart critical services
restart service manager
restart service proton
restart service corfu
# Wait 5-10 minutes, then verify
get cluster status
# Verify via SDDC Manager API
TOKEN=$(curl -sk -X POST https://sddc-manager.lab.local/v1/tokens \
-H "Content-Type: application/json" \
-d '{"username":"administrator@vsphere.local","password":"Success01!0909!!"}' \
| python3 -c "import sys,json;print(json.load(sys.stdin)['accessToken'])")
curl -sk -H "Authorization: Bearer $TOKEN" \
https://sddc-manager.lab.local/v1/nsxt-clusters | python3 -m json.tool
Expected: "status": "ACTIVE"
Symptom on VM console:
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [java:12345]
The SDDC Manager VM console showed a CPU soft lockup — the Java-based Spring Boot services consumed all available CPU, preventing the Linux kernel scheduler from running other processes.
Root Cause: Resource contention — with all management VMs running simultaneously (32 vCPU, 130 GB RAM in nested VMs), the physical host couldn't provide enough CPU time.
SSH was unresponsive due to the soft lockup. The VM had to be hard-reset through the vCenter REST API:
# Get vCenter API session
SESSION=$(curl -sk -X POST https://vcenter.lab.local/api/session \
-H "Authorization: Basic $(echo -n 'administrator@vsphere.local:Success01!0909!!' | base64)")
# Hard reset the SDDC Manager VM
curl -sk -X POST "https://vcenter.lab.local/api/vcenter/vm/vm-68/power?action=reset" \
-H "vmware-api-session-id: $SESSION"
Warning: Hard reset is destructive and should only be used when SSH and console are completely unresponsive due to soft lockups. Always prefer graceful restart first.
After hard reset, SDDC Manager takes significantly longer to start under resource contention:
| Service | Port | Normal Startup | Under Load (Nested) |
|---|---|---|---|
| domainmanager | 7200 (HTTP) | 2-3 min | ~37 min |
| operationsmanager | 7300 | 2-3 min | ~30 min |
| lcm | 7400 | 2-3 min | ~25 min |
# SSH to SDDC Manager (once responsive)
ssh vcf@192.168.1.241
# Check if domainmanager port is bound
ss -tlnp | grep 7200
# Check service status
systemctl status domainmanager
systemctl status operationsmanager
# Watch SDDC Manager API health
curl -sk https://localhost/v1/system/health
Critical Note: The
domainmanagerservice uses HTTP on port 7200 (not HTTPS). Usingcurl -sk https://localhost:7200will fail with "wrong version number". Always usehttp://localhost:7200for direct service health checks.
# Verify all SDDC Manager services are running
systemctl list-units --type=service --state=running | grep -E 'domain|operations|lcm|common'
# Verify API is responsive
curl -sk -X POST https://sddc-manager.lab.local/v1/tokens \
-H "Content-Type: application/json" \
-d '{"username":"administrator@vsphere.local","password":"Success01!0909!!"}' \
| python3 -c "import sys,json; t=json.load(sys.stdin); print('Token:', t['accessToken'][:20]+'...')"
The VCF Management Components deployment (Fleet, VCF Operations, Collector) was interrupted at step 25 of 28 when the Windows crash occurred.
# Check management components status
curl -sk -H "Authorization: Bearer $TOKEN" \
https://sddc-manager.lab.local/v1/vcf-management-components | python3 -m json.tool
Task status showed:
Fleet (vm-4014) also experienced a CPU soft lockup and required a hard reset:
curl -sk -X POST "https://vcenter.lab.local/api/vcenter/vm/vm-4014/power?action=reset" \
-H "vmware-api-session-id: $SESSION"
Fleet startup time: Port 8080 took approximately 48 minutes to become available after hard reset.
With all management VMs running, the total resource demand caused severe contention. The solution was to temporarily power off non-essential VMs:
# Power off Collector VM (already crashed)
curl -sk -X POST "https://vcenter.lab.local/api/vcenter/vm/vm-4016/power?action=stop" \
-H "vmware-api-session-id: $SESSION"
# Power off Logs VM (not needed for recovery)
curl -sk -X POST "https://vcenter.lab.local/api/vcenter/vm/vm-69/power?action=stop" \
-H "vmware-api-session-id: $SESSION"
Resources freed: 8 vCPU + 24 GB RAM
Lesson Learned: In nested environments with limited resources, prioritize which VMs need to run simultaneously. Power off non-essential VMs during recovery to prevent CPU soft lockups.
After Fleet came back online, its API returned HTTP 500 errors for the deployment request. PostgreSQL investigation revealed the request had already completed:
ssh root@192.168.1.78
sudo -u postgres psql -d vrlcm
# Check the request status
SELECT id, state, requesttype, created, completed
FROM vm_rs_request
WHERE id = 'be0221fd-e620-48f3-8543-eb67b26616b0';
Result: The request was already in COMPLETED state — Fleet's crash recovery had processed it during its long startup.
With Fleet reporting the request as completed, the SDDC Manager deployment task was retried:
curl -sk -X PATCH "https://sddc-manager.lab.local/v1/tasks/a48065d5-1ead-48ea-9d1e-113ae80732d2" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"status":"IN_PROGRESS"}'
After approximately 60 seconds, the task progressed through steps 26, 27, and 28 — all successful. Final status: 28/28 subtasks completed successfully.
{
"vcfOperationsFleetManagement": "SUCCEEDED",
"vcfOperations": "SUCCEEDED",
"vcfOperationsCollector": "SUCCEEDED"
}
VCF Operations (vcf-ops.lab.local) was stuck in INITIALIZATION_FAILED state. The CASA API confirmed:
curl -sk https://192.168.1.77/casa/cluster/status
# Showed: "state": "INITIALIZATION_FAILED"
Root Cause: The unclean shutdown left the Gemfire distributed cache and HSQLDB in an inconsistent state.
Reset procedure:
# SSH to VCF Operations node
ssh root@192.168.1.77
# Stop services
systemctl stop vmware-casa
systemctl stop vmware-vcops-watchdog
# Backup HSQLDB
cp /storage/db/casa/webapp/hsqldb/casa.db.script \
/storage/db/casa/webapp/hsqldb/casa.db.script.bak
# Edit HSQLDB — change initialization state
vi /storage/db/casa/webapp/hsqldb/casa.db.script
# Find: "initialization_state":"FAILED"
# Replace with: "initialization_state":"NONE"
# Clear HSQLDB log file
> /storage/db/casa/webapp/hsqldb/casa.db.log
The admin password hash may have become invalid after the crash:
cat > /storage/vcops/user/conf/adminuser.properties << 'EOF'
#Properties for vCOps user 'admin'
username=admin
hashed_password=
EOF
After cluster initialization, the system regenerates the password hash from the password configured during initial setup.
# Get the SHA1 thumbprint of the local certificate
THUMBPRINT=$(openssl x509 -in /storage/vcops/user/conf/ssl/cert.pem -noout -fingerprint -sha1 \
| sed 's/SHA1 Fingerprint=//')
# Restart services
systemctl start vmware-casa
systemctl start vmware-vcops-watchdog
# Wait for CASA to start, then trigger initialization
curl -sk -X POST https://localhost/casa/cluster/init \
-H "Content-Type: application/json"
# Verify cluster status
curl -sk https://localhost/casa/cluster/status
# Expected: "cluster_state": "INITIALIZED"
# Verify slice is online
curl -sk https://localhost/casa/sysadmin/slice/online_state
# Expected: "onlineState":"ONLINE"
After cluster initialization, both users showed empty roles. Investigation revealed:
admin user is a super admin with implicit full access — roleNames: [] is by designadministrator@vsphere.local user needed the Administrator role explicitly assigned# Get authentication token
TOKEN=$(curl -sk -X POST https://192.168.1.77/suite-api/api/auth/token/acquire \
-H "Content-Type: application/json" \
-d '{"username":"admin","password":"Success01!0909!!","authSource":"local"}' \
| python3 -c "import sys,json;print(json.load(sys.stdin)['token'])")
# Assign Administrator role (CRITICAL: single object, NOT array)
curl -sk -X PUT \
"https://192.168.1.77/suite-api/api/auth/users/<user-id>/permissions" \
-H "Authorization: vRealizeOpsToken $TOKEN" \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{
"roleName": "Administrator",
"allowAllObjects": true,
"traversal-spec-instances": []
}'
Critical: The request body must be a single JSON object with
roleName. Using{"permissions":[{"roleName":"Administrator"}]}will fail with "Role with name: null cannot be found".
Critical: VCF Operations Suite-API uses the auth header format
vRealizeOpsToken <token>— NOTBearer.
| User | roleNames | Actual Access | Notes |
|---|---|---|---|
| admin | [] (empty) | Full admin | Built-in super admin — implicit access by design |
| administrator@vsphere.local | ["Administrator"] | Full admin | Explicitly assigned via permissions API |
The Collector VM (vm-4016) was powered off during resource contention mitigation. After other components stabilized:
# Power on collector
curl -sk -X POST "https://vcenter.lab.local/api/vcenter/vm/vm-4016/power?action=start" \
-H "vmware-api-session-id: $SESSION"
Collector startup observations:
| Phase | Duration |
|---|---|
| Boot to SSH responsive | ~4 minutes |
| Load average during startup | 15.14 on 4 vCPUs |
| Load stabilization | ~30 minutes |
| CASA service fully initialized | ~30 minutes |
After the collector came online, adapters showed COLLECTOR_DOWN status. They needed stop/start cycles:
# For each adapter assigned to the collector (collectorId=2):
# Stop the adapter
curl -sk -X PUT \
"https://192.168.1.77/suite-api/api/adapters/<adapter-id>/monitoringstate/stop" \
-H "Authorization: vRealizeOpsToken $TOKEN"
# Start the adapter
curl -sk -X PUT \
"https://192.168.1.77/suite-api/api/adapters/<adapter-id>/monitoringstate/start" \
-H "Authorization: vRealizeOpsToken $TOKEN"
Important: After stopping and starting an adapter, wait for the collector to actually be responsive. Starting adapters while the collector JVM is still initializing will leave them in a
STOPPEDstate.
The NSX adapter was never auto-created because the VCF adapter's initial auto-discovery had already run before the crash. Manual creation was required.
Step 1: Create NSX credential:
curl -sk -X POST "https://192.168.1.77/suite-api/api/credentials" \
-H "Authorization: vRealizeOpsToken $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "nsx-vip.lab.local",
"adapterKindKey": "NSXTAdapter",
"credentialKindKey": "NSXTCREDENTIAL",
"fields": [
{"name": "USERNAME", "value": "admin"},
{"name": "PASSWORD", "value": "Success01!0909!!"}
]
}'
Note: The credential field names are
USERNAMEandPASSWORD(uppercase). UsingUSERwill fail with "USERNAME is mandatory".
Step 2: Create NSX adapter instance:
curl -sk -X POST "https://192.168.1.77/suite-api/api/adapters" \
-H "Authorization: vRealizeOpsToken $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "nsx-vip.lab.local",
"description": "NSX Manager",
"adapterKindKey": "NSXTAdapter",
"resourceIdentifiers": [
{"name": "NSXTHOST", "value": "nsx-vip.lab.local"},
{"name": "AUTO_DISCOVERY", "value": "true"},
{"name": "ENABLE_ALERTS_FROM_NSX", "value": "false"},
{"name": "VCURL", "value": "vcenter.lab.local"},
{"name": "VMEntityVCID", "value": "<vcenter-guid>"},
{"name": "NSX_CLUSTER_ID", "value": "<nsx-cluster-id>"}
],
"credential": {"id": "<credential-id>"},
"collectorId": 2
}'
Step 3: Start the adapter and verify (within 60 seconds):
curl -sk -X PUT \
"https://192.168.1.77/suite-api/api/adapters/<adapter-id>/monitoringstate/start" \
-H "Authorization: vRealizeOpsToken $TOKEN"
# Verify
curl -sk -H "Authorization: vRealizeOpsToken $TOKEN" \
"https://192.168.1.77/suite-api/api/adapters/<adapter-id>"
# Expected: numberOfResourcesCollected > 0
| Adapter | Status | Health | Resources |
|---|---|---|---|
| vcenter (VMWARE) | DATA_RECEIVING | GREEN | 33 |
| nsx-vip.lab.local (NSXTAdapter) | DATA_RECEIVING | GREEN | 1+ |
| lab (VcfAdapter) | DATA_RECEIVING | ORANGE | 2 |
| Container | DATA_RECEIVING | GREEN | 43 |
| VCF Operations API (vcf-ops) | DATA_RECEIVING | GREEN | 1 |
| VCF Operations Adapter (vcf-ops) | DATA_RECEIVING | GREEN | 13 |
| VCF Operations Adapter (collector) | DATA_RECEIVING | GREEN | 7 |
| Infrastructure Health (vcf-ops) | DATA_RECEIVING | GREEN | 59 |
| Infrastructure Health (collector) | DATA_RECEIVING | GREEN | 3 |
| Infrastructure Management (vcf-ops) | DATA_RECEIVING | GREEN | 5 |
| Infrastructure Management (collector) | DATA_RECEIVING | GREEN | 7 |
| Configuration Management (collector) | DATA_RECEIVING | GREEN | 0 |
| Diagnostics (vcf-ops) | DATA_RECEIVING | GREEN | 7 |
| Diagnostics (collector) | DATA_RECEIVING | GREEN | 2 |
| Application Monitoring (collector) | DATA_RECEIVING | GREEN | 1 |
| Log Assist (collector) | ERROR | ORANGE | 1 |
Note: Log Assist adapter shows ERROR because the Logs VM was powered off. This resolves when the Logs VM is powered back on.
This section provides a comprehensive, reusable health check procedure that can be applied to any VCF environment. Each subsection covers a specific component with the exact commands and expected outputs.
See also: The standalone document VCF-Environment-Health-Check.md provides this same procedure as a portable runbook.
Before checking VCF components, verify the underlying platform:
# For VMware Workstation nested labs — check host resources
# (run on the Windows host)
systeminfo | findstr /C:"Total Physical Memory" /C:"Available Physical Memory"
wmic cpu get NumberOfCores,NumberOfLogicalProcessors
# For bare metal — check IPMI/iLO/iDRAC for hardware alerts
# For ESXi standalone — check hardware status
esxcli hardware platform get
esxcli system version get
Run on each ESXi host via SSH:
# 1. Basic host info
esxcli system version get
esxcli system hostname get
# 2. Uptime and boot time
esxcli system stats uptime get
# 3. CPU and memory
esxcli hardware cpu global get
esxcli hardware memory get
# 4. NIC status — all NICs should show "Link Status: Up"
esxcli network nic list
# 5. VMkernel interfaces — verify IPs on management, vMotion, vSAN
esxcli network ip interface ipv4 list
# 6. vSwitch health — verify uplinks are assigned
esxcli network vswitch standard list
# 7. Failover policy — ensure no "Unused Adapters"
esxcli network vswitch standard policy failover get -v vSwitch0
# 8. Routing table — verify routes for all subnets
esxcli network ip route ipv4 list
# 9. Services
esxcli system settings advanced list -o /UserVars/SuppressShellWarning
Expected healthy state:
Run on any ESXi host in the cluster:
# 1. Cluster membership — all hosts should be in one sub-cluster
esxcli vsan cluster get
# Key: Sub-Cluster Member Count should equal total host count
# Key: Local Node Health State should be HEALTHY
# 2. Cluster health
esxcli vsan health cluster list
# 3. Unicast agents — should list all cluster members
esxcli vsan cluster unicastagent list
# 4. Disk status
esxcli vsan storage list
# 5. vSAN network connectivity — ping other hosts from vmk2
vmkping -I vmk2 192.168.12.75 -c 3
vmkping -I vmk2 192.168.12.76 -c 3
vmkping -I vmk2 192.168.12.82 -c 3
# 6. Resync status (should show 0 resyncing objects)
esxcli vsan debug resync summary get
# 7. Object health
esxcli vsan debug object health summary get
Expected healthy state:
Via REST API from any machine with network access:
# 1. Get API session
SESSION=$(curl -sk -X POST https://vcenter.lab.local/api/session \
-H "Authorization: Basic $(echo -n 'administrator@vsphere.local:Success01!0909!!' | base64)")
# 2. Check vCenter health status
curl -sk -H "vmware-api-session-id: $SESSION" \
https://vcenter.lab.local/api/appliance/health/system
# 3. Check individual health components
for component in applmgmt database load mem softwarepackages storage swap; do
echo -n "$component: "
curl -sk -H "vmware-api-session-id: $SESSION" \
"https://vcenter.lab.local/api/appliance/health/$component"
echo
done
# 4. List all VMs and their power states
curl -sk -H "vmware-api-session-id: $SESSION" \
https://vcenter.lab.local/api/vcenter/vm | python3 -m json.tool
# 5. Check services (SSH to vCenter appliance)
ssh root@vcenter.lab.local
vmon-cli --list
Expected healthy state:
Via NSX CLI (SSH):
ssh admin@nsx-vip.lab.local
# 1. Cluster status
get cluster status
# 2. Service status
get service
# 3. Interface status
get interface
# 4. Certificate status
get certificate api
Via NSX API:
# 1. Cluster status
curl -sk -u admin:'Success01!0909!!' https://nsx-vip.lab.local/api/v1/cluster/status
# 2. Transport node status
curl -sk -u admin:'Success01!0909!!' https://nsx-vip.lab.local/api/v1/transport-nodes/state
# 3. Alarms
curl -sk -u admin:'Success01!0909!!' https://nsx-vip.lab.local/api/v1/alarms
Expected healthy state:
Via REST API:
# 1. Get auth token
TOKEN=$(curl -sk -X POST https://sddc-manager.lab.local/v1/tokens \
-H "Content-Type: application/json" \
-d '{"username":"administrator@vsphere.local","password":"Success01!0909!!"}' \
| python3 -c "import sys,json;print(json.load(sys.stdin)['accessToken'])")
# 2. System health
curl -sk -H "Authorization: Bearer $TOKEN" \
https://sddc-manager.lab.local/v1/system | python3 -m json.tool
# 3. Component status
curl -sk -H "Authorization: Bearer $TOKEN" \
https://sddc-manager.lab.local/v1/nsxt-clusters | python3 -m json.tool
curl -sk -H "Authorization: Bearer $TOKEN" \
https://sddc-manager.lab.local/v1/vcenters | python3 -m json.tool
# 4. Check for stuck tasks
curl -sk -H "Authorization: Bearer $TOKEN" \
"https://sddc-manager.lab.local/v1/tasks?status=IN_PROGRESS" | python3 -m json.tool
# 5. Check for resource locks
curl -sk -H "Authorization: Bearer $TOKEN" \
https://sddc-manager.lab.local/v1/resource-locks | python3 -m json.tool
# 6. VCF Management Components
curl -sk -H "Authorization: Bearer $TOKEN" \
https://sddc-manager.lab.local/v1/vcf-management-components | python3 -m json.tool
Via SSH:
ssh vcf@sddc-manager.lab.local
# Service status
systemctl list-units --type=service --state=running | grep -E 'domain|operations|lcm|common'
# Check ports
ss -tlnp | grep -E '7200|7300|7400|443'
# Check disk space
df -h
Expected healthy state:
Via Suite-API:
# 1. Get token
TOKEN=$(curl -sk -X POST https://vcf-ops.lab.local/suite-api/api/auth/token/acquire \
-H "Content-Type: application/json" \
-d '{"username":"admin","password":"Success01!0909!!","authSource":"local"}' \
| python3 -c "import sys,json;print(json.load(sys.stdin)['token'])")
# 2. Node status
curl -sk -H "Authorization: vRealizeOpsToken $TOKEN" \
https://vcf-ops.lab.local/suite-api/api/deployment/node/status | python3 -m json.tool
# 3. Collector status
curl -sk -H "Authorization: vRealizeOpsToken $TOKEN" \
https://vcf-ops.lab.local/suite-api/api/collectors | python3 -m json.tool
# 4. Adapter status — check all adapters for health
curl -sk -H "Authorization: vRealizeOpsToken $TOKEN" \
https://vcf-ops.lab.local/suite-api/api/adapters | python3 -m json.tool
# 5. Cluster status (CASA API, from localhost)
ssh root@vcf-ops.lab.local
curl -sk https://localhost/casa/cluster/status
curl -sk https://localhost/casa/sysadmin/slice/online_state
Expected healthy state:
Via API:
# 1. Authentication
FLEET_TOKEN=$(curl -sk -X POST https://fleet.lab.local:8080/lcm/authzn/api/login \
-H "Content-Type: application/json" \
-d '{"username":"admin@local","password":"Success01!0909!!"}' \
| python3 -c "import sys,json;print(json.load(sys.stdin)['token'])")
# 2. Environment status
curl -sk -H "Authorization: Bearer $FLEET_TOKEN" \
https://fleet.lab.local:8080/lcm/lcops/api/v2/environments | python3 -m json.tool
# 3. Health check
curl -sk -H "Authorization: Bearer $FLEET_TOKEN" \
https://fleet.lab.local:8080/lcm/health | python3 -m json.tool
Via SSH:
ssh root@fleet.lab.local
# Service status
systemctl status nginx
systemctl status vmware-lcm
# Database status
sudo -u postgres pg_isready
# Port check
ss -tlnp | grep 8080
Expected healthy state:
A ready-to-use Python script that checks all components in one pass:
#!/bin/bash
# VCF Environment Health Check Script
# Usage: bash vcf-health-check.sh
# Prerequisites: curl, python3, SSH access to all components
VCENTER="vcenter.lab.local"
SDDC="sddc-manager.lab.local"
NSX_VIP="nsx-vip.lab.local"
VCF_OPS="vcf-ops.lab.local"
FLEET="fleet.lab.local"
USER="administrator@vsphere.local"
PASS="Success01!0909!!"
ADMIN_PASS="Success01!0909!!" # VCF Ops admin password
echo "=========================================="
echo "VCF Environment Health Check"
echo "Date: $(date)"
echo "=========================================="
# 1. vCenter Health
echo -e "\n--- vCenter Health ---"
SESSION=$(curl -sk -X POST "https://$VCENTER/api/session" \
-H "Authorization: Basic $(echo -n "$USER:$PASS" | base64)" 2>/dev/null | tr -d '"')
if [ -n "$SESSION" ] && [ "$SESSION" != "null" ]; then
HEALTH=$(curl -sk -H "vmware-api-session-id: $SESSION" \
"https://$VCENTER/api/appliance/health/system" 2>/dev/null | tr -d '"')
echo "vCenter System Health: $HEALTH"
else
echo "vCenter: UNREACHABLE"
fi
# 2. SDDC Manager Health
echo -e "\n--- SDDC Manager Health ---"
TOKEN=$(curl -sk -X POST "https://$SDDC/v1/tokens" \
-H "Content-Type: application/json" \
-d "{\"username\":\"$USER\",\"password\":\"$PASS\"}" 2>/dev/null \
| python3 -c "import sys,json;print(json.load(sys.stdin).get('accessToken','FAILED'))" 2>/dev/null)
if [ "$TOKEN" != "FAILED" ] && [ -n "$TOKEN" ]; then
echo "SDDC Manager API: HEALTHY (token acquired)"
# Check components
curl -sk -H "Authorization: Bearer $TOKEN" \
"https://$SDDC/v1/vcf-management-components" 2>/dev/null \
| python3 -c "
import sys,json
d=json.load(sys.stdin)
for k,v in d.items():
if isinstance(v,str): print(f' {k}: {v}')
" 2>/dev/null
else
echo "SDDC Manager: UNREACHABLE"
fi
# 3. NSX Health
echo -e "\n--- NSX Manager Health ---"
NSX_STATUS=$(curl -sk -u "admin:$PASS" \
"https://$NSX_VIP/api/v1/cluster/status" 2>/dev/null \
| python3 -c "import sys,json;d=json.load(sys.stdin);print(d.get('control_cluster_status',{}).get('status','UNKNOWN'))" 2>/dev/null)
echo "NSX Cluster Status: $NSX_STATUS"
# 4. VCF Operations Health
echo -e "\n--- VCF Operations Health ---"
OPS_TOKEN=$(curl -sk -X POST "https://$VCF_OPS/suite-api/api/auth/token/acquire" \
-H "Content-Type: application/json" \
-d "{\"username\":\"admin\",\"password\":\"$ADMIN_PASS\",\"authSource\":\"local\"}" 2>/dev/null \
| python3 -c "import sys,json;print(json.load(sys.stdin).get('token','FAILED'))" 2>/dev/null)
if [ "$OPS_TOKEN" != "FAILED" ] && [ -n "$OPS_TOKEN" ]; then
echo "VCF Operations API: HEALTHY (token acquired)"
# Check adapters
curl -sk -H "Authorization: vRealizeOpsToken $OPS_TOKEN" \
"https://$VCF_OPS/suite-api/api/adapters" 2>/dev/null \
| python3 -c "
import sys,json
d=json.load(sys.stdin)
adapters=d.get('adapterInstancesInfoDto',[])
green=sum(1 for a in adapters if a.get('resourceKey',{}).get('adapterKindKey','')!='')
print(f' Total Adapters: {len(adapters)}')
for a in adapters:
name=a.get('resourceKey',{}).get('name','?')
cs=a.get('adapter-status',{}).get('adapterStatus','?')
print(f' {name}: {cs}')
" 2>/dev/null
else
echo "VCF Operations: UNREACHABLE"
fi
# 5. Fleet Health
echo -e "\n--- Fleet (vRSLCM) Health ---"
FLEET_TOKEN=$(curl -sk -X POST "https://$FLEET:8080/lcm/authzn/api/login" \
-H "Content-Type: application/json" \
-d "{\"username\":\"admin@local\",\"password\":\"$PASS\"}" 2>/dev/null \
| python3 -c "import sys,json;print(json.load(sys.stdin).get('token','FAILED'))" 2>/dev/null)
if [ "$FLEET_TOKEN" != "FAILED" ] && [ -n "$FLEET_TOKEN" ]; then
echo "Fleet API: HEALTHY (token acquired)"
else
echo "Fleet: UNREACHABLE"
fi
echo -e "\n=========================================="
echo "Health Check Complete"
echo "=========================================="
Customization: Replace the hostname/IP variables at the top of the script with values for your environment.
In a nested lab environment with resource contention, Java-based services take significantly longer to start:
| Service | Normal Startup | Under Load (Nested) | Port |
|---|---|---|---|
| SDDC Manager domainmanager | 2-3 min | 37 min | 7200 (HTTP) |
| SDDC Manager operationsmanager | 2-3 min | 30 min | 7300 |
| Fleet LCM backend | 3-5 min | 48 min | 8080 |
| VCF Operations CASA | 2-3 min | 10-15 min | 443 |
| VCF Operations Collector CASA | 2-3 min | 5-10 min | 443 |
| NSX Manager services | 3-5 min | 10-15 min | 443 |
Rule of Thumb: In nested environments, expect startup times to be 5-10x longer than normal. Do not assume a service has failed — check CPU load and be patient.
| Pitfall | Wrong | Correct |
|---|---|---|
| VCF Ops auth header | Authorization: Bearer <token> |
Authorization: vRealizeOpsToken <token> |
| SDDC Manager internal port | https://localhost:7200 |
http://localhost:7200 |
| VCF Ops permissions body | {"permissions":[{"roleName":"Admin"}]} |
{"roleName":"Administrator","allowAllObjects":true} |
| NSX credential field | {"name":"USER","value":"admin"} |
{"name":"USERNAME","value":"admin"} |
Bash ! in passwords |
password="Success01!" |
Use heredoc or single quotes |
| Gemfire cache after init | Querying roles immediately | Wait 5-10 minutes for cache to populate |
| Component | IP Address | FQDN | Role |
|---|---|---|---|
| DNS / AD Server | 192.168.1.230 | dc.lab.local | DNS, NTP, Active Directory (lab.local) |
| vCenter Server | 192.168.1.69 | vcenter.lab.local | vSphere management |
| NSX VIP | 192.168.1.70 | nsx-vip.lab.local | NSX Manager cluster VIP |
| NSX Node 1 | 192.168.1.71 | nsx-node1.lab.local | NSX Manager node |
| ESXi Host 1 | 192.168.1.74 | esxi01.lab.local | Compute host |
| ESXi Host 2 | 192.168.1.75 | esxi02.lab.local | Compute host |
| ESXi Host 3 | 192.168.1.76 | esxi03.lab.local | Compute host |
| VCF Operations | 192.168.1.77 | vcf-ops.lab.local | Monitoring / Fleet Management UI |
| Fleet (Cloud Proxy) | 192.168.1.78 | fleet.lab.local | VCF Operations data collector |
| Collector | 192.168.1.79 | collector.lab.local | Operations Collector |
| ESXi Host 4 | 192.168.1.82 | esxi04.lab.local | Compute host |
| Automation | 192.168.1.90 | automation.lab.local | VCF Automation (if deployed) |
| Aria Lifecycle | 192.168.1.94 | aria-lifecycle.lab.local | Lifecycle Manager |
| SDDC Manager | 192.168.1.241 | sddc-manager.lab.local | VCF orchestration and lifecycle |
| NSX Manager (SDDC registered) | 192.168.1.70 | nsx-manager.lab.local | FQDN used by SDDC Manager for NSX |
Forward (A) records required in lab.local zone:
vcenter A 192.168.1.69
nsx-vip A 192.168.1.70
nsx-node1 A 192.168.1.71
nsx-manager A 192.168.1.70
esxi01 A 192.168.1.74
esxi02 A 192.168.1.75
esxi03 A 192.168.1.76
vcf-ops A 192.168.1.77
fleet A 192.168.1.78
collector A 192.168.1.79
esxi04 A 192.168.1.82
automation A 192.168.1.90
aria-lifecycle A 192.168.1.94
sddc-manager A 192.168.1.241
Reverse (PTR) records required in 1.168.192.in-addr.arpa zone:
69 PTR vcenter.lab.local.
70 PTR nsx-vip.lab.local.
71 PTR nsx-node1.lab.local.
74 PTR esxi01.lab.local.
75 PTR esxi02.lab.local.
76 PTR esxi03.lab.local.
77 PTR vcf-ops.lab.local.
78 PTR fleet.lab.local.
82 PTR esxi04.lab.local.
241 PTR sddc-manager.lab.local.
Entries NOT needed for Simple Mode: nsx-node2, nsx-node3, vcf-ops-rep, vcf-ops-data, vcf-ops-lb, automation-node1/2/3/4, automation-upgrade.
| Component | Username | Password / Notes |
|---|---|---|
| ESXi Hosts | root | Set during installation |
| vCenter SSO | administrator@vsphere.local | Set during deployment |
| SDDC Manager UI | admin@local | Set during deployment |
| SDDC Manager SSH | vcf | Only user that can SSH; root via su - |
| NSX Manager admin | admin | Set during OVA deployment |
| NSX Manager audit | audit | Set during OVA deployment |
| NSX Manager root | root | Set during OVA deployment |
| VCF Operations | admin | Set during OVA deployment |
| Java Keystore | N/A | changeit |
| VCF Trust Store | N/A | Contents of /etc/vmware/vcf/commonsvcs/trusted_certificates.key |
| Cloud Builder SSH | root | vmware (default) |
| VM | vCPU | RAM | Storage (Actual) | Deployed By |
|---|---|---|---|---|
| ESXi Host (x4) | 32 | 48 GB | ~400 GB each (local) | VMware Workstation |
| NSX Manager | 6 | 32 GB | vSAN (thin) | Manual (ovftool) |
| vCenter Server | 4 | 19 GB | vSAN | VCF Installer |
| SDDC Manager | 4 | 16 GB | vSAN (thin, ~108 GB used) | VCF Installer bringup |
| VCF Operations | 2 | 8 GB | vSAN (thin) | Manual (ovftool) |
| Fleet (Cloud Proxy) | 2 | 4 GB | vSAN (thin) | VCF Operations Lifecycle |
Physical host: Dell Precision 7920, 35-core CPU, 192 GB RAM, D: 2TB SSD, E: 2TB SSD, 2x 4TB HDD.
| VMkernel | Subnet | TCP/IP Stack | Purpose |
|---|---|---|---|
| vmk0 | 192.168.1.0/24 | defaultTcpipStack | Management + NSX TEP (overlay) |
| vmk1 | 192.168.11.0/24 | vmotion | vMotion |
| vmk2 | 192.168.12.0/24 | defaultTcpipStack | vSAN |
| vmk50 | 169.254.0.0/16 | hyperbus | NSX Hyperbus (internal, auto-created) |
Per-host VMkernel IP addresses:
| Host | vmk0 (Mgmt/TEP) | vmk1 (vMotion) | vmk2 (vSAN) |
|---|---|---|---|
| esxi01 | 192.168.1.74 | 192.168.11.121 | 192.168.12.121 |
| esxi02 | 192.168.1.75 | 192.168.11.120 | 192.168.12.120 |
| esxi03 | 192.168.1.76 | 192.168.11.122 | 192.168.12.122 |
| esxi04 | 192.168.1.82 | 192.168.11.123 | 192.168.12.123 |
| Port | Protocol | Source | Destination | Description |
|---|---|---|---|---|
| 22 | TCP | Admin workstation | ESXi, vCenter, SDDC Mgr, NSX | SSH access |
| 53 | TCP/UDP | All components | DNS server | DNS resolution |
| 80 | TCP | Browsers | vCenter | HTTP redirect to HTTPS |
| 123 | UDP | All components | NTP server | Time synchronization |
| 443 | TCP | Browsers, SDDC Mgr | vCenter, NSX, ESXi, SDDC Mgr | HTTPS management UI and API |
| 902 | TCP | vCenter | ESXi hosts | VMware authentication / NFC |
| 5480 | TCP | Admin workstation | vCenter | VAMI (appliance management) |
| 5432 | TCP | SDDC Mgr (internal) | PostgreSQL | Database connectivity |
| Port | Protocol | Source | Destination | Description |
|---|---|---|---|---|
| 2233 | TCP | ESXi hosts | ESXi hosts | vSAN transport |
| 12345-23451 | UDP | ESXi hosts | ESXi hosts | vSAN cluster service (CMMDS, RDT) |
| Port | Protocol | Source | Destination | Description |
|---|---|---|---|---|
| 443 | TCP | Admin, SDDC Mgr | NSX Manager | NSX UI and API |
| 1234 | TCP | ESXi hosts | NSX Manager | NSX agent to manager communication |
| 1235 | TCP | NSX Manager | NSX Manager | NSX cluster inter-node |
| 6081 | UDP | ESXi hosts | ESXi hosts | GENEVE overlay encapsulation |
| 8080 | TCP | NSX Manager | NSX Manager | Internal cluster HTTP |
| Port | Protocol | Source | Destination | Description |
|---|---|---|---|---|
| 443 | TCP | Browsers | VCF Operations | Operations UI and API |
| 443 | TCP | Cloud Proxy | VCF Operations | Fleet management data |
| Port | Protocol | Source | Destination | Description |
|---|---|---|---|---|
| 443 | TCP | Browsers, VCF Ops | SDDC Manager | SDDC Manager UI and API |
| 22 | TCP | Admin workstation | SDDC Manager | SSH (vcf user only) |
| 5432 | TCP | Internal | SDDC Manager | PostgreSQL database |
| Port | Protocol | Source | Destination | Description |
|---|---|---|---|---|
| 8000 | TCP | ESXi hosts | ESXi hosts | vMotion traffic |
| 8443 | TCP | SDDC Manager | Offline depot | Custom HTTPS offline depot |
| 111 | TCP | ESXi hosts | NFS server | NFS portmapper |
| 2049 | TCP | ESXi hosts | NFS server | NFS file system |
SDDC Manager logs:
| Log Path | Description |
|---|---|
/var/log/vmware/vcf/domainmanager/domainmanager.log |
Domain Manager main log (deployments, tasks, domain operations) |
/var/log/vmware/vcf/domainmanager/domainmanager-gc.log |
Domain Manager garbage collection log |
/var/log/vmware/vcf/lcm/lcm.log |
Lifecycle Manager log (upgrades, patching, bundles) |
/var/log/vmware/vcf/lcm/upgrade/ |
Upgrade-specific logs directory |
/var/log/vmware/vcf/operationsmanager/operationsmanager.log |
Operations Manager log |
/var/log/vmware/vcf/operationsmanager/operationsmanager-gc.log |
Operations Manager GC log |
/var/log/vmware/vcf/sos/sos.log |
SoS utility log |
/var/log/vmware/vcf/commonsvcs/commonsvcs.log |
Common services log (certificates, trust store) |
/var/log/vmware/vcf/sddc-support/sddc-support.log |
Support bundle collection log |
/var/log/vmware/vcf/vdt/vdt-<timestamp>.txt |
VCF Diagnostic Tool results |
/var/log/nginx/error.log |
NGINX reverse proxy error log |
/var/log/nginx/access.log |
NGINX access log |
/var/log/postgresql/postgresql-*.log |
PostgreSQL database logs |
vCenter Server logs:
| Log Path | Description |
|---|---|
/var/log/vmware/vpxd/vpxd.log |
Main vCenter Server daemon log |
/var/log/vmware/vsphere-client/logs/vsphere_client_virgo.log |
vSphere Client (legacy) log |
/var/log/vmware/vsphere-ui/logs/vsphere_client_virgo.log |
vSphere UI log |
/var/log/vmware/vpostgres/postgresql*.log |
vCenter PostgreSQL database logs |
/var/log/vmware/sso/vmware-sts-idmd.log |
SSO / Lookup service log |
/var/log/vmware/eam/eam.log |
ESX Agent Manager log |
/var/log/vmware/content-library/cls.log |
Content Library service log |
/var/log/vmware/vlcm/vlcm.log |
vSphere Lifecycle Manager log |
ESXi host logs:
| Log Path | Description |
|---|---|
/var/log/vmkernel.log |
VMkernel log (storage, network, hardware events) |
/var/log/hostd.log |
Host daemon log (management operations, VM power) |
/var/log/vpxa.log |
vCenter agent log (host-to-vCenter communication) |
/var/log/nsx-syslog.log |
NSX agent log on ESXi hosts |
/var/log/fdm.log |
Fault Domain Manager (HA) log |
/var/log/vobd.log |
VMkernel Observation log (events, alarms) |
/var/log/esxupdate.log |
ESXi patching and update log |
/var/log/vmkwarning.log |
VMkernel warning messages |
/var/log/shell.log |
ESXi shell command history |
/var/log/auth.log |
Authentication and SSH log |
NSX Manager logs:
| Log Path | Description |
|---|---|
/var/log/proton/nsxapi.log |
NSX API service log |
/var/log/proton/nsx-management-plane.log |
NSX management plane log |
/var/log/corfu/corfu.log |
Corfu distributed database log |
/var/log/syslog |
General system log |
/config/cluster-manager/ |
Cluster manager configuration and certificates |
VCF Operations logs:
| Log Path | Description |
|---|---|
/storage/log/vcops/ |
VCF Operations main log directory |
/storage/log/vcops/web/ |
Web UI logs |
/storage/log/vcops/analytics/ |
Analytics engine logs |
| Issue Category | Primary Logs to Check | Secondary Logs |
|---|---|---|
| VCF Task Failures | domainmanager.log, lcm.log |
operationsmanager.log |
| Deployment Issues | domainmanager.log, lcm.log |
commonsvcs.log |
| vCenter Connectivity | vpxd.log, vpxa.log |
hostd.log |
| VM Power Issues | hostd.log |
vpxd.log, vmkernel.log |
| Network / Connectivity | vmkernel.log, nsx-syslog.log |
vpxa.log |
| vSAN Storage | vmkernel.log (grep vsan) |
hostd.log |
| Certificate Errors | commonsvcs.log |
vpxd.log, domainmanager.log |
| Authentication / SSO | vmware-sts-idmd.log |
vpxd.log |
| NSX Transport Nodes | nsx-syslog.log, nsxapi.log |
vmkernel.log |
| Bundle Download / LCM | lcm.log |
nginx/error.log |
| Database Issues | postgresql-*.log |
domainmanager.log |
| VCF Diagnostic Tool | /var/log/vmware/vcf/vdt/vdt-<timestamp>.txt |
N/A |
Log analysis commands:
# Real-time log monitoring
tail -f /var/log/vmware/vcf/domainmanager/domainmanager.log
# Search for errors in a log file
grep -i error /var/log/vmware/vcf/domainmanager/domainmanager.log | tail -50
# Search for exceptions
grep -i exception /var/log/vmware/vcf/lcm/lcm.log | tail -20
# Filter by date
grep "2026-02-12" /var/log/vmware/vcf/domainmanager/domainmanager.log
# Search compressed/rotated logs
zgrep -i error /var/log/vmware/vcf/domainmanager/domainmanager.log.gz
# Search for specific task ID
grep "<task-id>" /var/log/vmware/vcf/lcm/lcm.log
# View systemd journal for a service
journalctl -u vcf-services -f
# View journal errors from last hour
journalctl -u vcf-services --since "1 hour ago" -p err
| Term | Definition |
|---|---|
| ABX | Action-Based Extensibility -- custom actions triggered by events in VCF Automation |
| BOM | Bill of Materials -- component version and build number list for a VCF release |
| CMMDS | Cluster Monitoring, Membership, and Directory Service (vSAN internal) |
| CNI | Container Network Interface -- Kubernetes networking plugin (Antrea is default for VKS) |
| CSI | Container Storage Interface -- allows storage providers to expose persistent volumes to Kubernetes |
| DFW | Distributed Firewall -- NSX micro-segmentation applied at the VM vNIC level |
| DRS | Distributed Resource Scheduler -- automatic VM placement and load balancing across hosts |
| ESA | Express Storage Architecture -- vSAN single-tier NVMe-only storage (VCF 9.0+) |
| EVC | Enhanced vMotion Compatibility -- CPU feature masking for mixed-generation clusters |
| FIPS | Federal Information Processing Standards -- cryptographic compliance mode (mandatory in VCF 9.0) |
| FTT | Failures to Tolerate -- vSAN data protection level (1, 2, or 3 failures) |
| GENEVE | Generic Network Virtualization Encapsulation -- NSX overlay tunnel protocol (~54 bytes overhead) |
| HA | High Availability -- automatic VM restart on host failure |
| HCL | Hardware Compatibility List -- VMware-certified hardware for vSAN and ESXi |
| LCM | Lifecycle Management -- patching, upgrading, and maintaining VCF components |
| NSX | VMware's software-defined networking and security platform |
| NTP | Network Time Protocol -- time synchronization (critical for VCF certificate and cluster operations) |
| OSA | Original Storage Architecture -- vSAN with cache+capacity disk groups |
| OVA | Open Virtual Appliance -- packaged VM template for deployment |
| PEM | Privacy Enhanced Mail -- Base64-encoded certificate format used by all VCF components |
| PSC | Platform Services Controller -- SSO and certificate authority (embedded in vCenter 9.0) |
| SAN | Subject Alternative Name -- certificate field listing valid hostnames and IPs |
| SDDC | Software-Defined Data Center -- the complete VCF infrastructure stack |
| SOS | Supportability and Serviceability -- SDDC Manager diagnostic and log bundle utility |
| TEP | Tunnel Endpoint -- overlay network encapsulation point on each ESXi host (uses GENEVE) |
| TKG | Tanzu Kubernetes Grid -- VMware's Kubernetes distribution for vSphere |
| VCF | VMware Cloud Foundation -- unified private cloud platform |
| VDS | vSphere Distributed Switch -- centrally managed virtual switch across multiple hosts |
| VDT | VCF Diagnostic Tool -- read-only Python health check tool (download from Broadcom KB 344917) |
| VIB | vSphere Installation Bundle -- ESXi software package format |
| vLCM | vSphere Lifecycle Manager -- ESXi image-based lifecycle management (replaces baselines in 9.0) |
| VPC | Virtual Private Cloud -- isolated network environment in VCF Automation |
| VTEP | Virtual Tunnel Endpoint -- same as TEP; virtual interface for overlay encapsulation |
| VKS | VMware Kubernetes Service -- managed Kubernetes clusters on VCF |
NSX Manager sizing for nested environments:
SDDC Manager deployment timeout loop: Manual ovftool deployment bypasses SDDC Manager's timeout thresholds that are not suited for nested environments. SDDC Manager will delete and retry timed-out deployments in an infinite loop.
vhv.enable ghost setting:
The vhv.enable setting can persist in a VM's runtime state (vmware.log DICT) even when it is not present in the VMX file. This causes vMotion to fail with "Configuration mismatch: snapshot was taken with VHV enabled." Fix by explicitly adding vhv.enable = "FALSE" to the VMX file.
Hot vMotion memory convergence: In nested environments, hot vMotion frequently fails because memory convergence cannot complete within the timeout. Use cold migration (power off, relocate, power on) as a reliable fallback.
NSX nested boot storm: After power-on, NSX Manager runs 12+ Java processes on 6 vCPUs, causing load averages of 30-100+ for 30-60 minutes. The VIP won't come online until load settles below ~20. Do NOT add more vCPUs — co-scheduling overhead makes it worse. Credential operations attempted during this window will fail and can trigger the cascade failure described in Section 7.2.6.
vSAN network latency: Nested vSAN will always show yellow on network latency health checks (typically 5-7ms vs 5ms threshold). This is normal for virtualized NICs in VMware Workstation and does not affect functionality.
VMware Workstation VMX settings required:
vhv.enable = "TRUE" # Nested virtualization
vpmc.enable = "TRUE" # Virtual Performance Counters
vvtd.enable = "TRUE" # Virtual Intel VT-d
ethernet0.noPromisc = "FALSE" # Allow nested VM traffic
sata0:0.virtualSSD = "1" # Mark disks as SSD for vSAN
Windows host prerequisite:
Must disable Hyper-V (bcdedit /set hypervisorlaunchtype off) and reboot before VMware Workstation can pass VT-x to nested ESXi VMs.
SDDC Manager SSH access:
Only the vcf user can SSH in (root and admin are rejected). Root access is via su - from a vcf session. SCP does not work due to the restricted shell; use ssh vcf@host "cat > file" < localfile for file transfers.
SDDC Manager vcf account lockout:
Failed SSH attempts (including from automated scripts) lock the vcf account quickly. SDDC Manager uses faillock (not pam_tally2). Unlock from console as root: faillock --user vcf --reset. If ALL accounts are locked, boot into GRUB single-user mode with init=/bin/bash.
SDDC Manager PostgreSQL access:
PostgreSQL uses TCP on 127.0.0.1 (not Unix sockets — always use -h 127.0.0.1 with psql). Data directory is /data/pgdata. Password is not easily discoverable — use the temporary trust auth workaround in pg_hba.conf (always restore immediately after). Always use PAGER=cat to prevent pager traps in remote sessions. Key databases: platform (nsxt, lock, task_metadata, task_lock tables), operationsmanager (task, execution, processing_task tables).
SDDC Manager credential rotation cascade failure:
A failed credential rotation (e.g., NSX unreachable during boot storm) leaves the resource stuck in ACTIVATING or ERROR state in the platform.nsxt table, stale exclusive locks in platform.lock, and unresolved tasks piling up in platform.task_metadata (resolved=false). All future credential operations are blocked even after the target component recovers. The API cannot cancel stuck tasks (TA_TASK_CAN_NOT_BE_RETRIED). Fix: 6-step database repair — (1) enable trust auth, (2) fix nsxt status to ACTIVE, (3) delete stale locks, (4) mark task_metadata resolved=true + clear task_lock, (5) restore pg_hba.conf, (6) restart operationsmanager. See Section 7.2.6.
NSX admin CLI:
DNS and NTP are configured via set name-servers / set ntp-servers commands in the admin CLI, NOT through the NSX UI.
NSX shell limitations: No backslash line continuation is supported. All curl commands and other multi-argument commands must be written on a single line.
NSX certificate SAN requirements:
The SAN must include nsx-manager.lab.local (the FQDN registered in SDDC Manager for NSX), not just nsx-node1.lab.local. Without it, VDT reports a SAN check failure.
NSX certificate trust stores: After replacing NSX self-signed certificates, import the new cert into both:
/etc/vmware/vcf/commonsvcs/trusted_certificates.store (password in .key file)/etc/alternatives/jre/lib/security/cacerts (password: changeit)Then restart SDDC Manager services. Reference: Broadcom KB 316056.
vSAN thick-to-thin migration:
vCenter's migration wizard cannot thin-provision to vSAN. Use vmkfstools -i <src> <dst> -d thin per disk to convert thick-provisioned VMDKs to thin.
VDT is not pre-installed:
Must be downloaded from Broadcom KB 344917 and uploaded to SDDC Manager manually via the ssh cat method.
Aria Lifecycle OVF properties:
Use ovftool <ova> to probe the OVA and discover correct property names. The property format is NOT always vami.ip0.VCF_OPS_Management_Appliance -- it varies by appliance version.
ovftool single-line commands:
On VCF Installer / SDDC Manager, use single-line ovftool commands. Backslash continuation and --noSSLVerify can break depending on how commands are pasted.
NSX 9.0 TEP on vmk0: Use the "Use VMkernel Adapter" option in the Transport Node Profile IPv4 Assignment to reuse vmk0 for overlay traffic. This eliminates the need for a dedicated TEP VLAN in lab environments.
VCF 9.0.1 vSAN ESA HCL bypass:
Add vsan.esa.sddc.managed.disk.claim=true to /etc/vmware/vcf/domainmanager/application-prod.properties and restart domainmanager before running the VCF Installer wizard.
NFS mount ownership:
If VDT reports NFS mount ownership failure, fix with chown root:vcf /nfs/vmware/vcf/nfs-mount/. Reference: Broadcom KB 392923.
VCF Upgrade order (always follow this sequence):
| Component | Version | Build Number |
|---|---|---|
| vCenter Server | 9.0.1.0 | 24957454 |
| ESXi | 9.0.1.0 | 24957456 |
| NSX Manager | 9.0.1.0 | 24952111 |
| SDDC Manager | 9.0.1.0 | 24962180 |
| VCF Operations | 9.0.1.0 | 24960351 |
| Fleet Management | 9.0.1.0 | 24960371 |
| Automation | 9.0.1.0 | 24965341 |
| Operations Collector | 9.0.1.0 | 24960349 |
| Component | Certificate Path | Key Path |
|---|---|---|
| ESXi SSL | /etc/vmware/ssl/rui.crt |
/etc/vmware/ssl/rui.key |
| vCenter | /etc/vmware-vpx/ssl/ |
/etc/vmware-vpx/ssl/ |
| SDDC Manager | /etc/vmware/vcf/commonsvcs/ |
/etc/vmware/vcf/commonsvcs/ |
| NSX Manager | /config/cluster-manager/ |
/config/cluster-manager/ |
| VCF Trust Store | /etc/vmware/vcf/commonsvcs/trusted_certificates.store |
Password in trusted_certificates.key |
| Java Cacerts | /etc/alternatives/jre/lib/security/cacerts |
Password: changeit |
20 Python diagnostic scripts for VCF 9.0.1 nested lab troubleshooting. All use Paramiko for SSH and run from a Windows workstation (pip install paramiko).
| Target | IP | User | Purpose |
|---|---|---|---|
| SDDC Manager | 192.168.1.241 | vcf | API gateway, database access (su to root) |
| NSX Node | 192.168.1.71 | root | Direct NSX service management |
| NSX VIP | 192.168.1.70 | admin | NSX cluster API (via curl from SDDC Mgr) |
| Scenario | Script |
|---|---|
| Is everything healthy? | python quick_status.py |
| NSX slow after boot? | python nsx_monitor.py |
| Credential operation failed? | python check_remediate_error.py |
| Need to update NSX password? | python nsx_cred_update.py |
| NSX CPU overloaded? | python nsx_slim.py |
| Put NSX services back? | python nsx_restart_all.py |
| Clear stale DB locks? | python clear_locks.py |
| Fix stuck tasks in DB? | python fix_stuck_tasks.py |
| Full cascade fix? | python full_remediate_fix.py |
| System clean after fix? | python final_check.py |
Status & Health Checks (Read-Only):
| Script | Connects To | What It Does |
|---|---|---|
quick_status.py |
SDDC Manager | Start here. NSX status, VIP health, resource locks, notifications, credentials |
final_check.py |
SDDC Manager | Lightweight: notifications and resource locks only |
diag.py |
localhost | DNS resolution, TCP 443 connectivity, ARP/routing from Windows host |
nsx_monitor.py |
NSX Node | Polls cluster status + load avg every 60s for 10 iterations |
NSX Diagnostics (Read-Only):
| Script | Connects To | What It Does |
|---|---|---|
nsx_check.py |
SDDC Manager | Tests both NSX VIP and direct node connectivity — diagnoses VIP failover issues |
nsx_diag.py |
NSX Node | Top CPU consumers, disk space, service health via API, catalina errors |
nsx_resource_check.py |
SDDC Manager | NSX clusters, credentials, warnings, DB resource state |
sddc_nsx_status.py |
SDDC Manager | Compares SDDC Manager's NSX status vs actual NSX VIP cluster status |
Credential Operations:
| Script | Modifies | What It Does |
|---|---|---|
nsx_cred_update.py |
Yes | Full workflow: health checks, lists credentials, updates admin API, monitors 200s |
nsx_retry_when_ready.py |
Yes | Waits up to 15 min for NSX API, then submits update with 450s monitoring |
check_disconnected.py |
No | Inspects all credential objects for connection status fields |
check_remediate_error.py |
No | Failed task details with full error messages, NSX connectivity test, log search |
NSX Service Management:
| Script | Action | What It Does |
|---|---|---|
nsx_slim.py |
Stops | Stops 5 non-essential services to free CPU during boot storm |
nsx_restart_all.py |
Starts | Restarts all services stopped by nsx_slim.py |
nsx_fix_svc.py |
Restarts | Restarts search, nsx-sha, nsx-appl-proxy, validates health |
Database Fixes (Modify SDDC Manager PostgreSQL):
| Script | What It Does |
|---|---|
clear_locks.py |
Fixes NSX status (ACTIVATING/ERROR → ACTIVE), clears lock table, restarts operationsmanager |
fix_stuck_tasks.py |
Marks stuck task_metadata as resolved, clears task_lock, fixes execution_to_task orphans |
full_remediate_fix.py |
Complete cascade fix: NSX health check + DB fix (status + locks + tasks) + service restart |
find_pg_pass.py |
Searches for PostgreSQL password in config files (read-only) |
get_task.py |
Retrieves task details by ID with subtask errors (edit task_id before running) |
WARNING: Do not run credential update scripts if NSX status is not
ACTIVEin SDDC Manager orSTABLEat the VIP. A failed update creates stale locks and stuck tasks, requiring database repair.
python quick_status.py # 1. Overall health
python nsx_check.py # 2. VIP + node connectivity
python nsx_diag.py # 3. Performance & services
python sddc_nsx_status.py # 4. SDDC Manager vs NSX sync
python nsx_slim.py # Free CPU (if load > 30)
# Wait for load to drop below 15
python nsx_restart_all.py # Bring services back
python nsx_check.py # Verify cluster health
Problem: Credential operation failed
|
+-- python quick_status.py
| |
| +-- NSX Status = ACTIVATING or ERROR?
| | +-- python clear_locks.py (fix DB status + locks)
| | +-- python fix_stuck_tasks.py (resolve stuck tasks)
| | +-- OR: python full_remediate_fix.py (all-in-one)
| |
| +-- NSX VIP returning 503?
| | +-- python nsx_diag.py (check load)
| | +-- Load > 30? -> python nsx_slim.py (free CPU)
| | +-- Wait -> python nsx_monitor.py (track recovery)
| |
| +-- All green?
| +-- python nsx_cred_update.py (retry update)
-- Connect: su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform"
-- Fix NSX status (covers ACTIVATING and ERROR)
UPDATE nsxt SET status = 'ACTIVE' WHERE status != 'ACTIVE';
-- Clear stale locks
DELETE FROM lock;
-- Resolve stuck tasks
UPDATE task_metadata SET resolved = true WHERE resolved = false;
DELETE FROM task_lock;
Environment: Dell Precision 7920 (dual Intel Xeon Gold 6140, 192GB RAM, 2x 1TB SSD + 2x 4TB HDD) Platform: VMware Cloud Foundation 9.0.1 — fully nested in VMware Workstation Period: January–February 2026
| Component | Details |
|---|---|
| ESXi Hosts | 4 nested ESXi 9.0 hosts (44GB RAM, 8 vCPU each) with nested virtualization |
| vCenter Server | 9.0.1 — deployed via Cloud Builder, embedded PSC |
| SDDC Manager | 9.0.1 — orchestrates full VCF lifecycle |
| NSX Manager | 9.0 — single-node cluster with VIP, 32GB RAM / 6 vCPU |
| vSAN | OSA (Original Storage Architecture) — 4-node cluster with disk groups |
| VCF Operations | Aria Operations 9.0.2 — monitoring and alerting |
| VCF Ops for Logs | 9.0.1 — centralized log collection (vCenter, ESXi, NSX, SDDC Manager) |
| Fleet Management | Cloud proxy for password management and lifecycle |
| Aria Lifecycle | Component deployment orchestration |
| DNS / AD | Windows Server (192.168.1.230) — 14+ forward/reverse records |
| Offline Depot | Python HTTPS server with TLS 1.2+ for air-gapped bundle management |
Full VCF Day 0 → Day 2 lifecycle completed — from bare metal ESXi preparation through Cloud Builder bringup, workload domain configuration, certificate management, monitoring deployment, and ongoing operations.
1. NSX Certificate Chain Failure
2. SDDC Manager Credential Cascade Failure
platform.nsxt tableplatform.lock blocking all operations; 47 unresolved tasks in platform.task_metadata3. SDDC Manager Storage Migration (914GB → Thin)
vmkfstools -i -d thin (vCenter wizard can't thin-provision to vSAN)4. vMotion Ghost Setting Failure
vhv.enable = "FALSE" in VMX5. NSX Boot Storm Resource Management
6. VCF Operations for Logs Certificate Mismatch
7. SDDC Manager Deployment Loop
8. VCF Account Lockout Recovery
faillock (not pam_tally2)9. NSX Manager Memory Escalation
10. VDT Compliance Remediation
20 Python Diagnostic Scripts — Remote SSH-based diagnostic toolkit using Paramiko:
| Category | Scripts | Purpose |
|---|---|---|
| Health monitoring | quick_status.py, final_check.py, nsx_monitor.py |
Real-time environment health |
| NSX diagnostics | nsx_check.py, nsx_diag.py, nsx_resource_check.py |
NSX cluster, services, performance |
| Credential operations | nsx_cred_update.py, nsx_retry_when_ready.py |
Automated credential update with health checks |
| Database repair | clear_locks.py, fix_stuck_tasks.py, full_remediate_fix.py |
PostgreSQL cascade failure repair |
| Failure analysis | check_remediate_error.py, sddc_nsx_status.py |
Deep error diagnosis |
| Service management | nsx_slim.py, nsx_restart_all.py, nsx_fix_svc.py |
NSX service load management |
Offline Depot Infrastructure — Python HTTPS server with TLS 1.2+ for air-gapped bundle delivery.
| Document | Pages | Content |
|---|---|---|
| VCF9 Lab Setup Guide | ~45 | Complete 9-phase deployment guide with troubleshooting |
| Troubleshooting Handbook | ~65 | 10 sections covering every failure mode encountered |
| Operations Configuration Handbook | ~55 | 16-phase post-deployment config guide, 19 known issues |
| Command Reference | ~25 | 28-section quick reference organized by topic |
| Interview Cheat Sheet | ~10 | 8-section printable interview prep |
| Offline Depot Handbook | ~15 | Air-gapped depot setup and management |
| Master Bible | ~100 | Consolidated reference across all topics |
| Diagnostic Scripts Cheatsheet | ~5 | Quick reference for all 20 scripts |
| SDDC Manager API Handbook | ~25 | 18-section REST API reference with authentication, endpoints, Python scripts |
All 10 documents available in Markdown, PDF, and HTML formats.
Mapped SDDC Manager's internal PostgreSQL schema (undocumented by Broadcom):
| Database | Key Tables | Purpose |
|---|---|---|
platform |
nsxt |
NSX cluster resource status (ACTIVE/ACTIVATING/ERROR) |
platform |
lock |
Exclusive operation locks |
platform |
task_metadata |
Task resolution tracking (resolved boolean) |
platform |
task_lock |
Task-to-lock associations |
operationsmanager |
task (column: state) |
Operation tasks |
operationsmanager |
execution (column: execution_status) |
Execution tracking |
operationsmanager |
processing_task |
Active processing queue |
The following issues have no official Broadcom KB articles, documentation, or known workarounds. All were discovered through independent lab investigation. Each entry includes the exact resolution — no guessing required.
Full Reference: See
VCF-Undocumented-Issues-Reference.pdffor the complete copy-paste-ready resolution steps, OpenSSL configs, SQL queries, and API commands for all 35 issues.
Database & Credential Operations (7)
| # | Discovery | Impact | Resolution |
|---|---|---|---|
| 1 | SDDC Manager PostgreSQL schema — table names, column names, relationships all unmapped | Cannot troubleshoot credential failures without schema knowledge | ssh vcf@sddc-manager.lab.local → su - → sudo -u postgres psql -h 127.0.0.1 -d platform → SELECT table_name FROM information_schema.tables WHERE table_schema='public'; Key tables: nsxt, lock, task_metadata, task_lock |
| 2 | Credential cascade failure mechanism — failed rotation leaves NSX stuck in ACTIVATING, stale locks, unresolved tasks | All future credential ops blocked; no Broadcom procedure exists | Must fix all 3 tables in sequence: nsxt status → lock table → task_metadata resolved flag (see Issue #4) |
| 3 | API cannot cancel stuck tasks — returns TA_TASK_CAN_NOT_BE_RETRIED, DELETE returns HTTP 500 |
Database repair is the only fix path | Direct PostgreSQL repair required — API has no mechanism to fix stuck tasks. See Issue #4 for full procedure |
| 4 | 6-step PostgreSQL repair procedure — must fix nsxt status + locks + tasks together in sequence | Partial fix still fails; all three tables participate in prevalidation | Step 1: Edit /opt/vmware/vcf/commonsvcs/conf/pg_hba.conf — add host all all 127.0.0.1/32 trust above existing lines → systemctl restart postgres. Step 2: sudo -u postgres psql -h 127.0.0.1 -d platform → UPDATE nsxt SET state='ACTIVE' WHERE state='ACTIVATING'; Step 3: DELETE FROM lock; Step 4: UPDATE task_metadata SET resolved=true WHERE resolved=false; Step 5: DELETE FROM task_lock; Step 6: systemctl restart operationsmanager → Revert pg_hba.conf trust line → systemctl restart postgres |
| 5 | PostgreSQL access requires TCP — must use -h 127.0.0.1 (Unix sockets don't work) |
psql without -h flag silently fails |
Always use: sudo -u postgres psql -h 127.0.0.1 -d platform |
| 6 | Database column naming inconsistencies — state not status, resolved boolean not status enum |
Wrong column names = wrong queries = no fix | Use SELECT column_name, data_type FROM information_schema.columns WHERE table_name='nsxt'; to discover correct column names before writing queries |
| 7 | Password not discoverable — must use trust auth workaround in pg_hba.conf |
No documented method to obtain PostgreSQL password | Edit /opt/vmware/vcf/commonsvcs/conf/pg_hba.conf → add host all all 127.0.0.1/32 trust as first host line → systemctl restart postgres → connect without password → revert after use |
NSX in Nested/Resource-Constrained Environments (6)
| # | Discovery | Impact | Resolution |
|---|---|---|---|
| 8 | 32GB RAM / 6 vCPU minimum — Broadcom docs say 16GB; actual: 16GB=OOM, 24GB=crashes, 32GB=stable | Under-provisioned NSX cascades into all VCF operations | Power off NSX VM → Edit Settings → set RAM to 30-32GB, vCPU to 6 → Power on. In VMware Workstation: edit .vmx file |
| 9 | Boot storm load >100 on 6 cores for 30-60 min is normal; VIP offline until settled | Credential ops during boot storm trigger cascade failure | Wait 30-60 minutes after all VMs power on. Monitor: ssh admin@192.168.1.71 → get cluster status. Do NOT attempt credential operations until cluster status = STABLE |
| 10 | Adding more vCPU is counterproductive — co-scheduling overhead increases load | Intuitive fix actually makes it worse | Keep NSX at 6 vCPU. Reduce contention by staggering VM startups and powering off non-essential VMs during boot |
| 11 | Services take 10-15 min to stabilize after restart; API returns error 101 during stabilization | Premature API calls fail and can trigger retries | After restart service manager / restart service proton, wait 15 minutes before any API calls. Verify: get cluster status → wait for STABLE |
| 12 | NSX admin CLI for DNS/NTP — set name-servers/set ntp-servers, NOT the UI |
UI settings don't persist in some nested configs | ssh admin@192.168.1.71 → set name-servers 192.168.1.5 → set ntp-servers 192.168.1.5 → get name-servers / get ntp-servers to verify |
| 13 | TEP on vmk0 — NSX 9.0 "Use VMkernel Adapter" reuses vmk0 as TEP (new in 9.0) | Eliminates need for dedicated TEP VLAN in nested environments | During host transport node config in NSX, select "Use VMkernel Adapter" → choose vmk0. No additional VLAN or vmk needed |
Certificate Management (5)
| # | Discovery | Impact | Resolution |
|---|---|---|---|
| 14 | NSX cert SAN must include SDDC Manager's registered FQDN (nsx-manager.lab.local) |
VDT fails SAN check; SDDC Manager loses trust in NSX | Create OpenSSL config with [alt_names] section: DNS.1=nsx-manager.lab.local, DNS.2=nsx-vip.lab.local, IP.1=192.168.1.71, IP.2=192.168.1.70 → openssl req -new -nodes -keyout nsx.key -out nsx.csr -config nsx-cert.cnf → openssl x509 -req -days 825 -in nsx.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out nsx.crt -extensions v3_req -extfile nsx-cert.cnf → Import via NSX API using Python for PEM escaping |
| 15 | Two separate trust stores — VCF common services + Java cacerts must both be updated | KB 316056 is incomplete; missing either import = VDT failure | Trust store 1: ssh vcf@sddc-manager.lab.local → /opt/vmware/vcf/commonsvcs/utility/bin/certool --importcert --cert=ca.crt Trust store 2: /usr/java/jre-vmware-17/bin/keytool -importcert -alias nsx-ca -file ca.crt -keystore /usr/java/jre-vmware-17/lib/security/cacerts -storepass changeit -noprompt → systemctl restart domainmanager operationsmanager |
| 16 | Fleet Management cert generator produces wrong SANs | Precheck fails: "hosts in the certificate doesn't match" | Generate cert manually: create OpenSSL config with DNS.1=fleet.lab.local, IP.1=192.168.1.78 → openssl req -new -nodes -keyout fleet.key -out fleet.csr -config fleet-cert.cnf → openssl x509 -req -days 825 -in fleet.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out fleet.crt -extensions v3_req -extfile fleet-cert.cnf → Upload via Fleet UI: Settings → Certificate → Import |
| 17 | VCF Ops for Logs cert generator — same SAN mismatch | Identical pattern to Fleet Management; same OpenSSL workaround | Same procedure as #16 but with Logs hostnames in the OpenSSL config SANs. Upload via Logs appliance UI |
| 18 | Shell can't handle PEM escaping — must use Python for JSON cert payload to NSX API | curl with inline PEM breaks on newlines; no documented alternative | Use Python: cert_pem = open('nsx.crt').read() → key_pem = open('nsx.key').read() → payload = json.dumps({"pem_encoded": cert_pem + key_pem}) → requests.post(url, headers=headers, data=payload, verify=False) |
VCF Operations 9.x Changes (6)
| # | Discovery | Impact | Resolution |
|---|---|---|---|
| 19 | Adapter log paths changed — /storage/log/vcops/log/adapters/ (legacy path doesn't exist) |
Cannot find logs for adapter troubleshooting | Use: ls /storage/log/vcops/log/adapters/ then tail -f /storage/log/vcops/log/adapters/<adapter-name>/adapter.log |
| 20 | JRE path changed — /usr/java/jre-vmware-17/ (legacy jre-vmware doesn't exist) |
Cannot import certs into correct truststore | Use: /usr/java/jre-vmware-17/bin/keytool -importcert -alias <alias> -file cert.crt -keystore /usr/java/jre-vmware-17/lib/security/cacerts -storepass changeit -noprompt |
| 21 | Two separate NSX adapters — VCF section uses VIP, "Aria Admin" uses node FQDN | Both need credentials; Aria Admin works when VIP is down | Update both adapters in VCF Operations UI: Administration → Solutions → NSX → Edit credential for both instances. Aria Admin adapter uses nsx-manager.lab.local, VCF adapter uses nsx-vip.lab.local |
| 22 | System Managed Credential ROTATE doesn't work for NSX | Must uncheck and set manually | In Fleet UI: Settings → Password Management → find NSX entries → uncheck "System Managed" → manually set the password → Save |
| 23 | SSH enable via Admin UI only — console/systemctl won't work | Cannot SSH for troubleshooting without Admin UI access | Navigate to https://vcf-ops.lab.local/admin → login as admin → Administration → SSH → Enable. Cannot be done from console or systemctl |
| 24 | Health adapter silently fails on stale SDDC Manager credential; reboot required | UI stop/start insufficient; must full reboot appliance | Update the credential in VCF Operations UI first, then: ssh root@192.168.1.77 → reboot. Wait 10-15 minutes for full restart. UI adapter stop/start is NOT sufficient |
Infrastructure & Platform (4)
| # | Discovery | Impact | Resolution |
|---|---|---|---|
| 25 | vCenter can't thin-provision to vSAN — migration wizard keeps thick provisioning | Must use vmkfstools -i -d thin per disk (914GB → 108GB) |
SSH to ESXi host → vmkfstools -i "/vmfs/volumes/source/vm/disk.vmdk" -d thin "/vmfs/volumes/vsanDatastore/vm/disk.vmdk" per disk. Update .vmx to point to new paths. Register new VM in vCenter |
| 26 | vhv.enable ghost setting persists in VM runtime even when absent from VMX file |
vMotion fails; must explicitly set FALSE (removing line is not enough) |
Power off VM → Edit Settings → VM Options → Advanced → Configuration Parameters → Add vhv.enable = FALSE. Or edit .vmx: add vhv.enable = "FALSE" explicitly |
| 27 | Hot vMotion fails in nested environments — memory convergence timeout | Must use cold migration as fallback | Power off VM → right-click → Migrate → select destination host → complete wizard. Hot migration will time out in nested environments due to memory convergence issues |
| 28 | VDT not pre-installed on SDDC Manager — must download from KB 344917 | Cannot run health checks without manual download | ssh vcf@sddc-manager.lab.local → download VDT from Broadcom KB 344917 → chmod +x vdt-* → ./vdt --domain MANAGEMENT |
Crash Recovery & VCF Operations Suite-API (7) — Discovered March 2026
| # | Discovery | Impact | Resolution |
|---|---|---|---|
| 29 | Suite-API uses vRealizeOpsToken auth header — not Bearer or VMware like every other VMware API |
All API calls fail 401 if using standard Bearer format | Always use: Authorization: vRealizeOpsToken <token>. Get token: curl -sk -X POST https://192.168.1.77/suite-api/api/auth/token/acquire -H "Content-Type: application/json" -d '{"username":"admin","password":"Success01!0909!!","authSource":"local"}' |
| 30 | Permissions API requires single JSON object — not wrapped in array or permissions key | Returns "Role with name: null" with no useful error | Use: curl -sk -X PUT "https://192.168.1.77/suite-api/api/auth/users/<user-id>/permissions" -H "Authorization: vRealizeOpsToken $TOKEN" -H "Content-Type: application/json" -d '{"roleName":"Administrator","allowAllObjects":true,"traversal-spec-instances":[]}' |
| 31 | Super admin admin user always shows roleNames: [] — this is by design, not a bug |
Wastes time trying to "fix" role assignment | No fix needed — this is by design. The admin user has implicit full access. Do NOT try to assign roles to this account |
| 32 | SDDC Manager domainmanager port 7200 is HTTP (not HTTPS) | curl https://localhost:7200 fails with confusing "wrong version number" |
Use HTTP: curl http://localhost:7200/health — NOT https. The external SDDC Manager API on port 443 is HTTPS |
| 33 | NSX adapter credential fields must be uppercase — USERNAME not USER |
Fails with "USERNAME is mandatory"; no docs specify field names | Use exact field names: {"name": "USERNAME", "value": "admin"} and {"name": "PASSWORD", "value": "Success01!0909!!"} |
| 34 | Gemfire cache takes 5-10 min after cluster init — roles/users don't appear immediately | Admins conclude data is missing and take unnecessary action | Wait 5-10 minutes after cluster initialization completes. Verify: curl -sk -H "Authorization: vRealizeOpsToken $TOKEN" https://192.168.1.77/suite-api/api/auth/roles — roles will appear once Gemfire cache loads |
| 35 | HSQLDB reset required after unclean shutdown — no automatic recovery for INITIALIZATION_FAILED |
VCF Operations completely non-functional; manual fix only | ssh root@192.168.1.77 → systemctl stop vmware-casa vmware-vcops-watchdog → cp /storage/db/casa/webapp/hsqldb/casa.db.script{,.bak} → edit casa.db.script: change "initialization_state":"FAILED" to "initialization_state":"NONE" → > /storage/db/casa/webapp/hsqldb/casa.db.log → clear adminuser.properties hashed_password → systemctl start vmware-casa vmware-vcops-watchdog → curl -sk -X POST https://localhost/casa/cluster/init |
VMware Stack: VCF 9.0.1, SDDC Manager, NSX 9.0, vSAN OSA, vCenter 9, ESXi 9, VCF Operations, Aria Lifecycle
Infrastructure: Nested virtualization architecture, vSAN disk groups, NSX overlay networking (GENEVE, TEP, transport zones), certificate lifecycle, offline depot management
Troubleshooting: Root cause analysis through cascading failures, SDDC Manager API diagnostics, PostgreSQL database-level repair, log analysis across 6+ component log paths, VDT compliance remediation
Automation: Python/Paramiko remote diagnostics, ovftool CLI deployments, OpenSSL certificate generation, REST API scripting (NSX, SDDC Manager, vCenter)
Linux/DB: PostgreSQL administration (pg_hba.conf, trust auth, SQL repair), systemctl service management, SSH access patterns, keystore management (keytool), faillock account recovery
Documentation: 13 comprehensive technical documents (~430 pages total), all in Markdown/PDF/HTML with professional styling
Use this as your opening when asked "Tell me about your VCF experience":
"Over the past two months, I built a full VMware Cloud Foundation 9.0.1 environment from scratch — four nested ESXi hosts, vCenter, SDDC Manager, NSX, vSAN, and the full VCF Operations stack — all running nested inside VMware Workstation on a single Dell Precision workstation.
What made this valuable wasn't just the deployment — it was the troubleshooting. Nested virtualization amplifies every failure mode you'd see in production, and I hit them all. I diagnosed and resolved over ten major platform issues, including an NSX certificate chain failure where the SAN didn't include SDDC Manager's registered FQDN, a credential cascade failure that required direct PostgreSQL database repair because the API literally cannot cancel stuck tasks, and NSX resource management where I had to figure out that 32GB RAM is the minimum viable config through three rounds of OOM crashes.
In total, I cataloged 35 issues that have no official Broadcom documentation — spanning database administration, NSX sizing, certificate management, VCF Operations 9.x changes, platform constraints, and crash recovery. I mapped the SDDC Manager PostgreSQL database schema independently to understand how the platform, lock, and task tables interact during credential operations. I also performed a full disaster recovery after an unplanned Windows Update crash wiped out the entire environment — recovering vSAN, NSX, SDDC Manager, and VCF Operations from scratch. I built 20 Python diagnostic scripts for remote SSH-based troubleshooting and wrote over 430 pages of technical documentation across 13 documents covering deployment, troubleshooting, operations, API reference, disaster recovery, and health checks — all version-controlled and available in multiple formats."
Then let them ask follow-up questions — each of the 10 problems above is a ready-made STAR story, and the 35 undocumented discoveries are grouped by category if they want to drill into specifics.
Target Role: VMware Cloud Foundation Professional Services Consultant
Q: "Tell me about your experience with VMware Cloud Foundation."
"I've built and managed a complete VCF 9.0.1 environment from the ground up — not just clicking through wizards, but handling the full stack end-to-end. That includes the Cloud Builder deployment, SDDC Manager commissioning, ESXi host preparation, vCenter, vSAN OSA configuration, NSX 9.0 overlay networking, and VCF Operations. The entire environment runs nested in VMware Workstation on a Dell Precision 7920 — dual Xeon Gold 6140, 192GB RAM. I've worked through the entire Day 0 through Day 2 lifecycle — initial bring-up, workload domain creation, certificate management, and ongoing operations. In fact, I cataloged 35 separate issues that have no official Broadcom documentation — spanning database internals, NSX sizing, certificate management, and VCF Operations 9.x changes."
Q: "Walk me through a VCF deployment."
"Starting from Day 0: prepare ESXi hosts with proper networking, DNS, NTP, and AD. Deploy the Cloud Builder appliance, fill out the deployment parameter workbook — the Excel sheet that defines every IP, FQDN, password, VLAN. Cloud Builder orchestrates the bring-up: deploys vCenter, configures vSAN, deploys SDDC Manager, and stands up NSX Manager. Post bring-up: certificate replacement, VCF Operations deployment, compliance checks with VDT, and coordinated upgrade sequences."
NSX Certificate Story:
"After deploying NSX 9.0, I needed to replace self-signed certs. The SANs had to include not only the NSX node FQDN but also the VIP FQDN that SDDC Manager uses. After generating the cert and applying via NSX API — node first, then VIP — SDDC Manager still couldn't communicate. The issue: two separate trust stores (VCF common services and Java cacerts) both needed the CA cert imported. I documented the entire process as a repeatable procedure."
Credential Cascade Story:
"SDDC Manager's credential rotation for NSX failed and left the entire password management system broken. Every subsequent attempt failed with 'not in ACTIVE state' and 'Unable to acquire resource level lock.' VCF Operations showed two accounts disconnected.
The root cause was a cascading failure: the rotation failed because NSX was unreachable during a boot storm — load average over 100 on 6 cores. That left the NSX cluster stuck in ACTIVATING state in PostgreSQL, plus 47 unresolved tasks and stale locks piling up with each UI retry. I tried the API first — PATCH returned 'TA_TASK_CAN_NOT_BE_RETRIED', DELETE returned HTTP 500. The API has no mechanism to fix this.
So I went to PostgreSQL directly. Mapped the database schema myself — none of this is documented by Broadcom. Discovered the key tables:
nsxtfor resource status,lockfor exclusive locks,task_metadatawith aresolvedboolean for task tracking. The column names aren't what you'd expect — I found them through information_schema queries after early scripts failed.The fix was 6 steps: trust auth workaround, fix nsxt status to ACTIVE, clear lock table, mark task_metadata resolved, clear task_lock, restart operationsmanager. All three tables must be fixed together — they all participate in prevalidation. I built three Python scripts to automate it and documented the full procedure as a repeatable runbook."
vMotion Ghost Setting:
"vMotion was failing with a 'snapshot taken with VHV enabled' error. The setting was invisible in vCenter UI and VMX file — only found in VM runtime logs. Fix: explicitly set
vhv.enable = FALSErather than just removing the line."
Q: "How do you approach troubleshooting?"
"Structured approach: first check relevant logs (SDDC Manager domainmanager/operationsmanager logs, NSX syslog, vSAN health). If logs don't point to the issue, isolate the problem — can SDDC Manager reach NSX? Are certs trusted? Is DNS correct? 80% of VCF issues come down to: certificate trust, DNS resolution, service timing, or stale internal state in SDDC Manager's database. I use the SDDC Manager API for detailed task status and error payloads the UI hides. When the API isn't enough, I go to PostgreSQL. I've also built 20 Python diagnostic scripts for remote troubleshooting."
VCF Day 0 bring-up sequence:
https://sddc-manager.lab.local/VCF upgrade order: SDDC Manager → vCenter → NSX Manager → ESXi (rolling) → vSAN → VCF Operations
SDDC Manager API:
# Get auth token
curl -sk -X POST https://sddc-manager.lab.local/v1/tokens \
-H "Content-Type: application/json" \
-d '{"username":"administrator@vsphere.local","password":"Success01!0909!!"}'
# Key endpoints
/v1/credentials /v1/nsxt-clusters /v1/tasks/{id} /v1/resource-locks
| Question | Key Points |
|---|---|
| What is VCF? | Software-defined DC platform. Integrates vSphere, vSAN, NSX, SDDC Manager. |
| Mgmt vs workload domain? | Management = infrastructure services. Workload = customer apps. |
| What does SDDC Manager do? | Orchestrates Day 0/1/2. Single pane for full stack. API for automation. |
| vSAN ESA vs OSA? | ESA = single pool, NVMe native, no disk groups. OSA = disk groups, SAS/SATA. |
| NSX transport zones? | Overlay = GENEVE tunneling. VLAN = traditional. VCF creates both during bring-up. |
| Cert management? | SDDC Mgr generates CSRs. Replacement requires updating trust stores (VCF + Java cacerts). |
| Password mgmt in VCF 9? | Centralized in VCF Ops Fleet Mgmt. Failed ops can leave stale locks → DB repair. |
Q: "What issues did you find that weren't in the documentation?"
"Across the full deployment lifecycle, I cataloged 35 separate issues with no official Broadcom documentation. These fall into six categories. I've documented the exact resolution for every single one — complete with copy-paste-ready commands, SQL queries, and OpenSSL configs."
Full reference with exact commands:
VCF-Undocumented-Issues-Reference.pdf
Database & Credential Operations (7)
| # | What Broadcom Doesn't Tell You | How I Fixed It |
|---|---|---|
| 1 | SDDC Manager's PostgreSQL schema — table names, column names, relationships all unmapped | Mapped schema using information_schema.tables and information_schema.columns queries via psql -h 127.0.0.1 -d platform |
| 2 | Credential cascade failure — failed rotation leaves NSX stuck in ACTIVATING, stale locks, unresolved tasks | Direct PostgreSQL repair across 3 tables — must fix all together (Issue #4) |
| 3 | API cannot cancel stuck tasks — returns TA_TASK_CAN_NOT_BE_RETRIED; database repair is the only fix |
PostgreSQL: UPDATE task_metadata SET resolved=true + DELETE FROM lock + DELETE FROM task_lock |
| 4 | 6-step repair procedure — must fix nsxt status + locks + tasks together; partial fix still fails | pg_hba.conf trust auth → fix nsxt state → clear locks → mark tasks resolved → clear task_lock → restart operationsmanager |
| 5 | PostgreSQL requires -h 127.0.0.1 — Unix sockets don't work |
Always: sudo -u postgres psql -h 127.0.0.1 -d platform |
| 6 | Column naming inconsistencies — state not status, resolved boolean not status enum |
Query information_schema.columns first to discover correct column names |
| 7 | Password not discoverable — must use trust auth workaround in pg_hba.conf |
Add host all all 127.0.0.1/32 trust to pg_hba.conf → restart postgres → revert after use |
NSX in Nested/Resource-Constrained Environments (6)
| # | What Broadcom Doesn't Tell You | How I Fixed It |
|---|---|---|
| 8 | 32GB RAM / 6 vCPU minimum — Broadcom docs say 16GB; actual: 16GB=OOM, 24GB=crashes, 32GB=stable | Set NSX VM to 30-32GB RAM, 6 vCPU in VMware Workstation .vmx |
| 9 | Boot storm load >100 for 30-60 min is normal; VIP offline until settled | Wait 30-60 min after power-on; verify with get cluster status → STABLE |
| 10 | Adding more vCPU is counterproductive — co-scheduling overhead | Keep at 6 vCPU; stagger VM startups instead |
| 11 | Services take 10-15 min to stabilize; API returns error 101 during stabilization | Wait 15 min after service restart before any API calls |
| 12 | DNS/NTP via admin CLI (set name-servers), NOT the UI |
ssh admin@nsx → set name-servers 192.168.1.5 → set ntp-servers 192.168.1.5 |
| 13 | TEP on vmk0 — NSX 9.0 "Use VMkernel Adapter" eliminates dedicated TEP VLAN | Select "Use VMkernel Adapter" → vmk0 during transport node config |
Certificate Management (5)
| # | What Broadcom Doesn't Tell You | How I Fixed It |
|---|---|---|
| 14 | NSX cert SAN must include SDDC Manager's registered FQDN (nsx-manager.lab.local) |
OpenSSL config with DNS.1=nsx-manager, DNS.2=nsx-vip, IP.1/IP.2 → generate CSR → sign → import via NSX API with Python PEM escaping |
| 15 | Two separate trust stores — VCF common services + Java cacerts; KB 316056 is incomplete | Import CA into both: certool --importcert + keytool -importcert into /usr/java/jre-vmware-17/lib/security/cacerts |
| 16 | Fleet Management cert generator produces wrong SANs | Generate manually with OpenSSL using correct SANs → upload via Fleet UI |
| 17 | VCF Ops for Logs cert generator — same SAN mismatch pattern | Same OpenSSL manual generation with Logs hostnames → upload via Logs UI |
| 18 | Shell can't handle PEM escaping — must use Python for JSON cert payload | Python script: read PEM files → json.dumps({"pem_encoded": cert+key}) → POST to NSX API |
VCF Operations 9.x Changes (6)
| # | What Broadcom Doesn't Tell You | How I Fixed It |
|---|---|---|
| 19 | Adapter log paths changed to /storage/log/vcops/log/adapters/ — legacy path gone |
Use new path: tail -f /storage/log/vcops/log/adapters/<name>/adapter.log |
| 20 | JRE path changed to /usr/java/jre-vmware-17/ — legacy jre-vmware gone |
Use new path for keytool: /usr/java/jre-vmware-17/bin/keytool |
| 21 | Two separate NSX adapters — VCF section uses VIP, "Aria Admin" uses node FQDN | Update credentials on both adapters — VIP adapter and node FQDN adapter |
| 22 | System Managed Credential ROTATE doesn't work for NSX — must set manually | Fleet UI → uncheck System Managed → set password manually |
| 23 | SSH enable via Admin UI only — console/systemctl won't work | https://vcf-ops.lab.local/admin → Administration → SSH → Enable |
| 24 | Health adapter silently fails on stale credential; full reboot required | Update credential in UI → ssh root@vcf-ops → reboot (stop/start insufficient) |
Infrastructure & Platform (4)
| # | What Broadcom Doesn't Tell You | How I Fixed It |
|---|---|---|
| 25 | vCenter can't thin-provision to vSAN — must use vmkfstools -i -d thin per disk |
SSH to ESXi → vmkfstools -i source.vmdk -d thin dest.vmdk per disk → update .vmx |
| 26 | vhv.enable ghost setting persists in VM runtime — must explicitly set FALSE |
Add vhv.enable = "FALSE" to .vmx explicitly — removing the line is NOT enough |
| 27 | Hot vMotion fails in nested environments — use cold migration | Power off VM → Migrate → select destination host (hot migration times out) |
| 28 | VDT not pre-installed — must download from KB 344917 | Download from Broadcom KB 344917 → chmod +x vdt-* → ./vdt --domain MANAGEMENT |
Crash Recovery & Suite-API (7) — Discovered March 2026
| # | What Broadcom Doesn't Tell You | How I Fixed It |
|---|---|---|
| 29 | Suite-API uses vRealizeOpsToken auth header — not Bearer |
Authorization: vRealizeOpsToken <token> for all Suite-API calls |
| 30 | Permissions API requires single JSON object — not array | {"roleName":"Administrator","allowAllObjects":true} — no wrapper |
| 31 | Super admin admin shows roleNames: [] — by design |
No fix needed — implicit full access by design |
| 32 | SDDC Manager domainmanager port 7200 is HTTP not HTTPS | curl http://localhost:7200/health — NOT https |
| 33 | NSX adapter credential fields must be uppercase | {"name":"USERNAME","value":"admin"} and {"name":"PASSWORD","value":"..."} |
| 34 | Gemfire cache takes 5-10 min after cluster init | Wait 5-10 min; roles/users populate after Gemfire loads |
| 35 | HSQLDB reset required after unclean shutdown | Stop services → edit casa.db.script (FAILED→NONE) → clear log → restart → curl -X POST .../casa/cluster/init |
Condensed reference from the full SDDC Manager REST API Handbook. For complete endpoint details, Python scripts, Postman collection, and lab-tested workflows, see VCF-SDDC-Manager-API-Handbook.md.
Base Path: https://sddc-manager.lab.local/v1/
Authentication: Bearer Token (JWT) via POST /v1/tokens
Pattern: Async — mutating operations return task IDs, poll /v1/tasks/{id} for status
+------------------------------+
| SDDC Manager API |
| Base Path: /v1/ |
+---------------+--------------+
|
+-----------+-----------+-----------+-----------+
| | | | |
v v v v v
+--------+ +----------+ +--------+ +----------+ +--------+
| Auth | | Infra | | Tasks | | Locks | | Creds |
| tokens | | hosts | | tasks | | resource | | creds |
| | | domains | | {id} | | -locks | | |
+--------+ | clusters | +--------+ +----------+ +--------+
| nsxt- |
| clusters |
+----------+
Token Extraction:
TOKEN=$(curl -sk -X POST https://sddc-manager.lab.local/v1/tokens \
-H "Content-Type: application/json" \
-d '{"username":"administrator@vsphere.local","password":"Success01!0909!!"}' \
| python3 -c "import sys,json; print(json.load(sys.stdin)['accessToken'])")
| Property | Value |
|---|---|
| Access token lifetime | 60 minutes |
| Refresh token lifetime | 24 hours |
| Token type | JWT (JSON Web Token) |
| Required header | Authorization: Bearer <accessToken> |
| Token refresh | PATCH /v1/tokens with refreshToken.id |
| Method | Endpoint | Description |
|---|---|---|
POST |
/v1/tokens |
Authenticate and get Bearer token |
PATCH |
/v1/tokens |
Refresh an expired access token |
| Method | Endpoint | Description |
|---|---|---|
GET |
/v1/system |
System information and version |
GET |
/v1/system/health |
Overall platform health (GREEN/YELLOW/RED) |
GET |
/v1/system/notifications |
Active notifications |
| Method | Endpoint | Description |
|---|---|---|
GET |
/v1/hosts |
List all ESXi hosts |
POST |
/v1/hosts |
Commission new host(s) |
DELETE |
/v1/hosts/{id} |
Decommission a host |
GET |
/v1/domains |
List all workload domains |
POST |
/v1/domains |
Create a new workload domain |
GET |
/v1/clusters |
List all clusters |
POST |
/v1/clusters |
Create a new cluster |
PATCH |
/v1/clusters/{id} |
Expand/shrink a cluster |
GET |
/v1/vcenters |
List all vCenter instances |
GET |
/v1/nsxt-clusters |
List all NSX-T clusters |
| Method | Endpoint | Description |
|---|---|---|
GET |
/v1/bundles |
List available update bundles |
POST |
/v1/bundles |
Download a bundle |
GET |
/v1/upgradables |
List upgradable components |
POST |
/v1/upgrades |
Start an upgrade operation |
Most mutating operations (credential rotations, upgrades, deployments) return a task ID. Poll until SUCCESSFUL, FAILED, or CANCELLED.
| Method | Endpoint | Description |
|---|---|---|
GET |
/v1/tasks |
List all tasks (filter: ?status=IN_PROGRESS) |
GET |
/v1/tasks/{id} |
Get task details with sub-tasks and errors |
PATCH |
/v1/tasks/{id} |
Attempt to cancel a task |
Task polling pattern:
TASK_ID="<task-id>"
while true; do
STATUS=$(curl -sk -H "Authorization: Bearer $TOKEN" \
https://sddc-manager.lab.local/v1/tasks/$TASK_ID \
| python3 -c "import sys,json; print(json.load(sys.stdin)['status'])")
echo "$(date +%H:%M:%S) - $STATUS"
[ "$STATUS" = "SUCCESSFUL" ] || [ "$STATUS" = "FAILED" ] && break
sleep 30
done
Key Discovery: The API returns
TA_TASK_CAN_NOT_BE_RETRIEDfor stuck tasks.DELETE /v1/tasks/{id}returns HTTP 500. When the API cannot cancel stuck tasks, direct PostgreSQL database repair is the only option — see Section 7.2.6.
| Method | Endpoint | Description |
|---|---|---|
GET |
/v1/credentials |
List all stored credentials |
PUT |
/v1/credentials |
Update, rotate, or remediate credentials |
Credential operation types:
| operationType | Effect |
|---|---|
UPDATE |
Sync SDDC Manager's stored password with current password on target |
ROTATE |
Generate new password and push to both SDDC Manager and target |
REMEDIATE |
Re-attempt a failed credential operation |
WARNING: If a credential operation fails mid-flight (e.g., NSX unreachable during boot storm), it triggers a cascade failure. See Section 7.2.6 for the 6-step PostgreSQL repair procedure.
| Method | Endpoint | Description |
|---|---|---|
GET |
/v1/resource-locks |
List active resource locks |
Stale locks from failed operations block all subsequent operations. The API provides no way to force-release locks. Fix requires direct database cleanup: DELETE FROM lock in the platform database.
Three ready-to-use Python scripts are provided in the full API Handbook:
| Script | Purpose |
|---|---|
| Full API Client | Queries all key endpoints in one pass (system, health, credentials, NSX, tasks, locks, hosts, domains) |
| Credential Status Checker | Tabular display of all credentials with type, resource, and status |
| Task Monitor | Polls a specific task ID every 30 seconds until completion, displays errors on failure |
All scripts use requests, urllib3.disable_warnings(), and verify=False for self-signed certs.
| Error | Root Cause | Fix |
|---|---|---|
TA_TASK_CAN_NOT_BE_RETRIED |
Stuck task | DB: UPDATE task_metadata SET resolved = true |
Unable to acquire resource level lock(s) |
Stale locks | DB: DELETE FROM lock in platform DB |
Resources [X] are not in ACTIVE state |
NSX stuck | DB: UPDATE nsxt SET status = 'ACTIVE' |
| HTTP 401 | Token expired | Re-authenticate via POST /v1/tokens |
| HTTP 409 | Resource locked | Check /v1/resource-locks, wait or clear DB locks |
Connection refused |
Services down | SSH: systemctl restart vcf-services |
/var/log/vmware/vcf/operationsmanager/operationsmanager.log # Credential ops
/var/log/vmware/vcf/domainmanager/domainmanager.log # Domain/cluster ops
/var/log/vmware/vcf/lcm/lcm.log # Lifecycle/upgrade ops
/var/log/vmware/vcf/commonsvcs/commonsvcs.log # Auth/token issues
| Order | Component | Why |
|---|---|---|
| 1 | SDDC Manager | Orchestrates all other upgrades |
| 2 | vCenter Server | Required before ESXi upgrades |
| 3 | NSX Manager | Required before host network changes |
| 4 | ESXi Hosts | Rolling upgrade, one host at a time |
| 5 | vSAN | After all hosts are upgraded |
| 6 | VCF Operations | Last — depends on all infrastructure |
Full reference: See
VCF-SDDC-Manager-API-Handbook.mdfor complete endpoint documentation, Postman collection (12 pre-built requests), and 3 lab-tested workflows (Full Health Check, Credential Update for NSX, Diagnose Credential Cascade Failure).
(c) 2026 Virtual Control LLC. All rights reserved.