Virtual Control

VMware Cloud Foundation Solutions

Troubleshooting Handbook

VCF
Troubleshooting Handbook

End-to-end troubleshooting procedures for VCF components including SDDC Manager, vCenter, ESXi, and NSX.

SDDC ManagervCenterESXiNSXEnd-to-End

VCF 9.0

VMware Cloud Foundation

Proprietary & Confidential

VMware Cloud Foundation 9.0 Nested Deployment Troubleshooting Handbook

Version: 1.5 Date: February 2026 Environment: VCF 9.0.1 Nested in VMware Workstation Author: Virtual Control LLC

Executive Summary
Environment Overview
Offline Depot Configuration
- 3.1 TLS/SSL Certificate Issues
- 3.2 Python HTTPS Server Setup
- 3.3 Certificate Generation
- 3.4 Certificate Import to VCF
ESXi Host Certificate Issues
- 4.1 Certificate Hostname Mismatch
- 4.2 Certificate Regeneration
- 4.3 Thumbprint Updates
vSAN Storage Configuration
- 5.1 SSD Detection Issues
- 5.2 VMX File Configuration
- 5.3 Cleaning Up Previous vSAN Configuration
Network Configuration for Nested VCF
- 6.1 VLAN Configuration
- 6.2 MTU Settings
- 6.3 IP Addressing
VCF Component Configuration
- 7.1 VCF Automation
- 7.2 NSX Manager
  - 7.2.1 NSX Certificate SAN Failure (VDT)
  - 7.2.2 NSX Certificate Trust Failure (VDT)
- 7.3 vCenter Server
- 7.4 vCenter Deployment Issues
- 7.5 Failed Deployment Recovery
Command Reference
- 8.1 VCF Installer Commands
- 8.2 ESXi Host Commands
- 8.3 Windows Depot Server Commands
Troubleshooting Flowcharts
SDDC Manager Credential Cascade Failure
SDDC Manager Storage Migration (Local → vSAN)

1. Executive Summary

This handbook documents the complete troubleshooting process for deploying VMware Cloud Foundation 9.0.1 in a nested VMware Workstation environment. The deployment encountered several challenges including:

Offline Depot TLS Issues: BouncyCastle FIPS TLS implementation rejecting self-signed certificates
ESXi Certificate Mismatches: Hosts with incorrect hostname in SSL certificates
vSAN SSD Detection: Virtual disks not being recognized as SSDs for vSAN cache tier
Network Configuration: Proper VLAN and subnet configuration for nested environments
SSH Requirements: vLCM host seeding requires SSH enabled on all ESXi hosts
vCenter Deployment Failures: Stuck deployments, PostgreSQL initialization issues, service startup failures
Failed Deployment Recovery: Complete cleanup procedures when deployments fail without rollback option

This document provides step-by-step remediation procedures and recovery processes for nested VCF deployments.

2. Environment Overview

2.1 Infrastructure Components

Component	FQDN	IP Address
VCF Installer/SDDC Manager	vcf-installer.lab.local	192.168.1.240
vCenter Server	vcenter.lab.local	192.168.1.69
NSX Manager VIP	nsx-vip.lab.local	192.168.1.70
NSX Manager Node 1	nsx-node1.lab.local	192.168.1.71
ESXi Host 1	esxi01.lab.local	192.168.1.74
ESXi Host 2	esxi02.lab.local	192.168.1.75
ESXi Host 3	esxi03.lab.local	192.168.1.76
ESXi Host 4	esxi04.lab.local	192.168.1.82
VCF Operations	vcf-ops.lab.local	192.168.1.77
Fleet Management	fleet.lab.local	192.168.1.78
Collector	collector.lab.local	192.168.1.79
VCF Automation	automation.lab.local	192.168.1.90
Automation Node 1	automation-node1.lab.local	192.168.1.91
Offline Depot Server	(IP only - no DNS)	192.168.1.160
DNS/NTP Server	(Windows Server)	192.168.1.230

2.2 Network Configuration

Network	VLAN ID	Subnet	Gateway	IP Range
ESX Management	0	192.168.1.0/24	192.168.1.1	DHCP/Static
VM Management	0	192.168.1.0/24	192.168.1.1	Same as ESX Mgmt
vMotion	100	192.168.100.0/24	192.168.100.1	192.168.100.10-20
vSAN	200	192.168.200.0/24	192.168.200.1	192.168.200.206-216
NSX TEP	300	192.168.250.0/24	192.168.250.1	192.168.250.10-25

2.3 ESXi VM Specifications (Nested)

Resource	Value
vCPUs	8 (4 cores x 2 sockets)
Memory	48 GB
OS Disk	32 GB
vSAN Cache Disk	100 GB (SSD)
vSAN Capacity Disk	800 GB (SSD)
Network Adapters	4x vmxnet3

3. Offline Depot Configuration

3.1 TLS/SSL Certificate Issues

Problem Description

VCF 9.0.1 uses BouncyCastle FIPS TLS implementation which has strict certificate validation requirements. When connecting to an offline depot with a self-signed certificate, the connection fails with:

Secure protocol communication error, check logs for more details

Symptoms

VCF Installer UI shows "Secure protocol communication error"

LCM debug logs show TlsFatalAlert errors:

org.bouncycastle.tls.TlsFatalAlert caught when processing request to {s}->https://192.168.1.160:8443

Root Cause

Self-signed certificate not trusted by Java keystore
Certificate may lack proper extensions for FIPS compliance
Certificate fingerprint mismatch between server and truststore

Diagnostic Commands (VCF Installer)

# Test SSL connectivity
openssl s_client -connect 192.168.1.160:8443

# Test with TLS 1.2 specifically
openssl s_client -connect 192.168.1.160:8443 -tls1_2

# Check cipher negotiation
openssl s_client -connect 192.168.1.160:8443 -tls1_2 </dev/null 2>&1 | grep -E "Cipher|Protocol|Verify"

# View certificate details
openssl s_client -connect 192.168.1.160:8443 </dev/null 2>/dev/null | openssl x509 -text -noout

# Get certificate fingerprint
openssl s_client -connect 192.168.1.160:8443 </dev/null 2>/dev/null | openssl x509 -noout -fingerprint -sha256

# Check LCM logs for TLS errors
grep -i "tlsfatal\|ssl\|certificate" /var/log/vmware/vcf/lcm/lcm-debug.log | tail -20

# Check LCM service status
systemctl status lcm

3.2 Python HTTPS Server Setup

Server Script (C:\VCF-DEPOT\https_server.py)

#!/usr/bin/env python3
"""
HTTPS server for VCF Offline Depot
Serves files with TLS 1.2+ for SDDC Manager compatibility
"""
import http.server
import ssl
import os
import base64
import socketserver
from functools import partial

# Configuration
PORT = 8443
CERT_FILE = 'server.crt'
KEY_FILE = 'server.key'
USERNAME = 'admin'
PASSWORD = 'admin'

class AuthHandler(http.server.SimpleHTTPRequestHandler):
    protocol_version = "HTTP/1.1"

    def __init__(self, *args, directory=None, **kwargs):
        super().__init__(*args, directory=directory, **kwargs)

    def do_HEAD(self):
        if not self.authenticate():
            return
        super().do_HEAD()

    def do_GET(self):
        if not self.authenticate():
            return
        super().do_GET()

    def do_POST(self):
        if not self.authenticate():
            return
        content_length = int(self.headers.get('Content-Length', 0))
        self.rfile.read(content_length)
        self.send_response(200)
        self.send_header('Content-Type', 'application/json')
        self.send_header('Connection', 'close')
        self.end_headers()
        self.wfile.write(b'{"status": "ok"}')

    def authenticate(self):
        auth_header = self.headers.get('Authorization')
        if auth_header is None:
            self.send_auth_request()
            return False

        try:
            auth_type, credentials = auth_header.split(' ', 1)
            if auth_type.lower() != 'basic':
                self.send_auth_request()
                return False

            decoded = base64.b64decode(credentials).decode('utf-8')
            username, password = decoded.split(':', 1)

            if username == USERNAME and password == PASSWORD:
                return True
        except Exception:
            pass

        self.send_auth_request()
        return False

    def send_auth_request(self):
        self.send_response(401)
        self.send_header('WWW-Authenticate', 'Basic realm="VCF Depot"')
        self.send_header('Content-type', 'text/html')
        self.send_header('Content-Length', '23')
        self.send_header('Connection', 'close')
        self.end_headers()
        self.wfile.write(b'Authentication required')

    def log_message(self, format, *args):
        print(f"{self.client_address[0]} - {format % args}")

class ThreadedHTTPServer(socketserver.ThreadingMixIn, http.server.HTTPServer):
    daemon_threads = True

def run_server():
    os.chdir(os.path.dirname(os.path.abspath(__file__)))

    context = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
    context.minimum_version = ssl.TLSVersion.TLSv1_2
    context.maximum_version = ssl.TLSVersion.TLSv1_3

    if hasattr(context, 'post_handshake_auth'):
        context.post_handshake_auth = False

    context.options |= ssl.OP_NO_TICKET
    context.options |= getattr(ssl, 'OP_NO_RENEGOTIATION', 0)

    context.load_cert_chain(CERT_FILE, KEY_FILE)

    try:
        context.set_ciphers('DEFAULT:!aNULL:!MD5:!DSS')
    except ssl.SSLError:
        pass

    handler = partial(AuthHandler, directory=os.getcwd())
    server = ThreadedHTTPServer(('0.0.0.0', PORT), handler)
    server.socket = context.wrap_socket(server.socket, server_side=True)

    print(f"VCF Offline Depot Server")
    print(f"========================")
    print(f"Serving: {os.getcwd()}")
    print(f"URL: https://192.168.1.160:{PORT}/")
    print(f"Credentials: {USERNAME} / {PASSWORD}")
    print(f"TLS: 1.2 - 1.3")
    print(f"Press Ctrl+C to stop")

    try:
        server.serve_forever()
    except KeyboardInterrupt:
        print("\nStopped.")
        server.shutdown()

if __name__ == '__main__':
    run_server()

3.3 Certificate Generation

Certificate Generation Script (C:\VCF-DEPOT\generate_cert.py)

"""
Generate simple self-signed certificate for VCF Offline Depot
"""
import subprocess
import sys
import os

def generate_cert():
    try:
        from cryptography import x509
        from cryptography.x509.oid import NameOID
        from cryptography.hazmat.primitives import hashes
        from cryptography.hazmat.primitives.asymmetric import rsa
        from cryptography.hazmat.primitives import serialization
        from datetime import datetime, timedelta, timezone
        import ipaddress

        print("Generating RSA 2048-bit private key...")
        key = rsa.generate_private_key(
            public_exponent=65537,
            key_size=2048,
        )

        print("Creating self-signed certificate...")
        subject = issuer = x509.Name([
            x509.NameAttribute(NameOID.COMMON_NAME, "192.168.1.160"),
        ])

        now = datetime.now(timezone.utc)

        # Simple certificate like the original
        cert = (
            x509.CertificateBuilder()
            .subject_name(subject)
            .issuer_name(issuer)
            .public_key(key.public_key())
            .serial_number(x509.random_serial_number())
            .not_valid_before(now)
            .not_valid_after(now + timedelta(days=365))
            .add_extension(
                x509.SubjectAlternativeName([
                    x509.IPAddress(ipaddress.IPv4Address("192.168.1.160")),
                ]),
                critical=False,
            )
            .add_extension(
                x509.BasicConstraints(ca=True, path_length=None),
                critical=True,
            )
            .sign(key, hashes.SHA256())
        )

        with open("server.key", "wb") as f:
            f.write(key.private_bytes(
                encoding=serialization.Encoding.PEM,
                format=serialization.PrivateFormat.TraditionalOpenSSL,
                encryption_algorithm=serialization.NoEncryption()
            ))
        print("Created: server.key")

        with open("server.crt", "wb") as f:
            f.write(cert.public_bytes(serialization.Encoding.PEM))
        print("Created: server.crt")

        fingerprint = cert.fingerprint(hashes.SHA256()).hex()
        formatted = ':'.join(fingerprint[i:i+2].upper() for i in range(0, len(fingerprint), 2))
        print(f"SHA256: {formatted}")

        return True
    except ImportError:
        print("Installing cryptography...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", "cryptography"])
        return False

def main():
    os.chdir(r"C:\VCF-DEPOT")
    print("Generating certificate...")
    if not generate_cert():
        generate_cert()
    print("\nDone. Run: python https_server.py")

if __name__ == "__main__":
    main()

Windows PowerShell Commands

# Navigate to depot directory
cd C:\VCF-DEPOT

# Generate certificate
python generate_cert.py

# Start HTTPS server
python https_server.py

3.4 Certificate Import to VCF

Complete Certificate Import Procedure (VCF Installer)

# Step 1: Download certificate from depot server
openssl s_client -connect 192.168.1.160:8443 </dev/null 2>/dev/null | openssl x509 -outform PEM > /tmp/depot.crt

# Step 2: Verify certificate was downloaded
cat /tmp/depot.crt

# Step 3: Get certificate fingerprint
openssl x509 -in /tmp/depot.crt -noout -fingerprint -sha256

# Step 4: Find Java truststore location
echo $JAVA_HOME
# Output: /usr/lib/jvm/openjdk-java17-headless.x86_64

# Step 5: Delete old certificate if exists
keytool -delete -alias offline-depot -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit

# Step 6: Import new certificate
keytool -import -trustcacerts -alias offline-depot -file /tmp/depot.crt -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit -noprompt

# Step 7: Verify import
keytool -list -alias offline-depot -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit

# Step 8: Restart LCM service
systemctl restart lcm

# Step 9: Wait for LCM to start (2 minutes)
systemctl status lcm

# Step 10: Verify LCM is ready
tail -f /var/log/vmware/vcf/lcm/lcm-debug.log | grep -i "started\|ready"

Troubleshooting Certificate Issues

# Check all cacerts files on system
find / -name "cacerts" -type f 2>/dev/null

# List all certificates in truststore
keytool -list -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit

# View certificate details in truststore
keytool -list -alias offline-depot -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit -v

# Check LCM logs for certificate errors
grep -B5 -A10 "TlsFatalAlert" /var/log/vmware/vcf/lcm/lcm-debug.log | tail -40

4. ESXi Host Certificate Issues

4.1 Certificate Hostname Mismatch

Problem Description

ESXi hosts may have certificates with incorrect hostnames (e.g., "localhost.localdomain" instead of the actual FQDN), causing VCF validation to fail.

Symptoms

javax.net.ssl.SSLPeerUnverifiedException: Certificate for <esxi01.lab.local> doesn't match any of the subject alternative names: [localhost.localdomain]

Diagnostic Commands (ESXi Host)

# Check current hostname
esxcli system hostname get

# View current certificate SAN
openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout | grep -A1 "Subject Alternative Name"

# View full certificate details
openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout

4.2 Certificate Regeneration

Procedure (Run on Each ESXi Host)

# Step 1: Set correct hostname
esxcli system hostname set --fqdn=esxi01.lab.local

# Step 2: Verify hostname
esxcli system hostname get

# Step 3: Backup existing certificate
mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak

# Step 4: Generate new certificate
/sbin/generate-certificates

# Step 5: Restart services
services.sh restart

# Step 6: Verify new certificate
openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout | grep -A1 "Subject Alternative Name"

Host-Specific Commands

esxi01.lab.local (192.168.1.74):

esxcli system hostname set --fqdn=esxi01.lab.local
mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak
/sbin/generate-certificates
services.sh restart

esxi02.lab.local (192.168.1.75):

esxcli system hostname set --fqdn=esxi02.lab.local
mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak
/sbin/generate-certificates
services.sh restart

esxi03.lab.local (192.168.1.76):

esxcli system hostname set --fqdn=esxi03.lab.local
mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak
/sbin/generate-certificates
services.sh restart

esxi04.lab.local (192.168.1.82):

esxcli system hostname set --fqdn=esxi04.lab.local
mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak
/sbin/generate-certificates
services.sh restart

4.3 Thumbprint Updates

Get New Thumbprints (VCF Installer)

# Get thumbprint for each host
echo | openssl s_client -connect 192.168.1.74:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256
echo | openssl s_client -connect 192.168.1.75:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256
echo | openssl s_client -connect 192.168.1.76:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256
echo | openssl s_client -connect 192.168.1.82:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256

After regenerating certificates, update the thumbprints in VCF Installer UI by re-validating the hosts.

5. vSAN Storage Configuration

5.1 SSD Detection Issues

Problem Description

In nested VMware Workstation environments, virtual disks are not automatically detected as SSDs, causing vSAN cache tier configuration to fail.

Symptoms

ESX Host esxi01.lab.local found zero SSD devices for SSD cache tier

Diagnostic Commands (ESXi Host)

# List all storage devices with SSD status
esxcli storage core device list | grep -E "^t10|Is SSD"

# Check vSAN eligible disks
vdq -q

# List vSAN storage
esxcli vsan storage list

# Check disk partitions
partedUtil getptbl /vmfs/devices/disks/<device-name>

5.2 VMX File Configuration

Problem Description

Virtual disks in VMware Workstation need the virtualSSD flag set in the VMX file to be recognized as SSDs by nested ESXi.

Solution: Edit VMX Files

Location: Edit each ESXi VM's .vmx file in VMware Workstation

Required Lines to Add:

For esxi01.vmx:

sata0:0.virtualSSD = 1
sata0:2.virtualSSD = 1
sata0:3.virtualSSD = 1
sata0:4.virtualSSD = 1

For esxi02.vmx, esxi03.vmx, esxi04.vmx:

sata0:0.virtualSSD = 1
sata0:3.virtualSSD = 1
sata0:4.virtualSSD = 1

Procedure

Shut down all ESXi VMs in VMware Workstation
Navigate to each VM's folder
Open the .vmx file in a text editor
Add the virtualSSD lines (location doesn't matter, bottom is fine)
Save the file
Power on the ESXi VMs

Verify SSD detection:

esxcli storage core device list | grep -E "^t10|Is SSD"

5.3 Cleaning Up Previous vSAN Configuration

Problem Description

Disks with existing vSAN partitions from previous deployments are marked as "Ineligible for use by VSAN" with reason "Has partitions" or "Disk in use by disk group".

Diagnostic Commands

# Check vSAN eligibility
vdq -q

# Check existing vSAN storage
esxcli vsan storage list

# Check partition table
partedUtil getptbl /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001

Cleanup Procedure (Run on Each ESXi Host)

# Step 1: Remove existing vSAN disk group (if exists)
esxcli vsan storage remove -d t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001

# Step 2: Delete partitions from vSAN cache disk
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001 1
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001 2

# Step 3: Delete partitions from vSAN capacity disk
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________04000000000000000001 1
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________04000000000000000001 2

# Step 4: Verify disks are now eligible
vdq -q

Expected Output After Cleanup

{
    "Name": "t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001",
    "State": "Eligible for use by VSAN",
    "Reason": "None",
    "IsSSD": "1"
}

6. Network Configuration for Nested VCF

6.1 VLAN Configuration

Problem Description

VCF requires separate subnets for different network types. Using the same subnet/gateway for multiple networks causes validation errors:

Gateway 192.168.1.1 is duplicated across networks
Subnet 192.168.1.0/24 is duplicated across networks

Solution: Use VLANs with Separate Subnets

Network	VLAN ID	Subnet	Gateway
Management	0	192.168.1.0/24	192.168.1.1
vMotion	100	192.168.100.0/24	192.168.100.1
vSAN	200	192.168.200.0/24	192.168.200.1
NSX TEP	300	192.168.250.0/24	192.168.250.1

Note: For vMotion, vSAN, and TEP networks in nested environments, the gateway IPs don't need to exist - they're isolated networks. VCF requires the gateway field to be populated.

6.2 MTU Settings

Critical for Nested Environments

Do NOT use jumbo frames (MTU 9000) in nested VMware Workstation environments.

Component	Recommended MTU
Distributed Switch	1500-1600
ESX Management	1500
vMotion	1500
vSAN	1500
NSX TEP	1500

6.3 IP Addressing

VCF Network IP Ranges

Network	IP Range	Purpose
Management	192.168.1.x	Static assignments per hosts file
vMotion	192.168.100.10-20	Automatic assignment by VCF
vSAN	192.168.200.206-216	Automatic assignment by VCF
NSX TEP	192.168.250.10-25	TEP IP Pool

7. VCF Component Configuration

7.1 VCF Automation

IP Address Configuration Issue

Problem: VCF detects duplicate IP when cluster FQDN resolves to same IP as Node IP.

Error:

IP address 192.168.1.90 for product VCF Automation is already resolved from an FQDN in the input specification

Solution: Use different IPs for cluster FQDN and node IP:

Field	Value
Cluster hostname/FQDN	automation.lab.local (192.168.1.90)
Node IP 1	192.168.1.91 (automation-node1.lab.local)
Additional IP for upgrades	192.168.1.81
Node name prefix	automation
Internal Cluster CIDR	198.18.0.0/15

7.2 NSX Manager

Configuration

Field	Value
Appliance Size	Medium
Appliance FQDN	nsx-node1.lab.local
Virtual IP (VIP) FQDN	nsx-vip.lab.local

7.2.1 NSX Certificate SAN Failure (VDT)

Problem: VDT reports "SAN contains neither hostname nor IP" for NSX VIP and NSX Manager certificates. The default NSX self-signed certificate uses a wildcard SAN (*.lab.local) without specific hostnames or IPs.

Solution: Generate a new self-signed certificate with explicit SAN entries and apply via NSX API.

Step 1: Create OpenSSL config on NSX Manager (SSH as root):

cat > /tmp/nsx-cert.conf << 'EOF'
[ req ]
default_bits = 2048
distinguished_name = req_distinguished_name
req_extensions = req_ext
x509_extensions = req_ext
prompt = no

[ req_distinguished_name ]
countryName = US
stateOrProvinceName = Lab
localityName = Lab
organizationName = lab.local
commonName = nsx-vip.lab.local

[ req_ext ]
basicConstraints = CA:FALSE
subjectAltName = @alt_names

[alt_names]
DNS.1 = nsx-vip.lab.local
DNS.2 = nsx-node1.lab.local
DNS.3 = nsx-manager.lab.local
IP.1 = 192.168.1.70
IP.2 = 192.168.1.71
EOF

Important: DNS.3 = nsx-manager.lab.local is required because SDDC Manager registers NSX using this FQDN. Without it, VDT reports "SAN contains IP but not hostname".

Step 2: Generate cert, build JSON, import, and apply:

# Generate cert
openssl req -x509 -nodes -days 825 -newkey rsa:2048 -keyout /tmp/nsx.key -out /tmp/nsx.crt -config /tmp/nsx-cert.conf -sha256

# Build JSON payload (avoids shell escaping issues with PEM newlines)
python -c "
import json
cert = open('/tmp/nsx.crt').read()
key = open('/tmp/nsx.key').read()
print(json.dumps({'pem_encoded': cert, 'private_key': key}))
" > /tmp/nsx-import.json

# Import cert (single-line — NSX shell doesn't support backslash continuation)
curl -k -u admin:'<password>' -X POST "https://192.168.1.71/api/v1/trust-management/certificates?action=import" -H "Content-Type: application/json" -d @/tmp/nsx-import.json
# Note the certificate ID from the response

# Get node UUID
curl -k -u admin:'<password>' https://192.168.1.71/api/v1/cluster
# Note the node UUID

# Apply to node (API service)
curl -k -u admin:'<password>' -X POST "https://192.168.1.71/api/v1/trust-management/certificates/<cert-id>?action=apply_certificate&service_type=API&node_id=<node-uuid>"

# Apply to VIP (MGMT_CLUSTER)
curl -k -u admin:'<password>' -X POST "https://192.168.1.71/api/v1/trust-management/certificates/<cert-id>?action=apply_certificate&service_type=MGMT_CLUSTER"

Prerequisite: All NSX services must be healthy (MANAGER, SEARCH, UI, NODE_MGMT all UP). If services are DOWN, the API returns error 101. Wait 10-15 minutes after NSX restart in nested environments.

7.2.2 NSX Certificate Trust Failure (VDT)

Problem: After replacing the NSX self-signed certificate, VDT reports "NSX VIP Cert Trust: FAIL" and "NSX Manager Cert Trust: FAIL". The new self-signed cert's root is not in SDDC Manager's keystores (the original cert was pre-trusted during bringup).

Symptoms:

VDT shows FAIL for NSX VIP Cert Trust and NSX Manager Cert Trust
NSX cert SAN checks may PASS but trust checks FAIL
Occurs after any NSX certificate replacement with a new self-signed cert

Solution: Import the NSX certificate into both SDDC Manager trust stores.

Step 1: Pull the NSX certificate (SSH to SDDC Manager as root):

openssl s_client -showcerts -connect 192.168.1.71:443 < /dev/null 2>/dev/null | openssl x509 -outform PEM > /tmp/nsx-root.crt

# Verify it's the correct cert
openssl x509 -in /tmp/nsx-root.crt -noout -text | grep -A2 "Subject Alternative Name"

Step 2: Import into VCF trust store:

KEY=$(cat /etc/vmware/vcf/commonsvcs/trusted_certificates.key)
keytool -importcert -alias nsx-selfsigned -file /tmp/nsx-root.crt \
  -keystore /etc/vmware/vcf/commonsvcs/trusted_certificates.store \
  -storepass "$KEY" -noprompt

Step 3: Import into Java cacerts:

keytool -importcert -alias nsx-selfsigned -file /tmp/nsx-root.crt \
  -keystore /etc/alternatives/jre/lib/security/cacerts \
  -storepass changeit -noprompt

Step 4: Restart SDDC Manager services (~5 minutes):

/opt/vmware/vcf/operationsmanager/scripts/cli/sddcmanager_restart_services.sh

Key paths:

Item	Path/Value
VCF trust store	`/etc/vmware/vcf/commonsvcs/trusted_certificates.store`
VCF trust store password	Contents of `/etc/vmware/vcf/commonsvcs/trusted_certificates.key`
Java cacerts	`/etc/alternatives/jre/lib/security/cacerts`
Java cacerts password	`changeit`
Service restart script	`/opt/vmware/vcf/operationsmanager/scripts/cli/sddcmanager_restart_services.sh`

Reference: https://knowledge.broadcom.com/external/article/316056

Note on SDDC Manager SSH: Only the vcf user can SSH in (root and admin are rejected). Use su - from the vcf session to get root access. SCP does not work due to the restricted shell; use ssh vcf@host "cat > file" < localfile for file transfers.

7.3 vCenter Server

Configuration

Field	Value
Appliance FQDN	vcenter.lab.local
Appliance Size	Tiny
Datacenter Name	mgmt-dc01
Cluster Name	mgmt-cl01
SSO Domain Name	vsphere.local

7.4 vCenter Deployment Issues

7.4.1 SSH Requirement for vLCM Host Seeding

Problem Description VCF vLCM (vSphere Lifecycle Manager) requires SSH access to ESXi hosts during vCenter deployment for host seeding. If SSH is disabled, the deployment fails with:

vCenter installation failed. Check logs under /var/log/vmware/vcf/domainmanager/ci-installer-XX-XX-XX-XX-XX-XXX for more details

Symptoms in Logs

Extraction of image from host esxi01.lab.local failed

Root Cause SSH service is stopped or disabled on ESXi hosts. VCF needs SSH to extract ESXi image metadata for vLCM host seeding.

Solution: Enable SSH on All ESXi Hosts

Run on each ESXi host BEFORE starting VCF deployment:

# Enable SSH service
vim-cmd hostsvc/enable_ssh

# Start SSH service
vim-cmd hostsvc/start_ssh

# Verify SSH is running
vim-cmd hostsvc/runtimeinfo | grep ssh

Alternative Method (from ESXi Shell)

# Enable and start SSH
esxcli system ssh set --enable=true

# Verify SSH status
esxcli system ssh get

Note: SSH can be disabled after successful VCF deployment for security.

7.4.2 Monitoring vCenter Deployment Progress

VCF Installer Log Monitoring

Watch deployment progress from VCF Installer:

# Find the latest ci-installer log directory
ls -lt /var/log/vmware/vcf/domainmanager/ | head -5

# Watch the installation log
tail -f /var/log/vmware/vcf/domainmanager/ci-installer-XX-XX-XX-XX-XX-XXX/ci-installer.log

# Search for errors
grep -i "error\|failed\|exception" /var/log/vmware/vcf/domainmanager/ci-installer-XX-XX-XX-XX-XX-XXX/ci-installer.log

vCenter VM Direct Monitoring

SSH to the vCenter VM during deployment (default password: vmware):

# Watch firstboot progress
tail -f /var/log/firstboot/firstbootStatus.json

# Watch detailed installation
tail -f /var/log/vmware/firstboot/installer.log

# Check VMware services
vmon-cli --list

# Check specific service status
vmon-cli --status <service-name>

Expected Deployment Stages

vCenter VM deployment (OVA extraction)
First boot - basic configuration
Installing containers (60% mark)
Database initialization
Service startup
vCenter registration with VCF

7.4.3 Diagnosing Stuck Deployments

Symptoms

Deployment stuck at a percentage (commonly 60%) for more than 30 minutes
No progress in firstboot logs
VCF UI shows no timeout error

Diagnostic Commands (SSH to vCenter VM)

# Check current deployment status
cat /var/log/firstboot/firstbootStatus.json

# Check for running processes
ps aux | grep -E "install|firstboot|postgres|vpxd"

# Check disk I/O (should show activity)
vmstat 1 5

# Check memory usage
free -h

# Check for error logs
tail -50 /var/log/vmware/firstboot/installer.log
grep -i "error\|fail\|exception" /var/log/vmware/firstboot/*.log

PostgreSQL Database Issues

If deployment is stuck at "Installing Containers" (60%), check postgres:

# Check if postgres service exists
ls -la /storage/db/vpostgres/

# Check for postgres config file
ls -la /storage/db/vpostgres/postgresql.conf

# Check postgres user/group
grep postgres /etc/passwd
grep postgres /etc/group

# Check postgres logs
tail -50 /var/log/vmware/vpostgres/*.log

If PostgreSQL Never Initialized Missing /storage/db/vpostgres/postgresql.conf and missing postgres user indicates the database initialization failed. This is typically unrecoverable and requires full redeployment.

Service Startup Issues

# List all VMware services and status
vmon-cli --list

# Check rhttpproxy (reverse proxy)
systemctl status rhttpproxy
tail -50 /var/log/vmware/rhttpproxy/rhttpproxy.log

# Check vpostgres
systemctl status vmware-vpostgres
tail -50 /var/log/vmware/vpostgres/postgresql*.log

7.4.4 vCenter Deployment Failure Reference Tokens

When vCenter deployment fails, VCF provides a reference token. To find detailed errors:

# Search for reference token in logs
grep -r "REFERENCE_TOKEN" /var/log/vmware/vcf/

# Example: Reference Token 3OHCKD
grep -r "3OHCKD" /var/log/vmware/vcf/
grep -B20 -A20 "3OHCKD" /var/log/vmware/vcf/domainmanager/*.log

7.5 Failed Deployment Recovery

7.5.1 Overview

VCF does not provide a rollback mechanism for failed management domain deployments. A failed deployment requires manual cleanup of:

vCenter VM (delete from ESXi)
vSAN disk groups and partitions
VDS (Virtual Distributed Switch) configuration
Depot connection in VCF UI

7.5.2 Complete Cleanup Procedure

Step 1: Delete Failed vCenter VM

From any ESXi host (or the one hosting the failed vCenter):

# List all VMs
vim-cmd vmsvc/getallvms

# Find the vCenter VM ID (look for vcenter.lab.local or similar)
# Power off the VM if running
vim-cmd vmsvc/power.off <vmid>

# Unregister the VM
vim-cmd vmsvc/unregister <vmid>

# Delete the VM files from datastore (if needed)
rm -rf /vmfs/volumes/<datastore>/vcenter.lab.local/

Step 2: Clean Up VDS (Distributed Switch)

From ESXi hosts, remove VDS configuration:

# List current virtual switches
esxcli network vswitch dvs vmware list

# Remove host from VDS (if configured)
# This is typically done from vCenter, but if vCenter is gone:
# Remove vmkernel ports from VDS
esxcli network ip interface remove -i vmk1  # vMotion
esxcli network ip interface remove -i vmk2  # vSAN

# Remove VDS uplink
esxcli network vswitch dvs vmware list

Step 3: Clean Up vSAN Configuration

Run on EACH ESXi host:

# List current vSAN storage
esxcli vsan storage list

# Remove vSAN disk groups
esxcli vsan storage remove -d t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001
esxcli vsan storage remove -d t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________04000000000000000001

# If remove fails, check disk state
vdq -q

# Delete partitions from cache disk (example device name)
partedUtil getptbl /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001 1
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001 2

# Delete partitions from capacity disk
partedUtil getptbl /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________04000000000000000001
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________04000000000000000001 1
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________04000000000000000001 2

# Verify disks are now eligible
vdq -q

Common vSAN Cleanup Error If you see: cache disk/s are in an invalid state...available size is 0.0 GB

This means disks still have partitions. Use partedUtil to delete them.

Step 4: Remove Depot Connection (VCF UI)

Log into VCF Installer UI (https://192.168.1.240:8443)
Navigate to Settings or Configuration
Remove the existing offline depot connection
Re-add the depot connection with certificate

Step 5: Verify Hosts Are Ready

On each ESXi host, verify:

# Check hostname is correct
esxcli system hostname get

# Check certificate is valid
openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout | grep -A1 "Subject Alternative Name"

# Check SSH is enabled
vim-cmd hostsvc/runtimeinfo | grep ssh

# Check vSAN disks are eligible
vdq -q

# Check no VDS remnants
esxcli network vswitch dvs vmware list

Step 6: Restart VCF Services (Optional)

On VCF Installer, restart services for clean state:

systemctl restart lcm
systemctl restart domainmanager

# Wait for services to start
sleep 120

# Verify services are running
systemctl status lcm
systemctl status domainmanager

7.5.3 Troubleshooting Flowchart: Failed Deployment Recovery

START: VCF Deployment Failed
  │
  ├─→ Note reference token from error message
  │     └─→ Search logs: grep -r "TOKEN" /var/log/vmware/vcf/
  │
  ├─→ Delete failed vCenter VM
  │     ├─→ vim-cmd vmsvc/getallvms
  │     ├─→ vim-cmd vmsvc/power.off <vmid>
  │     └─→ vim-cmd vmsvc/unregister <vmid>
  │
  ├─→ Clean up vSAN on EACH host
  │     ├─→ esxcli vsan storage remove -d <device>
  │     ├─→ partedUtil delete ... (both partitions)
  │     └─→ vdq -q (verify eligible)
  │
  ├─→ Clean up VDS (if configured)
  │     └─→ esxcli network ip interface remove ...
  │
  ├─→ Remove depot connection in VCF UI
  │     └─→ Re-add with certificate
  │
  ├─→ Verify SSH enabled on all hosts
  │     └─→ vim-cmd hostsvc/enable_ssh
  │
  └─→ Retry deployment

8. Command Reference

8.1 VCF Installer Commands

Service Management

# Check LCM service status
systemctl status lcm

# Restart LCM service
systemctl restart lcm

# Check Domain Manager status
systemctl status domainmanager

# Restart Domain Manager
systemctl restart domainmanager

# View LCM logs
tail -f /var/log/vmware/vcf/lcm/lcm-debug.log

# Search LCM logs for errors
grep -i "error\|fatal\|exception" /var/log/vmware/vcf/lcm/lcm-debug.log | tail -20

Certificate Management

# Download certificate from remote server
openssl s_client -connect <IP>:<PORT> </dev/null 2>/dev/null | openssl x509 -outform PEM > /tmp/cert.crt

# View certificate details
openssl x509 -in /tmp/cert.crt -text -noout

# Get certificate fingerprint
openssl x509 -in /tmp/cert.crt -noout -fingerprint -sha256

# Import certificate to Java truststore
keytool -import -trustcacerts -alias <alias> -file /tmp/cert.crt -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit -noprompt

# Delete certificate from truststore
keytool -delete -alias <alias> -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit

# List certificates in truststore
keytool -list -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit

# View specific certificate in truststore
keytool -list -alias <alias> -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit -v

Network Diagnostics

# Test connectivity
ping <IP>

# Test SSL connection
openssl s_client -connect <IP>:<PORT>

# Test with specific TLS version
openssl s_client -connect <IP>:<PORT> -tls1_2

# Test HTTP endpoint
curl -v -k -u admin:admin https://<IP>:<PORT>/path

8.2 ESXi Host Commands

System Information

# Get hostname
esxcli system hostname get

# Set hostname
esxcli system hostname set --fqdn=<FQDN>

# Get system version
esxcli system version get

Certificate Management

# View current certificate
openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout

# View certificate SAN
openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout | grep -A1 "Subject Alternative Name"

# Backup certificates
mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak

# Regenerate certificates
/sbin/generate-certificates

# Restart services after certificate change
services.sh restart

Storage Management

# List all storage devices
esxcli storage core device list

# List devices with SSD status
esxcli storage core device list | grep -E "^t10|^naa|Is SSD"

# Check vSAN eligible disks
vdq -q

# List vSAN storage
esxcli vsan storage list

# Get partition table
partedUtil getptbl /vmfs/devices/disks/<device>

# Delete partition
partedUtil delete /vmfs/devices/disks/<device> <partition-number>

# Remove vSAN disk group
esxcli vsan storage remove -d <device>

# Add SATP rule for SSD
esxcli storage nmp satp rule add -s VMW_SATP_LOCAL -d <device> -o enable_ssd

# Reclaim device after SATP rule
esxcli storage core claiming reclaim -d <device>

# List SATP rules
esxcli storage nmp satp rule list | grep enable_ssd

Maintenance Mode

# Enter maintenance mode
esxcli system maintenanceMode set -e true -m noAction

# Exit maintenance mode
esxcli system maintenanceMode set -e false

# Check maintenance mode status
esxcli system maintenanceMode get

8.3 Windows Depot Server Commands

PowerShell Commands

# Navigate to depot directory
cd C:\VCF-DEPOT

# Generate certificate
python generate_cert.py

# Start HTTPS server
python https_server.py

# Install Python cryptography library
pip install cryptography

9. Troubleshooting Flowcharts

9.1 Offline Depot Connection Failure

START: "Secure protocol communication error"
  │
  ├─→ Test connectivity: ping <depot-ip>
  │     └─→ FAIL: Check network/firewall
  │
  ├─→ Test SSL: openssl s_client -connect <ip>:8443
  │     └─→ FAIL: Check depot server is running
  │
  ├─→ Check certificate: View cert details
  │     └─→ Wrong hostname: Regenerate certificate
  │
  ├─→ Import certificate to Java truststore
  │     └─→ keytool -import ...
  │
  ├─→ Verify fingerprints match
  │     └─→ MISMATCH: Re-import correct certificate
  │
  └─→ Restart LCM service
        └─→ Wait 2 minutes, retry connection

9.2 ESXi Certificate Mismatch

START: "Certificate doesn't match subject alternative names"
  │
  ├─→ Check current cert SAN
  │     └─→ openssl x509 -in /etc/vmware/ssl/rui.crt ...
  │
  ├─→ Set correct hostname
  │     └─→ esxcli system hostname set --fqdn=<FQDN>
  │
  ├─→ Backup old certificates
  │     └─→ mv /etc/vmware/ssl/rui.* /etc/vmware/ssl/rui.*.bak
  │
  ├─→ Generate new certificates
  │     └─→ /sbin/generate-certificates
  │
  ├─→ Restart services
  │     └─→ services.sh restart
  │
  └─→ Update thumbprints in VCF
        └─→ Re-validate hosts in UI

9.3 vSAN SSD Detection Failure

START: "Found zero SSD devices for SSD cache tier"
  │
  ├─→ Check SSD status: esxcli storage core device list
  │     └─→ "Is SSD: false" → Continue
  │
  ├─→ Shut down ESXi VM in Workstation
  │
  ├─→ Edit VMX file
  │     └─→ Add: sata0:X.virtualSSD = 1
  │
  ├─→ Power on ESXi VM
  │
  ├─→ Verify SSD status
  │     └─→ Still false: Check VMX syntax
  │
  ├─→ Check vSAN eligibility: vdq -q
  │     └─→ "Has partitions" → Clean up partitions
  │
  └─→ Clean up old vSAN config
        ├─→ esxcli vsan storage remove -d <device>
        └─→ partedUtil delete ...

9.4 vCenter Deployment Stuck

START: vCenter deployment stuck at percentage
  │
  ├─→ Wait 30 minutes (large downloads may be slow)
  │
  ├─→ SSH to vCenter VM (ssh root@<vcenter-ip>)
  │     └─→ Default password: vmware
  │
  ├─→ Check firstboot status
  │     └─→ cat /var/log/firstboot/firstbootStatus.json
  │
  ├─→ Check for activity
  │     ├─→ vmstat 1 5 (disk I/O)
  │     └─→ tail -f /var/log/vmware/firstboot/installer.log
  │
  ├─→ If stuck at 60% "Installing Containers"
  │     ├─→ Check postgres: ls /storage/db/vpostgres/
  │     ├─→ Missing postgresql.conf → Database failed to init
  │     └─→ UNRECOVERABLE: Must redeploy
  │
  ├─→ Check services: vmon-cli --list
  │     └─→ Services not started → Check individual logs
  │
  └─→ If unrecoverable:
        ├─→ Delete vCenter VM
        ├─→ Clean up vSAN on all hosts
        ├─→ Reset depot connection
        └─→ Retry deployment

9.5 vLCM Host Seeding Failure

START: "Extraction of image from host failed"
  │
  ├─→ Check SSH status on ESXi host
  │     └─→ vim-cmd hostsvc/runtimeinfo | grep ssh
  │
  ├─→ SSH Disabled?
  │     ├─→ vim-cmd hostsvc/enable_ssh
  │     └─→ vim-cmd hostsvc/start_ssh
  │
  ├─→ Verify SSH on ALL hosts
  │     └─→ Repeat for esxi01, esxi02, esxi03, esxi04
  │
  └─→ Retry vCenter deployment

9.6 NSX Certificate Trust/SAN Failure (VDT)

START: VDT reports NSX cert FAIL (Trust or SAN)
  │
  ├─→ Check which check failed
  │     ├─→ SAN FAIL: Certificate missing hostnames/IPs
  │     └─→ Trust FAIL: Certificate root not in SDDC Manager keystores
  │
  ├─→ If SAN FAIL:
  │     ├─→ SSH to NSX Manager as root
  │     ├─→ Create OpenSSL config with all SANs:
  │     │     DNS.1 = nsx-vip.lab.local
  │     │     DNS.2 = nsx-node1.lab.local
  │     │     DNS.3 = nsx-manager.lab.local  ← SDDC Manager's registered FQDN
  │     │     IP.1 = 192.168.1.70 (VIP)
  │     │     IP.2 = 192.168.1.71 (node)
  │     ├─→ Generate cert: openssl req -x509 ...
  │     ├─→ Build JSON: python (avoid shell PEM escaping)
  │     ├─→ Import via API: POST /api/v1/trust-management/certificates?action=import
  │     ├─→ Apply to node: ?action=apply_certificate&service_type=API&node_id=<uuid>
  │     └─→ Apply to VIP: ?action=apply_certificate&service_type=MGMT_CLUSTER
  │
  ├─→ If Trust FAIL (after cert replacement):
  │     ├─→ SSH to SDDC Manager as vcf, then su - to root
  │     ├─→ Pull cert: openssl s_client ... > /tmp/nsx-root.crt
  │     ├─→ Import to VCF store: keytool -importcert ... trusted_certificates.store
  │     ├─→ Import to Java cacerts: keytool -importcert ... cacerts
  │     └─→ Restart services: sddcmanager_restart_services.sh
  │
  └─→ Re-run VDT after ~5 minutes
        └─→ Expected: NSX cert checks all PASS

9.7 VCF Operations Health Adapter "No Data Receiving" Status

START: Infrastructure Health Adapter shows "no data receiving"
  │
  ├─→ Check adapter log (VCF Ops 9.x path):
  │     tail -100 /storage/log/vcops/log/adapters/
  │       VMwareInfraHealthAdapter/VMwareInfraHealthAdapter_55.log
  │
  ├─→ If "Unable to fetch access token for the SDDC manager":
  │     ├─→ Test SDDC Manager auth from VCF Ops node:
  │     │     curl -sk -X POST https://sddc-manager.lab.local/v1/tokens \
  │     │       -H "Content-Type: application/json" \
  │     │       -d '{"username":"administrator@vsphere.local","password":"..."}'
  │     ├─→ If token returned: credential issue in adapter
  │     │     ├─→ UI → Administration → Integrations → SDDC Manager
  │     │     ├─→ If System Managed Credential enabled → click ROTATE
  │     │     ├─→ If Credential dropdown empty → uncheck System Managed,
  │     │     │     click +, create credential, select it
  │     │     ├─→ Click VALIDATE CONNECTION → confirm "valid"
  │     │     ├─→ Click SAVE
  │     │     └─→ Reboot VCF Ops appliance (adapter may not pick up
  │     │           new credential without full restart)
  │     └─→ If connection refused: check DNS/network/cert trust
  │
  ├─→ If "PKIX path building failed" for NSX:
  │     ├─→ If NSX is powered off → expected, ignore
  │     └─→ If NSX is running → See flowchart 9.8 below
  │
  ├─→ If "vROPs is not configured with NTP server":
  │     └─→ Configure NTP on VCF Ops appliance (cosmetic warning,
  │           does not block data collection)
  │
  └─→ After fix: wait 10 min for 2 collection cycles
        └─→ Verify: adapter status changes to "Collecting" (green)

Key paths on VCF Operations 9.x

Item	Path
Adapter logs	`/storage/log/vcops/log/adapters/<AdapterName>/`
Main vcops logs	`/storage/log/vcops/log/`
Collector GC log	`/storage/log/vcops/log/collector-gc-*.log`
VCF Adapter log	`/storage/log/vcops/log/adapters/VcfAdapter/VcfAdapter_254.log`
VMware Adapter log	`/storage/log/vcops/log/adapters/VMwareAdapter/VMwareAdapter_63.log`
vSAN Adapter log	`/storage/log/vcops/log/adapters/VsanStorageAdapter/VsanStorageAdapter_257.log`

Note: VCF Operations 9.x does NOT use /var/log/vmware/vcops/adapters/ — that path from older Aria Operations versions no longer exists. All adapter logs are under /storage/log/vcops/log/adapters/.

9.8 VCF Operations NSX Integration — PKIX / Connection Failure

START: NSX adapter shows Warning, logs show "PKIX path building failed"
  │
  ├─→ Verify NSX is actually reachable from VCF Ops node:
  │     curl -sk https://nsx-vip.lab.local/api/v1/node/status | head -5
  │     ├─→ "No route to host" → NSX not ready (check load avg below)
  │     └─→ Returns JSON → NSX is up, proceed to cert fix
  │
  ├─→ If VIP (.70) unreachable but node (.71) responds:
  │     ├─→ NSX cluster VIP not online yet
  │     ├─→ Check load: curl -sk -u admin:'<pass>'
  │     │     https://<node>:443/api/v1/node/status | grep load_average
  │     ├─→ Load > 20 on 6 cores = still booting (normal in nested,
  │     │     can take 30-60 min after power-on)
  │     └─→ Wait for load < 20, VIP will come online automatically
  │
  ├─→ Import NSX cert into VCF Ops Java truststore:
  │     ├─→ openssl s_client -connect nsx-vip.lab.local:443 \
  │     │     -showcerts </dev/null 2>/dev/null \
  │     │     | openssl x509 -outform PEM > /tmp/nsx-cert.pem
  │     ├─→ Find truststore: java -XshowSettings:properties 2>&1
  │     │     | grep java.home
  │     │     → /usr/java/jre-vmware-17
  │     ├─→ keytool -importcert -alias nsx-vip \
  │     │     -file /tmp/nsx-cert.pem \
  │     │     -keystore /usr/java/jre-vmware-17/lib/security/cacerts \
  │     │     -storepass changeit -noprompt
  │     └─→ Reboot VCF Ops appliance
  │
  ├─→ Fix NSX credential (two adapters to check):
  │     │
  │     ├─→ VCF section → nsx-vip.lab.local adapter:
  │     │     ├─→ System Managed Credential ROTATE rarely works for NSX
  │     │     ├─→ Uncheck System Managed Credential
  │     │     ├─→ Click + → create credential (admin / password)
  │     │     ├─→ Select credential, VALIDATE CONNECTION, SAVE
  │     │     └─→ If VIP still unreachable, wait for NSX load to settle
  │     │
  │     └─→ NSX section → Aria Admin adapter:
  │           ├─→ Points to nsx-manager.lab.local (node FQDN)
  │           ├─→ May connect even when VIP is down
  │           ├─→ Set credential (admin / password)
  │           ├─→ VALIDATE CONNECTION → SAVE
  │           └─→ This can start collecting before VIP comes online
  │
  └─→ Verify: All adapters show "Collecting" (green) in
        Administration → Integrations

Key insight: VCF Operations has TWO separate NSX adapters — one under the VCF Cloud Foundation account (uses VIP) and one under the standalone NSX section called "Aria Admin" (uses node FQDN). Both need valid credentials. The Aria Admin adapter can connect via the node FQDN even when the VIP is still offline after a fresh NSX boot.

Java truststore path on VCF Ops 9.x: /usr/java/jre-vmware-17/lib/security/cacerts (password: changeit). The legacy /usr/java/jre-vmware/ path does not exist.

10. SDDC Manager Credential Cascade Failure

Symptoms:

Password Management → Update Password, Rotate, or Remediate fails for NSX (or other component)
Task history shows: "Resources [nsx-vip.lab.local] are not available/ready" or "not in ACTIVE state"
Subsequent attempts fail with: "Unable to acquire resource level lock(s)"
VCF Operations Fleet Management shows "[2] account(s) has been disconnected"
SDDC Manager API /v1/nsxt-clusters shows empty or non-ACTIVE status
Dozens of stuck IN_PROGRESS tasks accumulate (visible via /v1/tasks?status=IN_PROGRESS)

Root Cause Chain: A failed credential operation (often due to NSX being temporarily unreachable during a boot storm or maintenance) triggers a cascade:

NSX cluster resource gets stuck in ACTIVATING or ERROR state in platform.nsxt table
Stale exclusive locks remain in platform.lock table, blocking all new operations
Failed tasks remain as IN_PROGRESS in platform.task_metadata (resolved=false), piling up
Each retry from the UI creates more stuck tasks and locks
Even after NSX recovers, SDDC Manager won't attempt the operation because the status check fails prevalidation

Diagnosis:

# 1. Get auth token from SDDC Manager
TOKEN=$(curl -sk -X POST https://sddc-manager.lab.local/v1/tokens \
  -H "Content-Type: application/json" \
  -d '{"username":"administrator@vsphere.local","password":"<password>"}' \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['accessToken'])")

# 2. Check NSX cluster resource state (look for status field)
curl -sk "https://sddc-manager.lab.local/v1/nsxt-clusters" \
  -H "Authorization: Bearer $TOKEN" | python3 -m json.tool
# If status is "ACTIVATING" or "ERROR" instead of "ACTIVE" → this is the problem

# 3. Check for stale resource locks
curl -sk "https://sddc-manager.lab.local/v1/resource-locks" \
  -H "Authorization: Bearer $TOKEN" | python3 -m json.tool
# Stale locks from failed operations will block all new operations

# 4. Check for stuck IN_PROGRESS tasks
curl -sk "https://sddc-manager.lab.local/v1/tasks?status=IN_PROGRESS" \
  -H "Authorization: Bearer $TOKEN" | python3 -c \
  "import sys,json; d=json.load(sys.stdin); print(f'Stuck tasks: {len(d.get(\"elements\",[]))}')"

# 5. Verify NSX is actually healthy (from SDDC Manager)
curl -sk -u admin:'<password>' --connect-timeout 10 \
  https://nsx-vip.lab.local/api/v1/cluster/status
# overall_status should be "STABLE"

Fix — Full Database Repair:

WARNING: Direct database manipulation is unsupported and should only be done in lab environments. Always back up before modifying.

Step 1: Access PostgreSQL on SDDC Manager

SSH to SDDC Manager as vcf, then su - to root. PostgreSQL uses TCP on 127.0.0.1 (not Unix sockets), and the password may not be easily discoverable. Disable the psql pager to prevent --More-- prompts from corrupting interactive shell sessions:

# Back up pg_hba.conf
cp /data/pgdata/pg_hba.conf /data/pgdata/pg_hba.conf.bak

# Temporarily allow passwordless local connections
sed -i 's/scram-sha-256/trust/g' /data/pgdata/pg_hba.conf

# Reload postgres (no restart needed)
su - postgres -c "/usr/pgsql/15/bin/pg_ctl reload -D /data/pgdata"

# Disable psql pager (CRITICAL for scripted/remote sessions)
export PAGER=cat
export PGPAGER=cat

Step 2: Fix the stuck resource status

The nsxt table status can be ACTIVATING, ERROR, or other non-ACTIVE values:

# Check current NSX resource status
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -t -c \"SELECT id, status FROM nsxt;\""

# Fix ANY non-ACTIVE status
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"UPDATE nsxt SET status = 'ACTIVE' WHERE status != 'ACTIVE';\""

Step 3: Clear stale resource locks

su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"SELECT count(*) FROM lock;\""
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"DELETE FROM lock;\""

Step 4: Mark stuck tasks as resolved

The task_metadata table in the platform DB tracks task resolution state. Unresolved tasks (resolved=false) from failed operations accumulate and can interfere with new operations:

# Check unresolved task count
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"SELECT resolved, count(*) FROM task_metadata GROUP BY resolved;\""

# Mark all unresolved tasks as resolved
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"UPDATE task_metadata SET resolved = true WHERE resolved = false;\""

# Clear task_lock table if any entries exist
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"DELETE FROM task_lock;\""

Step 5: Restore pg_hba.conf (CRITICAL — do not skip)

cp /data/pgdata/pg_hba.conf.bak /data/pgdata/pg_hba.conf
su - postgres -c "/usr/pgsql/15/bin/pg_ctl reload -D /data/pgdata"

# Verify it's back to scram-sha-256
grep -c 'scram-sha-256' /data/pgdata/pg_hba.conf
# Should return 4 or more

Step 6: Restart operationsmanager service

systemctl restart operationsmanager
# Wait 2-3 minutes for it to fully start
systemctl is-active operationsmanager

Verification:

# NSX cluster should now show ACTIVE
curl -sk "https://sddc-manager.lab.local/v1/nsxt-clusters" \
  -H "Authorization: Bearer $TOKEN" | python3 -c \
  "import sys,json; [print(f'{c[\"id\"]}: {c[\"status\"]}') for c in json.load(sys.stdin).get('elements',[])]"

# Resource locks should be empty
curl -sk "https://sddc-manager.lab.local/v1/resource-locks" \
  -H "Authorization: Bearer $TOKEN"

# IN_PROGRESS tasks should be zero or minimal
curl -sk "https://sddc-manager.lab.local/v1/tasks?status=IN_PROGRESS" \
  -H "Authorization: Bearer $TOKEN" | python3 -c \
  "import sys,json; print(f'IN_PROGRESS: {len(json.load(sys.stdin).get(\"elements\",[]))}')"

# Credential remediate should now succeed via VCF Operations Fleet Management UI

Credential Cascade Failure Flowchart:
┌──────────────────────────────────────────────┐
│ Credential Update/Rotate/Remediate fails     │
│ in SDDC Manager or VCF Operations UI         │
└──────────────────┬───────────────────────────┘
                   │
          ┌────────▼────────┐
          │ Check task error │
          └────────┬────────┘
                   │
    ┌──────────────┼──────────────┐
    │              │              │
    ▼              ▼              ▼
"not in        "Unable to     "503 Service
ACTIVE state"  acquire lock"  Unavailable"
    │              │              │
    ▼              ▼              ▼
Fix nsxt       Delete from    NSX still
table status   lock table     booting/
(ACTIVATING/   in platform    unstable
ERROR→ACTIVE)  DB             │
    │              │           ▼
    │              │        Wait for
    │              │        NSX load
    │              │        to settle
    │              │        (< 20)
    └──────┬───────┘          │
           ▼                  │
    Mark task_metadata        │
    resolved = true    ◄──────┘
           │
           ▼
    Clear task_lock
           │
           ▼
    Restart
    operationsmanager
           │
           ▼
    Retry credential
    operation

Key insight: Three tables in the platform database must be cleaned: (1) nsxt — resource status, (2) lock — operation locks, (3) task_metadata — task resolution tracking. The operationsmanager database has separate task and execution tables (columns: task.state, execution.execution_status — not status). The API won't let you cancel or delete stuck tasks — database repair is required.

psql pager trap: When running psql queries via Paramiko or remote shell, the default pager (less/more) captures output and waits for interactive input, corrupting the session. Always set PAGER=cat before running psql commands, or pass it inline: PAGER=cat psql -h 127.0.0.1 -d platform -c "...". For Paramiko invoke_shell(), also set height=1000 to prevent terminal-based paging.

PostgreSQL on SDDC Manager: Uses TCP on 127.0.0.1 (not Unix sockets — you'll get "No such file or directory" without -h 127.0.0.1). Data directory is /data/pgdata. Key databases: platform (nsxt, lock, task_metadata tables), operationsmanager (task, execution, processing_task tables). The pg_hba.conf trust workaround is a last resort — always restore the original immediately after.

vcf account lockout: Failed SSH attempts (including from automated scripts) can lock the vcf account. SDDC Manager uses faillock (not pam_tally2). Unlock from console as root: faillock --user vcf --reset

Discovery Process — How the Schema Was Mapped

None of the database repair procedure is documented by Broadcom. The schema was mapped through the following investigation:

Why the API wasn't enough:

PATCH /v1/tasks/{id} with {"status":"CANCELLED"} returned TA_TASK_CAN_NOT_BE_RETRIED for every stuck task
DELETE /v1/tasks/{id} returned HTTP 500
The API has no mechanism to force-cancel stuck tasks or clear stale locks — database repair is the only path

How the database was explored:

Accessed PostgreSQL using the trust auth workaround (password not discoverable in config files)
Listed all databases with \l: platform, operationsmanager, domainmanager, lcm, sddc_manager_ui, postgres
Listed tables in platform DB with \dt: found nsxt, lock, task_metadata, task_lock (plus vcenter, host, etc.)
Queried column definitions via SELECT column_name, data_type FROM information_schema.columns WHERE table_name = '<table>'
Discovered that task_metadata uses a resolved boolean (not a status field like you'd expect)
Discovered that operationsmanager.task uses column state (not status) and execution uses execution_status (not status)
Early script versions failed because they referenced task in the platform DB (wrong — it's task_metadata) and used status column on operationsmanager tables (wrong — it's state and execution_status)

Why each repair step is needed:

Step	Table	Action	Why
2	`nsxt`	Set status to ACTIVE	The stuck ACTIVATING/ERROR status makes every new credential operation fail at prevalidation — SDDC Manager checks this before even attempting the operation
3	`lock`	Delete all rows	Stale exclusive locks from the failed operation block all new operations from acquiring their own locks — they'll fail with "Unable to acquire resource level lock(s)"
4	`task_metadata`	Set resolved=true	Unresolved tasks (resolved=false) accumulate with each UI retry. 47 were found during the initial diagnosis. These can interfere with new task scheduling
4	`task_lock`	Delete all rows	Links tasks to locks — clearing this ensures no orphaned task-lock relationships remain
5	`pg_hba.conf`	Restore backup	Leaving trust auth enabled means any local process can access PostgreSQL without a password — security risk
6	`operationsmanager`	Restart service	The service caches database state in memory. A restart forces it to re-read the cleaned tables and reset its internal state machine

Why each step matters in sequence:

Steps 2-4 must all be done in one session — fixing just the status without clearing locks still fails, and clearing locks without fixing the status still fails. All three tables participate in the prevalidation check.
Step 5 (restore pg_hba.conf) must happen before step 6 (restart service) because the service restart may trigger new database connections that should use proper authentication.
The trust auth window should be as short as possible — do all queries, restore immediately, then restart.

Automated scripts built from this knowledge:

clear_locks.py — fixes nsxt status + clears lock table (quick fix for simple cases)
fix_stuck_tasks.py — marks task_metadata resolved + clears task_lock (for accumulated stuck tasks)
full_remediate_fix.py — combines NSX health check + all DB fixes + service restart (all-in-one cascade repair)

All scripts automate the pg_hba.conf backup/trust/restore cycle and use PAGER=cat to prevent pager traps in remote sessions.

11. SDDC Manager Storage Migration (Local → vSAN)

Phase 7 — Feb 10–11, 2026

Why SDDC Manager Starts on Local Storage

This is a chicken-and-egg bootstrap constraint in every VCF deployment:

The VCF Installer OVA (which is the SDDC Manager appliance — same OVA, dual purpose) must be deployed before the bringup process runs
vSAN does not exist yet at this point — vSAN is created during the bringup process when the VCF Installer orchestrates the deployment of vCenter, vSAN, and VDS across the ESXi hosts
The only storage available before bringup is the local datastore on the ESXi host where you deploy the installer (esxi01-local in the lab)
After bringup completes, the VCF Installer transforms into SDDC Manager — still sitting on local storage where it was originally deployed

This means SDDC Manager is always initially deployed to local storage and must be manually migrated to shared storage (vSAN) afterward.

The Problem

SDDC Manager was on esxi01-local with 914GB thick-provisioned disks (only ~108GB actually used)
Could not vMotion to other hosts — VM is on local storage, not shared
esxi01 was overloaded — also running NSX Manager (32GB RAM) on the same host
vCenter storage migration wizard failed after 8+ hours — it cannot thin-provision when migrating to vSAN

Disk analysis:

Disk	Allocated	Actual Used
sddc-manager.vmdk	32GB	2.6GB
sddc-manager_1.vmdk	16GB	2.6GB
sddc-manager_2.vmdk	240GB	3.0GB
sddc-manager_3.vmdk	512GB	99.5GB
sddc-manager_4.vmdk	26GB	30MB
sddc-manager_5.vmdk	88GB	64MB
Total	914GB	~108GB

Solution: vmkfstools Thin Clone

The vCenter migration wizard cannot thin-provision to vSAN. The workaround is to clone each disk individually as thin using vmkfstools directly on the ESXi host:

# SSH to esxi01 as root
# Clone each disk from local to vSAN as thin-provisioned
vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager.vmdk \
  /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager.vmdk -d thin

vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_1.vmdk \
  /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_1.vmdk -d thin

vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_2.vmdk \
  /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_2.vmdk -d thin

vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_3.vmdk \
  /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_3.vmdk -d thin

vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_4.vmdk \
  /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_4.vmdk -d thin

vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_5.vmdk \
  /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_5.vmdk -d thin

After cloning:

Power off SDDC Manager
Remove from vCenter inventory
Browse vSAN datastore → navigate to the sddc-manager folder → right-click the .vmx → Register VM
Power on and verify all services start (systemctl status vcf-services)

Recovery from Mid-Clone Failure

The 512GB disk clone failed partway through (ESXi connection timeout on nested storage). Fix:

# Delete the partial clone
vmkfstools -U /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_3.vmdk

# Retry the clone
vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_3.vmdk \
  /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_3.vmdk -d thin

Key lesson: vCenter's migration wizard cannot thin-provision to vSAN. Always use vmkfstools -i <src> <dst> -d thin per disk. This is also the only way to reclaim wasted space from thick-provisioned VCF appliances — SDDC Manager went from 914GB allocated to ~108GB actual on vSAN.

Document Information

Field	Value
Document Title	VCF 9.0.1 Nested Deployment Troubleshooting Handbook
Version	1.5
Last Updated	February 2026
Environment	VMware Workstation 17.x Nested Lab
VCF Version	9.0.1

This handbook is intended for lab and educational purposes. Always consult official VMware documentation for production deployments.

VMware Cloud Foundation 9.0 Nested Deployment Troubleshooting Handbook

Table of Contents

1. Executive Summary

2. Environment Overview

2.1 Infrastructure Components

2.2 Network Configuration

2.3 ESXi VM Specifications (Nested)

3. Offline Depot Configuration

3.1 TLS/SSL Certificate Issues

Problem Description

Symptoms

Root Cause

Diagnostic Commands (VCF Installer)

3.2 Python HTTPS Server Setup

Server Script (C:\VCF-DEPOT\https_server.py)

3.3 Certificate Generation

Certificate Generation Script (C:\VCF-DEPOT\generate_cert.py)

Windows PowerShell Commands

3.4 Certificate Import to VCF

Complete Certificate Import Procedure (VCF Installer)

Troubleshooting Certificate Issues

4. ESXi Host Certificate Issues

4.1 Certificate Hostname Mismatch

Problem Description

Symptoms

Diagnostic Commands (ESXi Host)

4.2 Certificate Regeneration

Procedure (Run on Each ESXi Host)

Host-Specific Commands

4.3 Thumbprint Updates

Get New Thumbprints (VCF Installer)

5. vSAN Storage Configuration

5.1 SSD Detection Issues

Problem Description

Symptoms

Diagnostic Commands (ESXi Host)

5.2 VMX File Configuration

Problem Description

Solution: Edit VMX Files

Procedure

5.3 Cleaning Up Previous vSAN Configuration

Problem Description

Diagnostic Commands

Cleanup Procedure (Run on Each ESXi Host)

Expected Output After Cleanup

6. Network Configuration for Nested VCF

6.1 VLAN Configuration

Problem Description

Solution: Use VLANs with Separate Subnets

6.2 MTU Settings

Critical for Nested Environments

6.3 IP Addressing

VCF Network IP Ranges

7. VCF Component Configuration

7.1 VCF Automation

IP Address Configuration Issue

7.2 NSX Manager

Configuration

7.2.1 NSX Certificate SAN Failure (VDT)

7.2.2 NSX Certificate Trust Failure (VDT)

7.3 vCenter Server

Configuration

7.4 vCenter Deployment Issues

7.4.1 SSH Requirement for vLCM Host Seeding

7.4.2 Monitoring vCenter Deployment Progress

7.4.3 Diagnosing Stuck Deployments

7.4.4 vCenter Deployment Failure Reference Tokens

7.5 Failed Deployment Recovery

7.5.1 Overview

7.5.2 Complete Cleanup Procedure

7.5.3 Troubleshooting Flowchart: Failed Deployment Recovery

8. Command Reference

8.1 VCF Installer Commands

Service Management

Certificate Management

Network Diagnostics

8.2 ESXi Host Commands

System Information

Certificate Management

Storage Management