VC
Virtual Control
VMware Cloud Foundation Solutions
Troubleshooting Handbook
VCF
Troubleshooting Handbook
End-to-end troubleshooting procedures for VCF components including SDDC Manager, vCenter, ESXi, and NSX.
SDDC ManagervCenterESXiNSXEnd-to-End
VCF 9.0
VMware Cloud Foundation
Proprietary & Confidential

VMware Cloud Foundation 9.0 Nested Deployment Troubleshooting Handbook

Version: 1.5 Date: February 2026 Environment: VCF 9.0.1 Nested in VMware Workstation Author: Virtual Control LLC


Table of Contents

  1. Executive Summary
  2. Environment Overview
  3. Offline Depot Configuration
  4. ESXi Host Certificate Issues
  5. vSAN Storage Configuration
  6. Network Configuration for Nested VCF
  7. VCF Component Configuration
  8. Command Reference
  9. Troubleshooting Flowcharts
  10. SDDC Manager Credential Cascade Failure
  11. SDDC Manager Storage Migration (Local → vSAN)

1. Executive Summary

This handbook documents the complete troubleshooting process for deploying VMware Cloud Foundation 9.0.1 in a nested VMware Workstation environment. The deployment encountered several challenges including:

This document provides step-by-step remediation procedures and recovery processes for nested VCF deployments.


2. Environment Overview

2.1 Infrastructure Components

Component FQDN IP Address
VCF Installer/SDDC Manager vcf-installer.lab.local 192.168.1.240
vCenter Server vcenter.lab.local 192.168.1.69
NSX Manager VIP nsx-vip.lab.local 192.168.1.70
NSX Manager Node 1 nsx-node1.lab.local 192.168.1.71
ESXi Host 1 esxi01.lab.local 192.168.1.74
ESXi Host 2 esxi02.lab.local 192.168.1.75
ESXi Host 3 esxi03.lab.local 192.168.1.76
ESXi Host 4 esxi04.lab.local 192.168.1.82
VCF Operations vcf-ops.lab.local 192.168.1.77
Fleet Management fleet.lab.local 192.168.1.78
Collector collector.lab.local 192.168.1.79
VCF Automation automation.lab.local 192.168.1.90
Automation Node 1 automation-node1.lab.local 192.168.1.91
Offline Depot Server (IP only - no DNS) 192.168.1.160
DNS/NTP Server (Windows Server) 192.168.1.230

2.2 Network Configuration

Network VLAN ID Subnet Gateway IP Range
ESX Management 0 192.168.1.0/24 192.168.1.1 DHCP/Static
VM Management 0 192.168.1.0/24 192.168.1.1 Same as ESX Mgmt
vMotion 100 192.168.100.0/24 192.168.100.1 192.168.100.10-20
vSAN 200 192.168.200.0/24 192.168.200.1 192.168.200.206-216
NSX TEP 300 192.168.250.0/24 192.168.250.1 192.168.250.10-25

2.3 ESXi VM Specifications (Nested)

Resource Value
vCPUs 8 (4 cores x 2 sockets)
Memory 48 GB
OS Disk 32 GB
vSAN Cache Disk 100 GB (SSD)
vSAN Capacity Disk 800 GB (SSD)
Network Adapters 4x vmxnet3

3. Offline Depot Configuration

3.1 TLS/SSL Certificate Issues

Problem Description

VCF 9.0.1 uses BouncyCastle FIPS TLS implementation which has strict certificate validation requirements. When connecting to an offline depot with a self-signed certificate, the connection fails with:

Secure protocol communication error, check logs for more details

Symptoms

Root Cause

  1. Self-signed certificate not trusted by Java keystore
  2. Certificate may lack proper extensions for FIPS compliance
  3. Certificate fingerprint mismatch between server and truststore

Diagnostic Commands (VCF Installer)

# Test SSL connectivity
openssl s_client -connect 192.168.1.160:8443

# Test with TLS 1.2 specifically
openssl s_client -connect 192.168.1.160:8443 -tls1_2

# Check cipher negotiation
openssl s_client -connect 192.168.1.160:8443 -tls1_2 </dev/null 2>&1 | grep -E "Cipher|Protocol|Verify"

# View certificate details
openssl s_client -connect 192.168.1.160:8443 </dev/null 2>/dev/null | openssl x509 -text -noout

# Get certificate fingerprint
openssl s_client -connect 192.168.1.160:8443 </dev/null 2>/dev/null | openssl x509 -noout -fingerprint -sha256

# Check LCM logs for TLS errors
grep -i "tlsfatal\|ssl\|certificate" /var/log/vmware/vcf/lcm/lcm-debug.log | tail -20

# Check LCM service status
systemctl status lcm

3.2 Python HTTPS Server Setup

Server Script (C:\VCF-DEPOT\https_server.py)

#!/usr/bin/env python3
"""
HTTPS server for VCF Offline Depot
Serves files with TLS 1.2+ for SDDC Manager compatibility
"""
import http.server
import ssl
import os
import base64
import socketserver
from functools import partial

# Configuration
PORT = 8443
CERT_FILE = 'server.crt'
KEY_FILE = 'server.key'
USERNAME = 'admin'
PASSWORD = 'admin'

class AuthHandler(http.server.SimpleHTTPRequestHandler):
    protocol_version = "HTTP/1.1"

    def __init__(self, *args, directory=None, **kwargs):
        super().__init__(*args, directory=directory, **kwargs)

    def do_HEAD(self):
        if not self.authenticate():
            return
        super().do_HEAD()

    def do_GET(self):
        if not self.authenticate():
            return
        super().do_GET()

    def do_POST(self):
        if not self.authenticate():
            return
        content_length = int(self.headers.get('Content-Length', 0))
        self.rfile.read(content_length)
        self.send_response(200)
        self.send_header('Content-Type', 'application/json')
        self.send_header('Connection', 'close')
        self.end_headers()
        self.wfile.write(b'{"status": "ok"}')

    def authenticate(self):
        auth_header = self.headers.get('Authorization')
        if auth_header is None:
            self.send_auth_request()
            return False

        try:
            auth_type, credentials = auth_header.split(' ', 1)
            if auth_type.lower() != 'basic':
                self.send_auth_request()
                return False

            decoded = base64.b64decode(credentials).decode('utf-8')
            username, password = decoded.split(':', 1)

            if username == USERNAME and password == PASSWORD:
                return True
        except Exception:
            pass

        self.send_auth_request()
        return False

    def send_auth_request(self):
        self.send_response(401)
        self.send_header('WWW-Authenticate', 'Basic realm="VCF Depot"')
        self.send_header('Content-type', 'text/html')
        self.send_header('Content-Length', '23')
        self.send_header('Connection', 'close')
        self.end_headers()
        self.wfile.write(b'Authentication required')

    def log_message(self, format, *args):
        print(f"{self.client_address[0]} - {format % args}")

class ThreadedHTTPServer(socketserver.ThreadingMixIn, http.server.HTTPServer):
    daemon_threads = True

def run_server():
    os.chdir(os.path.dirname(os.path.abspath(__file__)))

    context = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
    context.minimum_version = ssl.TLSVersion.TLSv1_2
    context.maximum_version = ssl.TLSVersion.TLSv1_3

    if hasattr(context, 'post_handshake_auth'):
        context.post_handshake_auth = False

    context.options |= ssl.OP_NO_TICKET
    context.options |= getattr(ssl, 'OP_NO_RENEGOTIATION', 0)

    context.load_cert_chain(CERT_FILE, KEY_FILE)

    try:
        context.set_ciphers('DEFAULT:!aNULL:!MD5:!DSS')
    except ssl.SSLError:
        pass

    handler = partial(AuthHandler, directory=os.getcwd())
    server = ThreadedHTTPServer(('0.0.0.0', PORT), handler)
    server.socket = context.wrap_socket(server.socket, server_side=True)

    print(f"VCF Offline Depot Server")
    print(f"========================")
    print(f"Serving: {os.getcwd()}")
    print(f"URL: https://192.168.1.160:{PORT}/")
    print(f"Credentials: {USERNAME} / {PASSWORD}")
    print(f"TLS: 1.2 - 1.3")
    print(f"Press Ctrl+C to stop")

    try:
        server.serve_forever()
    except KeyboardInterrupt:
        print("\nStopped.")
        server.shutdown()

if __name__ == '__main__':
    run_server()

3.3 Certificate Generation

Certificate Generation Script (C:\VCF-DEPOT\generate_cert.py)

"""
Generate simple self-signed certificate for VCF Offline Depot
"""
import subprocess
import sys
import os

def generate_cert():
    try:
        from cryptography import x509
        from cryptography.x509.oid import NameOID
        from cryptography.hazmat.primitives import hashes
        from cryptography.hazmat.primitives.asymmetric import rsa
        from cryptography.hazmat.primitives import serialization
        from datetime import datetime, timedelta, timezone
        import ipaddress

        print("Generating RSA 2048-bit private key...")
        key = rsa.generate_private_key(
            public_exponent=65537,
            key_size=2048,
        )

        print("Creating self-signed certificate...")
        subject = issuer = x509.Name([
            x509.NameAttribute(NameOID.COMMON_NAME, "192.168.1.160"),
        ])

        now = datetime.now(timezone.utc)

        # Simple certificate like the original
        cert = (
            x509.CertificateBuilder()
            .subject_name(subject)
            .issuer_name(issuer)
            .public_key(key.public_key())
            .serial_number(x509.random_serial_number())
            .not_valid_before(now)
            .not_valid_after(now + timedelta(days=365))
            .add_extension(
                x509.SubjectAlternativeName([
                    x509.IPAddress(ipaddress.IPv4Address("192.168.1.160")),
                ]),
                critical=False,
            )
            .add_extension(
                x509.BasicConstraints(ca=True, path_length=None),
                critical=True,
            )
            .sign(key, hashes.SHA256())
        )

        with open("server.key", "wb") as f:
            f.write(key.private_bytes(
                encoding=serialization.Encoding.PEM,
                format=serialization.PrivateFormat.TraditionalOpenSSL,
                encryption_algorithm=serialization.NoEncryption()
            ))
        print("Created: server.key")

        with open("server.crt", "wb") as f:
            f.write(cert.public_bytes(serialization.Encoding.PEM))
        print("Created: server.crt")

        fingerprint = cert.fingerprint(hashes.SHA256()).hex()
        formatted = ':'.join(fingerprint[i:i+2].upper() for i in range(0, len(fingerprint), 2))
        print(f"SHA256: {formatted}")

        return True
    except ImportError:
        print("Installing cryptography...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", "cryptography"])
        return False

def main():
    os.chdir(r"C:\VCF-DEPOT")
    print("Generating certificate...")
    if not generate_cert():
        generate_cert()
    print("\nDone. Run: python https_server.py")

if __name__ == "__main__":
    main()

Windows PowerShell Commands

# Navigate to depot directory
cd C:\VCF-DEPOT

# Generate certificate
python generate_cert.py

# Start HTTPS server
python https_server.py

3.4 Certificate Import to VCF

Complete Certificate Import Procedure (VCF Installer)

# Step 1: Download certificate from depot server
openssl s_client -connect 192.168.1.160:8443 </dev/null 2>/dev/null | openssl x509 -outform PEM > /tmp/depot.crt

# Step 2: Verify certificate was downloaded
cat /tmp/depot.crt

# Step 3: Get certificate fingerprint
openssl x509 -in /tmp/depot.crt -noout -fingerprint -sha256

# Step 4: Find Java truststore location
echo $JAVA_HOME
# Output: /usr/lib/jvm/openjdk-java17-headless.x86_64

# Step 5: Delete old certificate if exists
keytool -delete -alias offline-depot -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit

# Step 6: Import new certificate
keytool -import -trustcacerts -alias offline-depot -file /tmp/depot.crt -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit -noprompt

# Step 7: Verify import
keytool -list -alias offline-depot -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit

# Step 8: Restart LCM service
systemctl restart lcm

# Step 9: Wait for LCM to start (2 minutes)
systemctl status lcm

# Step 10: Verify LCM is ready
tail -f /var/log/vmware/vcf/lcm/lcm-debug.log | grep -i "started\|ready"

Troubleshooting Certificate Issues

# Check all cacerts files on system
find / -name "cacerts" -type f 2>/dev/null

# List all certificates in truststore
keytool -list -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit

# View certificate details in truststore
keytool -list -alias offline-depot -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit -v

# Check LCM logs for certificate errors
grep -B5 -A10 "TlsFatalAlert" /var/log/vmware/vcf/lcm/lcm-debug.log | tail -40

4. ESXi Host Certificate Issues

4.1 Certificate Hostname Mismatch

Problem Description

ESXi hosts may have certificates with incorrect hostnames (e.g., "localhost.localdomain" instead of the actual FQDN), causing VCF validation to fail.

Symptoms

javax.net.ssl.SSLPeerUnverifiedException: Certificate for <esxi01.lab.local> doesn't match any of the subject alternative names: [localhost.localdomain]

Diagnostic Commands (ESXi Host)

# Check current hostname
esxcli system hostname get

# View current certificate SAN
openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout | grep -A1 "Subject Alternative Name"

# View full certificate details
openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout

4.2 Certificate Regeneration

Procedure (Run on Each ESXi Host)

# Step 1: Set correct hostname
esxcli system hostname set --fqdn=esxi01.lab.local

# Step 2: Verify hostname
esxcli system hostname get

# Step 3: Backup existing certificate
mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak

# Step 4: Generate new certificate
/sbin/generate-certificates

# Step 5: Restart services
services.sh restart

# Step 6: Verify new certificate
openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout | grep -A1 "Subject Alternative Name"

Host-Specific Commands

esxi01.lab.local (192.168.1.74):

esxcli system hostname set --fqdn=esxi01.lab.local
mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak
/sbin/generate-certificates
services.sh restart

esxi02.lab.local (192.168.1.75):

esxcli system hostname set --fqdn=esxi02.lab.local
mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak
/sbin/generate-certificates
services.sh restart

esxi03.lab.local (192.168.1.76):

esxcli system hostname set --fqdn=esxi03.lab.local
mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak
/sbin/generate-certificates
services.sh restart

esxi04.lab.local (192.168.1.82):

esxcli system hostname set --fqdn=esxi04.lab.local
mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak
/sbin/generate-certificates
services.sh restart

4.3 Thumbprint Updates

Get New Thumbprints (VCF Installer)

# Get thumbprint for each host
echo | openssl s_client -connect 192.168.1.74:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256
echo | openssl s_client -connect 192.168.1.75:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256
echo | openssl s_client -connect 192.168.1.76:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256
echo | openssl s_client -connect 192.168.1.82:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256

After regenerating certificates, update the thumbprints in VCF Installer UI by re-validating the hosts.


5. vSAN Storage Configuration

5.1 SSD Detection Issues

Problem Description

In nested VMware Workstation environments, virtual disks are not automatically detected as SSDs, causing vSAN cache tier configuration to fail.

Symptoms

ESX Host esxi01.lab.local found zero SSD devices for SSD cache tier

Diagnostic Commands (ESXi Host)

# List all storage devices with SSD status
esxcli storage core device list | grep -E "^t10|Is SSD"

# Check vSAN eligible disks
vdq -q

# List vSAN storage
esxcli vsan storage list

# Check disk partitions
partedUtil getptbl /vmfs/devices/disks/<device-name>

5.2 VMX File Configuration

Problem Description

Virtual disks in VMware Workstation need the virtualSSD flag set in the VMX file to be recognized as SSDs by nested ESXi.

Solution: Edit VMX Files

Location: Edit each ESXi VM's .vmx file in VMware Workstation

Required Lines to Add:

For esxi01.vmx:

sata0:0.virtualSSD = 1
sata0:2.virtualSSD = 1
sata0:3.virtualSSD = 1
sata0:4.virtualSSD = 1

For esxi02.vmx, esxi03.vmx, esxi04.vmx:

sata0:0.virtualSSD = 1
sata0:3.virtualSSD = 1
sata0:4.virtualSSD = 1

Procedure

  1. Shut down all ESXi VMs in VMware Workstation
  2. Navigate to each VM's folder
  3. Open the .vmx file in a text editor
  4. Add the virtualSSD lines (location doesn't matter, bottom is fine)
  5. Save the file
  6. Power on the ESXi VMs
  7. Verify SSD detection:
    esxcli storage core device list | grep -E "^t10|Is SSD"
    

5.3 Cleaning Up Previous vSAN Configuration

Problem Description

Disks with existing vSAN partitions from previous deployments are marked as "Ineligible for use by VSAN" with reason "Has partitions" or "Disk in use by disk group".

Diagnostic Commands

# Check vSAN eligibility
vdq -q

# Check existing vSAN storage
esxcli vsan storage list

# Check partition table
partedUtil getptbl /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001

Cleanup Procedure (Run on Each ESXi Host)

# Step 1: Remove existing vSAN disk group (if exists)
esxcli vsan storage remove -d t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001

# Step 2: Delete partitions from vSAN cache disk
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001 1
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001 2

# Step 3: Delete partitions from vSAN capacity disk
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________04000000000000000001 1
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________04000000000000000001 2

# Step 4: Verify disks are now eligible
vdq -q

Expected Output After Cleanup

{
    "Name": "t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001",
    "State": "Eligible for use by VSAN",
    "Reason": "None",
    "IsSSD": "1"
}

6. Network Configuration for Nested VCF

6.1 VLAN Configuration

Problem Description

VCF requires separate subnets for different network types. Using the same subnet/gateway for multiple networks causes validation errors:

Gateway 192.168.1.1 is duplicated across networks
Subnet 192.168.1.0/24 is duplicated across networks

Solution: Use VLANs with Separate Subnets

Network VLAN ID Subnet Gateway
Management 0 192.168.1.0/24 192.168.1.1
vMotion 100 192.168.100.0/24 192.168.100.1
vSAN 200 192.168.200.0/24 192.168.200.1
NSX TEP 300 192.168.250.0/24 192.168.250.1

Note: For vMotion, vSAN, and TEP networks in nested environments, the gateway IPs don't need to exist - they're isolated networks. VCF requires the gateway field to be populated.

6.2 MTU Settings

Critical for Nested Environments

Do NOT use jumbo frames (MTU 9000) in nested VMware Workstation environments.

Component Recommended MTU
Distributed Switch 1500-1600
ESX Management 1500
vMotion 1500
vSAN 1500
NSX TEP 1500

6.3 IP Addressing

VCF Network IP Ranges

Network IP Range Purpose
Management 192.168.1.x Static assignments per hosts file
vMotion 192.168.100.10-20 Automatic assignment by VCF
vSAN 192.168.200.206-216 Automatic assignment by VCF
NSX TEP 192.168.250.10-25 TEP IP Pool

7. VCF Component Configuration

7.1 VCF Automation

IP Address Configuration Issue

Problem: VCF detects duplicate IP when cluster FQDN resolves to same IP as Node IP.

Error:

IP address 192.168.1.90 for product VCF Automation is already resolved from an FQDN in the input specification

Solution: Use different IPs for cluster FQDN and node IP:

Field Value
Cluster hostname/FQDN automation.lab.local (192.168.1.90)
Node IP 1 192.168.1.91 (automation-node1.lab.local)
Additional IP for upgrades 192.168.1.81
Node name prefix automation
Internal Cluster CIDR 198.18.0.0/15

7.2 NSX Manager

Configuration

Field Value
Appliance Size Medium
Appliance FQDN nsx-node1.lab.local
Virtual IP (VIP) FQDN nsx-vip.lab.local

7.2.1 NSX Certificate SAN Failure (VDT)

Problem: VDT reports "SAN contains neither hostname nor IP" for NSX VIP and NSX Manager certificates. The default NSX self-signed certificate uses a wildcard SAN (*.lab.local) without specific hostnames or IPs.

Solution: Generate a new self-signed certificate with explicit SAN entries and apply via NSX API.

Step 1: Create OpenSSL config on NSX Manager (SSH as root):

cat > /tmp/nsx-cert.conf << 'EOF'
[ req ]
default_bits = 2048
distinguished_name = req_distinguished_name
req_extensions = req_ext
x509_extensions = req_ext
prompt = no

[ req_distinguished_name ]
countryName = US
stateOrProvinceName = Lab
localityName = Lab
organizationName = lab.local
commonName = nsx-vip.lab.local

[ req_ext ]
basicConstraints = CA:FALSE
subjectAltName = @alt_names

[alt_names]
DNS.1 = nsx-vip.lab.local
DNS.2 = nsx-node1.lab.local
DNS.3 = nsx-manager.lab.local
IP.1 = 192.168.1.70
IP.2 = 192.168.1.71
EOF

Important: DNS.3 = nsx-manager.lab.local is required because SDDC Manager registers NSX using this FQDN. Without it, VDT reports "SAN contains IP but not hostname".

Step 2: Generate cert, build JSON, import, and apply:

# Generate cert
openssl req -x509 -nodes -days 825 -newkey rsa:2048 -keyout /tmp/nsx.key -out /tmp/nsx.crt -config /tmp/nsx-cert.conf -sha256

# Build JSON payload (avoids shell escaping issues with PEM newlines)
python -c "
import json
cert = open('/tmp/nsx.crt').read()
key = open('/tmp/nsx.key').read()
print(json.dumps({'pem_encoded': cert, 'private_key': key}))
" > /tmp/nsx-import.json

# Import cert (single-line — NSX shell doesn't support backslash continuation)
curl -k -u admin:'<password>' -X POST "https://192.168.1.71/api/v1/trust-management/certificates?action=import" -H "Content-Type: application/json" -d @/tmp/nsx-import.json
# Note the certificate ID from the response

# Get node UUID
curl -k -u admin:'<password>' https://192.168.1.71/api/v1/cluster
# Note the node UUID

# Apply to node (API service)
curl -k -u admin:'<password>' -X POST "https://192.168.1.71/api/v1/trust-management/certificates/<cert-id>?action=apply_certificate&service_type=API&node_id=<node-uuid>"

# Apply to VIP (MGMT_CLUSTER)
curl -k -u admin:'<password>' -X POST "https://192.168.1.71/api/v1/trust-management/certificates/<cert-id>?action=apply_certificate&service_type=MGMT_CLUSTER"

Prerequisite: All NSX services must be healthy (MANAGER, SEARCH, UI, NODE_MGMT all UP). If services are DOWN, the API returns error 101. Wait 10-15 minutes after NSX restart in nested environments.

7.2.2 NSX Certificate Trust Failure (VDT)

Problem: After replacing the NSX self-signed certificate, VDT reports "NSX VIP Cert Trust: FAIL" and "NSX Manager Cert Trust: FAIL". The new self-signed cert's root is not in SDDC Manager's keystores (the original cert was pre-trusted during bringup).

Symptoms:

Solution: Import the NSX certificate into both SDDC Manager trust stores.

Step 1: Pull the NSX certificate (SSH to SDDC Manager as root):

openssl s_client -showcerts -connect 192.168.1.71:443 < /dev/null 2>/dev/null | openssl x509 -outform PEM > /tmp/nsx-root.crt

# Verify it's the correct cert
openssl x509 -in /tmp/nsx-root.crt -noout -text | grep -A2 "Subject Alternative Name"

Step 2: Import into VCF trust store:

KEY=$(cat /etc/vmware/vcf/commonsvcs/trusted_certificates.key)
keytool -importcert -alias nsx-selfsigned -file /tmp/nsx-root.crt \
  -keystore /etc/vmware/vcf/commonsvcs/trusted_certificates.store \
  -storepass "$KEY" -noprompt

Step 3: Import into Java cacerts:

keytool -importcert -alias nsx-selfsigned -file /tmp/nsx-root.crt \
  -keystore /etc/alternatives/jre/lib/security/cacerts \
  -storepass changeit -noprompt

Step 4: Restart SDDC Manager services (~5 minutes):

/opt/vmware/vcf/operationsmanager/scripts/cli/sddcmanager_restart_services.sh

Key paths:

Item Path/Value
VCF trust store /etc/vmware/vcf/commonsvcs/trusted_certificates.store
VCF trust store password Contents of /etc/vmware/vcf/commonsvcs/trusted_certificates.key
Java cacerts /etc/alternatives/jre/lib/security/cacerts
Java cacerts password changeit
Service restart script /opt/vmware/vcf/operationsmanager/scripts/cli/sddcmanager_restart_services.sh

Reference: https://knowledge.broadcom.com/external/article/316056

Note on SDDC Manager SSH: Only the vcf user can SSH in (root and admin are rejected). Use su - from the vcf session to get root access. SCP does not work due to the restricted shell; use ssh vcf@host "cat > file" < localfile for file transfers.

7.3 vCenter Server

Configuration

Field Value
Appliance FQDN vcenter.lab.local
Appliance Size Tiny
Datacenter Name mgmt-dc01
Cluster Name mgmt-cl01
SSO Domain Name vsphere.local

7.4 vCenter Deployment Issues

7.4.1 SSH Requirement for vLCM Host Seeding

Problem Description VCF vLCM (vSphere Lifecycle Manager) requires SSH access to ESXi hosts during vCenter deployment for host seeding. If SSH is disabled, the deployment fails with:

vCenter installation failed. Check logs under /var/log/vmware/vcf/domainmanager/ci-installer-XX-XX-XX-XX-XX-XXX for more details

Symptoms in Logs

Extraction of image from host esxi01.lab.local failed

Root Cause SSH service is stopped or disabled on ESXi hosts. VCF needs SSH to extract ESXi image metadata for vLCM host seeding.

Solution: Enable SSH on All ESXi Hosts

Run on each ESXi host BEFORE starting VCF deployment:

# Enable SSH service
vim-cmd hostsvc/enable_ssh

# Start SSH service
vim-cmd hostsvc/start_ssh

# Verify SSH is running
vim-cmd hostsvc/runtimeinfo | grep ssh

Alternative Method (from ESXi Shell)

# Enable and start SSH
esxcli system ssh set --enable=true

# Verify SSH status
esxcli system ssh get

Note: SSH can be disabled after successful VCF deployment for security.

7.4.2 Monitoring vCenter Deployment Progress

VCF Installer Log Monitoring

Watch deployment progress from VCF Installer:

# Find the latest ci-installer log directory
ls -lt /var/log/vmware/vcf/domainmanager/ | head -5

# Watch the installation log
tail -f /var/log/vmware/vcf/domainmanager/ci-installer-XX-XX-XX-XX-XX-XXX/ci-installer.log

# Search for errors
grep -i "error\|failed\|exception" /var/log/vmware/vcf/domainmanager/ci-installer-XX-XX-XX-XX-XX-XXX/ci-installer.log

vCenter VM Direct Monitoring

SSH to the vCenter VM during deployment (default password: vmware):

# Watch firstboot progress
tail -f /var/log/firstboot/firstbootStatus.json

# Watch detailed installation
tail -f /var/log/vmware/firstboot/installer.log

# Check VMware services
vmon-cli --list

# Check specific service status
vmon-cli --status <service-name>

Expected Deployment Stages

  1. vCenter VM deployment (OVA extraction)
  2. First boot - basic configuration
  3. Installing containers (60% mark)
  4. Database initialization
  5. Service startup
  6. vCenter registration with VCF

7.4.3 Diagnosing Stuck Deployments

Symptoms

Diagnostic Commands (SSH to vCenter VM)

# Check current deployment status
cat /var/log/firstboot/firstbootStatus.json

# Check for running processes
ps aux | grep -E "install|firstboot|postgres|vpxd"

# Check disk I/O (should show activity)
vmstat 1 5

# Check memory usage
free -h

# Check for error logs
tail -50 /var/log/vmware/firstboot/installer.log
grep -i "error\|fail\|exception" /var/log/vmware/firstboot/*.log

PostgreSQL Database Issues

If deployment is stuck at "Installing Containers" (60%), check postgres:

# Check if postgres service exists
ls -la /storage/db/vpostgres/

# Check for postgres config file
ls -la /storage/db/vpostgres/postgresql.conf

# Check postgres user/group
grep postgres /etc/passwd
grep postgres /etc/group

# Check postgres logs
tail -50 /var/log/vmware/vpostgres/*.log

If PostgreSQL Never Initialized Missing /storage/db/vpostgres/postgresql.conf and missing postgres user indicates the database initialization failed. This is typically unrecoverable and requires full redeployment.

Service Startup Issues

# List all VMware services and status
vmon-cli --list

# Check rhttpproxy (reverse proxy)
systemctl status rhttpproxy
tail -50 /var/log/vmware/rhttpproxy/rhttpproxy.log

# Check vpostgres
systemctl status vmware-vpostgres
tail -50 /var/log/vmware/vpostgres/postgresql*.log

7.4.4 vCenter Deployment Failure Reference Tokens

When vCenter deployment fails, VCF provides a reference token. To find detailed errors:

# Search for reference token in logs
grep -r "REFERENCE_TOKEN" /var/log/vmware/vcf/

# Example: Reference Token 3OHCKD
grep -r "3OHCKD" /var/log/vmware/vcf/
grep -B20 -A20 "3OHCKD" /var/log/vmware/vcf/domainmanager/*.log

7.5 Failed Deployment Recovery

7.5.1 Overview

VCF does not provide a rollback mechanism for failed management domain deployments. A failed deployment requires manual cleanup of:

  1. vCenter VM (delete from ESXi)
  2. vSAN disk groups and partitions
  3. VDS (Virtual Distributed Switch) configuration
  4. Depot connection in VCF UI

7.5.2 Complete Cleanup Procedure

Step 1: Delete Failed vCenter VM

From any ESXi host (or the one hosting the failed vCenter):

# List all VMs
vim-cmd vmsvc/getallvms

# Find the vCenter VM ID (look for vcenter.lab.local or similar)
# Power off the VM if running
vim-cmd vmsvc/power.off <vmid>

# Unregister the VM
vim-cmd vmsvc/unregister <vmid>

# Delete the VM files from datastore (if needed)
rm -rf /vmfs/volumes/<datastore>/vcenter.lab.local/

Step 2: Clean Up VDS (Distributed Switch)

From ESXi hosts, remove VDS configuration:

# List current virtual switches
esxcli network vswitch dvs vmware list

# Remove host from VDS (if configured)
# This is typically done from vCenter, but if vCenter is gone:
# Remove vmkernel ports from VDS
esxcli network ip interface remove -i vmk1  # vMotion
esxcli network ip interface remove -i vmk2  # vSAN

# Remove VDS uplink
esxcli network vswitch dvs vmware list

Step 3: Clean Up vSAN Configuration

Run on EACH ESXi host:

# List current vSAN storage
esxcli vsan storage list

# Remove vSAN disk groups
esxcli vsan storage remove -d t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001
esxcli vsan storage remove -d t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________04000000000000000001

# If remove fails, check disk state
vdq -q

# Delete partitions from cache disk (example device name)
partedUtil getptbl /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001 1
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001 2

# Delete partitions from capacity disk
partedUtil getptbl /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________04000000000000000001
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________04000000000000000001 1
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________04000000000000000001 2

# Verify disks are now eligible
vdq -q

Common vSAN Cleanup Error If you see: cache disk/s are in an invalid state...available size is 0.0 GB

This means disks still have partitions. Use partedUtil to delete them.

Step 4: Remove Depot Connection (VCF UI)

  1. Log into VCF Installer UI (https://192.168.1.240:8443)
  2. Navigate to Settings or Configuration
  3. Remove the existing offline depot connection
  4. Re-add the depot connection with certificate

Step 5: Verify Hosts Are Ready

On each ESXi host, verify:

# Check hostname is correct
esxcli system hostname get

# Check certificate is valid
openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout | grep -A1 "Subject Alternative Name"

# Check SSH is enabled
vim-cmd hostsvc/runtimeinfo | grep ssh

# Check vSAN disks are eligible
vdq -q

# Check no VDS remnants
esxcli network vswitch dvs vmware list

Step 6: Restart VCF Services (Optional)

On VCF Installer, restart services for clean state:

systemctl restart lcm
systemctl restart domainmanager

# Wait for services to start
sleep 120

# Verify services are running
systemctl status lcm
systemctl status domainmanager

7.5.3 Troubleshooting Flowchart: Failed Deployment Recovery

START: VCF Deployment Failed
  │
  ├─→ Note reference token from error message
  │     └─→ Search logs: grep -r "TOKEN" /var/log/vmware/vcf/
  │
  ├─→ Delete failed vCenter VM
  │     ├─→ vim-cmd vmsvc/getallvms
  │     ├─→ vim-cmd vmsvc/power.off <vmid>
  │     └─→ vim-cmd vmsvc/unregister <vmid>
  │
  ├─→ Clean up vSAN on EACH host
  │     ├─→ esxcli vsan storage remove -d <device>
  │     ├─→ partedUtil delete ... (both partitions)
  │     └─→ vdq -q (verify eligible)
  │
  ├─→ Clean up VDS (if configured)
  │     └─→ esxcli network ip interface remove ...
  │
  ├─→ Remove depot connection in VCF UI
  │     └─→ Re-add with certificate
  │
  ├─→ Verify SSH enabled on all hosts
  │     └─→ vim-cmd hostsvc/enable_ssh
  │
  └─→ Retry deployment

8. Command Reference

8.1 VCF Installer Commands

Service Management

# Check LCM service status
systemctl status lcm

# Restart LCM service
systemctl restart lcm

# Check Domain Manager status
systemctl status domainmanager

# Restart Domain Manager
systemctl restart domainmanager

# View LCM logs
tail -f /var/log/vmware/vcf/lcm/lcm-debug.log

# Search LCM logs for errors
grep -i "error\|fatal\|exception" /var/log/vmware/vcf/lcm/lcm-debug.log | tail -20

Certificate Management

# Download certificate from remote server
openssl s_client -connect <IP>:<PORT> </dev/null 2>/dev/null | openssl x509 -outform PEM > /tmp/cert.crt

# View certificate details
openssl x509 -in /tmp/cert.crt -text -noout

# Get certificate fingerprint
openssl x509 -in /tmp/cert.crt -noout -fingerprint -sha256

# Import certificate to Java truststore
keytool -import -trustcacerts -alias <alias> -file /tmp/cert.crt -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit -noprompt

# Delete certificate from truststore
keytool -delete -alias <alias> -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit

# List certificates in truststore
keytool -list -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit

# View specific certificate in truststore
keytool -list -alias <alias> -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit -v

Network Diagnostics

# Test connectivity
ping <IP>

# Test SSL connection
openssl s_client -connect <IP>:<PORT>

# Test with specific TLS version
openssl s_client -connect <IP>:<PORT> -tls1_2

# Test HTTP endpoint
curl -v -k -u admin:admin https://<IP>:<PORT>/path

8.2 ESXi Host Commands

System Information

# Get hostname
esxcli system hostname get

# Set hostname
esxcli system hostname set --fqdn=<FQDN>

# Get system version
esxcli system version get

Certificate Management

# View current certificate
openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout

# View certificate SAN
openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout | grep -A1 "Subject Alternative Name"

# Backup certificates
mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak

# Regenerate certificates
/sbin/generate-certificates

# Restart services after certificate change
services.sh restart

Storage Management

# List all storage devices
esxcli storage core device list

# List devices with SSD status
esxcli storage core device list | grep -E "^t10|^naa|Is SSD"

# Check vSAN eligible disks
vdq -q

# List vSAN storage
esxcli vsan storage list

# Get partition table
partedUtil getptbl /vmfs/devices/disks/<device>

# Delete partition
partedUtil delete /vmfs/devices/disks/<device> <partition-number>

# Remove vSAN disk group
esxcli vsan storage remove -d <device>

# Add SATP rule for SSD
esxcli storage nmp satp rule add -s VMW_SATP_LOCAL -d <device> -o enable_ssd

# Reclaim device after SATP rule
esxcli storage core claiming reclaim -d <device>

# List SATP rules
esxcli storage nmp satp rule list | grep enable_ssd

Maintenance Mode

# Enter maintenance mode
esxcli system maintenanceMode set -e true -m noAction

# Exit maintenance mode
esxcli system maintenanceMode set -e false

# Check maintenance mode status
esxcli system maintenanceMode get

8.3 Windows Depot Server Commands

PowerShell Commands

# Navigate to depot directory
cd C:\VCF-DEPOT

# Generate certificate
python generate_cert.py

# Start HTTPS server
python https_server.py

# Install Python cryptography library
pip install cryptography

9. Troubleshooting Flowcharts

9.1 Offline Depot Connection Failure

START: "Secure protocol communication error"
  │
  ├─→ Test connectivity: ping <depot-ip>
  │     └─→ FAIL: Check network/firewall
  │
  ├─→ Test SSL: openssl s_client -connect <ip>:8443
  │     └─→ FAIL: Check depot server is running
  │
  ├─→ Check certificate: View cert details
  │     └─→ Wrong hostname: Regenerate certificate
  │
  ├─→ Import certificate to Java truststore
  │     └─→ keytool -import ...
  │
  ├─→ Verify fingerprints match
  │     └─→ MISMATCH: Re-import correct certificate
  │
  └─→ Restart LCM service
        └─→ Wait 2 minutes, retry connection

9.2 ESXi Certificate Mismatch

START: "Certificate doesn't match subject alternative names"
  │
  ├─→ Check current cert SAN
  │     └─→ openssl x509 -in /etc/vmware/ssl/rui.crt ...
  │
  ├─→ Set correct hostname
  │     └─→ esxcli system hostname set --fqdn=<FQDN>
  │
  ├─→ Backup old certificates
  │     └─→ mv /etc/vmware/ssl/rui.* /etc/vmware/ssl/rui.*.bak
  │
  ├─→ Generate new certificates
  │     └─→ /sbin/generate-certificates
  │
  ├─→ Restart services
  │     └─→ services.sh restart
  │
  └─→ Update thumbprints in VCF
        └─→ Re-validate hosts in UI

9.3 vSAN SSD Detection Failure

START: "Found zero SSD devices for SSD cache tier"
  │
  ├─→ Check SSD status: esxcli storage core device list
  │     └─→ "Is SSD: false" → Continue
  │
  ├─→ Shut down ESXi VM in Workstation
  │
  ├─→ Edit VMX file
  │     └─→ Add: sata0:X.virtualSSD = 1
  │
  ├─→ Power on ESXi VM
  │
  ├─→ Verify SSD status
  │     └─→ Still false: Check VMX syntax
  │
  ├─→ Check vSAN eligibility: vdq -q
  │     └─→ "Has partitions" → Clean up partitions
  │
  └─→ Clean up old vSAN config
        ├─→ esxcli vsan storage remove -d <device>
        └─→ partedUtil delete ...

9.4 vCenter Deployment Stuck

START: vCenter deployment stuck at percentage
  │
  ├─→ Wait 30 minutes (large downloads may be slow)
  │
  ├─→ SSH to vCenter VM (ssh root@<vcenter-ip>)
  │     └─→ Default password: vmware
  │
  ├─→ Check firstboot status
  │     └─→ cat /var/log/firstboot/firstbootStatus.json
  │
  ├─→ Check for activity
  │     ├─→ vmstat 1 5 (disk I/O)
  │     └─→ tail -f /var/log/vmware/firstboot/installer.log
  │
  ├─→ If stuck at 60% "Installing Containers"
  │     ├─→ Check postgres: ls /storage/db/vpostgres/
  │     ├─→ Missing postgresql.conf → Database failed to init
  │     └─→ UNRECOVERABLE: Must redeploy
  │
  ├─→ Check services: vmon-cli --list
  │     └─→ Services not started → Check individual logs
  │
  └─→ If unrecoverable:
        ├─→ Delete vCenter VM
        ├─→ Clean up vSAN on all hosts
        ├─→ Reset depot connection
        └─→ Retry deployment

9.5 vLCM Host Seeding Failure

START: "Extraction of image from host failed"
  │
  ├─→ Check SSH status on ESXi host
  │     └─→ vim-cmd hostsvc/runtimeinfo | grep ssh
  │
  ├─→ SSH Disabled?
  │     ├─→ vim-cmd hostsvc/enable_ssh
  │     └─→ vim-cmd hostsvc/start_ssh
  │
  ├─→ Verify SSH on ALL hosts
  │     └─→ Repeat for esxi01, esxi02, esxi03, esxi04
  │
  └─→ Retry vCenter deployment

9.6 NSX Certificate Trust/SAN Failure (VDT)

START: VDT reports NSX cert FAIL (Trust or SAN)
  │
  ├─→ Check which check failed
  │     ├─→ SAN FAIL: Certificate missing hostnames/IPs
  │     └─→ Trust FAIL: Certificate root not in SDDC Manager keystores
  │
  ├─→ If SAN FAIL:
  │     ├─→ SSH to NSX Manager as root
  │     ├─→ Create OpenSSL config with all SANs:
  │     │     DNS.1 = nsx-vip.lab.local
  │     │     DNS.2 = nsx-node1.lab.local
  │     │     DNS.3 = nsx-manager.lab.local  ← SDDC Manager's registered FQDN
  │     │     IP.1 = 192.168.1.70 (VIP)
  │     │     IP.2 = 192.168.1.71 (node)
  │     ├─→ Generate cert: openssl req -x509 ...
  │     ├─→ Build JSON: python (avoid shell PEM escaping)
  │     ├─→ Import via API: POST /api/v1/trust-management/certificates?action=import
  │     ├─→ Apply to node: ?action=apply_certificate&service_type=API&node_id=<uuid>
  │     └─→ Apply to VIP: ?action=apply_certificate&service_type=MGMT_CLUSTER
  │
  ├─→ If Trust FAIL (after cert replacement):
  │     ├─→ SSH to SDDC Manager as vcf, then su - to root
  │     ├─→ Pull cert: openssl s_client ... > /tmp/nsx-root.crt
  │     ├─→ Import to VCF store: keytool -importcert ... trusted_certificates.store
  │     ├─→ Import to Java cacerts: keytool -importcert ... cacerts
  │     └─→ Restart services: sddcmanager_restart_services.sh
  │
  └─→ Re-run VDT after ~5 minutes
        └─→ Expected: NSX cert checks all PASS

9.7 VCF Operations Health Adapter "No Data Receiving" Status

START: Infrastructure Health Adapter shows "no data receiving"
  │
  ├─→ Check adapter log (VCF Ops 9.x path):
  │     tail -100 /storage/log/vcops/log/adapters/
  │       VMwareInfraHealthAdapter/VMwareInfraHealthAdapter_55.log
  │
  ├─→ If "Unable to fetch access token for the SDDC manager":
  │     ├─→ Test SDDC Manager auth from VCF Ops node:
  │     │     curl -sk -X POST https://sddc-manager.lab.local/v1/tokens \
  │     │       -H "Content-Type: application/json" \
  │     │       -d '{"username":"administrator@vsphere.local","password":"..."}'
  │     ├─→ If token returned: credential issue in adapter
  │     │     ├─→ UI → Administration → Integrations → SDDC Manager
  │     │     ├─→ If System Managed Credential enabled → click ROTATE
  │     │     ├─→ If Credential dropdown empty → uncheck System Managed,
  │     │     │     click +, create credential, select it
  │     │     ├─→ Click VALIDATE CONNECTION → confirm "valid"
  │     │     ├─→ Click SAVE
  │     │     └─→ Reboot VCF Ops appliance (adapter may not pick up
  │     │           new credential without full restart)
  │     └─→ If connection refused: check DNS/network/cert trust
  │
  ├─→ If "PKIX path building failed" for NSX:
  │     ├─→ If NSX is powered off → expected, ignore
  │     └─→ If NSX is running → See flowchart 9.8 below
  │
  ├─→ If "vROPs is not configured with NTP server":
  │     └─→ Configure NTP on VCF Ops appliance (cosmetic warning,
  │           does not block data collection)
  │
  └─→ After fix: wait 10 min for 2 collection cycles
        └─→ Verify: adapter status changes to "Collecting" (green)

Key paths on VCF Operations 9.x

Item Path
Adapter logs /storage/log/vcops/log/adapters/<AdapterName>/
Main vcops logs /storage/log/vcops/log/
Collector GC log /storage/log/vcops/log/collector-gc-*.log
VCF Adapter log /storage/log/vcops/log/adapters/VcfAdapter/VcfAdapter_254.log
VMware Adapter log /storage/log/vcops/log/adapters/VMwareAdapter/VMwareAdapter_63.log
vSAN Adapter log /storage/log/vcops/log/adapters/VsanStorageAdapter/VsanStorageAdapter_257.log

Note: VCF Operations 9.x does NOT use /var/log/vmware/vcops/adapters/ — that path from older Aria Operations versions no longer exists. All adapter logs are under /storage/log/vcops/log/adapters/.

9.8 VCF Operations NSX Integration — PKIX / Connection Failure

START: NSX adapter shows Warning, logs show "PKIX path building failed"
  │
  ├─→ Verify NSX is actually reachable from VCF Ops node:
  │     curl -sk https://nsx-vip.lab.local/api/v1/node/status | head -5
  │     ├─→ "No route to host" → NSX not ready (check load avg below)
  │     └─→ Returns JSON → NSX is up, proceed to cert fix
  │
  ├─→ If VIP (.70) unreachable but node (.71) responds:
  │     ├─→ NSX cluster VIP not online yet
  │     ├─→ Check load: curl -sk -u admin:'<pass>'
  │     │     https://<node>:443/api/v1/node/status | grep load_average
  │     ├─→ Load > 20 on 6 cores = still booting (normal in nested,
  │     │     can take 30-60 min after power-on)
  │     └─→ Wait for load < 20, VIP will come online automatically
  │
  ├─→ Import NSX cert into VCF Ops Java truststore:
  │     ├─→ openssl s_client -connect nsx-vip.lab.local:443 \
  │     │     -showcerts </dev/null 2>/dev/null \
  │     │     | openssl x509 -outform PEM > /tmp/nsx-cert.pem
  │     ├─→ Find truststore: java -XshowSettings:properties 2>&1
  │     │     | grep java.home
  │     │     → /usr/java/jre-vmware-17
  │     ├─→ keytool -importcert -alias nsx-vip \
  │     │     -file /tmp/nsx-cert.pem \
  │     │     -keystore /usr/java/jre-vmware-17/lib/security/cacerts \
  │     │     -storepass changeit -noprompt
  │     └─→ Reboot VCF Ops appliance
  │
  ├─→ Fix NSX credential (two adapters to check):
  │     │
  │     ├─→ VCF section → nsx-vip.lab.local adapter:
  │     │     ├─→ System Managed Credential ROTATE rarely works for NSX
  │     │     ├─→ Uncheck System Managed Credential
  │     │     ├─→ Click + → create credential (admin / password)
  │     │     ├─→ Select credential, VALIDATE CONNECTION, SAVE
  │     │     └─→ If VIP still unreachable, wait for NSX load to settle
  │     │
  │     └─→ NSX section → Aria Admin adapter:
  │           ├─→ Points to nsx-manager.lab.local (node FQDN)
  │           ├─→ May connect even when VIP is down
  │           ├─→ Set credential (admin / password)
  │           ├─→ VALIDATE CONNECTION → SAVE
  │           └─→ This can start collecting before VIP comes online
  │
  └─→ Verify: All adapters show "Collecting" (green) in
        Administration → Integrations

Key insight: VCF Operations has TWO separate NSX adapters — one under the VCF Cloud Foundation account (uses VIP) and one under the standalone NSX section called "Aria Admin" (uses node FQDN). Both need valid credentials. The Aria Admin adapter can connect via the node FQDN even when the VIP is still offline after a fresh NSX boot.

Java truststore path on VCF Ops 9.x: /usr/java/jre-vmware-17/lib/security/cacerts (password: changeit). The legacy /usr/java/jre-vmware/ path does not exist.


10. SDDC Manager Credential Cascade Failure

Symptoms:

Root Cause Chain: A failed credential operation (often due to NSX being temporarily unreachable during a boot storm or maintenance) triggers a cascade:

  1. NSX cluster resource gets stuck in ACTIVATING or ERROR state in platform.nsxt table
  2. Stale exclusive locks remain in platform.lock table, blocking all new operations
  3. Failed tasks remain as IN_PROGRESS in platform.task_metadata (resolved=false), piling up
  4. Each retry from the UI creates more stuck tasks and locks
  5. Even after NSX recovers, SDDC Manager won't attempt the operation because the status check fails prevalidation

Diagnosis:

# 1. Get auth token from SDDC Manager
TOKEN=$(curl -sk -X POST https://sddc-manager.lab.local/v1/tokens \
  -H "Content-Type: application/json" \
  -d '{"username":"administrator@vsphere.local","password":"<password>"}' \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['accessToken'])")

# 2. Check NSX cluster resource state (look for status field)
curl -sk "https://sddc-manager.lab.local/v1/nsxt-clusters" \
  -H "Authorization: Bearer $TOKEN" | python3 -m json.tool
# If status is "ACTIVATING" or "ERROR" instead of "ACTIVE" → this is the problem

# 3. Check for stale resource locks
curl -sk "https://sddc-manager.lab.local/v1/resource-locks" \
  -H "Authorization: Bearer $TOKEN" | python3 -m json.tool
# Stale locks from failed operations will block all new operations

# 4. Check for stuck IN_PROGRESS tasks
curl -sk "https://sddc-manager.lab.local/v1/tasks?status=IN_PROGRESS" \
  -H "Authorization: Bearer $TOKEN" | python3 -c \
  "import sys,json; d=json.load(sys.stdin); print(f'Stuck tasks: {len(d.get(\"elements\",[]))}')"

# 5. Verify NSX is actually healthy (from SDDC Manager)
curl -sk -u admin:'<password>' --connect-timeout 10 \
  https://nsx-vip.lab.local/api/v1/cluster/status
# overall_status should be "STABLE"

Fix — Full Database Repair:

WARNING: Direct database manipulation is unsupported and should only be done in lab environments. Always back up before modifying.

Step 1: Access PostgreSQL on SDDC Manager

SSH to SDDC Manager as vcf, then su - to root. PostgreSQL uses TCP on 127.0.0.1 (not Unix sockets), and the password may not be easily discoverable. Disable the psql pager to prevent --More-- prompts from corrupting interactive shell sessions:

# Back up pg_hba.conf
cp /data/pgdata/pg_hba.conf /data/pgdata/pg_hba.conf.bak

# Temporarily allow passwordless local connections
sed -i 's/scram-sha-256/trust/g' /data/pgdata/pg_hba.conf

# Reload postgres (no restart needed)
su - postgres -c "/usr/pgsql/15/bin/pg_ctl reload -D /data/pgdata"

# Disable psql pager (CRITICAL for scripted/remote sessions)
export PAGER=cat
export PGPAGER=cat

Step 2: Fix the stuck resource status

The nsxt table status can be ACTIVATING, ERROR, or other non-ACTIVE values:

# Check current NSX resource status
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -t -c \"SELECT id, status FROM nsxt;\""

# Fix ANY non-ACTIVE status
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"UPDATE nsxt SET status = 'ACTIVE' WHERE status != 'ACTIVE';\""

Step 3: Clear stale resource locks

su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"SELECT count(*) FROM lock;\""
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"DELETE FROM lock;\""

Step 4: Mark stuck tasks as resolved

The task_metadata table in the platform DB tracks task resolution state. Unresolved tasks (resolved=false) from failed operations accumulate and can interfere with new operations:

# Check unresolved task count
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"SELECT resolved, count(*) FROM task_metadata GROUP BY resolved;\""

# Mark all unresolved tasks as resolved
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"UPDATE task_metadata SET resolved = true WHERE resolved = false;\""

# Clear task_lock table if any entries exist
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"DELETE FROM task_lock;\""

Step 5: Restore pg_hba.conf (CRITICAL — do not skip)

cp /data/pgdata/pg_hba.conf.bak /data/pgdata/pg_hba.conf
su - postgres -c "/usr/pgsql/15/bin/pg_ctl reload -D /data/pgdata"

# Verify it's back to scram-sha-256
grep -c 'scram-sha-256' /data/pgdata/pg_hba.conf
# Should return 4 or more

Step 6: Restart operationsmanager service

systemctl restart operationsmanager
# Wait 2-3 minutes for it to fully start
systemctl is-active operationsmanager

Verification:

# NSX cluster should now show ACTIVE
curl -sk "https://sddc-manager.lab.local/v1/nsxt-clusters" \
  -H "Authorization: Bearer $TOKEN" | python3 -c \
  "import sys,json; [print(f'{c[\"id\"]}: {c[\"status\"]}') for c in json.load(sys.stdin).get('elements',[])]"

# Resource locks should be empty
curl -sk "https://sddc-manager.lab.local/v1/resource-locks" \
  -H "Authorization: Bearer $TOKEN"

# IN_PROGRESS tasks should be zero or minimal
curl -sk "https://sddc-manager.lab.local/v1/tasks?status=IN_PROGRESS" \
  -H "Authorization: Bearer $TOKEN" | python3 -c \
  "import sys,json; print(f'IN_PROGRESS: {len(json.load(sys.stdin).get(\"elements\",[]))}')"

# Credential remediate should now succeed via VCF Operations Fleet Management UI
Credential Cascade Failure Flowchart:
┌──────────────────────────────────────────────┐
│ Credential Update/Rotate/Remediate fails     │
│ in SDDC Manager or VCF Operations UI         │
└──────────────────┬───────────────────────────┘
                   │
          ┌────────▼────────┐
          │ Check task error │
          └────────┬────────┘
                   │
    ┌──────────────┼──────────────┐
    │              │              │
    ▼              ▼              ▼
"not in        "Unable to     "503 Service
ACTIVE state"  acquire lock"  Unavailable"
    │              │              │
    ▼              ▼              ▼
Fix nsxt       Delete from    NSX still
table status   lock table     booting/
(ACTIVATING/   in platform    unstable
ERROR→ACTIVE)  DB             │
    │              │           ▼
    │              │        Wait for
    │              │        NSX load
    │              │        to settle
    │              │        (< 20)
    └──────┬───────┘          │
           ▼                  │
    Mark task_metadata        │
    resolved = true    ◄──────┘
           │
           ▼
    Clear task_lock
           │
           ▼
    Restart
    operationsmanager
           │
           ▼
    Retry credential
    operation

Key insight: Three tables in the platform database must be cleaned: (1) nsxt — resource status, (2) lock — operation locks, (3) task_metadata — task resolution tracking. The operationsmanager database has separate task and execution tables (columns: task.state, execution.execution_status — not status). The API won't let you cancel or delete stuck tasks — database repair is required.

psql pager trap: When running psql queries via Paramiko or remote shell, the default pager (less/more) captures output and waits for interactive input, corrupting the session. Always set PAGER=cat before running psql commands, or pass it inline: PAGER=cat psql -h 127.0.0.1 -d platform -c "...". For Paramiko invoke_shell(), also set height=1000 to prevent terminal-based paging.

PostgreSQL on SDDC Manager: Uses TCP on 127.0.0.1 (not Unix sockets — you'll get "No such file or directory" without -h 127.0.0.1). Data directory is /data/pgdata. Key databases: platform (nsxt, lock, task_metadata tables), operationsmanager (task, execution, processing_task tables). The pg_hba.conf trust workaround is a last resort — always restore the original immediately after.

vcf account lockout: Failed SSH attempts (including from automated scripts) can lock the vcf account. SDDC Manager uses faillock (not pam_tally2). Unlock from console as root: faillock --user vcf --reset

Discovery Process — How the Schema Was Mapped

None of the database repair procedure is documented by Broadcom. The schema was mapped through the following investigation:

Why the API wasn't enough:

How the database was explored:

  1. Accessed PostgreSQL using the trust auth workaround (password not discoverable in config files)
  2. Listed all databases with \l: platform, operationsmanager, domainmanager, lcm, sddc_manager_ui, postgres
  3. Listed tables in platform DB with \dt: found nsxt, lock, task_metadata, task_lock (plus vcenter, host, etc.)
  4. Queried column definitions via SELECT column_name, data_type FROM information_schema.columns WHERE table_name = '<table>'
  5. Discovered that task_metadata uses a resolved boolean (not a status field like you'd expect)
  6. Discovered that operationsmanager.task uses column state (not status) and execution uses execution_status (not status)
  7. Early script versions failed because they referenced task in the platform DB (wrong — it's task_metadata) and used status column on operationsmanager tables (wrong — it's state and execution_status)

Why each repair step is needed:

Step Table Action Why
2 nsxt Set status to ACTIVE The stuck ACTIVATING/ERROR status makes every new credential operation fail at prevalidation — SDDC Manager checks this before even attempting the operation
3 lock Delete all rows Stale exclusive locks from the failed operation block all new operations from acquiring their own locks — they'll fail with "Unable to acquire resource level lock(s)"
4 task_metadata Set resolved=true Unresolved tasks (resolved=false) accumulate with each UI retry. 47 were found during the initial diagnosis. These can interfere with new task scheduling
4 task_lock Delete all rows Links tasks to locks — clearing this ensures no orphaned task-lock relationships remain
5 pg_hba.conf Restore backup Leaving trust auth enabled means any local process can access PostgreSQL without a password — security risk
6 operationsmanager Restart service The service caches database state in memory. A restart forces it to re-read the cleaned tables and reset its internal state machine

Why each step matters in sequence:

Automated scripts built from this knowledge:

All scripts automate the pg_hba.conf backup/trust/restore cycle and use PAGER=cat to prevent pager traps in remote sessions.


11. SDDC Manager Storage Migration (Local → vSAN)

Phase 7 — Feb 10–11, 2026

Why SDDC Manager Starts on Local Storage

This is a chicken-and-egg bootstrap constraint in every VCF deployment:

  1. The VCF Installer OVA (which is the SDDC Manager appliance — same OVA, dual purpose) must be deployed before the bringup process runs
  2. vSAN does not exist yet at this point — vSAN is created during the bringup process when the VCF Installer orchestrates the deployment of vCenter, vSAN, and VDS across the ESXi hosts
  3. The only storage available before bringup is the local datastore on the ESXi host where you deploy the installer (esxi01-local in the lab)
  4. After bringup completes, the VCF Installer transforms into SDDC Manager — still sitting on local storage where it was originally deployed

This means SDDC Manager is always initially deployed to local storage and must be manually migrated to shared storage (vSAN) afterward.

The Problem

Disk analysis:

Disk Allocated Actual Used
sddc-manager.vmdk 32GB 2.6GB
sddc-manager_1.vmdk 16GB 2.6GB
sddc-manager_2.vmdk 240GB 3.0GB
sddc-manager_3.vmdk 512GB 99.5GB
sddc-manager_4.vmdk 26GB 30MB
sddc-manager_5.vmdk 88GB 64MB
Total 914GB ~108GB

Solution: vmkfstools Thin Clone

The vCenter migration wizard cannot thin-provision to vSAN. The workaround is to clone each disk individually as thin using vmkfstools directly on the ESXi host:

# SSH to esxi01 as root
# Clone each disk from local to vSAN as thin-provisioned
vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager.vmdk \
  /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager.vmdk -d thin

vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_1.vmdk \
  /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_1.vmdk -d thin

vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_2.vmdk \
  /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_2.vmdk -d thin

vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_3.vmdk \
  /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_3.vmdk -d thin

vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_4.vmdk \
  /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_4.vmdk -d thin

vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_5.vmdk \
  /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_5.vmdk -d thin

After cloning:

  1. Power off SDDC Manager
  2. Remove from vCenter inventory
  3. Browse vSAN datastore → navigate to the sddc-manager folder → right-click the .vmxRegister VM
  4. Power on and verify all services start (systemctl status vcf-services)

Recovery from Mid-Clone Failure

The 512GB disk clone failed partway through (ESXi connection timeout on nested storage). Fix:

# Delete the partial clone
vmkfstools -U /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_3.vmdk

# Retry the clone
vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_3.vmdk \
  /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_3.vmdk -d thin

Key lesson: vCenter's migration wizard cannot thin-provision to vSAN. Always use vmkfstools -i <src> <dst> -d thin per disk. This is also the only way to reclaim wasted space from thick-provisioned VCF appliances — SDDC Manager went from 914GB allocated to ~108GB actual on vSAN.


Document Information

Field Value
Document Title VCF 9.0.1 Nested Deployment Troubleshooting Handbook
Version 1.5
Last Updated February 2026
Environment VMware Workstation 17.x Nested Lab
VCF Version 9.0.1

This handbook is intended for lab and educational purposes. Always consult official VMware documentation for production deployments.

(c) 2026 Virtual Control LLC. All rights reserved.