Version: 1.5 Date: February 2026 Environment: VCF 9.0.1 Nested in VMware Workstation Author: Virtual Control LLC
This handbook documents the complete troubleshooting process for deploying VMware Cloud Foundation 9.0.1 in a nested VMware Workstation environment. The deployment encountered several challenges including:
This document provides step-by-step remediation procedures and recovery processes for nested VCF deployments.
| Component | FQDN | IP Address |
|---|---|---|
| VCF Installer/SDDC Manager | vcf-installer.lab.local | 192.168.1.240 |
| vCenter Server | vcenter.lab.local | 192.168.1.69 |
| NSX Manager VIP | nsx-vip.lab.local | 192.168.1.70 |
| NSX Manager Node 1 | nsx-node1.lab.local | 192.168.1.71 |
| ESXi Host 1 | esxi01.lab.local | 192.168.1.74 |
| ESXi Host 2 | esxi02.lab.local | 192.168.1.75 |
| ESXi Host 3 | esxi03.lab.local | 192.168.1.76 |
| ESXi Host 4 | esxi04.lab.local | 192.168.1.82 |
| VCF Operations | vcf-ops.lab.local | 192.168.1.77 |
| Fleet Management | fleet.lab.local | 192.168.1.78 |
| Collector | collector.lab.local | 192.168.1.79 |
| VCF Automation | automation.lab.local | 192.168.1.90 |
| Automation Node 1 | automation-node1.lab.local | 192.168.1.91 |
| Offline Depot Server | (IP only - no DNS) | 192.168.1.160 |
| DNS/NTP Server | (Windows Server) | 192.168.1.230 |
| Network | VLAN ID | Subnet | Gateway | IP Range |
|---|---|---|---|---|
| ESX Management | 0 | 192.168.1.0/24 | 192.168.1.1 | DHCP/Static |
| VM Management | 0 | 192.168.1.0/24 | 192.168.1.1 | Same as ESX Mgmt |
| vMotion | 100 | 192.168.100.0/24 | 192.168.100.1 | 192.168.100.10-20 |
| vSAN | 200 | 192.168.200.0/24 | 192.168.200.1 | 192.168.200.206-216 |
| NSX TEP | 300 | 192.168.250.0/24 | 192.168.250.1 | 192.168.250.10-25 |
| Resource | Value |
|---|---|
| vCPUs | 8 (4 cores x 2 sockets) |
| Memory | 48 GB |
| OS Disk | 32 GB |
| vSAN Cache Disk | 100 GB (SSD) |
| vSAN Capacity Disk | 800 GB (SSD) |
| Network Adapters | 4x vmxnet3 |
VCF 9.0.1 uses BouncyCastle FIPS TLS implementation which has strict certificate validation requirements. When connecting to an offline depot with a self-signed certificate, the connection fails with:
Secure protocol communication error, check logs for more details
TlsFatalAlert errors:org.bouncycastle.tls.TlsFatalAlert caught when processing request to {s}->https://192.168.1.160:8443
# Test SSL connectivity
openssl s_client -connect 192.168.1.160:8443
# Test with TLS 1.2 specifically
openssl s_client -connect 192.168.1.160:8443 -tls1_2
# Check cipher negotiation
openssl s_client -connect 192.168.1.160:8443 -tls1_2 </dev/null 2>&1 | grep -E "Cipher|Protocol|Verify"
# View certificate details
openssl s_client -connect 192.168.1.160:8443 </dev/null 2>/dev/null | openssl x509 -text -noout
# Get certificate fingerprint
openssl s_client -connect 192.168.1.160:8443 </dev/null 2>/dev/null | openssl x509 -noout -fingerprint -sha256
# Check LCM logs for TLS errors
grep -i "tlsfatal\|ssl\|certificate" /var/log/vmware/vcf/lcm/lcm-debug.log | tail -20
# Check LCM service status
systemctl status lcm
#!/usr/bin/env python3
"""
HTTPS server for VCF Offline Depot
Serves files with TLS 1.2+ for SDDC Manager compatibility
"""
import http.server
import ssl
import os
import base64
import socketserver
from functools import partial
# Configuration
PORT = 8443
CERT_FILE = 'server.crt'
KEY_FILE = 'server.key'
USERNAME = 'admin'
PASSWORD = 'admin'
class AuthHandler(http.server.SimpleHTTPRequestHandler):
protocol_version = "HTTP/1.1"
def __init__(self, *args, directory=None, **kwargs):
super().__init__(*args, directory=directory, **kwargs)
def do_HEAD(self):
if not self.authenticate():
return
super().do_HEAD()
def do_GET(self):
if not self.authenticate():
return
super().do_GET()
def do_POST(self):
if not self.authenticate():
return
content_length = int(self.headers.get('Content-Length', 0))
self.rfile.read(content_length)
self.send_response(200)
self.send_header('Content-Type', 'application/json')
self.send_header('Connection', 'close')
self.end_headers()
self.wfile.write(b'{"status": "ok"}')
def authenticate(self):
auth_header = self.headers.get('Authorization')
if auth_header is None:
self.send_auth_request()
return False
try:
auth_type, credentials = auth_header.split(' ', 1)
if auth_type.lower() != 'basic':
self.send_auth_request()
return False
decoded = base64.b64decode(credentials).decode('utf-8')
username, password = decoded.split(':', 1)
if username == USERNAME and password == PASSWORD:
return True
except Exception:
pass
self.send_auth_request()
return False
def send_auth_request(self):
self.send_response(401)
self.send_header('WWW-Authenticate', 'Basic realm="VCF Depot"')
self.send_header('Content-type', 'text/html')
self.send_header('Content-Length', '23')
self.send_header('Connection', 'close')
self.end_headers()
self.wfile.write(b'Authentication required')
def log_message(self, format, *args):
print(f"{self.client_address[0]} - {format % args}")
class ThreadedHTTPServer(socketserver.ThreadingMixIn, http.server.HTTPServer):
daemon_threads = True
def run_server():
os.chdir(os.path.dirname(os.path.abspath(__file__)))
context = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
context.minimum_version = ssl.TLSVersion.TLSv1_2
context.maximum_version = ssl.TLSVersion.TLSv1_3
if hasattr(context, 'post_handshake_auth'):
context.post_handshake_auth = False
context.options |= ssl.OP_NO_TICKET
context.options |= getattr(ssl, 'OP_NO_RENEGOTIATION', 0)
context.load_cert_chain(CERT_FILE, KEY_FILE)
try:
context.set_ciphers('DEFAULT:!aNULL:!MD5:!DSS')
except ssl.SSLError:
pass
handler = partial(AuthHandler, directory=os.getcwd())
server = ThreadedHTTPServer(('0.0.0.0', PORT), handler)
server.socket = context.wrap_socket(server.socket, server_side=True)
print(f"VCF Offline Depot Server")
print(f"========================")
print(f"Serving: {os.getcwd()}")
print(f"URL: https://192.168.1.160:{PORT}/")
print(f"Credentials: {USERNAME} / {PASSWORD}")
print(f"TLS: 1.2 - 1.3")
print(f"Press Ctrl+C to stop")
try:
server.serve_forever()
except KeyboardInterrupt:
print("\nStopped.")
server.shutdown()
if __name__ == '__main__':
run_server()
"""
Generate simple self-signed certificate for VCF Offline Depot
"""
import subprocess
import sys
import os
def generate_cert():
try:
from cryptography import x509
from cryptography.x509.oid import NameOID
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.asymmetric import rsa
from cryptography.hazmat.primitives import serialization
from datetime import datetime, timedelta, timezone
import ipaddress
print("Generating RSA 2048-bit private key...")
key = rsa.generate_private_key(
public_exponent=65537,
key_size=2048,
)
print("Creating self-signed certificate...")
subject = issuer = x509.Name([
x509.NameAttribute(NameOID.COMMON_NAME, "192.168.1.160"),
])
now = datetime.now(timezone.utc)
# Simple certificate like the original
cert = (
x509.CertificateBuilder()
.subject_name(subject)
.issuer_name(issuer)
.public_key(key.public_key())
.serial_number(x509.random_serial_number())
.not_valid_before(now)
.not_valid_after(now + timedelta(days=365))
.add_extension(
x509.SubjectAlternativeName([
x509.IPAddress(ipaddress.IPv4Address("192.168.1.160")),
]),
critical=False,
)
.add_extension(
x509.BasicConstraints(ca=True, path_length=None),
critical=True,
)
.sign(key, hashes.SHA256())
)
with open("server.key", "wb") as f:
f.write(key.private_bytes(
encoding=serialization.Encoding.PEM,
format=serialization.PrivateFormat.TraditionalOpenSSL,
encryption_algorithm=serialization.NoEncryption()
))
print("Created: server.key")
with open("server.crt", "wb") as f:
f.write(cert.public_bytes(serialization.Encoding.PEM))
print("Created: server.crt")
fingerprint = cert.fingerprint(hashes.SHA256()).hex()
formatted = ':'.join(fingerprint[i:i+2].upper() for i in range(0, len(fingerprint), 2))
print(f"SHA256: {formatted}")
return True
except ImportError:
print("Installing cryptography...")
subprocess.check_call([sys.executable, "-m", "pip", "install", "cryptography"])
return False
def main():
os.chdir(r"C:\VCF-DEPOT")
print("Generating certificate...")
if not generate_cert():
generate_cert()
print("\nDone. Run: python https_server.py")
if __name__ == "__main__":
main()
# Navigate to depot directory
cd C:\VCF-DEPOT
# Generate certificate
python generate_cert.py
# Start HTTPS server
python https_server.py
# Step 1: Download certificate from depot server
openssl s_client -connect 192.168.1.160:8443 </dev/null 2>/dev/null | openssl x509 -outform PEM > /tmp/depot.crt
# Step 2: Verify certificate was downloaded
cat /tmp/depot.crt
# Step 3: Get certificate fingerprint
openssl x509 -in /tmp/depot.crt -noout -fingerprint -sha256
# Step 4: Find Java truststore location
echo $JAVA_HOME
# Output: /usr/lib/jvm/openjdk-java17-headless.x86_64
# Step 5: Delete old certificate if exists
keytool -delete -alias offline-depot -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit
# Step 6: Import new certificate
keytool -import -trustcacerts -alias offline-depot -file /tmp/depot.crt -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit -noprompt
# Step 7: Verify import
keytool -list -alias offline-depot -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit
# Step 8: Restart LCM service
systemctl restart lcm
# Step 9: Wait for LCM to start (2 minutes)
systemctl status lcm
# Step 10: Verify LCM is ready
tail -f /var/log/vmware/vcf/lcm/lcm-debug.log | grep -i "started\|ready"
# Check all cacerts files on system
find / -name "cacerts" -type f 2>/dev/null
# List all certificates in truststore
keytool -list -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit
# View certificate details in truststore
keytool -list -alias offline-depot -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit -v
# Check LCM logs for certificate errors
grep -B5 -A10 "TlsFatalAlert" /var/log/vmware/vcf/lcm/lcm-debug.log | tail -40
ESXi hosts may have certificates with incorrect hostnames (e.g., "localhost.localdomain" instead of the actual FQDN), causing VCF validation to fail.
javax.net.ssl.SSLPeerUnverifiedException: Certificate for <esxi01.lab.local> doesn't match any of the subject alternative names: [localhost.localdomain]
# Check current hostname
esxcli system hostname get
# View current certificate SAN
openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout | grep -A1 "Subject Alternative Name"
# View full certificate details
openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout
# Step 1: Set correct hostname
esxcli system hostname set --fqdn=esxi01.lab.local
# Step 2: Verify hostname
esxcli system hostname get
# Step 3: Backup existing certificate
mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak
# Step 4: Generate new certificate
/sbin/generate-certificates
# Step 5: Restart services
services.sh restart
# Step 6: Verify new certificate
openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout | grep -A1 "Subject Alternative Name"
esxi01.lab.local (192.168.1.74):
esxcli system hostname set --fqdn=esxi01.lab.local
mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak
/sbin/generate-certificates
services.sh restart
esxi02.lab.local (192.168.1.75):
esxcli system hostname set --fqdn=esxi02.lab.local
mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak
/sbin/generate-certificates
services.sh restart
esxi03.lab.local (192.168.1.76):
esxcli system hostname set --fqdn=esxi03.lab.local
mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak
/sbin/generate-certificates
services.sh restart
esxi04.lab.local (192.168.1.82):
esxcli system hostname set --fqdn=esxi04.lab.local
mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak
/sbin/generate-certificates
services.sh restart
# Get thumbprint for each host
echo | openssl s_client -connect 192.168.1.74:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256
echo | openssl s_client -connect 192.168.1.75:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256
echo | openssl s_client -connect 192.168.1.76:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256
echo | openssl s_client -connect 192.168.1.82:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256
After regenerating certificates, update the thumbprints in VCF Installer UI by re-validating the hosts.
In nested VMware Workstation environments, virtual disks are not automatically detected as SSDs, causing vSAN cache tier configuration to fail.
ESX Host esxi01.lab.local found zero SSD devices for SSD cache tier
# List all storage devices with SSD status
esxcli storage core device list | grep -E "^t10|Is SSD"
# Check vSAN eligible disks
vdq -q
# List vSAN storage
esxcli vsan storage list
# Check disk partitions
partedUtil getptbl /vmfs/devices/disks/<device-name>
Virtual disks in VMware Workstation need the virtualSSD flag set in the VMX file to be recognized as SSDs by nested ESXi.
Location: Edit each ESXi VM's .vmx file in VMware Workstation
Required Lines to Add:
For esxi01.vmx:
sata0:0.virtualSSD = 1
sata0:2.virtualSSD = 1
sata0:3.virtualSSD = 1
sata0:4.virtualSSD = 1
For esxi02.vmx, esxi03.vmx, esxi04.vmx:
sata0:0.virtualSSD = 1
sata0:3.virtualSSD = 1
sata0:4.virtualSSD = 1
esxcli storage core device list | grep -E "^t10|Is SSD"
Disks with existing vSAN partitions from previous deployments are marked as "Ineligible for use by VSAN" with reason "Has partitions" or "Disk in use by disk group".
# Check vSAN eligibility
vdq -q
# Check existing vSAN storage
esxcli vsan storage list
# Check partition table
partedUtil getptbl /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001
# Step 1: Remove existing vSAN disk group (if exists)
esxcli vsan storage remove -d t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001
# Step 2: Delete partitions from vSAN cache disk
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001 1
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001 2
# Step 3: Delete partitions from vSAN capacity disk
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________04000000000000000001 1
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________04000000000000000001 2
# Step 4: Verify disks are now eligible
vdq -q
{
"Name": "t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001",
"State": "Eligible for use by VSAN",
"Reason": "None",
"IsSSD": "1"
}
VCF requires separate subnets for different network types. Using the same subnet/gateway for multiple networks causes validation errors:
Gateway 192.168.1.1 is duplicated across networks
Subnet 192.168.1.0/24 is duplicated across networks
| Network | VLAN ID | Subnet | Gateway |
|---|---|---|---|
| Management | 0 | 192.168.1.0/24 | 192.168.1.1 |
| vMotion | 100 | 192.168.100.0/24 | 192.168.100.1 |
| vSAN | 200 | 192.168.200.0/24 | 192.168.200.1 |
| NSX TEP | 300 | 192.168.250.0/24 | 192.168.250.1 |
Note: For vMotion, vSAN, and TEP networks in nested environments, the gateway IPs don't need to exist - they're isolated networks. VCF requires the gateway field to be populated.
Do NOT use jumbo frames (MTU 9000) in nested VMware Workstation environments.
| Component | Recommended MTU |
|---|---|
| Distributed Switch | 1500-1600 |
| ESX Management | 1500 |
| vMotion | 1500 |
| vSAN | 1500 |
| NSX TEP | 1500 |
| Network | IP Range | Purpose |
|---|---|---|
| Management | 192.168.1.x | Static assignments per hosts file |
| vMotion | 192.168.100.10-20 | Automatic assignment by VCF |
| vSAN | 192.168.200.206-216 | Automatic assignment by VCF |
| NSX TEP | 192.168.250.10-25 | TEP IP Pool |
Problem: VCF detects duplicate IP when cluster FQDN resolves to same IP as Node IP.
Error:
IP address 192.168.1.90 for product VCF Automation is already resolved from an FQDN in the input specification
Solution: Use different IPs for cluster FQDN and node IP:
| Field | Value |
|---|---|
| Cluster hostname/FQDN | automation.lab.local (192.168.1.90) |
| Node IP 1 | 192.168.1.91 (automation-node1.lab.local) |
| Additional IP for upgrades | 192.168.1.81 |
| Node name prefix | automation |
| Internal Cluster CIDR | 198.18.0.0/15 |
| Field | Value |
|---|---|
| Appliance Size | Medium |
| Appliance FQDN | nsx-node1.lab.local |
| Virtual IP (VIP) FQDN | nsx-vip.lab.local |
Problem: VDT reports "SAN contains neither hostname nor IP" for NSX VIP and NSX Manager certificates. The default NSX self-signed certificate uses a wildcard SAN (*.lab.local) without specific hostnames or IPs.
Solution: Generate a new self-signed certificate with explicit SAN entries and apply via NSX API.
Step 1: Create OpenSSL config on NSX Manager (SSH as root):
cat > /tmp/nsx-cert.conf << 'EOF'
[ req ]
default_bits = 2048
distinguished_name = req_distinguished_name
req_extensions = req_ext
x509_extensions = req_ext
prompt = no
[ req_distinguished_name ]
countryName = US
stateOrProvinceName = Lab
localityName = Lab
organizationName = lab.local
commonName = nsx-vip.lab.local
[ req_ext ]
basicConstraints = CA:FALSE
subjectAltName = @alt_names
[alt_names]
DNS.1 = nsx-vip.lab.local
DNS.2 = nsx-node1.lab.local
DNS.3 = nsx-manager.lab.local
IP.1 = 192.168.1.70
IP.2 = 192.168.1.71
EOF
Important: DNS.3 = nsx-manager.lab.local is required because SDDC Manager registers NSX using this FQDN. Without it, VDT reports "SAN contains IP but not hostname".
Step 2: Generate cert, build JSON, import, and apply:
# Generate cert
openssl req -x509 -nodes -days 825 -newkey rsa:2048 -keyout /tmp/nsx.key -out /tmp/nsx.crt -config /tmp/nsx-cert.conf -sha256
# Build JSON payload (avoids shell escaping issues with PEM newlines)
python -c "
import json
cert = open('/tmp/nsx.crt').read()
key = open('/tmp/nsx.key').read()
print(json.dumps({'pem_encoded': cert, 'private_key': key}))
" > /tmp/nsx-import.json
# Import cert (single-line — NSX shell doesn't support backslash continuation)
curl -k -u admin:'<password>' -X POST "https://192.168.1.71/api/v1/trust-management/certificates?action=import" -H "Content-Type: application/json" -d @/tmp/nsx-import.json
# Note the certificate ID from the response
# Get node UUID
curl -k -u admin:'<password>' https://192.168.1.71/api/v1/cluster
# Note the node UUID
# Apply to node (API service)
curl -k -u admin:'<password>' -X POST "https://192.168.1.71/api/v1/trust-management/certificates/<cert-id>?action=apply_certificate&service_type=API&node_id=<node-uuid>"
# Apply to VIP (MGMT_CLUSTER)
curl -k -u admin:'<password>' -X POST "https://192.168.1.71/api/v1/trust-management/certificates/<cert-id>?action=apply_certificate&service_type=MGMT_CLUSTER"
Prerequisite: All NSX services must be healthy (MANAGER, SEARCH, UI, NODE_MGMT all UP). If services are DOWN, the API returns error 101. Wait 10-15 minutes after NSX restart in nested environments.
Problem: After replacing the NSX self-signed certificate, VDT reports "NSX VIP Cert Trust: FAIL" and "NSX Manager Cert Trust: FAIL". The new self-signed cert's root is not in SDDC Manager's keystores (the original cert was pre-trusted during bringup).
Symptoms:
Solution: Import the NSX certificate into both SDDC Manager trust stores.
Step 1: Pull the NSX certificate (SSH to SDDC Manager as root):
openssl s_client -showcerts -connect 192.168.1.71:443 < /dev/null 2>/dev/null | openssl x509 -outform PEM > /tmp/nsx-root.crt
# Verify it's the correct cert
openssl x509 -in /tmp/nsx-root.crt -noout -text | grep -A2 "Subject Alternative Name"
Step 2: Import into VCF trust store:
KEY=$(cat /etc/vmware/vcf/commonsvcs/trusted_certificates.key)
keytool -importcert -alias nsx-selfsigned -file /tmp/nsx-root.crt \
-keystore /etc/vmware/vcf/commonsvcs/trusted_certificates.store \
-storepass "$KEY" -noprompt
Step 3: Import into Java cacerts:
keytool -importcert -alias nsx-selfsigned -file /tmp/nsx-root.crt \
-keystore /etc/alternatives/jre/lib/security/cacerts \
-storepass changeit -noprompt
Step 4: Restart SDDC Manager services (~5 minutes):
/opt/vmware/vcf/operationsmanager/scripts/cli/sddcmanager_restart_services.sh
Key paths:
| Item | Path/Value |
|---|---|
| VCF trust store | /etc/vmware/vcf/commonsvcs/trusted_certificates.store |
| VCF trust store password | Contents of /etc/vmware/vcf/commonsvcs/trusted_certificates.key |
| Java cacerts | /etc/alternatives/jre/lib/security/cacerts |
| Java cacerts password | changeit |
| Service restart script | /opt/vmware/vcf/operationsmanager/scripts/cli/sddcmanager_restart_services.sh |
Reference: https://knowledge.broadcom.com/external/article/316056
Note on SDDC Manager SSH: Only the vcf user can SSH in (root and admin are rejected). Use su - from the vcf session to get root access. SCP does not work due to the restricted shell; use ssh vcf@host "cat > file" < localfile for file transfers.
| Field | Value |
|---|---|
| Appliance FQDN | vcenter.lab.local |
| Appliance Size | Tiny |
| Datacenter Name | mgmt-dc01 |
| Cluster Name | mgmt-cl01 |
| SSO Domain Name | vsphere.local |
Problem Description VCF vLCM (vSphere Lifecycle Manager) requires SSH access to ESXi hosts during vCenter deployment for host seeding. If SSH is disabled, the deployment fails with:
vCenter installation failed. Check logs under /var/log/vmware/vcf/domainmanager/ci-installer-XX-XX-XX-XX-XX-XXX for more details
Symptoms in Logs
Extraction of image from host esxi01.lab.local failed
Root Cause SSH service is stopped or disabled on ESXi hosts. VCF needs SSH to extract ESXi image metadata for vLCM host seeding.
Solution: Enable SSH on All ESXi Hosts
Run on each ESXi host BEFORE starting VCF deployment:
# Enable SSH service
vim-cmd hostsvc/enable_ssh
# Start SSH service
vim-cmd hostsvc/start_ssh
# Verify SSH is running
vim-cmd hostsvc/runtimeinfo | grep ssh
Alternative Method (from ESXi Shell)
# Enable and start SSH
esxcli system ssh set --enable=true
# Verify SSH status
esxcli system ssh get
Note: SSH can be disabled after successful VCF deployment for security.
VCF Installer Log Monitoring
Watch deployment progress from VCF Installer:
# Find the latest ci-installer log directory
ls -lt /var/log/vmware/vcf/domainmanager/ | head -5
# Watch the installation log
tail -f /var/log/vmware/vcf/domainmanager/ci-installer-XX-XX-XX-XX-XX-XXX/ci-installer.log
# Search for errors
grep -i "error\|failed\|exception" /var/log/vmware/vcf/domainmanager/ci-installer-XX-XX-XX-XX-XX-XXX/ci-installer.log
vCenter VM Direct Monitoring
SSH to the vCenter VM during deployment (default password: vmware):
# Watch firstboot progress
tail -f /var/log/firstboot/firstbootStatus.json
# Watch detailed installation
tail -f /var/log/vmware/firstboot/installer.log
# Check VMware services
vmon-cli --list
# Check specific service status
vmon-cli --status <service-name>
Expected Deployment Stages
Symptoms
Diagnostic Commands (SSH to vCenter VM)
# Check current deployment status
cat /var/log/firstboot/firstbootStatus.json
# Check for running processes
ps aux | grep -E "install|firstboot|postgres|vpxd"
# Check disk I/O (should show activity)
vmstat 1 5
# Check memory usage
free -h
# Check for error logs
tail -50 /var/log/vmware/firstboot/installer.log
grep -i "error\|fail\|exception" /var/log/vmware/firstboot/*.log
PostgreSQL Database Issues
If deployment is stuck at "Installing Containers" (60%), check postgres:
# Check if postgres service exists
ls -la /storage/db/vpostgres/
# Check for postgres config file
ls -la /storage/db/vpostgres/postgresql.conf
# Check postgres user/group
grep postgres /etc/passwd
grep postgres /etc/group
# Check postgres logs
tail -50 /var/log/vmware/vpostgres/*.log
If PostgreSQL Never Initialized
Missing /storage/db/vpostgres/postgresql.conf and missing postgres user indicates the database initialization failed. This is typically unrecoverable and requires full redeployment.
Service Startup Issues
# List all VMware services and status
vmon-cli --list
# Check rhttpproxy (reverse proxy)
systemctl status rhttpproxy
tail -50 /var/log/vmware/rhttpproxy/rhttpproxy.log
# Check vpostgres
systemctl status vmware-vpostgres
tail -50 /var/log/vmware/vpostgres/postgresql*.log
When vCenter deployment fails, VCF provides a reference token. To find detailed errors:
# Search for reference token in logs
grep -r "REFERENCE_TOKEN" /var/log/vmware/vcf/
# Example: Reference Token 3OHCKD
grep -r "3OHCKD" /var/log/vmware/vcf/
grep -B20 -A20 "3OHCKD" /var/log/vmware/vcf/domainmanager/*.log
VCF does not provide a rollback mechanism for failed management domain deployments. A failed deployment requires manual cleanup of:
Step 1: Delete Failed vCenter VM
From any ESXi host (or the one hosting the failed vCenter):
# List all VMs
vim-cmd vmsvc/getallvms
# Find the vCenter VM ID (look for vcenter.lab.local or similar)
# Power off the VM if running
vim-cmd vmsvc/power.off <vmid>
# Unregister the VM
vim-cmd vmsvc/unregister <vmid>
# Delete the VM files from datastore (if needed)
rm -rf /vmfs/volumes/<datastore>/vcenter.lab.local/
Step 2: Clean Up VDS (Distributed Switch)
From ESXi hosts, remove VDS configuration:
# List current virtual switches
esxcli network vswitch dvs vmware list
# Remove host from VDS (if configured)
# This is typically done from vCenter, but if vCenter is gone:
# Remove vmkernel ports from VDS
esxcli network ip interface remove -i vmk1 # vMotion
esxcli network ip interface remove -i vmk2 # vSAN
# Remove VDS uplink
esxcli network vswitch dvs vmware list
Step 3: Clean Up vSAN Configuration
Run on EACH ESXi host:
# List current vSAN storage
esxcli vsan storage list
# Remove vSAN disk groups
esxcli vsan storage remove -d t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001
esxcli vsan storage remove -d t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________04000000000000000001
# If remove fails, check disk state
vdq -q
# Delete partitions from cache disk (example device name)
partedUtil getptbl /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001 1
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________03000000000000000001 2
# Delete partitions from capacity disk
partedUtil getptbl /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________04000000000000000001
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________04000000000000000001 1
partedUtil delete /vmfs/devices/disks/t10.ATA_____VMware_Virtual_SATA_Hard_Drive__________04000000000000000001 2
# Verify disks are now eligible
vdq -q
Common vSAN Cleanup Error
If you see: cache disk/s are in an invalid state...available size is 0.0 GB
This means disks still have partitions. Use partedUtil to delete them.
Step 4: Remove Depot Connection (VCF UI)
Step 5: Verify Hosts Are Ready
On each ESXi host, verify:
# Check hostname is correct
esxcli system hostname get
# Check certificate is valid
openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout | grep -A1 "Subject Alternative Name"
# Check SSH is enabled
vim-cmd hostsvc/runtimeinfo | grep ssh
# Check vSAN disks are eligible
vdq -q
# Check no VDS remnants
esxcli network vswitch dvs vmware list
Step 6: Restart VCF Services (Optional)
On VCF Installer, restart services for clean state:
systemctl restart lcm
systemctl restart domainmanager
# Wait for services to start
sleep 120
# Verify services are running
systemctl status lcm
systemctl status domainmanager
START: VCF Deployment Failed
│
├─→ Note reference token from error message
│ └─→ Search logs: grep -r "TOKEN" /var/log/vmware/vcf/
│
├─→ Delete failed vCenter VM
│ ├─→ vim-cmd vmsvc/getallvms
│ ├─→ vim-cmd vmsvc/power.off <vmid>
│ └─→ vim-cmd vmsvc/unregister <vmid>
│
├─→ Clean up vSAN on EACH host
│ ├─→ esxcli vsan storage remove -d <device>
│ ├─→ partedUtil delete ... (both partitions)
│ └─→ vdq -q (verify eligible)
│
├─→ Clean up VDS (if configured)
│ └─→ esxcli network ip interface remove ...
│
├─→ Remove depot connection in VCF UI
│ └─→ Re-add with certificate
│
├─→ Verify SSH enabled on all hosts
│ └─→ vim-cmd hostsvc/enable_ssh
│
└─→ Retry deployment
# Check LCM service status
systemctl status lcm
# Restart LCM service
systemctl restart lcm
# Check Domain Manager status
systemctl status domainmanager
# Restart Domain Manager
systemctl restart domainmanager
# View LCM logs
tail -f /var/log/vmware/vcf/lcm/lcm-debug.log
# Search LCM logs for errors
grep -i "error\|fatal\|exception" /var/log/vmware/vcf/lcm/lcm-debug.log | tail -20
# Download certificate from remote server
openssl s_client -connect <IP>:<PORT> </dev/null 2>/dev/null | openssl x509 -outform PEM > /tmp/cert.crt
# View certificate details
openssl x509 -in /tmp/cert.crt -text -noout
# Get certificate fingerprint
openssl x509 -in /tmp/cert.crt -noout -fingerprint -sha256
# Import certificate to Java truststore
keytool -import -trustcacerts -alias <alias> -file /tmp/cert.crt -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit -noprompt
# Delete certificate from truststore
keytool -delete -alias <alias> -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit
# List certificates in truststore
keytool -list -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit
# View specific certificate in truststore
keytool -list -alias <alias> -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit -v
# Test connectivity
ping <IP>
# Test SSL connection
openssl s_client -connect <IP>:<PORT>
# Test with specific TLS version
openssl s_client -connect <IP>:<PORT> -tls1_2
# Test HTTP endpoint
curl -v -k -u admin:admin https://<IP>:<PORT>/path
# Get hostname
esxcli system hostname get
# Set hostname
esxcli system hostname set --fqdn=<FQDN>
# Get system version
esxcli system version get
# View current certificate
openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout
# View certificate SAN
openssl x509 -in /etc/vmware/ssl/rui.crt -text -noout | grep -A1 "Subject Alternative Name"
# Backup certificates
mv /etc/vmware/ssl/rui.crt /etc/vmware/ssl/rui.crt.bak
mv /etc/vmware/ssl/rui.key /etc/vmware/ssl/rui.key.bak
# Regenerate certificates
/sbin/generate-certificates
# Restart services after certificate change
services.sh restart
# List all storage devices
esxcli storage core device list
# List devices with SSD status
esxcli storage core device list | grep -E "^t10|^naa|Is SSD"
# Check vSAN eligible disks
vdq -q
# List vSAN storage
esxcli vsan storage list
# Get partition table
partedUtil getptbl /vmfs/devices/disks/<device>
# Delete partition
partedUtil delete /vmfs/devices/disks/<device> <partition-number>
# Remove vSAN disk group
esxcli vsan storage remove -d <device>
# Add SATP rule for SSD
esxcli storage nmp satp rule add -s VMW_SATP_LOCAL -d <device> -o enable_ssd
# Reclaim device after SATP rule
esxcli storage core claiming reclaim -d <device>
# List SATP rules
esxcli storage nmp satp rule list | grep enable_ssd
# Enter maintenance mode
esxcli system maintenanceMode set -e true -m noAction
# Exit maintenance mode
esxcli system maintenanceMode set -e false
# Check maintenance mode status
esxcli system maintenanceMode get
# Navigate to depot directory
cd C:\VCF-DEPOT
# Generate certificate
python generate_cert.py
# Start HTTPS server
python https_server.py
# Install Python cryptography library
pip install cryptography
START: "Secure protocol communication error"
│
├─→ Test connectivity: ping <depot-ip>
│ └─→ FAIL: Check network/firewall
│
├─→ Test SSL: openssl s_client -connect <ip>:8443
│ └─→ FAIL: Check depot server is running
│
├─→ Check certificate: View cert details
│ └─→ Wrong hostname: Regenerate certificate
│
├─→ Import certificate to Java truststore
│ └─→ keytool -import ...
│
├─→ Verify fingerprints match
│ └─→ MISMATCH: Re-import correct certificate
│
└─→ Restart LCM service
└─→ Wait 2 minutes, retry connection
START: "Certificate doesn't match subject alternative names"
│
├─→ Check current cert SAN
│ └─→ openssl x509 -in /etc/vmware/ssl/rui.crt ...
│
├─→ Set correct hostname
│ └─→ esxcli system hostname set --fqdn=<FQDN>
│
├─→ Backup old certificates
│ └─→ mv /etc/vmware/ssl/rui.* /etc/vmware/ssl/rui.*.bak
│
├─→ Generate new certificates
│ └─→ /sbin/generate-certificates
│
├─→ Restart services
│ └─→ services.sh restart
│
└─→ Update thumbprints in VCF
└─→ Re-validate hosts in UI
START: "Found zero SSD devices for SSD cache tier"
│
├─→ Check SSD status: esxcli storage core device list
│ └─→ "Is SSD: false" → Continue
│
├─→ Shut down ESXi VM in Workstation
│
├─→ Edit VMX file
│ └─→ Add: sata0:X.virtualSSD = 1
│
├─→ Power on ESXi VM
│
├─→ Verify SSD status
│ └─→ Still false: Check VMX syntax
│
├─→ Check vSAN eligibility: vdq -q
│ └─→ "Has partitions" → Clean up partitions
│
└─→ Clean up old vSAN config
├─→ esxcli vsan storage remove -d <device>
└─→ partedUtil delete ...
START: vCenter deployment stuck at percentage
│
├─→ Wait 30 minutes (large downloads may be slow)
│
├─→ SSH to vCenter VM (ssh root@<vcenter-ip>)
│ └─→ Default password: vmware
│
├─→ Check firstboot status
│ └─→ cat /var/log/firstboot/firstbootStatus.json
│
├─→ Check for activity
│ ├─→ vmstat 1 5 (disk I/O)
│ └─→ tail -f /var/log/vmware/firstboot/installer.log
│
├─→ If stuck at 60% "Installing Containers"
│ ├─→ Check postgres: ls /storage/db/vpostgres/
│ ├─→ Missing postgresql.conf → Database failed to init
│ └─→ UNRECOVERABLE: Must redeploy
│
├─→ Check services: vmon-cli --list
│ └─→ Services not started → Check individual logs
│
└─→ If unrecoverable:
├─→ Delete vCenter VM
├─→ Clean up vSAN on all hosts
├─→ Reset depot connection
└─→ Retry deployment
START: "Extraction of image from host failed"
│
├─→ Check SSH status on ESXi host
│ └─→ vim-cmd hostsvc/runtimeinfo | grep ssh
│
├─→ SSH Disabled?
│ ├─→ vim-cmd hostsvc/enable_ssh
│ └─→ vim-cmd hostsvc/start_ssh
│
├─→ Verify SSH on ALL hosts
│ └─→ Repeat for esxi01, esxi02, esxi03, esxi04
│
└─→ Retry vCenter deployment
START: VDT reports NSX cert FAIL (Trust or SAN)
│
├─→ Check which check failed
│ ├─→ SAN FAIL: Certificate missing hostnames/IPs
│ └─→ Trust FAIL: Certificate root not in SDDC Manager keystores
│
├─→ If SAN FAIL:
│ ├─→ SSH to NSX Manager as root
│ ├─→ Create OpenSSL config with all SANs:
│ │ DNS.1 = nsx-vip.lab.local
│ │ DNS.2 = nsx-node1.lab.local
│ │ DNS.3 = nsx-manager.lab.local ← SDDC Manager's registered FQDN
│ │ IP.1 = 192.168.1.70 (VIP)
│ │ IP.2 = 192.168.1.71 (node)
│ ├─→ Generate cert: openssl req -x509 ...
│ ├─→ Build JSON: python (avoid shell PEM escaping)
│ ├─→ Import via API: POST /api/v1/trust-management/certificates?action=import
│ ├─→ Apply to node: ?action=apply_certificate&service_type=API&node_id=<uuid>
│ └─→ Apply to VIP: ?action=apply_certificate&service_type=MGMT_CLUSTER
│
├─→ If Trust FAIL (after cert replacement):
│ ├─→ SSH to SDDC Manager as vcf, then su - to root
│ ├─→ Pull cert: openssl s_client ... > /tmp/nsx-root.crt
│ ├─→ Import to VCF store: keytool -importcert ... trusted_certificates.store
│ ├─→ Import to Java cacerts: keytool -importcert ... cacerts
│ └─→ Restart services: sddcmanager_restart_services.sh
│
└─→ Re-run VDT after ~5 minutes
└─→ Expected: NSX cert checks all PASS
START: Infrastructure Health Adapter shows "no data receiving"
│
├─→ Check adapter log (VCF Ops 9.x path):
│ tail -100 /storage/log/vcops/log/adapters/
│ VMwareInfraHealthAdapter/VMwareInfraHealthAdapter_55.log
│
├─→ If "Unable to fetch access token for the SDDC manager":
│ ├─→ Test SDDC Manager auth from VCF Ops node:
│ │ curl -sk -X POST https://sddc-manager.lab.local/v1/tokens \
│ │ -H "Content-Type: application/json" \
│ │ -d '{"username":"administrator@vsphere.local","password":"..."}'
│ ├─→ If token returned: credential issue in adapter
│ │ ├─→ UI → Administration → Integrations → SDDC Manager
│ │ ├─→ If System Managed Credential enabled → click ROTATE
│ │ ├─→ If Credential dropdown empty → uncheck System Managed,
│ │ │ click +, create credential, select it
│ │ ├─→ Click VALIDATE CONNECTION → confirm "valid"
│ │ ├─→ Click SAVE
│ │ └─→ Reboot VCF Ops appliance (adapter may not pick up
│ │ new credential without full restart)
│ └─→ If connection refused: check DNS/network/cert trust
│
├─→ If "PKIX path building failed" for NSX:
│ ├─→ If NSX is powered off → expected, ignore
│ └─→ If NSX is running → See flowchart 9.8 below
│
├─→ If "vROPs is not configured with NTP server":
│ └─→ Configure NTP on VCF Ops appliance (cosmetic warning,
│ does not block data collection)
│
└─→ After fix: wait 10 min for 2 collection cycles
└─→ Verify: adapter status changes to "Collecting" (green)
| Item | Path |
|---|---|
| Adapter logs | /storage/log/vcops/log/adapters/<AdapterName>/ |
| Main vcops logs | /storage/log/vcops/log/ |
| Collector GC log | /storage/log/vcops/log/collector-gc-*.log |
| VCF Adapter log | /storage/log/vcops/log/adapters/VcfAdapter/VcfAdapter_254.log |
| VMware Adapter log | /storage/log/vcops/log/adapters/VMwareAdapter/VMwareAdapter_63.log |
| vSAN Adapter log | /storage/log/vcops/log/adapters/VsanStorageAdapter/VsanStorageAdapter_257.log |
Note: VCF Operations 9.x does NOT use
/var/log/vmware/vcops/adapters/— that path from older Aria Operations versions no longer exists. All adapter logs are under/storage/log/vcops/log/adapters/.
START: NSX adapter shows Warning, logs show "PKIX path building failed"
│
├─→ Verify NSX is actually reachable from VCF Ops node:
│ curl -sk https://nsx-vip.lab.local/api/v1/node/status | head -5
│ ├─→ "No route to host" → NSX not ready (check load avg below)
│ └─→ Returns JSON → NSX is up, proceed to cert fix
│
├─→ If VIP (.70) unreachable but node (.71) responds:
│ ├─→ NSX cluster VIP not online yet
│ ├─→ Check load: curl -sk -u admin:'<pass>'
│ │ https://<node>:443/api/v1/node/status | grep load_average
│ ├─→ Load > 20 on 6 cores = still booting (normal in nested,
│ │ can take 30-60 min after power-on)
│ └─→ Wait for load < 20, VIP will come online automatically
│
├─→ Import NSX cert into VCF Ops Java truststore:
│ ├─→ openssl s_client -connect nsx-vip.lab.local:443 \
│ │ -showcerts </dev/null 2>/dev/null \
│ │ | openssl x509 -outform PEM > /tmp/nsx-cert.pem
│ ├─→ Find truststore: java -XshowSettings:properties 2>&1
│ │ | grep java.home
│ │ → /usr/java/jre-vmware-17
│ ├─→ keytool -importcert -alias nsx-vip \
│ │ -file /tmp/nsx-cert.pem \
│ │ -keystore /usr/java/jre-vmware-17/lib/security/cacerts \
│ │ -storepass changeit -noprompt
│ └─→ Reboot VCF Ops appliance
│
├─→ Fix NSX credential (two adapters to check):
│ │
│ ├─→ VCF section → nsx-vip.lab.local adapter:
│ │ ├─→ System Managed Credential ROTATE rarely works for NSX
│ │ ├─→ Uncheck System Managed Credential
│ │ ├─→ Click + → create credential (admin / password)
│ │ ├─→ Select credential, VALIDATE CONNECTION, SAVE
│ │ └─→ If VIP still unreachable, wait for NSX load to settle
│ │
│ └─→ NSX section → Aria Admin adapter:
│ ├─→ Points to nsx-manager.lab.local (node FQDN)
│ ├─→ May connect even when VIP is down
│ ├─→ Set credential (admin / password)
│ ├─→ VALIDATE CONNECTION → SAVE
│ └─→ This can start collecting before VIP comes online
│
└─→ Verify: All adapters show "Collecting" (green) in
Administration → Integrations
Key insight: VCF Operations has TWO separate NSX adapters — one under the VCF Cloud Foundation account (uses VIP) and one under the standalone NSX section called "Aria Admin" (uses node FQDN). Both need valid credentials. The Aria Admin adapter can connect via the node FQDN even when the VIP is still offline after a fresh NSX boot.
Java truststore path on VCF Ops 9.x:
/usr/java/jre-vmware-17/lib/security/cacerts(password:changeit). The legacy/usr/java/jre-vmware/path does not exist.
Symptoms:
"Resources [nsx-vip.lab.local] are not available/ready" or "not in ACTIVE state""Unable to acquire resource level lock(s)""[2] account(s) has been disconnected"/v1/nsxt-clusters shows empty or non-ACTIVE status/v1/tasks?status=IN_PROGRESS)Root Cause Chain: A failed credential operation (often due to NSX being temporarily unreachable during a boot storm or maintenance) triggers a cascade:
ACTIVATING or ERROR state in platform.nsxt tableplatform.lock table, blocking all new operationsIN_PROGRESS in platform.task_metadata (resolved=false), piling upDiagnosis:
# 1. Get auth token from SDDC Manager
TOKEN=$(curl -sk -X POST https://sddc-manager.lab.local/v1/tokens \
-H "Content-Type: application/json" \
-d '{"username":"administrator@vsphere.local","password":"<password>"}' \
| python3 -c "import sys,json; print(json.load(sys.stdin)['accessToken'])")
# 2. Check NSX cluster resource state (look for status field)
curl -sk "https://sddc-manager.lab.local/v1/nsxt-clusters" \
-H "Authorization: Bearer $TOKEN" | python3 -m json.tool
# If status is "ACTIVATING" or "ERROR" instead of "ACTIVE" → this is the problem
# 3. Check for stale resource locks
curl -sk "https://sddc-manager.lab.local/v1/resource-locks" \
-H "Authorization: Bearer $TOKEN" | python3 -m json.tool
# Stale locks from failed operations will block all new operations
# 4. Check for stuck IN_PROGRESS tasks
curl -sk "https://sddc-manager.lab.local/v1/tasks?status=IN_PROGRESS" \
-H "Authorization: Bearer $TOKEN" | python3 -c \
"import sys,json; d=json.load(sys.stdin); print(f'Stuck tasks: {len(d.get(\"elements\",[]))}')"
# 5. Verify NSX is actually healthy (from SDDC Manager)
curl -sk -u admin:'<password>' --connect-timeout 10 \
https://nsx-vip.lab.local/api/v1/cluster/status
# overall_status should be "STABLE"
Fix — Full Database Repair:
WARNING: Direct database manipulation is unsupported and should only be done in lab environments. Always back up before modifying.
Step 1: Access PostgreSQL on SDDC Manager
SSH to SDDC Manager as vcf, then su - to root. PostgreSQL uses TCP on 127.0.0.1 (not Unix sockets), and the password may not be easily discoverable. Disable the psql pager to prevent --More-- prompts from corrupting interactive shell sessions:
# Back up pg_hba.conf
cp /data/pgdata/pg_hba.conf /data/pgdata/pg_hba.conf.bak
# Temporarily allow passwordless local connections
sed -i 's/scram-sha-256/trust/g' /data/pgdata/pg_hba.conf
# Reload postgres (no restart needed)
su - postgres -c "/usr/pgsql/15/bin/pg_ctl reload -D /data/pgdata"
# Disable psql pager (CRITICAL for scripted/remote sessions)
export PAGER=cat
export PGPAGER=cat
Step 2: Fix the stuck resource status
The nsxt table status can be ACTIVATING, ERROR, or other non-ACTIVE values:
# Check current NSX resource status
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -t -c \"SELECT id, status FROM nsxt;\""
# Fix ANY non-ACTIVE status
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"UPDATE nsxt SET status = 'ACTIVE' WHERE status != 'ACTIVE';\""
Step 3: Clear stale resource locks
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"SELECT count(*) FROM lock;\""
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"DELETE FROM lock;\""
Step 4: Mark stuck tasks as resolved
The task_metadata table in the platform DB tracks task resolution state. Unresolved tasks (resolved=false) from failed operations accumulate and can interfere with new operations:
# Check unresolved task count
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"SELECT resolved, count(*) FROM task_metadata GROUP BY resolved;\""
# Mark all unresolved tasks as resolved
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"UPDATE task_metadata SET resolved = true WHERE resolved = false;\""
# Clear task_lock table if any entries exist
su - postgres -c "PAGER=cat psql -h 127.0.0.1 -d platform -c \"DELETE FROM task_lock;\""
Step 5: Restore pg_hba.conf (CRITICAL — do not skip)
cp /data/pgdata/pg_hba.conf.bak /data/pgdata/pg_hba.conf
su - postgres -c "/usr/pgsql/15/bin/pg_ctl reload -D /data/pgdata"
# Verify it's back to scram-sha-256
grep -c 'scram-sha-256' /data/pgdata/pg_hba.conf
# Should return 4 or more
Step 6: Restart operationsmanager service
systemctl restart operationsmanager
# Wait 2-3 minutes for it to fully start
systemctl is-active operationsmanager
Verification:
# NSX cluster should now show ACTIVE
curl -sk "https://sddc-manager.lab.local/v1/nsxt-clusters" \
-H "Authorization: Bearer $TOKEN" | python3 -c \
"import sys,json; [print(f'{c[\"id\"]}: {c[\"status\"]}') for c in json.load(sys.stdin).get('elements',[])]"
# Resource locks should be empty
curl -sk "https://sddc-manager.lab.local/v1/resource-locks" \
-H "Authorization: Bearer $TOKEN"
# IN_PROGRESS tasks should be zero or minimal
curl -sk "https://sddc-manager.lab.local/v1/tasks?status=IN_PROGRESS" \
-H "Authorization: Bearer $TOKEN" | python3 -c \
"import sys,json; print(f'IN_PROGRESS: {len(json.load(sys.stdin).get(\"elements\",[]))}')"
# Credential remediate should now succeed via VCF Operations Fleet Management UI
Credential Cascade Failure Flowchart:
┌──────────────────────────────────────────────┐
│ Credential Update/Rotate/Remediate fails │
│ in SDDC Manager or VCF Operations UI │
└──────────────────┬───────────────────────────┘
│
┌────────▼────────┐
│ Check task error │
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │ │
▼ ▼ ▼
"not in "Unable to "503 Service
ACTIVE state" acquire lock" Unavailable"
│ │ │
▼ ▼ ▼
Fix nsxt Delete from NSX still
table status lock table booting/
(ACTIVATING/ in platform unstable
ERROR→ACTIVE) DB │
│ │ ▼
│ │ Wait for
│ │ NSX load
│ │ to settle
│ │ (< 20)
└──────┬───────┘ │
▼ │
Mark task_metadata │
resolved = true ◄──────┘
│
▼
Clear task_lock
│
▼
Restart
operationsmanager
│
▼
Retry credential
operation
Key insight: Three tables in the
platformdatabase must be cleaned: (1)nsxt— resource status, (2)lock— operation locks, (3)task_metadata— task resolution tracking. Theoperationsmanagerdatabase has separatetaskandexecutiontables (columns:task.state,execution.execution_status— notstatus). The API won't let you cancel or delete stuck tasks — database repair is required.
psql pager trap: When running psql queries via Paramiko or remote shell, the default pager (
less/more) captures output and waits for interactive input, corrupting the session. Always setPAGER=catbefore running psql commands, or pass it inline:PAGER=cat psql -h 127.0.0.1 -d platform -c "...". For Paramikoinvoke_shell(), also setheight=1000to prevent terminal-based paging.
PostgreSQL on SDDC Manager: Uses TCP on
127.0.0.1(not Unix sockets — you'll get "No such file or directory" without-h 127.0.0.1). Data directory is/data/pgdata. Key databases:platform(nsxt, lock, task_metadata tables),operationsmanager(task, execution, processing_task tables). Thepg_hba.conftrust workaround is a last resort — always restore the original immediately after.
vcf account lockout: Failed SSH attempts (including from automated scripts) can lock the
vcfaccount. SDDC Manager usesfaillock(notpam_tally2). Unlock from console as root:faillock --user vcf --reset
None of the database repair procedure is documented by Broadcom. The schema was mapped through the following investigation:
Why the API wasn't enough:
PATCH /v1/tasks/{id} with {"status":"CANCELLED"} returned TA_TASK_CAN_NOT_BE_RETRIED for every stuck taskDELETE /v1/tasks/{id} returned HTTP 500How the database was explored:
\l: platform, operationsmanager, domainmanager, lcm, sddc_manager_ui, postgresplatform DB with \dt: found nsxt, lock, task_metadata, task_lock (plus vcenter, host, etc.)SELECT column_name, data_type FROM information_schema.columns WHERE table_name = '<table>'task_metadata uses a resolved boolean (not a status field like you'd expect)operationsmanager.task uses column state (not status) and execution uses execution_status (not status)task in the platform DB (wrong — it's task_metadata) and used status column on operationsmanager tables (wrong — it's state and execution_status)Why each repair step is needed:
| Step | Table | Action | Why |
|---|---|---|---|
| 2 | nsxt |
Set status to ACTIVE | The stuck ACTIVATING/ERROR status makes every new credential operation fail at prevalidation — SDDC Manager checks this before even attempting the operation |
| 3 | lock |
Delete all rows | Stale exclusive locks from the failed operation block all new operations from acquiring their own locks — they'll fail with "Unable to acquire resource level lock(s)" |
| 4 | task_metadata |
Set resolved=true | Unresolved tasks (resolved=false) accumulate with each UI retry. 47 were found during the initial diagnosis. These can interfere with new task scheduling |
| 4 | task_lock |
Delete all rows | Links tasks to locks — clearing this ensures no orphaned task-lock relationships remain |
| 5 | pg_hba.conf |
Restore backup | Leaving trust auth enabled means any local process can access PostgreSQL without a password — security risk |
| 6 | operationsmanager |
Restart service | The service caches database state in memory. A restart forces it to re-read the cleaned tables and reset its internal state machine |
Why each step matters in sequence:
Automated scripts built from this knowledge:
clear_locks.py — fixes nsxt status + clears lock table (quick fix for simple cases)fix_stuck_tasks.py — marks task_metadata resolved + clears task_lock (for accumulated stuck tasks)full_remediate_fix.py — combines NSX health check + all DB fixes + service restart (all-in-one cascade repair)All scripts automate the pg_hba.conf backup/trust/restore cycle and use PAGER=cat to prevent pager traps in remote sessions.
Phase 7 — Feb 10–11, 2026
This is a chicken-and-egg bootstrap constraint in every VCF deployment:
esxi01-local in the lab)This means SDDC Manager is always initially deployed to local storage and must be manually migrated to shared storage (vSAN) afterward.
esxi01-local with 914GB thick-provisioned disks (only ~108GB actually used)Disk analysis:
| Disk | Allocated | Actual Used |
|---|---|---|
| sddc-manager.vmdk | 32GB | 2.6GB |
| sddc-manager_1.vmdk | 16GB | 2.6GB |
| sddc-manager_2.vmdk | 240GB | 3.0GB |
| sddc-manager_3.vmdk | 512GB | 99.5GB |
| sddc-manager_4.vmdk | 26GB | 30MB |
| sddc-manager_5.vmdk | 88GB | 64MB |
| Total | 914GB | ~108GB |
The vCenter migration wizard cannot thin-provision to vSAN. The workaround is to clone each disk individually as thin using vmkfstools directly on the ESXi host:
# SSH to esxi01 as root
# Clone each disk from local to vSAN as thin-provisioned
vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager.vmdk \
/vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager.vmdk -d thin
vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_1.vmdk \
/vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_1.vmdk -d thin
vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_2.vmdk \
/vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_2.vmdk -d thin
vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_3.vmdk \
/vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_3.vmdk -d thin
vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_4.vmdk \
/vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_4.vmdk -d thin
vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_5.vmdk \
/vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_5.vmdk -d thin
After cloning:
sddc-manager folder → right-click the .vmx → Register VMsystemctl status vcf-services)The 512GB disk clone failed partway through (ESXi connection timeout on nested storage). Fix:
# Delete the partial clone
vmkfstools -U /vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_3.vmdk
# Retry the clone
vmkfstools -i /vmfs/volumes/esxi01-local/sddc-manager/sddc-manager_3.vmdk \
/vmfs/volumes/vcenter-cl01-ds-vsan01/sddc-manager/sddc-manager_3.vmdk -d thin
Key lesson: vCenter's migration wizard cannot thin-provision to vSAN. Always use
vmkfstools -i <src> <dst> -d thinper disk. This is also the only way to reclaim wasted space from thick-provisioned VCF appliances — SDDC Manager went from 914GB allocated to ~108GB actual on vSAN.
Document Information
| Field | Value |
|---|---|
| Document Title | VCF 9.0.1 Nested Deployment Troubleshooting Handbook |
| Version | 1.5 |
| Last Updated | February 2026 |
| Environment | VMware Workstation 17.x Nested Lab |
| VCF Version | 9.0.1 |
This handbook is intended for lab and educational purposes. Always consult official VMware documentation for production deployments.
(c) 2026 Virtual Control LLC. All rights reserved.