VC
Virtual Control
VMware Cloud Foundation Solutions
Health Check Handbook
VCF Operations
Health Check Handbook
Analytics cluster health validation covering node status, adapters, collection, certificates, Cassandra, alert engine, and remote collectors.
AnalyticsAdaptersCollectorsAlertsCassandra
VCF 9.0
VMware Cloud Foundation
Proprietary & Confidential

VCF Operations Health Check Handbook

Comprehensive Health Verification for VCF Operations in VCF 9

Author: Virtual Control LLC Date: March 2026 Version: 1.0 Classification: Internal Use Platform: VMware Cloud Foundation 9.0 / VCF Operations 8.18+

Table of Contents

1. Overview & Purpose

This handbook provides a complete health check procedure for VCF Operations (formerly VMware Aria Operations / vRealize Operations) deployed within a VCF 9.0 environment. VCF Operations provides:

When to Run

Trigger Priority
After deployment / node addition Critical
Before/after VCF upgrades Critical
Weekly routine health check Recommended
When dashboards show stale data Troubleshooting
When alerts are not firing Troubleshooting
Environment Variables:
$OPS = VCF Operations FQDN (e.g., vcf-ops.lab.local)
$OPS_USER = admin
$OPS_PASS = VCF Operations admin password
$OPS_TOKEN = Suite API auth token

2. Prerequisites

Required Access

Access Type Target Credentials
HTTPS (443) VCF Operations VIP admin / password
SSH (22) Each VCF Operations node root / password
Suite API (443) VCF Operations VIP admin / auth token
CASA Admin VCF Operations master node root / admin

Token Acquisition (Suite API)

export OPS="vcf-ops.lab.local"
export OPS_USER="admin"
export OPS_PASS="YourPassword123!"

# Acquire auth token
OPS_TOKEN=$(curl -sk -X POST \
  "https://$OPS/suite-api/api/auth/token/acquire" \
  -H "Content-Type: application/json" \
  -d "{
    \"username\":\"$OPS_USER\",
    \"password\":\"$OPS_PASS\",
    \"authSource\":\"local\"
  }" | jq -r '.token')

echo "Token: ${OPS_TOKEN:0:20}..."

# Convenience function
ops_api() {
  curl -sk -H "Authorization: vRealizeOpsToken $OPS_TOKEN" \
    -H "Content-Type: application/json" \
    "https://$OPS/suite-api$1" 2>/dev/null
}
Token Expiry: Suite API tokens expire after 6 hours by default. Re-acquire if you receive 401 responses.

3. Quick Reference — All Checks Summary

# Check Method PASS WARN FAIL
4.1 Cluster State CASA/SSH RUNNING / INITIALIZED STARTING OFFLINE / ERROR
4.2 Slice Status CASA All ONLINE Any STARTING Any OFFLINE
5.1 Node Status Suite API All nodes ONLINE Any STARTING Any OFFLINE
5.2 Node CPU SSH < 70% 70-85% > 85%
5.3 Node Memory SSH < 75% 75-90% > 90%
6.1 Adapters Suite API All COLLECTING Any NOT_COLLECTING (non-critical) vCenter adapter not collecting
6.2 Collection Suite API Last collection < 10 min 10-30 min gap > 30 min gap
8 Certificates SSH/openssl > 30 days to expiry 7-30 days < 7 days / expired
9 License Suite API Valid, objects < capacity > 80% capacity Expired or over capacity
10 Disk SSH < 70% all partitions 70-85% > 85%
11 Active Alerts Suite API 0 critical Warning alerts Critical alerts
12 Collectors Suite API All ONLINE Any UNKNOWN Any OFFLINE
15 Suite API curl Response < 2s 2-5s > 5s or error

4. Analytics Cluster Status

4.1 Cluster State via CASA

What: Verify the VCF Operations analytics cluster is fully initialized and running.

Why: A cluster not in RUNNING state means data collection, alerting, and dashboards may be stale or non-functional.

SSH Method (on master node)

ssh root@$OPS

# Check cluster status via CASA admin
$VMWARE_PYTHON_PATH/bin/python \
  /usr/lib/vmware-vcops/tools/opscli/admin-cli.py \
  getClusterStatus

Expected Output (Healthy):

Cluster Status: RUNNING
Cluster Uptime: 15 days 8 hours
Master Node: vcf-ops-01.lab.local (ONLINE)
Data Node: vcf-ops-02.lab.local (ONLINE)
Data Node: vcf-ops-03.lab.local (ONLINE)
Remote Collector: rc-01.lab.local (ONLINE)

Alternative — CASA API

curl -sk "https://$OPS/casa/cluster/status" \
  -u "admin:$OPS_PASS" | jq .

Expected Output:

{
  "cluster_status": "RUNNING",
  "slice_status": "ONLINE",
  "node_statuses": [
    {"node_name": "vcf-ops-01", "status": "ONLINE", "role": "MASTER"},
    {"node_name": "vcf-ops-02", "status": "ONLINE", "role": "DATA"},
    {"node_name": "vcf-ops-03", "status": "ONLINE", "role": "DATA"}
  ]
}

Pass / Warn / Fail

Result Criteria Indicator
PASS Cluster RUNNING, all nodes ONLINE Fully operational
WARN Cluster STARTING or any node STARTING Coming online
FAIL Cluster OFFLINE or ERROR Data collection stopped
Remediation:
1. Bring cluster online: Use CASA admin UI (https://<master>/casa) → Cluster Operations → Start
2. Via CLI: $VMWARE_PYTHON_PATH/bin/python /usr/lib/vmware-vcops/tools/opscli/admin-cli.py bringClusterOnline
3. Check cluster logs: /storage/log/vcops/casa/casa.log

4.2 Slice Status

What: Verify all analytics slices are online.

# Via CASA API
curl -sk "https://$OPS/casa/slice/status" \
  -u "admin:$OPS_PASS" | jq .

Expected Output:

{
  "slices": [
    {"slice_id": 0, "status": "ONLINE", "node": "vcf-ops-01"},
    {"slice_id": 1, "status": "ONLINE", "node": "vcf-ops-02"},
    {"slice_id": 2, "status": "ONLINE", "node": "vcf-ops-03"}
  ]
}

4.3 Node Roles

Role Description Count
MASTER Primary analytics node, cluster coordinator 1
MASTER_REPLICA Failover for master 1 (if HA)
DATA Analytics processing and storage 1+
REMOTE_COLLECTOR Remote data collection proxy 0+

5. Node Health

5.1 Individual Node Status

# List all nodes via Suite API
ops_api "/api/deployment/node" | jq '.nodeList[] | {
  name: .name,
  ip: .ip,
  role: .role,
  status: .status,
  version: .version
}'

5.2 Resource Utilization per Node

ssh root@$OPS

# CPU
top -b -n 1 | head -5

# Memory
free -m

# Disk (critical partitions)
df -h /storage /storage/db /storage/log /storage/core

Critical Partitions

Partition Purpose PASS WARN FAIL
/storage Analytics data < 70% 70-85% > 85%
/storage/db Cassandra / xDB < 70% 70-85% > 85%
/storage/log Log files < 70% 70-85% > 85%
/ (root) OS < 70% 70-85% > 85%

5.3 Heartbeat Verification

# Check last heartbeat per node
ops_api "/api/deployment/node" | jq '.nodeList[] | {
  name: .name,
  lastHeartbeat: .lastHeartbeat,
  heartbeatStatus: .heartbeatStatus
}'

6. Adapter Health

6.1 Adapter Instances

What: Verify all configured adapter instances are collecting data.

# List all adapters
ops_api "/api/adapters" | jq '.adapterInstancesInfoDto[] | {
  id: .id,
  adapterKind: .resourceKey.adapterKindKey,
  name: .resourceKey.name,
  collectorId: .collectorId,
  collectionState: .collectionState,
  collectionStatus: .collectionStatus
}'

Expected Output:

{
  "id": "abc123",
  "adapterKind": "VMWARE",
  "name": "vCenter - vcenter.lab.local",
  "collectorId": "1",
  "collectionState": "COLLECTING",
  "collectionStatus": "DATA_RECEIVING"
}

Key Adapters to Verify

Adapter Kind Name Pattern Critical
VMWARE vCenter adapter Yes
NSXTAdapter NSX-T adapter Yes
VsanAdapter vSAN adapter Yes
SDDCHealthAdapter SDDC Health Yes
PythonRemediationVcenterAdapter Automation No
LogInsightAdapter Log Insight integration No
Result Criteria Indicator
PASS All critical adapters COLLECTING Data flowing
WARN Non-critical adapter not collecting Limited functionality
FAIL vCenter or NSX adapter not collecting Stale data / no monitoring

6.2 Collection Status

# Check last collection time for a specific adapter
ADAPTER_ID="<adapter-id>"
ops_api "/api/adapters/$ADAPTER_ID" | jq '{
  name: .resourceKey.name,
  collectionState: .collectionState,
  lastCollected: .lastCollected,
  numberOfMetricsCollected: .numberOfMetricsCollected,
  numberOfResourcesCollected: .numberOfResourcesCollected
}'

6.3 Credential Validation

# List credentials
ops_api "/api/credentials" | jq '.credentialInstances[] | {
  id: .id,
  name: .name,
  adapterKind: .adapterKindKey
}'

# Test credential (trigger validation)
curl -sk -X POST \
  -H "Authorization: vRealizeOpsToken $OPS_TOKEN" \
  -H "Content-Type: application/json" \
  "https://$OPS/suite-api/api/adapters/$ADAPTER_ID/monitoringstate/start"
Remediation for adapter not collecting:
1. Verify credential: Update password if changed on target
2. Test connectivity: curl -sk https://<target>:443 from OPS node
3. Restart adapter: Suite API → POST /api/adapters/<id>/monitoringstate/stop then /start
4. Check adapter logs: /storage/log/vcops/adapterkind/<adapter-kind>/

7. Collection Status

What: Verify data collection is current and no gaps exist.

# Get collection stats
ops_api "/api/resources?adapterKind=VMWARE&resourceKind=VirtualMachine&pageSize=5" | jq '{
  totalCount: .totalCount,
  resources: [.resourceList[].resourceKey.name]
}'

Check for Collection Gaps

# Recent collection cycles on the node
ssh root@$OPS
grep "Collection completed" /storage/log/vcops/analytics/analytics.log | tail -10
Result Criteria Indicator
PASS Last collection < 10 minutes ago Current data
WARN Last collection 10-30 minutes ago Slight delay
FAIL Last collection > 30 minutes ago Stale data

8. Certificate Health

# Check web certificate
echo | openssl s_client -connect $OPS:443 2>/dev/null | \
  openssl x509 -noout -dates -subject

# Check all certificates on the node
ssh root@$OPS
find /storage/vcops/user/conf/ssl -name "*.pem" -exec \
  sh -c 'echo "=== $1 ===" && openssl x509 -in "$1" -noout -enddate' _ {} \;
Result Criteria Indicator
PASS All certificates > 30 days from expiry Healthy
WARN Any certificate 7-30 days from expiry Plan renewal
FAIL Any certificate < 7 days or expired Immediate action

9. Capacity & Licensing

# Check license status
ops_api "/api/deployment/licenses" | jq '.licenseDetails[] | {
  licenseKey: .licenseKey[0:8],
  edition: .edition,
  capacity: .capacity,
  usage: .usage,
  expirationDate: .expirationDate
}'

Expected Output:

{
  "licenseKey": "XXXXX-XX",
  "edition": "Enterprise",
  "capacity": 500,
  "usage": 320,
  "expirationDate": "2027-03-01"
}
Result Criteria Indicator
PASS License valid, usage < 80% of capacity Healthy
WARN Usage > 80% capacity or < 60 days to expiry Plan expansion
FAIL License expired or usage > capacity Functionality limited

10. Disk & Database Health

Disk Usage

ssh root@$OPS
df -h /storage /storage/db /storage/log

Cassandra / xDB Health

# Check Cassandra status
ssh root@$OPS
/opt/vmware/vcops/cassandra/apache-cassandra/bin/nodetool status

Expected Output:

Datacenter: vrops
==========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns    Host ID
UN  192.168.1.77    15.2 GiB   256     100.0%  abc123...
UN  192.168.1.78    14.8 GiB   256     100.0%  def456...
UN  192.168.1.79    15.0 GiB   256     100.0%  ghi789...
Status Meaning
UN Up, Normal — healthy
DN Down, Normal — node offline
UL Up, Leaving — decommissioning
UJ Up, Joining — bootstrapping

Data Retention

# Check retention settings
ops_api "/api/deployment/retention" | jq .
Remediation for disk full:
1. Reduce retention: Lower data retention period via Administration → Global Settings
2. Clean old logs: find /storage/log -name "*.gz" -mtime +14 -delete
3. Cassandra compaction: /opt/vmware/vcops/cassandra/apache-cassandra/bin/nodetool compact
4. Expand disk: Power off node → expand VMDK → extend partition

11. Alert Engine Health

# Count active alerts by criticality
ops_api "/api/alerts?status=ACTIVE&criticality=CRITICAL" | jq '.totalCount'
ops_api "/api/alerts?status=ACTIVE&criticality=IMMEDIATE" | jq '.totalCount'
ops_api "/api/alerts?status=ACTIVE&criticality=WARNING" | jq '.totalCount'

Check Alert Plugins (Notifications)

ops_api "/api/alertplugins" | jq '.notificationPluginInstances[] | {
  id: .id,
  name: .name,
  pluginType: .pluginTypeId,
  enabled: .enabled
}'

Test SMTP Notification

# Verify SMTP relay
ssh root@$OPS
echo "Test" | mail -s "VCF Ops Health Check Test" admin@lab.local
Result Criteria Indicator
PASS 0 critical alerts, notifications working Healthy
WARN Warning alerts present Review and tune
FAIL Critical alerts or notifications broken Immediate review

12. Remote Collectors

# List remote collectors
ops_api "/api/collectors" | jq '.collector[] | {
  id: .id,
  name: .name,
  ip: .ip,
  status: .state,
  version: .version,
  usingVRealize: .usingVRealize
}'

Expected Output:

{
  "id": "1",
  "name": "vcf-ops-rc-01",
  "ip": "192.168.1.80",
  "status": "ONLINE",
  "version": "8.18.0.12345678"
}
Result Criteria Indicator
PASS All collectors ONLINE Healthy
WARN Any collector UNKNOWN Communication issue
FAIL Any collector OFFLINE Data collection impacted

13. Management Packs

# List installed management packs (solutions)
ops_api "/api/solutions" | jq '.solution[] | {
  id: .id,
  name: .name,
  version: .version,
  adapterKind: .adapterKindKeys
}'

14. Integration Health

vCenter Adapter

# Check vCenter adapter specifically
ops_api "/api/adapters?adapterKindKey=VMWARE" | jq '.adapterInstancesInfoDto[] | {
  name: .resourceKey.name,
  collectionState: .collectionState,
  lastCollected: .lastCollected
}'

NSX Adapter

ops_api "/api/adapters?adapterKindKey=NSXTAdapter" | jq '.adapterInstancesInfoDto[] | {
  name: .resourceKey.name,
  collectionState: .collectionState
}'

vSAN Adapter

ops_api "/api/adapters?adapterKindKey=VsanAdapter" | jq '.adapterInstancesInfoDto[] | {
  name: .resourceKey.name,
  collectionState: .collectionState
}'

SDDC Health Adapter

ops_api "/api/adapters?adapterKindKey=SDDCHealthAdapter" | jq '.adapterInstancesInfoDto[] | {
  name: .resourceKey.name,
  collectionState: .collectionState
}'

15. API Health (Suite API)

Token Acquisition Test

time curl -sk -X POST \
  "https://$OPS/suite-api/api/auth/token/acquire" \
  -H "Content-Type: application/json" \
  -d "{\"username\":\"$OPS_USER\",\"password\":\"$OPS_PASS\",\"authSource\":\"local\"}" \
  | jq -r '.token' > /dev/null
Result Criteria Indicator
PASS Token acquired in < 2 seconds API responsive
WARN 2-5 seconds API slow
FAIL > 5 seconds or failed API issue

Endpoint Responsiveness

ENDPOINTS="/api/deployment/node /api/adapters /api/resources?pageSize=1 /api/alerts?pageSize=1"
for EP in $ENDPOINTS; do
  START=$(date +%s%N)
  HTTP=$(curl -sk -o /dev/null -w "%{http_code}" \
    -H "Authorization: vRealizeOpsToken $OPS_TOKEN" \
    "https://$OPS/suite-api$EP")
  END=$(date +%s%N)
  MS=$(( (END - START) / 1000000 ))
  echo "$EP: HTTP $HTTP (${MS}ms)"
done

16. NTP & DNS

ssh root@$OPS
# NTP
timedatectl status
chronyc tracking

# DNS
nslookup vcenter.lab.local
nslookup nsx-vip.lab.local
cat /etc/resolv.conf

17. Backup Configuration

# Check via CASA
curl -sk "https://$OPS/casa/deployment/backup/schedule" \
  -u "admin:$OPS_PASS" | jq .
Result Criteria Indicator
PASS Backup configured, recent success Protected
WARN > 24h since last backup Check schedule
FAIL No backup configured Data at risk

18. Resource Utilization

ssh root@$OPS
# CPU and Load
uptime
top -b -n 1 | head -5

# Memory
free -m

# Disk
df -h

# Java heap (analytics process)
ps aux | grep analytics | grep -v grep | awk '{print $6/1024 " MB"}'
Resource PASS WARN FAIL
CPU < 70% 70-85% > 85%
Memory < 75% 75-90% > 90%
Disk (any partition) < 70% 70-85% > 85%
Java Heap < 80% allocated 80-90% > 90% (OOM risk)

19. Port Reference Table

Inbound Ports

Source Port Protocol Purpose
Admin Browser 443 TCP Web UI / Suite API
Admin 22 TCP SSH
Admin 443 TCP CASA admin interface
Remote Collector 443 TCP Collector → cluster
vCenter 443 TCP Webhook notifications

Outbound Ports

Destination Port Protocol Purpose
vCenter 443 TCP Data collection (vSphere API)
NSX Manager 443 TCP NSX data collection
ESXi Hosts 443 TCP Host metrics
SDDC Manager 443 TCP SDDC Health data
VCF Ops Logs 443/9543 TCP Log integration
SMTP Server 25/587 TCP Email notifications
DNS Server 53 TCP/UDP Name resolution
NTP Server 123 UDP Time synchronization

Inter-Node Ports

Port Protocol Purpose
443 TCP HTTPS / Suite API
3091 TCP Cluster communication
3092 TCP Cluster communication
7000 TCP Cassandra inter-node
7001 TCP Cassandra SSL inter-node
9042 TCP Cassandra native transport
9160 TCP Cassandra Thrift

20. Common Issues & Remediation

20.1 Cluster Offline

Symptom Likely Cause Resolution
CASA shows cluster OFFLINE Node crash or network partition Bring cluster online via CASA admin
Cluster won't start Disk full on master node Free disk space, then start cluster
Split-brain between nodes Network connectivity loss Restore network, restart cluster
Bring cluster online:
1. CASA UI: https://<master>/casa → Cluster → Bring Online
2. CLI: $VMWARE_PYTHON_PATH/bin/python /usr/lib/vmware-vcops/tools/opscli/admin-cli.py bringClusterOnline
3. Force start (last resort): admin-cli.py forceClusterOnline

20.2 Slice Degraded

Symptom Likely Cause Resolution
Slice OFFLINE on one node Node resource exhaustion Check disk/memory, restart node slice
Multiple slices offline Cluster issue Restart entire cluster

20.3 Adapter Failures

Symptom Likely Cause Resolution
NOT_COLLECTING Credential change Update credential in VCF Ops
COLLECTING but stale data Target unreachable Check network connectivity
Adapter crash Memory issue Increase adapter memory, restart

20.4 Disk Full

# Quick disk cleanup
ssh root@$OPS
# 1. Clean old logs
find /storage/log -name "*.gz" -mtime +14 -delete
# 2. Check large files
du -sh /storage/* | sort -rh | head -10
# 3. Compact Cassandra
/opt/vmware/vcops/cassandra/apache-cassandra/bin/nodetool compact

20.5 Collection Gaps

Symptom Likely Cause Resolution
Dashboards show gaps Collection cycle missed Restart adapter
Historical data missing Retention policy deleted it Adjust retention
New objects not appearing Discovery cycle pending Wait for next cycle or force discovery

20.6 Certificate Expiry

Impact Resolution
Suite API returns TLS errors Replace certificate via CASA admin
Remote collectors disconnect Replace collector certificate, re-register
Browser security warnings Install custom CA certificate

21. CLI Quick Reference Card

CASA Admin CLI

Command Purpose
admin-cli.py getClusterStatus Cluster status
admin-cli.py bringClusterOnline Start cluster
admin-cli.py takeClusterOffline Stop cluster
admin-cli.py forceClusterOnline Force start cluster
admin-cli.py getNodeStatus Node status
admin-cli.py getSliceStatus Slice status
CLI Path: $VMWARE_PYTHON_PATH/bin/python /usr/lib/vmware-vcops/tools/opscli/admin-cli.py <command>

System Commands

Command Purpose
df -h /storage /storage/db /storage/log Disk usage
free -m Memory usage
top -b -n 1 | head -5 CPU / load
timedatectl Time sync
chronyc tracking NTP details
systemctl status vmware-vcops-analytics Analytics service
systemctl status vmware-vcops-collector Collector service
systemctl status vmware-vcops-web Web service
systemctl status vmware-vcops-casa CASA service

Cassandra Commands

Command Purpose
nodetool status Cassandra cluster status
nodetool info Local node info
nodetool compactionstats Active compactions
nodetool compact Trigger compaction
nodetool repair Repair data
nodetool describecluster Cluster schema

Log Locations

Log Path
Analytics /storage/log/vcops/analytics/analytics.log
Collector /storage/log/vcops/collector/collector.log
CASA /storage/log/vcops/casa/casa.log
Web /storage/log/vcops/web/web.log
Adapter (per-type) /storage/log/vcops/adapterkind/<kind>/
Cassandra /storage/log/vcops/cassandra/system.log

22. API Quick Reference (Suite API)

Authentication

# Acquire token
curl -sk -X POST "https://$OPS/suite-api/api/auth/token/acquire" \
  -H "Content-Type: application/json" \
  -d '{"username":"admin","password":"password","authSource":"local"}'

# Release token
curl -sk -X POST "https://$OPS/suite-api/api/auth/token/release" \
  -H "Authorization: vRealizeOpsToken $OPS_TOKEN" \
  -H "Content-Type: application/json"

Key Endpoints

Endpoint Method Purpose
/api/auth/token/acquire POST Get auth token
/api/auth/token/release POST Release auth token
/api/deployment/node GET List cluster nodes
/api/deployment/licenses GET License info
/api/deployment/retention GET Data retention config
/api/adapters GET List all adapters
/api/adapters/<id> GET Adapter details
/api/adapters?adapterKindKey=VMWARE GET Filter by adapter kind
/api/adapters/<id>/monitoringstate/start POST Start adapter
/api/adapters/<id>/monitoringstate/stop POST Stop adapter
/api/credentials GET List credentials
/api/resources GET List resources
/api/resources/<id>/stats/latest GET Latest metrics
/api/alerts GET List alerts
/api/alerts?status=ACTIVE GET Active alerts only
/api/alerts?criticality=CRITICAL GET Critical alerts only
/api/alertplugins GET Notification plugins
/api/collectors GET Remote collectors
/api/solutions GET Management packs
/api/reports GET Report definitions

Common Query Parameters

Parameter Example Purpose
pageSize ?pageSize=100 Results per page
page ?page=0 Page number
adapterKind ?adapterKind=VMWARE Filter by adapter
resourceKind ?resourceKind=VirtualMachine Filter by resource type
status ?status=ACTIVE Alert status filter
criticality ?criticality=CRITICAL Alert criticality

VCF Operations Health Check Handbook Version 1.0 | March 2026 © 2026 Virtual Control LLC — All Rights Reserved