This handbook provides a comprehensive, repeatable methodology for verifying the health of VCF Operations for Logs (formerly VMware Aria Operations for Logs / vRealize Log Insight) within a VMware Cloud Foundation 9 environment. It is designed for infrastructure engineers, VCF administrators, and operations teams who need to validate that the centralized logging platform is functioning correctly, ingesting events at expected rates, and maintaining cluster integrity.
This health check covers the following areas:
| Trigger | Frequency | Priority |
|---|---|---|
| Scheduled proactive review | Monthly | Standard |
| Pre-upgrade validation (VCF lifecycle) | Before each upgrade cycle | High |
| Post-upgrade verification | Immediately after upgrade | Critical |
| After cluster node addition or removal | As needed | High |
| After certificate renewal | As needed | High |
| Performance degradation reported | Reactive | Critical |
| Ingestion rate anomalies detected | Reactive | Critical |
| After datacenter-level maintenance window | As needed | Standard |
| Disaster recovery rehearsal | Quarterly | High |
VCF Operations for Logs in VCF 9 consists of the following architectural components:
| Component | Description | Default Port(s) |
|---|---|---|
| Log Insight Daemon | Core ingestion and query engine | 9000, 9543 |
| Apache HTTPD | Reverse proxy for the web UI and API | 443 (HTTPS), 80 (redirect) |
| Cassandra | Embedded data store for log metadata and indexes | 9042, 7000, 7199 |
| Fluentd | Log collection agent framework (embedded) | Various |
| ILB (Integrated Load Balancer) | Virtual IP distribution across cluster nodes | Same as service ports |
| REST API | Programmatic access for queries, config, and management | 443, 9543 |
| Agents (li-agent) | Remote log collection agents on ESXi and VMs | 1514, 514, 6514 |
SSH access to each Ops for Logs node is required for service-level and OS-level checks. The default administrative user is root or a configured admin account.
# Test SSH connectivity to each node
ssh root@ops-for-logs-node1.vcf.local "hostname && uptime"
ssh root@ops-for-logs-node2.vcf.local "hostname && uptime"
ssh root@ops-for-logs-node3.vcf.local "hostname && uptime"
Expected output:
ops-for-logs-node1
10:23:45 up 45 days, 3:12, 1 user, load average: 0.42, 0.38, 0.35
All API calls in this handbook target the Ops for Logs REST API at https://<ops-for-logs-vip>/api/v1/ or https://<ops-for-logs-vip>/api/v2/. An authentication token is required for most endpoints.
# Authenticate and retrieve bearer token
curl -sk -X POST "https://ops-for-logs.vcf.local/api/v1/sessions" \
-H "Content-Type: application/json" \
-d '{
"username": "admin",
"password": "<ADMIN_PASSWORD>",
"provider": "Local"
}'
Expected response:
{
"userId": "012345ab-cdef-6789-abcd-ef0123456789",
"sessionId": "aBcDeFgHiJkLmNoPqRsTuVwXyZ123456",
"ttl": 1800
}
Store the sessionId for subsequent API calls:
export TOKEN="aBcDeFgHiJkLmNoPqRsTuVwXyZ123456"
Set these variables at the start of your health check session for convenience:
# Ops for Logs VIP or FQDN
export OFL_HOST="ops-for-logs.vcf.local"
# Individual node FQDNs
export OFL_NODE1="ops-for-logs-node1.vcf.local"
export OFL_NODE2="ops-for-logs-node2.vcf.local"
export OFL_NODE3="ops-for-logs-node3.vcf.local"
# API base URL
export OFL_API="https://${OFL_HOST}/api/v1"
# Authenticate and store token
export TOKEN=$(curl -sk -X POST "${OFL_API}/sessions" \
-H "Content-Type: application/json" \
-d '{"username":"admin","password":"'"${OFL_PASS}"'","provider":"Local"}' \
| python3 -c "import sys,json; print(json.load(sys.stdin)['sessionId'])")
echo "Token acquired: ${TOKEN:0:8}..."
| Tool | Purpose | Install Check |
|---|---|---|
curl |
REST API calls | curl --version |
jq |
JSON parsing | jq --version |
openssl |
Certificate inspection | openssl version |
ssh |
Remote node access | ssh -V |
python3 |
Scripting and JSON parsing | python3 --version |
ntpq / chronyc |
NTP verification | ntpq -V or chronyc --version |
dig / nslookup |
DNS resolution testing | dig -v |
This table provides a single-glance view of every health check in this handbook, with pass/warn/fail criteria.
| # | Check | Command / Method | PASS | WARN | FAIL |
|---|---|---|---|---|---|
| 4.1 | Log Insight Daemon | systemctl status loginsight |
active (running) |
Restarting frequently | inactive / failed |
| 4.2 | Cassandra Service | systemctl status cassandra |
active (running) |
High compaction pending | inactive / failed |
| 4.3 | Apache HTTPD | systemctl status httpd |
active (running) |
High connection count | inactive / failed |
| 4.4 | Fluentd | systemctl status fluentd |
active (running) |
Buffer warnings | inactive / failed |
| 5.1 | Node Roles | GET /api/v1/cluster |
All nodes present | Node degraded | Node missing |
| 5.2 | Cluster Status | GET /api/v1/cluster/status |
All nodes RUNNING |
Node in JOINING |
Node OFFLINE |
| 5.3 | ILB VIP | curl -sk https://<VIP>/ |
HTTP 200/302 | High latency (>2s) | Connection refused |
| 6.1 | /storage/var Usage |
df -h /storage/var |
< 70% | 70-85% | > 85% |
| 6.2 | Cassandra Data Size | du -sh /storage/var/cassandra |
< 60% of disk | 60-80% | > 80% |
| 7.1 | Ingestion Rate | GET /api/v1/stats |
Stable EPS | > 20% deviation | Ingestion stopped |
| 7.2 | Dropped Events | Log analysis | 0 dropped | < 0.1% dropped | > 0.1% dropped |
| 8.1 | Forwarding Status | GET /api/v1/forwarding |
All destinations up | Intermittent failures | Destination unreachable |
| 9.1 | Content Packs | GET /api/v1/content/contentpack/list |
All current version | Updates available | Pack errors |
| 10.1 | Ops Integration | Launch-in-context test | Works correctly | Partial function | Not configured |
| 11.1 | Agent Count | GET /api/v1/agent/groups |
All agents connected | > 5% stale | > 20% stale |
| 12.1 | API Auth | POST /api/v1/sessions |
Token returned < 2s | Token returned 2-5s | Auth failure |
| 13.1 | SSL Certificate | openssl s_client |
Valid > 30 days | Valid 7-30 days | Expired / < 7 days |
| 14.1 | NTP Sync | chronyc tracking |
Offset < 100ms | Offset 100ms-500ms | Offset > 500ms / unsync |
| 14.2 | DNS Resolution | dig <FQDN> |
Resolves correctly | Slow resolution (>1s) | Resolution fails |
| 15.1 | Backup Status | Backup config check | Recent backup exists | Backup > 7 days old | No backup configured |
| 16.1 | CPU Utilization | top / mpstat |
< 70% sustained | 70-90% sustained | > 90% sustained |
| 16.2 | Memory Usage | free -m |
< 80% used | 80-90% used | > 90% used |
| 16.3 | JVM Heap | JMX / log analysis | < 75% heap | 75-90% heap | > 90% heap / OOM |
All Ops for Logs nodes run a set of critical services. Each must be verified on every node in the cluster. Execute the following checks via SSH to each node.
The loginsight daemon is the core process responsible for log ingestion, indexing, querying, and the web UI.
# Check loginsight service status on each node
ssh root@${OFL_NODE1} "systemctl status loginsight"
Expected output (healthy):
● loginsight.service - VMware Aria Operations for Logs
Loaded: loaded (/etc/systemd/system/loginsight.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2026-03-20 08:15:22 UTC; 6 days ago
Main PID: 1842 (loginsight)
Tasks: 187 (limit: 37253)
Memory: 4.2G
CPU: 2d 5h 32min 14.221s
CGroup: /system.slice/loginsight.service
└─1842 /usr/lib/loginsight/application/sbin/loginsight ...
# Check for recent restarts (indicates instability)
ssh root@${OFL_NODE1} "journalctl -u loginsight --since '7 days ago' | grep -c 'Started VMware'"
Expected: 1 (single start in the past 7 days). Values greater than 2 indicate restarts that should be investigated.
# Verify the process is running and check resource consumption
ssh root@${OFL_NODE1} "ps aux | grep loginsight | grep -v grep"
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
| Service state | active (running) |
Restarted > 2 times in 7 days | inactive, failed, or not found |
| Memory usage | < 80% of allocated | 80-90% of allocated | > 90% or OOM killed |
| Process PID | Stable (same PID for days) | Changed in last 24h | Process not found |
journalctl -u loginsight --no-pager -n 100tail -200 /storage/var/loginsight/runtime.logsystemctl restart loginsight/storage/var and Cassandra health.
Cassandra is the embedded database that stores log metadata, indexes, and cluster state. Its health is critical to overall Ops for Logs function.
# Check Cassandra service status
ssh root@${OFL_NODE1} "systemctl status cassandra"
Expected output (healthy):
● cassandra.service - VMware Ops for Logs Cassandra
Loaded: loaded (/etc/systemd/system/cassandra.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2026-03-20 08:14:55 UTC; 6 days ago
Main PID: 1523 (java)
Tasks: 94 (limit: 37253)
Memory: 2.8G
CPU: 1d 12h 45min 33.109s
CGroup: /system.slice/cassandra.service
└─1523 /usr/bin/java -Xms2048m -Xmx2048m ...
# Check Cassandra ring status
ssh root@${OFL_NODE1} "nodetool status"
Expected output:
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.1.101 12.45 GiB 256 33.3% a1b2c3d4-e5f6-7890-abcd-ef0123456789 rack1
UN 192.168.1.102 11.82 GiB 256 33.3% b2c3d4e5-f6a7-8901-bcde-f01234567890 rack1
UN 192.168.1.103 12.01 GiB 256 33.4% c3d4e5f6-a7b8-9012-cdef-012345678901 rack1
The UN prefix means Up and Normal. Any other state requires investigation.
# Check pending compactions
ssh root@${OFL_NODE1} "nodetool compactionstats"
Expected: pending tasks: 0 or a small number (< 10). High pending compactions (> 50) indicate storage I/O pressure.
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
| Service state | active (running) |
Frequent GC pauses | inactive / failed |
| nodetool status | All nodes UN |
Node in UJ (joining) |
Node DN (down) |
| Pending compactions | 0 - 10 | 10 - 50 | > 50 |
| Data load balance | Within 10% across nodes | 10-25% variance | > 25% variance |
tail -200 /storage/var/cassandra/logs/system.loggrep -i "OutOfMemoryError" /storage/var/cassandra/logs/system.logsystemctl restart cassandraDN, check network connectivity between nodes and verify /storage/var has free space.nodetool setcompactionthroughput 128
Apache serves as the reverse proxy for the Ops for Logs web UI and REST API over HTTPS (port 443).
# Check Apache HTTPD status
ssh root@${OFL_NODE1} "systemctl status httpd"
Expected output (healthy):
● httpd.service - The Apache HTTP Server
Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2026-03-20 08:15:30 UTC; 6 days ago
Docs: man:httpd.service(8)
Main PID: 2103 (httpd)
Status: "Total requests: 48231; Idle/Busy workers 8/2"
Tasks: 213 (limit: 37253)
Memory: 345.2M
# Check active connections to port 443
ssh root@${OFL_NODE1} "ss -tuln | grep ':443' && ss -s"
# Check for recent errors
ssh root@${OFL_NODE1} "tail -50 /var/log/httpd/error_log | grep -i 'error\|warn'"
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
| Service state | active (running) |
High worker utilization (> 80%) | inactive / failed |
| Port 443 listening | Yes | -- | Not listening |
| Error log | No critical errors | Occasional warnings | Persistent errors |
httpd -ttail -100 /var/log/httpd/error_logsystemctl restart httpd
Fluentd handles local log collection and forwarding on each node.
# Check Fluentd service status
ssh root@${OFL_NODE1} "systemctl status fluentd"
Expected output (healthy):
● fluentd.service - Fluentd Log Collector
Loaded: loaded (/etc/systemd/system/fluentd.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2026-03-20 08:15:25 UTC; 6 days ago
Main PID: 1955 (ruby)
Tasks: 18 (limit: 37253)
Memory: 128.5M
# Check Fluentd buffer directory size
ssh root@${OFL_NODE1} "du -sh /storage/var/fluentd/buffer/ 2>/dev/null || echo 'No buffer directory'"
# Check for buffer overflow warnings in Fluentd logs
ssh root@${OFL_NODE1} "grep -c 'buffer is full' /var/log/fluentd/fluentd.log 2>/dev/null || echo '0'"
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
| Service state | active (running) |
Buffer warnings present | inactive / failed |
| Buffer size | < 100 MB | 100 MB - 500 MB | > 500 MB (backlog) |
| Buffer overflow events | 0 | 1-5 in past 24h | > 5 in past 24h |
tail -100 /var/log/fluentd/fluentd.logrm -f /storage/var/fluentd/buffer/*.logsystemctl restart fluentdRun this consolidated command on each node to verify all critical services in a single pass:
# Quick service health summary for a single node
ssh root@${OFL_NODE1} 'echo "=== Service Status Summary ===" && \
for svc in loginsight cassandra httpd fluentd; do \
STATUS=$(systemctl is-active $svc 2>/dev/null); \
ENABLED=$(systemctl is-enabled $svc 2>/dev/null); \
printf "%-15s Active: %-12s Enabled: %s\n" "$svc" "$STATUS" "$ENABLED"; \
done'
Expected output:
=== Service Status Summary ===
loginsight Active: active Enabled: enabled
cassandra Active: active Enabled: enabled
httpd Active: active Enabled: enabled
fluentd Active: active Enabled: enabled
# Loop across all cluster nodes
for NODE in ${OFL_NODE1} ${OFL_NODE2} ${OFL_NODE3}; do
echo "===== ${NODE} ====="
ssh root@${NODE} 'for svc in loginsight cassandra httpd fluentd; do \
printf "%-15s %s\n" "$svc" "$(systemctl is-active $svc)"; done'
echo ""
done
VCF Operations for Logs operates as a clustered appliance with a minimum of three nodes for high availability. Cluster health verification ensures that all nodes are online, roles are correctly assigned, and the integrated load balancer is distributing traffic.
Each Ops for Logs cluster has exactly one master node and one or more worker nodes. The master manages cluster coordination, schema, and configuration replication.
# Retrieve cluster node roles
curl -sk -X GET "${OFL_API}/cluster" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
Expected response:
{
"clusterSize": 3,
"nodes": [
{
"id": "a1b2c3d4-e5f6-7890-abcd-ef0123456789",
"hostname": "ops-for-logs-node1.vcf.local",
"ipAddress": "192.168.1.101",
"role": "MASTER",
"status": "RUNNING",
"version": "9.0.0-12345678"
},
{
"id": "b2c3d4e5-f6a7-8901-bcde-f01234567890",
"hostname": "ops-for-logs-node2.vcf.local",
"ipAddress": "192.168.1.102",
"role": "WORKER",
"status": "RUNNING",
"version": "9.0.0-12345678"
},
{
"id": "c3d4e5f6-a7b8-9012-cdef-012345678901",
"hostname": "ops-for-logs-node3.vcf.local",
"ipAddress": "192.168.1.103",
"role": "WORKER",
"status": "RUNNING",
"version": "9.0.0-12345678"
}
]
}
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
| Master node present | Exactly 1 master | -- | 0 or > 1 master |
| All nodes reporting | Count matches clusterSize |
-- | Missing node(s) |
| Version consistency | All nodes same version | -- | Version mismatch |
| All nodes RUNNING | All status = RUNNING |
Node in JOINING/LEAVING |
Node OFFLINE/ERROR |
systemctl status loginsightping ${OFL_NODE1} from other nodescurl -sk https://${OFL_NODE1}:9000tail -200 /storage/var/loginsight/runtime.log | grep -i cluster# Get detailed cluster health
curl -sk -X GET "${OFL_API}/cluster/status" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
Expected response:
{
"clusterStatus": "RUNNING",
"masterNodeId": "a1b2c3d4-e5f6-7890-abcd-ef0123456789",
"nodesHealth": [
{
"nodeId": "a1b2c3d4-e5f6-7890-abcd-ef0123456789",
"hostname": "ops-for-logs-node1.vcf.local",
"state": "RUNNING",
"diskUsagePercent": 42.5,
"cpuUsagePercent": 23.1,
"memoryUsagePercent": 65.8,
"eventsPerSecond": 3245
},
{
"nodeId": "b2c3d4e5-f6a7-8901-bcde-f01234567890",
"hostname": "ops-for-logs-node2.vcf.local",
"state": "RUNNING",
"diskUsagePercent": 41.2,
"cpuUsagePercent": 21.8,
"memoryUsagePercent": 63.4,
"eventsPerSecond": 3198
},
{
"nodeId": "c3d4e5f6-a7b8-9012-cdef-012345678901",
"hostname": "ops-for-logs-node3.vcf.local",
"state": "RUNNING",
"diskUsagePercent": 43.1,
"cpuUsagePercent": 22.5,
"memoryUsagePercent": 64.2,
"eventsPerSecond": 3210
}
]
}
The ILB provides a single virtual IP (VIP) that distributes incoming log traffic and API requests across all cluster nodes.
# Test VIP is responding
curl -sk -o /dev/null -w "HTTP_CODE: %{http_code}\nTIME_TOTAL: %{time_total}s\n" \
"https://${OFL_HOST}/"
Expected output:
HTTP_CODE: 302
TIME_TOTAL: 0.234s
# Check ILB configuration
curl -sk -X GET "${OFL_API}/ilb" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
Expected response:
{
"enabled": true,
"virtualIp": "192.168.1.100",
"heartbeatInterval": 3,
"failoverTimeout": 15
}
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
| VIP responds | HTTP 200 or 302 | Response time > 2s | Connection refused / timeout |
| ILB enabled | true |
-- | false |
| All nodes behind ILB | All nodes included | -- | Node excluded |
ssh root@${OFL_NODE1} "ip addr show | grep 192.168.1.100"arping -D -I eth0 192.168.1.100All cluster nodes must be able to communicate with each other on required ports.
# Test connectivity from node1 to node2 and node3 on key ports
ssh root@${OFL_NODE1} "
echo '--- Port 9000 (loginsight) ---'
nc -zv ${OFL_NODE2} 9000 2>&1
nc -zv ${OFL_NODE3} 9000 2>&1
echo '--- Port 9042 (Cassandra CQL) ---'
nc -zv ${OFL_NODE2} 9042 2>&1
nc -zv ${OFL_NODE3} 9042 2>&1
echo '--- Port 7000 (Cassandra inter-node) ---'
nc -zv ${OFL_NODE2} 7000 2>&1
nc -zv ${OFL_NODE3} 7000 2>&1
"
Expected output:
--- Port 9000 (loginsight) ---
Connection to ops-for-logs-node2.vcf.local 9000 port [tcp/*] succeeded!
Connection to ops-for-logs-node3.vcf.local 9000 port [tcp/*] succeeded!
--- Port 9042 (Cassandra CQL) ---
Connection to ops-for-logs-node2.vcf.local 9042 port [tcp/*] succeeded!
Connection to ops-for-logs-node3.vcf.local 9042 port [tcp/*] succeeded!
--- Port 7000 (Cassandra inter-node) ---
Connection to ops-for-logs-node2.vcf.local 7000 port [tcp/*] succeeded!
Connection to ops-for-logs-node3.vcf.local 7000 port [tcp/*] succeeded!
Storage is the most common source of Ops for Logs issues. The appliance stores all ingested log data, Cassandra metadata, and indexes on the /storage/var partition.
# Show all mounted partitions and usage
ssh root@${OFL_NODE1} "df -hT"
Expected output:
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda3 ext4 10G 3.2G 6.3G 34% /
/dev/sda1 vfat 512M 12M 500M 3% /boot/efi
/dev/sdb1 ext4 500G 210G 266G 45% /storage/var
tmpfs tmpfs 7.8G 0 7.8G 0% /dev/shm
The critical partitions are:
| Partition | Purpose | Minimum Size | Alert Threshold |
|---|---|---|---|
/ |
OS root filesystem | 10 GB | > 80% used |
/storage/var |
Log data, Cassandra, indexes | 500 GB+ | > 70% used |
/boot/efi |
EFI boot partition | 512 MB | > 90% used |
# Check /storage/var utilization with breakdown
ssh root@${OFL_NODE1} "
echo '=== Overall /storage/var ==='
df -h /storage/var
echo ''
echo '=== Top-level directories by size ==='
du -sh /storage/var/*/ 2>/dev/null | sort -rh | head -20
"
Expected output:
=== Overall /storage/var ===
Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 500G 210G 266G 45% /storage/var
=== Top-level directories by size ===
185G /storage/var/loginsight/
18G /storage/var/cassandra/
3.2G /storage/var/fluentd/
1.1G /storage/var/apache/
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
/storage/var usage |
< 70% | 70-85% | > 85% |
Root / usage |
< 80% | 80-90% | > 90% |
| Inode usage | < 70% | 70-85% | > 85% |
# Check inode usage (often overlooked)
ssh root@${OFL_NODE1} "df -i /storage/var"
/storage/var exceeds 85%, Ops for Logs will begin aggressively purging old data. At 95%, ingestion may halt entirely. Proactive monitoring is essential.
# Check Cassandra data footprint
ssh root@${OFL_NODE1} "
echo '=== Cassandra Data Directory ==='
du -sh /storage/var/cassandra/data/ 2>/dev/null
echo ''
echo '=== Cassandra Commit Logs ==='
du -sh /storage/var/cassandra/commitlog/ 2>/dev/null
echo ''
echo '=== Cassandra Saved Caches ==='
du -sh /storage/var/cassandra/saved_caches/ 2>/dev/null
"
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
| Data directory | < 60% of /storage/var |
60-80% | > 80% |
| Commit log size | < 2 GB | 2-5 GB | > 5 GB (indicates write issues) |
# Get retention settings
curl -sk -X GET "${OFL_API}/time/config" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
Expected response:
{
"retentionPeriod": 30,
"archiveEnabled": true,
"archiveRetentionPeriod": 365
}
# Check the loginsight configuration file for retention settings
ssh root@${OFL_NODE1} "grep -i 'retention' /storage/var/loginsight/config/loginsight-config.xml 2>/dev/null"
# Check archive/NFS configuration
curl -sk -X GET "${OFL_API}/archive" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
Expected response (when configured):
{
"enabled": true,
"archiveType": "NFS",
"nfsServer": "nfs-server.vcf.local",
"nfsPath": "/exports/loginsight-archive",
"archiveFrequency": "DAILY",
"compressionEnabled": true
}
# Check if NFS archive is mounted
ssh root@${OFL_NODE1} "mount | grep nfs && df -h /storage/var/loginsight/archive/"
retentionPeriod value/storage/var virtual disk in vSphere and grow the filesystemnodetool clearsnapshot
The ingestion rate (events per second, or EPS) is a key performance indicator for Ops for Logs. Monitoring this metric ensures that the platform is receiving logs at expected volumes and not silently dropping events.
# Get current ingestion statistics
curl -sk -X GET "${OFL_API}/stats" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
Expected response:
{
"totalEventsIngested": 285432109,
"currentEventsPerSecond": 9653,
"averageEventsPerSecond": 9480,
"peakEventsPerSecond": 18234,
"totalBytesIngested": 412983726501,
"droppedEvents": 0,
"queueDepth": 12
}
# Monitor real-time ingestion rate from node logs
ssh root@${OFL_NODE1} "tail -100 /storage/var/loginsight/runtime.log | grep -i 'ingestion\|eps\|events.*second'"
# Query ingestion rate over the past 24 hours
curl -sk -X POST "${OFL_API}/events/stats" \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"query": "ingestion_rate",
"startTimeMillis": '$(date -d "24 hours ago" +%s%3N)',
"endTimeMillis": '$(date +%s%3N)',
"bucketDurationMinutes": 60
}' | jq '.buckets[] | {time: .startTime, eps: .eventsPerSecond}'
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
| Current EPS | Within 20% of baseline | 20-50% deviation from baseline | > 50% deviation or 0 EPS |
| Dropped events | 0 | < 0.1% of total ingested | > 0.1% of total |
| Queue depth | < 100 | 100-1000 | > 1000 |
# Check ingestion pipeline components
ssh root@${OFL_NODE1} "
echo '=== Listening Ports for Ingestion ==='
ss -tuln | grep -E ':(514|1514|6514|9000|9543) '
echo ''
echo '=== Active Syslog Connections ==='
ss -tn | grep -E ':(514|1514|6514) ' | wc -l
echo ''
echo '=== Active CFAPI Connections ==='
ss -tn | grep -E ':(9000|9543) ' | wc -l
"
Expected output:
=== Listening Ports for Ingestion ===
tcp LISTEN 0 128 *:514 *:*
tcp LISTEN 0 128 *:1514 *:*
tcp LISTEN 0 128 *:6514 *:*
tcp LISTEN 0 128 *:9000 *:*
tcp LISTEN 0 128 *:9543 *:*
=== Active Syslog Connections ===
42
=== Active CFAPI Connections ===
18
# Check for dropped events in the runtime log
ssh root@${OFL_NODE1} "grep -c 'dropped\|overflow\|backpressure' \
/storage/var/loginsight/runtime.log 2>/dev/null || echo '0'"
# Check ingestion queue depth
ssh root@${OFL_NODE1} "grep -i 'queue.*depth\|pending.*events' \
/storage/var/loginsight/runtime.log | tail -5"
Ops for Logs can forward ingested logs to external destinations via syslog (UDP/TCP), syslog over TLS, or the CFAPI protocol. This section verifies that all forwarding destinations are configured correctly and operating.
# List all configured forwarding destinations
curl -sk -X GET "${OFL_API}/forwarding" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
Expected response:
{
"destinations": [
{
"id": "dest-001",
"name": "SIEM-Primary",
"host": "siem.vcf.local",
"port": 6514,
"protocol": "SYSLOG",
"transport": "TCP-TLS",
"enabled": true,
"status": "CONNECTED",
"filter": "*",
"lastEventForwarded": "2026-03-26T09:45:12Z"
},
{
"id": "dest-002",
"name": "Archive-Collector",
"host": "log-archive.vcf.local",
"port": 9543,
"protocol": "CFAPI",
"transport": "HTTPS",
"enabled": true,
"status": "CONNECTED",
"filter": "vmw_vc_*",
"lastEventForwarded": "2026-03-26T09:45:10Z"
}
]
}
# Check TLS certificate used for syslog forwarding
ssh root@${OFL_NODE1} "
echo '=== Forwarding TLS Certificates ==='
ls -la /storage/var/loginsight/certs/forwarding/ 2>/dev/null || echo 'No forwarding certs directory'
echo ''
echo '=== Forwarding Configuration ==='
grep -A 10 'forwarding' /storage/var/loginsight/config/loginsight-config.xml 2>/dev/null | head -30
"
# Verify TLS handshake to syslog destination
openssl s_client -connect siem.vcf.local:6514 -servername siem.vcf.local </dev/null 2>/dev/null | \
openssl x509 -noout -subject -dates -issuer
Expected output:
subject=CN = siem.vcf.local
notBefore=Jan 15 00:00:00 2026 GMT
notAfter=Jan 15 23:59:59 2027 GMT
issuer=CN = VCF Internal CA, O = Virtual Control LLC
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
| TLS handshake | Succeeds | Certificate nearing expiry | Handshake fails |
| Protocol match | Matches destination config | -- | Mismatch |
| Certificate trust | CA chain trusted | Self-signed (intentional) | Untrusted / expired |
# Check forwarding statistics per destination
curl -sk -X GET "${OFL_API}/forwarding/stats" \
-H "Authorization: Bearer ${TOKEN}" | jq '.destinations[] | {name, eventsForwarded, eventsFailed, lastSuccess}'
Expected output:
{
"name": "SIEM-Primary",
"eventsForwarded": 48293012,
"eventsFailed": 0,
"lastSuccess": "2026-03-26T09:45:12Z"
}
{
"name": "Archive-Collector",
"eventsForwarded": 12045231,
"eventsFailed": 0,
"lastSuccess": "2026-03-26T09:45:10Z"
}
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
| Events forwarded | Increasing steadily | Intermittent pauses | Not increasing / 0 |
| Events failed | 0 | < 0.01% of forwarded | > 0.01% or increasing |
| Last success | Within 5 minutes | 5-60 minutes ago | > 60 minutes ago |
| Destination status | CONNECTED |
RECONNECTING |
DISCONNECTED / ERROR |
# Send a test event via the API to verify end-to-end forwarding
curl -sk -X POST "${OFL_API}/events/ingest/0" \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"events": [
{
"text": "HEALTH_CHECK_TEST: Forwarding validation event from Ops for Logs health check - '"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'",
"source": "health-check-script",
"fields": [
{"name": "test_id", "content": "hc-fwd-'"$(date +%s)"'"}
]
}
]
}'
Then verify the test event arrived at the forwarding destination by searching for HEALTH_CHECK_TEST in the target SIEM or log collector.
nc -zv siem.vcf.local 6514Content packs provide pre-built dashboards, alerts, extracted fields, and queries for specific products (vSphere, NSX, SDDC Manager, vSAN, etc.). Keeping content packs current ensures full observability.
# List all installed content packs
curl -sk -X GET "${OFL_API}/content/contentpack/list" \
-H "Authorization: Bearer ${TOKEN}" | jq '.contentPacks[] | {name, namespace, version, installedDate}'
Expected output:
{
"name": "VMware vSphere",
"namespace": "com.vmware.vsphere",
"version": "9.0.1",
"installedDate": "2026-02-15T10:30:00Z"
}
{
"name": "VMware NSX",
"namespace": "com.vmware.nsx",
"version": "9.0.0",
"installedDate": "2026-02-15T10:30:05Z"
}
{
"name": "VMware SDDC Manager",
"namespace": "com.vmware.sddc",
"version": "9.0.0",
"installedDate": "2026-02-15T10:30:10Z"
}
{
"name": "VMware vSAN",
"namespace": "com.vmware.vsan",
"version": "9.0.0",
"installedDate": "2026-02-15T10:30:15Z"
}
{
"name": "VMware Aria Operations",
"namespace": "com.vmware.vrops",
"version": "9.0.0",
"installedDate": "2026-02-15T10:30:20Z"
}
| Content Pack | Namespace | Minimum Version | Purpose |
|---|---|---|---|
| VMware vSphere | com.vmware.vsphere |
9.0.0 | ESXi and vCenter log parsing |
| VMware NSX | com.vmware.nsx |
9.0.0 | NSX manager and edge log parsing |
| VMware SDDC Manager | com.vmware.sddc |
9.0.0 | SDDC Manager lifecycle events |
| VMware vSAN | com.vmware.vsan |
9.0.0 | vSAN health and performance logs |
| VMware Aria Operations | com.vmware.vrops |
9.0.0 | Ops manager integration logs |
| Linux | com.vmware.linux |
9.0.0 | General Linux syslog parsing |
| General | com.vmware.general |
9.0.0 | Generic field extraction |
# Check marketplace for content pack updates
curl -sk -X GET "${OFL_API}/content/contentpack/marketplace" \
-H "Authorization: Bearer ${TOKEN}" | jq '.contentPacks[] | select(.updateAvailable == true) | {name, currentVersion, availableVersion}'
Expected output (no updates needed):
(empty output -- no updates available)
Output when updates are available:
{
"name": "VMware vSphere",
"currentVersion": "9.0.0",
"availableVersion": "9.0.1"
}
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
| All VCF packs installed | All 7+ packs present | Missing non-critical pack | Missing vSphere or SDDC pack |
| Pack versions | All at latest | Minor update available | Major version behind |
| Pack status | No errors | Warning on extraction | Pack failed to load |
# Check auto-update settings for content packs
curl -sk -X GET "${OFL_API}/content/contentpack/autoupdate" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
Expected response:
{
"autoUpdateEnabled": true,
"checkIntervalHours": 24,
"lastCheckTime": "2026-03-25T02:00:00Z",
"proxyEnabled": false
}
POST /api/v1/content/contentpack/install with the pack namespacePUT /api/v1/content/contentpack/autoupdate with {"autoUpdateEnabled": true}
VCF Operations for Logs integrates with VCF Operations (formerly Aria Operations / vRealize Operations) to provide launch-in-context capabilities, shared authentication, and correlated alerting.
Launch-in-context enables users to jump directly from VCF Operations alerts and dashboards into relevant log queries in Ops for Logs.
# Check VCF Operations integration settings
curl -sk -X GET "${OFL_API}/integration/vrops" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
Expected response:
{
"enabled": true,
"vropsHost": "ops.vcf.local",
"vropsPort": 443,
"connectionStatus": "CONNECTED",
"lastSyncTime": "2026-03-26T08:00:00Z",
"ssoIntegrated": true,
"launchInContextEnabled": true
}
# Verify launch-in-context URL format
curl -sk -X GET "${OFL_API}/integration/vrops/launch-url?resourceId=vm-123&timeRange=3600" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
| Integration enabled | true |
-- | false or not configured |
| Connection status | CONNECTED |
DEGRADED |
DISCONNECTED |
| Last sync time | Within 24 hours | 1-7 days ago | > 7 days or never |
| Launch-in-context | URL generated correctly | Partial functionality | Errors on generation |
# Verify SSO / shared authentication with VCF Operations
curl -sk -X GET "${OFL_API}/auth/providers" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
Expected response:
{
"providers": [
{
"name": "Local",
"type": "LOCAL",
"enabled": true
},
{
"name": "vcf-sso.vcf.local",
"type": "ACTIVE_DIRECTORY",
"enabled": true,
"connectionStatus": "CONNECTED"
},
{
"name": "VMware Identity Manager",
"type": "VIDM",
"enabled": true,
"connectionStatus": "CONNECTED"
}
]
}
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
| SSO provider configured | Yes, CONNECTED |
Configured but DEGRADED |
Not configured |
| AD integration | CONNECTED |
Intermittent failures | DISCONNECTED |
| Local auth backup | Enabled as fallback | -- | Disabled (no fallback) |
Verify that VCF Operations is sending notification events and that Ops for Logs is receiving them.
# Search for VCF Operations events in Ops for Logs
curl -sk -X POST "${OFL_API}/events" \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"query": "vmw_product=vrops",
"startTimeMillis": '$(date -d "24 hours ago" +%s%3N)',
"endTimeMillis": '$(date +%s%3N)',
"limit": 5
}' | jq '.results | length'
Expected: A positive number indicating events are flowing from VCF Operations to Ops for Logs.
curl -sk https://ops.vcf.local:443 from Ops for Logs nodessystemctl restart loginsight (integration is part of the main daemon)
Ops for Logs agents (li-agent) run on ESXi hosts, VMs, and other endpoints to collect and forward logs to the cluster. Agent health monitoring ensures complete log coverage.
# Get agent summary statistics
curl -sk -X GET "${OFL_API}/agent/stats" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
Expected response:
{
"totalAgents": 48,
"connectedAgents": 47,
"disconnectedAgents": 1,
"activeAgentGroups": 5,
"averageEventsPerAgent": 201
}
# List agents with their connection status
curl -sk -X GET "${OFL_API}/agent/agents" \
-H "Authorization: Bearer ${TOKEN}" | jq '.agents[] | {hostname, ipAddress, version, status, lastHeartbeat}' | head -60
Sample output:
{
"hostname": "esxi-host-01.vcf.local",
"ipAddress": "192.168.10.101",
"version": "9.0.0-12345",
"status": "CONNECTED",
"lastHeartbeat": "2026-03-26T09:44:55Z"
}
{
"hostname": "esxi-host-02.vcf.local",
"ipAddress": "192.168.10.102",
"version": "9.0.0-12345",
"status": "CONNECTED",
"lastHeartbeat": "2026-03-26T09:44:52Z"
}
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
| Connected agents | 100% connected | 95-99% connected | < 95% connected |
| Agent version | All same version as cluster | Minor version mismatch | Major version mismatch |
| Heartbeat age | < 5 minutes | 5-30 minutes | > 30 minutes |
Agent groups organize agents for targeted log collection and configuration distribution.
# List all agent groups
curl -sk -X GET "${OFL_API}/agent/groups" \
-H "Authorization: Bearer ${TOKEN}" | jq '.groups[] | {id, name, agentCount, filter}'
Expected output:
{
"id": "group-001",
"name": "ESXi-Hosts",
"agentCount": 32,
"filter": "hostname MATCHES esxi-*"
}
{
"id": "group-002",
"name": "VCF-Management-VMs",
"agentCount": 12,
"filter": "hostname MATCHES vcf-mgmt-*"
}
{
"id": "group-003",
"name": "Windows-Servers",
"agentCount": 4,
"filter": "os MATCHES Windows*"
}
# Get detailed agent group configuration including collection targets
curl -sk -X GET "${OFL_API}/agent/groups/group-001" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
Expected response:
{
"id": "group-001",
"name": "ESXi-Hosts",
"agentCount": 32,
"config": {
"fileLogs": [
{
"directory": "/var/log",
"include": "*.log",
"parser": "AUTO"
},
{
"directory": "/var/run/log",
"include": "vmkernel*",
"parser": "VMW_ESXI"
}
],
"eventLogs": [],
"destination": {
"host": "ops-for-logs.vcf.local",
"port": 9543,
"protocol": "CFAPI",
"ssl": true
}
}
}
Stale agents are agents that have not sent a heartbeat within the expected interval (typically 5 minutes). They may indicate agent crashes, network issues, or decommissioned hosts.
# Find agents with no heartbeat in the last 30 minutes
curl -sk -X GET "${OFL_API}/agent/agents?status=DISCONNECTED" \
-H "Authorization: Bearer ${TOKEN}" | jq '.agents[] | {hostname, lastHeartbeat, status}'
Expected output (ideally empty):
{
"hostname": "old-vm-decommissioned.vcf.local",
"lastHeartbeat": "2026-03-10T14:22:00Z",
"status": "DISCONNECTED"
}
ping old-vm-decommissioned.vcf.localsystemctl status liagentdsystemctl restart liagentdtail -100 /var/log/liagent/liagent.logDELETE /api/v1/agent/agents/{agentId}nc -zv ops-for-logs.vcf.local 9543
The Ops for Logs REST API is the primary interface for programmatic queries, configuration management, and integration with external tools. Verifying API health ensures automation and integrations function correctly.
# Measure authentication response time
time curl -sk -X POST "${OFL_API}/sessions" \
-H "Content-Type: application/json" \
-d '{"username":"admin","password":"'"${OFL_PASS}"'","provider":"Local"}' \
-o /dev/null -w "HTTP_CODE: %{http_code}\nTIME_TOTAL: %{time_total}s\nTIME_CONNECT: %{time_connect}s\n"
Expected output:
HTTP_CODE: 200
TIME_TOTAL: 0.345s
TIME_CONNECT: 0.012s
real 0m0.362s
user 0m0.024s
sys 0m0.012s
# Verify a token works for an authenticated endpoint
curl -sk -X GET "${OFL_API}/version" \
-H "Authorization: Bearer ${TOKEN}" \
-w "\nHTTP_CODE: %{http_code}\n" | jq '.'
Expected response:
{
"version": "9.0.0",
"build": "12345678",
"releaseName": "VCF Operations for Logs 9.0"
}
HTTP_CODE: 200
# Verify that invalid tokens are properly rejected
curl -sk -X GET "${OFL_API}/cluster" \
-H "Authorization: Bearer INVALID_TOKEN_12345" \
-w "\nHTTP_CODE: %{http_code}\n"
Expected: HTTP_CODE: 401 (Unauthorized).
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
| Auth response time | < 2 seconds | 2-5 seconds | > 5 seconds or timeout |
| HTTP status | 200 | -- | 401, 403, 500, or connection error |
| Token validity | Token works on subsequent calls | TTL shorter than expected | Token immediately invalid |
| Invalid token rejection | Returns 401 | -- | Returns 200 (security issue) |
Test several key API endpoints for response time under normal load.
# Benchmark multiple API endpoints
echo "=== API Endpoint Response Times ==="
for ENDPOINT in "version" "cluster" "cluster/status" "stats" "agent/stats" "forwarding"; do
RESP=$(curl -sk -X GET "${OFL_API}/${ENDPOINT}" \
-H "Authorization: Bearer ${TOKEN}" \
-o /dev/null -w "%{http_code} %{time_total}s")
printf "%-25s %s\n" "${ENDPOINT}" "${RESP}"
done
Expected output:
=== API Endpoint Response Times ===
version 200 0.089s
cluster 200 0.156s
cluster/status 200 0.234s
stats 200 0.312s
agent/stats 200 0.198s
forwarding 200 0.145s
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
| Average response time | < 1 second | 1-3 seconds | > 3 seconds |
| All endpoints reachable | All return 200 | Some return 503 | Critical endpoints fail |
| Error rate | 0% | < 1% | > 1% |
# Test rate limiting by sending rapid requests
echo "=== Rate Limit Test (20 rapid requests) ==="
for i in $(seq 1 20); do
CODE=$(curl -sk -X GET "${OFL_API}/version" \
-H "Authorization: Bearer ${TOKEN}" \
-o /dev/null -w "%{http_code}")
echo "Request ${i}: HTTP ${CODE}"
done | sort | uniq -c | sort -rn
Expected output:
20 Request: HTTP 200
If rate limiting is active, you may see HTTP 429 (Too Many Requests) after a threshold.
/storage/var/loginsight/runtime.log for API error messagessystemctl restart httpdsystemctl restart loginsight
SSL/TLS certificates are critical for securing the Ops for Logs web UI, API, agent communication, and log forwarding. Expired or misconfigured certificates cause connection failures across the environment.
# Inspect the SSL certificate served by Ops for Logs
echo | openssl s_client -connect ${OFL_HOST}:443 -servername ${OFL_HOST} 2>/dev/null | \
openssl x509 -noout -subject -issuer -dates -serial -fingerprint -ext subjectAltName
Expected output:
subject=CN = ops-for-logs.vcf.local
issuer=CN = VCF Internal CA, O = Virtual Control LLC, L = Managed
notBefore=Feb 1 00:00:00 2026 GMT
notAfter=Feb 1 23:59:59 2028 GMT
serial=4A3B2C1D0E9F8A7B
SHA256 Fingerprint=AB:CD:EF:12:34:56:78:9A:BC:DE:F0:12:34:56:78:9A:BC:DE:F0:12:34:56:78:9A:BC:DE:F0:12:34:56:78:9A
X509v3 Subject Alternative Name:
DNS:ops-for-logs.vcf.local, DNS:ops-for-logs-node1.vcf.local, DNS:ops-for-logs-node2.vcf.local, DNS:ops-for-logs-node3.vcf.local, IP Address:192.168.1.100, IP Address:192.168.1.101, IP Address:192.168.1.102, IP Address:192.168.1.103
# Verify certificate consistency across all nodes
for NODE in ${OFL_NODE1} ${OFL_NODE2} ${OFL_NODE3}; do
echo "===== ${NODE} ====="
echo | openssl s_client -connect ${NODE}:443 -servername ${NODE} 2>/dev/null | \
openssl x509 -noout -subject -dates -fingerprint
echo ""
done
# Verify the CFAPI ingestion port certificate
echo | openssl s_client -connect ${OFL_HOST}:9543 -servername ${OFL_HOST} 2>/dev/null | \
openssl x509 -noout -subject -dates
# Check if a custom CA certificate is installed
ssh root@${OFL_NODE1} "
echo '=== Custom CA Certificates ==='
ls -la /storage/var/loginsight/certs/ 2>/dev/null
echo ''
echo '=== Trust Store Contents ==='
keytool -list -keystore /storage/var/loginsight/certs/truststore.jks \
-storepass changeit 2>/dev/null | head -20
"
# Download and verify the full certificate chain
echo | openssl s_client -connect ${OFL_HOST}:443 -showcerts 2>/dev/null | \
awk '/BEGIN CERTIFICATE/,/END CERTIFICATE/{ print }' > /tmp/ofl_chain.pem
# Verify the chain
openssl verify -verbose /tmp/ofl_chain.pem
# Calculate days until certificate expiry
EXPIRY_DATE=$(echo | openssl s_client -connect ${OFL_HOST}:443 -servername ${OFL_HOST} 2>/dev/null | \
openssl x509 -noout -enddate | cut -d= -f2)
EXPIRY_EPOCH=$(date -d "${EXPIRY_DATE}" +%s)
NOW_EPOCH=$(date +%s)
DAYS_REMAINING=$(( (EXPIRY_EPOCH - NOW_EPOCH) / 86400 ))
echo "Certificate expires: ${EXPIRY_DATE}"
echo "Days remaining: ${DAYS_REMAINING}"
Expected output:
Certificate expires: Feb 1 23:59:59 2028 GMT
Days remaining: 677
# Check certificate expiry on all service ports
echo "=== Certificate Expiry by Port ==="
for PORT in 443 9000 9543; do
EXPIRY=$(echo | openssl s_client -connect ${OFL_HOST}:${PORT} 2>/dev/null | \
openssl x509 -noout -enddate 2>/dev/null | cut -d= -f2)
printf "Port %-6s Expires: %s\n" "${PORT}" "${EXPIRY:-N/A}"
done
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
| Days until expiry | > 30 days | 7-30 days | < 7 days or expired |
| SAN entries | Include VIP + all nodes | Missing some entries | Missing VIP or critical node |
| Certificate chain | Full chain valid | Intermediate missing (works) | Chain broken / untrusted |
| Consistency across nodes | Same cert on all nodes | -- | Different certs on nodes |
| Ingestion port cert | Valid | Nearing expiry | Expired |
PUT /api/v1/sslPOST /api/v1/ssl/casystemctl restart httpdAccurate time synchronization and reliable DNS resolution are foundational requirements for Ops for Logs. Time skew causes log correlation issues, and DNS failures prevent cluster communication.
# Check chrony synchronization status on each node
ssh root@${OFL_NODE1} "chronyc tracking"
Expected output:
Reference ID : C0A80001 (ntp-server.vcf.local)
Stratum : 3
Ref time (UTC) : Wed Mar 26 09:30:22 2026
System time : 0.000023455 seconds fast of NTP time
Last offset : +0.000012332 seconds
RMS offset : 0.000034521 seconds
Frequency : 2.345 ppm slow
Residual freq : +0.001 ppm
Skew : 0.023 ppm
Root delay : 0.001234 seconds
Root dispersion : 0.000456 seconds
Update interval : 1024.0 seconds
Leap status : Normal
# List NTP sources and their status
ssh root@${OFL_NODE1} "chronyc sources -v"
Expected output:
.-- Source mode '^' = server, '=' = peer, '#' = local clock.
/ .- Source state '*' = current best, '+' = combined, '-' = not combined,
| / 'x' = may be in error, '~' = too variable, '?' = unusable.
|| .- xxxx [ yyyy ] +/- zzzz
|| Reachability register (octal) -. | xxxx = adjusted offset,
|| Log2(Polling interval) --. | | yyyy = measured offset,
|| \ | | zzzz = estimated error.
|| | | \
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^* ntp-server.vcf.local 2 10 377 234 +0.012ms[ +0.015ms] +/- 1.23ms
^+ ntp-backup.vcf.local 2 10 377 512 -0.034ms[ -0.031ms] +/- 2.45ms
# Check time offset between all nodes
echo "=== Time on each node ==="
for NODE in ${OFL_NODE1} ${OFL_NODE2} ${OFL_NODE3}; do
TIME=$(ssh root@${NODE} "date -u '+%Y-%m-%d %H:%M:%S.%N UTC'")
echo "${NODE}: ${TIME}"
done
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
| NTP offset | < 100ms | 100ms - 500ms | > 500ms |
| NTP source reachable | At least 1 source with * |
Sources showing ? |
No reachable source |
| Inter-node time drift | < 200ms between nodes | 200ms - 1s | > 1s between nodes |
| Leap status | Normal |
-- | Not synchronised |
chronyc makestepping ntp-server.vcf.localcat /etc/chrony.confsystemctl restart chronydsystemctl restart ntpd && ntpq -p
# Verify DNS resolution for all Ops for Logs FQDNs
echo "=== Forward DNS Lookups ==="
for FQDN in ${OFL_HOST} ${OFL_NODE1} ${OFL_NODE2} ${OFL_NODE3}; do
IP=$(dig +short ${FQDN} 2>/dev/null)
printf "%-45s -> %s\n" "${FQDN}" "${IP:-FAILED}"
done
Expected output:
=== Forward DNS Lookups ===
ops-for-logs.vcf.local -> 192.168.1.100
ops-for-logs-node1.vcf.local -> 192.168.1.101
ops-for-logs-node2.vcf.local -> 192.168.1.102
ops-for-logs-node3.vcf.local -> 192.168.1.103
# Verify reverse DNS for all node IPs
echo "=== Reverse DNS Lookups ==="
for IP in 192.168.1.100 192.168.1.101 192.168.1.102 192.168.1.103; do
HOSTNAME=$(dig +short -x ${IP} 2>/dev/null)
printf "%-18s -> %s\n" "${IP}" "${HOSTNAME:-FAILED}"
done
# Measure DNS resolution time
echo "=== DNS Response Time ==="
for FQDN in ${OFL_HOST} ${OFL_NODE1}; do
TIME=$(dig ${FQDN} | grep "Query time" | awk '{print $4, $5}')
printf "%-45s %s\n" "${FQDN}" "${TIME}"
done
# Check DNS configuration on each node
ssh root@${OFL_NODE1} "cat /etc/resolv.conf"
Expected output:
search vcf.local
nameserver 192.168.1.10
nameserver 192.168.1.11
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
| Forward DNS | All FQDNs resolve | Slow resolution (> 1s) | Any FQDN fails to resolve |
| Reverse DNS | All IPs resolve to correct FQDN | Missing reverse for VIP | Missing reverse for node |
| DNS response time | < 100ms | 100ms - 1s | > 1s |
| DNS servers configured | 2+ nameservers | 1 nameserver | 0 nameservers |
ping 192.168.1.10/etc/resolv.conf for correct nameserver entriesdig @192.168.1.10 ops-for-logs.vcf.localsystemd-resolve --flush-caches
Regular backups of Ops for Logs configuration and data are essential for disaster recovery. This section verifies backup configuration and recency.
# Check backup schedule and recent backup status
ssh root@${OFL_NODE1} "
echo '=== Backup Configuration ==='
grep -A 20 'backup' /storage/var/loginsight/config/loginsight-config.xml 2>/dev/null | head -25
echo ''
echo '=== Recent Backup Files ==='
ls -lhrt /storage/var/loginsight/backups/ 2>/dev/null | tail -10
"
# Get backup configuration and status
curl -sk -X GET "${OFL_API}/backup" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
Expected response:
{
"enabled": true,
"schedule": "DAILY",
"lastBackupTime": "2026-03-25T02:00:00Z",
"lastBackupStatus": "SUCCESS",
"lastBackupSizeBytes": 245678901,
"backupDestination": "/storage/var/loginsight/backups",
"retentionCount": 7
}
# Verify backup destination is accessible and has space
ssh root@${OFL_NODE1} "
echo '=== Backup Directory ==='
ls -lh /storage/var/loginsight/backups/ 2>/dev/null
echo ''
echo '=== Total Backup Size ==='
du -sh /storage/var/loginsight/backups/ 2>/dev/null
echo ''
echo '=== Backup Count ==='
ls -1 /storage/var/loginsight/backups/*.tar.gz 2>/dev/null | wc -l
"
Expected output:
=== Backup Directory ===
-rw-r--r-- 1 root root 234M Mar 25 02:01 backup-2026-03-25.tar.gz
-rw-r--r-- 1 root root 231M Mar 24 02:01 backup-2026-03-24.tar.gz
-rw-r--r-- 1 root root 228M Mar 23 02:01 backup-2026-03-23.tar.gz
=== Total Backup Size ===
1.6G /storage/var/loginsight/backups/
=== Backup Count ===
7
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
| Backup configured | Enabled with schedule | -- | Not configured |
| Last backup status | SUCCESS |
-- | FAILED |
| Last backup age | < 24 hours | 1-7 days | > 7 days |
| Backup retention | >= 3 copies | 1-2 copies | 0 copies |
| Backup destination space | > 20% free | 10-20% free | < 10% free |
PUT /api/v1/backup with schedule and destinationPOST /api/v1/backup/triggerMonitoring CPU, memory, disk I/O, and JVM heap usage per node ensures Ops for Logs has adequate resources and is not approaching capacity limits.
# Check CPU utilization on each node
for NODE in ${OFL_NODE1} ${OFL_NODE2} ${OFL_NODE3}; do
echo "===== ${NODE} ====="
ssh root@${NODE} "
echo '--- CPU Summary (mpstat) ---'
mpstat 1 3 | tail -1
echo ''
echo '--- Load Average ---'
uptime
echo ''
echo '--- Top CPU Processes ---'
ps aux --sort=-%cpu | head -6
"
echo ""
done
Expected output (per node):
===== ops-for-logs-node1.vcf.local =====
--- CPU Summary (mpstat) ---
Average: all 22.15 0.00 3.45 0.12 0.00 0.00 0.00 0.00 74.28
--- Load Average ---
09:45:12 up 45 days, 3:12, 1 user, load average: 1.42, 1.38, 1.35
--- Top CPU Processes ---
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1842 18.2 52.3 8234560 4312340 ? Sl Mar20 3214:23 /usr/lib/loginsight/application/sbin/loginsight
root 1523 12.5 35.2 5234560 2903450 ? Sl Mar20 1823:45 /usr/bin/java -Xms2048m -Xmx2048m (cassandra)
root 2103 2.1 4.3 234560 354340 ? Ss Mar20 302:12 /usr/sbin/httpd
root 1955 1.3 1.6 198450 132340 ? Sl Mar20 189:34 /usr/bin/ruby (fluentd)
# Check memory utilization on each node
for NODE in ${OFL_NODE1} ${OFL_NODE2} ${OFL_NODE3}; do
echo "===== ${NODE} ====="
ssh root@${NODE} "free -m"
echo ""
done
Expected output (per node):
===== ops-for-logs-node1.vcf.local =====
total used free shared buff/cache available
Mem: 16016 10452 1234 128 4330 5184
Swap: 2048 0 2048
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
| CPU utilization | < 70% sustained | 70-90% sustained | > 90% sustained |
| Load average | < (CPU count * 0.7) | < (CPU count * 1.0) | > (CPU count * 1.5) |
| Memory used | < 80% of total | 80-90% of total | > 90% of total |
| Swap usage | 0 MB | < 500 MB | > 500 MB (indicates memory pressure) |
Cassandra runs on the JVM and is sensitive to heap exhaustion. Log Insight also uses Java components.
# Check Cassandra JVM heap usage via nodetool
ssh root@${OFL_NODE1} "nodetool info | grep -E 'Heap|Off'"
Expected output:
Heap Memory (MB) : 2048.00 / 2048.00
Off Heap Memory (MB): 123.45
# Check Cassandra GC log for long pauses
ssh root@${OFL_NODE1} "grep -c 'GC pause.*[0-9]\{4,\}ms' /storage/var/cassandra/logs/gc.log 2>/dev/null || echo '0'"
# Check for OutOfMemoryError
ssh root@${OFL_NODE1} "grep -c 'OutOfMemoryError' /storage/var/cassandra/logs/system.log 2>/dev/null || echo '0'"
# Check Log Insight JVM heap from runtime log
ssh root@${OFL_NODE1} "grep -i 'heap\|memory' /storage/var/loginsight/runtime.log | tail -10"
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
| Cassandra heap usage | < 75% of max | 75-90% | > 90% or OOM errors |
| GC pause duration | < 500ms | 500ms - 2s | > 2s (application stalls) |
| GC pause frequency | < 1 per minute | 1-5 per minute | > 5 per minute |
| OOM errors | 0 | -- | Any OOM errors |
# Check disk I/O statistics
ssh root@${OFL_NODE1} "
echo '=== Disk I/O Stats (iostat) ==='
iostat -xz 1 3 | tail -10
echo ''
echo '=== Disk Queue Depth ==='
iostat -x | grep -E 'sdb|nvme' | awk '{print \$1, \"await:\" \$10 \"ms\", \"util:\" \$NF \"%\"}'
"
Expected output:
=== Disk I/O Stats (iostat) ===
Device r/s w/s rkB/s wkB/s await %util
sdb 45.23 128.67 2345.00 8765.00 2.34 18.56
=== Disk Queue Depth ===
sdb await: 2.34ms util: 18.56%
| Criteria | PASS | WARN | FAIL |
|---|---|---|---|
Disk utilization (%util) |
< 60% | 60-85% | > 85% |
Average wait (await) |
< 10ms | 10-50ms | > 50ms |
| I/O queue depth | < 4 | 4-16 | > 16 |
/storage/var/cassandra/conf/cassandra-env.sh. Restart Cassandra after changes.
The following table documents all network ports used by VCF Operations for Logs. Ensure firewall rules permit these ports between the listed source and destination components.
| Port | Protocol | Direction | Source | Destination | Purpose |
|---|---|---|---|---|---|
| 443 | TCP (HTTPS) | Inbound | Browsers, API clients | Ops for Logs VIP/Nodes | Web UI and REST API access |
| 80 | TCP (HTTP) | Inbound | Browsers | Ops for Logs VIP/Nodes | HTTP redirect to HTTPS |
| 9000 | TCP | Inbound | Ops for Logs agents | Ops for Logs VIP/Nodes | CFAPI log ingestion (non-TLS) |
| 9543 | TCP (TLS) | Inbound | Ops for Logs agents | Ops for Logs VIP/Nodes | CFAPI log ingestion (TLS) |
| 514 | TCP/UDP | Inbound | Syslog sources | Ops for Logs VIP/Nodes | Syslog ingestion (non-TLS) |
| 1514 | TCP | Inbound | Syslog sources | Ops for Logs VIP/Nodes | Syslog ingestion (alternate port) |
| 6514 | TCP (TLS) | Inbound | Syslog sources | Ops for Logs VIP/Nodes | Syslog ingestion (TLS) |
| 7000 | TCP | Inter-node | Ops for Logs Node | Ops for Logs Node | Cassandra inter-node gossip |
| 7001 | TCP (TLS) | Inter-node | Ops for Logs Node | Ops for Logs Node | Cassandra inter-node TLS gossip |
| 7199 | TCP | Inter-node | Ops for Logs Node | Ops for Logs Node | Cassandra JMX monitoring |
| 9042 | TCP | Inter-node | Ops for Logs Node | Ops for Logs Node | Cassandra CQL native transport |
| 9160 | TCP | Inter-node | Ops for Logs Node | Ops for Logs Node | Cassandra Thrift client (legacy) |
| 16520 | TCP | Inter-node | Ops for Logs Node | Ops for Logs Node | Cluster replication and sync |
| 16521 | TCP (TLS) | Inter-node | Ops for Logs Node | Ops for Logs Node | Cluster replication (TLS) |
| 123 | UDP | Outbound | Ops for Logs Nodes | NTP Server | Time synchronization |
| 53 | TCP/UDP | Outbound | Ops for Logs Nodes | DNS Server | DNS resolution |
| 389 | TCP | Outbound | Ops for Logs Nodes | LDAP/AD Server | LDAP authentication |
| 636 | TCP (TLS) | Outbound | Ops for Logs Nodes | LDAP/AD Server | LDAPS authentication |
| 25 | TCP | Outbound | Ops for Logs Nodes | SMTP Server | Email notifications/alerts |
| 587 | TCP (TLS) | Outbound | Ops for Logs Nodes | SMTP Server | Email (TLS STARTTLS) |
| 514/6514 | TCP | Outbound | Ops for Logs Nodes | Forwarding destination | Log forwarding (syslog) |
| 9543 | TCP (TLS) | Outbound | Ops for Logs Nodes | Forwarding destination | Log forwarding (CFAPI) |
| 443 | TCP (HTTPS) | Outbound | Ops for Logs Nodes | VCF Operations | Integration with Ops Manager |
| 443 | TCP (HTTPS) | Outbound | Ops for Logs Nodes | vCenter Server | vSphere integration |
| 443 | TCP (HTTPS) | Outbound | Ops for Logs Nodes | SDDC Manager | VCF lifecycle management |
| 443 | TCP (HTTPS) | Outbound | Ops for Logs Nodes | Workspace ONE Access | VIDM SSO authentication |
| 2049 | TCP | Outbound | Ops for Logs Nodes | NFS Server | Archive storage (NFS) |
# Verify all critical ports are listening on a node
ssh root@${OFL_NODE1} "
echo '=== Listening Ports ==='
ss -tuln | grep -E ':(443|80|9000|9543|514|1514|6514|7000|7199|9042|16520) ' | sort -t: -k2 -n
"
Expected output:
tcp LISTEN 0 128 *:80 *:*
tcp LISTEN 0 128 *:443 *:*
tcp LISTEN 0 128 *:514 *:*
tcp LISTEN 0 128 *:1514 *:*
tcp LISTEN 0 128 *:6514 *:*
tcp LISTEN 0 128 *:7000 *:*
tcp LISTEN 0 128 *:7199 *:*
tcp LISTEN 0 128 *:9000 *:*
tcp LISTEN 0 128 *:9042 *:*
tcp LISTEN 0 128 *:9543 *:*
tcp LISTEN 0 128 *:16520 *:*
# Check iptables rules (if applicable)
ssh root@${OFL_NODE1} "iptables -L -n --line-numbers 2>/dev/null | head -40 || echo 'iptables not active'"
# Test external connectivity to key ports
for PORT in 443 9000 9543 514 6514; do
nc -zv ${OFL_HOST} ${PORT} 2>&1 | grep -E 'succeeded|refused|timed'
done
This section provides detailed troubleshooting guidance for the most frequently encountered Ops for Logs problems.
Symptoms: systemctl status cassandra shows failed. Log queries return errors. Web UI shows "Service Unavailable".
Diagnosis:
# Check Cassandra system log for startup errors
ssh root@${OFL_NODE1} "tail -100 /storage/var/cassandra/logs/system.log | grep -i 'error\|exception\|fatal'"
# Check for commit log corruption
ssh root@${OFL_NODE1} "ls -la /storage/var/cassandra/commitlog/"
# Check disk space
ssh root@${OFL_NODE1} "df -h /storage/var"
mkdir /tmp/corrupt-cl && mv /storage/var/cassandra/commitlog/CommitLog-*.log /tmp/corrupt-cl//storage/var/cassandra/conf/cassandra-env.shsystemctl restart cassandranodetool status -- ensure all nodes rejoin
Symptoms: Slow queries, high disk I/O, increasing disk usage despite stable ingestion.
# Check compaction backlog
ssh root@${OFL_NODE1} "nodetool compactionstats"
# Check compaction throughput
ssh root@${OFL_NODE1} "nodetool getcompactionthroughput"
nodetool setcompactionthroughput 256 (default is 64 MB/s)watch -n 10 'nodetool compactionstats'Symptoms: nodetool status shows a node as DN. Cluster is degraded.
# Check connectivity to the down node
ping ${OFL_NODE2}
nc -zv ${OFL_NODE2} 7000
nc -zv ${OFL_NODE2} 9042
# Check logs on the down node
ssh root@${OFL_NODE2} "systemctl status cassandra && tail -50 /storage/var/cassandra/logs/system.log"
systemctl restart cassandranodetool status (wait for UJ then UN)Symptoms: Missing logs in queries, ingestion EPS drops to zero or significantly below baseline, monitoring alerts on dropped events.
Diagnosis:
# Check for ingestion errors in runtime log
ssh root@${OFL_NODE1} "grep -i 'drop\|overflow\|backpressure\|reject' \
/storage/var/loginsight/runtime.log | tail -20"
# Check ingestion pipeline ports
ssh root@${OFL_NODE1} "ss -tuln | grep -E ':(514|1514|6514|9000|9543)'"
# Check stats API for dropped events
curl -sk -X GET "${OFL_API}/stats" \
-H "Authorization: Bearer ${TOKEN}" | jq '{droppedEvents, currentEventsPerSecond, queueDepth}'
Symptoms: Ingestion halts, web UI errors, Cassandra write failures, df -h /storage/var shows > 95%.
Emergency Diagnosis:
# Identify what is consuming space
ssh root@${OFL_NODE1} "
df -h /storage/var
echo ''
du -sh /storage/var/*/ 2>/dev/null | sort -rh
echo ''
echo '=== Largest files ==='
find /storage/var -type f -size +1G -exec ls -lh {} \; 2>/dev/null | sort -k5 -rh | head -10
"
nodetool clearsnapshot -- can free significant spacefind /storage/var -name "core.*" -deleterm -f /storage/var/fluentd/buffer/*.loggrowpart /dev/sdb 1 && resize2fs /dev/sdb1Symptoms: Two nodes claim to be master, data inconsistency between nodes, cluster API shows conflicting information.
Diagnosis:
# Check cluster state from each node
for NODE in ${OFL_NODE1} ${OFL_NODE2} ${OFL_NODE3}; do
echo "===== ${NODE} ====="
ssh root@${NODE} "curl -sk https://localhost/api/v1/cluster 2>/dev/null | python3 -m json.tool | grep -E 'role|status'"
echo ""
done
# Check Cassandra ring consistency
for NODE in ${OFL_NODE1} ${OFL_NODE2} ${OFL_NODE3}; do
echo "===== ${NODE} ====="
ssh root@${NODE} "nodetool describecluster | head -10"
echo ""
done
systemctl stop loginsight on the node incorrectly claiming masternodetool repair on the remaining nodesSymptoms: Browser SSL warnings, agent connection failures, API calls return TLS errors, forwarding breaks.
Diagnosis:
# Check certificate details
echo | openssl s_client -connect ${OFL_HOST}:443 2>&1 | grep -E 'Verify|depth|error|subject'
# Check certificate expiry
echo | openssl s_client -connect ${OFL_HOST}:443 2>/dev/null | openssl x509 -noout -dates
# Check if agents can connect (from an agent host)
openssl s_client -connect ${OFL_HOST}:9543 </dev/null 2>&1 | grep "Verify return code"
POST /api/v1/ssl/ca/etc/vmware/ssl/systemctl restart httpd) and verify agents reconnect
Symptoms: Agents showing DISCONNECTED status, gaps in log data from specific hosts, agent heartbeat timeouts.
Diagnosis (from the agent host):
# Check agent status on the remote host
ssh root@<agent-host> "systemctl status liagentd"
# Check agent log
ssh root@<agent-host> "tail -50 /var/log/liagent/liagent.log"
# Test connectivity to Ops for Logs
ssh root@<agent-host> "nc -zv ${OFL_HOST} 9543 && nc -zv ${OFL_HOST} 443"
# Check agent configuration
ssh root@<agent-host> "cat /var/lib/liagent/liagent.ini | grep -v '^;' | grep -v '^$'"
On ESXi hosts:
# Check ESXi syslog configuration
ssh root@<esxi-host> "esxcli system syslog config get"
# Check ESXi Log Insight agent
ssh root@<esxi-host> "esxcli software vib list | grep -i loginsight"
# Test connectivity from ESXi
ssh root@<esxi-host> "nc -zv ${OFL_HOST} 9543"
systemctl restart liagentdliagent.ini to point to the VIP: hostname=ops-for-logs.vcf.localesxcli software vib update -d /path/to/VMware-loginsight-agent.zipThis section provides a consolidated list of all CLI commands used throughout this handbook for quick reference.
| Command | Purpose |
|---|---|
systemctl status loginsight |
Check Log Insight daemon status |
systemctl status cassandra |
Check Cassandra service status |
systemctl status httpd |
Check Apache HTTPD status |
systemctl status fluentd |
Check Fluentd status |
systemctl restart loginsight |
Restart the Log Insight daemon |
systemctl restart cassandra |
Restart Cassandra |
systemctl restart httpd |
Restart Apache |
systemctl restart fluentd |
Restart Fluentd |
systemctl restart chronyd |
Restart NTP (chrony) |
journalctl -u loginsight --no-pager -n 100 |
View recent Log Insight journal entries |
journalctl -u cassandra --no-pager -n 100 |
View recent Cassandra journal entries |
| Command | Purpose |
|---|---|
nodetool status |
Show Cassandra ring status and node states |
nodetool info |
Show node info including heap memory |
nodetool compactionstats |
Show pending and active compactions |
nodetool getcompactionthroughput |
Show current compaction throughput limit |
nodetool setcompactionthroughput <MB/s> |
Set compaction throughput (e.g., 128 or 256) |
nodetool describecluster |
Show cluster name, snitch, and schema versions |
nodetool repair |
Run a repair on the local node |
nodetool clearsnapshot |
Clear all saved snapshots to free disk space |
nodetool tpstats |
Show thread pool statistics |
nodetool cfstats |
Show column family (table) statistics |
nodetool gcstats |
Show garbage collection statistics |
| Command | Purpose |
|---|---|
df -hT |
Show all filesystem usage with type |
df -h /storage/var |
Show /storage/var usage |
df -i /storage/var |
Show inode usage |
du -sh /storage/var/*/ |
Show top-level directory sizes |
du -sh /storage/var/cassandra/data/ |
Show Cassandra data size |
du -sh /storage/var/loginsight/ |
Show Log Insight data size |
du -sh /storage/var/fluentd/buffer/ |
Show Fluentd buffer size |
iostat -xz 1 3 |
Show disk I/O statistics (3 samples) |
| Command | Purpose |
|---|---|
ss -tuln |
Show all listening TCP/UDP ports |
ss -tn |
Show all active TCP connections |
ss -s |
Show socket statistics summary |
nc -zv <host> <port> |
Test TCP connectivity to a specific port |
ping <host> |
Test ICMP reachability |
dig <fqdn> |
Forward DNS lookup |
dig +short <fqdn> |
Forward DNS lookup (short output) |
dig +short -x <ip> |
Reverse DNS lookup |
ip addr show |
Show network interface addresses |
arping -D -I eth0 <ip> |
Check for IP address conflicts |
| Command | Purpose |
|---|---|
openssl s_client -connect <host>:443 |
Inspect the SSL certificate on port 443 |
openssl x509 -noout -subject -dates -issuer |
Parse certificate details (piped from s_client) |
openssl x509 -noout -enddate |
Show only the expiry date |
openssl s_client -connect <host>:443 -showcerts |
Show the full certificate chain |
openssl verify <cert.pem> |
Verify a certificate chain |
keytool -list -keystore <path> -storepass changeit |
List Java trust store contents |
| Command | Purpose |
|---|---|
chronyc tracking |
Show NTP tracking status |
chronyc sources -v |
Show NTP sources with details |
chronyc makestep |
Force an immediate time sync |
ntpq -p |
Show NTP peers (if using ntpd) |
date -u |
Show current UTC time |
timedatectl status |
Show time/date configuration |
| Command | Purpose |
|---|---|
ps aux --sort=-%cpu | head -10 |
Top 10 processes by CPU |
ps aux --sort=-%mem | head -10 |
Top 10 processes by memory |
free -m |
Show memory usage in MB |
uptime |
Show uptime and load average |
mpstat 1 3 |
Show CPU statistics (3 samples) |
top -bn1 | head -20 |
One-shot top output |
| Log File | Purpose |
|---|---|
/storage/var/loginsight/runtime.log |
Main Ops for Logs application log |
/storage/var/cassandra/logs/system.log |
Cassandra system log |
/storage/var/cassandra/logs/gc.log |
Cassandra garbage collection log |
/var/log/httpd/error_log |
Apache error log |
/var/log/httpd/access_log |
Apache access log |
/var/log/fluentd/fluentd.log |
Fluentd log |
/var/log/liagent/liagent.log |
Log Insight agent log (on agent hosts) |
All API endpoints use the base URL https://<ops-for-logs-vip>/api/v1/. Authentication is required for most endpoints via the Authorization: Bearer <token> header.
# POST /api/v1/sessions -- Authenticate and obtain a session token
curl -sk -X POST "https://${OFL_HOST}/api/v1/sessions" \
-H "Content-Type: application/json" \
-d '{
"username": "admin",
"password": "<PASSWORD>",
"provider": "Local"
}'
# Response: { "sessionId": "<TOKEN>", "userId": "<UUID>", "ttl": 1800 }
# DELETE /api/v1/sessions/current -- Invalidate the current session
curl -sk -X DELETE "https://${OFL_HOST}/api/v1/sessions/current" \
-H "Authorization: Bearer ${TOKEN}"
# GET /api/v1/version -- Get product version info
curl -sk -X GET "https://${OFL_HOST}/api/v1/version" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
# Response: { "version": "9.0.0", "build": "12345678", "releaseName": "..." }
# GET /api/v1/cluster -- Get cluster configuration and node list
curl -sk -X GET "https://${OFL_HOST}/api/v1/cluster" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
# GET /api/v1/cluster/status -- Get detailed cluster health status
curl -sk -X GET "https://${OFL_HOST}/api/v1/cluster/status" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
# GET /api/v1/ilb -- Get ILB configuration
curl -sk -X GET "https://${OFL_HOST}/api/v1/ilb" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
# GET /api/v1/stats -- Get ingestion statistics
curl -sk -X GET "https://${OFL_HOST}/api/v1/stats" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
# Response: { "totalEventsIngested": N, "currentEventsPerSecond": N, "droppedEvents": N, ... }
# POST /api/v1/events/stats -- Query historical ingestion statistics
curl -sk -X POST "https://${OFL_HOST}/api/v1/events/stats" \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"query": "ingestion_rate",
"startTimeMillis": 1711411200000,
"endTimeMillis": 1711497600000,
"bucketDurationMinutes": 60
}' | jq '.'
# POST /api/v1/events -- Search for events
curl -sk -X POST "https://${OFL_HOST}/api/v1/events" \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"query": "vmw_vc_*",
"startTimeMillis": 1711411200000,
"endTimeMillis": 1711497600000,
"limit": 100
}' | jq '.'
# POST /api/v1/events/ingest/0 -- Ingest events via API
curl -sk -X POST "https://${OFL_HOST}/api/v1/events/ingest/0" \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"events": [
{
"text": "Test event from API",
"source": "api-test",
"fields": [{"name": "env", "content": "production"}]
}
]
}'
# GET /api/v1/forwarding -- List all forwarding destinations
curl -sk -X GET "https://${OFL_HOST}/api/v1/forwarding" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
# GET /api/v1/forwarding/stats -- Get forwarding statistics
curl -sk -X GET "https://${OFL_HOST}/api/v1/forwarding/stats" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
# POST /api/v1/forwarding -- Create a new forwarding destination
curl -sk -X POST "https://${OFL_HOST}/api/v1/forwarding" \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"name": "New-SIEM",
"host": "siem.vcf.local",
"port": 6514,
"protocol": "SYSLOG",
"transport": "TCP-TLS",
"enabled": true,
"filter": "*"
}' | jq '.'
# GET /api/v1/content/contentpack/list -- List installed content packs
curl -sk -X GET "https://${OFL_HOST}/api/v1/content/contentpack/list" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
# GET /api/v1/content/contentpack/marketplace -- Check marketplace for updates
curl -sk -X GET "https://${OFL_HOST}/api/v1/content/contentpack/marketplace" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
# GET /api/v1/content/contentpack/autoupdate -- Check auto-update configuration
curl -sk -X GET "https://${OFL_HOST}/api/v1/content/contentpack/autoupdate" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
# PUT /api/v1/content/contentpack/autoupdate -- Enable/disable auto-update
curl -sk -X PUT "https://${OFL_HOST}/api/v1/content/contentpack/autoupdate" \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/json" \
-d '{"autoUpdateEnabled": true, "checkIntervalHours": 24}' | jq '.'
# GET /api/v1/agent/stats -- Get agent summary statistics
curl -sk -X GET "https://${OFL_HOST}/api/v1/agent/stats" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
# GET /api/v1/agent/agents -- List all agents
curl -sk -X GET "https://${OFL_HOST}/api/v1/agent/agents" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
# GET /api/v1/agent/agents?status=DISCONNECTED -- List disconnected agents
curl -sk -X GET "https://${OFL_HOST}/api/v1/agent/agents?status=DISCONNECTED" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
# GET /api/v1/agent/groups -- List all agent groups
curl -sk -X GET "https://${OFL_HOST}/api/v1/agent/groups" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
# GET /api/v1/agent/groups/<groupId> -- Get specific agent group configuration
curl -sk -X GET "https://${OFL_HOST}/api/v1/agent/groups/group-001" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
# DELETE /api/v1/agent/agents/<agentId> -- Remove a stale agent
curl -sk -X DELETE "https://${OFL_HOST}/api/v1/agent/agents/<agentId>" \
-H "Authorization: Bearer ${TOKEN}"
# GET /api/v1/integration/vrops -- Check VCF Operations integration status
curl -sk -X GET "https://${OFL_HOST}/api/v1/integration/vrops" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
# GET /api/v1/auth/providers -- List authentication providers
curl -sk -X GET "https://${OFL_HOST}/api/v1/auth/providers" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
# GET /api/v1/ssl -- Get current SSL certificate information
curl -sk -X GET "https://${OFL_HOST}/api/v1/ssl" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
# POST /api/v1/ssl/ca -- Upload a custom CA certificate
curl -sk -X POST "https://${OFL_HOST}/api/v1/ssl/ca" \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/json" \
-d '{"certificate": "<PEM-encoded-CA-cert>"}' | jq '.'
# PUT /api/v1/ssl -- Replace the server certificate
curl -sk -X PUT "https://${OFL_HOST}/api/v1/ssl" \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"certificate": "<PEM-encoded-cert>",
"privateKey": "<PEM-encoded-key>",
"certificateChain": "<PEM-encoded-chain>"
}' | jq '.'
# GET /api/v1/backup -- Get backup configuration
curl -sk -X GET "https://${OFL_HOST}/api/v1/backup" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
# POST /api/v1/backup/trigger -- Trigger an immediate backup
curl -sk -X POST "https://${OFL_HOST}/api/v1/backup/trigger" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
# PUT /api/v1/backup -- Configure backup settings
curl -sk -X PUT "https://${OFL_HOST}/api/v1/backup" \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"enabled": true,
"schedule": "DAILY",
"retentionCount": 7,
"backupDestination": "/storage/var/loginsight/backups"
}' | jq '.'
# GET /api/v1/time/config -- Get retention configuration
curl -sk -X GET "https://${OFL_HOST}/api/v1/time/config" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
# PUT /api/v1/time/config -- Update retention settings
curl -sk -X PUT "https://${OFL_HOST}/api/v1/time/config" \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/json" \
-d '{"retentionPeriod": 30}' | jq '.'
# GET /api/v1/archive -- Get archive configuration
curl -sk -X GET "https://${OFL_HOST}/api/v1/archive" \
-H "Authorization: Bearer ${TOKEN}" | jq '.'
VCF Operations for Logs Health Check Handbook
Version 1.0 -- March 2026
Copyright 2026 Virtual Control LLC. All rights reserved.
This document is intended for internal use by authorized personnel only.
For questions, updates, or feedback regarding this handbook, contact the VCF operations team.