VMware Cloud Foundation 9.0 | Broadcom
Author: Virtual Control LLC Copyright: © 2026 Virtual Control LLC. All rights reserved. Version: 1.0 — March 2026 Classification: Internal / Professional Reference
VCF Operations is the unified monitoring, capacity planning, and optimization platform for VMware Cloud Foundation environments. It provides real-time analytics across compute, storage, networking, and application layers, delivering intelligent workload placement, proactive alerting, and capacity forecasting to ensure the health and efficiency of your private cloud infrastructure.
The product now known as VCF Operations has undergone several name changes as VMware's portfolio evolved and Broadcom completed its acquisition of VMware. Understanding this lineage is critical when referencing older documentation, KB articles, and community resources.
| Year | Product Name | Context |
|---|---|---|
| 2012 | vCenter Operations Manager | Initial release as a standalone vCenter companion for performance analytics. |
| 2015 | vRealize Operations Manager (vROps) | Rebranded under the vRealize Suite umbrella. Versions 6.x through 8.x carried this name. This is the name most widely recognized in the VMware community. |
| 2022 | VMware Aria Operations | VMware unified its cloud management portfolio under the "Aria" brand. vRealize Operations Manager 8.10+ became Aria Operations. |
| 2024 | VCF Operations | Following the Broadcom acquisition of VMware, the Aria brand was retired. All products were realigned under the VMware Cloud Foundation (VCF) umbrella. VCF Operations is the current and official name. |
Important: When searching VMware Knowledge Base articles, use all three historical names —
vRealize Operations,Aria Operations, andVCF Operations— to ensure complete coverage of relevant results. Many KB articles have not yet been updated to reflect the latest naming.
The underlying technology, architecture, and API surface remain consistent across these name changes. A deployment upgraded from vRealize Operations 8.6 through Aria Operations 8.14 to VCF Operations 8.18.2 retains its full configuration, dashboards, super metrics, and alert definitions without requiring re-creation.
VCF Operations is not a single appliance — it is the anchor product within a broader operations suite. The following table lists all products in the VCF Operations family as of VCF 9.0:
| Product Name | Description | VCF 9.0 Version | Deployment Model |
|---|---|---|---|
| VCF Operations | Performance monitoring, capacity planning, optimization, and workload balancing for the entire VCF stack. | 8.18.2 | OVA appliance (analytics cluster) |
| VCF Operations for Logs | Centralized log management and analytics. Collects, indexes, and analyzes syslog and log data from all VCF components. | 8.18.2 | OVA appliance (standalone or cluster) |
| VCF Operations for Networks | Deep network visibility, micro-segmentation analytics, traffic flow analysis, and network topology mapping. Integrates with NSX. | 6.14 | OVA appliance |
| VCF Suite Lifecycle Manager | Lifecycle management for the VCF Operations suite. Handles deployment, upgrades, patching, and configuration drift for all suite products. | 8.18 | Embedded in SDDC Manager or standalone OVA |
| Cloud Proxies | Lightweight collectors deployed in remote sites or workload domains to collect data and forward it to the central analytics cluster. | Bundled with VCF Operations | OVA appliance (minimal footprint) |
All products in the suite share a common authentication framework, can be cross-launched from one another, and are managed through a unified lifecycle workflow in SDDC Manager.
VCF Operations employs a distributed analytics architecture designed for horizontal scalability and high availability. The architecture consists of the following tiers:
The analytics cluster is the core of VCF Operations. It processes incoming metrics, computes dynamic thresholds, evaluates alert conditions, and serves the user interface. A production analytics cluster consists of:
Collectors are responsible for gathering metrics from monitored endpoints and delivering them to the analytics cluster:
Adapters are modular plugins that define how VCF Operations connects to and collects data from specific endpoint types. Each adapter type understands the API, object model, and metrics of its target system. Key built-in adapters include:
Management packs extend VCF Operations with adapters, dashboards, alert definitions, and reports for third-party and additional VMware products. They are installed through the product UI or via the VCF Suite Lifecycle Manager.
In VCF 9.0, VCF Operations is a mandatory first-class component of the Cloud Foundation stack, not an optional add-on. Its integration with the broader VCF platform is deep and bidirectional:
VCF 9.0 introduces Fleet Manager as the next-generation deployment and lifecycle orchestrator. Fleet Manager automates the full VCF Operations deployment workflow:
SDDC Manager provides ongoing lifecycle management for VCF Operations:
In VCF 9.0, VCF Operations monitoring is enabled by default for every workload domain. When a new workload domain is created, SDDC Manager automatically:
This ensures complete observability from the moment a workload domain becomes operational, with no manual adapter configuration required.
Proper sizing of VCF Operations is critical to achieving reliable performance, accurate analytics, and timely alerting. Under-sizing leads to collection lag, delayed alerts, and UI timeouts. Over-sizing wastes management cluster resources. This chapter provides the authoritative sizing tables and prerequisite requirements.
VCF Operations uses five distinct node types. Each serves a specific role in the analytics architecture:
| Node Type | Role | Required? | Quantity |
|---|---|---|---|
| Primary | Master controller, xDB partition owner, API gateway, UI host. First node deployed. | Yes (exactly 1) | 1 |
| Replica | Hot standby for the primary node. Maintains a synchronized copy of all primary data and configuration. Automatically promoted if the primary fails. | No (required for HA) | 0 or 1 |
| Data | Expands analytics processing capacity and xDB storage. Added in pairs for balanced data distribution. | No (for scale-out) | 0, 2, 4, 6, or 8 |
| Remote Collector | Lightweight forwarder deployed near monitored endpoints. Collects metrics and sends them to the analytics cluster. Stores no data locally. | No (for remote sites) | 0+ |
| Cloud Proxy | Specialized remote collector for cloud-connected services and SaaS integrations. | No (for cloud use cases) | 0+ |
The VCF Operations OVA is deployed with one of five predefined size profiles. The size is selected during OVA deployment and cannot be changed after deployment without redeploying the node.
| Size | vCPUs | Memory (GB) | Disk (GB) | Maximum Objects | Use Case |
|---|---|---|---|---|---|
| Extra Small | 4 | 16 | 282 | 1,200 | Lab and proof-of-concept environments. Not recommended for production. |
| Small | 8 | 32 | 474 | 4,000 | Small production environments. Single workload domain with limited VM count. |
| Medium | 16 | 48 | 898 | 16,000 | Mid-size production environments. Multiple workload domains. Most common production size. |
| Large | 32 | 128 | 2,026 | 50,000 | Large enterprise environments. Many workload domains, multiple vCenter instances. |
| Extra Large | 48 | 512 | 4,014 | 100,000 | Very large enterprise or service-provider environments. Requires data nodes for scale. |
Note: The "Maximum Objects" column refers to the total count of monitored objects across all adapters — VMs, hosts, datastores, clusters, port groups, NSX objects, vSAN objects, and any objects from management packs. Use the formula:
Total Objects ≈ (VMs × 1.0) + (Hosts × 3.5) + (Clusters × 2.0) + (Datastores × 1.0)as a rough estimation starting point.
VCF Operations supports three cluster deployment models:
Critical Limitation: You cannot convert a Simple (single-node) deployment to an HA or CA deployment in place. The replica and data nodes must be deployed as fresh OVAs and joined to the primary. However, you cannot retroactively add HA to a node that was initialized as a standalone instance without redeploying the primary. Plan your cluster model before initial deployment.
Remote Collectors are deployed as separate OVAs with their own sizing profiles, independent of the analytics cluster node sizing:
| Size | vCPUs | Memory (GB) | Disk (GB) | Max Adapters | Max Objects | Use Case |
|---|---|---|---|---|---|---|
| Standard | 2 | 4 | 20 | 5 | 1,500 | Small remote sites, single vCenter endpoint. |
| Large | 4 | 16 | 20 | 15 | 10,000 | Large remote sites, multiple endpoints, or high-frequency collection. |
The VCF Operations appliance uses multiple disk partitions to separate data by function. Understanding these partitions is essential for troubleshooting disk-space alerts and planning NFS extension:
| Mount Point | Purpose | Grows With |
|---|---|---|
/ |
Root filesystem. Operating system, appliance binaries, configuration files. | Static — does not grow significantly. |
/storage/db |
xDB distributed datastore. Primary storage for all collected metrics, properties, relationships, and computed analytics. | Object count and retention period. This is the largest and fastest-growing partition. |
/storage/log |
Application log files for all VCF Operations services. | Activity level and log verbosity settings. |
/storage/core |
Core dump files generated during application crashes. | Only grows when crashes occur. |
/storage/nfs |
Optional NFS mount point for offloading historical data or report storage. | Configured capacity of the NFS share. |
/storage/vcops/backup |
Local backup storage. Used by the built-in backup mechanism for configuration and data snapshots. | Backup frequency and retention count. |
Best Practice: Monitor disk usage on
/storage/dbclosely. When this partition reaches 90% utilization, VCF Operations triggers a critical alert and may begin dropping the oldest data to prevent total disk exhaustion. Extend this partition by adding an NFS datastore or by deploying additional data nodes to distribute the storage load.
The VCF Operations HTML5 UI is supported on the following browsers:
| Browser | Minimum Version | Notes |
|---|---|---|
| Google Chrome | 100+ | Recommended browser. Best performance and rendering. |
| Mozilla Firefox | 100+ | Fully supported. |
| Microsoft Edge (Chromium) | 100+ | Fully supported. Legacy Edge (EdgeHTML) is not supported. |
| Component | Supported Versions |
|---|---|
| ESXi | 7.0 U3+, 8.0, 8.0 U1, 8.0 U2, 8.0 U3 |
| vCenter Server | 7.0 U3+, 8.0, 8.0 U1, 8.0 U2, 8.0 U3 |
| Virtual Hardware Version | 19 (ESXi 7.0 U3) or 20/21 (ESXi 8.0+) |
VCF Operations requires specific network ports to be open between its nodes, monitored endpoints, and consuming services. Failure to open the correct ports results in collection failures, cluster communication breakdowns, or inaccessible UI. This chapter provides the complete port reference.
These ports must be open to the VCF Operations analytics cluster nodes from clients and external systems:
| Port | Protocol | Source | Destination | Purpose |
|---|---|---|---|---|
| 443 | TCP (HTTPS) | Admin workstations, API clients, SDDC Manager, VCF Operations for Logs | VCF Operations cluster nodes | Primary UI access, REST API, Suite API, adapter data reception from Remote Collectors. This is the single most critical port. |
| 8543 | TCP (HTTPS) | Legacy API clients | VCF Operations cluster nodes | Legacy vRealize Operations API endpoint. Maintained for backward compatibility with older integrations and scripts. Deprecated — migrate to port 443. |
| 443 | TCP (HTTPS) | Remote Collectors, Cloud Proxies | VCF Operations cluster nodes | Data forwarding from remote collectors to the analytics cluster. Remote collectors push collected metrics to the cluster over this port. |
These ports must be open from the VCF Operations analytics cluster nodes (and Remote Collectors) to external endpoints:
| Port | Protocol | Source | Destination | Purpose |
|---|---|---|---|---|
| 443 | TCP (HTTPS) | VCF Operations nodes / Remote Collectors | vCenter Server | vCenter adapter data collection. Retrieves VM, host, cluster, datastore, and resource pool metrics via the vSphere API. |
| 443 | TCP (HTTPS) | VCF Operations nodes / Remote Collectors | NSX Manager | NSX adapter data collection. Retrieves transport node, logical switch, edge, and firewall metrics. |
| 443 | TCP (HTTPS) | VCF Operations nodes / Remote Collectors | SDDC Manager | SDDC Manager adapter. Retrieves workload domain configuration, lifecycle events, and compliance status. |
| 443 | TCP (HTTPS) | VCF Operations nodes | Broadcom repository (online) | Downloading upgrade bundles, management packs, and content updates when connected to the internet. |
| 514 | TCP/UDP | VCF Operations nodes | Syslog server / VCF Operations for Logs | Forwarding VCF Operations application logs to a centralized syslog collector. |
| 25 | TCP (SMTP) | VCF Operations nodes | Mail server | Sending email notifications for alert triggers. Unencrypted SMTP. |
| 587 | TCP (SMTP/TLS) | VCF Operations nodes | Mail server | Sending email notifications for alert triggers over TLS-encrypted SMTP. Preferred over port 25. |
These ports are used for communication between VCF Operations cluster nodes (primary, replica, and data nodes). They must be open bidirectionally between all cluster members:
| Port | Protocol | Purpose |
|---|---|---|
| 7001 | TCP | Cassandra (xDB) inter-node communication. Handles data replication, partition synchronization, and consistency management between cluster nodes. |
| 1300–1399 | TCP | GemFire distributed cache. Used for in-memory data grid communication, cache replication, and cluster state synchronization. The exact port within this range is assigned dynamically. |
| 10002 | TCP | Cluster controller RPC. The primary node uses this port to coordinate cluster operations — node joins, failover decisions, and configuration propagation. |
| 20002 | TCP | Analytics data synchronization. Distributes computed analytics results (dynamic thresholds, scores, capacity projections) across all cluster nodes. |
| 20003 | TCP | Cluster heartbeat. Used by the cluster health monitor to detect node failures. A missed heartbeat sequence triggers failover procedures. |
| 4369 | TCP | Erlang Port Mapper Daemon (epmd). Used by the RabbitMQ message broker embedded in each node for inter-node message routing. |
Important: All cluster-internal ports must have low latency (< 1 ms round-trip) and high bandwidth (1 Gbps minimum). Cluster nodes should not be separated by WAN links, firewalls with deep packet inspection, or load balancers. Place all cluster nodes on the same VLAN or Layer 2 segment.
These ports are bound to 127.0.0.1 (localhost) on each VCF Operations node. They do not require firewall rules because they are not accessible from the network. They are documented here for troubleshooting and security audit purposes:
| Port | Protocol | Purpose |
|---|---|---|
| 5433 | TCP | vPostgres embedded database. Stores appliance configuration, user accounts, roles, policies, and alert definitions. Not used for metric storage (that is xDB). |
| 8080 | TCP (HTTP) | Internal CaSA (Collector and Storage Aggregator) service. Handles internal metric routing between collector threads and the storage layer. |
| 9090 | TCP (HTTP) | Internal admin/health-check endpoint. Used by the appliance self-monitoring watchdog to verify service health. |
Remote Collectors have a simplified port profile because they do not run analytics or store data:
| Port | Protocol | Direction | Source → Destination | Purpose |
|---|---|---|---|---|
| 443 | TCP (HTTPS) | Outbound | Remote Collector → VCF Operations cluster | Forwarding collected metrics, properties, and relationship data to the analytics cluster. |
| 443 | TCP (HTTPS) | Outbound | Remote Collector → Monitored endpoints (vCenter, NSX, etc.) | Collecting data from monitored endpoints. The Remote Collector initiates all connections — endpoints never connect inbound to the collector. |
Remote Collectors do not expose any inbound ports. All communication is initiated outbound by the collector. This makes Remote Collectors ideal for deployment in DMZ or restricted network zones where inbound connections are prohibited.
When creating firewall rules for VCF Operations, follow these best practices:
Use FQDNs, not IP addresses, in firewall rules where possible. VCF Operations nodes may change IP addresses during disaster recovery or migration. FQDN-based rules are more resilient.
Restrict source addresses. Do not use any as the source for inbound port 443. Limit access to known admin workstation subnets, SDDC Manager, and Remote Collector IP ranges.
Enable stateful inspection. All VCF Operations connections are TCP-based and work correctly with stateful firewalls. Stateful inspection ensures return traffic is automatically permitted.
Do not use SSL decryption/inspection on traffic between VCF Operations cluster nodes. SSL interception between cluster members causes certificate validation failures and breaks cluster communication.
Test connectivity before deployment. Use curl -v https://<target>:443 from the VCF Operations node to verify that each required port is reachable before configuring adapters. Connection failures after adapter configuration are difficult to distinguish from credential or API errors.
Document all rules. Maintain a port matrix document that maps each firewall rule to its VCF Operations purpose. This accelerates troubleshooting when connectivity issues arise during maintenance windows or network changes.
This chapter covers all deployment methods for VCF Operations — from the fully automated VCF 9.0 workflow to manual OVA deployment via the vSphere Client and command-line tools. Regardless of the deployment method, the end result is the same: a running VCF Operations appliance ready for initial configuration.
In a VCF 9.0 environment, VCF Operations deployment is orchestrated by SDDC Manager and Fleet Manager. This is the recommended deployment method for production VCF environments because it ensures consistency with the VCF Bill of Materials and integrates VCF Operations into the overall lifecycle management framework.
Prerequisite Validation — SDDC Manager validates that the management cluster has sufficient capacity (CPU, memory, storage) to host the VCF Operations appliance at the specified size.
OVA Acquisition — Fleet Manager retrieves the VCF Operations OVA from the configured software depot. In connected environments, this is the Broadcom online repository. In air-gapped environments, the OVA must be pre-staged in the local SDDC Manager depot.
OVA Deployment — Fleet Manager deploys the OVA to the management cluster's designated resource pool and datastore. Network configuration (IP, subnet, gateway, DNS, NTP) is injected via OVF properties derived from the deployment specification.
Appliance Boot and Self-Configuration — The appliance boots, applies the network configuration, generates initial self-signed certificates, and starts all core services. This takes approximately 10–15 minutes.
Registration — Fleet Manager registers the VCF Operations instance with SDDC Manager. This enables ongoing lifecycle management (upgrades, certificate rotation, health monitoring).
Initial Adapter Configuration — SDDC Manager automatically configures vCenter and NSX adapter instances for the management domain. If additional workload domains exist, adapters are configured for those as well.
Validation — Fleet Manager runs a post-deployment health check to confirm that all services are running, the UI is accessible, and initial data collection has started.
Note: The automated deployment flow always deploys a Medium-sized OVA by default. To override the size, edit the deployment specification JSON before initiating the workflow. Consult the SDDC Manager API documentation for the exact parameter path.
For environments not using VCF 9.0 automation, or when deploying additional nodes (replica, data), the OVA can be deployed manually through the vSphere Client.
Step 1 — Download the OVA
Download the VCF Operations OVA file from the Broadcom Support Portal (support.broadcom.com). Navigate to VMware Cloud Foundation → VCF Operations → Downloads. Select the version matching your VCF Bill of Materials.
Step 2 — Launch the Deploy OVF Template Wizard
https://<vcenter-fqdn>/ui).Step 3 — Select the OVA Source
.ova file.Step 4 — Name and Location
vcf-operations-primary-01.Step 5 — Select a Compute Resource
Step 6 — Review Details
Step 7 — Configuration (Size Selection)
Extra Small, Small, Medium, Large, or Extra LargeStep 8 — Select Storage
vSAN Default Storage Policy or a custom policy).Step 9 — Select Networks
Network 1 to the target port group on the management VLAN.Step 10 — Customize Template (OVF Properties)
This is the most critical page. Enter the following values:
| Property | Value | Notes |
|---|---|---|
| Hostname | vcf-ops-01.lab.local |
Must match the DNS A record. FQDN is permanent. |
| IP Address | 10.0.10.50 |
Static IP on the management VLAN. |
| Subnet Mask | 255.255.255.0 |
Matches the management VLAN subnet. |
| Default Gateway | 10.0.10.1 |
Management VLAN gateway. |
| DNS Server(s) | 10.0.10.10 |
Comma-separated if multiple. |
| Domain Name | lab.local |
DNS search domain. |
| NTP Server(s) | 10.0.10.10 |
Must match the NTP source used by vCenter and ESXi. |
| Admin Password | (strong password) |
Password for the admin user account. |
| Root Password | (strong password) |
Password for the Linux root user on the appliance. |
Step 11 — Ready to Complete
Step 12 — Power On
For automated or scripted deployments, use the VMware ovftool command-line utility. This is useful for deploying multiple nodes in a cluster or for integrating VCF Operations deployment into infrastructure-as-code pipelines.
ovftool \
--name="vcf-operations-primary-01" \
--deploymentOption="medium" \
--diskMode="thin" \
--datastore="vsanDatastore" \
--network="Management-PG" \
--acceptAllEulas \
--allowExtraConfig \
--powerOn \
--prop:vami.DNS.VMware_Aria_Operations="10.0.10.10" \
--prop:vami.gateway.VMware_Aria_Operations="10.0.10.1" \
--prop:vami.ip0.VMware_Aria_Operations="10.0.10.50" \
--prop:vami.netmask0.VMware_Aria_Operations="255.255.255.0" \
--prop:vami.hostname="vcf-ops-01.lab.local" \
--prop:vami.NTP.VMware_Aria_Operations="10.0.10.10" \
--prop:vami.domain.VMware_Aria_Operations="lab.local" \
--prop:guestinfo.cis.appliance.root.password="VMware123!" \
--prop:guestinfo.cis.appliance.ssh.enabled="True" \
/path/to/vcf-operations-8.18.2.ova \
"vi://administrator@vsphere.local:password@vcenter.lab.local/Datacenter/host/Management-Cluster"
| Parameter | Description |
|---|---|
--deploymentOption |
OVA size profile: xsmall, small, medium, large, xlarge. |
--diskMode |
Disk provisioning: thin (recommended) or thick. |
--datastore |
Target datastore name on the destination host/cluster. |
--network |
Port group name to map Network 1 to. |
--prop:vami.DNS.* |
DNS server IP(s). |
--prop:vami.gateway.* |
Default gateway IP. |
--prop:vami.ip0.* |
Static IP address for the appliance. |
--prop:vami.netmask0.* |
Subnet mask. |
--prop:vami.hostname |
FQDN for the appliance. Must have a matching DNS record. |
--prop:vami.NTP.* |
NTP server IP(s). |
--powerOn |
Automatically power on the VM after deployment. |
Note: The OVF property names reference
VMware_Aria_Operationsbecause the OVA internal metadata still uses the Aria Operations naming convention. This does not affect functionality — it is simply the OVF property namespace.
After the appliance boots and completes its initial self-configuration, the Virtual Appliance Management Interface (VAMI) is available for administrative tasks.
Open a browser and navigate to:
https://<node-fqdn>/admin
Log in with the admin user account and the password specified during OVA deployment.
Historical Note: In older versions (vRealize Operations 6.x–8.x), the VAMI was accessed on port 5480 (
https://<node-fqdn>:5480). In current versions, the VAMI is integrated into the main web interface at the/adminpath on port 443.
The VAMI provides the following administrative functions:
On the first login to the VCF Operations UI (https://<node-fqdn>/ui), the Initial Setup Wizard guides you through the essential configuration steps. The wizard consists of seven steps:
admin user account.After the wizard completes, the VCF Operations login page is displayed. Log in with admin and the password set in Step 4. The system is now ready for adapter configuration and monitoring setup (covered in subsequent chapters).
A single-node VCF Operations deployment provides no fault tolerance — if the appliance fails, all monitoring, alerting, and capacity analytics are lost until the node is restored. For production environments, deploying a high availability (HA) cluster is strongly recommended. This chapter provides a detailed walkthrough of HA cluster configuration.
The primary node is deployed using the procedures described in Chapter 4 (Section 4.2 for vSphere Client or Section 4.3 for ovftool). Complete the Initial Setup Wizard (Section 4.5) with the New Installation option.
Before proceeding to replica deployment, verify the primary node is fully operational:
https://<primary-fqdn>/ui with the admin account.Before deploying the replica node, ensure:
Deploy a second VCF Operations OVA using the same size profile as the primary node. Use the same deployment method (vSphere Client or ovftool) described in Chapter 4.
Key settings for the replica OVA:
| Setting | Value |
|---|---|
| VM Name | vcf-operations-replica-01 |
| Size | Must match the primary node (e.g., Medium) |
| IP Address | Different IP, same VLAN as primary (e.g., 10.0.10.51) |
| FQDN | Unique FQDN with matching DNS records (e.g., vcf-ops-02.lab.local) |
| Gateway, DNS, NTP | Identical to the primary node |
| Admin/Root Passwords | May differ from the primary, but using the same passwords simplifies administration |
Power on the replica OVA and wait for it to complete first-boot initialization (10–15 minutes).
Open a browser and navigate to https://<replica-fqdn>/ui.
The Initial Setup Wizard appears. On the Deployment Type page, select Expand an Existing Cluster.
Enter the primary node's FQDN or IP address:
vcf-ops-01.lab.localClick Validate. The wizard connects to the primary node and retrieves its certificate.
Accept the Certificate — Review the certificate thumbprint displayed. Verify it matches the primary node's certificate thumbprint (you can find this on the primary node under Administration → Cluster Management → Certificate). Click Accept.
Authenticate — Enter the admin credentials for the primary node.
Node Role Selection — Select Replica as the role for this node.
Click Next and then Finish to initiate the join process.
The join process takes approximately 15–25 minutes. During this time:
On the primary node, navigate to Administration → Cluster Management. The cluster status panel shows:
Joining → Synchronizing → Online.Single Node → Preparing HA → Online (HA Enabled).Do not modify any configuration or restart any services during the join process.
After the replica node joins successfully, the cluster requires activation to enable HA functionality:
On the primary node, navigate to Administration → Cluster Management.
The cluster status panel shows both nodes: the primary and the replica. Both should show status Online.
Click the Enable HA button (if it has not been automatically enabled during the join process).
A confirmation dialog appears: "Enabling High Availability will synchronize all data between the primary and replica nodes. This may temporarily impact performance during the initial synchronization. Do you want to continue?"
Click Yes to confirm.
The cluster enters the Synchronizing state. During synchronization:
When synchronization completes, the cluster status changes to Online (HA Enabled). This indicates:
To verify that HA is working correctly:
Navigate to Administration → Cluster Management → Status.
Confirm:
High AvailabilityOnlineOnlineSynchronized (no pending sync operations)Check the Cluster Health dashboard (Dashboards → VCF Operations Self-Monitoring → Cluster Health). All health indicators should be green.
To prevent both cluster nodes from running on the same ESXi host (which would defeat the purpose of HA), create a DRS anti-affinity rule:
VCF-Operations-Anti-AffinitySeparate Virtual Machinesvcf-operations-primary-01 and vcf-operations-replica-01.DRS will automatically vMotion the VMs to separate hosts if they are currently co-located.
VCF Operations does not support converting a single-node (Simple) deployment to HA by adding a replica after the fact if the original deployment was initialized as a standalone instance with certain configuration flags. The supported path is:
If you attempt to add a replica to a long-running standalone deployment, the join may succeed, but synchronization of historical data can take an extremely long time and may fail for very large datasets. Best practice: decide on your cluster model before initial deployment and deploy the replica immediately after the primary.
Once deployed, the IP address and FQDN of each node are embedded in the cluster configuration, certificates, and inter-node trust relationships. Changing the IP or FQDN of a cluster member requires:
This is disruptive and should be avoided. Plan IP addressing and DNS naming carefully before deployment.
All nodes in a cluster must use the same OVA size profile. You cannot mix a Medium primary with a Small replica or add Large data nodes to a Medium cluster. If you need to change the cluster size, you must redeploy all nodes.
Cluster-internal communication (xDB replication, heartbeat, GemFire cache synchronization) is latency-sensitive. All cluster nodes must be on the same Layer 2 network segment with:
Violating these requirements leads to split-brain scenarios, data inconsistency, and false failover events.
When the primary node fails:
During the failover window (approximately 5 minutes), the UI is unavailable and no new alerts are generated. Metric collection continues in the collector buffer and is flushed to the cluster once the new primary is operational.
For environments requiring even higher availability, deploy a witness node in addition to the primary and replica. The witness participates in quorum voting to prevent split-brain scenarios but does not store analytics data or serve the UI. The witness OVA is a separate, much smaller appliance. Continuous Availability (CA) mode requires:
CA mode ensures that the cluster continues to operate with zero data loss even if one node fails completely, by maintaining a quorum and synchronous data replication across all nodes.
This chapter provides a comprehensive reference for the filesystem layout, service architecture, and operational commands used to manage VCF Operations (Aria Operations) appliances. Understanding these paths and services is essential for troubleshooting, backup planning, and day-to-day administration.
The VCF Operations appliance is built on Photon OS and follows a structured directory layout. The two primary mount points are the root filesystem (/) and the data partition (/storage/), which is sized according to the deployment profile selected during installation.
| Path | Purpose |
|---|---|
/usr/lib/vmware-vcops/ |
Main application directory; contains binaries, libraries, and runtime components for all VCF Operations services. |
/usr/lib/vmware-vcops/user/conf/ |
Application configuration files including analytics.properties, collector.properties, gemfire.properties, and adapter configuration XML files. |
/usr/lib/vmware-vcops/user/plugins/ |
Management pack plugin directories. Each installed management pack places its adapter JAR files and descriptors here in a versioned subdirectory. |
/usr/lib/vmware-vcops/user/plugins/inbound/ |
Inbound (data collection) adapter plugins. Contains subdirectories for each installed adapter such as VMware_adapter3, PythonRemediationVcenterAdapter, and third-party packs. |
/usr/lib/vmware-vcops/user/conf/ssl/ |
SSL/TLS certificates and keystores used by the application, including the web server certificate (cert.pem), private key (key.pem), and trust stores. |
/usr/lib/vmware-vcops/user/conf/cassandra/ |
Cassandra configuration directory containing cassandra.yaml, cassandra-env.sh, and related tuning files for the metrics datastore. |
/usr/lib/vmware-vcops/tomcat-enterprise/ |
Apache Tomcat instance serving the REST API (/suite-api) and the administrative UI. Contains conf/server.xml, webapps/, and log directories. |
/usr/lib/vmware-vcops/tools/opscli/ |
Operations CLI tooling. The primary entry point is ops-cli.py, used for adapter management, slice configuration queries, and cluster diagnostics. |
/usr/lib/vmware-vcops/support/ |
Support and diagnostic scripts including sliceConfiguration.sh, cleanupOps.sh, and the support bundle generator supportbundle.py. |
/storage/db/ |
Primary analytics database directory housing the FSDB (File System Database) and HIS (Historical) data stores. This is where time-series metric data resides. |
/storage/db/casa/ |
CASA (Cluster Automated Services Architecture) database. Manages cluster membership, node roles, replication state, and slice ownership metadata. |
/storage/db/cassandra/ |
Cassandra data directory for persisted metrics. Contains SSTables, commit logs, and saved caches. |
/storage/db/vcops/ |
Core analytics working data, including dynamic threshold calculations, symptom state, and alert evaluation results. |
/storage/log/ |
Application-level log files for all VCF Operations services. Primary troubleshooting location. Key files include analytics.log, collector.log, api.log, and casa.log. |
/storage/core/ |
Core dump files generated during application crashes. Monitor disk usage here; large core dumps can fill the partition. |
/storage/nfs/ |
Default NFS mount point for scheduled backup destinations. Must be pre-configured with appropriate NFS export permissions. |
/var/log/ |
Operating system and VMware infrastructure service logs, including syslog, messages, vmware/ subdirectory, and Photon OS package manager logs. |
/var/vmware/ |
VMware infrastructure service runtime data, including STS token caches and VMware Identity Manager working files. |
/opt/vmware/etc/ |
vPostgres (VMware-bundled PostgreSQL) configuration files. Contains postgresql.conf, pg_hba.conf, and recovery configuration. |
/opt/vmware/vpostgres/ |
vPostgres binary and library directory. The PostgreSQL instance used for alert definitions, user data, and report storage. |
Note: The
/storage/partition is critical. If it reaches capacity, analytics processing halts and data collection stops. Monitor the partition withdf -h /storage/and configure alerts for filesystem utilization exceeding 85%.
VCF Operations runs as a collection of interdependent services managed by systemd. The following table lists every core service, its function, and its expected default state on a healthy primary node.
| Service Name | Description | Default State |
|---|---|---|
vmware-vcops-analytics |
Core analytics engine responsible for dynamic threshold computation, symptom evaluation, alert generation, capacity modeling, and workload optimization calculations. | Running |
vmware-vcops-collector |
Data collection service that executes adapter instances, gathers metrics from monitored systems, and feeds raw data into the analytics pipeline. | Running |
vmware-vcops-api |
REST API and administrative UI service hosted on Tomcat. Serves the /suite-api endpoint and the HTML5 management interface on port 443. |
Running |
vmware-casa |
Cluster Automated Services Architecture. Manages multi-node cluster topology, node membership, slice assignment, replication orchestration, and failover coordination. | Running |
vmware-vcops-gemfire |
Apache Geode (GemFire) distributed in-memory cache. Provides inter-node data sharing, real-time metric buffering, and distributed lock management across cluster nodes. | Running |
vmware-vcops-vpostgres |
VMware-packaged PostgreSQL database instance. Stores alert definitions, custom dashboards, super metrics, user accounts, report templates, and compliance data. | Running |
vmware-vcops-cassandra |
Apache Cassandra metrics storage engine. Provides the persistent time-series datastore for all collected metrics and properties. | Running |
vmware-vcops-watchdog |
Service watchdog daemon. Monitors the health of all other VCF Operations services and automatically restarts any service that becomes unresponsive or crashes. | Running |
vmware-vcops-web |
Front-end web server (httpd/nginx reverse proxy). Handles TLS termination, static content serving, and request routing to the Tomcat API backend. | Running |
vmware-stsd |
VMware Security Token Service daemon. Provides authentication token issuance and validation for inter-service communication. | Running |
vmware-vcops-rhino |
Rhino script engine for custom automation actions and notification plugins. | Running |
All service operations must be performed as the root user via SSH or console access. The appliance supports both systemctl and legacy service command syntax.
Checking service status:
# Preferred — systemctl
systemctl status vmware-vcops-analytics
# Legacy — service wrapper
service vmware-vcops-analytics status
Starting, stopping, and restarting individual services:
# Start a service
systemctl start vmware-vcops-collector
# Stop a service
systemctl stop vmware-vcops-collector
# Restart a service (stop then start)
systemctl restart vmware-vcops-api
Querying overall cluster slice status:
/usr/lib/vmware-vcops/support/sliceConfiguration.sh --status
This command returns the role of the current node (primary, replica, data), cluster membership, and the online/offline state of each slice.
Using the Operations CLI:
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py --help
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py adapter list
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py node list
Checking all VCF Operations services at once:
for svc in analytics collector api casa gemfire vpostgres cassandra watchdog web; do
echo "=== vmware-vcops-${svc} ==="
systemctl is-active vmware-vcops-${svc}
done
Correct shutdown and startup ordering is critical to avoid data corruption, split-brain scenarios, and prolonged recovery times. The required sequence varies depending on your deployment topology.
Shutdown:
# Stop all VCF Operations services
service vmware-stsd stop
service vmware-vcops stop
Startup:
# Start all VCF Operations services
service vmware-vcops start
service vmware-stsd start
Warning: Always stop
vmware-stsdaftervmware-vcopsduring shutdown, and start it beforevmware-vcopsduring startup. Reversing this order can leave authentication tokens in an inconsistent state.
Shutdown sequence (order matters):
# On each data node (if applicable):
service vmware-vcops stop && shutdown -h now
# On the replica node:
service vmware-vcops stop && shutdown -h now
# On the primary node (last):
service vmware-vcops stop && shutdown -h now
Startup sequence (reverse order):
# On the primary node (first):
service vmware-vcops start
# Verify primary is healthy before proceeding:
/usr/lib/vmware-vcops/support/sliceConfiguration.sh --status
# On the replica node (second):
service vmware-vcops start
# On each data node (last):
service vmware-vcops start
Warning: Starting the replica or data nodes before the primary is fully online will cause CASA cluster formation failures. The primary node must be the first to come online and the last to go offline.
Continuous Availability deployments include a witness node in addition to primary and replica nodes. The witness participates in quorum decisions but does not store data.
Shutdown sequence:
Startup sequence:
Warning: In a CA deployment, losing both the witness and one of the primary/replica nodes simultaneously causes a loss of quorum. Never perform maintenance on the witness and a data-bearing node at the same time. Always verify quorum status via the admin UI or
sliceConfiguration.sh --statusbefore proceeding to the next node.
The VMware vSphere adapter is the foundational integration for VCF Operations. It collects performance metrics, configuration properties, change events, and relationship data from vCenter Server and all managed objects including ESXi hosts, virtual machines, datastores, clusters, distributed switches, and resource pools. This chapter provides a complete walkthrough of credential creation, adapter instance configuration, collection tuning, and health monitoring.
Before creating an adapter instance, you must configure a credential that VCF Operations will use to authenticate against the target vCenter Server.
Step-by-step procedure:
vcsa-mgmt-01.corp.local).svc-vrops-mgmt-01.vcsa-mgmt-01.corp.local). Do not use an IP address; certificate validation requires FQDN.svc-vrops@vsphere.local). Never use administrator@vsphere.local in production.Required vCenter Permissions:
The service account must be assigned a custom role at the vCenter root level with the following minimum privileges:
| Privilege Category | Specific Privilege | Access Level |
|---|---|---|
| Global | Licenses | Read only |
| Global | Settings | Read only |
| Global | Health | Read only |
| Host | Configuration (all sub-items) | Read only |
| Host | CIM → CIM Interaction | Read only |
| Host | Storage operations | Read only |
| Virtual Machine | Interaction → Console interaction | Read only |
| Virtual Machine | State → Create snapshot, Remove snapshot | Read/Write |
| Virtual Machine | Configuration (all sub-items) | Read only |
| Datastore | Browse datastore | Read only |
| Datastore | Low-level file operations | Read only |
| Performance | Modify intervals | Read/Write |
| vSAN | Cluster → ReadOnly | Read only |
| Sessions | Validate session | Read only |
| Extension | Register extension | Read/Write (optional — only for remediation actions) |
| Alarm | Acknowledge alarm, Set alarm status | Read/Write (optional — only for alert sync) |
Best Practice: Create a dedicated vSphere role named
VCF-Operations-ReadOnlywith these privileges. Assign it to the service account at the vCenter root object and select Propagate to children. This ensures the adapter can discover and monitor all objects in the inventory hierarchy.
With the credential in place, create the adapter instance that will perform data collection.
| Field | Description | Example Value |
|---|---|---|
| Adapter Type | Pre-selected as VMware vSphere. | VMware vSphere |
| Display Name | Unique name identifying this adapter instance in dashboards and alerts. | vcsa-mgmt-01 |
| Description | Free-text description. | Management domain vCenter |
| Credential | Select the credential created in Section 7.1. | svc-vrops-mgmt-01 |
| vCenter Server | FQDN of the target vCenter Server. Must match the credential's vCenter Server field. | vcsa-mgmt-01.corp.local |
| Collector / Collector Group | Select the collector node or group responsible for data collection. In multi-site deployments, choose a collector closest to the target vCenter. | Default collector group |
| Auto Discovery | When enabled, newly added hosts and VMs are automatically discovered and monitored. | Enabled (recommended) |
| Setting | Default | Description |
|---|---|---|
COLLECT_VSAN_PERF_METRICS |
true |
Enables collection of vSAN performance counters from the vSAN Performance Service. |
COLLECT_VSAN_ADVANCED_METRICS |
false |
Enables collection of extended vSAN metrics (DOM, LSOM, CMMDS). Increases load on vCenter. |
PROCESS_CHANGE_EVENTS |
true |
Enables ingestion of vCenter events and tasks for change-driven analytics and audit trails. |
DISABLE_COMM_WITH_VCENTER |
false |
Emergency toggle to stop all communication with vCenter without deleting the adapter. Useful during planned vCenter maintenance. |
CONNECT_TIMEOUT |
60000 |
Connection timeout in milliseconds for vCenter API calls. Increase for high-latency WAN connections. |
ENABLE_DIFFMERGE |
true |
Enables differential collection (only changed properties are sent), reducing processing overhead. |
COLLECTOR_INSTANCE_COUNT |
1 |
Number of parallel collection threads. Increase for very large vCenter inventories (>5,000 VMs). |
VCF Operations collects different categories of data at different frequencies. These intervals can be modified per adapter instance, but the defaults are optimized for most environments.
| Collection Type | Default Interval | Configurable Range | Notes |
|---|---|---|---|
| Performance Metrics | 5 minutes | 1–60 minutes | Aligns with vCenter's default real-time statistics interval (20-second samples aggregated to 5 minutes). Reducing below 5 minutes does not yield higher granularity from vCenter. |
| Configuration Properties | 30 minutes | 5–1440 minutes | Collects object configuration attributes (CPU count, memory size, disk layout, network assignments). |
| Change Events | 5 minutes | 1–60 minutes | Polls vCenter's EventManager for tasks and events since the last poll. |
| Inventory Discovery | 6 hours | 1–24 hours | Full inventory traversal to discover new objects and remove stale ones. |
| vSAN Performance | 5 minutes | 5–60 minutes | vSAN performance counters collected via the vSAN Performance Service API. Must be ≥5 minutes. |
| Relationship Mapping | 30 minutes | 5–1440 minutes | Updates parent-child and peer relationships between objects. |
Tip: In very large environments (>10,000 VMs), increasing the configuration collection interval to 60 minutes and inventory discovery to 12 hours significantly reduces API load on vCenter with minimal impact on monitoring fidelity.
After initial deployment, the adapter follows a well-defined lifecycle before full analytics capability is reached:
Initial Discovery (0–30 minutes): The adapter performs a complete inventory traversal, creating resource objects for every discovered entity (hosts, VMs, clusters, datastores, etc.). The Object Count in the adapter status begins to populate.
First Collection Cycle (5–10 minutes after discovery): Performance metrics and configuration properties are collected for the first time. Metrics begin appearing in dashboards, but values are raw with no baseline context.
Statistics Build-Up (24–72 hours): The analytics engine begins calculating rolling averages, standard deviations, and trend lines. Capacity projections begin to appear, but with low confidence.
Dynamic Thresholds (1–2 weeks): After accumulating approximately one to two weeks of continuous data, the analytics engine generates dynamic thresholds (DT). These adaptive baselines learn normal behavior patterns for each metric on each object, including daily and weekly seasonality. Alerts based on dynamic thresholds become meaningful only after this maturation period.
Steady State (2+ weeks): Dynamic thresholds are fully established. Anomaly detection, predictive alerts, and capacity forecasts operate at full accuracy. The system continues to refine thresholds as it accumulates more historical data.
Important: Do not create custom alert definitions based on dynamic thresholds during the first two weeks. The immature thresholds will generate excessive false positives. Use static thresholds for immediate alerting needs during the burn-in period.
After saving the adapter instance, perform the following validation steps:
Test Connection: On the adapter configuration page, click Test Connection. A successful test confirms:
Monitor Discovery Progress:
Verify Object Counts:
Check Adapter Logs:
tail -100 /storage/log/collector/collector.log | grep -i "VMware_adapter3"
Look for Collection completed successfully messages and verify there are no authentication errors or timeout exceptions.
Ongoing adapter health monitoring ensures continuous data collection and early detection of integration failures.
Via REST API:
# Get adapter instance status
curl -sk -X GET \
"https://<vrops-fqdn>/suite-api/api/adapters/{adapterId}" \
-H "Authorization: vRealizeOpsToken <token>" \
-H "Accept: application/json"
The response includes resourceStatusAndReason, where:
resourceStatus: DATA_RECEIVING — Healthy.resourceStatus: NO_DATA_RECEIVING — Collection failures occurring.resourceStatus: NO_PARENT_MONITORING — Collector node offline.resourceStatus: UNKNOWN — Adapter has not completed its first cycle.Via CLI:
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py adapter list
This outputs all configured adapter instances, their types, collection states, and associated collector nodes.
Via UI:
Common adapter health issues and resolutions:
| Symptom | Likely Cause | Resolution |
|---|---|---|
Status is red; SSLHandshakeException in logs |
vCenter certificate changed or renewed | Re-trust the vCenter certificate in VCF Operations: Administration → Certificates → Certificate Management |
Status is red; InvalidLogin in logs |
Service account password expired or changed | Update the credential in Administration → Integrations → Accounts |
| Status is yellow; collection duration exceeds interval | Oversized vCenter inventory or resource contention | Increase COLLECTOR_INSTANCE_COUNT, add a remote collector, or increase collection intervals |
| Object count is zero | Insufficient vCenter permissions | Verify the service account role assignment per Section 7.1 |
| Status is green but metrics are stale | Collector node clock drift | Verify NTP synchronization on both the collector appliance and vCenter |
SDDC Manager is the lifecycle management control plane for VMware Cloud Foundation. Integrating VCF Operations with SDDC Manager provides domain-level topology awareness, lifecycle status visibility, and operational context that enriches the analytics engine's understanding of the VCF stack.
Step-by-step procedure:
sddc-mgr-01.corp.local).sddc-mgr-01.corp.local). Use the FQDN, not the IP address.ADMIN or OPERATOR role (e.g., svc-vrops@corp.local).Note: The SDDC Manager API uses port 443. Ensure that firewall rules allow HTTPS traffic from the VCF Operations collector node to the SDDC Manager appliance.
Once configured, the SDDC Manager adapter automatically discovers and monitors the following data:
The adapter communicates with the SDDC Manager via its published REST API. The following table lists the key endpoints queried during each collection cycle:
| Endpoint | Data Collected |
|---|---|
GET /v1/system |
Overall system information: SDDC Manager version, system status, NTP configuration, DNS settings, and deployment type. |
GET /v1/domains |
All workload domains including name, ID, type (management/VI), status, and associated cluster references. |
GET /v1/clusters |
Cluster details within each domain: cluster name, host count, vSAN enabled status, stretch cluster configuration, and image profile. |
GET /v1/hosts |
Host inventory: hardware model, ESXi version, commission status (ASSIGNED, UNASSIGNED_USEABLE, DECOMMISSIONED), and associated domain/cluster. |
GET /v1/tasks |
Recent task history: upgrade workflows, host operations, certificate rotations, and their completion status (SUCCESSFUL, FAILED, IN_PROGRESS). |
GET /v1/upgrades |
Available upgrade bundles and their applicability to each domain, including pre-check results and compatibility matrices. |
GET /v1/certificates |
Certificate inventory: issuing CA, subject, expiration date, and associated component (vCenter, NSX, ESXi). |
GET /v1/network-pools |
Network pool definitions, VLAN ranges, and IP address block utilization. |
GET /v1/sddc-managers |
SDDC Manager cluster node information (in multi-instance deployments). |
Tip: If the SDDC Manager adapter reports errors for specific endpoints, verify that the service account has sufficient privileges. The
ADMINrole provides access to all endpoints; theOPERATORrole may restrict access to certain lifecycle operations.
NSX provides the network virtualization and security layer in VMware Cloud Foundation. Integrating VCF Operations with NSX delivers visibility into logical networking constructs, transport infrastructure, distributed firewall activity, and load balancer performance — all correlated with the compute and storage metrics collected by the vSphere adapter.
Step-by-step procedure:
nsx-mgmt-01.corp.local).nsx-vip-mgmt.corp.local). This is critical — see the note below.audit role (e.g., svc-vrops-nsx@corp.local) or the enterprise_admin role for full visibility. The audit role is recommended for least-privilege compliance.Warning: Always use the NSX Manager VIP (Virtual IP), not the FQDN of an individual NSX Manager node. The NSX Manager cluster operates as a three-node Raft consensus group. During maintenance, node upgrades, or node failures, individual manager nodes become temporarily unavailable. The VIP automatically directs traffic to a healthy node, ensuring uninterrupted data collection. Configuring the adapter with an individual node address will result in collection outages during any node maintenance event.
The NSX-T adapter collects a comprehensive set of networking and security data:
In addition to the built-in NSX adapter, VMware offers VCF Operations for Networks (formerly known as Aria Operations for Networks, or vRealize Network Insight) as a complementary product for deep network visibility. While the NSX adapter focuses on management-plane metrics, VCF Operations for Networks provides data-plane flow analysis.
Deployment model:
Key capabilities:
Note: VCF Operations for Networks is licensed separately from VCF Operations. In VCF 5.x environments, it is included with the VCF Operations Advanced and Enterprise editions.
VMware vSAN is the hyper-converged storage platform embedded in VCF. VCF Operations provides native vSAN monitoring through the vSphere adapter, delivering capacity analytics, performance trending, health correlation, and policy compliance tracking without requiring a separate adapter installation.
vSAN monitoring is automatically activated when the vCenter adapter discovers one or more vSAN-enabled clusters. No additional adapter installation, configuration, or licensing is required for core vSAN metrics.
Prerequisites for automatic vSAN data collection:
vSAN → Cluster → ReadOnly privilege (included in the role defined in Section 7.1).Once these prerequisites are met, VCF Operations automatically creates resource objects for:
For environments requiring deeper vSAN observability, additional collection parameters can be enabled in the vCenter adapter instance's advanced settings.
Navigate to Administration → Integrations → Accounts → select the vCenter adapter → Edit → expand Advanced Settings:
| Setting | Default | Description |
|---|---|---|
COLLECT_VSAN_PERF_METRICS |
true |
Collects vSAN performance counters (IOPS, throughput, latency) from the vSAN Performance Service API. Disabling this removes all vSAN performance data while retaining capacity and health metrics. |
COLLECT_VSAN_ADVANCED_METRICS |
false |
Enables collection of extended vSAN metrics from the DOM (Distributed Object Manager), LSOM (Local Log-Structured Object Manager), and CMMDS (Cluster Monitoring Membership and Directory Services) layers. Provides deep diagnostic visibility but increases collection load on vCenter and the ESXi hosts. |
VSAN_PERF_DIAG_MODE |
false |
Enables vSAN performance diagnostics mode, which collects additional latency breakdown metrics (e.g., guest-to-kernel, kernel-to-disk) for troubleshooting storage performance issues. |
Warning: Enabling
COLLECT_VSAN_ADVANCED_METRICSon clusters with more than 32 hosts or heavy I/O workloads can significantly increase vCenter API response times and VCF Operations collection duration. Enable this setting selectively and monitor the adapter collection duration (see Section 7.6) after activation.
Additional vSAN Performance Service requirements:
VCF Operations collects hundreds of vSAN metrics. The following table summarizes the most operationally significant metric groups:
| Metric Group | Key Metrics | Description |
|---|---|---|
| Capacity | vsanDatastore|capacity_usedSpace, vsanDatastore|capacity_freeSpace, vsanDatastore|capacity_dedupRatio, vsanDatastore|capacity_compressionRatio, vsanDatastore|capacity_savingsRatio |
Overall vSAN datastore capacity utilization, deduplication effectiveness, compression ratios, and combined space savings. Used for capacity planning and trending. |
| Performance — IOPS | vsanDatastore|performance_readIops, vsanDatastore|performance_writeIops, vsanDatastore|performance_totalIops |
Read, write, and total I/O operations per second at the cluster, host, and disk group levels. |
| Performance — Throughput | vsanDatastore|performance_readThroughput, vsanDatastore|performance_writeThroughput |
Data throughput in KBps for read and write operations. Useful for identifying bandwidth bottlenecks. |
| Performance — Latency | vsanDatastore|performance_readLatency, vsanDatastore|performance_writeLatency, vsanDatastore|performance_totalLatency |
Average latency in milliseconds for read, write, and combined operations. VCF Operations applies dynamic thresholds to these metrics after the burn-in period. |
| Resync | vsanDatastore|resync_bytesRemaining, vsanDatastore|resync_objectsResyncing, vsanDatastore|resync_etr |
Bytes remaining to resynchronize after a host failure or maintenance event, count of objects actively resyncing, and estimated time to completion (ETR). Critical for monitoring recovery progress. |
| Health | vsanDatastore|health_diskHealth, vsanDatastore|health_networkHealth, vsanDatastore|health_dataIntegrity, vsanDatastore|health_overallHealth |
Health check results for disk subsystem, vSAN network (VMkernel connectivity, multicast), data integrity (object checksum verification), and overall cluster health. |
| Policy Compliance | vsanDatastore|policy_complianceStatus, vsanDatastore|policy_objectsByPolicy |
Reports whether all VM storage objects comply with their assigned vSAN storage policy (e.g., FTT=1, stripe width). Identifies VMs at risk due to policy violations. |
| Congestion | vsanDatastore|performance_congestion |
vSAN congestion value (0–255). Values above 0 indicate back-pressure in the I/O path. Sustained values above 30 warrant investigation. |
| Disk Group | vsanDiskGroup|iopsRead, vsanDiskGroup|iopsWrite, vsanDiskGroup|latencyRead, vsanDiskGroup|latencyWrite, vsanDiskGroup|cacheHitRate |
Per-disk-group performance counters including cache tier hit rate. Low cache hit rates may indicate a need for larger cache disks or workload redistribution. |
VCF Operations ships with a comprehensive set of predefined vSAN dashboards that provide immediate operational visibility without custom configuration. These dashboards cover:
For a complete listing of all predefined vSAN dashboards, their widget configurations, and customization guidance, refer to Chapter 14: Predefined Dashboards and Views.
Tip: Pin the vSAN Cluster Overview and vSAN Capacity Planning dashboards to your home page for daily operational monitoring. Configure email-based scheduled reports from the vSAN Capacity Planning dashboard to automatically distribute weekly capacity status to infrastructure leads.
Policies in VCF Operations govern how the platform analyzes, alerts on, and reports capacity for your monitored objects. Every object in the inventory is subject to exactly one policy at any given time, and understanding how policies layer and override one another is essential for accurate monitoring at scale.
VCF Operations ships with a single Default Policy that is automatically applied to every monitored object in the inventory. This policy contains Broadcom's recommended thresholds, alert definitions, symptom definitions, and capacity settings for all supported object types. It cannot be deleted, and it serves as the fallback for any object not explicitly covered by a custom policy.
Custom policies allow administrators to override specific settings from the Default Policy for targeted groups of objects. A custom policy does not need to redefine every setting — it inherits any setting left unconfigured from the Default Policy and only overrides the values explicitly changed.
To manage policies, navigate to:
Configure → Policies
The Policies page displays all active policies in a table with columns for Name, Description, Priority, and the number of object groups assigned. From this page you can:
Note: The Default Policy itself can be edited, but exercise caution — changes to the Default Policy affect every object that is not covered by a higher-priority custom policy.
When multiple custom policies exist, VCF Operations uses a numeric priority system to determine which policy governs a given object. Each policy is assigned a priority number, and lower numbers indicate higher priority.
Policy resolution follows this logic:
| Priority | Policy Name | Assigned Groups | Matched Object | Result |
|---|---|---|---|---|
| 1 | Critical Production | Production-Tier1 | VM in Production-Tier1 | Governed by Critical Production |
| 2 | Standard Production | Production-All | VM in Production-All | Governed by Standard Production |
| 3 | Development | Dev-Test | VM in Dev-Test | Governed by Development |
| — | Default Policy | (All objects) | VM in no group | Governed by Default Policy |
If an object belongs to groups assigned to multiple policies, only the highest-priority policy (lowest number) applies. There is no merging of settings across policies — the winning policy's settings apply in full, with any unconfigured settings inherited from the Default Policy.
To change priority, navigate to Configure → Policies, select a policy, and click Edit Priority. Enter the desired numeric value and save.
Each policy exposes five major configuration areas. The following sections detail every configurable element, its UI navigation path, and key settings.
Navigation: Configure → Policies → [Policy Name] → Edit → Workload Automation
Workload Automation enables DRS-like optimization recommendations (or automated actions, if configured) driven by VCF Operations analytics rather than vCenter DRS alone.
| Setting | Description | Default |
|---|---|---|
| Enable Workload Automation | Turns on optimization analysis for the policy scope | Disabled |
| Automation Mode | Manual (recommendations only), Semi-Automatic, or Fully Automatic | Manual |
| Aggressiveness | Conservative, Moderate, or Aggressive balancing | Moderate |
| Excluded Object Types | Object types to exclude from automation | None |
Navigation: Configure → Policies → [Policy Name] → Edit → Capacity
Capacity settings control how VCF Operations calculates remaining capacity and time-to-exhaustion.
| Setting | Description | Default |
|---|---|---|
| Allocation Model / Demand Model | Method for computing capacity (see Section 11.4) | Allocation Model |
| Time Remaining Threshold (days) | Alert fires when projected exhaustion is within this window | 90 days |
| Capacity Remaining Threshold (%) | Alert fires when remaining capacity drops below this value | 20% |
| CPU Overcommit Ratio | Virtual-to-physical CPU ratio ceiling | 4:1 |
| Memory Overcommit Ratio | Virtual-to-physical memory ratio ceiling | 1.25:1 |
| Storage Overcommit Ratio | Virtual-to-physical storage ratio ceiling | 1:1 |
| High Availability Buffer (%) | Capacity reserved for HA failover | Based on cluster HA settings |
| Maintenance Buffer (%) | Capacity reserved for host maintenance | 0% |
Navigation: Configure → Policies → [Policy Name] → Edit → Attributes/Metrics
This section allows enabling or disabling the collection of specific metric groups per object type. Disabling unused metric groups reduces storage consumption and processing overhead.
Categories include CPU, Memory, Disk, Network, Datastore, Virtual Disk, GPU, vSAN, and System metrics. Each category can be individually toggled.
Navigation: Configure → Policies → [Policy Name] → Edit → Alerts/Symptoms
Administrators can enable or disable individual alert definitions and symptom definitions within the scope of the policy. This is useful for suppressing alerts that are not relevant to a particular workload tier — for example, disabling memory overcommit alerts for development clusters where overcommit is expected.
Navigation: Configure → Policies → [Policy Name] → Edit → Compliance
Activate or deactivate compliance benchmarks on a per-policy basis. Available benchmarks include VMware Security Hardening Guide, CIS Benchmarks, DISA STIGs, and any custom benchmarks that have been imported.
The capacity model determines how VCF Operations calculates how much capacity a cluster or datastore has remaining.
| Aspect | Allocation Model | Demand Model |
|---|---|---|
| Calculation Basis | Provisioned (allocated) resources | Actual measured utilization |
| Philosophy | Conservative — assumes all provisioned resources may be consumed | Optimistic — assumes current usage patterns continue |
| CPU Capacity Used | Sum of all vCPUs allocated × overcommit ratio | Peak or 95th-percentile CPU demand |
| Memory Capacity Used | Sum of all configured VM memory | Active + consumed memory demand |
| Example (8-core host) | 10 VMs × 4 vCPU = 40 vCPU allocated → 40/32 = 125% used (at 4:1 ratio) | Actual demand is 12 GHz of 64 GHz → 18.75% used |
| Best For | Production environments with strict SLAs | Development environments or well-understood workloads |
| Risk | May show capacity exhaustion prematurely | May underestimate future demand if workloads spike |
Alerts and symptoms form the proactive monitoring backbone of VCF Operations. Symptoms detect individual conditions; alerts correlate one or more symptoms into actionable notifications that drive operational response.
Every alert in VCF Operations is classified along three dimensions: type, criticality, and control state.
Alert Types:
| Type | Badge Icon | Purpose | Example |
|---|---|---|---|
| Health | Red/Orange/Yellow cross | Indicates a current, active problem requiring immediate attention | Host memory usage critical |
| Risk | Red/Orange/Yellow diamond | Predicts a future problem based on trend analysis | Datastore will run out of space in 30 days |
| Efficiency | Red/Orange/Yellow arrow | Identifies optimization opportunities to reclaim waste | VM is oversized — using 5% of allocated CPU |
Badge Colors and Criticality Levels:
| Color | Criticality | Description |
|---|---|---|
| Red | Critical | Immediate action required; service impact is occurring or imminent |
| Orange | Immediate | Urgent attention needed; potential for service impact |
| Yellow | Warning | Attention recommended; condition is outside normal bounds |
| Green | Information / Clear | Informational or no active alerts |
Control States:
| State | Description |
|---|---|
| Open | Alert is active and unacknowledged |
| Assigned | An administrator has taken ownership |
| Suspended | Alert is temporarily suppressed (with optional expiration) |
| Cancelled | Alert has been manually dismissed by an administrator |
When all triggering symptoms clear, the alert automatically transitions to a cancelled state. Manually cancelled alerts will not re-fire until the symptoms clear and then trigger again.
The alert lifecycle follows a deterministic sequence:
Alerts can also be suspended for a configurable duration (e.g., during a maintenance window), after which they automatically resume evaluation.
To create a custom alert definition, follow these steps:
Step 1. Navigate to Configure → Alerts → Alert Definitions.
Step 2. Click the Add button in the toolbar.
Step 3. On the Name and Description tab:
Step 4. On the Alert Impact tab:
Step 5. On the Add Symptom Definitions tab:
Step 6. On the Configure Symptom Conditions section:
Step 7. Click Save. The alert definition is now created but will only evaluate against objects governed by a policy where the alert is enabled.
Symptoms are the atomic conditions that feed into alert definitions. VCF Operations supports five distinct symptom types.
Triggers when a monitored metric or property meets a defined condition.
Static Threshold Configuration:
>, <, >=, <=, =, !=.Dynamic Threshold Configuration:
Triggers when a log message matches a defined pattern. This symptom type requires Operations for Logs integration.
Triggers on fault events published by vCenter Server or other adapter sources.
Triggers on metric events published by external systems through the VCF Operations REST API.
Predictive symptom that uses machine-learning trend analysis to forecast when a metric will cross a threshold.
| Aspect | Static Threshold | Dynamic Threshold |
|---|---|---|
| Definition | Fixed numeric value set by the administrator | Machine-learned baseline derived from historical patterns |
| Trigger Condition | Fires when metric crosses the fixed value | Fires when metric deviates from the learned normal pattern |
| Setup Effort | Immediate — define value and save | Requires 1–2 weeks of data collection for baseline |
| Adaptability | Does not adapt; same value applies 24/7 | Adapts to daily/weekly patterns (e.g., business hours vs off-hours) |
| False Positive Risk | Higher — a single threshold cannot account for variable workloads | Lower — learned baselines reflect actual usage patterns |
| Best For | Hard limits (e.g., disk full > 95%), SLA thresholds | Anomaly detection, workloads with variable patterns |
| Configuration | Operator + fixed value | Direction (Above/Below) + Sensitivity level (Normal, 1-3 sigma) |
Dynamic Threshold Sensitivity Levels:
| Level | Interpretation | Use When |
|---|---|---|
| Normal Range | Any deviation outside the learned band | You want maximum sensitivity to deviations |
| 1 Standard Deviation | Moderate deviation from normal | General-purpose anomaly detection |
| 2 Standard Deviations | Significant deviation from normal | Reducing noise while catching meaningful anomalies |
| 3 Standard Deviations | Extreme deviation from normal | Only alerting on severe outliers |
Alert definitions can include symptoms that evaluate conditions not only on the alerting object itself but also on related objects in the inventory hierarchy.
| Relationship | Description | Example |
|---|---|---|
| Self | Symptom evaluates on the object that will trigger the alert | VM CPU Usage > 90% on the VM itself |
| Parent | Symptom evaluates on the immediate parent object | Host memory pressure on the host running the VM |
| Child | Symptom evaluates on an immediate child object | A VM on a host has high disk latency |
| Peer | Symptom evaluates on an object at the same level sharing a parent | Another VM on the same host is consuming excessive CPU |
| Ancestor | Symptom evaluates on any object above in the hierarchy (parent, grandparent, etc.) | Cluster-level capacity warning affecting a VM two levels down |
| Descendant | Symptom evaluates on any object below in the hierarchy (child, grandchild, etc.) | Any VM in a cluster experiencing memory contention |
Relationship-based symptoms enable compound alerts that correlate conditions across infrastructure layers — for example, an alert that fires only when a VM has high CPU ready AND its parent host has high CPU utilization, confirming the contention is host-driven rather than guest-driven.
Notification rules bridge alerts to human attention by defining what gets communicated, to whom, and through which channel.
Step 1. Navigate to Configure → Alerts → Notification Settings.
Step 2. Click Add to create a new notification rule.
Step 3. Enter a Name for the rule (e.g., "Critical Production Alerts to On-Call Team").
Step 4. Set Filter Criteria to control which alerts trigger this notification:
Step 5. Select the Notification Method — choose from the configured outbound plug-ins (see Section 12.9).
Step 6. Set Notification Frequency:
Step 7. Click Save. The notification rule takes effect immediately.
Tip: Create separate notification rules for different criticality levels. Route Critical alerts to PagerDuty or SMS-capable channels for immediate response, while routing Warning alerts to email or Slack for informational awareness.
Outbound plug-ins define the communication channels available for notification rules. Configure them at Administration → Outbound Settings → Add.
| # | Plug-in Type | Key Configuration Fields | Notes |
|---|---|---|---|
| 1 | Standard Email (SMTP) | SMTP Host, Port (25/465/587), Secure Connection (TLS/SSL), From Address, Authentication (username/password) | Most common. Supports HTML formatting. Test with Test button before saving. |
| 2 | Log File | File path on the VCF Operations analytics node (e.g., /var/log/vmware/vcops/alerts.log) |
Useful for SIEM ingestion from local filesystem. |
| 3 | Network Share (CIFS/NFS) | Share Path (e.g., \\server\share\alerts), Domain, Username, Password |
Writes alert data as files to a network share. |
| 4 | SNMP Trap | Target Host (IP/FQDN), Port (default 162), Community String, SNMP Version (v1/v2c/v3), Security Level (v3: AuthPriv/AuthNoPriv/NoAuthNoPriv), Engine ID | For integration with enterprise SNMP managers (e.g., HP OpenView, IBM Tivoli). |
| 5 | ServiceNow | Instance URL (e.g., https://instance.service-now.com), Username, Password, REST Endpoint, Incident Table, Assignment Group |
Creates ServiceNow incidents automatically. Requires the VCF Operations ServiceNow app or direct REST configuration. |
| 6 | Slack | Webhook URL (from Slack Incoming Webhooks app), Channel (override), Username (override) | Posts formatted alert messages to a Slack channel. |
| 7 | Webhook (REST) | URL, HTTP Method (POST/PUT/PATCH), Content Type (JSON/XML), Headers (key-value pairs), Body Template (with alert field placeholders), Authentication (None/Basic/Bearer Token/OAuth) | Most flexible — integrates with any REST-capable system (PagerDuty, Teams, OpsGenie, custom APIs). |
Configuration procedure for each plug-in type:
Each plug-in type can have multiple instances configured (e.g., separate SMTP servers for different environments, multiple Slack channels). Notification rules reference specific plug-in instances when defining the delivery channel.
Super metrics extend the analytic capabilities of VCF Operations by enabling administrators to define custom calculated metrics that combine, aggregate, or transform multiple standard metrics into a single derived value. They fill gaps where the built-in metric catalog does not provide the exact calculation your organization needs.
A super metric is a user-defined formula that VCF Operations evaluates on every collection cycle, producing a new metric value that can be used in dashboards, views, reports, alert symptom definitions, and capacity calculations — just like any native metric.
Common use cases:
To access super metrics, navigate to: Configure → Super Metrics.
Follow this ten-step procedure to create a super metric:
Step 1. Navigate to Configure → Super Metrics.
Step 2. Click the Add button in the toolbar.
Step 3. Enter a Name for the super metric (e.g., "Cluster - Avg VM CPU Usage (Powered-On Only)"). Enter an optional Description explaining the formula's purpose and intended consumers.
Step 4. Select the Object Type that this super metric will be associated with. The super metric will appear as a metric on objects of this type. For example, selecting "Cluster Compute Resource" means the super metric will be calculated and displayed for each cluster.
Step 5. Build the formula in the Formula Editor. The editor provides a text area where you type or construct the formula using metric references, operators, and functions.
Step 6. Use the Metric Picker (right panel) to browse or search the available metric catalog. Double-click a metric to insert its reference into the formula. The metric reference is inserted in the syntax ${this, metric=<metric_key>}.
Step 7. Apply looping functions to iterate over child objects. For example, wrap a metric reference in avg() to compute the average value of that metric across all child objects at a specified depth. See Section 13.3 for the complete list of looping functions.
Step 8. Click the Preview button to validate the formula syntax and see sample results. The preview evaluates the formula against a few sample objects and displays the computed values. Fix any syntax errors before proceeding.
Step 9. Assign the super metric to a policy. A super metric only collects data when it is activated in at least one policy. Navigate to the Policies tab within the super metric editor, or go to Configure → Policies → [Policy Name] → Edit → Attributes/Metrics and enable the super metric under the appropriate object type.
Step 10. Click Save. The super metric begins collecting data on the next collection cycle for all objects governed by the policy where it is activated.
Important: Super metrics do not retroactively calculate historical data. Data collection begins from the moment the super metric is activated in a policy.
Looping functions iterate over child objects (or related objects at a specified depth) and aggregate a metric across them.
| Function | Description | Syntax Example |
|---|---|---|
avg() |
Calculates the arithmetic mean of a metric across child objects | avg(${this, metric=cpu|usage_average, depth=1}) |
combine() |
Combines individual time series from child objects into a unified series | combine(${this, metric=cpu|usage_average, depth=1}) |
count() |
Returns the number of child objects that report the specified metric | count(${this, metric=cpu|usage_average, depth=1}) |
max() |
Returns the maximum value of the metric across all child objects | max(${this, metric=cpu|usage_average, depth=1}) |
min() |
Returns the minimum value of the metric across all child objects | min(${this, metric=cpu|usage_average, depth=1}) |
sum() |
Returns the sum of the metric values across all child objects | sum(${this, metric=mem|consumed_average, depth=1}) |
The depth parameter controls how many levels down the hierarchy to traverse:
depth=1 — direct children only (e.g., VMs directly under a host).depth=2 — children and grandchildren (e.g., VMs under hosts under a cluster).depth defaults to depth=1.Single functions operate on individual numeric values within the formula.
| Function | Description |
|---|---|
abs(x) |
Returns the absolute value of x |
acos(x) |
Returns the arc cosine of x (in radians) |
ceil(x) |
Returns the smallest integer greater than or equal to x |
cos(x) |
Returns the cosine of x (x in radians) |
exp(x) |
Returns Euler's number raised to the power of x |
floor(x) |
Returns the largest integer less than or equal to x |
log(x) |
Returns the natural logarithm (base e) of x |
log10(x) |
Returns the base-10 logarithm of x |
pow(x, y) |
Returns x raised to the power of y |
round(x) |
Returns x rounded to the nearest integer |
sqrt(x) |
Returns the square root of x |
sin(x) |
Returns the sine of x (x in radians) |
tan(x) |
Returns the tangent of x (x in radians) |
Super metric formulas support the following operators:
Numeric Operators:
| Operator | Description | Example |
|---|---|---|
+ |
Addition | metricA + metricB |
- |
Subtraction | metricA - metricB |
* |
Multiplication | metricA * 1024 |
/ |
Division | metricA / metricB |
% |
Modulo (remainder) | metricA % 60 |
Comparison Operators:
| Operator | Description | Example |
|---|---|---|
> |
Greater than | metricA > 90 |
< |
Less than | metricA < 10 |
>= |
Greater than or equal to | metricA >= 50 |
<= |
Less than or equal to | metricA <= 100 |
== |
Equal to | metricA == 0 |
!= |
Not equal to | metricA != -1 |
Logical Operators:
| Operator | Description | Example |
|---|---|---|
&& |
Logical AND | (metricA > 90) && (metricB > 80) |
|| |
Logical OR | (metricA > 95) || (metricB > 95) |
! |
Logical NOT | !(metricA == 0) |
String Operators:
| Operator | Description | Example |
|---|---|---|
.contains() |
Checks if a string property contains a substring | ${this, property=config|guestFullName}.contains("Windows") |
.length() |
Returns the length of a string property | ${this, property=config|name}.length() |
depth ParameterThe depth parameter specifies how many levels of the object hierarchy to traverse when using looping functions:
depth=0 — the current object itself (no looping).depth=1 — direct children (e.g., for a cluster: hosts; for a host: VMs).depth=2 — grandchildren (e.g., for a cluster: VMs through hosts).where ClauseThe where clause filters child objects by a property value before aggregation:
avg(${this, metric=cpu|usage_average, depth=1, where=Summary|Guest Operating System=.*Linux.*})
This calculates the average CPU usage only for child VMs whose guest OS name matches the regex .*Linux.*.
The where clause supports:
where=Summary|Runtime|PowerState=Powered Onwhere=config|name=.*prod.*where context: where=cpu|usage_average > 50isFresh() FunctionisFresh() checks whether a metric has received data within the most recent collection cycle. It returns 1 if fresh data exists, 0 otherwise. This is useful for conditionally including only actively-reporting objects:
sum(${this, metric=mem|consumed_average, depth=1, where=isFresh(mem|consumed_average)})
Intermediate calculations can be assigned to aliases for readability:
alias cpuTotal = sum(${this, metric=cpu|usagemhz_average, depth=1})
alias cpuCapacity = ${this, metric=cpu|capacity_usagemhz}
cpuTotal / cpuCapacity * 100
Use ternary syntax for conditional logic:
${this, metric=cpu|usage_average} > 80 ? 1 : 0
This returns 1 if CPU usage exceeds 80%, otherwise returns 0 — useful for creating "count of objects exceeding threshold" super metrics when combined with sum().
The following real-world examples demonstrate practical super metric formulas.
Example 1: Average VM CPU Usage Across a Cluster (Windows VMs Only)
Object Type: Cluster Compute Resource
avg(${this, metric=cpu|usage_average, depth=2, where=Summary|Guest Operating System=.*Windows.*})
This formula traverses two levels deep from the cluster (cluster → host → VM), filters to only Windows VMs, and calculates the average CPU usage across all matching VMs in the cluster.
Example 2: Total Memory Consumed by Powered-On VMs
Object Type: Cluster Compute Resource
sum(${this, metric=mem|consumed_average, depth=2, where=Summary|Runtime|PowerState=Powered On})
This formula sums the consumed memory metric across all VMs in the cluster that are currently powered on, giving an accurate picture of active memory demand.
Example 3: Count of VMs with CPU Ready Exceeding Threshold
Object Type: Host System
count(${this, metric=cpu|readyPct, depth=1, where=cpu|readyPct > 2.5})
This formula returns the number of VMs on a host where the CPU Ready percentage exceeds 2.5%, providing a single metric that indicates how many VMs on the host are experiencing CPU scheduling contention.
Example 4: Cluster CPU Overcommit Ratio
Object Type: Cluster Compute Resource
sum(${this, metric=cpu|num_vcpus_latest, depth=2}) / sum(${this, metric=cpu|corecount_provisioned, depth=0})
This formula divides the total number of vCPUs allocated across all VMs in the cluster (depth=2 to traverse through hosts to VMs) by the total physical core count of the cluster itself (depth=0 for the cluster's own metric), producing the vCPU-to-pCPU overcommit ratio.
VCF Operations ships with an extensive library of predefined dashboards that provide immediate visibility into the health, performance, capacity, and efficiency of your virtual infrastructure. These dashboards represent Broadcom's best-practice views and serve as both operational tools and templates for custom dashboard development.
To access predefined dashboards:
Predefined dashboards are read-only — they cannot be modified directly. To customize a predefined dashboard:
Dashboards can be marked as Favorites (star icon) for quick access from the Favorites section of the left panel. The Home Dashboard can be set by navigating to Visualize → Dashboards → Actions → Set as Home Dashboard.
| Dashboard Name | Purpose | Key Widgets |
|---|---|---|
| VM Performance | Identifies top CPU, memory, disk, and network consumers among virtual machines | Top-N CPU Usage, Top-N Memory Usage, Top-N Disk Latency, Top-N Network Throughput, Metric Chart |
| Cluster Performance | Displays cluster-level utilization trends for compute and storage | Cluster CPU/Memory Utilization Heatmap, Utilization Trend Charts, DRS Balance Scoreboard |
| ESXi Host Performance | Shows per-host utilization, contention, and hardware health | Host CPU/Memory Utilization, Host Contention Metrics, NIC Throughput, HBA Throughput |
| Datastore Performance | Monitors storage latency, IOPS, and throughput per datastore | Datastore Latency Trend, IOPS Distribution, Throughput Top-N, Outstanding IO |
| Network Performance | Tracks packet loss, throughput, errors, and dropped packets across network paths | Packet Loss Heatmap, Throughput Trends, Error Rate Scoreboard, Dropped Packets Top-N |
| vSAN Performance | Provides vSAN-specific IOPS, latency, throughput, and congestion metrics | vSAN IOPS Trend, Backend Latency, Congestion Scoreboard, Disk Group Performance |
| VM Contention | Surfaces per-VM contention indicators including CPU Ready, Co-Stop, and Memory Contention | CPU Ready % Top-N, Co-Stop Top-N, Memory Contention % Top-N, Disk Latency Top-N |
| Cluster Contention | Aggregates contention metrics at the cluster level for rapid triage | Cluster CPU Contention Heatmap, Memory Pressure Trend, Cluster Disk Latency Summary |
| Dashboard Name | Purpose | Key Widgets |
|---|---|---|
| Cluster Capacity | Shows Time Remaining and Capacity Remaining per cluster with trend projections | Capacity Remaining Scoreboard, Time Remaining Scoreboard, Capacity Trend Chart, What-If Scenario |
| Datastore Capacity | Monitors storage utilization, provisioned vs used space, and forecast | Datastore Usage Heatmap, Capacity Trend, Thin Provisioning Overcommit, Forecast Chart |
| ESXi Host Capacity | Displays per-host capacity metrics including headroom for additional workloads | Host CPU/Memory Remaining, VM Density, Headroom Scoreboard |
| VM Capacity | Provides rightsizing recommendations for oversized and undersized VMs | Oversized VMs List, Undersized VMs List, Reclaimable CPU/Memory Scoreboard, Idle VMs |
| vSAN Capacity | Shows vSAN capacity utilization including deduplication and compression savings | vSAN Used vs Free, Dedup/Compression Ratio, Slack Space, Capacity Trend |
| Dashboard Name | Purpose | Key Widgets |
|---|---|---|
| Cost Overview | Provides total and monthly cost breakdown across the environment | Total Cost Scoreboard, Monthly Trend Chart, Cost by Object Type, Cost by Datacenter |
| Optimization | Quantifies potential cost savings from rightsizing and reclamation | Reclaimable Cost Scoreboard, Powered-Off VM Cost, Idle VM Cost, Snapshot Cost |
| Showback | Displays cost allocation by business unit, department, or custom grouping | Cost by Department Chart, Cost by Application, Cost by Environment Tier |
| Chargeback | Supports billing integration with per-consumer cost detail | Chargeable Cost per Consumer, Rate Card Summary, Invoice Detail |
| Dashboard Name | Purpose | Key Widgets |
|---|---|---|
| Availability Overview | Summarizes uptime, active alerts, and overall environment health | Uptime Scoreboard, Alert Count by Severity, Health Badge Summary, Outage Timeline |
| Dashboard Name | Purpose | Key Widgets |
|---|---|---|
| Carbon Footprint | Estimates carbon emissions based on compute power consumption and regional emission factors | Total Carbon Emissions Scoreboard, Emissions Trend, Emissions by Cluster, PUE Factor |
| Green Scorecard | Tracks energy efficiency metrics and sustainability KPIs | Energy Efficiency Score, Power Consumption Trend, Idle Resource Waste, Green Improvement Recommendations |
| Dashboard Name | Purpose | Key Widgets |
|---|---|---|
| NSX-T Overview | High-level summary of NSX-T environment health, alert count, and component status | NSX Manager Health, Transport Node Status, Edge Cluster Status, Alert Summary |
| NSX Security Overview | Security posture summary including firewall rule counts, policy compliance, and threat indicators | DFW Rule Count, Security Policy Status, Applied Profiles, Threat Activity |
| NSX Logical Switching | Monitors logical switch health, port utilization, and segment configuration | Logical Switch List, Port Count Summary, Segment Health, VLAN/VXLAN Mapping |
| NSX Edge Performance | Tracks NSX Edge node CPU, memory, throughput, and session count | Edge CPU/Memory Utilization, Throughput per Edge, NAT Session Count, IPSec Tunnel Status |
| NSX Distributed Firewall | Monitors DFW rule evaluation rates, connection counts, and CPU overhead on hosts | DFW Rule Hit Count, Connection Rate, CPU Overhead Trend, Rule Table Size |
| NSX Load Balancer | Displays load balancer pool health, session distribution, and throughput | Pool Health Status, Active Sessions, Request Rate, Server Health Checks |
| NSX Network Topology | Visual topology map showing the relationships between logical routers, switches, and edge nodes | Interactive Topology Graph, Component Status Overlay, Alert Badge Overlay |
| NSX Troubleshooting | Diagnostic dashboard for identifying NSX control/data plane issues | Traceflow Results, Controller Cluster Health, Transport Zone Status, BFD Session Status |
| Dashboard Name | Purpose | Key Widgets |
|---|---|---|
| Application Monitoring | Tracks application-level metrics from integrated APM sources | Application Health Summary, Response Time Trend, Error Rate, Dependency Map |
| Workload Management | Monitors Tanzu Kubernetes clusters and workload placement | TKG Cluster Status, Pod Count, Namespace Utilization, Supervisor Cluster Health |
| Migration Planning | Assesses VM migration readiness and provides cloud cost comparison | Migration Readiness List, Cloud Cost Estimate, Dependency Analysis, Compatibility Check |
| Service Discovery | Maps discovered application services and their infrastructure dependencies | Service Map, Dependency Graph, Communication Flow, Infrastructure Mapping |
The following table provides industry-standard threshold guidance for key performance indicators. These values are used by many of the predefined dashboards and alert definitions.
| KPI | Good (Green) | Warning (Yellow) | Critical (Red) | Notes |
|---|---|---|---|---|
| CPU Ready % | < 2.5% | 2.5% – 5.0% | > 5.0% | Measured on a per-vCPU basis. Values above 5% indicate the VM is waiting for physical CPU scheduling and will experience application-visible latency. |
| CPU Co-Stop % | < 2.0% | 2.0% – 4.0% | > 4.0% | Relevant for SMP (multi-vCPU) VMs. Indicates vCPUs being halted to synchronize scheduling. Reduce vCPU count if consistently high. |
| Memory Contention % | < 1.0% | 1.0% – 3.0% | > 3.0% | Includes ballooning, swapping, and compression. Values above 3% indicate the host is under memory pressure and VMs are experiencing degraded performance. |
| Disk Latency (ms) | < 10 | 10 – 20 | > 20 | Combined read + write latency at the virtual disk (VMDK) level. Values above 20 ms are perceptible to most applications. |
| Disk Command Aborts | 0 | 1 – 5 | > 5 | Per collection interval (5 minutes). Any aborted commands indicate storage path issues and warrant investigation. |
| Network TX Drops | 0 | 1 – 100 | > 100 | Transmitted packet drops per interval. Indicates transmit queue saturation, typically caused by network bandwidth exhaustion or vSwitch misconfiguration. |
| Packet Loss % | 0% | 0% – 0.1% | > 0.1% | End-to-end packet loss. Even 0.1% loss is significant for latency-sensitive applications (VoIP, RDP, database replication). |
| vSAN Latency (ms) | < 5 | 5 – 10 | > 10 | vSAN backend (device-level) latency. Frontend (VM-visible) latency may be higher. Values above 10 ms indicate disk group saturation or network congestion. |
All dashboards support configurable time ranges and refresh intervals that control the data window displayed by widgets.
Time Range Options:
| Setting | Duration | Best For |
|---|---|---|
| Last Hour | 1 hour | Real-time troubleshooting, active incident investigation |
| Last 6 Hours | 6 hours | Default view — covers a typical shift or business window |
| Last 24 Hours | 24 hours | Daily review, identifying overnight patterns |
| Last 7 Days | 7 days | Weekly trend analysis, capacity planning reviews |
| Last 30 Days | 30 days | Monthly reporting, long-term trend identification |
| Custom | User-defined start and end | Post-incident analysis, compliance audits, specific maintenance windows |
The time range selector is located in the top-right toolbar of every dashboard. Changing the time range affects all time-aware widgets on the dashboard simultaneously.
Auto-Refresh Intervals:
| Setting | Behavior |
|---|---|
| Off | Dashboard displays static data from the last load; manual refresh required |
| 5 Minutes | Dashboard automatically refreshes every 5 minutes (aligns with default collection interval) |
| 10 Minutes | Dashboard automatically refreshes every 10 minutes |
| 15 Minutes | Dashboard automatically refreshes every 15 minutes |
The auto-refresh toggle is located next to the time range selector. For dashboards displayed on NOC wall screens, set auto-refresh to 5 minutes to maintain near-real-time visibility.
Note: Setting aggressive auto-refresh intervals on dashboards with many widgets or large object scopes may increase load on the VCF Operations analytics cluster. For environments with more than 10,000 objects, consider using 10- or 15-minute refresh intervals for complex dashboards.
While the predefined dashboards cover a broad range of operational scenarios, custom dashboards enable you to build views tailored to your organization's specific monitoring requirements, operational workflows, and reporting needs.
Follow these steps to create a new custom dashboard:
Step 1. Navigate to Visualize → Dashboards.
Step 2. Click the Create button in the top toolbar (or use the + icon).
Step 3. Enter a Dashboard Name (e.g., "Production Cluster Health — Tier 1").
Step 4. Optionally select a Dashboard Template from the dropdown. Templates provide pre-arranged widget layouts that you can populate with your own data sources. Available templates include Blank Canvas, Two-Column, Three-Column, Executive Summary, and Troubleshooting.
Step 5. Set the Default Time Range for the dashboard (e.g., Last 6 Hours). Individual widgets can override this if needed.
Step 6. Click Save. The empty dashboard canvas appears in edit mode, ready for widgets to be added.
To add widgets:
VCF Operations provides a comprehensive widget catalog organized by functional category.
| Widget Name | Description |
|---|---|
| Metric Chart | Time-series visualization supporting line, area, and stacked area chart types. Displays one or more metrics for one or more objects over the selected time range. Supports trend lines, dynamic thresholds overlay, and data table toggle. |
| Scoreboard | Displays a single KPI value with configurable color-coded status bands (green/yellow/orange/red). Ideal for executive-level dashboards showing current state at a glance. Supports sparkline overlay and multi-metric mode. |
| Heatmap | Color-coded grid where each cell represents an object, colored by a selected metric value, and optionally sized by a second metric. Enables rapid visual identification of outliers across large object populations. |
| Top-N | Horizontal or vertical bar chart ranking objects by a selected metric. Configurable for top or bottom N values. Useful for identifying the highest consumers or worst performers. |
| Topology Graph | Interactive relationship map showing objects and their connections. Displays health badges, metric overlays, and alert status on each node. Supports configurable relationship depth. |
| Distribution Chart | Histogram or pie chart showing the distribution of objects across value ranges for a selected metric. Useful for understanding workload profiles and identifying clusters of similar behavior. |
| Sparkline | Compact, minimal trend line designed for embedding in dense dashboards. Shows directional trend without axis labels or detailed data points. |
| Widget Name | Description |
|---|---|
| Object List | Filterable, sortable table of inventory objects with configurable columns. Supports inline metric values, health badges, and property display. Can serve as a provider widget to drive other widgets on the dashboard. |
| Object Relationship | Hierarchical navigation widget showing parent, child, and peer relationships for a selected object. Enables drill-down through the inventory tree. |
| Alert List | Filtered table of active alerts with columns for severity, alert name, object name, time triggered, and control state. Supports filtering by alert type, criticality, object type, and time range. |
| Symptom List | Filtered table of active symptoms with details on the triggering condition, current value, and threshold. |
| Property List | Displays configuration properties and attributes for a selected object (CPU count, memory size, guest OS, tools version, etc.). |
| Widget Name | Description |
|---|---|
| Text Widget | Displays static text content. Supports HTML and Markdown formatting for embedding instructions, notes, team contact information, or operational procedures directly in the dashboard. |
| Image Widget | Embeds a static image (PNG, JPG, SVG) in the dashboard. Used for logos, architecture diagrams, or visual context. Images can be uploaded or referenced by URL. |
| Rolling View | Automatically cycles through a configured list of dashboards at a set interval. Designed for NOC wall displays that need to rotate between multiple views. |
| Container Widget | Groups multiple widgets into a tabbed container, conserving dashboard real estate. Each tab contains a separate widget, and users click tabs to switch between them. |
| Navigation Widget | Displays clickable links or buttons that navigate to other dashboards, external URLs, or specific objects in the inventory. Used for building multi-level dashboard hierarchies. |
| Geo Map | Plots objects on a geographic map based on configured location coordinates. Each marker shows health status and can be clicked for detail. Useful for multi-site or distributed infrastructure monitoring. |
The Scoreboard widget is the most commonly used widget for executive dashboards and NOC displays.
Configuration steps:
The Heatmap widget provides instant visual identification of outliers across hundreds or thousands of objects.
Configuration steps:
The Metric Chart widget is the primary tool for time-series analysis and trend investigation.
Configuration steps:
The Top-N widget ranks objects by a selected metric to quickly surface the highest or lowest performers.
Configuration steps:
The Topology Graph widget visualizes the relationships between infrastructure objects as an interactive network diagram.
Configuration steps:
Widget interactions enable a powerful provider/receiver paradigm where selecting an object in one widget automatically updates the data displayed in other widgets on the same dashboard. This creates interactive, drill-down capable dashboards.
Key concepts:
Configuring widget interactions:
Performance considerations:
Example interaction configuration:
A common pattern is the "list-and-detail" layout:
| Widget | Role | Purpose |
|---|---|---|
| Object List (Virtual Machines) | Provider | Displays a filterable list of VMs. User clicks a row to select a VM. |
| Metric Chart (CPU) | Receiver | Shows CPU usage trend for the selected VM. |
| Metric Chart (Memory) | Receiver | Shows memory usage trend for the selected VM. |
| Alert List | Receiver | Shows active alerts for the selected VM. |
| Property List | Receiver | Shows configuration properties of the selected VM. |
When the operator clicks a VM in the Object List, all four receiver widgets update simultaneously to show data for that specific VM, creating a cohesive investigation experience.
Dashboard navigation enables you to link multiple dashboards together, creating hierarchical drill-down paths that guide operators from high-level overviews to detailed investigation views.
Method 1: Navigation Widget
The Navigation Widget provides explicit, clickable links to other dashboards or external URLs.
Method 2: Object Click Actions
Configure what happens when a user clicks an object in a widget:
Method 3: Dashboard Linking via URL Parameters
Dashboards can be directly linked using URL parameters that pre-select objects and time ranges:
?objectId=<resource_id> to the dashboard URL.&timeRange=<start>-<end> in epoch milliseconds.Best practices for dashboard navigation:
Dashboards are the primary interface through which operators, engineers, and executives consume data from VCF Operations. A poorly designed dashboard buries critical information; a well-designed dashboard surfaces the right data to the right audience at the right time. This chapter provides six ready-to-implement dashboard blueprints and a set of universal design principles.
This dashboard is the first screen an operations engineer should open each morning. It answers one question: "Is anything broken or about to break?"
Row 1 — Scoreboards (4 widgets, equal width)
| Widget | Type | Metric / Property | Color Coding |
|---|---|---|---|
| Overall Cluster Health | Scoreboard | Worst badge color across all clusters | Green / Yellow / Orange / Red |
| Total Critical Alerts | Scoreboard | Count of alerts where Criticality = Critical | Red if > 0, Green if 0 |
| Total Warning Alerts | Scoreboard | Count of alerts where Criticality = Warning | Yellow if > 5, Green if ≤ 5 |
| VM Count / Host Count | Scoreboard | Total VMs (powered on) and total ESXi hosts | Informational — no threshold |
Configuration Tip: Set the Scoreboard refresh interval to 5 minutes. Use the "Sparkline" option to show a 24-hour mini-trend directly inside the scoreboard tile.
Row 2 — Top-N Performance Offenders (3 widgets, equal width)
| Widget | Type | Object Type | Metric | Sort | Count |
|---|---|---|---|---|---|
| Top-N CPU Ready VMs | Top-N | Virtual Machine | cpu|readyPct | Descending | 10 |
| Top-N Memory Contention VMs | Top-N | Virtual Machine | mem|contention_average | Descending | 10 |
| Top-N Disk Latency VMs | Top-N | Virtual Machine | virtualDisk|totalLatency | Descending | 10 |
Row 3 — Trends and Heatmaps (2 widgets, 60/40 split)
| Widget | Type | Configuration |
|---|---|---|
| Cluster Capacity Heatmap | Heatmap | Object: Cluster Compute Resource; Color by: cpu|capacityRemaining_percentage; Size by: summary|total_number_vms |
| Alert Trend (7-day) | Metric Chart | Scope: all clusters; Metric: count of alerts by day; Mode: stacked bar by criticality |
Blockquote — Why 7-day alert trend? A 7-day window reveals patterns tied to weekly batch jobs, backup windows, or recurring misconfigurations. A single day's snapshot hides these cycles.
This dashboard is reviewed weekly by capacity and infrastructure teams. It answers: "When will we run out of resources, and what can we reclaim?"
Row 1 — Scoreboards
| Widget | Metric | Threshold |
|---|---|---|
| Clusters at Risk | Count of clusters where Time Remaining < 90 days | Red if > 0 |
| Total Reclaimable vCPU | Sum of reclaimable CPU across all VMs (from rightsizing engine) | Informational |
| Total Reclaimable Memory (GB) | Sum of reclaimable RAM | Informational |
| Average Cluster Utilization % | Avg of cpu|demandPct across clusters | Yellow > 70%, Red > 85% |
Row 2 — Bar Charts (2 widgets, equal width)
| Widget | Type | Details |
|---|---|---|
| Cluster Capacity Time Remaining | Top-N (horizontal bar) | Metric: capacityRemainingUsingConsumers_timeRemaining; Sort: Ascending (worst first); Top 10 |
| Datastore Capacity Remaining | Top-N (horizontal bar) | Metric: diskspace|capacityRemaining_percentage; Sort: Ascending; Top 10 |
Row 3 — Lists and Actions (2 widgets, equal width)
| Widget | Type | Details |
|---|---|---|
| VM Rightsizing Candidates | Object List | Filter: oversized = true; Columns: VM Name, Provisioned vCPU, Recommended vCPU, Provisioned RAM, Recommended RAM |
| What-If Scenario Launcher | Text Widget | Hyperlink to Optimize → What-If Analysis with instructions |
Capacity Threshold Recommendations:
| Resource | Conservative | Moderate | Aggressive |
|---|---|---|---|
| CPU Demand % | 60% | 70% | 80% |
| Memory Demand % | 70% | 80% | 90% |
| Datastore Used % | 70% | 80% | 85% |
| Time Remaining (days) | 180 | 90 | 60 |
This dashboard is used during active troubleshooting or continuous performance reviews. It answers: "How are my workloads performing right now and over time?"
Row 1 — Scoreboards (3 widgets)
| Widget | Metric | Threshold |
|---|---|---|
| Average CPU Usage % | Avg cpu|usage_average across all clusters | Yellow > 70%, Red > 85% |
| Average Memory Usage % | Avg mem|usage_average across all clusters | Yellow > 75%, Red > 90% |
| Average Disk Latency (ms) | Avg virtualDisk|totalLatency across all VMs | Yellow > 15 ms, Red > 25 ms |
Row 2 — Metric Charts (2 widgets, equal width)
| Widget | Type | Configuration |
|---|---|---|
| Cluster CPU/Memory Trend (30-day) | Metric Chart (line) | Scope: select clusters; Metrics: cpu|demandPct, mem|demandPct; Date range: Last 30 Days; Show dynamic thresholds |
| vSAN Latency Trend | Metric Chart (line) | Scope: vSAN clusters; Metrics: vSAN|readLatency, vSAN|writeLatency; Date range: Last 30 Days |
Row 3 — Heatmap and Top-N (2 widgets, 60/40 split)
| Widget | Type | Configuration |
|---|---|---|
| All VMs by CPU Ready % | Heatmap | Object: Virtual Machine; Group by: Parent Cluster; Color by: cpu|readyPct; Size by: config|hardware|numCpu |
| Top-N Network Drops | Top-N | Object: Host System; Metric: net|droppedPct; Sort: Descending; Count: 10 |
This dashboard serves finance teams and infrastructure managers tracking cloud and on-premises spending. It answers: "Where is the money going, and where can we save?"
Row 1 — Scoreboards (3 widgets)
| Widget | Metric | Notes |
|---|---|---|
| Total Monthly Cost | costop|totalCost | Requires cost drivers to be configured under Optimize → Cost Drivers |
| Cost per VM | costop|costPerVM | Derived from total cost ÷ powered-on VM count |
| Cost Trend | Metric Chart (sparkline) | 6-month trend of totalCost |
Row 2 — Distribution and Savings (2 widgets)
| Widget | Type | Configuration |
|---|---|---|
| Cost by Department | Distribution (pie chart) | Group by: Custom Property "Department"; Metric: costop|totalCost |
| Optimization Savings Potential | Scoreboard | Metric: sum of potential savings from rightsizing + reclamation recommendations |
Row 3 — Actionable Lists (2 widgets)
| Widget | Type | Configuration |
|---|---|---|
| Idle / Powered-Off VM List | Object List | Filter: powerState = poweredOff OR idleVM = true; Columns: VM Name, Power State, Days Since Last I/O, Monthly Cost |
| Snapshot Age Violations | Object List | Filter: snapshot|age > 72 hours; Columns: VM Name, Snapshot Name, Age (hours), Size (GB) |
This dashboard is essential for security and audit teams. It answers: "Are we compliant, and where have we drifted?"
Row 1 — Scoreboards (2 widgets)
| Widget | Metric | Threshold |
|---|---|---|
| Overall Compliance Score | Percentage of objects passing all benchmark tests | Green ≥ 95%, Yellow ≥ 80%, Red < 80% |
| Non-Compliant Objects Count | Count of objects with at least one failure | Red if > 0 |
Row 2 — Compliance by Benchmark (3 widgets)
| Widget | Type | Configuration |
|---|---|---|
| DISA STIG Compliance | Scoreboard + bar | Pass/Fail count for DISA STIG benchmark rules |
| CIS Benchmark Compliance | Scoreboard + bar | Pass/Fail count for CIS benchmark rules |
| PCI-DSS Compliance | Scoreboard + bar | Pass/Fail count for PCI-DSS benchmark rules |
Row 3 — Drift and Changes (2 widgets)
| Widget | Type | Configuration |
|---|---|---|
| Drift Detection Alerts | Alert List | Filter: Alert Type = Compliance, Sub-type = Drift; Sort by: time (newest first) |
| Configuration Change Timeline | Metric Chart (event overlay) | Show configuration change events overlaid on compliance score trend |
This dashboard is designed for C-level and director-level audiences. It prioritizes clarity over detail and should be presentable on a projector or shared screen without explanation.
Design Principles for Executive Dashboards:
Row 1 — Environment Scorecard (3 large scoreboards)
| Widget | Label | Source |
|---|---|---|
| Health | "Infrastructure Health" | Worst health badge across all clusters |
| Risk | "Risk Score" | Highest risk badge across all clusters |
| Efficiency | "Resource Efficiency" | Average efficiency badge across all clusters |
Row 2 — 30-Day Trends (2 widgets)
| Widget | Type | Configuration |
|---|---|---|
| 30-Day Alert Trend | Metric Chart (area) | Stacked area by criticality (Critical, Warning, Info); Date range: 30 days |
| Capacity Runway Summary | Scoreboard list | Show Time Remaining (days) for each cluster, color-coded |
Row 3 — Cost and Sustainability (2 widgets)
| Widget | Type | Configuration |
|---|---|---|
| Cost Summary | Scoreboard | Total monthly cost with month-over-month delta percentage |
| Sustainability Metrics | Scoreboard | Power consumption (kWh), Carbon estimate (if available via management pack) |
Limit widget count. Keep dashboards to 15–20 widgets maximum. Each additional widget increases render time and cognitive load. If you need more, create a second dashboard and link them.
Use widget interactions for drill-down. Configure widget interactions so that clicking an object in a Top-N chart drives the selection in a Metric Chart or Object List widget on the same dashboard. This eliminates the need to duplicate data.
Group related metrics logically. Place CPU metrics adjacent to CPU-related alerts. Place capacity widgets together. The user's eye should flow naturally from overview to detail, left to right, top to bottom.
Use consistent time ranges. If one chart shows 30 days, all charts on that dashboard should show 30 days unless there is a specific analytical reason to differ. Inconsistent ranges confuse viewers.
Place critical KPIs in the top-left quadrant. Eye-tracking studies confirm that users scan dashboards starting from the top-left. Place the most urgent or important information there.
Use Text Widgets for section headers. A simple text widget with a bold label like "Performance Indicators" or "Capacity Metrics" helps organize the dashboard visually and aids comprehension.
Clone predefined dashboards as starting points. VCF Operations ships with dozens of out-of-the-box dashboards. Clone one that is close to your goal, then modify it. This saves time and ensures you start with proven widget configurations.
Test with real data at scale. A dashboard that loads quickly in a lab with 50 VMs may be unusably slow in production with 10,000 VMs. Test with production scope before publishing.
Set appropriate default scopes. Avoid dashboards scoped to "All Objects" when a narrower scope (specific cluster, resource pool, or custom group) would be more relevant.
Document your dashboards. Add a Text Widget at the top of each dashboard with a one-sentence purpose statement and the intended audience. This prevents dashboard sprawl and confusion.
Views and Reports are the primary mechanism for extracting structured, repeatable, and shareable data from VCF Operations. While dashboards are interactive and real-time, reports are static snapshots designed for distribution, archival, and audit compliance.
Views are the building blocks of reports. Each view type presents data in a specific visual format optimized for a particular analytical need.
| View Type | Description | Best Use Case | Output Format |
|---|---|---|---|
| List View | Tabular list of objects with selected metrics and properties displayed as columns | Inventory reports, VM configuration audits, host hardware lists | Table |
| Trend View | Time-series line or area graph plotting one or more metrics over a defined date range | Performance analysis, capacity trending, SLA compliance over time | Line/Area Chart |
| Distribution View | Pie chart or histogram showing how a metric's values are distributed across objects | Resource allocation analysis, workload distribution, cost breakdown by department | Pie/Histogram |
| Image View | Custom uploaded image (PNG, JPG, SVG) with data overlays positioned at specific coordinates | Network topology diagrams, data center floor plans with live metrics, rack diagrams | Annotated Image |
| Summary View | Aggregated statistics (average, minimum, maximum, sum, count) for selected metrics across a group of objects | Executive summaries, SLA reports, aggregate capacity statements | Summary Table |
Note: Image Views require you to upload a base image first, then map data points to specific pixel coordinates on the image. This is most commonly used for physical data center visualizations.
Follow these steps to create a custom view.
Step 1. Navigate to Visualize → Views in the left navigation menu.
Step 2. Click the Create button (plus icon) in the toolbar.
Step 3. In the Presentation section, enter:
Step 4. Select the View Type from the dropdown: List, Trend, Distribution, Image, or Summary.
Step 5. In the Subjects section, select the Object Type that this view will report on. Common selections include:
Step 6. Switch to the Data tab. Here you select the metrics and properties to display:
Step 7. Switch to the Filter tab (optional). Apply conditions to limit which objects appear in the view. Filters use property or metric-based conditions such as:
powerState = poweredOncpu|usage_average > 80summary|config|numCpu >= 8Multiple filter conditions can be combined with AND/OR logic.
Step 8. Click Preview to verify the output displays the expected data with the correct format and filtering.
Step 9. Click Save. The view is now available for use in dashboards or report templates.
Tip: When creating List Views, limit the number of columns to 10–12 for readability. If you need more data points, create a second view rather than cramming everything into one table.
Report templates combine one or more views into a formatted document suitable for distribution. Follow this procedure.
Step 1. Navigate to Visualize → Reports in the left navigation menu.
Step 2. Click Create Template in the toolbar.
Step 3. Enter the Report Name (e.g., "Weekly Infrastructure Health Report") and an optional Description.
Step 4. In the report canvas, add views by dragging them from the left panel into the report body. You can include multiple views of different types. Arrange them in the desired order — each view will render as a separate section in the final report.
Step 5. Optionally configure presentation elements:
Step 6. Click Save. The template is now available for on-demand generation or scheduled execution.
Important: Report templates are separate from the data they display. A template defines the structure; the data is populated at generation time based on the scope you select.
On-Demand Generation:
Scheduled Generation:
| Parameter | Options | Recommendation |
|---|---|---|
| Frequency | Daily, Weekly, Monthly | Weekly for operational reports, Monthly for executive reports |
| Day of Week | Monday–Sunday (for weekly) | Monday morning for "last week" review |
| Time of Day | HH:MM (24-hour format) | 06:00 — before the operations team arrives |
| Scope | Object, Group, or Tag-based | Use Custom Groups for consistent scoping |
${ReportName}, ${Date}Warning: Scheduled reports consume analytics engine resources during generation. Avoid scheduling more than 10 reports at the same time window. Stagger schedules by 15–30 minutes.
| Format | Content | Use Case | Limitations |
|---|---|---|---|
| Fully formatted report with charts, tables, headers, footers, and cover page | Distribution to stakeholders, audit documentation, archival | Charts are rendered as static images; no interactivity | |
| CSV | Raw tabular data export; one CSV file per List or Summary view in the report | Spreadsheet analysis, data import into third-party tools, custom charting | No charts or formatting; Trend and Distribution views export as data tables |
Both formats are available from the Generated Reports tab. Click the download icon next to a completed report and select the desired format.
Tip: For automated downstream processing, use the Suite API endpoint
POST /suite-api/api/reports/{reportId}/downloadwith theformatquery parameter set tocsv. This enables integration with ticketing systems, SharePoint libraries, or custom portals.
Capacity planning in VCF Operations moves beyond simple threshold monitoring into predictive analytics. The platform's capacity engine continuously analyzes historical consumption patterns, applies multiple forecasting algorithms, and produces actionable recommendations for rightsizing, reclamation, and future procurement.
The capacity engine evaluates every cluster, datastore, and resource pool across three dimensions.
| Metric | Definition | Where to Find | Action Trigger |
|---|---|---|---|
| Time Remaining | Projected number of days until a resource (CPU, Memory, Disk) reaches its usable capacity limit | Optimize → Capacity → select cluster | < 90 days: plan procurement or migration |
| Capacity Remaining (%) | Percentage of total usable capacity that is still available after accounting for HA reserves, buffers, and current demand | Optimize → Capacity → select cluster | < 20%: immediate attention required |
| Recommended Size | The optimal allocation of vCPU, memory, or disk for a given VM based on actual usage patterns | Optimize → Rightsizing → select VM | Delta > 25% from current: rightsizing candidate |
The capacity engine runs on a continuous cycle, recalculating projections every collection interval (default: 5 minutes for real-time, daily for long-term forecasts).
Important: Capacity calculations honor the policy settings applied to each object. If your policy sets a CPU utilization cap of 70% (meaning 70% is considered "full"), Time Remaining reflects when demand will reach 70%, not 100%.
VCF Operations does not rely on a single forecasting model. Instead, it runs multiple algorithms in parallel and selects the best fit for each metric on each object.
| Algorithm | How It Works | Best Suited For | Weakness |
|---|---|---|---|
| Change-Point Detection | Identifies sudden, sustained shifts in the data (step changes) and adjusts the baseline accordingly | Environments with frequent application deployments or workload migrations | May over-react to one-time events if not enough history |
| Linear Regression | Fits a straight line through historical data points and projects the trend forward | Steady, predictable growth patterns (e.g., data stores growing at constant rate) | Cannot model cyclical or seasonal patterns |
| Cyclical Analysis | Detects repeating patterns on daily, weekly, or monthly cycles and factors them into the projection | Workloads with known cycles — month-end batch processing, weekly reporting jobs | Requires 2+ full cycles of history to detect patterns |
| Exponential Smoothing | Applies exponentially decreasing weights to older data, giving recent observations more influence | Environments where recent behavior is more indicative of future behavior than distant history | Can be thrown off by recent anomalies |
The analytics engine scores each algorithm's fit against actual historical data using a mean-absolute-percentage-error (MAPE) calculation. The algorithm with the lowest MAPE for a given metric is selected for that metric's forecast.
Tip: To see which algorithm was selected for a specific metric, navigate to the cluster's Capacity tab and hover over the forecast line. The tooltip displays the algorithm name and confidence interval.
Not all spikes in resource consumption are equal. The capacity engine classifies peaks to prevent false alarms and ensure accurate forecasting.
| Peak Type | Duration | Impact on Capacity Calculation | Example |
|---|---|---|---|
| Momentary | Less than 5 minutes | Ignored — treated as noise | CPU spike during VM snapshot creation, brief network burst |
| Sustained | 5 minutes to 4 hours | Included in analysis with standard weight | Application batch job, database index rebuild, backup window |
| Periodic | Recurring at regular intervals | Weighted appropriately based on recurrence frequency | End-of-month financial close processing, weekly ETL jobs, nightly backups |
Peak classification thresholds can be adjusted in the active policy under Configure → Policies → Edit Policy → Capacity and Allocation → Peak Classification.
Navigate to: Optimize → Rightsizing
Rightsizing identifies VMs whose allocated resources are significantly mismatched to their actual consumption patterns.
Oversized VM Detection Criteria:
| Resource | Condition | Default Threshold |
|---|---|---|
| CPU | Provisioned vCPUs exceed peak demand by a factor of 2 or more | Provisioned vCPU > 2x 95th-percentile CPU demand |
| Memory | Provisioned RAM exceeds peak demand by a factor of 1.5 or more | Provisioned RAM > 1.5x 95th-percentile active memory |
Undersized VM Detection Criteria:
| Resource | Condition | Default Threshold |
|---|---|---|
| CPU | CPU Ready percentage consistently elevated | cpu|readyPct > 2.5% over 7-day average |
| Memory | Memory ballooning or swapping is active | mem|balloonPct > 0% or mem|swapused_average > 0 |
Rightsizing Report Columns:
| Column | Description |
|---|---|
| VM Name | Virtual machine display name |
| Current vCPU | Currently provisioned vCPU count |
| Recommended vCPU | Analytics-recommended vCPU count |
| Current Memory (GB) | Currently provisioned RAM |
| Recommended Memory (GB) | Analytics-recommended RAM |
| Potential Savings | Estimated cost reduction if rightsized (requires cost drivers) |
Taking Action on Rightsizing Recommendations:
Warning: Always validate rightsizing recommendations against application-level requirements. A VM may appear oversized from an infrastructure perspective but require the allocated resources for licensing compliance (e.g., Oracle per-core licensing) or application-mandated minimums.
Navigate to: Optimize → Reclaim
The reclamation engine identifies waste — resources that are allocated but delivering no value.
| Category | Detection Criteria | Default Threshold | Typical Savings |
|---|---|---|---|
| Powered-Off VMs | VM in poweredOff state for extended period | Idle > 30 days | Full VM cost recovery |
| Orphaned VMDKs | VMDK files on datastores not attached to any registered VM | Any orphaned VMDK | Storage reclamation |
| Old Snapshots | VM snapshots exceeding age threshold | Age > 72 hours (3 days) | Storage reclamation; performance improvement |
| Idle VMs | Powered-on VMs with negligible CPU, memory, network, and disk I/O | CPU < 100 MHz, Network < 1 KBps, Disk I/O < 1 IOPS for 7+ days | Full VM cost recovery |
Best Practice: Schedule a weekly reclamation review meeting. Export the reclamation report and distribute it to application owners with a 14-day response window. VMs and VMDKs not claimed within the window are candidates for decommissioning.
Navigate to: Optimize → Workload Optimization
Workload Optimization provides DRS-like placement recommendations, but operates at the VCF Operations level rather than within a single vCenter. This enables cross-cluster and even cross-vCenter balancing recommendations.
Considerations evaluated by the engine:
Output: The engine generates a prioritized list of migration recommendations. Each recommendation includes:
| Field | Description |
|---|---|
| VM Name | The virtual machine to migrate |
| Source Host / Cluster | Current placement |
| Destination Host / Cluster | Recommended placement |
| Improvement | Projected reduction in contention or improvement in balance score |
| Risk | Assessment of migration risk (Low / Medium / High) |
Note: Workload Optimization recommendations are advisory. VCF Operations does not execute migrations autonomously unless integrated with an automation platform and explicitly configured to do so.
Navigate to: Optimize → What-If Analysis
What-If Analysis lets you model hypothetical changes to your environment and see projected capacity impacts before committing resources or budget.
Scenario Types:
| Scenario Type | Question It Answers | Required Inputs |
|---|---|---|
| Add Workload | "What if I deploy 50 new VMs?" | VM profile (vCPU, RAM, Disk per VM), quantity, target cluster |
| Remove Workload | "What if I decommission this cluster's VMs?" | Select VMs or clusters to remove |
| Add Infrastructure | "What if I add 3 hosts to this cluster?" | Host profile (CPU cores, RAM, local storage), quantity, target cluster |
| Change Allocation | "What if I change the overcommit ratio?" | New CPU or memory overcommit ratio, target cluster |
Step-by-Step Procedure (applicable to all scenario types):
Step 1. Click Create Scenario and provide a scenario name (e.g., "Q3 ERP Migration Impact").
Step 2. Select the scenario type from the four options above.
Step 3. Enter parameters specific to the scenario type. For "Add Workload," define the VM profile:
Step 4. Select the target cluster(s) where the workload will be placed or infrastructure will be added.
Step 5. Click Run Analysis. The engine calculates the impact using the same forecasting algorithms described in Section 18.2.
Step 6. Review the results:
| Result Field | Description |
|---|---|
| Time Remaining (Before) | Projected days before the scenario |
| Time Remaining (After) | Projected days after applying the scenario |
| Capacity Remaining % (Before/After) | Side-by-side capacity comparison |
| Risk Level Change | Whether the cluster moves from Green to Yellow/Red |
| Alerts Generated | Any new capacity alerts that would trigger |
Step 7. Save the scenario for future reference or discard it. Saved scenarios can be revisited, modified, and re-run as conditions change.
Tip: Combine scenario types for complex planning. First run "Add Workload" to see the impact of a new project, then run "Add Infrastructure" to determine how many hosts are needed to absorb it. Compare the two scenarios side by side.
Management Packs extend VCF Operations beyond vSphere, enabling unified monitoring across heterogeneous infrastructure, cloud platforms, applications, and hardware.
A Management Pack is a pluggable adapter module that teaches VCF Operations how to collect, interpret, and act on data from a specific technology. Each management pack is a self-contained package that includes:
| Component | Purpose |
|---|---|
| Adapter Code | The collection engine that connects to the target system via API, SNMP, WMI, SSH, or other protocol |
| Object Model | Defines the object types (e.g., "AWS EC2 Instance," "NetApp Volume") and their relationships |
| Metric Definitions | The specific metrics to collect, their units, and collection intervals |
| Dashboards | Pre-built dashboards tailored to the monitored technology |
| Alert Definitions | Symptoms and alert rules specific to the technology |
| Views and Reports | Pre-built views and report templates |
Management packs are distributed as PAK files (Platform Archive Kit) — a signed archive format used by the VCF Operations platform for all extensions and updates.
Step 1. Obtain the management pack PAK file. Sources include:
Step 2. In VCF Operations, navigate to Administration → Integrations → Repository.
Step 3. Click Add (or Upload PAK File, depending on UI version).
Step 4. Browse to the downloaded PAK file and click Upload. The system validates the file signature and compatibility.
Step 5. Review and accept the End User License Agreement (EULA).
Step 6. Monitor the installation progress bar. Installation typically takes 2–5 minutes. The cluster will distribute the adapter code to all nodes automatically.
Step 7. After installation completes, configure the adapter instance:
Warning: After installing a management pack, allow 2–3 collection cycles (typically 10–15 minutes) before expecting data to appear in dashboards. The initial collection cycle populates the object inventory; subsequent cycles populate metrics.
The following table lists the management packs available from Broadcom, including those built into VCF Operations and those available as separate downloads.
| # | Management Pack | Version | Monitored Technology | Key Metrics | Built-In |
|---|---|---|---|---|---|
| 1 | VMware vSphere | 8.18.2 | vCenter, ESXi Hosts, VMs, Resource Pools | CPU, Memory, Disk, Network for all vSphere objects | Yes |
| 2 | VMware NSX-T | 8.18.2 | NSX Manager, Transport Nodes, Logical Switches, DFW | Transport node health, DFW rule hit counts, tunnel status | Yes |
| 3 | VMware SDDC Manager | 8.18.2 | SDDC Manager, Workload Domains, VCF Lifecycle | Domain health, lifecycle operation status | Yes |
| 4 | VMware vSAN | 8.18.2 | vSAN Clusters, Disk Groups, Capacity Devices | Resync status, cache hit ratio, congestion, latency | Yes |
| 5 | VMware Cloud Director | 5.x | VCD Cells, Organizations, vApps, Org VDCs | Cell health, Org resource consumption | No |
| 6 | VMware Horizon | 4.x | Connection Servers, Desktop Pools, Sessions | Session latency, pool utilization, protocol performance | No |
| 7 | VMware Tanzu | 2.x | TKG Clusters, Supervisor Namespaces, Pods, Nodes | Pod restart count, node resource usage, cluster health | No |
| 8 | VCF Automation | 4.x | Blueprints, Deployments, Catalog Items | Deployment success rate, provisioning time | No |
| 9 | AWS | 4.x | EC2, S3, RDS, Lambda, ELB, CloudWatch | Instance utilization, S3 bucket size, RDS connections | No |
| 10 | Azure | 4.x | VMs, Storage Accounts, SQL Database, App Services | VM performance, storage transactions, DTU usage | No |
| 11 | Google Cloud | 2.x | GCE Instances, GCS Buckets, BigQuery, Cloud SQL | Instance CPU, bucket object count, query slot utilization | No |
| 12 | Dell EMC | Varies | PowerStore, PowerScale, Unity, VMAX/PowerMax | Array latency, capacity, IOPS, throughput | No |
| 13 | NetApp ONTAP | 3.x | Clusters, SVMs, Volumes, Aggregates, LUNs | Volume latency, aggregate capacity, snapshot reserve | No |
| 14 | Pure Storage | 2.x | FlashArray, FlashBlade, Volumes | Array latency, capacity, data reduction ratio | No |
| 15 | HPE | 2.x | 3PAR/Primera, Nimble, Synergy, ProLiant | Array performance, blade health, enclosure power | No |
| 16 | Cisco UCS | 3.x | Fabric Interconnects, Blades, Rack Units, Service Profiles | Fabric uplink utilization, blade faults, power draw | No |
| 17 | OS: Windows | 8.x | Windows Servers (WMI-based) | CPU, Memory, Disk, Network, Services, Processes | No |
| 18 | OS: Linux | 8.x | Linux Servers (SSH-based) | CPU, Memory, Disk, Network, top processes | No |
| 19 | SNMP | 5.x | Generic SNMP-enabled devices (switches, routers, UPS) | Interface traffic, device uptime, OID-based custom metrics | No |
| 20 | Active Directory | 3.x | Domain Controllers, Sites, Replication | Replication latency, LDAP response time, DC availability | No |
| 21 | SQL Server | 4.x | SQL Instances, Databases, Always On Availability Groups | Query latency, buffer cache hit ratio, log growth | No |
| 22 | Oracle Database | 3.x | Oracle Instances, Tablespaces, ASM Disk Groups | Tablespace usage, session counts, wait events | No |
| 23 | Ping | 8.x | Any IP-reachable device | ICMP availability, round-trip latency, packet loss | No |
| 24 | Log Insight | 8.x | Operations for Logs integration | Log event counts, ingestion rate | Yes |
| 25 | Telegraf Agent | 8.x | Any system running Telegraf (push-based) | Custom metrics via Telegraf input plugins | No |
| 26 | Kubernetes | 2.x | Kubernetes Clusters, Namespaces, Nodes, Pods, Containers | Pod status, container resource usage, node conditions | No |
| 27 | Service Discovery | 8.x | Application dependency mapping | Service relationships, communication flows, port mappings | Yes |
For technologies not covered by existing management packs, VCF Operations includes a no-code development environment for building custom adapters.
Navigate to: Administration → Integrations → Management Pack Builder
Supported Input Methods:
| Input Type | Description | Use Case |
|---|---|---|
| REST API | Define endpoints, authentication, JSON path mappings | Custom web applications, SaaS platforms, IoT APIs |
| SNMP MIB | Import MIB files and map OIDs to metrics | Legacy network devices, industrial equipment |
| Script-Based | Python or PowerShell scripts that output metrics in a defined format | Internal tools, proprietary systems, complex collection logic |
Development Workflow:
Tip: Start with the REST API input type for most modern applications. Define a health-check endpoint first to validate connectivity, then expand to detailed metrics. Use the built-in Test button at each stage to validate collection before exporting.
In addition to Broadcom-published management packs, several vendors produce and support their own packs for VCF Operations.
| Vendor | Management Pack | Monitored Technology | Key Capabilities |
|---|---|---|---|
| Dell Technologies | OpenManage for VCF Operations | PowerEdge server hardware via iDRAC | Hardware health (fans, PSUs, RAID), firmware inventory, warranty status, thermal monitoring |
| NVIDIA | vGPU Management Pack | NVIDIA vGPU-enabled hosts and VMs | GPU utilization %, GPU memory usage, temperature, encoder/decoder sessions, frame buffer |
| Rubrik | Rubrik Management Pack | Rubrik CDM and Polaris | Backup job success/failure rates, SLA compliance percentage, storage consumption trends, archive status |
| Zerto | Zerto Management Pack | Zerto Virtual Replication | VPG replication health, RPO status, journal size, failover test history, bandwidth consumption |
Note: Third-party management packs follow their own release cadence independent of VCF Operations versions. Always verify compatibility with your VCF Operations version before installing. Check the vendor's compatibility matrix or release notes.
Once VCF Operations is deployed and configured, ongoing maintenance ensures the platform remains healthy, performant, and current. This chapter covers the operational tasks that every VCF Operations administrator must master.
All VCF Operations appliance logs reside on the appliance filesystem. The following table identifies the critical log files, their paths, and their purposes.
| Log File | Path | Purpose |
|---|---|---|
| Analytics | /storage/log/vcops/analytics.log |
Analytics engine processing — capacity calculations, forecasting, anomaly detection |
| Collector | /storage/log/vcops/collector.log |
Data collection framework — adapter scheduling, metric ingestion |
| API / UI | /storage/log/vcops/web/catalina.out |
Tomcat application server — REST API requests, UI errors |
| CASA | /storage/log/vmware/casa/casa.log |
Cluster management — node join/leave, role assignment, slice configuration |
| GemFire | /storage/log/vcops/gemfire/gemfire.log |
Distributed cache — inter-node data replication, partition management |
| vPostgres | /storage/log/vmware/vpostgres/postgresql.log |
PostgreSQL database — query errors, connection issues, replication |
| Adapter (per-adapter) | /storage/log/vcops/adapters/<adapter-name>/ |
Individual adapter logs — collection errors, connectivity issues |
| VAMI | /var/log/vmware/ |
VMware Appliance Management Interface — appliance configuration changes |
| PAK Manager | /storage/log/vcops/pakManager.log |
PAK file installation, upgrade, and management pack deployment |
| Suiteapi | /storage/log/vcops/web/suiteapi.log |
Suite API specific request/response logging |
Tip: When troubleshooting, start with the most specific log. If a particular adapter is failing, check its log in
/storage/log/vcops/adapters/first. Escalate tocollector.logonly if the adapter log does not reveal the issue.
Log files can grow substantially in active environments, particularly when debug logging is enabled. Use the following procedures to reclaim disk space safely.
Check current disk usage:
df -h /storage/log
du -sh /storage/log/vcops/*
Truncate an active log file (preserves file handle):
truncate -s 0 /storage/log/vcops/analytics.log
Remove old rotated log archives:
find /storage/log -name "*.gz" -mtime +30 -delete
find /storage/log -name "*.log.*" -mtime +30 -delete
Check for core dumps consuming space:
du -sh /storage/core/
# If core dumps are present and no longer needed:
rm -f /storage/core/core.*
Warning: Never use
rmon active log files (e.g.,rm analytics.log). The process holding the file descriptor will continue writing to the deleted inode, consuming disk space invisibly. Always usetruncateto safely zero out an active log file while preserving the file handle.
Warning: If log growth is persistent, investigate the root cause (e.g., a failing adapter retrying every 5 seconds, debug logging left enabled). Truncating logs without addressing the cause is a temporary fix.
Backup Configuration:
Step 1. Navigate to Administration → Backup/Restore.
Step 2. Configure the backup destination:
nfs://fileserver.corp.local/backups/vcfops)Step 3. Set the backup schedule:
| Setting | Recommendation |
|---|---|
| Frequency | Daily |
| Time | 02:00 (during low-activity window) |
| Retention | 7 backups (1 week of daily backups) |
Step 4. Select backup content:
| Option | Includes | Size Impact |
|---|---|---|
| All (Configuration + Data) | Cluster config, policies, dashboards, alerts, views, reports, custom groups, supermetrics, AND historical metric data | Large (potentially hundreds of GB) |
| Configuration Only | Everything except historical metric data | Small (typically < 5 GB) |
Step 5. Click Save to activate the schedule. For an immediate backup, click Backup Now.
Restore Procedure:
Important: You cannot restore a backup from a newer version to an older version. The target appliance must be the same version or newer than the backup source.
VCF Operations generates self-signed internal certificates during deployment. For production environments, replace these with certificates signed by your enterprise Certificate Authority (CA).
Current Certificate Status: Navigate to Administration → Certificates to view the current certificate details, including issuer, subject, expiration date, and thumbprint.
Supported Formats:
| Format | Description |
|---|---|
| PEM | Base64-encoded certificate and private key in separate files (.pem, .crt, .key) |
| PFX / PKCS12 | Binary format containing certificate chain and private key in a single file (.pfx, .p12) |
Steps to Replace the Certificate:
Step 1. Generate a Certificate Signing Request (CSR) from VCF Operations, or prepare a PEM certificate chain externally.
Step 2. Upload the signed certificate and private key:
Step 3. Click Apply. VCF Operations validates the certificate chain, verifies the private key matches, and restarts services automatically. Expect 5–10 minutes of downtime during the service restart.
Warning: Ensure the certificate's Subject Alternative Names (SANs) include the FQDN of every node in the cluster and the cluster VIP (if using HA/CA). Missing SANs will cause inter-node communication failures.
Regular password rotation is a security best practice and may be required by organizational policy.
Via CLI (SSH to the VCF Operations appliance):
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py \
password change --user admin
You will be prompted for the current password and the new password.
Via VCF Fleet Manager (SDDC Manager integration):
Rotation Schedule Recommendations:
| Account | Recommended Interval | Notes |
|---|---|---|
| admin (UI) | Every 90 days | Primary administrative account |
| root (SSH) | Every 90 days | Appliance OS-level access |
| maintenanceAdmin | Every 90 days | Used for cluster maintenance operations |
| Adapter credentials | Every 90 days or per policy | Service accounts connecting to vCenter, NSX, etc. |
Important: After rotating adapter credentials (e.g., the vCenter service account password), update the corresponding credential in Administration → Integrations → Accounts → Edit Credential. Failure to do so will cause data collection to stop.
VCF Operations upgrades are delivered as PAK files and follow a rolling upgrade process that minimizes downtime in HA and CA deployments.
Phase 1 — Pre-Upgrade Checklist:
| Task | Command / Location | Purpose |
|---|---|---|
| Verify current version | Administration → Cluster Management | Confirm starting version |
| Take full backup | Administration → Backup/Restore → Backup Now | Rollback safety net |
| Check compatibility matrix | Broadcom compatibility guide | Ensure management packs are compatible with target version |
| Download upgrade PAK | Broadcom support portal | Obtain the upgrade binary |
| Snapshot all nodes | vCenter → Right-click VM → Snapshots → Take Snapshot | Quick rollback mechanism |
| Verify NTP sync | ntpq -p on each node |
Prevent time-skew issues during upgrade |
| Check disk space | df -h on each node |
Ensure /storage has > 20% free |
Phase 2 — Upgrade Execution:
Phase 3 — Post-Upgrade Validation:
| Task | How to Verify |
|---|---|
| Cluster status is "Online" | Administration → Cluster Management |
| All nodes show new version | Administration → Cluster Management → Node details |
| Data collection is active | Environment → select any object → verify recent metrics |
| Management packs are functional | Administration → Integrations → Accounts → check status icons |
| Dashboards load correctly | Navigate to several dashboards and verify data |
| Remove VM snapshots | vCenter → Right-click VM → Snapshots → Delete All |
Warning: Do not delete VM snapshots until you have fully validated the upgrade. Snapshots provide the fastest rollback path if issues are discovered. However, do not keep snapshots longer than 72 hours, as they degrade VM performance.
As monitored environments grow, VCF Operations may require additional resources.
Vertical Scaling (Scale Up):
Increase the vCPU and memory allocated to existing nodes.
| OVA Size | vCPU | Memory (GB) | Objects Supported |
|---|---|---|---|
| Small | 4 | 16 | Up to 1,500 |
| Medium | 8 | 32 | Up to 5,000 |
| Large | 16 | 48 | Up to 15,000 |
| Extra Large | 24 | 128 | Up to 30,000 |
To change the size: power off the node, adjust CPU/RAM in vCenter, power on. The analytics engine automatically detects the new resources.
Horizontal Scaling (Scale Out):
Add Data Nodes to distribute the analytics workload across more compute.
Guideline: Add one Data Node for every 10,000 additional objects beyond the primary node's capacity. For environments exceeding 50,000 objects, engage Broadcom Professional Services for architecture review.
When engaging Broadcom Global Support Services (GSS), a support bundle is typically required.
Via the UI:
Via the CLI (SSH):
/usr/lib/vmware-vcops/support/vrops-support.sh
The script collects logs, configuration files, cluster state, and diagnostic information into a ZIP file located at /storage/log/vcops/support/.
Support Bundle Contents:
| Category | Included Items |
|---|---|
| Logs | All log files from Section 20.1 |
| Configuration | Cluster config, slice configuration, property files |
| Cluster State | Node roles, service status, GemFire partition info |
| System Info | OS version, disk usage, memory usage, process list |
| Thread Dumps | Java thread dumps for analytics and collector services |
Tip: For targeted troubleshooting, you can generate a "lightweight" bundle by specifying only the relevant log categories. This reduces generation time and file size, which speeds up upload to the support ticket.
The following sections document the most frequently encountered issues, their root causes, and step-by-step resolutions.
Symptom: The cluster status on the Administration → Cluster Management page shows "Going Online" for more than 30 minutes without progressing.
Root Cause: The analytics service is failing to start, typically due to a GemFire distributed cache partition conflict or corrupted analytics state.
Resolution:
# Step 1: Check current service status
/usr/lib/vmware-vcops/support/sliceConfiguration.sh --status
# Step 2: Restart the analytics service
service vmware-vcops-analytics restart
# Step 3: Monitor the analytics log for errors
tail -f /storage/log/vcops/analytics.log
If the restart does not resolve the issue, check for GemFire partition conflicts:
grep -i "partition" /storage/log/vcops/gemfire/gemfire.log | tail -20
If partition errors are present, a full cluster restart may be required:
service vmware-vcops stop
# Wait 5 minutes for all services to fully terminate
service vmware-vcops start
Symptom: After requesting the cluster to go offline, the Admin UI becomes unresponsive and the cluster never reaches "Offline" state.
Root Cause: A hung analytics or vPostgres process is preventing graceful shutdown.
Resolution:
# Step 1: Force stop all services
service vmware-vcops stop
# Step 2: Verify all Java processes have terminated
ps aux | grep java
# Step 3: If processes remain, wait 5 minutes then check again
# Do NOT use kill -9 unless absolutely necessary
# Step 4: Start services
service vmware-vcops start
Symptom: Dashboard widgets display "Waiting for Analytics" instead of data. The message persists beyond the normal startup window (15 minutes).
Root Cause: The analytics engine has either crashed, is processing a large backlog, or has encountered an out-of-memory condition.
Resolution:
Check the analytics service status:
service vmware-vcops-analytics status
If the service is stopped, check the log for the cause:
tail -100 /storage/log/vcops/analytics.log
Look for OutOfMemoryError or StackOverflowError in the log. If found, the node likely needs more memory (see Section 20.7 on vertical scaling).
Restart the analytics service:
service vmware-vcops-analytics restart
Symptom: An alert fires indicating that the FSDB (File System Database) partition is running low on disk space. The /storage/db partition is at or above 85% utilization.
Root Cause: Historical metric data has filled the /storage/db partition. This occurs when retention is set too high for the available disk, or when a large number of new objects were added without corresponding disk expansion.
Resolution (in order of preference):
Reduce data retention:
| Data Type | Default | Minimum Recommended |
|---|---|---|
| Real-time (5-min) | 1 day | 1 day |
| Hourly rollup | 30 days | 15 days |
| Daily rollup | 6 months | 3 months |
| Monthly rollup | 13 months | 6 months |
Expand the /storage/db disk:
https://<node-fqdn>:5480)/storage/db partition to include the new diskRemove unused management packs:
Symptom: Metric charts show gaps, dashboards display stale data, or the "Last Collection" timestamp for adapters is more than 10 minutes old.
Root Cause: Multiple potential causes — adapter overload, network latency to the target system, expired or invalid credentials, or insufficient collector resources.
Resolution:
Check adapter status:
Check adapter logs:
tail -200 /storage/log/vcops/adapters/<adapter-name>/<adapter-name>.log
Verify credentials: Edit the adapter account and click Validate Connection
Check collector resource usage:
top -bn1 | head -20
free -h
For geographically distant targets: Deploy a Remote Collector at the remote site to reduce collection latency
Symptom: Services fail to start. SSH access still works but commands may produce "No space left on device" errors.
Root Cause: Core dumps, temporary files, or unexpected log files have filled the root (/) partition.
Resolution:
# Step 1: Identify the largest consumers
du -sh /* | sort -rh | head
# Step 2: Common culprits — check and clean
du -sh /storage/core/
rm -f /storage/core/core.*
du -sh /tmp/
# Remove old temp files (careful — do not remove active temp files)
find /tmp -type f -mtime +7 -delete
# Step 3: Check for unexpected log files in /var/log
du -sh /var/log/*
Warning: If the root partition is completely full (100%), services cannot write PID files or temp files and will refuse to start. In extreme cases, you may need to boot from a rescue ISO to clear space.
VCF Operations provides several command-line tools for administration and troubleshooting.
| Tool | Command | Purpose | Common Usage |
|---|---|---|---|
| vrops-status | vrops-status |
Quick cluster health check | Verify all services are running, check node roles |
| OPS-CLI | $VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py |
Full CLI management | Adapter management, metric queries, object searches, password changes |
| Slice Configuration | /usr/lib/vmware-vcops/support/sliceConfiguration.sh |
Cluster slice management | Check slice status, force slice rebalancing |
| Support Script | /usr/lib/vmware-vcops/support/vrops-support.sh |
Support bundle generation | Generate log bundles for Broadcom support |
| Service Control | service vmware-vcops {start|stop|restart|status} |
Service management | Start, stop, or restart the entire VCF Operations stack |
| Platform CLI | vcops-cli |
Platform-level operations | License management, node management |
OPS-CLI Examples:
# List all adapter instances
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py \
adapter list
# Search for an object by name
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py \
object search --name "web-server-01"
# Query a metric for a specific object
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py \
metric query --objectId <uuid> --metricKey "cpu|usage_average"
Warning: Direct Cassandra access via
cqlsh localhost 9042is available for advanced troubleshooting but is unsupported by Broadcom. Modifying data in Cassandra directly can corrupt the FSDB and render the cluster inoperable. Use only under explicit guidance from Broadcom support.
The Suite API is the RESTful interface for programmatic access to all VCF Operations functionality. It enables integration with ITSM tools, custom portals, automation pipelines, and third-party systems.
Authentication:
# Acquire a token
curl -k -X POST \
"https://<vrops-fqdn>/suite-api/api/auth/token/acquire" \
-H "Content-Type: application/json" \
-d '{"username":"admin","authSource":"local","password":"<password>"}'
The response returns a JSON object containing the token field. Use this token in subsequent requests.
Token Usage:
Include the token in the Authorization header for all API calls:
Authorization: vRealizeOpsToken <token>
Tokens expire after 6 hours by default. Acquire a new token when the current one expires.
Base URL:
https://<vrops-fqdn>/suite-api/api/
Key Endpoint Categories:
| Category | Base Path | Operations |
|---|---|---|
| Resources | /resources |
List, search, create, delete objects; query relationships |
| Alerts | /alerts |
List, query, update, cancel alerts |
| Symptoms | /symptoms |
List, create, delete symptom definitions |
| Supermetrics | /supermetrics |
List, create, update, delete supermetric formulas |
| Policies | /policies |
List, create, apply, export, import policies |
| Adapters | /adapters |
List adapter kinds, instances; start/stop collection |
| Credentials | /credentials |
List, create, update, delete credential instances |
| Reports | /reports |
List templates, generate reports, download results |
| Dashboards | /dashboards |
List, import, export, share dashboards |
| Auth | /auth |
Token acquisition, token release, user management |
| Collector Groups | /collectorgroups |
List, create, assign collectors |
| Custom Groups | /customgroups |
List, create, update, delete custom groups |
| Metric Keys | /resources/{id}/stats |
Query metric data for specific resources |
Interactive API Documentation:
VCF Operations ships with embedded Swagger UI documentation:
https://<vrops-fqdn>/suite-api/doc/swagger-ui.html
The Swagger UI provides a complete, interactive reference for all API endpoints, including request/response schemas, parameter descriptions, and the ability to execute API calls directly from the browser.
Common API Workflow Example — Export All Critical Alerts:
# Step 1: Acquire token
TOKEN=$(curl -sk -X POST \
"https://vrops.corp.local/suite-api/api/auth/token/acquire" \
-H "Content-Type: application/json" \
-d '{"username":"admin","authSource":"local","password":"P@ssw0rd"}' \
| python -c "import sys,json; print(json.load(sys.stdin)['token'])")
# Step 2: Query critical alerts
curl -sk \
"https://vrops.corp.local/suite-api/api/alerts?status=ACTIVE&criticality=CRITICAL" \
-H "Authorization: vRealizeOpsToken $TOKEN" \
-H "Accept: application/json" | python -m json.tool
# Step 3: Release token when done
curl -sk -X POST \
"https://vrops.corp.local/suite-api/api/auth/token/release" \
-H "Authorization: vRealizeOpsToken $TOKEN"
Best Practice: Always release tokens when your automation workflow completes. Each VCF Operations instance supports a limited number of concurrent API sessions. Unreleased tokens count against this limit until they expire naturally.
VCF Operations for Logs has undergone several name changes since its inception, reflecting broader shifts in VMware's product portfolio and, ultimately, the Broadcom acquisition. Understanding the naming timeline is essential when referencing older documentation, knowledge-base articles, and community posts.
| Year | Product Name | Context |
|---|---|---|
| 2013 | VMware vCenter Log Insight | Original release; tightly associated with vCenter Server |
| 2016 | vRealize Log Insight | Rebranded under the vRealize management suite umbrella |
| 2022 | VMware Aria Operations for Logs | Part of the VMware Aria brand unification across all management products |
| 2024 | VCF Operations for Logs | Broadcom acquisition; product folded into the VCF (VMware Cloud Foundation) brand |
Note: Many CLI tools, OVA filenames, internal service names, and API endpoints still reference
loginsightorvrli. Do not be alarmed when you encounter these legacy identifiers — they are functionally equivalent to the current product.
Throughout this handbook, the terms Operations for Logs, OpsForLogs, and the abbreviation vRLI may be used interchangeably where historical context or brevity demands it.
VCF Operations (metrics) and VCF Operations for Logs (logs) are complementary products. They are deployed separately, serve distinct analytical purposes, and store fundamentally different data types. The following table summarizes the key differences.
| Aspect | VCF Operations | VCF Operations for Logs |
|---|---|---|
| Data Type | Metrics, properties, super metrics, events | Log messages (syslog, agent-collected file logs) |
| Analysis Model | Time-series statistical analysis, machine-learning anomaly detection, capacity modeling | Full-text search, pattern matching, ML-based intelligent grouping |
| Alerting | Threshold-based — triggers when a metric value crosses a defined boundary | Pattern-based — triggers when a log message matches a content rule or frequency condition |
| Storage Engine | FSDB (proprietary time-series database) | Apache Cassandra + proprietary full-text index |
| Primary Use Cases | Performance monitoring, capacity planning, cost analysis, what-if modeling | Troubleshooting, root-cause analysis, audit trail, compliance reporting |
| Retention Model | Configurable retention policies (weeks to months of metric data) | Index partitions managed by time-based buckets (days to months of log data) |
| Integration Direction | Launches-in-context to Operations for Logs for correlated log investigation | Launches-in-context to Operations for metric correlation |
Best Practice: Deploy both products and configure the bidirectional integration between them. When Operations detects an anomaly on an object, the administrator can pivot directly into Operations for Logs to examine the logs from that object during the anomaly window — dramatically reducing mean time to resolution.
Operations for Logs follows a scale-out clustered architecture built on the following components:
Cluster Specifications:
Data Flow:
Log Sources (vCenter, ESXi, NSX, Agents, Syslog Devices)
│
▼
VIP Address (ILB)
│
▼
┌─────┴─────┐
│ Primary │──── Worker 1 ──── Worker 2 ──── Worker N
│ Node │
└────────────┘
│
▼
Ingestion Pipeline → Parsing → Field Extraction → Indexing
│
▼
Cassandra Index + Full-Text Index (per-node storage)
│
▼
Query Engine (distributed, merges results from all nodes)
Select the cluster size based on expected daily ingestion volume and query concurrency requirements.
| Cluster Size | Nodes | Estimated Ingestion Rate | Typical Use Case |
|---|---|---|---|
| Small | 1 (standalone) | ~15 GB/day | Lab, proof-of-concept, developer environments |
| Medium | 3 | ~45 GB/day | Small production (single VCF instance, <200 VMs) |
| Large | 6 | ~90 GB/day | Medium production (multi-cluster, 200–1,000 VMs) |
| Extra Large | 12+ | ~180+ GB/day | Large enterprise (multi-site, 1,000+ VMs, compliance-heavy) |
Warning: These figures assume default field extraction and content packs. Heavy use of custom regex extraction, large numbers of active alerts, or complex dashboards with many concurrent users will reduce effective ingestion capacity. Always monitor the Ingestion Rate and Query Latency dashboards after deployment and add worker nodes proactively if ingestion approaches 80% of rated capacity.
Tip: For VCF environments, a 3-node medium cluster is the recommended starting point for production. This provides both high availability (the cluster tolerates the loss of one node) and sufficient headroom for growth.
The Operations for Logs OVA offers three deployment sizes. Select the size at deployment time — it cannot be changed later without redeployment.
| Size | vCPUs | Memory (GB) | Disk (GB) | Estimated Ingestion Rate |
|---|---|---|---|---|
| Small | 4 | 8 | 530 | ~15 GB/day |
| Medium | 8 | 16 | 1,060 | ~30 GB/day |
| Large | 16 | 32 | 2,080 | ~45 GB/day |
Important: Disk sizes listed are total, including OS, application, and log index storage. The index partition consumes the majority of disk space. When planning retention, remember that longer retention windows require proportionally more disk. If the built-in disk is insufficient, you can attach additional VMDK volumes post-deployment and configure them as additional storage partitions.
Recommendation: For production deployments, always select Medium or Large. The Small size is appropriate only for labs and proof-of-concept environments.
Deploy the Operations for Logs OVA through the vSphere Client using the following procedure.
Prerequisites:
VMware-vRealize-Log-Insight-*.ova or the renamed VCF equivalent)Procedure:
vrli-primary-01) and select the target inventory folder. Click Next.vrli-primary-01.lab.local).255.255.255.0).lab.local).Warning: Do not snapshot the VM during initial boot. Allow all services to fully start before taking the first snapshot.
After the first boot completes, access the web-based configuration wizard to finalize setup.
https://<node-fqdn> (or https://<ip-address>).Step 1 — Admin Password:
admin user account.Step 2 — License Key:
Step 3 — General Configuration:
Step 4 — CEIP:
Step 5 — NTP Configuration:
Critical: Accurate time synchronization is essential for log correlation. Ensure all Operations for Logs nodes, vCenter servers, and ESXi hosts share the same NTP source.
Step 6 — SMTP Configuration:
vrli-alerts@lab.local).Step 7 — SSL Certificate:
Step 8 — Finish:
admin and the password set in Step 1.A single standalone node is suitable for labs, but production environments require a cluster of at least three nodes for high availability, ingestion scaling, and query performance.
https://<worker-fqdn>.Note: Worker nodes do not require independent license keys. The license is managed centrally on the primary node and applies cluster-wide.
After adding worker nodes, configure the Integrated Load Balancer and Virtual IP to provide a single entry point for all clients.
https://<primary-fqdn>.https://<vip-fqdn> — the Operations for Logs UI should load.liagent.ini files) to point to the VIP address instead of the primary node address.Warning: If you do not configure a VIP, log sources pointing to the primary node will not benefit from load balancing, and the primary node becomes a single point of failure for ingestion.
In VCF 9.0 and later, Operations for Logs can be deployed through SDDC Manager's Fleet Management capability, which automates the entire lifecycle.
Tip: Fleet Manager also handles future upgrades, certificate rotation, and backup scheduling for Operations for Logs, reducing ongoing administrative overhead.
Operations for Logs listens on several ports for log ingestion. The following table summarizes the default ports, protocols, and their intended use.
| Port | Protocol | Transport | Use Case | Notes |
|---|---|---|---|---|
| 514 | Syslog | TCP | General syslog ingestion | Unencrypted; most common for internal networks |
| 514 | Syslog | UDP | General syslog ingestion | Unencrypted; no delivery guarantee; not recommended for production |
| 6514 | Syslog | TCP + TLS | Secure syslog ingestion | Requires TLS certificate configuration on both sender and receiver |
| 1514 | Syslog | TCP + SSL | ESXi host log forwarding | Automatically configured when vSphere integration is enabled |
| 9000 | CFAPI | HTTP | Agent-based ingestion | VMware Log Insight agent protocol; unencrypted |
| 9543 | CFAPI | HTTPS | Secure agent-based ingestion | VMware Log Insight agent protocol; certificate-secured |
Best Practice: Use TCP-based protocols (514/TCP, 6514/TCP, 9543/TCP) for all production log sources. UDP-based syslog (514/UDP) does not guarantee delivery and can silently drop messages under load. For compliance-sensitive environments, use TLS-encrypted ports (6514/TCP for syslog, 9543/TCP for agents).
You can verify which ports are actively listening by navigating to Administration → Configuration → Ports in the Operations for Logs UI, or by running the following on the appliance:
netstat -tlnp | grep -E '514|9000|9543'
vCenter Server generates critical logs including vpxd, vpxd-svcs, vmware-sps, vmafdd, and many others. Forwarding these to Operations for Logs provides centralized visibility into vCenter operations.
https://<vcenter-fqdn>:5480.root account.<vrli-vip-fqdn> (the VIP address of your Operations for Logs cluster)514source = <vcenter-fqdn>. Logs should appear within 1–2 minutes.If VAMI access is unavailable, configure syslog forwarding from the vCenter shell:
# SSH to vCenter as root
# List current syslog configuration
/usr/lib/vmware-syslog/bin/get-rsyslog-config.sh
# Set remote syslog target
/usr/lib/vmware-vmon/vmon-cli --restart rsyslog
Note: In vCenter 8.x and later, syslog configuration is managed through the VAMI. CLI-based configuration methods vary between versions. Always consult the release-specific documentation.
ESXi hosts produce some of the most valuable logs in a VMware environment — vmkernel, hostd, vpxa, fdm, and vobd among others. There are three methods to configure ESXi syslog forwarding.
This is the simplest method and ensures all hosts managed by a vCenter are automatically configured.
Tip: The vSphere integration also pulls ESXi events and tasks, enabling richer correlation between log messages and vCenter-reported events.
Use this method when vSphere integration is not desired or when configuring individual hosts outside of vCenter management.
# SSH to the ESXi host
esxcli system syslog config set --loghost=tcp://<vrli-vip>:514
esxcli system syslog reload
# Verify the configuration
esxcli system syslog config get
The --loghost parameter supports multiple targets separated by commas:
esxcli system syslog config set --loghost=tcp://vrli-vip.lab.local:514,tcp://backup-syslog.lab.local:514
Important: If the ESXi firewall is enabled, ensure the
syslogfirewall rule is open:esxcli network firewall ruleset set -r syslog -e true esxcli network firewall refresh
For large environments, use PowerCLI to configure all hosts at once:
# Connect to vCenter
Connect-VIServer -Server vcenter.lab.local
# Set syslog target for all hosts
$logHost = "tcp://vrli-vip.lab.local:514"
Get-VMHost | ForEach-Object {
Write-Host "Configuring syslog on $($_.Name)..."
Set-VMHostSysLogServer -VMHost $_ -SysLogServer $logHost
$esxcli = Get-EsxCli -VMHost $_ -V2
$esxcli.system.syslog.reload.Invoke()
}
# Verify configuration
Get-VMHost | ForEach-Object {
$esxcli = Get-EsxCli -VMHost $_ -V2
$config = $esxcli.system.syslog.config.get.Invoke()
Write-Host "$($_.Name): $($config.RemoteHost)"
}
Warning: When using both the vSphere integration (Method 1) and manual configuration (Method 2 or 3) simultaneously, you may receive duplicate log entries. Choose one method and apply it consistently.
NSX Manager and NSX Edge nodes generate logs critical for network troubleshooting, security event analysis, and compliance auditing. Configure log forwarding from the NSX Manager UI.
Procedure:
https://<nsx-manager-fqdn>).<vrli-vip-fqdn>514 (for unencrypted TCP) or 6514 (for TLS)TCP or LI-TLS (Log Insight TLS)INFO (captures Info, Warning, Error, Critical, and Emergency)NSX Edge Nodes:
In some NSX deployments, Edge transport nodes may require separate syslog configuration:
Note: NSX Distributed Firewall (DFW) logs are generated on the ESXi hosts where the DFW rules are enforced. These logs are forwarded via the ESXi syslog configuration (Section 23.3), not via the NSX Manager syslog configuration.
Verification:
In Operations for Logs, search for:
appname = "nsxmanager" OR appname = "nsx-edge"
NSX logs should appear within 1–2 minutes of configuration.
The Operations for Logs agent (also known as the Log Insight agent or liagent) is a lightweight process that collects log files from Windows and Linux operating systems and forwards them to Operations for Logs via the CFAPI protocol.
VMware-Log-Insight-Agent-*.msi).<vrli-vip-fqdn>.9543 (HTTPS) or 9000 (HTTP).VMware Log Insight Agent) starts automatically and begins forwarding Windows Event Logs.For automated deployments via SCCM, GPO, or scripting:
msiexec /i VMware-Log-Insight-Agent-x64.msi /qn ^
SERVERHOST=vrli-vip.lab.local ^
SERVERPROTOCOL=cfapi ^
SERVERPORT=9543 ^
/l*v C:\temp\liagent-install.log
Tip: Add
SERVICEACCOUNT=domain\svcaccount SERVICEPASSWORD=P@ssw0rdparameters if the agent service needs to run under a domain account to access specific log file paths.
# Copy the RPM to the target server
sudo rpm -i VMware-Log-Insight-Agent-*.rpm
# Edit the agent configuration
sudo vi /var/lib/loginsight-agent/liagent.ini
# Set the [server] section hostname to the VIP FQDN
# Start and enable the agent service
sudo systemctl start liagent
sudo systemctl enable liagent
# Verify the agent is running
sudo systemctl status liagent
# Copy the DEB package to the target server
sudo dpkg -i VMware-Log-Insight-Agent-*.deb
# Edit the agent configuration
sudo vi /var/lib/loginsight-agent/liagent.ini
# Set the [server] section hostname to the VIP FQDN
# Start and enable the agent service
sudo systemctl start liagent
sudo systemctl enable liagent
# Verify the agent is running
sudo systemctl status liagent
Important: The agent collects
/var/log/messagesand/var/log/syslogby default. Additional log directories must be configured explicitly inliagent.ini(see Section 23.6).
liagent.ini)The agent configuration file liagent.ini controls all aspects of agent behavior — server connectivity, log file collection, field tagging, and debug settings. The file is located at:
/var/lib/loginsight-agent/liagent.iniC:\ProgramData\VMware\Log Insight Agent\liagent.ini; ─── Server Connection ───
[server]
hostname=vrli-vip.lab.local
port=9543
proto=cfapi
ssl=yes
ssl_accept_any=yes
; ssl_ca_path=/etc/pki/tls/certs/ca-bundle.crt ; Use for strict CA validation
; ─── Default Syslog Collection (Linux) ───
[filelog|syslog]
directory=/var/log
include=*.log;messages;syslog
; ─── Custom Application Logs ───
[filelog|custom_app]
directory=/opt/myapp/logs
include=*.log
exclude=debug-*.log
parser=auto
tags={"appname":"myapp","env":"production","tier":"web"}
; ─── Apache Access Logs ───
[filelog|apache_access]
directory=/var/log/httpd
include=access_log*
parser=clf
; ─── Windows Event Log (Windows only) ───
[winlog|application]
channel=Application
[winlog|system]
channel=System
[winlog|security]
channel=Security
; ─── Agent Logging ───
[logging]
debug_level=0
; 0=Off, 1=Error, 2=Warning, 3=Info, 4=Debug
; Set to 4 only for troubleshooting; generates significant local log volume
| Parameter | Description | Default |
|---|---|---|
hostname |
Operations for Logs VIP FQDN or IP | (required) |
port |
Ingestion port | 9543 |
proto |
Protocol (cfapi or syslog) |
cfapi |
ssl |
Enable SSL/TLS | yes |
ssl_accept_any |
Accept any server certificate (lab only) | no |
directory |
Log file directory to monitor | (per section) |
include |
Semicolon-separated file patterns to collect | *.log |
exclude |
Semicolon-separated file patterns to skip | (none) |
tags |
JSON key-value pairs attached to every log entry from this section | {} |
parser |
Log parsing mode (auto, clf, csv, or custom regex) |
auto |
Instead of editing liagent.ini on every machine, you can push agent configurations centrally from the Operations for Logs UI:
[filelog|...] and [winlog|...] sections to the group configuration.Best Practice: Use Agent Groups for all production agent configuration. This ensures consistency, simplifies changes, and provides a single pane of glass for agent management.
Operations for Logs ships with two content packs installed by default:
source, appname, facility, severity, text), and a default overview dashboard. This pack handles standard RFC 3164 and RFC 5424 syslog messages.vmw_host, vmw_vc_vm_name, vmw_esxi_service, vmw_vc_event_type, etc.)Note: The vSphere content pack is automatically activated when the vSphere integration is configured (Section 23.3, Method 1). Its extracted fields enable rich, structured queries against ESXi and vCenter logs.
Additional content packs are available from the in-product Marketplace and from the Broadcom download portal. The following table lists commonly used packs.
| Content Pack | Source | Key Features |
|---|---|---|
| VMware NSX | VMware/Broadcom | NSX Manager and Edge log parsing; security event dashboards; DFW rule hit analysis |
| VMware vSAN | VMware/Broadcom | vSAN trace and CMMDS log parsing; health event extraction; rebalance tracking |
| VMware VCF | VMware/Broadcom | SDDC Manager log parsing; lifecycle operation dashboards; compliance event tracking |
| Active Directory | Community/VMware | Windows AD log parsing; authentication success/failure dashboards; account lockout tracking |
| Linux | Community/VMware | /var/log/* parsing; SSH login analysis; cron job tracking; common Linux event fields |
| Palo Alto Networks | Palo Alto/Community | PAN-OS syslog parsing; firewall allow/deny dashboards; threat event correlation |
| F5 BIG-IP | Community | LTM and ASM log parsing; virtual server health dashboards; WAF event analysis |
| Dell EMC | Dell/Community | PowerStore, Unity, VNX storage array log parsing; hardware fault dashboards |
| Cisco | Community | IOS and NX-OS syslog parsing; interface state change tracking; routing event analysis |
Every content pack — whether built-in, marketplace, or custom — is composed of the following components:
Tip: When evaluating a content pack, review the extracted fields first. Fields are the foundation — dashboards and alerts depend on them. If the fields do not match your log format (e.g., because of a firmware version difference), the dashboards will show no data.
From the Marketplace (In-Product):
From a Downloaded File:
.vlcp extension) from the Broadcom support portal or a community repository..vlcp file and select it.Warning: Installing a content pack that defines fields with the same names as existing fields will overwrite the existing field definitions. Review field conflicts before installing, especially when mixing marketplace packs with custom-defined fields.
Organizations can bundle their custom fields, dashboards, alerts, and queries into a reusable content pack for distribution across environments or teams.
Procedure:
Custom - Payment Gateway Logs), Namespace (unique identifier, e.g., com.mycompany.paymentgw), and Description..vlcp file that can be imported into other Operations for Logs instances.Tip: Use a consistent namespace convention (e.g.,
com.<company>.<application>) to avoid conflicts with VMware or community content packs. Version your content packs semantically (1.0, 1.1, 2.0) to track changes.
Content pack operations are governed by the role-based access control system in Operations for Logs.
| Role | Install / Uninstall | Create / Export | Use Dashboards | Use Queries | Modify Components |
|---|---|---|---|---|---|
| Super Admin | Yes | Yes | Yes | Yes | Yes |
| Admin | Yes | Yes | Yes | Yes | Yes |
| User | No | No | Yes | Yes | No (can create personal copies) |
| View Only | No | No | Yes (read-only) | Yes (read-only) | No |
Best Practice: Assign the User role to operations teams who need to search logs and view dashboards. Reserve Admin for the team responsible for content pack management and platform administration.
The Explore Logs interface is the primary workspace for interactive log investigation in Operations for Logs. Access it by clicking Explore Logs in the main navigation bar at the top of the UI.
The interface consists of:
Operations for Logs supports three query modes, each suited to different analytical needs.
The simplest query mode. Enter keywords or phrases in the query bar, and Operations for Logs searches the full text of all log messages within the selected time range.
error
"connection refused"
authentication failed
Use extracted fields and operators to create precise, structured queries. This mode is more efficient than free-text search because it operates on indexed field values rather than raw text.
vmw_host = esxi01.lab.local
appname = "vpxd" AND severity = "error"
vmw_vc_vm_name = web-server-* AND text CONTAINS "snapshot"
Apply statistical functions to log data to identify trends, volumes, and outliers. Aggregation queries produce charts rather than individual log entries.
# Count events per source over time
COUNT by source
# Average response time by application
AVERAGE(response_time) by appname
# Top 10 sources by error count
COUNT WHERE severity = "error" GROUP BY source ORDER BY COUNT DESC LIMIT 10
| Syntax | Description | Example |
|---|---|---|
| Single keyword | Finds logs containing the word anywhere in the message | error |
| Phrase (quoted) | Finds logs containing the exact phrase | "connection refused" |
| Boolean AND | Both terms must appear | error AND vcenter |
| Boolean OR | Either term must appear | warning OR error |
| Boolean NOT | Excludes logs containing the term | error NOT test |
| Parentheses | Group boolean expressions | (error OR warning) AND vcenter |
| Pattern | Description | Example |
|---|---|---|
* |
Matches zero or more characters | *error* matches "timeout error occurred" |
? |
Matches exactly one character | host-??.lab.local matches "host-01.lab.local" |
[...] |
Matches any character in the set | [Ee]rror matches "Error" and "error" |
[0-9] |
Character range | vm-[0-9][0-9][0-9] matches "vm-001" through "vm-999" |
Field-based filters are the most powerful and efficient search mechanism. They use extracted fields (from content packs or custom extraction) to narrow results precisely.
| Operator | Description | Example |
|---|---|---|
= |
Exact match | vmw_host = esxi01.lab.local |
!= |
Not equal | vmw_esxi_vpxa_status != running |
CONTAINS |
Substring match | text CONTAINS "certificate expired" |
NOT CONTAINS |
Excludes substring | text NOT CONTAINS "debug" |
MATCHES |
Regex match | text MATCHES "err(or|no)\s\d+" |
>, <, >=, <= |
Numeric comparison | response_time > 5000 |
EXISTS |
Field has a value | vmw_vc_vm_name EXISTS |
NOT EXISTS |
Field is absent | custom_field NOT EXISTS |
Tip: Combine multiple field filters with
AND/ORfor complex investigations:vmw_host = esxi01.lab.local AND appname = "vmkernel" AND text CONTAINS "NMP" AND severity = "warning"
Every log message ingested by Operations for Logs automatically receives the following static fields, regardless of content packs:
| Field | Description | Example Value |
|---|---|---|
timestamp |
Time the event was generated (from syslog header or agent) | 2026-03-20T14:32:01.000Z |
source |
Hostname or IP of the sending device | esxi01.lab.local |
appname |
Application name (from syslog header APP-NAME field) | vpxd, hostd, vmkernel |
facility |
Syslog facility code | local0, daemon, kern |
severity |
Syslog severity level | info, warning, error, critical |
text |
Full message body (everything after the syslog header) | (variable) |
Content packs define additional fields that are extracted at query time (or at ingest time, depending on configuration). For example, the vSphere content pack extracts:
vmw_host — ESXi hostnamevmw_vc_vm_name — Virtual machine namevmw_esxi_service — ESXi service name (hostd, vpxa, etc.)vmw_vc_event_type — vCenter event type (VmPoweredOnEvent, DrsVmMigratedEvent, etc.)vmw_vc_user — User who initiated the actionFor logs not covered by existing content packs, you can create custom extracted fields interactively.
Procedure:
http_response_code, db_query_time).String, Integer, or Float.Warning: Custom extracted fields consume CPU during query execution. Avoid creating overly broad regex patterns that match unintended log messages. Test thoroughly using the Preview function before saving.
Operations for Logs uses different regex engines depending on the context:
| Context | Regex Engine | Notes |
|---|---|---|
| UI queries and field extraction | Java regex (java.util.regex) | Double-escape backslashes in the UI: \\d+ |
Agent file parsing (liagent.ini) |
C++ Boost regex | Standard PCRE-like syntax |
| API queries | Java regex | Same as UI |
| Pattern | Purpose | Regex |
|---|---|---|
| IPv4 address | Match IP addresses in log text | \\b\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\b |
| MAC address | Match MAC addresses | [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5} |
| ISO timestamp | Match ISO 8601 timestamps | \\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2} |
| HTTP status code | Match 3-digit HTTP codes | HTTP/\\d\\.\\d\\s+(\\d{3}) |
| Email address | Match email addresses | [\\w.+-]+@[\\w.-]+\\.[a-zA-Z]{2,} |
| Windows SID | Match Windows Security Identifiers | S-\\d-\\d+-[\\d-]+ |
| UUID / GUID | Match UUIDs | [0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12} |
Tip: When building regex patterns in the UI, use the Preview function to validate against live data. Start with a broad pattern and refine it iteratively. Named capture groups
(?<fieldname>...)are supported for multi-field extraction from a single pattern.
Intelligent Grouping is a machine-learning feature that automatically clusters structurally similar log messages, ignoring variable components like IP addresses, timestamps, UUIDs, and numeric values.
Accessing Intelligent Grouping:
source = esxi01.lab.local) and set the time range.How It Works:
User <*> logged in from <*>).Use Cases:
Interacting with Groups:
Saved queries preserve search criteria — keywords, field filters, time range preferences, and selected fields — for reuse without re-entering the query each time.
Saving a Query:
ESXi PSOD Events - All Hosts).Using Saved Queries:
Managing Saved Queries:
Best Practice: Establish a naming convention for shared saved queries (e.g.,
[Team] - [Description]) to keep the query library organized as it grows. Periodically review and prune unused saved queries to maintain clarity.
Operations for Logs provides two methods to create dashboards: promoting a query result directly from Explore Logs, or building a dashboard from scratch in the Dashboards section.
Method 1 — Promote from Explore Logs:
Method 2 — Build from Scratch:
Tip: Dashboards auto-refresh at configurable intervals (30 seconds, 1 minute, 5 minutes, 15 minutes, or manual). Set the refresh interval using the clock icon in the dashboard toolbar.
Operations for Logs supports a variety of widget types, each optimized for different analytical use cases.
| Widget Type | Description | Best Use Case |
|---|---|---|
| Chart (Pie) | Proportional breakdown of values as a circular chart | Distribution of log sources, error types by category |
| Chart (Bar) | Horizontal category comparison bars | Top 10 error-generating hosts, busiest log sources |
| Chart (Line) | Time-series trend line with data points | Log volume over time, error rate trends, ingestion throughput |
| Chart (Column) | Vertical bars for period-based comparison | Hourly event counts, daily log volume comparison |
| Chart (Gauge) | Single metric displayed as a gauge dial | Current ingestion rate, active alert count |
| Field Table | Tabular data view with sortable columns | Detailed event listing with extracted fields |
| Query List | List of saved queries displayed as clickable links | Quick-access navigation panel for analysts |
| Event Types | Breakdown of machine-learning-grouped event categories | ML-classified event distribution |
| Event Trends | Sparkline trend charts for each event type | At-a-glance trend overview per event category |
Widget Configuration Options:
hostname, appname, vmw_cluster).To create an alert, navigate to Alerts → Alert Definitions → New Alert. Operations for Logs provides four trigger condition types, each suited to a different monitoring pattern.
Trigger Condition Type 1 — On Every Match
text CONTAINS "CRITICAL" AND appname CONTAINS "sshd" AND text CONTAINS "root"Trigger Condition Type 2 — Total Count
text CONTAINS "error", Threshold = 100, Window = 5 minutes.Trigger Condition Type 3 — Unique Count
source), Count threshold (integer), Time window (minutes).text CONTAINS "authentication failure", Field = source, Threshold = 10, Window = 15 minutes.Trigger Condition Type 4 — Aggregation
avg, min, max, sum, count), Field name, Threshold (numeric), Time window (minutes).appname CONTAINS "nginx", Function = avg, Field = response_time, Threshold = 5000, Window = 10 minutes.Each alert definition includes the following configuration fields:
| Field | Description | Required |
|---|---|---|
| Name | Descriptive name for the alert (e.g., "ESXi PSOD Detection") | Yes |
| Description | Detailed description of the alert purpose and expected response | No |
| Query | The log query that defines which events are evaluated | Yes |
| Trigger Condition | One of the four types described in Section 26.3 | Yes |
| Frequency | How often the alert query is evaluated: 1, 5, 15, 30, or 60 minutes | Yes |
| Raise an Alert | When to generate the alert: First occurrence only, Every time the condition is met, or Once per time window | Yes |
| Notification | Select one or more notification channels (email or webhook) | No |
| Enable/Disable | Toggle to activate or deactivate the alert without deleting it | Yes |
Best Practice: Set the alert frequency to be shorter than or equal to the trigger time window. For example, if the time window is 5 minutes, set the frequency to 5 minutes or less. This ensures no events are missed between evaluation cycles.
When an alert becomes temporarily noisy — for example, during a planned maintenance window — you can snooze it rather than disabling it entirely.
Snoozed alerts display a clock icon and remaining snooze time in the Triggered Alerts list. You can cancel a snooze early by clicking Unsnooze on the alert.
Operations for Logs supports two primary notification channel types: Email (SMTP) and Webhooks.
| Field | Example Value |
|---|---|
| SMTP Server | smtp.lab.local |
| Port | 587 (TLS) or 25 (unencrypted) |
| Use TLS | Enabled |
| From Address | vrli-alerts@lab.local |
| Username | vrli-smtp-user |
| Password | (SMTP authentication password) |
| Field | Description |
|---|---|
| Name | Descriptive name (e.g., "Slack-Ops-Channel") |
| URL | Target endpoint URL |
| Content Type | application/json (default) |
| Payload Template | JSON body with placeholder variables |
Common Webhook Targets:
| Target | URL Format | Notes |
|---|---|---|
| Slack | https://hooks.slack.com/services/T.../B.../xxx |
Use Slack Incoming Webhook URL |
| PagerDuty | https://events.pagerduty.com/v2/enqueue |
Use PagerDuty Events API v2 integration key |
| Aria Automation | https://<vra-fqdn>/csp/gateway/am/api/... |
Trigger workflow via REST webhook |
| ServiceNow | https://<instance>.service-now.com/api/now/table/incident |
Create incident via REST API |
| Custom | Any https:// endpoint |
Configurable HTTP method, headers, body template |
Available Payload Variables:
${AlertName} — Name of the triggered alert${AlertDescription} — Alert description text${AlertQuery} — The query that triggered the alert${MatchCount} — Number of matching events${Url} — Direct URL to the alert in Operations for Logs${Timestamp} — Time the alert was triggered${Messages} — Sample of matching log messagesVCF Operations and VCF Operations for Logs are designed to work together as a unified observability platform. Two complementary integration methods connect the products:
Notification Events — VCF Operations for Logs sends alert notifications to VCF Operations, creating corresponding alert objects that appear alongside metric-based alerts. This enables a single-pane-of-glass view of both metric and log-based anomalies.
Launch in Context — From VCF Operations, operators can click on any monitored object and open its associated logs directly in Operations for Logs. The log view is automatically pre-filtered to show only events from the selected object and time range, eliminating the need to manually construct queries.
This integration pushes alert data from Operations for Logs into VCF Operations.
Step-by-step on the Operations for Logs side:
vrops-vip.lab.local).Step-by-step on the VCF Operations side:
Note: Alerts forwarded from Operations for Logs appear under the Log Analytics alert type in VCF Operations. They can be viewed, acknowledged, and cancelled using the same alert management workflows as native VCF Operations alerts.
This integration allows operators to open contextual log data from within the VCF Operations interface.
Step-by-step on the VCF Operations side:
https://<vrli-vip-fqdn> (e.g., https://vrli-vip.lab.local).Verification:
A dedicated content pack enables Operations for Logs to parse, extract, and visualize logs generated by VCF Operations itself.
Installation:
Included Content:
| Content Type | Count | Examples |
|---|---|---|
| Extracted Fields | 15+ | vrops_component, vrops_alert_name, vrops_adapter_kind |
| Saved Queries | 10+ | "VCF Operations Errors — Last 24h", "Analytics Engine Warnings" |
| Dashboards | 3 | "VCF Operations Health", "Adapter Collection Status", "Audit Trail" |
| Alerts | 5 | "VCF Operations Service Crash", "Collector Disconnected" |
Once both integration methods are configured, the following capabilities become available in VCF Operations:
Operations for Logs can forward received logs to other systems for compliance archival, SIEM integration, or multi-site aggregation. Forwarding is asynchronous and adds no significant overhead to the cluster. Three forwarding protocols are supported:
| Protocol | Description | Use Case |
|---|---|---|
| Ingestion API (CFAPI) | Forward using the native Operations for Logs ingestion API format | Forward to another Operations for Logs instance for multi-site aggregation |
| Syslog | Forward as standard syslog messages over TCP, UDP, or TLS | Forward to SIEM platforms (Splunk, QRadar, ArcSight), syslog servers |
| RAW | Forward the original raw log data without transformation | Preserve exact original format for compliance or forensic archives |
Step-by-step:
| Field | Description | Example |
|---|---|---|
| Name | Descriptive name for the destination | SIEM-Splunk-Prod |
| Destination Host | FQDN or IP address of the target system | splunk-hec.lab.local |
| Protocol | Syslog (TCP/UDP/TLS), CFAPI, or RAW | Syslog (TLS) |
| Port | Port number appropriate for the selected protocol | 6514 |
Filter (optional): Configure filters to forward only specific log data:
Tags (optional): Add or modify tags on events before forwarding. This allows the receiving system to identify forwarded events.
Click Test to verify connectivity to the destination, then click Save.
Note: Each cluster supports up to 10 forwarding destinations. Forwarding operates asynchronously from the ingestion pipeline — destination outages do not affect log ingestion or indexing. Events are buffered and retried if the destination is temporarily unreachable.
Operations for Logs can archive log data to an NFS share for long-term retention beyond the active index capacity.
Step-by-step:
nfs://<server>/<share>nfs://nfs-server.lab.local/vrli-archiveArchive Behavior:
| Aspect | Detail |
|---|---|
| When data is archived | After it ages out of the active index (based on partition retention) |
| Archive format | Compressed JSON files organized by date |
| Searchability | Archived data is not searchable directly — must be re-ingested to query |
| NFS version requirement | NFSv3 |
| Permissions | Read/write access required from all cluster nodes |
| Mount validation | All nodes must successfully mount the NFS share |
Important: Ensure the NFS share has sufficient capacity for long-term storage. A cluster ingesting 50 GB/day will generate approximately 15–20 GB/day of compressed archive data.
Index partitions allow you to apply different retention periods to different categories of log data. This enables longer retention for compliance-critical logs (e.g., security audit events) while using shorter retention for high-volume operational logs.
Configuration:
| Field | Description | Example |
|---|---|---|
| Name | Partition identifier | Security-Logs |
| Retention Period | Number of days to retain indexed data | 90 |
| Filter | Criteria determining which logs are routed to this partition | appname CONTAINS "sshd" OR appname CONTAINS "audit" |
Important: Longer retention periods require proportionally more disk space. Plan the
/storage/vardisk on each node to accommodate the total data volume across all partitions. Use the formula: Required Disk (GB) = Daily Ingestion (GB) x Retention (days) x 1.3 (index overhead).
The Operations for Logs appliance stores its own operational logs in well-defined paths. Familiarity with these files is essential for troubleshooting appliance issues.
| Log File | Path | Purpose | Rotation |
|---|---|---|---|
| Core Application | /var/log/loginsight/runtime.log |
Main application log — startup, shutdown, errors | Daily |
| API / UI | /var/log/loginsight/api_runtime.log |
API request logs, UI backend errors | Daily |
| Ingestion | /var/log/loginsight/ingestion.log |
Syslog and agent ingestion pipeline | Daily |
| Cassandra | /var/log/loginsight/cassandra.log |
Index database operations and errors | Daily |
| Audit | /var/log/loginsight/audit.log |
User actions, login events, configuration changes | Daily |
| Watchdog | /var/log/loginsight/watchdog.log |
Service health monitoring and auto-restart events | Daily |
| System | /var/log/messages |
OS-level syslog messages | Weekly |
| Apache Reverse Proxy | /var/log/loginsight/apache/ |
Reverse proxy access and error logs | Daily |
| Upgrade | /var/log/loginsight/upgrade.log |
Upgrade process log with step-by-step progress | Per upgrade |
The Operations for Logs appliance runs on a SUSE Linux Enterprise Server (SLES) base operating system. Services are managed via systemctl.
# Check overall service status
systemctl status loginsight
# Restart the main Operations for Logs service
systemctl restart loginsight
# Check Cassandra index database status
systemctl status loginsight-cassandra
# Check watchdog service (monitors and auto-restarts crashed services)
systemctl status loginsight-watchdog
# View real-time service logs
journalctl -u loginsight -f
# Check disk usage on storage partition
df -h /storage/var
# Check cluster node connectivity
curl -k https://localhost:9543/api/v2/version
Warning: Restarting the
loginsightservice causes a brief ingestion interruption on that node. In a cluster, agents and syslog sources connected to the restarted node temporarily buffer events and reconnect to another node via the ILB VIP.
If the admin password is lost and UI access is not possible, reset it from the appliance command line:
# SSH to the primary node as root
ssh root@vrli-primary.lab.local
# Navigate to the application sbin directory
cd /usr/lib/loginsight/application/sbin
# Execute the password reset script
./li-reset-admin-password.sh
# Follow the interactive prompts to set a new admin password
# Services restart automatically after the password is reset
Note: This procedure resets the local
adminaccount password only. It does not affect Active Directory or VIDM-integrated accounts. The password reset requires SSH access to the primary node as root.
Adjusting the internal logging level can help diagnose appliance issues.
| Level | Volume | Use Case |
|---|---|---|
| Error | Minimal | Production — only critical failures |
| Warning | Low | Production — failures and potential issues |
| Info | Moderate (default) | Normal operations — recommended for production |
| Debug | High | Active troubleshooting — detailed diagnostic output |
| Trace | Very High | Deep troubleshooting — full method-level tracing |
Important: Set the logging level to Debug or Trace only temporarily during active troubleshooting. These levels significantly increase log volume and can fill the
/storage/varpartition if left enabled. Always return to Info after troubleshooting is complete.
A support bundle collects diagnostic information required by Broadcom support for troubleshooting appliance issues.
UI Method:
CLI Method:
# SSH to the primary node as root
ssh root@vrli-primary.lab.local
# Generate the support bundle
/usr/lib/loginsight/application/sbin/li-support-bundle.sh
# Output location:
# /tmp/li-support-bundle-<timestamp>.tar.gz
# Transfer the bundle to your workstation
scp root@vrli-primary.lab.local:/tmp/li-support-bundle-*.tar.gz .
Bundle Contents:
All Operations for Logs API calls use HTTPS on port 9543 (or HTTP on port 9000 for non-production environments). The base URL format is:
https://<vrli-vip-fqdn>:9543/api/v2/
Replace <vrli-vip-fqdn> with the cluster VIP FQDN or individual node FQDN. All examples in this chapter use vrli-vip.lab.local as the target.
All API calls (except /api/v2/sessions) require a valid session token. Obtain a token by authenticating against the sessions endpoint:
# Obtain a session token
curl -k -X POST "https://vrli-vip.lab.local:9543/api/v2/sessions" \
-H "Content-Type: application/json" \
-d '{
"username": "admin",
"password": "YourPassword123!",
"provider": "Local"
}'
# Response:
# {
# "userId": "a1b2c3d4-...",
# "sessionId": "abc123def456...",
# "ttl": 1800
# }
Use the returned sessionId value as a Bearer token in subsequent requests:
Authorization: Bearer abc123def456...
| Field | Description |
|---|---|
userId |
Unique identifier of the authenticated user |
sessionId |
Session token — valid for ttl seconds |
ttl |
Time-to-live in seconds (default 1800 = 30 minutes) |
provider |
Authentication provider: Local, ActiveDirectory, or vidm |
Note: Tokens expire after the TTL period. For long-running automation scripts, implement token refresh logic that re-authenticates before the TTL expires.
Send log events programmatically using the ingestion endpoint. This is useful for forwarding application logs, CI/CD pipeline events, or custom monitoring data.
# Ingest a single event
curl -k -X POST "https://vrli-vip.lab.local:9543/api/v2/events/ingest/0" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer abc123def456..." \
-d '{
"events": [
{
"text": "Application deployment completed successfully",
"timestamp": 1711000000000,
"fields": [
{"name": "appname", "content": "deploy-pipeline"},
{"name": "environment", "content": "production"},
{"name": "build_number", "content": "1842"},
{"name": "deploy_status", "content": "success"}
]
}
]
}'
Ingestion Payload Fields:
| Field | Type | Required | Description |
|---|---|---|---|
text |
String | Yes | The log message body |
timestamp |
Long | No | Event timestamp in epoch milliseconds (defaults to server receipt time) |
fields |
Array | No | Array of key-value pairs for structured field extraction |
fields[].name |
String | Yes (if fields used) | Field name |
fields[].content |
String | Yes (if fields used) | Field value |
Tip: The
/ingest/0endpoint path suffix (0) specifies the shard hint. For most use cases, use0to let the cluster auto-distribute. For high-throughput ingestion, distribute across shard hints0throughn-1wherenis the number of cluster nodes.
Retrieve log events and aggregated statistics programmatically.
Search Events:
# Simple keyword search — last 100 matching events
curl -k -X GET \
"https://vrli-vip.lab.local:9543/api/v2/events?q=error&limit=100" \
-H "Authorization: Bearer abc123def456..."
# Field-based query with time range
curl -k -X GET \
"https://vrli-vip.lab.local:9543/api/v2/events?q=vmw_host%3Desxi01*&limit=50&start-time-ms=1711000000000&end-time-ms=1711086400000" \
-H "Authorization: Bearer abc123def456..."
Query Parameters:
| Parameter | Type | Description |
|---|---|---|
q |
String | Query string (URL-encoded) |
limit |
Integer | Maximum number of events to return (default 100, max 20000) |
start-time-ms |
Long | Start of time range in epoch milliseconds |
end-time-ms |
Long | End of time range in epoch milliseconds |
order-by-direction |
String | ASC or DESC (default DESC) |
content-pack-fields |
String | Include content pack extracted fields |
Aggregated Events:
# Count events by source over the last hour, divided into 12 bins
curl -k -X GET \
"https://vrli-vip.lab.local:9543/api/v2/aggregated-events/timestamp/LAST_HOUR?q=error&num-bins=12" \
-H "Authorization: Bearer abc123def456..."
Aggregation Time Ranges:
| Value | Description |
|---|---|
LAST_5_MINUTES |
Last 5 minutes |
LAST_15_MINUTES |
Last 15 minutes |
LAST_HOUR |
Last 60 minutes |
LAST_6_HOURS |
Last 6 hours |
LAST_24_HOURS |
Last 24 hours |
LAST_3_DAYS |
Last 3 days |
LAST_7_DAYS |
Last 7 days |
LAST_30_DAYS |
Last 30 days |
CUSTOM |
Use start-time-ms and end-time-ms |
The following table lists all available API endpoint categories in Operations for Logs v2 API.
| # | Category | Endpoint | Description |
|---|---|---|---|
| 1 | Sessions | /api/v2/sessions |
Authentication — acquire and release tokens |
| 2 | Events | /api/v2/events |
Query log events with filters and time ranges |
| 3 | Aggregated Events | /api/v2/aggregated-events |
Statistical queries with time-bucketed aggregation |
| 4 | Ingest | /api/v2/events/ingest |
Send log events via CFAPI |
| 5 | Alerts | /api/v2/alerts |
Manage alert definitions (CRUD) |
| 6 | Content Packs | /api/v2/content-packs |
Install, list, and manage content packs |
| 7 | Dashboards | /api/v2/dashboards |
Create, update, delete dashboards and widgets |
| 8 | Groups | /api/v2/groups |
Manage agent groups and group filters |
| 9 | Notifications | /api/v2/notifications |
Manage notification channels |
| 10 | Users | /api/v2/users |
User management (create, list, update, delete) |
| 11 | Roles | /api/v2/roles |
Role management and permission assignment |
| 12 | Datasets | /api/v2/datasets |
Index partition management |
| 13 | Cluster | /api/v2/cluster |
Cluster topology, node status, and management |
| 14 | License Keys | /api/v2/licensekeys |
License key management and status |
| 15 | SMTP | /api/v2/notification/smtp |
Email notification server configuration |
| 16 | Webhooks | /api/v2/notification/webhook |
Webhook endpoint configuration |
| 17 | Archiving | /api/v2/archiving |
NFS archive configuration and status |
| 18 | Forwarding | /api/v2/forwarding |
Log forwarding destination management |
| 19 | Agents | /api/v2/agents |
Agent registration, status, and management |
| 20 | vSphere | /api/v2/vsphere |
vSphere integration configuration |
| 21 | Spaces | /api/v2/spaces |
Multi-tenancy space management |
| 22 | Certificates | /api/v2/certificates |
TLS certificate management |
| 23 | Upgrades | /api/v2/upgrades |
Appliance upgrade management |
| 24 | Support | /api/v2/support |
Support bundle generation and download |
| 25 | Version | /api/v2/version |
Product version and build information |
Operations for Logs provides interactive API documentation accessible directly from the appliance:
https://<vrli-vip-fqdn>:9543/api/v2/docs
Features of the interactive documentation:
https://<vrli-vip-fqdn>:9543/api/v2/docs/openapi.json
Tip: Use the OpenAPI specification to generate client libraries in Python, Go, Java, or PowerShell for automating Operations for Logs administration tasks.
The following table lists all network ports required for VCF Operations deployment and operation. Firewall rules must permit traffic on these ports between the listed source and destination components.
| Port | Protocol | Direction | Source | Destination | Purpose |
|---|---|---|---|---|---|
| 443 | TCP | Inbound | Browser / API Client | VCF Operations Cluster VIP | Web UI and REST API (HTTPS) |
| 443 | TCP | Outbound | VCF Operations Node | vCenter Server | vCenter adapter data collection |
| 443 | TCP | Outbound | VCF Operations Node | NSX Manager | NSX adapter data collection |
| 443 | TCP | Outbound | VCF Operations Node | SDDC Manager | SDDC Manager adapter data collection |
| 443 | TCP | Outbound | VCF Operations Node | ESXi Hosts | Direct ESXi metric collection |
| 443 | TCP | Outbound | Remote Collector | vCenter / NSX / Targets | Remote adapter data collection |
| 443 | TCP | Outbound | VCF Operations Node | Broadcom Marketplace | Management pack downloads |
| 8543 | TCP | Inbound | Remote Collector / Agents | VCF Operations Cluster VIP | Collector-to-cluster communication |
| 7001 | TCP | Internal | VCF Operations Node | VCF Operations Node | GemFire cache replication |
| 1300–1399 | TCP | Internal | VCF Operations Node | VCF Operations Node | Distributed cache range ports |
| 10002 | TCP | Internal | VCF Operations Node | VCF Operations Node | GemFire locator port |
| 20002 | TCP | Internal | VCF Operations Node | VCF Operations Node | xDB replication primary port |
| 20003 | TCP | Internal | VCF Operations Node | VCF Operations Node | xDB replication secondary port |
| 4369 | TCP | Internal | VCF Operations Node | VCF Operations Node | Erlang Port Mapper Daemon (epmd) |
| 5433 | TCP | Internal | VCF Operations Node | VCF Operations Node | PostgreSQL database replication |
| 8080 | TCP | Localhost | VCF Operations Node | Localhost | Internal application HTTP |
| 9090 | TCP | Localhost | VCF Operations Node | Localhost | Internal admin service |
| 514 | UDP | Inbound | Network Devices | VCF Operations Node | Syslog reception (optional) |
| 162 | UDP | Inbound | Network Devices | VCF Operations Node | SNMP trap reception |
| 25 | TCP | Outbound | VCF Operations Node | SMTP Server | Email notification delivery |
| 587 | TCP | Outbound | VCF Operations Node | SMTP Server | Email notification delivery (TLS) |
| 123 | UDP | Outbound | VCF Operations Node | NTP Server | Time synchronization |
Note: For the complete and most current port requirements, consult the Broadcom Ports and Protocols tool at
https://ports.broadcom.com/.
The following table lists all network ports required for VCF Operations for Logs deployment and operation.
| Port | Protocol | Direction | Source | Destination | Purpose |
|---|---|---|---|---|---|
| 443 | TCP | Inbound | Browser / API Client | Operations for Logs VIP | Web UI access (HTTPS) |
| 514 | TCP | Inbound | Syslog Sources | Operations for Logs VIP | Syslog ingestion (TCP) |
| 514 | UDP | Inbound | Syslog Sources | Operations for Logs VIP | Syslog ingestion (UDP) |
| 6514 | TCP | Inbound | Syslog Sources | Operations for Logs VIP | Syslog ingestion (TLS-encrypted) |
| 1514 | TCP | Inbound | ESXi Hosts | Operations for Logs VIP | ESXi SSL syslog forwarding |
| 9000 | TCP | Inbound | Log Insight Agents | Operations for Logs VIP | CFAPI ingestion (HTTP) |
| 9543 | TCP | Inbound | Log Insight Agents / API Clients | Operations for Logs VIP | CFAPI ingestion (HTTPS) + REST API |
| 16520–16580 | TCP | Internal | Operations for Logs Node | Operations for Logs Node | Cluster inter-node communication |
| 59778 | TCP | Internal | Operations for Logs Node | Operations for Logs Node | Thrift RPC inter-node calls |
| 12543 | TCP | Internal | Operations for Logs Node | Operations for Logs Node | Cassandra database communication |
| 9200 | TCP | Internal | Operations for Logs Node | Operations for Logs Node | Node indexing service |
| 7000 | TCP | Internal | Operations for Logs Node | Operations for Logs Node | Cassandra gossip protocol |
| 7001 | TCP | Internal | Operations for Logs Node | Operations for Logs Node | Cassandra SSL gossip protocol |
| 123 | UDP | Outbound | Operations for Logs Node | NTP Server | Time synchronization |
| 25 | TCP | Outbound | Operations for Logs Node | SMTP Server | Email notification delivery |
| 587 | TCP | Outbound | Operations for Logs Node | SMTP Server | Email notification delivery (TLS) |
| 514 | TCP | Outbound | Operations for Logs Node | Syslog Destination | Log forwarding (syslog) |
| 443 | TCP | Outbound | Operations for Logs Node | VCF Operations VIP | Alert notification integration |
| 2049 | TCP/UDP | Outbound | Operations for Logs Node | NFS Server | NFS archive mount |
Note: Syslog ingestion on port 514 (both TCP and UDP) is enabled by default. Port 6514 (TLS) and port 1514 (ESXi SSL) require additional configuration in the appliance admin UI.
The VCF Operations Suite API provides programmatic access to all platform capabilities. The base path for all endpoints is:
https://<vrops-vip-fqdn>/suite-api/api/
| # | Category | Base Path | Key Operations |
|---|---|---|---|
| 1 | Authentication | /suite-api/api/auth |
Acquire and release authentication tokens |
| 2 | Resources | /suite-api/api/resources |
CRUD operations on monitored objects |
| 3 | Resource Kinds | /suite-api/api/resourcekinds |
List and describe resource types |
| 4 | Adapter Kinds | /suite-api/api/adapterkinds |
List and describe adapter types |
| 5 | Adapters | /suite-api/api/adapters |
Manage adapter instances and credentials |
| 6 | Credentials | /suite-api/api/credentials |
Create, update, and delete stored credentials |
| 7 | Alerts | /suite-api/api/alerts |
Query, acknowledge, and cancel alerts |
| 8 | Alert Definitions | /suite-api/api/alertdefinitions |
Create and manage alert definitions |
| 9 | Symptoms | /suite-api/api/symptoms |
Query active symptom instances |
| 10 | Symptom Definitions | /suite-api/api/symptomdefinitions |
Create and manage symptom definitions |
| 11 | Notifications | /suite-api/api/notifications |
Manage notification rules and channels |
| 12 | Super Metrics | /suite-api/api/supermetrics |
Create and manage super metric formulas |
| 13 | Policies | /suite-api/api/policies |
Manage operational policies and assignments |
| 14 | Dashboards | /suite-api/api/dashboards |
Create, clone, share, and delete dashboards |
| 15 | Reports | /suite-api/api/reports |
Generate, schedule, and download reports |
| 16 | Report Definitions | /suite-api/api/reportdefinitions |
Define report templates and layouts |
| 17 | Views | /suite-api/api/views |
Create and manage data views |
| 18 | Tasks | /suite-api/api/tasks |
Query and manage background tasks |
| 19 | Collector Groups | /suite-api/api/collectorgroups |
Manage collector group assignments |
| 20 | Collectors | /suite-api/api/collectors |
List and manage collector nodes |
| 21 | Audit | /suite-api/api/audit |
Query audit log entries |
| 22 | Applications | /suite-api/api/applications |
Application monitoring configuration |
| 23 | Deployment | /suite-api/api/deployment |
Cluster deployment and scaling operations |
| 24 | Certificate | /suite-api/api/certificate |
TLS certificate management |
| 25 | Cluster | /suite-api/api/cluster |
Cluster topology and health |
| 26 | Versions | /suite-api/api/versions |
Product version and build information |
| 27 | Content | /suite-api/api/content |
Import and export content bundles |
| 28 | Events | /suite-api/api/events |
Query and manage event timeline |
| 29 | Maintenance Schedules | /suite-api/api/maintenanceschedules |
Schedule maintenance windows |
| 30 | Object Groups | /suite-api/api/groups |
Manage built-in object groups |
| 31 | Custom Groups | /suite-api/api/customgroups |
Create and manage custom object groups |
| 32 | Traversal Specs | /suite-api/api/traversalspecs |
Define object relationship traversals |
| 33 | Relationships | /suite-api/api/resources/{id}/relationships |
Query parent/child object relationships |
| 34 | Statistics | /suite-api/api/resources/{id}/stats |
Retrieve metric data for a resource |
| 35 | Properties | /suite-api/api/resources/{id}/properties |
Retrieve property values for a resource |
| 36 | Latest Statistics | /suite-api/api/resources/{id}/stats/latest |
Retrieve the most recent metric values |
| 37 | Recommendations | /suite-api/api/recommendations |
Query optimization recommendations |
| 38 | Cost | /suite-api/api/costconfig |
Cost model and rate card configuration |
| 39 | Pricing | /suite-api/api/pricing |
Pricing policy management |
| 40 | Capacity | /suite-api/api/capacity |
Capacity analytics and projections |
| 41 | Reclamation | /suite-api/api/reclamation |
Resource reclamation recommendations |
| 42 | Compliance | /suite-api/api/compliance |
Compliance benchmark scoring |
| 43 | SDDC Health | /suite-api/api/sddc |
SDDC-level health and status |
| 44 | vSAN | /suite-api/api/vsan |
vSAN-specific health and capacity |
| 45 | Token | /suite-api/api/auth/token |
Token-based authentication (acquire/validate) |
| 46 | Users | /suite-api/api/auth/users |
User account management |
| 47 | Roles | /suite-api/api/auth/roles |
Role and permission management |
Note: All Suite API endpoints support JSON request and response bodies. Use
Content-Type: application/jsonandAccept: application/jsonheaders. Full Swagger documentation is available athttps://<vrops-vip-fqdn>/suite-api/doc/swagger-ui.html.
The following table lists the OVA appliance files used to deploy VCF Operations and related products. File sizes are approximate and vary by specific release version.
| Product | OVA Filename | Approx. Size | Notes |
|---|---|---|---|
| VCF Operations (Analytics Node) | vRealize-Operations-Manager-Appliance-8.18.2.*.ova |
~3.2 GB | Primary, replica, and data node appliance |
| VCF Operations (Remote Collector) | vRealize-Operations-Manager-Remote-Collector-*.ova |
~1.8 GB | Lightweight collection-only appliance |
| VCF Operations for Logs | VMware-vRealize-Log-Insight-8.18.2.*.ova |
~2.5 GB | Log analytics node (primary and worker) |
| VCF Suite Lifecycle Manager | VMware-vRealize-Suite-Lifecycle-Manager-*.ova |
~4.5 GB | Lifecycle management for the full VCF Operations suite |
| VCF Operations for Networks (Platform) | VMware-vRealize-Network-Insight-*.ova |
~3.0 GB | Network analytics platform node |
| VCF Operations for Networks (Collector) | VMware-vRealize-Network-Insight-Collector-*.ova |
~2.0 GB | Network flow and configuration collector |
Checksum Verification:
Always verify the SHA256 checksum of downloaded OVA files against the values published on the Broadcom download portal before deployment.
# Linux / macOS
sha256sum vRealize-Operations-Manager-Appliance-8.18.2.*.ova
# Windows (PowerShell)
Get-FileHash -Algorithm SHA256 .\vRealize-Operations-Manager-Appliance-8.18.2.*.ova
Important: Deploying an OVA with a mismatched checksum may indicate a corrupted download or a tampered file. Re-download the OVA from the Broadcom support portal if the checksum does not match.
The following table provides direct links to key documentation and resources for VCF Operations and related products.
| Resource | URL |
|---|---|
| VCF Operations Documentation | https://docs.vmware.com/en/VMware-Aria-Operations/index.html |
| VCF Operations for Logs Documentation | https://docs.vmware.com/en/VMware-Aria-Operations-for-Logs/index.html |
| VCF Operations API Reference (Suite API) | https://docs.vmware.com/en/VMware-Aria-Operations/8.18/aria-operations-api-guide/GUID-intro.html |
| VCF Operations Sizing Guide | https://kb.vmware.com/s/article/2093783 |
| VCF Operations Port Requirements | https://ports.broadcom.com/ |
| VCF 9.0 Release Notes | https://docs.vmware.com/en/VMware-Cloud-Foundation/9.0/rn/vmware-cloud-foundation-90-release-notes/index.html |
| Broadcom Support Portal | https://support.broadcom.com/ |
| VCF Compatibility Matrix | https://interopmatrix.vmware.com/ |
| Broadcom Marketplace (Management Packs) | https://marketplace.cloud.vmware.com/ |
| VMware Knowledge Base | https://kb.vmware.com/ |
| VCF Operations for Logs API Reference | https://docs.vmware.com/en/VMware-Aria-Operations-for-Logs/8.18/aria-operations-for-logs-api-guide/GUID-intro.html |
| VCF Operations Community Forums | https://community.broadcom.com/vmware-tanzu/home |
End of Document
VCF Operations & Operations for Logs — Complete Handbook v1.0 © 2026 Virtual Control LLC. All rights reserved.