VC
Virtual Control
VMware Cloud Foundation Solutions
Complete Handbook
VCF Operations
Complete Handbook
Comprehensive 180-page handbook covering VCF Operations and Operations for Logs deployment, configuration, and management.
180 PagesOperationsOps for LogsManagement
VCF 9.0
VMware Cloud Foundation
Proprietary & Confidential

VCF Operations & Operations for Logs — Complete Handbook

VMware Cloud Foundation 9.0 | Broadcom

Author: Virtual Control LLC Copyright: © 2026 Virtual Control LLC. All rights reserved. Version: 1.0 — March 2026 Classification: Internal / Professional Reference


Table of Contents


PART I: VCF Operations


Chapter 1 — Product Overview

VCF Operations is the unified monitoring, capacity planning, and optimization platform for VMware Cloud Foundation environments. It provides real-time analytics across compute, storage, networking, and application layers, delivering intelligent workload placement, proactive alerting, and capacity forecasting to ensure the health and efficiency of your private cloud infrastructure.

1.1 Naming History

The product now known as VCF Operations has undergone several name changes as VMware's portfolio evolved and Broadcom completed its acquisition of VMware. Understanding this lineage is critical when referencing older documentation, KB articles, and community resources.

Year Product Name Context
2012 vCenter Operations Manager Initial release as a standalone vCenter companion for performance analytics.
2015 vRealize Operations Manager (vROps) Rebranded under the vRealize Suite umbrella. Versions 6.x through 8.x carried this name. This is the name most widely recognized in the VMware community.
2022 VMware Aria Operations VMware unified its cloud management portfolio under the "Aria" brand. vRealize Operations Manager 8.10+ became Aria Operations.
2024 VCF Operations Following the Broadcom acquisition of VMware, the Aria brand was retired. All products were realigned under the VMware Cloud Foundation (VCF) umbrella. VCF Operations is the current and official name.

Important: When searching VMware Knowledge Base articles, use all three historical names — vRealize Operations, Aria Operations, and VCF Operations — to ensure complete coverage of relevant results. Many KB articles have not yet been updated to reflect the latest naming.

The underlying technology, architecture, and API surface remain consistent across these name changes. A deployment upgraded from vRealize Operations 8.6 through Aria Operations 8.14 to VCF Operations 8.18.2 retains its full configuration, dashboards, super metrics, and alert definitions without requiring re-creation.

1.2 Product Family

VCF Operations is not a single appliance — it is the anchor product within a broader operations suite. The following table lists all products in the VCF Operations family as of VCF 9.0:

Product Name Description VCF 9.0 Version Deployment Model
VCF Operations Performance monitoring, capacity planning, optimization, and workload balancing for the entire VCF stack. 8.18.2 OVA appliance (analytics cluster)
VCF Operations for Logs Centralized log management and analytics. Collects, indexes, and analyzes syslog and log data from all VCF components. 8.18.2 OVA appliance (standalone or cluster)
VCF Operations for Networks Deep network visibility, micro-segmentation analytics, traffic flow analysis, and network topology mapping. Integrates with NSX. 6.14 OVA appliance
VCF Suite Lifecycle Manager Lifecycle management for the VCF Operations suite. Handles deployment, upgrades, patching, and configuration drift for all suite products. 8.18 Embedded in SDDC Manager or standalone OVA
Cloud Proxies Lightweight collectors deployed in remote sites or workload domains to collect data and forward it to the central analytics cluster. Bundled with VCF Operations OVA appliance (minimal footprint)

All products in the suite share a common authentication framework, can be cross-launched from one another, and are managed through a unified lifecycle workflow in SDDC Manager.

1.3 Architecture

VCF Operations employs a distributed analytics architecture designed for horizontal scalability and high availability. The architecture consists of the following tiers:

Analytics Cluster

The analytics cluster is the core of VCF Operations. It processes incoming metrics, computes dynamic thresholds, evaluates alert conditions, and serves the user interface. A production analytics cluster consists of:

Collectors

Collectors are responsible for gathering metrics from monitored endpoints and delivering them to the analytics cluster:

Adapters

Adapters are modular plugins that define how VCF Operations connects to and collects data from specific endpoint types. Each adapter type understands the API, object model, and metrics of its target system. Key built-in adapters include:

Management Packs

Management packs extend VCF Operations with adapters, dashboards, alert definitions, and reports for third-party and additional VMware products. They are installed through the product UI or via the VCF Suite Lifecycle Manager.

Data Flow Summary

  1. Collection — Adapters (running on embedded or remote collectors) poll endpoints at configured intervals (default: 5 minutes) and retrieve raw metrics, properties, and relationships.
  2. Ingestion — Raw data is transmitted to the analytics cluster over HTTPS and written to the xDB distributed datastore.
  3. Analytics — The analytics engine computes dynamic thresholds (based on machine-learning models), evaluates symptom and alert definitions, runs capacity models, and calculates health/risk/efficiency scores.
  4. Presentation — Processed data is served through the HTML5 UI, REST API, dashboards, views, and reports.

1.4 VCF 9.0 Integration Model

In VCF 9.0, VCF Operations is a mandatory first-class component of the Cloud Foundation stack, not an optional add-on. Its integration with the broader VCF platform is deep and bidirectional:

Fleet Manager Deployment

VCF 9.0 introduces Fleet Manager as the next-generation deployment and lifecycle orchestrator. Fleet Manager automates the full VCF Operations deployment workflow:

  1. Fleet Manager references the Bill of Materials (BoM) to determine the correct VCF Operations OVA version.
  2. The OVA is downloaded from the Broadcom repository or a local depot.
  3. Fleet Manager deploys the OVA to the management cluster, applying network and sizing parameters from the deployment specification.
  4. Post-deployment configuration (NTP, DNS, admin credentials) is applied automatically.
  5. The VCF Operations instance is registered with SDDC Manager for ongoing lifecycle management.

SDDC Manager Orchestration

SDDC Manager provides ongoing lifecycle management for VCF Operations:

Mandatory Monitoring

In VCF 9.0, VCF Operations monitoring is enabled by default for every workload domain. When a new workload domain is created, SDDC Manager automatically:

This ensures complete observability from the moment a workload domain becomes operational, with no manual adapter configuration required.


Chapter 2 — Sizing and Prerequisites

Proper sizing of VCF Operations is critical to achieving reliable performance, accurate analytics, and timely alerting. Under-sizing leads to collection lag, delayed alerts, and UI timeouts. Over-sizing wastes management cluster resources. This chapter provides the authoritative sizing tables and prerequisite requirements.

2.1 Node Types

VCF Operations uses five distinct node types. Each serves a specific role in the analytics architecture:

Node Type Role Required? Quantity
Primary Master controller, xDB partition owner, API gateway, UI host. First node deployed. Yes (exactly 1) 1
Replica Hot standby for the primary node. Maintains a synchronized copy of all primary data and configuration. Automatically promoted if the primary fails. No (required for HA) 0 or 1
Data Expands analytics processing capacity and xDB storage. Added in pairs for balanced data distribution. No (for scale-out) 0, 2, 4, 6, or 8
Remote Collector Lightweight forwarder deployed near monitored endpoints. Collects metrics and sends them to the analytics cluster. Stores no data locally. No (for remote sites) 0+
Cloud Proxy Specialized remote collector for cloud-connected services and SaaS integrations. No (for cloud use cases) 0+

Node Selection Guidance

2.2 OVA Sizing

The VCF Operations OVA is deployed with one of five predefined size profiles. The size is selected during OVA deployment and cannot be changed after deployment without redeploying the node.

Size vCPUs Memory (GB) Disk (GB) Maximum Objects Use Case
Extra Small 4 16 282 1,200 Lab and proof-of-concept environments. Not recommended for production.
Small 8 32 474 4,000 Small production environments. Single workload domain with limited VM count.
Medium 16 48 898 16,000 Mid-size production environments. Multiple workload domains. Most common production size.
Large 32 128 2,026 50,000 Large enterprise environments. Many workload domains, multiple vCenter instances.
Extra Large 48 512 4,014 100,000 Very large enterprise or service-provider environments. Requires data nodes for scale.

Note: The "Maximum Objects" column refers to the total count of monitored objects across all adapters — VMs, hosts, datastores, clusters, port groups, NSX objects, vSAN objects, and any objects from management packs. Use the formula: Total Objects ≈ (VMs × 1.0) + (Hosts × 3.5) + (Clusters × 2.0) + (Datastores × 1.0) as a rough estimation starting point.

Sizing Recommendations

2.3 Cluster Models

VCF Operations supports three cluster deployment models:

Simple (Single Node)

High Availability (HA)

Continuous Availability (CA)

Critical Limitation: You cannot convert a Simple (single-node) deployment to an HA or CA deployment in place. The replica and data nodes must be deployed as fresh OVAs and joined to the primary. However, you cannot retroactively add HA to a node that was initialized as a standalone instance without redeploying the primary. Plan your cluster model before initial deployment.

2.4 Remote Collector Sizing

Remote Collectors are deployed as separate OVAs with their own sizing profiles, independent of the analytics cluster node sizing:

Size vCPUs Memory (GB) Disk (GB) Max Adapters Max Objects Use Case
Standard 2 4 20 5 1,500 Small remote sites, single vCenter endpoint.
Large 4 16 20 15 10,000 Large remote sites, multiple endpoints, or high-frequency collection.

Remote Collector Placement Guidelines

2.5 Disk Partitions

The VCF Operations appliance uses multiple disk partitions to separate data by function. Understanding these partitions is essential for troubleshooting disk-space alerts and planning NFS extension:

Mount Point Purpose Grows With
/ Root filesystem. Operating system, appliance binaries, configuration files. Static — does not grow significantly.
/storage/db xDB distributed datastore. Primary storage for all collected metrics, properties, relationships, and computed analytics. Object count and retention period. This is the largest and fastest-growing partition.
/storage/log Application log files for all VCF Operations services. Activity level and log verbosity settings.
/storage/core Core dump files generated during application crashes. Only grows when crashes occur.
/storage/nfs Optional NFS mount point for offloading historical data or report storage. Configured capacity of the NFS share.
/storage/vcops/backup Local backup storage. Used by the built-in backup mechanism for configuration and data snapshots. Backup frequency and retention count.

Best Practice: Monitor disk usage on /storage/db closely. When this partition reaches 90% utilization, VCF Operations triggers a critical alert and may begin dropping the oldest data to prevent total disk exhaustion. Extend this partition by adding an NFS datastore or by deploying additional data nodes to distribute the storage load.

2.6 Browser and Hypervisor Requirements

Supported Browsers

The VCF Operations HTML5 UI is supported on the following browsers:

Browser Minimum Version Notes
Google Chrome 100+ Recommended browser. Best performance and rendering.
Mozilla Firefox 100+ Fully supported.
Microsoft Edge (Chromium) 100+ Fully supported. Legacy Edge (EdgeHTML) is not supported.

Supported Hypervisor Versions

Component Supported Versions
ESXi 7.0 U3+, 8.0, 8.0 U1, 8.0 U2, 8.0 U3
vCenter Server 7.0 U3+, 8.0, 8.0 U1, 8.0 U2, 8.0 U3
Virtual Hardware Version 19 (ESXi 7.0 U3) or 20/21 (ESXi 8.0+)

Additional Prerequisites


Chapter 3 — Network Port Requirements

VCF Operations requires specific network ports to be open between its nodes, monitored endpoints, and consuming services. Failure to open the correct ports results in collection failures, cluster communication breakdowns, or inaccessible UI. This chapter provides the complete port reference.

3.1 Inbound Ports

These ports must be open to the VCF Operations analytics cluster nodes from clients and external systems:

Port Protocol Source Destination Purpose
443 TCP (HTTPS) Admin workstations, API clients, SDDC Manager, VCF Operations for Logs VCF Operations cluster nodes Primary UI access, REST API, Suite API, adapter data reception from Remote Collectors. This is the single most critical port.
8543 TCP (HTTPS) Legacy API clients VCF Operations cluster nodes Legacy vRealize Operations API endpoint. Maintained for backward compatibility with older integrations and scripts. Deprecated — migrate to port 443.
443 TCP (HTTPS) Remote Collectors, Cloud Proxies VCF Operations cluster nodes Data forwarding from remote collectors to the analytics cluster. Remote collectors push collected metrics to the cluster over this port.

3.2 Outbound Ports

These ports must be open from the VCF Operations analytics cluster nodes (and Remote Collectors) to external endpoints:

Port Protocol Source Destination Purpose
443 TCP (HTTPS) VCF Operations nodes / Remote Collectors vCenter Server vCenter adapter data collection. Retrieves VM, host, cluster, datastore, and resource pool metrics via the vSphere API.
443 TCP (HTTPS) VCF Operations nodes / Remote Collectors NSX Manager NSX adapter data collection. Retrieves transport node, logical switch, edge, and firewall metrics.
443 TCP (HTTPS) VCF Operations nodes / Remote Collectors SDDC Manager SDDC Manager adapter. Retrieves workload domain configuration, lifecycle events, and compliance status.
443 TCP (HTTPS) VCF Operations nodes Broadcom repository (online) Downloading upgrade bundles, management packs, and content updates when connected to the internet.
514 TCP/UDP VCF Operations nodes Syslog server / VCF Operations for Logs Forwarding VCF Operations application logs to a centralized syslog collector.
25 TCP (SMTP) VCF Operations nodes Mail server Sending email notifications for alert triggers. Unencrypted SMTP.
587 TCP (SMTP/TLS) VCF Operations nodes Mail server Sending email notifications for alert triggers over TLS-encrypted SMTP. Preferred over port 25.

3.3 Cluster-Internal Ports

These ports are used for communication between VCF Operations cluster nodes (primary, replica, and data nodes). They must be open bidirectionally between all cluster members:

Port Protocol Purpose
7001 TCP Cassandra (xDB) inter-node communication. Handles data replication, partition synchronization, and consistency management between cluster nodes.
1300–1399 TCP GemFire distributed cache. Used for in-memory data grid communication, cache replication, and cluster state synchronization. The exact port within this range is assigned dynamically.
10002 TCP Cluster controller RPC. The primary node uses this port to coordinate cluster operations — node joins, failover decisions, and configuration propagation.
20002 TCP Analytics data synchronization. Distributes computed analytics results (dynamic thresholds, scores, capacity projections) across all cluster nodes.
20003 TCP Cluster heartbeat. Used by the cluster health monitor to detect node failures. A missed heartbeat sequence triggers failover procedures.
4369 TCP Erlang Port Mapper Daemon (epmd). Used by the RabbitMQ message broker embedded in each node for inter-node message routing.

Important: All cluster-internal ports must have low latency (< 1 ms round-trip) and high bandwidth (1 Gbps minimum). Cluster nodes should not be separated by WAN links, firewalls with deep packet inspection, or load balancers. Place all cluster nodes on the same VLAN or Layer 2 segment.

3.4 Localhost-Only Ports

These ports are bound to 127.0.0.1 (localhost) on each VCF Operations node. They do not require firewall rules because they are not accessible from the network. They are documented here for troubleshooting and security audit purposes:

Port Protocol Purpose
5433 TCP vPostgres embedded database. Stores appliance configuration, user accounts, roles, policies, and alert definitions. Not used for metric storage (that is xDB).
8080 TCP (HTTP) Internal CaSA (Collector and Storage Aggregator) service. Handles internal metric routing between collector threads and the storage layer.
9090 TCP (HTTP) Internal admin/health-check endpoint. Used by the appliance self-monitoring watchdog to verify service health.

3.5 Remote Collector Ports

Remote Collectors have a simplified port profile because they do not run analytics or store data:

Port Protocol Direction Source → Destination Purpose
443 TCP (HTTPS) Outbound Remote Collector → VCF Operations cluster Forwarding collected metrics, properties, and relationship data to the analytics cluster.
443 TCP (HTTPS) Outbound Remote Collector → Monitored endpoints (vCenter, NSX, etc.) Collecting data from monitored endpoints. The Remote Collector initiates all connections — endpoints never connect inbound to the collector.

Remote Collectors do not expose any inbound ports. All communication is initiated outbound by the collector. This makes Remote Collectors ideal for deployment in DMZ or restricted network zones where inbound connections are prohibited.

3.6 Firewall Rule Guidance

When creating firewall rules for VCF Operations, follow these best practices:

  1. Use FQDNs, not IP addresses, in firewall rules where possible. VCF Operations nodes may change IP addresses during disaster recovery or migration. FQDN-based rules are more resilient.

  2. Restrict source addresses. Do not use any as the source for inbound port 443. Limit access to known admin workstation subnets, SDDC Manager, and Remote Collector IP ranges.

  3. Enable stateful inspection. All VCF Operations connections are TCP-based and work correctly with stateful firewalls. Stateful inspection ensures return traffic is automatically permitted.

  4. Do not use SSL decryption/inspection on traffic between VCF Operations cluster nodes. SSL interception between cluster members causes certificate validation failures and breaks cluster communication.

  5. Test connectivity before deployment. Use curl -v https://<target>:443 from the VCF Operations node to verify that each required port is reachable before configuring adapters. Connection failures after adapter configuration are difficult to distinguish from credential or API errors.

  6. Document all rules. Maintain a port matrix document that maps each firewall rule to its VCF Operations purpose. This accelerates troubleshooting when connectivity issues arise during maintenance windows or network changes.


Chapter 4 — Deployment

This chapter covers all deployment methods for VCF Operations — from the fully automated VCF 9.0 workflow to manual OVA deployment via the vSphere Client and command-line tools. Regardless of the deployment method, the end result is the same: a running VCF Operations appliance ready for initial configuration.

4.1 VCF Automated Deployment Flow

In a VCF 9.0 environment, VCF Operations deployment is orchestrated by SDDC Manager and Fleet Manager. This is the recommended deployment method for production VCF environments because it ensures consistency with the VCF Bill of Materials and integrates VCF Operations into the overall lifecycle management framework.

Automated Deployment Sequence

  1. Prerequisite Validation — SDDC Manager validates that the management cluster has sufficient capacity (CPU, memory, storage) to host the VCF Operations appliance at the specified size.

  2. OVA Acquisition — Fleet Manager retrieves the VCF Operations OVA from the configured software depot. In connected environments, this is the Broadcom online repository. In air-gapped environments, the OVA must be pre-staged in the local SDDC Manager depot.

  3. OVA Deployment — Fleet Manager deploys the OVA to the management cluster's designated resource pool and datastore. Network configuration (IP, subnet, gateway, DNS, NTP) is injected via OVF properties derived from the deployment specification.

  4. Appliance Boot and Self-Configuration — The appliance boots, applies the network configuration, generates initial self-signed certificates, and starts all core services. This takes approximately 10–15 minutes.

  5. Registration — Fleet Manager registers the VCF Operations instance with SDDC Manager. This enables ongoing lifecycle management (upgrades, certificate rotation, health monitoring).

  6. Initial Adapter Configuration — SDDC Manager automatically configures vCenter and NSX adapter instances for the management domain. If additional workload domains exist, adapters are configured for those as well.

  7. Validation — Fleet Manager runs a post-deployment health check to confirm that all services are running, the UI is accessible, and initial data collection has started.

Note: The automated deployment flow always deploys a Medium-sized OVA by default. To override the size, edit the deployment specification JSON before initiating the workflow. Consult the SDDC Manager API documentation for the exact parameter path.

4.2 OVA Deployment via vSphere Client

For environments not using VCF 9.0 automation, or when deploying additional nodes (replica, data), the OVA can be deployed manually through the vSphere Client.

Step-by-Step Procedure

Step 1 — Download the OVA

Download the VCF Operations OVA file from the Broadcom Support Portal (support.broadcom.com). Navigate to VMware Cloud FoundationVCF OperationsDownloads. Select the version matching your VCF Bill of Materials.

Step 2 — Launch the Deploy OVF Template Wizard

  1. Log in to the vSphere Client (https://<vcenter-fqdn>/ui).
  2. Right-click the target cluster or resource pool in the inventory tree.
  3. Select Deploy OVF Template.

Step 3 — Select the OVA Source

Step 4 — Name and Location

Step 5 — Select a Compute Resource

Step 6 — Review Details

Step 7 — Configuration (Size Selection)

Step 8 — Select Storage

Step 9 — Select Networks

Step 10 — Customize Template (OVF Properties)

This is the most critical page. Enter the following values:

Property Value Notes
Hostname vcf-ops-01.lab.local Must match the DNS A record. FQDN is permanent.
IP Address 10.0.10.50 Static IP on the management VLAN.
Subnet Mask 255.255.255.0 Matches the management VLAN subnet.
Default Gateway 10.0.10.1 Management VLAN gateway.
DNS Server(s) 10.0.10.10 Comma-separated if multiple.
Domain Name lab.local DNS search domain.
NTP Server(s) 10.0.10.10 Must match the NTP source used by vCenter and ESXi.
Admin Password (strong password) Password for the admin user account.
Root Password (strong password) Password for the Linux root user on the appliance.

Step 11 — Ready to Complete

Step 12 — Power On

4.3 OVA Deployment via ovftool CLI

For automated or scripted deployments, use the VMware ovftool command-line utility. This is useful for deploying multiple nodes in a cluster or for integrating VCF Operations deployment into infrastructure-as-code pipelines.

Full ovftool Command

ovftool \
  --name="vcf-operations-primary-01" \
  --deploymentOption="medium" \
  --diskMode="thin" \
  --datastore="vsanDatastore" \
  --network="Management-PG" \
  --acceptAllEulas \
  --allowExtraConfig \
  --powerOn \
  --prop:vami.DNS.VMware_Aria_Operations="10.0.10.10" \
  --prop:vami.gateway.VMware_Aria_Operations="10.0.10.1" \
  --prop:vami.ip0.VMware_Aria_Operations="10.0.10.50" \
  --prop:vami.netmask0.VMware_Aria_Operations="255.255.255.0" \
  --prop:vami.hostname="vcf-ops-01.lab.local" \
  --prop:vami.NTP.VMware_Aria_Operations="10.0.10.10" \
  --prop:vami.domain.VMware_Aria_Operations="lab.local" \
  --prop:guestinfo.cis.appliance.root.password="VMware123!" \
  --prop:guestinfo.cis.appliance.ssh.enabled="True" \
  /path/to/vcf-operations-8.18.2.ova \
  "vi://administrator@vsphere.local:password@vcenter.lab.local/Datacenter/host/Management-Cluster"

Key Parameters Explained

Parameter Description
--deploymentOption OVA size profile: xsmall, small, medium, large, xlarge.
--diskMode Disk provisioning: thin (recommended) or thick.
--datastore Target datastore name on the destination host/cluster.
--network Port group name to map Network 1 to.
--prop:vami.DNS.* DNS server IP(s).
--prop:vami.gateway.* Default gateway IP.
--prop:vami.ip0.* Static IP address for the appliance.
--prop:vami.netmask0.* Subnet mask.
--prop:vami.hostname FQDN for the appliance. Must have a matching DNS record.
--prop:vami.NTP.* NTP server IP(s).
--powerOn Automatically power on the VM after deployment.

Note: The OVF property names reference VMware_Aria_Operations because the OVA internal metadata still uses the Aria Operations naming convention. This does not affect functionality — it is simply the OVF property namespace.

4.4 VAMI Configuration

After the appliance boots and completes its initial self-configuration, the Virtual Appliance Management Interface (VAMI) is available for administrative tasks.

Accessing VAMI

Open a browser and navigate to:

https://<node-fqdn>/admin

Log in with the admin user account and the password specified during OVA deployment.

Historical Note: In older versions (vRealize Operations 6.x–8.x), the VAMI was accessed on port 5480 (https://<node-fqdn>:5480). In current versions, the VAMI is integrated into the main web interface at the /admin path on port 443.

VAMI Administrative Functions

The VAMI provides the following administrative functions:

4.5 Initial Setup Wizard

On the first login to the VCF Operations UI (https://<node-fqdn>/ui), the Initial Setup Wizard guides you through the essential configuration steps. The wizard consists of seven steps:

Step 1 — Getting Started

Step 2 — Accept the EULA

Step 3 — Choose the Deployment Type

Step 4 — Set the Admin Password

Step 5 — Choose the Certificate Option

Step 6 — Configure NTP

Step 7 — Ready to Complete

After the wizard completes, the VCF Operations login page is displayed. Log in with admin and the password set in Step 4. The system is now ready for adapter configuration and monitoring setup (covered in subsequent chapters).


Chapter 5 — High Availability Cluster Setup

A single-node VCF Operations deployment provides no fault tolerance — if the appliance fails, all monitoring, alerting, and capacity analytics are lost until the node is restored. For production environments, deploying a high availability (HA) cluster is strongly recommended. This chapter provides a detailed walkthrough of HA cluster configuration.

5.1 Deploy Primary Node

The primary node is deployed using the procedures described in Chapter 4 (Section 4.2 for vSphere Client or Section 4.3 for ovftool). Complete the Initial Setup Wizard (Section 4.5) with the New Installation option.

Before proceeding to replica deployment, verify the primary node is fully operational:

  1. Log in to https://<primary-fqdn>/ui with the admin account.
  2. Navigate to AdministrationCluster Management.
  3. Confirm the node status shows Online with a green indicator.
  4. Verify that all services are running: navigate to AdministrationCluster ManagementServices and confirm that every listed service shows a status of Running.

Pre-Requisites for Cluster Expansion

Before deploying the replica node, ensure:

5.2 Deploy and Join Replica Nodes

Deploy the Replica OVA

Deploy a second VCF Operations OVA using the same size profile as the primary node. Use the same deployment method (vSphere Client or ovftool) described in Chapter 4.

Key settings for the replica OVA:

Setting Value
VM Name vcf-operations-replica-01
Size Must match the primary node (e.g., Medium)
IP Address Different IP, same VLAN as primary (e.g., 10.0.10.51)
FQDN Unique FQDN with matching DNS records (e.g., vcf-ops-02.lab.local)
Gateway, DNS, NTP Identical to the primary node
Admin/Root Passwords May differ from the primary, but using the same passwords simplifies administration

Power on the replica OVA and wait for it to complete first-boot initialization (10–15 minutes).

Join the Replica to the Primary

  1. Open a browser and navigate to https://<replica-fqdn>/ui.

  2. The Initial Setup Wizard appears. On the Deployment Type page, select Expand an Existing Cluster.

  3. Enter the primary node's FQDN or IP address:

  4. Click Validate. The wizard connects to the primary node and retrieves its certificate.

  5. Accept the Certificate — Review the certificate thumbprint displayed. Verify it matches the primary node's certificate thumbprint (you can find this on the primary node under AdministrationCluster ManagementCertificate). Click Accept.

  6. Authenticate — Enter the admin credentials for the primary node.

  7. Node Role Selection — Select Replica as the role for this node.

  8. Click Next and then Finish to initiate the join process.

The join process takes approximately 15–25 minutes. During this time:

Monitoring the Join Progress

On the primary node, navigate to AdministrationCluster Management. The cluster status panel shows:

Do not modify any configuration or restart any services during the join process.

5.3 Cluster Initialization

After the replica node joins successfully, the cluster requires activation to enable HA functionality:

  1. On the primary node, navigate to AdministrationCluster Management.

  2. The cluster status panel shows both nodes: the primary and the replica. Both should show status Online.

  3. Click the Enable HA button (if it has not been automatically enabled during the join process).

  4. A confirmation dialog appears: "Enabling High Availability will synchronize all data between the primary and replica nodes. This may temporarily impact performance during the initial synchronization. Do you want to continue?"

  5. Click Yes to confirm.

  6. The cluster enters the Synchronizing state. During synchronization:

  7. When synchronization completes, the cluster status changes to Online (HA Enabled). This indicates:

Verifying HA Functionality

To verify that HA is working correctly:

  1. Navigate to AdministrationCluster ManagementStatus.

  2. Confirm:

  3. Check the Cluster Health dashboard (DashboardsVCF Operations Self-MonitoringCluster Health). All health indicators should be green.

Creating Anti-Affinity Rules

To prevent both cluster nodes from running on the same ESXi host (which would defeat the purpose of HA), create a DRS anti-affinity rule:

  1. In the vSphere Client, navigate to the management cluster.
  2. Go to ConfigureVM/Host Rules.
  3. Click Add.
  4. Name: VCF-Operations-Anti-Affinity
  5. Type: Separate Virtual Machines
  6. Members: Add vcf-operations-primary-01 and vcf-operations-replica-01.
  7. Click OK.

DRS will automatically vMotion the VMs to separate hosts if they are currently co-located.

5.4 Limitations and Considerations

Critical: No In-Place Conversion from Simple to HA

VCF Operations does not support converting a single-node (Simple) deployment to HA by adding a replica after the fact if the original deployment was initialized as a standalone instance with certain configuration flags. The supported path is:

  1. Deploy the primary node using the New Installation wizard.
  2. Deploy the replica OVA.
  3. Join the replica to the primary before or shortly after production data collection begins.

If you attempt to add a replica to a long-running standalone deployment, the join may succeed, but synchronization of historical data can take an extremely long time and may fail for very large datasets. Best practice: decide on your cluster model before initial deployment and deploy the replica immediately after the primary.

IP Address and FQDN Immutability

Once deployed, the IP address and FQDN of each node are embedded in the cluster configuration, certificates, and inter-node trust relationships. Changing the IP or FQDN of a cluster member requires:

  1. Removing the node from the cluster.
  2. Redeploying the OVA with the new IP/FQDN.
  3. Re-joining the node to the cluster.

This is disruptive and should be avoided. Plan IP addressing and DNS naming carefully before deployment.

Cluster Node Sizing Consistency

All nodes in a cluster must use the same OVA size profile. You cannot mix a Medium primary with a Small replica or add Large data nodes to a Medium cluster. If you need to change the cluster size, you must redeploy all nodes.

Network Latency Requirements

Cluster-internal communication (xDB replication, heartbeat, GemFire cache synchronization) is latency-sensitive. All cluster nodes must be on the same Layer 2 network segment with:

Violating these requirements leads to split-brain scenarios, data inconsistency, and false failover events.

Failover Behavior

When the primary node fails:

  1. The replica detects the failure via missed heartbeats (default timeout: 5 minutes).
  2. The replica promotes itself to primary and begins serving the UI and API.
  3. All adapter collection continues without interruption (collectors reconnect to the new primary automatically).
  4. When the original primary node is restored, it re-joins the cluster as the replica.

During the failover window (approximately 5 minutes), the UI is unavailable and no new alerts are generated. Metric collection continues in the collector buffer and is flushed to the cluster once the new primary is operational.

Witness Node for Continuous Availability

For environments requiring even higher availability, deploy a witness node in addition to the primary and replica. The witness participates in quorum voting to prevent split-brain scenarios but does not store analytics data or serve the UI. The witness OVA is a separate, much smaller appliance. Continuous Availability (CA) mode requires:

CA mode ensures that the cluster continues to operate with zero data loss even if one node fails completely, by maintaining a quorum and synchronous data replication across all nodes.

Chapter 6: Key Filesystem Paths and Services

This chapter provides a comprehensive reference for the filesystem layout, service architecture, and operational commands used to manage VCF Operations (Aria Operations) appliances. Understanding these paths and services is essential for troubleshooting, backup planning, and day-to-day administration.

6.1 Filesystem Paths

The VCF Operations appliance is built on Photon OS and follows a structured directory layout. The two primary mount points are the root filesystem (/) and the data partition (/storage/), which is sized according to the deployment profile selected during installation.

Path Purpose
/usr/lib/vmware-vcops/ Main application directory; contains binaries, libraries, and runtime components for all VCF Operations services.
/usr/lib/vmware-vcops/user/conf/ Application configuration files including analytics.properties, collector.properties, gemfire.properties, and adapter configuration XML files.
/usr/lib/vmware-vcops/user/plugins/ Management pack plugin directories. Each installed management pack places its adapter JAR files and descriptors here in a versioned subdirectory.
/usr/lib/vmware-vcops/user/plugins/inbound/ Inbound (data collection) adapter plugins. Contains subdirectories for each installed adapter such as VMware_adapter3, PythonRemediationVcenterAdapter, and third-party packs.
/usr/lib/vmware-vcops/user/conf/ssl/ SSL/TLS certificates and keystores used by the application, including the web server certificate (cert.pem), private key (key.pem), and trust stores.
/usr/lib/vmware-vcops/user/conf/cassandra/ Cassandra configuration directory containing cassandra.yaml, cassandra-env.sh, and related tuning files for the metrics datastore.
/usr/lib/vmware-vcops/tomcat-enterprise/ Apache Tomcat instance serving the REST API (/suite-api) and the administrative UI. Contains conf/server.xml, webapps/, and log directories.
/usr/lib/vmware-vcops/tools/opscli/ Operations CLI tooling. The primary entry point is ops-cli.py, used for adapter management, slice configuration queries, and cluster diagnostics.
/usr/lib/vmware-vcops/support/ Support and diagnostic scripts including sliceConfiguration.sh, cleanupOps.sh, and the support bundle generator supportbundle.py.
/storage/db/ Primary analytics database directory housing the FSDB (File System Database) and HIS (Historical) data stores. This is where time-series metric data resides.
/storage/db/casa/ CASA (Cluster Automated Services Architecture) database. Manages cluster membership, node roles, replication state, and slice ownership metadata.
/storage/db/cassandra/ Cassandra data directory for persisted metrics. Contains SSTables, commit logs, and saved caches.
/storage/db/vcops/ Core analytics working data, including dynamic threshold calculations, symptom state, and alert evaluation results.
/storage/log/ Application-level log files for all VCF Operations services. Primary troubleshooting location. Key files include analytics.log, collector.log, api.log, and casa.log.
/storage/core/ Core dump files generated during application crashes. Monitor disk usage here; large core dumps can fill the partition.
/storage/nfs/ Default NFS mount point for scheduled backup destinations. Must be pre-configured with appropriate NFS export permissions.
/var/log/ Operating system and VMware infrastructure service logs, including syslog, messages, vmware/ subdirectory, and Photon OS package manager logs.
/var/vmware/ VMware infrastructure service runtime data, including STS token caches and VMware Identity Manager working files.
/opt/vmware/etc/ vPostgres (VMware-bundled PostgreSQL) configuration files. Contains postgresql.conf, pg_hba.conf, and recovery configuration.
/opt/vmware/vpostgres/ vPostgres binary and library directory. The PostgreSQL instance used for alert definitions, user data, and report storage.

Note: The /storage/ partition is critical. If it reaches capacity, analytics processing halts and data collection stops. Monitor the partition with df -h /storage/ and configure alerts for filesystem utilization exceeding 85%.

6.2 Services

VCF Operations runs as a collection of interdependent services managed by systemd. The following table lists every core service, its function, and its expected default state on a healthy primary node.

Service Name Description Default State
vmware-vcops-analytics Core analytics engine responsible for dynamic threshold computation, symptom evaluation, alert generation, capacity modeling, and workload optimization calculations. Running
vmware-vcops-collector Data collection service that executes adapter instances, gathers metrics from monitored systems, and feeds raw data into the analytics pipeline. Running
vmware-vcops-api REST API and administrative UI service hosted on Tomcat. Serves the /suite-api endpoint and the HTML5 management interface on port 443. Running
vmware-casa Cluster Automated Services Architecture. Manages multi-node cluster topology, node membership, slice assignment, replication orchestration, and failover coordination. Running
vmware-vcops-gemfire Apache Geode (GemFire) distributed in-memory cache. Provides inter-node data sharing, real-time metric buffering, and distributed lock management across cluster nodes. Running
vmware-vcops-vpostgres VMware-packaged PostgreSQL database instance. Stores alert definitions, custom dashboards, super metrics, user accounts, report templates, and compliance data. Running
vmware-vcops-cassandra Apache Cassandra metrics storage engine. Provides the persistent time-series datastore for all collected metrics and properties. Running
vmware-vcops-watchdog Service watchdog daemon. Monitors the health of all other VCF Operations services and automatically restarts any service that becomes unresponsive or crashes. Running
vmware-vcops-web Front-end web server (httpd/nginx reverse proxy). Handles TLS termination, static content serving, and request routing to the Tomcat API backend. Running
vmware-stsd VMware Security Token Service daemon. Provides authentication token issuance and validation for inter-service communication. Running
vmware-vcops-rhino Rhino script engine for custom automation actions and notification plugins. Running

6.3 Service Management Commands

All service operations must be performed as the root user via SSH or console access. The appliance supports both systemctl and legacy service command syntax.

Checking service status:

# Preferred — systemctl
systemctl status vmware-vcops-analytics

# Legacy — service wrapper
service vmware-vcops-analytics status

Starting, stopping, and restarting individual services:

# Start a service
systemctl start vmware-vcops-collector

# Stop a service
systemctl stop vmware-vcops-collector

# Restart a service (stop then start)
systemctl restart vmware-vcops-api

Querying overall cluster slice status:

/usr/lib/vmware-vcops/support/sliceConfiguration.sh --status

This command returns the role of the current node (primary, replica, data), cluster membership, and the online/offline state of each slice.

Using the Operations CLI:

$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py --help
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py adapter list
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py node list

Checking all VCF Operations services at once:

for svc in analytics collector api casa gemfire vpostgres cassandra watchdog web; do
  echo "=== vmware-vcops-${svc} ==="
  systemctl is-active vmware-vcops-${svc}
done

6.4 Shutdown and Startup Sequences

Correct shutdown and startup ordering is critical to avoid data corruption, split-brain scenarios, and prolonged recovery times. The required sequence varies depending on your deployment topology.

Non-HA (Single Node)

Shutdown:

# Stop all VCF Operations services
service vmware-stsd stop
service vmware-vcops stop

Startup:

# Start all VCF Operations services
service vmware-vcops start
service vmware-stsd start

Warning: Always stop vmware-stsd after vmware-vcops during shutdown, and start it before vmware-vcops during startup. Reversing this order can leave authentication tokens in an inconsistent state.

HA (High Availability — Primary + Replica)

Shutdown sequence (order matters):

  1. If data nodes exist, shut down all data nodes first (in any order among themselves).
  2. Shut down the replica node.
  3. Shut down the primary node last.
# On each data node (if applicable):
service vmware-vcops stop && shutdown -h now

# On the replica node:
service vmware-vcops stop && shutdown -h now

# On the primary node (last):
service vmware-vcops stop && shutdown -h now

Startup sequence (reverse order):

  1. Power on and start services on the primary node first.
  2. Wait until the primary node UI is accessible and all services report healthy.
  3. Power on and start services on the replica node.
  4. Power on and start services on all data nodes.
# On the primary node (first):
service vmware-vcops start

# Verify primary is healthy before proceeding:
/usr/lib/vmware-vcops/support/sliceConfiguration.sh --status

# On the replica node (second):
service vmware-vcops start

# On each data node (last):
service vmware-vcops start

Warning: Starting the replica or data nodes before the primary is fully online will cause CASA cluster formation failures. The primary node must be the first to come online and the last to go offline.

Continuous Availability (CA)

Continuous Availability deployments include a witness node in addition to primary and replica nodes. The witness participates in quorum decisions but does not store data.

Shutdown sequence:

  1. Shut down all data nodes.
  2. Shut down the witness node.
  3. Shut down the replica node.
  4. Shut down the primary node last.

Startup sequence:

  1. Start the primary node first.
  2. Start the replica node.
  3. Start the witness node.
  4. Start all data nodes.

Warning: In a CA deployment, losing both the witness and one of the primary/replica nodes simultaneously causes a loss of quorum. Never perform maintenance on the witness and a data-bearing node at the same time. Always verify quorum status via the admin UI or sliceConfiguration.sh --status before proceeding to the next node.


Chapter 7: vCenter Adapter Configuration

The VMware vSphere adapter is the foundational integration for VCF Operations. It collects performance metrics, configuration properties, change events, and relationship data from vCenter Server and all managed objects including ESXi hosts, virtual machines, datastores, clusters, distributed switches, and resource pools. This chapter provides a complete walkthrough of credential creation, adapter instance configuration, collection tuning, and health monitoring.

7.1 Create vCenter Credentials

Before creating an adapter instance, you must configure a credential that VCF Operations will use to authenticate against the target vCenter Server.

Step-by-step procedure:

  1. Log in to the VCF Operations UI as an administrator.
  2. Navigate to Administration → Integrations → Accounts.
  3. Click Add Account.
  4. Select vCenter as the account type.
  5. Complete the following fields:
  6. In the Credential section, click Add Credential (or select an existing one):
  7. Click Validate to test the credential before saving.
  8. Click Save.

Required vCenter Permissions:

The service account must be assigned a custom role at the vCenter root level with the following minimum privileges:

Privilege Category Specific Privilege Access Level
Global Licenses Read only
Global Settings Read only
Global Health Read only
Host Configuration (all sub-items) Read only
Host CIM → CIM Interaction Read only
Host Storage operations Read only
Virtual Machine Interaction → Console interaction Read only
Virtual Machine State → Create snapshot, Remove snapshot Read/Write
Virtual Machine Configuration (all sub-items) Read only
Datastore Browse datastore Read only
Datastore Low-level file operations Read only
Performance Modify intervals Read/Write
vSAN Cluster → ReadOnly Read only
Sessions Validate session Read only
Extension Register extension Read/Write (optional — only for remediation actions)
Alarm Acknowledge alarm, Set alarm status Read/Write (optional — only for alert sync)

Best Practice: Create a dedicated vSphere role named VCF-Operations-ReadOnly with these privileges. Assign it to the service account at the vCenter root object and select Propagate to children. This ensures the adapter can discover and monitor all objects in the inventory hierarchy.

7.2 Adapter Instance Configuration

With the credential in place, create the adapter instance that will perform data collection.

  1. Navigate to Administration → Integrations → Accounts.
  2. Click Add Account and select vCenter.
  3. Complete the following fields:
Field Description Example Value
Adapter Type Pre-selected as VMware vSphere. VMware vSphere
Display Name Unique name identifying this adapter instance in dashboards and alerts. vcsa-mgmt-01
Description Free-text description. Management domain vCenter
Credential Select the credential created in Section 7.1. svc-vrops-mgmt-01
vCenter Server FQDN of the target vCenter Server. Must match the credential's vCenter Server field. vcsa-mgmt-01.corp.local
Collector / Collector Group Select the collector node or group responsible for data collection. In multi-site deployments, choose a collector closest to the target vCenter. Default collector group
Auto Discovery When enabled, newly added hosts and VMs are automatically discovered and monitored. Enabled (recommended)
  1. Expand Advanced Settings to configure optional parameters:
Setting Default Description
COLLECT_VSAN_PERF_METRICS true Enables collection of vSAN performance counters from the vSAN Performance Service.
COLLECT_VSAN_ADVANCED_METRICS false Enables collection of extended vSAN metrics (DOM, LSOM, CMMDS). Increases load on vCenter.
PROCESS_CHANGE_EVENTS true Enables ingestion of vCenter events and tasks for change-driven analytics and audit trails.
DISABLE_COMM_WITH_VCENTER false Emergency toggle to stop all communication with vCenter without deleting the adapter. Useful during planned vCenter maintenance.
CONNECT_TIMEOUT 60000 Connection timeout in milliseconds for vCenter API calls. Increase for high-latency WAN connections.
ENABLE_DIFFMERGE true Enables differential collection (only changed properties are sent), reducing processing overhead.
COLLECTOR_INSTANCE_COUNT 1 Number of parallel collection threads. Increase for very large vCenter inventories (>5,000 VMs).
  1. Click Validate Connection to verify connectivity.
  2. Click Save to create the adapter instance.

7.3 Collection Intervals

VCF Operations collects different categories of data at different frequencies. These intervals can be modified per adapter instance, but the defaults are optimized for most environments.

Collection Type Default Interval Configurable Range Notes
Performance Metrics 5 minutes 1–60 minutes Aligns with vCenter's default real-time statistics interval (20-second samples aggregated to 5 minutes). Reducing below 5 minutes does not yield higher granularity from vCenter.
Configuration Properties 30 minutes 5–1440 minutes Collects object configuration attributes (CPU count, memory size, disk layout, network assignments).
Change Events 5 minutes 1–60 minutes Polls vCenter's EventManager for tasks and events since the last poll.
Inventory Discovery 6 hours 1–24 hours Full inventory traversal to discover new objects and remove stale ones.
vSAN Performance 5 minutes 5–60 minutes vSAN performance counters collected via the vSAN Performance Service API. Must be ≥5 minutes.
Relationship Mapping 30 minutes 5–1440 minutes Updates parent-child and peer relationships between objects.

Tip: In very large environments (>10,000 VMs), increasing the configuration collection interval to 60 minutes and inventory discovery to 12 hours significantly reduces API load on vCenter with minimal impact on monitoring fidelity.

7.4 Wait/Cancel Cycles and Data Maturation

After initial deployment, the adapter follows a well-defined lifecycle before full analytics capability is reached:

  1. Initial Discovery (0–30 minutes): The adapter performs a complete inventory traversal, creating resource objects for every discovered entity (hosts, VMs, clusters, datastores, etc.). The Object Count in the adapter status begins to populate.

  2. First Collection Cycle (5–10 minutes after discovery): Performance metrics and configuration properties are collected for the first time. Metrics begin appearing in dashboards, but values are raw with no baseline context.

  3. Statistics Build-Up (24–72 hours): The analytics engine begins calculating rolling averages, standard deviations, and trend lines. Capacity projections begin to appear, but with low confidence.

  4. Dynamic Thresholds (1–2 weeks): After accumulating approximately one to two weeks of continuous data, the analytics engine generates dynamic thresholds (DT). These adaptive baselines learn normal behavior patterns for each metric on each object, including daily and weekly seasonality. Alerts based on dynamic thresholds become meaningful only after this maturation period.

  5. Steady State (2+ weeks): Dynamic thresholds are fully established. Anomaly detection, predictive alerts, and capacity forecasts operate at full accuracy. The system continues to refine thresholds as it accumulates more historical data.

Important: Do not create custom alert definitions based on dynamic thresholds during the first two weeks. The immature thresholds will generate excessive false positives. Use static thresholds for immediate alerting needs during the burn-in period.

7.5 Test Connection and Initial Discovery

After saving the adapter instance, perform the following validation steps:

  1. Test Connection: On the adapter configuration page, click Test Connection. A successful test confirms:

  2. Monitor Discovery Progress:

  3. Verify Object Counts:

  4. Check Adapter Logs:

    tail -100 /storage/log/collector/collector.log | grep -i "VMware_adapter3"
    

    Look for Collection completed successfully messages and verify there are no authentication errors or timeout exceptions.

7.6 Monitoring Adapter Health

Ongoing adapter health monitoring ensures continuous data collection and early detection of integration failures.

Via REST API:

# Get adapter instance status
curl -sk -X GET \
  "https://<vrops-fqdn>/suite-api/api/adapters/{adapterId}" \
  -H "Authorization: vRealizeOpsToken <token>" \
  -H "Accept: application/json"

The response includes resourceStatusAndReason, where:

Via CLI:

$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py adapter list

This outputs all configured adapter instances, their types, collection states, and associated collector nodes.

Via UI:

  1. Navigate to Administration → Integrations → Repository.
  2. Each adapter is displayed as a card with a color-coded status indicator.
  3. Click an adapter card to view:

Common adapter health issues and resolutions:

Symptom Likely Cause Resolution
Status is red; SSLHandshakeException in logs vCenter certificate changed or renewed Re-trust the vCenter certificate in VCF Operations: Administration → Certificates → Certificate Management
Status is red; InvalidLogin in logs Service account password expired or changed Update the credential in Administration → Integrations → Accounts
Status is yellow; collection duration exceeds interval Oversized vCenter inventory or resource contention Increase COLLECTOR_INSTANCE_COUNT, add a remote collector, or increase collection intervals
Object count is zero Insufficient vCenter permissions Verify the service account role assignment per Section 7.1
Status is green but metrics are stale Collector node clock drift Verify NTP synchronization on both the collector appliance and vCenter

Chapter 8: SDDC Manager Integration

SDDC Manager is the lifecycle management control plane for VMware Cloud Foundation. Integrating VCF Operations with SDDC Manager provides domain-level topology awareness, lifecycle status visibility, and operational context that enriches the analytics engine's understanding of the VCF stack.

8.1 Configuration

Step-by-step procedure:

  1. Log in to the VCF Operations UI as an administrator.
  2. Navigate to Administration → Integrations → Accounts.
  3. Click Add Account.
  4. Select SDDC Manager from the account type list.
  5. Complete the following fields:
  6. Click Test Connection to validate connectivity and credentials.
  7. Click Save.

Note: The SDDC Manager API uses port 443. Ensure that firewall rules allow HTTPS traffic from the VCF Operations collector node to the SDDC Manager appliance.

8.2 What the Adapter Collects

Once configured, the SDDC Manager adapter automatically discovers and monitors the following data:

8.3 SDDC Manager REST API Endpoints Used

The adapter communicates with the SDDC Manager via its published REST API. The following table lists the key endpoints queried during each collection cycle:

Endpoint Data Collected
GET /v1/system Overall system information: SDDC Manager version, system status, NTP configuration, DNS settings, and deployment type.
GET /v1/domains All workload domains including name, ID, type (management/VI), status, and associated cluster references.
GET /v1/clusters Cluster details within each domain: cluster name, host count, vSAN enabled status, stretch cluster configuration, and image profile.
GET /v1/hosts Host inventory: hardware model, ESXi version, commission status (ASSIGNED, UNASSIGNED_USEABLE, DECOMMISSIONED), and associated domain/cluster.
GET /v1/tasks Recent task history: upgrade workflows, host operations, certificate rotations, and their completion status (SUCCESSFUL, FAILED, IN_PROGRESS).
GET /v1/upgrades Available upgrade bundles and their applicability to each domain, including pre-check results and compatibility matrices.
GET /v1/certificates Certificate inventory: issuing CA, subject, expiration date, and associated component (vCenter, NSX, ESXi).
GET /v1/network-pools Network pool definitions, VLAN ranges, and IP address block utilization.
GET /v1/sddc-managers SDDC Manager cluster node information (in multi-instance deployments).

Tip: If the SDDC Manager adapter reports errors for specific endpoints, verify that the service account has sufficient privileges. The ADMIN role provides access to all endpoints; the OPERATOR role may restrict access to certain lifecycle operations.


Chapter 9: NSX Integration

NSX provides the network virtualization and security layer in VMware Cloud Foundation. Integrating VCF Operations with NSX delivers visibility into logical networking constructs, transport infrastructure, distributed firewall activity, and load balancer performance — all correlated with the compute and storage metrics collected by the vSphere adapter.

9.1 Configuration

Step-by-step procedure:

  1. Log in to the VCF Operations UI as an administrator.
  2. Navigate to Administration → Integrations → Accounts.
  3. Click Add Account and select NSX-T from the account type list.
  4. Complete the following fields:
  5. Click Test Connection to verify connectivity and authentication.
  6. Click Save.

Warning: Always use the NSX Manager VIP (Virtual IP), not the FQDN of an individual NSX Manager node. The NSX Manager cluster operates as a three-node Raft consensus group. During maintenance, node upgrades, or node failures, individual manager nodes become temporarily unavailable. The VIP automatically directs traffic to a healthy node, ensuring uninterrupted data collection. Configuring the adapter with an individual node address will result in collection outages during any node maintenance event.

9.2 What the Adapter Collects

The NSX-T adapter collects a comprehensive set of networking and security data:

9.3 VCF Operations for Networks (Overview)

In addition to the built-in NSX adapter, VMware offers VCF Operations for Networks (formerly known as Aria Operations for Networks, or vRealize Network Insight) as a complementary product for deep network visibility. While the NSX adapter focuses on management-plane metrics, VCF Operations for Networks provides data-plane flow analysis.

Deployment model:

Key capabilities:

Note: VCF Operations for Networks is licensed separately from VCF Operations. In VCF 5.x environments, it is included with the VCF Operations Advanced and Enterprise editions.


Chapter 10: vSAN Integration

VMware vSAN is the hyper-converged storage platform embedded in VCF. VCF Operations provides native vSAN monitoring through the vSphere adapter, delivering capacity analytics, performance trending, health correlation, and policy compliance tracking without requiring a separate adapter installation.

10.1 Automatic via vSphere Adapter

vSAN monitoring is automatically activated when the vCenter adapter discovers one or more vSAN-enabled clusters. No additional adapter installation, configuration, or licensing is required for core vSAN metrics.

Prerequisites for automatic vSAN data collection:

Once these prerequisites are met, VCF Operations automatically creates resource objects for:

10.2 Advanced vSAN Configuration

For environments requiring deeper vSAN observability, additional collection parameters can be enabled in the vCenter adapter instance's advanced settings.

Navigate to Administration → Integrations → Accounts → select the vCenter adapter → Edit → expand Advanced Settings:

Setting Default Description
COLLECT_VSAN_PERF_METRICS true Collects vSAN performance counters (IOPS, throughput, latency) from the vSAN Performance Service API. Disabling this removes all vSAN performance data while retaining capacity and health metrics.
COLLECT_VSAN_ADVANCED_METRICS false Enables collection of extended vSAN metrics from the DOM (Distributed Object Manager), LSOM (Local Log-Structured Object Manager), and CMMDS (Cluster Monitoring Membership and Directory Services) layers. Provides deep diagnostic visibility but increases collection load on vCenter and the ESXi hosts.
VSAN_PERF_DIAG_MODE false Enables vSAN performance diagnostics mode, which collects additional latency breakdown metrics (e.g., guest-to-kernel, kernel-to-disk) for troubleshooting storage performance issues.

Warning: Enabling COLLECT_VSAN_ADVANCED_METRICS on clusters with more than 32 hosts or heavy I/O workloads can significantly increase vCenter API response times and VCF Operations collection duration. Enable this setting selectively and monitor the adapter collection duration (see Section 7.6) after activation.

Additional vSAN Performance Service requirements:

10.3 Key vSAN Metrics Collected

VCF Operations collects hundreds of vSAN metrics. The following table summarizes the most operationally significant metric groups:

Metric Group Key Metrics Description
Capacity vsanDatastore|capacity_usedSpace, vsanDatastore|capacity_freeSpace, vsanDatastore|capacity_dedupRatio, vsanDatastore|capacity_compressionRatio, vsanDatastore|capacity_savingsRatio Overall vSAN datastore capacity utilization, deduplication effectiveness, compression ratios, and combined space savings. Used for capacity planning and trending.
Performance — IOPS vsanDatastore|performance_readIops, vsanDatastore|performance_writeIops, vsanDatastore|performance_totalIops Read, write, and total I/O operations per second at the cluster, host, and disk group levels.
Performance — Throughput vsanDatastore|performance_readThroughput, vsanDatastore|performance_writeThroughput Data throughput in KBps for read and write operations. Useful for identifying bandwidth bottlenecks.
Performance — Latency vsanDatastore|performance_readLatency, vsanDatastore|performance_writeLatency, vsanDatastore|performance_totalLatency Average latency in milliseconds for read, write, and combined operations. VCF Operations applies dynamic thresholds to these metrics after the burn-in period.
Resync vsanDatastore|resync_bytesRemaining, vsanDatastore|resync_objectsResyncing, vsanDatastore|resync_etr Bytes remaining to resynchronize after a host failure or maintenance event, count of objects actively resyncing, and estimated time to completion (ETR). Critical for monitoring recovery progress.
Health vsanDatastore|health_diskHealth, vsanDatastore|health_networkHealth, vsanDatastore|health_dataIntegrity, vsanDatastore|health_overallHealth Health check results for disk subsystem, vSAN network (VMkernel connectivity, multicast), data integrity (object checksum verification), and overall cluster health.
Policy Compliance vsanDatastore|policy_complianceStatus, vsanDatastore|policy_objectsByPolicy Reports whether all VM storage objects comply with their assigned vSAN storage policy (e.g., FTT=1, stripe width). Identifies VMs at risk due to policy violations.
Congestion vsanDatastore|performance_congestion vSAN congestion value (0–255). Values above 0 indicate back-pressure in the I/O path. Sustained values above 30 warrant investigation.
Disk Group vsanDiskGroup|iopsRead, vsanDiskGroup|iopsWrite, vsanDiskGroup|latencyRead, vsanDiskGroup|latencyWrite, vsanDiskGroup|cacheHitRate Per-disk-group performance counters including cache tier hit rate. Low cache hit rates may indicate a need for larger cache disks or workload redistribution.

10.4 vSAN Dashboards

VCF Operations ships with a comprehensive set of predefined vSAN dashboards that provide immediate operational visibility without custom configuration. These dashboards cover:

For a complete listing of all predefined vSAN dashboards, their widget configurations, and customization guidance, refer to Chapter 14: Predefined Dashboards and Views.

Tip: Pin the vSAN Cluster Overview and vSAN Capacity Planning dashboards to your home page for daily operational monitoring. Configure email-based scheduled reports from the vSAN Capacity Planning dashboard to automatically distribute weekly capacity status to infrastructure leads.

Chapter 11: Policies

Policies in VCF Operations govern how the platform analyzes, alerts on, and reports capacity for your monitored objects. Every object in the inventory is subject to exactly one policy at any given time, and understanding how policies layer and override one another is essential for accurate monitoring at scale.

11.1 Default Policy vs Custom Policies

VCF Operations ships with a single Default Policy that is automatically applied to every monitored object in the inventory. This policy contains Broadcom's recommended thresholds, alert definitions, symptom definitions, and capacity settings for all supported object types. It cannot be deleted, and it serves as the fallback for any object not explicitly covered by a custom policy.

Custom policies allow administrators to override specific settings from the Default Policy for targeted groups of objects. A custom policy does not need to redefine every setting — it inherits any setting left unconfigured from the Default Policy and only overrides the values explicitly changed.

To manage policies, navigate to:

Configure → Policies

The Policies page displays all active policies in a table with columns for Name, Description, Priority, and the number of object groups assigned. From this page you can:

Note: The Default Policy itself can be edited, but exercise caution — changes to the Default Policy affect every object that is not covered by a higher-priority custom policy.

11.2 Policy Priority and Inheritance

When multiple custom policies exist, VCF Operations uses a numeric priority system to determine which policy governs a given object. Each policy is assigned a priority number, and lower numbers indicate higher priority.

Policy resolution follows this logic:

  1. VCF Operations evaluates all custom policies in ascending priority order (lowest number first).
  2. For each policy, it checks whether the object belongs to any of the policy's assigned object groups.
  3. The first matching policy (lowest priority number) wins and governs the object.
  4. If no custom policy matches, the Default Policy applies.
Priority Policy Name Assigned Groups Matched Object Result
1 Critical Production Production-Tier1 VM in Production-Tier1 Governed by Critical Production
2 Standard Production Production-All VM in Production-All Governed by Standard Production
3 Development Dev-Test VM in Dev-Test Governed by Development
Default Policy (All objects) VM in no group Governed by Default Policy

If an object belongs to groups assigned to multiple policies, only the highest-priority policy (lowest number) applies. There is no merging of settings across policies — the winning policy's settings apply in full, with any unconfigured settings inherited from the Default Policy.

To change priority, navigate to Configure → Policies, select a policy, and click Edit Priority. Enter the desired numeric value and save.

11.3 Configurable Elements

Each policy exposes five major configuration areas. The following sections detail every configurable element, its UI navigation path, and key settings.

11.3.1 Workload Automation

Navigation: Configure → Policies → [Policy Name] → Edit → Workload Automation

Workload Automation enables DRS-like optimization recommendations (or automated actions, if configured) driven by VCF Operations analytics rather than vCenter DRS alone.

Setting Description Default
Enable Workload Automation Turns on optimization analysis for the policy scope Disabled
Automation Mode Manual (recommendations only), Semi-Automatic, or Fully Automatic Manual
Aggressiveness Conservative, Moderate, or Aggressive balancing Moderate
Excluded Object Types Object types to exclude from automation None

11.3.2 Capacity Settings

Navigation: Configure → Policies → [Policy Name] → Edit → Capacity

Capacity settings control how VCF Operations calculates remaining capacity and time-to-exhaustion.

Setting Description Default
Allocation Model / Demand Model Method for computing capacity (see Section 11.4) Allocation Model
Time Remaining Threshold (days) Alert fires when projected exhaustion is within this window 90 days
Capacity Remaining Threshold (%) Alert fires when remaining capacity drops below this value 20%
CPU Overcommit Ratio Virtual-to-physical CPU ratio ceiling 4:1
Memory Overcommit Ratio Virtual-to-physical memory ratio ceiling 1.25:1
Storage Overcommit Ratio Virtual-to-physical storage ratio ceiling 1:1
High Availability Buffer (%) Capacity reserved for HA failover Based on cluster HA settings
Maintenance Buffer (%) Capacity reserved for host maintenance 0%

11.3.3 Attributes and Metrics

Navigation: Configure → Policies → [Policy Name] → Edit → Attributes/Metrics

This section allows enabling or disabling the collection of specific metric groups per object type. Disabling unused metric groups reduces storage consumption and processing overhead.

Categories include CPU, Memory, Disk, Network, Datastore, Virtual Disk, GPU, vSAN, and System metrics. Each category can be individually toggled.

11.3.4 Alerts and Symptoms

Navigation: Configure → Policies → [Policy Name] → Edit → Alerts/Symptoms

Administrators can enable or disable individual alert definitions and symptom definitions within the scope of the policy. This is useful for suppressing alerts that are not relevant to a particular workload tier — for example, disabling memory overcommit alerts for development clusters where overcommit is expected.

11.3.5 Compliance

Navigation: Configure → Policies → [Policy Name] → Edit → Compliance

Activate or deactivate compliance benchmarks on a per-policy basis. Available benchmarks include VMware Security Hardening Guide, CIS Benchmarks, DISA STIGs, and any custom benchmarks that have been imported.

11.4 Allocation vs Demand Model

The capacity model determines how VCF Operations calculates how much capacity a cluster or datastore has remaining.

Aspect Allocation Model Demand Model
Calculation Basis Provisioned (allocated) resources Actual measured utilization
Philosophy Conservative — assumes all provisioned resources may be consumed Optimistic — assumes current usage patterns continue
CPU Capacity Used Sum of all vCPUs allocated × overcommit ratio Peak or 95th-percentile CPU demand
Memory Capacity Used Sum of all configured VM memory Active + consumed memory demand
Example (8-core host) 10 VMs × 4 vCPU = 40 vCPU allocated → 40/32 = 125% used (at 4:1 ratio) Actual demand is 12 GHz of 64 GHz → 18.75% used
Best For Production environments with strict SLAs Development environments or well-understood workloads
Risk May show capacity exhaustion prematurely May underestimate future demand if workloads spike

11.5 Best Practices


Chapter 12: Alerts and Symptoms

Alerts and symptoms form the proactive monitoring backbone of VCF Operations. Symptoms detect individual conditions; alerts correlate one or more symptoms into actionable notifications that drive operational response.

12.1 Understanding Alerts

Every alert in VCF Operations is classified along three dimensions: type, criticality, and control state.

Alert Types:

Type Badge Icon Purpose Example
Health Red/Orange/Yellow cross Indicates a current, active problem requiring immediate attention Host memory usage critical
Risk Red/Orange/Yellow diamond Predicts a future problem based on trend analysis Datastore will run out of space in 30 days
Efficiency Red/Orange/Yellow arrow Identifies optimization opportunities to reclaim waste VM is oversized — using 5% of allocated CPU

Badge Colors and Criticality Levels:

Color Criticality Description
Red Critical Immediate action required; service impact is occurring or imminent
Orange Immediate Urgent attention needed; potential for service impact
Yellow Warning Attention recommended; condition is outside normal bounds
Green Information / Clear Informational or no active alerts

Control States:

State Description
Open Alert is active and unacknowledged
Assigned An administrator has taken ownership
Suspended Alert is temporarily suppressed (with optional expiration)
Cancelled Alert has been manually dismissed by an administrator

When all triggering symptoms clear, the alert automatically transitions to a cancelled state. Manually cancelled alerts will not re-fire until the symptoms clear and then trigger again.

12.2 Alert Lifecycle

The alert lifecycle follows a deterministic sequence:

  1. Symptom Detection — A metric crosses a threshold, a log message matches a pattern, or a fault event arrives. The symptom condition evaluates to TRUE.
  2. Wait Cycles — If configured, the symptom must remain TRUE for the specified number of collection cycles (each cycle is typically 5 minutes) before it activates.
  3. Symptom Activation — The symptom transitions to an active state.
  4. Alert Evaluation — The alert definition checks whether its symptom combination logic (ALL or ANY) is satisfied.
  5. Alert Fires — The alert appears on the object's alert list with the configured criticality and type.
  6. Notification — If a notification rule matches the alert's attributes (type, criticality, object type), the configured notification method is triggered (email, webhook, SNMP trap, etc.).
  7. Admin Review — An administrator views the alert in Alerts → Triggered Alerts or receives the notification.
  8. Assignment — The administrator changes the control state to Assigned, taking ownership.
  9. Resolution — The administrator resolves the underlying issue.
  10. Auto-Cancellation — When all triggering symptoms clear, VCF Operations automatically cancels the alert. Alternatively, the administrator can manually cancel the alert.

Alerts can also be suspended for a configurable duration (e.g., during a maintenance window), after which they automatically resume evaluation.

12.3 Creating Alert Definitions

To create a custom alert definition, follow these steps:

Step 1. Navigate to Configure → Alerts → Alert Definitions.

Step 2. Click the Add button in the toolbar.

Step 3. On the Name and Description tab:

Step 4. On the Alert Impact tab:

Step 5. On the Add Symptom Definitions tab:

Step 6. On the Configure Symptom Conditions section:

Step 7. Click Save. The alert definition is now created but will only evaluate against objects governed by a policy where the alert is enabled.

12.4 Symptom Definitions

Symptoms are the atomic conditions that feed into alert definitions. VCF Operations supports five distinct symptom types.

12.4.1 Metric/Property Symptom

Triggers when a monitored metric or property meets a defined condition.

Static Threshold Configuration:

Dynamic Threshold Configuration:

12.4.2 Message Event Symptom

Triggers when a log message matches a defined pattern. This symptom type requires Operations for Logs integration.

12.4.3 Fault Event Symptom

Triggers on fault events published by vCenter Server or other adapter sources.

12.4.4 Metric Event Symptom

Triggers on metric events published by external systems through the VCF Operations REST API.

12.4.5 Smart Early Warning

Predictive symptom that uses machine-learning trend analysis to forecast when a metric will cross a threshold.

12.5 Static vs Dynamic Thresholds

Aspect Static Threshold Dynamic Threshold
Definition Fixed numeric value set by the administrator Machine-learned baseline derived from historical patterns
Trigger Condition Fires when metric crosses the fixed value Fires when metric deviates from the learned normal pattern
Setup Effort Immediate — define value and save Requires 1–2 weeks of data collection for baseline
Adaptability Does not adapt; same value applies 24/7 Adapts to daily/weekly patterns (e.g., business hours vs off-hours)
False Positive Risk Higher — a single threshold cannot account for variable workloads Lower — learned baselines reflect actual usage patterns
Best For Hard limits (e.g., disk full > 95%), SLA thresholds Anomaly detection, workloads with variable patterns
Configuration Operator + fixed value Direction (Above/Below) + Sensitivity level (Normal, 1-3 sigma)

Dynamic Threshold Sensitivity Levels:

Level Interpretation Use When
Normal Range Any deviation outside the learned band You want maximum sensitivity to deviations
1 Standard Deviation Moderate deviation from normal General-purpose anomaly detection
2 Standard Deviations Significant deviation from normal Reducing noise while catching meaningful anomalies
3 Standard Deviations Extreme deviation from normal Only alerting on severe outliers

12.6 Relationship Types for Symptom Evaluation

Alert definitions can include symptoms that evaluate conditions not only on the alerting object itself but also on related objects in the inventory hierarchy.

Relationship Description Example
Self Symptom evaluates on the object that will trigger the alert VM CPU Usage > 90% on the VM itself
Parent Symptom evaluates on the immediate parent object Host memory pressure on the host running the VM
Child Symptom evaluates on an immediate child object A VM on a host has high disk latency
Peer Symptom evaluates on an object at the same level sharing a parent Another VM on the same host is consuming excessive CPU
Ancestor Symptom evaluates on any object above in the hierarchy (parent, grandparent, etc.) Cluster-level capacity warning affecting a VM two levels down
Descendant Symptom evaluates on any object below in the hierarchy (child, grandchild, etc.) Any VM in a cluster experiencing memory contention

Relationship-based symptoms enable compound alerts that correlate conditions across infrastructure layers — for example, an alert that fires only when a VM has high CPU ready AND its parent host has high CPU utilization, confirming the contention is host-driven rather than guest-driven.

12.7 Alert Definition Best Practices

12.8 Notification Rules

Notification rules bridge alerts to human attention by defining what gets communicated, to whom, and through which channel.

Step 1. Navigate to Configure → Alerts → Notification Settings.

Step 2. Click Add to create a new notification rule.

Step 3. Enter a Name for the rule (e.g., "Critical Production Alerts to On-Call Team").

Step 4. Set Filter Criteria to control which alerts trigger this notification:

Step 5. Select the Notification Method — choose from the configured outbound plug-ins (see Section 12.9).

Step 6. Set Notification Frequency:

Step 7. Click Save. The notification rule takes effect immediately.

Tip: Create separate notification rules for different criticality levels. Route Critical alerts to PagerDuty or SMS-capable channels for immediate response, while routing Warning alerts to email or Slack for informational awareness.

12.9 Outbound Notification Plug-ins

Outbound plug-ins define the communication channels available for notification rules. Configure them at Administration → Outbound Settings → Add.

# Plug-in Type Key Configuration Fields Notes
1 Standard Email (SMTP) SMTP Host, Port (25/465/587), Secure Connection (TLS/SSL), From Address, Authentication (username/password) Most common. Supports HTML formatting. Test with Test button before saving.
2 Log File File path on the VCF Operations analytics node (e.g., /var/log/vmware/vcops/alerts.log) Useful for SIEM ingestion from local filesystem.
3 Network Share (CIFS/NFS) Share Path (e.g., \\server\share\alerts), Domain, Username, Password Writes alert data as files to a network share.
4 SNMP Trap Target Host (IP/FQDN), Port (default 162), Community String, SNMP Version (v1/v2c/v3), Security Level (v3: AuthPriv/AuthNoPriv/NoAuthNoPriv), Engine ID For integration with enterprise SNMP managers (e.g., HP OpenView, IBM Tivoli).
5 ServiceNow Instance URL (e.g., https://instance.service-now.com), Username, Password, REST Endpoint, Incident Table, Assignment Group Creates ServiceNow incidents automatically. Requires the VCF Operations ServiceNow app or direct REST configuration.
6 Slack Webhook URL (from Slack Incoming Webhooks app), Channel (override), Username (override) Posts formatted alert messages to a Slack channel.
7 Webhook (REST) URL, HTTP Method (POST/PUT/PATCH), Content Type (JSON/XML), Headers (key-value pairs), Body Template (with alert field placeholders), Authentication (None/Basic/Bearer Token/OAuth) Most flexible — integrates with any REST-capable system (PagerDuty, Teams, OpsGenie, custom APIs).

Configuration procedure for each plug-in type:

  1. Navigate to Administration → Outbound Settings.
  2. Click Add.
  3. Select the Plug-in Type from the dropdown.
  4. Enter an Instance Name (e.g., "Production SMTP Server").
  5. Fill in the required configuration fields (per the table above).
  6. Click Test to send a test notification and verify connectivity.
  7. Click Save.

Each plug-in type can have multiple instances configured (e.g., separate SMTP servers for different environments, multiple Slack channels). Notification rules reference specific plug-in instances when defining the delivery channel.


Chapter 13: Super Metrics

Super metrics extend the analytic capabilities of VCF Operations by enabling administrators to define custom calculated metrics that combine, aggregate, or transform multiple standard metrics into a single derived value. They fill gaps where the built-in metric catalog does not provide the exact calculation your organization needs.

13.1 What Are Super Metrics

A super metric is a user-defined formula that VCF Operations evaluates on every collection cycle, producing a new metric value that can be used in dashboards, views, reports, alert symptom definitions, and capacity calculations — just like any native metric.

Common use cases:

To access super metrics, navigate to: Configure → Super Metrics.

13.2 Creating a Super Metric

Follow this ten-step procedure to create a super metric:

Step 1. Navigate to Configure → Super Metrics.

Step 2. Click the Add button in the toolbar.

Step 3. Enter a Name for the super metric (e.g., "Cluster - Avg VM CPU Usage (Powered-On Only)"). Enter an optional Description explaining the formula's purpose and intended consumers.

Step 4. Select the Object Type that this super metric will be associated with. The super metric will appear as a metric on objects of this type. For example, selecting "Cluster Compute Resource" means the super metric will be calculated and displayed for each cluster.

Step 5. Build the formula in the Formula Editor. The editor provides a text area where you type or construct the formula using metric references, operators, and functions.

Step 6. Use the Metric Picker (right panel) to browse or search the available metric catalog. Double-click a metric to insert its reference into the formula. The metric reference is inserted in the syntax ${this, metric=<metric_key>}.

Step 7. Apply looping functions to iterate over child objects. For example, wrap a metric reference in avg() to compute the average value of that metric across all child objects at a specified depth. See Section 13.3 for the complete list of looping functions.

Step 8. Click the Preview button to validate the formula syntax and see sample results. The preview evaluates the formula against a few sample objects and displays the computed values. Fix any syntax errors before proceeding.

Step 9. Assign the super metric to a policy. A super metric only collects data when it is activated in at least one policy. Navigate to the Policies tab within the super metric editor, or go to Configure → Policies → [Policy Name] → Edit → Attributes/Metrics and enable the super metric under the appropriate object type.

Step 10. Click Save. The super metric begins collecting data on the next collection cycle for all objects governed by the policy where it is activated.

Important: Super metrics do not retroactively calculate historical data. Data collection begins from the moment the super metric is activated in a policy.

13.3 Looping Functions

Looping functions iterate over child objects (or related objects at a specified depth) and aggregate a metric across them.

Function Description Syntax Example
avg() Calculates the arithmetic mean of a metric across child objects avg(${this, metric=cpu|usage_average, depth=1})
combine() Combines individual time series from child objects into a unified series combine(${this, metric=cpu|usage_average, depth=1})
count() Returns the number of child objects that report the specified metric count(${this, metric=cpu|usage_average, depth=1})
max() Returns the maximum value of the metric across all child objects max(${this, metric=cpu|usage_average, depth=1})
min() Returns the minimum value of the metric across all child objects min(${this, metric=cpu|usage_average, depth=1})
sum() Returns the sum of the metric values across all child objects sum(${this, metric=mem|consumed_average, depth=1})

The depth parameter controls how many levels down the hierarchy to traverse:

13.4 Single Functions

Single functions operate on individual numeric values within the formula.

Function Description
abs(x) Returns the absolute value of x
acos(x) Returns the arc cosine of x (in radians)
ceil(x) Returns the smallest integer greater than or equal to x
cos(x) Returns the cosine of x (x in radians)
exp(x) Returns Euler's number raised to the power of x
floor(x) Returns the largest integer less than or equal to x
log(x) Returns the natural logarithm (base e) of x
log10(x) Returns the base-10 logarithm of x
pow(x, y) Returns x raised to the power of y
round(x) Returns x rounded to the nearest integer
sqrt(x) Returns the square root of x
sin(x) Returns the sine of x (x in radians)
tan(x) Returns the tangent of x (x in radians)

13.5 Operators

Super metric formulas support the following operators:

Numeric Operators:

Operator Description Example
+ Addition metricA + metricB
- Subtraction metricA - metricB
* Multiplication metricA * 1024
/ Division metricA / metricB
% Modulo (remainder) metricA % 60

Comparison Operators:

Operator Description Example
> Greater than metricA > 90
< Less than metricA < 10
>= Greater than or equal to metricA >= 50
<= Less than or equal to metricA <= 100
== Equal to metricA == 0
!= Not equal to metricA != -1

Logical Operators:

Operator Description Example
&& Logical AND (metricA > 90) && (metricB > 80)
|| Logical OR (metricA > 95) || (metricB > 95)
! Logical NOT !(metricA == 0)

String Operators:

Operator Description Example
.contains() Checks if a string property contains a substring ${this, property=config|guestFullName}.contains("Windows")
.length() Returns the length of a string property ${this, property=config|name}.length()

13.6 Formula Syntax Deep Dive

The depth Parameter

The depth parameter specifies how many levels of the object hierarchy to traverse when using looping functions:

The where Clause

The where clause filters child objects by a property value before aggregation:

avg(${this, metric=cpu|usage_average, depth=1, where=Summary|Guest Operating System=.*Linux.*})

This calculates the average CPU usage only for child VMs whose guest OS name matches the regex .*Linux.*.

The where clause supports:

The isFresh() Function

isFresh() checks whether a metric has received data within the most recent collection cycle. It returns 1 if fresh data exists, 0 otherwise. This is useful for conditionally including only actively-reporting objects:

sum(${this, metric=mem|consumed_average, depth=1, where=isFresh(mem|consumed_average)})

Aliases (Variable Assignment)

Intermediate calculations can be assigned to aliases for readability:

alias cpuTotal = sum(${this, metric=cpu|usagemhz_average, depth=1})
alias cpuCapacity = ${this, metric=cpu|capacity_usagemhz}
cpuTotal / cpuCapacity * 100

Ternary Expressions

Use ternary syntax for conditional logic:

${this, metric=cpu|usage_average} > 80 ? 1 : 0

This returns 1 if CPU usage exceeds 80%, otherwise returns 0 — useful for creating "count of objects exceeding threshold" super metrics when combined with sum().

13.7 Use Cases and Examples

The following real-world examples demonstrate practical super metric formulas.

Example 1: Average VM CPU Usage Across a Cluster (Windows VMs Only)

Object Type: Cluster Compute Resource

avg(${this, metric=cpu|usage_average, depth=2, where=Summary|Guest Operating System=.*Windows.*})

This formula traverses two levels deep from the cluster (cluster → host → VM), filters to only Windows VMs, and calculates the average CPU usage across all matching VMs in the cluster.

Example 2: Total Memory Consumed by Powered-On VMs

Object Type: Cluster Compute Resource

sum(${this, metric=mem|consumed_average, depth=2, where=Summary|Runtime|PowerState=Powered On})

This formula sums the consumed memory metric across all VMs in the cluster that are currently powered on, giving an accurate picture of active memory demand.

Example 3: Count of VMs with CPU Ready Exceeding Threshold

Object Type: Host System

count(${this, metric=cpu|readyPct, depth=1, where=cpu|readyPct > 2.5})

This formula returns the number of VMs on a host where the CPU Ready percentage exceeds 2.5%, providing a single metric that indicates how many VMs on the host are experiencing CPU scheduling contention.

Example 4: Cluster CPU Overcommit Ratio

Object Type: Cluster Compute Resource

sum(${this, metric=cpu|num_vcpus_latest, depth=2}) / sum(${this, metric=cpu|corecount_provisioned, depth=0})

This formula divides the total number of vCPUs allocated across all VMs in the cluster (depth=2 to traverse through hosts to VMs) by the total physical core count of the cluster itself (depth=0 for the cluster's own metric), producing the vCPU-to-pCPU overcommit ratio.


Chapter 14: Dashboards — Built-in (Predefined)

VCF Operations ships with an extensive library of predefined dashboards that provide immediate visibility into the health, performance, capacity, and efficiency of your virtual infrastructure. These dashboards represent Broadcom's best-practice views and serve as both operational tools and templates for custom dashboard development.

14.1 Accessing Predefined Dashboards

To access predefined dashboards:

  1. Navigate to Visualize → Dashboards.
  2. The left navigation panel displays dashboard categories as expandable folders.
  3. Click a category to expand it and reveal the dashboards within.
  4. Click a dashboard name to load it in the main content area.

Predefined dashboards are read-only — they cannot be modified directly. To customize a predefined dashboard:

  1. Open the dashboard you wish to modify.
  2. Click the Actions menu (three dots or gear icon) in the top-right corner.
  3. Select Clone.
  4. Enter a name for the cloned copy.
  5. The cloned dashboard opens in edit mode, where all widgets and configurations can be freely modified.

Dashboards can be marked as Favorites (star icon) for quick access from the Favorites section of the left panel. The Home Dashboard can be set by navigating to Visualize → Dashboards → Actions → Set as Home Dashboard.

14.2 Complete List of Predefined Dashboards

Performance Category

Dashboard Name Purpose Key Widgets
VM Performance Identifies top CPU, memory, disk, and network consumers among virtual machines Top-N CPU Usage, Top-N Memory Usage, Top-N Disk Latency, Top-N Network Throughput, Metric Chart
Cluster Performance Displays cluster-level utilization trends for compute and storage Cluster CPU/Memory Utilization Heatmap, Utilization Trend Charts, DRS Balance Scoreboard
ESXi Host Performance Shows per-host utilization, contention, and hardware health Host CPU/Memory Utilization, Host Contention Metrics, NIC Throughput, HBA Throughput
Datastore Performance Monitors storage latency, IOPS, and throughput per datastore Datastore Latency Trend, IOPS Distribution, Throughput Top-N, Outstanding IO
Network Performance Tracks packet loss, throughput, errors, and dropped packets across network paths Packet Loss Heatmap, Throughput Trends, Error Rate Scoreboard, Dropped Packets Top-N
vSAN Performance Provides vSAN-specific IOPS, latency, throughput, and congestion metrics vSAN IOPS Trend, Backend Latency, Congestion Scoreboard, Disk Group Performance
VM Contention Surfaces per-VM contention indicators including CPU Ready, Co-Stop, and Memory Contention CPU Ready % Top-N, Co-Stop Top-N, Memory Contention % Top-N, Disk Latency Top-N
Cluster Contention Aggregates contention metrics at the cluster level for rapid triage Cluster CPU Contention Heatmap, Memory Pressure Trend, Cluster Disk Latency Summary

Capacity Category

Dashboard Name Purpose Key Widgets
Cluster Capacity Shows Time Remaining and Capacity Remaining per cluster with trend projections Capacity Remaining Scoreboard, Time Remaining Scoreboard, Capacity Trend Chart, What-If Scenario
Datastore Capacity Monitors storage utilization, provisioned vs used space, and forecast Datastore Usage Heatmap, Capacity Trend, Thin Provisioning Overcommit, Forecast Chart
ESXi Host Capacity Displays per-host capacity metrics including headroom for additional workloads Host CPU/Memory Remaining, VM Density, Headroom Scoreboard
VM Capacity Provides rightsizing recommendations for oversized and undersized VMs Oversized VMs List, Undersized VMs List, Reclaimable CPU/Memory Scoreboard, Idle VMs
vSAN Capacity Shows vSAN capacity utilization including deduplication and compression savings vSAN Used vs Free, Dedup/Compression Ratio, Slack Space, Capacity Trend

Cost Category

Dashboard Name Purpose Key Widgets
Cost Overview Provides total and monthly cost breakdown across the environment Total Cost Scoreboard, Monthly Trend Chart, Cost by Object Type, Cost by Datacenter
Optimization Quantifies potential cost savings from rightsizing and reclamation Reclaimable Cost Scoreboard, Powered-Off VM Cost, Idle VM Cost, Snapshot Cost
Showback Displays cost allocation by business unit, department, or custom grouping Cost by Department Chart, Cost by Application, Cost by Environment Tier
Chargeback Supports billing integration with per-consumer cost detail Chargeable Cost per Consumer, Rate Card Summary, Invoice Detail

Availability Category

Dashboard Name Purpose Key Widgets
Availability Overview Summarizes uptime, active alerts, and overall environment health Uptime Scoreboard, Alert Count by Severity, Health Badge Summary, Outage Timeline

Sustainability Category

Dashboard Name Purpose Key Widgets
Carbon Footprint Estimates carbon emissions based on compute power consumption and regional emission factors Total Carbon Emissions Scoreboard, Emissions Trend, Emissions by Cluster, PUE Factor
Green Scorecard Tracks energy efficiency metrics and sustainability KPIs Energy Efficiency Score, Power Consumption Trend, Idle Resource Waste, Green Improvement Recommendations

NSX Category

Dashboard Name Purpose Key Widgets
NSX-T Overview High-level summary of NSX-T environment health, alert count, and component status NSX Manager Health, Transport Node Status, Edge Cluster Status, Alert Summary
NSX Security Overview Security posture summary including firewall rule counts, policy compliance, and threat indicators DFW Rule Count, Security Policy Status, Applied Profiles, Threat Activity
NSX Logical Switching Monitors logical switch health, port utilization, and segment configuration Logical Switch List, Port Count Summary, Segment Health, VLAN/VXLAN Mapping
NSX Edge Performance Tracks NSX Edge node CPU, memory, throughput, and session count Edge CPU/Memory Utilization, Throughput per Edge, NAT Session Count, IPSec Tunnel Status
NSX Distributed Firewall Monitors DFW rule evaluation rates, connection counts, and CPU overhead on hosts DFW Rule Hit Count, Connection Rate, CPU Overhead Trend, Rule Table Size
NSX Load Balancer Displays load balancer pool health, session distribution, and throughput Pool Health Status, Active Sessions, Request Rate, Server Health Checks
NSX Network Topology Visual topology map showing the relationships between logical routers, switches, and edge nodes Interactive Topology Graph, Component Status Overlay, Alert Badge Overlay
NSX Troubleshooting Diagnostic dashboard for identifying NSX control/data plane issues Traceflow Results, Controller Cluster Health, Transport Zone Status, BFD Session Status

Other Categories

Dashboard Name Purpose Key Widgets
Application Monitoring Tracks application-level metrics from integrated APM sources Application Health Summary, Response Time Trend, Error Rate, Dependency Map
Workload Management Monitors Tanzu Kubernetes clusters and workload placement TKG Cluster Status, Pod Count, Namespace Utilization, Supervisor Cluster Health
Migration Planning Assesses VM migration readiness and provides cloud cost comparison Migration Readiness List, Cloud Cost Estimate, Dependency Analysis, Compatibility Check
Service Discovery Maps discovered application services and their infrastructure dependencies Service Map, Dependency Graph, Communication Flow, Infrastructure Mapping

14.3 KPI Thresholds

The following table provides industry-standard threshold guidance for key performance indicators. These values are used by many of the predefined dashboards and alert definitions.

KPI Good (Green) Warning (Yellow) Critical (Red) Notes
CPU Ready % < 2.5% 2.5% – 5.0% > 5.0% Measured on a per-vCPU basis. Values above 5% indicate the VM is waiting for physical CPU scheduling and will experience application-visible latency.
CPU Co-Stop % < 2.0% 2.0% – 4.0% > 4.0% Relevant for SMP (multi-vCPU) VMs. Indicates vCPUs being halted to synchronize scheduling. Reduce vCPU count if consistently high.
Memory Contention % < 1.0% 1.0% – 3.0% > 3.0% Includes ballooning, swapping, and compression. Values above 3% indicate the host is under memory pressure and VMs are experiencing degraded performance.
Disk Latency (ms) < 10 10 – 20 > 20 Combined read + write latency at the virtual disk (VMDK) level. Values above 20 ms are perceptible to most applications.
Disk Command Aborts 0 1 – 5 > 5 Per collection interval (5 minutes). Any aborted commands indicate storage path issues and warrant investigation.
Network TX Drops 0 1 – 100 > 100 Transmitted packet drops per interval. Indicates transmit queue saturation, typically caused by network bandwidth exhaustion or vSwitch misconfiguration.
Packet Loss % 0% 0% – 0.1% > 0.1% End-to-end packet loss. Even 0.1% loss is significant for latency-sensitive applications (VoIP, RDP, database replication).
vSAN Latency (ms) < 5 5 – 10 > 10 vSAN backend (device-level) latency. Frontend (VM-visible) latency may be higher. Values above 10 ms indicate disk group saturation or network congestion.

14.4 Dashboard Time Settings

All dashboards support configurable time ranges and refresh intervals that control the data window displayed by widgets.

Time Range Options:

Setting Duration Best For
Last Hour 1 hour Real-time troubleshooting, active incident investigation
Last 6 Hours 6 hours Default view — covers a typical shift or business window
Last 24 Hours 24 hours Daily review, identifying overnight patterns
Last 7 Days 7 days Weekly trend analysis, capacity planning reviews
Last 30 Days 30 days Monthly reporting, long-term trend identification
Custom User-defined start and end Post-incident analysis, compliance audits, specific maintenance windows

The time range selector is located in the top-right toolbar of every dashboard. Changing the time range affects all time-aware widgets on the dashboard simultaneously.

Auto-Refresh Intervals:

Setting Behavior
Off Dashboard displays static data from the last load; manual refresh required
5 Minutes Dashboard automatically refreshes every 5 minutes (aligns with default collection interval)
10 Minutes Dashboard automatically refreshes every 10 minutes
15 Minutes Dashboard automatically refreshes every 15 minutes

The auto-refresh toggle is located next to the time range selector. For dashboards displayed on NOC wall screens, set auto-refresh to 5 minutes to maintain near-real-time visibility.

Note: Setting aggressive auto-refresh intervals on dashboards with many widgets or large object scopes may increase load on the VCF Operations analytics cluster. For environments with more than 10,000 objects, consider using 10- or 15-minute refresh intervals for complex dashboards.


Chapter 15: Dashboards — Custom Creation

While the predefined dashboards cover a broad range of operational scenarios, custom dashboards enable you to build views tailored to your organization's specific monitoring requirements, operational workflows, and reporting needs.

15.1 Creating a Dashboard

Follow these steps to create a new custom dashboard:

Step 1. Navigate to Visualize → Dashboards.

Step 2. Click the Create button in the top toolbar (or use the + icon).

Step 3. Enter a Dashboard Name (e.g., "Production Cluster Health — Tier 1").

Step 4. Optionally select a Dashboard Template from the dropdown. Templates provide pre-arranged widget layouts that you can populate with your own data sources. Available templates include Blank Canvas, Two-Column, Three-Column, Executive Summary, and Troubleshooting.

Step 5. Set the Default Time Range for the dashboard (e.g., Last 6 Hours). Individual widgets can override this if needed.

Step 6. Click Save. The empty dashboard canvas appears in edit mode, ready for widgets to be added.

To add widgets:

  1. Click the Add Widget button (or drag from the widget panel on the left).
  2. Select the desired widget type from the catalog (see Section 15.2).
  3. Configure the widget (see Section 15.3).
  4. Position and resize the widget on the canvas by dragging.
  5. Repeat for additional widgets.
  6. Click Save when the layout is complete.

15.2 Complete Widget Catalog

VCF Operations provides a comprehensive widget catalog organized by functional category.

Data Visualization Widgets

Widget Name Description
Metric Chart Time-series visualization supporting line, area, and stacked area chart types. Displays one or more metrics for one or more objects over the selected time range. Supports trend lines, dynamic thresholds overlay, and data table toggle.
Scoreboard Displays a single KPI value with configurable color-coded status bands (green/yellow/orange/red). Ideal for executive-level dashboards showing current state at a glance. Supports sparkline overlay and multi-metric mode.
Heatmap Color-coded grid where each cell represents an object, colored by a selected metric value, and optionally sized by a second metric. Enables rapid visual identification of outliers across large object populations.
Top-N Horizontal or vertical bar chart ranking objects by a selected metric. Configurable for top or bottom N values. Useful for identifying the highest consumers or worst performers.
Topology Graph Interactive relationship map showing objects and their connections. Displays health badges, metric overlays, and alert status on each node. Supports configurable relationship depth.
Distribution Chart Histogram or pie chart showing the distribution of objects across value ranges for a selected metric. Useful for understanding workload profiles and identifying clusters of similar behavior.
Sparkline Compact, minimal trend line designed for embedding in dense dashboards. Shows directional trend without axis labels or detailed data points.

Object List Widgets

Widget Name Description
Object List Filterable, sortable table of inventory objects with configurable columns. Supports inline metric values, health badges, and property display. Can serve as a provider widget to drive other widgets on the dashboard.
Object Relationship Hierarchical navigation widget showing parent, child, and peer relationships for a selected object. Enables drill-down through the inventory tree.
Alert List Filtered table of active alerts with columns for severity, alert name, object name, time triggered, and control state. Supports filtering by alert type, criticality, object type, and time range.
Symptom List Filtered table of active symptoms with details on the triggering condition, current value, and threshold.
Property List Displays configuration properties and attributes for a selected object (CPU count, memory size, guest OS, tools version, etc.).

Utility Widgets

Widget Name Description
Text Widget Displays static text content. Supports HTML and Markdown formatting for embedding instructions, notes, team contact information, or operational procedures directly in the dashboard.
Image Widget Embeds a static image (PNG, JPG, SVG) in the dashboard. Used for logos, architecture diagrams, or visual context. Images can be uploaded or referenced by URL.
Rolling View Automatically cycles through a configured list of dashboards at a set interval. Designed for NOC wall displays that need to rotate between multiple views.
Container Widget Groups multiple widgets into a tabbed container, conserving dashboard real estate. Each tab contains a separate widget, and users click tabs to switch between them.
Navigation Widget Displays clickable links or buttons that navigate to other dashboards, external URLs, or specific objects in the inventory. Used for building multi-level dashboard hierarchies.
Geo Map Plots objects on a geographic map based on configured location coordinates. Each marker shows health status and can be clicked for detail. Useful for multi-site or distributed infrastructure monitoring.

15.3 Widget Configuration Deep Dive

Scoreboard Widget

The Scoreboard widget is the most commonly used widget for executive dashboards and NOC displays.

Configuration steps:

  1. Click Add Widget → Scoreboard.
  2. In the Data tab, click Add Metric.
  3. Select the Object Type and browse or search for the desired metric.
  4. Select specific object(s) or use an object group to scope the data.
  5. In the Thresholds tab, configure color bands:
  6. In the Display tab, choose the display mode:
  7. Configure Label (custom display name), Unit (override the default unit), and Decimal Places (0–4).
  8. Click Save.

Heatmap Widget

The Heatmap widget provides instant visual identification of outliers across hundreds or thousands of objects.

Configuration steps:

  1. Click Add Widget → Heatmap.
  2. In the Data tab, select the Object Type (e.g., Virtual Machine).
  3. Set Group By to organize cells by a parent attribute (e.g., Cluster, Host, Datacenter). Objects are visually grouped under their parent's label.
  4. Set Color By to the metric that determines cell color (e.g., CPU Usage %). Configure the color gradient with minimum (green) and maximum (red) values.
  5. Set Size By to the metric that determines cell size (e.g., Configured Memory MB). Larger cells represent objects with more of the sized metric.
  6. In the Thresholds tab, define the color bands:
  7. Optionally filter the object scope using an Object Group or tag-based filter.
  8. Click Save.

Metric Chart Widget

The Metric Chart widget is the primary tool for time-series analysis and trend investigation.

Configuration steps:

  1. Click Add Widget → Metric Chart.
  2. In the Data tab, add one or more object-metric combinations:
  3. In the Chart Options tab:
  4. Configure the Time Range override if the widget should use a different range than the dashboard default.
  5. Click Save.

Top-N Widget

The Top-N widget ranks objects by a selected metric to quickly surface the highest or lowest performers.

Configuration steps:

  1. Click Add Widget → Top-N.
  2. In the Data tab, select the Metric to rank by (e.g., Memory Usage %).
  3. Set the N Value — the number of objects to display: 5, 10, 20, or 50.
  4. Set Sort Order: Highest (top consumers) or Lowest (least utilized).
  5. Set Scope: All objects of the selected type, or filter to a specific Object Group or parent object.
  6. In the Display tab:
  7. Click Save.

Topology Graph Widget

The Topology Graph widget visualizes the relationships between infrastructure objects as an interactive network diagram.

Configuration steps:

  1. Click Add Widget → Topology Graph.
  2. In the Data tab, select the Root Object — the starting point for the topology visualization (e.g., a specific cluster or datacenter).
  3. Set Relationship Depth (1–5) — how many levels of parent/child/peer relationships to display from the root object.
  4. In the Display tab:
  5. Configure which Relationship Types to include: Parent, Child, Peer, or All.
  6. Click Save.

15.4 Widget Interactions

Widget interactions enable a powerful provider/receiver paradigm where selecting an object in one widget automatically updates the data displayed in other widgets on the same dashboard. This creates interactive, drill-down capable dashboards.

Key concepts:

Configuring widget interactions:

  1. Open the dashboard in Edit Mode (click the pencil icon or Edit button).
  2. Click the gear icon on the provider widget to open its configuration.
  3. Navigate to the Widget Interactions tab (also labeled Output in some widget types).
  4. In the Receiving Widgets section, check the boxes next to the widgets that should receive selections from this provider.
  5. Click Save on the widget configuration.
  6. Repeat for additional provider widgets as needed.
  7. Save the dashboard.

Performance considerations:

Example interaction configuration:

A common pattern is the "list-and-detail" layout:

Widget Role Purpose
Object List (Virtual Machines) Provider Displays a filterable list of VMs. User clicks a row to select a VM.
Metric Chart (CPU) Receiver Shows CPU usage trend for the selected VM.
Metric Chart (Memory) Receiver Shows memory usage trend for the selected VM.
Alert List Receiver Shows active alerts for the selected VM.
Property List Receiver Shows configuration properties of the selected VM.

When the operator clicks a VM in the Object List, all four receiver widgets update simultaneously to show data for that specific VM, creating a cohesive investigation experience.

15.5 Dashboard Navigation

Dashboard navigation enables you to link multiple dashboards together, creating hierarchical drill-down paths that guide operators from high-level overviews to detailed investigation views.

Method 1: Navigation Widget

The Navigation Widget provides explicit, clickable links to other dashboards or external URLs.

  1. Add a Navigation Widget to the dashboard.
  2. In the configuration panel, add one or more Links:
  3. Configure the Display Style: Button, Text Link, or Icon.
  4. Position the Navigation Widget at the top or side of the dashboard for visibility.

Method 2: Object Click Actions

Configure what happens when a user clicks an object in a widget:

  1. Open the dashboard in Edit Mode.
  2. Click the gear icon on the widget where click navigation should be enabled.
  3. In the Output section (or Interactions tab), find the On Click action setting.
  4. Select Navigate to Dashboard and choose the target dashboard.
  5. Optionally enable Pass Object Context — the clicked object is automatically set as the focus in the target dashboard.
  6. Save the widget and dashboard.

Method 3: Dashboard Linking via URL Parameters

Dashboards can be directly linked using URL parameters that pre-select objects and time ranges:

Best practices for dashboard navigation:

Chapter 16: Best Practice Dashboard Designs

Dashboards are the primary interface through which operators, engineers, and executives consume data from VCF Operations. A poorly designed dashboard buries critical information; a well-designed dashboard surfaces the right data to the right audience at the right time. This chapter provides six ready-to-implement dashboard blueprints and a set of universal design principles.

16.1 Daily Operations Check Dashboard

This dashboard is the first screen an operations engineer should open each morning. It answers one question: "Is anything broken or about to break?"

Row 1 — Scoreboards (4 widgets, equal width)

Widget Type Metric / Property Color Coding
Overall Cluster Health Scoreboard Worst badge color across all clusters Green / Yellow / Orange / Red
Total Critical Alerts Scoreboard Count of alerts where Criticality = Critical Red if > 0, Green if 0
Total Warning Alerts Scoreboard Count of alerts where Criticality = Warning Yellow if > 5, Green if ≤ 5
VM Count / Host Count Scoreboard Total VMs (powered on) and total ESXi hosts Informational — no threshold

Configuration Tip: Set the Scoreboard refresh interval to 5 minutes. Use the "Sparkline" option to show a 24-hour mini-trend directly inside the scoreboard tile.

Row 2 — Top-N Performance Offenders (3 widgets, equal width)

Widget Type Object Type Metric Sort Count
Top-N CPU Ready VMs Top-N Virtual Machine cpu|readyPct Descending 10
Top-N Memory Contention VMs Top-N Virtual Machine mem|contention_average Descending 10
Top-N Disk Latency VMs Top-N Virtual Machine virtualDisk|totalLatency Descending 10

Row 3 — Trends and Heatmaps (2 widgets, 60/40 split)

Widget Type Configuration
Cluster Capacity Heatmap Heatmap Object: Cluster Compute Resource; Color by: cpu|capacityRemaining_percentage; Size by: summary|total_number_vms
Alert Trend (7-day) Metric Chart Scope: all clusters; Metric: count of alerts by day; Mode: stacked bar by criticality

Blockquote — Why 7-day alert trend? A 7-day window reveals patterns tied to weekly batch jobs, backup windows, or recurring misconfigurations. A single day's snapshot hides these cycles.


16.2 Capacity Planning Dashboard

This dashboard is reviewed weekly by capacity and infrastructure teams. It answers: "When will we run out of resources, and what can we reclaim?"

Row 1 — Scoreboards

Widget Metric Threshold
Clusters at Risk Count of clusters where Time Remaining < 90 days Red if > 0
Total Reclaimable vCPU Sum of reclaimable CPU across all VMs (from rightsizing engine) Informational
Total Reclaimable Memory (GB) Sum of reclaimable RAM Informational
Average Cluster Utilization % Avg of cpu|demandPct across clusters Yellow > 70%, Red > 85%

Row 2 — Bar Charts (2 widgets, equal width)

Widget Type Details
Cluster Capacity Time Remaining Top-N (horizontal bar) Metric: capacityRemainingUsingConsumers_timeRemaining; Sort: Ascending (worst first); Top 10
Datastore Capacity Remaining Top-N (horizontal bar) Metric: diskspace|capacityRemaining_percentage; Sort: Ascending; Top 10

Row 3 — Lists and Actions (2 widgets, equal width)

Widget Type Details
VM Rightsizing Candidates Object List Filter: oversized = true; Columns: VM Name, Provisioned vCPU, Recommended vCPU, Provisioned RAM, Recommended RAM
What-If Scenario Launcher Text Widget Hyperlink to Optimize → What-If Analysis with instructions

Capacity Threshold Recommendations:

Resource Conservative Moderate Aggressive
CPU Demand % 60% 70% 80%
Memory Demand % 70% 80% 90%
Datastore Used % 70% 80% 85%
Time Remaining (days) 180 90 60

16.3 Performance Monitoring Dashboard

This dashboard is used during active troubleshooting or continuous performance reviews. It answers: "How are my workloads performing right now and over time?"

Row 1 — Scoreboards (3 widgets)

Widget Metric Threshold
Average CPU Usage % Avg cpu|usage_average across all clusters Yellow > 70%, Red > 85%
Average Memory Usage % Avg mem|usage_average across all clusters Yellow > 75%, Red > 90%
Average Disk Latency (ms) Avg virtualDisk|totalLatency across all VMs Yellow > 15 ms, Red > 25 ms

Row 2 — Metric Charts (2 widgets, equal width)

Widget Type Configuration
Cluster CPU/Memory Trend (30-day) Metric Chart (line) Scope: select clusters; Metrics: cpu|demandPct, mem|demandPct; Date range: Last 30 Days; Show dynamic thresholds
vSAN Latency Trend Metric Chart (line) Scope: vSAN clusters; Metrics: vSAN|readLatency, vSAN|writeLatency; Date range: Last 30 Days

Row 3 — Heatmap and Top-N (2 widgets, 60/40 split)

Widget Type Configuration
All VMs by CPU Ready % Heatmap Object: Virtual Machine; Group by: Parent Cluster; Color by: cpu|readyPct; Size by: config|hardware|numCpu
Top-N Network Drops Top-N Object: Host System; Metric: net|droppedPct; Sort: Descending; Count: 10

16.4 Cost Analysis Dashboard

This dashboard serves finance teams and infrastructure managers tracking cloud and on-premises spending. It answers: "Where is the money going, and where can we save?"

Row 1 — Scoreboards (3 widgets)

Widget Metric Notes
Total Monthly Cost costop|totalCost Requires cost drivers to be configured under Optimize → Cost Drivers
Cost per VM costop|costPerVM Derived from total cost ÷ powered-on VM count
Cost Trend Metric Chart (sparkline) 6-month trend of totalCost

Row 2 — Distribution and Savings (2 widgets)

Widget Type Configuration
Cost by Department Distribution (pie chart) Group by: Custom Property "Department"; Metric: costop|totalCost
Optimization Savings Potential Scoreboard Metric: sum of potential savings from rightsizing + reclamation recommendations

Row 3 — Actionable Lists (2 widgets)

Widget Type Configuration
Idle / Powered-Off VM List Object List Filter: powerState = poweredOff OR idleVM = true; Columns: VM Name, Power State, Days Since Last I/O, Monthly Cost
Snapshot Age Violations Object List Filter: snapshot|age > 72 hours; Columns: VM Name, Snapshot Name, Age (hours), Size (GB)

16.5 Compliance Dashboard

This dashboard is essential for security and audit teams. It answers: "Are we compliant, and where have we drifted?"

Row 1 — Scoreboards (2 widgets)

Widget Metric Threshold
Overall Compliance Score Percentage of objects passing all benchmark tests Green ≥ 95%, Yellow ≥ 80%, Red < 80%
Non-Compliant Objects Count Count of objects with at least one failure Red if > 0

Row 2 — Compliance by Benchmark (3 widgets)

Widget Type Configuration
DISA STIG Compliance Scoreboard + bar Pass/Fail count for DISA STIG benchmark rules
CIS Benchmark Compliance Scoreboard + bar Pass/Fail count for CIS benchmark rules
PCI-DSS Compliance Scoreboard + bar Pass/Fail count for PCI-DSS benchmark rules

Row 3 — Drift and Changes (2 widgets)

Widget Type Configuration
Drift Detection Alerts Alert List Filter: Alert Type = Compliance, Sub-type = Drift; Sort by: time (newest first)
Configuration Change Timeline Metric Chart (event overlay) Show configuration change events overlaid on compliance score trend

16.6 Executive Summary Dashboard

This dashboard is designed for C-level and director-level audiences. It prioritizes clarity over detail and should be presentable on a projector or shared screen without explanation.

Design Principles for Executive Dashboards:

Row 1 — Environment Scorecard (3 large scoreboards)

Widget Label Source
Health "Infrastructure Health" Worst health badge across all clusters
Risk "Risk Score" Highest risk badge across all clusters
Efficiency "Resource Efficiency" Average efficiency badge across all clusters

Row 2 — 30-Day Trends (2 widgets)

Widget Type Configuration
30-Day Alert Trend Metric Chart (area) Stacked area by criticality (Critical, Warning, Info); Date range: 30 days
Capacity Runway Summary Scoreboard list Show Time Remaining (days) for each cluster, color-coded

Row 3 — Cost and Sustainability (2 widgets)

Widget Type Configuration
Cost Summary Scoreboard Total monthly cost with month-over-month delta percentage
Sustainability Metrics Scoreboard Power consumption (kWh), Carbon estimate (if available via management pack)

16.7 Dashboard Design Best Practices

  1. Limit widget count. Keep dashboards to 15–20 widgets maximum. Each additional widget increases render time and cognitive load. If you need more, create a second dashboard and link them.

  2. Use widget interactions for drill-down. Configure widget interactions so that clicking an object in a Top-N chart drives the selection in a Metric Chart or Object List widget on the same dashboard. This eliminates the need to duplicate data.

  3. Group related metrics logically. Place CPU metrics adjacent to CPU-related alerts. Place capacity widgets together. The user's eye should flow naturally from overview to detail, left to right, top to bottom.

  4. Use consistent time ranges. If one chart shows 30 days, all charts on that dashboard should show 30 days unless there is a specific analytical reason to differ. Inconsistent ranges confuse viewers.

  5. Place critical KPIs in the top-left quadrant. Eye-tracking studies confirm that users scan dashboards starting from the top-left. Place the most urgent or important information there.

  6. Use Text Widgets for section headers. A simple text widget with a bold label like "Performance Indicators" or "Capacity Metrics" helps organize the dashboard visually and aids comprehension.

  7. Clone predefined dashboards as starting points. VCF Operations ships with dozens of out-of-the-box dashboards. Clone one that is close to your goal, then modify it. This saves time and ensures you start with proven widget configurations.

  8. Test with real data at scale. A dashboard that loads quickly in a lab with 50 VMs may be unusably slow in production with 10,000 VMs. Test with production scope before publishing.

  9. Set appropriate default scopes. Avoid dashboards scoped to "All Objects" when a narrower scope (specific cluster, resource pool, or custom group) would be more relevant.

  10. Document your dashboards. Add a Text Widget at the top of each dashboard with a one-sentence purpose statement and the intended audience. This prevents dashboard sprawl and confusion.


Chapter 17: Views and Reports

Views and Reports are the primary mechanism for extracting structured, repeatable, and shareable data from VCF Operations. While dashboards are interactive and real-time, reports are static snapshots designed for distribution, archival, and audit compliance.

17.1 View Types

Views are the building blocks of reports. Each view type presents data in a specific visual format optimized for a particular analytical need.

View Type Description Best Use Case Output Format
List View Tabular list of objects with selected metrics and properties displayed as columns Inventory reports, VM configuration audits, host hardware lists Table
Trend View Time-series line or area graph plotting one or more metrics over a defined date range Performance analysis, capacity trending, SLA compliance over time Line/Area Chart
Distribution View Pie chart or histogram showing how a metric's values are distributed across objects Resource allocation analysis, workload distribution, cost breakdown by department Pie/Histogram
Image View Custom uploaded image (PNG, JPG, SVG) with data overlays positioned at specific coordinates Network topology diagrams, data center floor plans with live metrics, rack diagrams Annotated Image
Summary View Aggregated statistics (average, minimum, maximum, sum, count) for selected metrics across a group of objects Executive summaries, SLA reports, aggregate capacity statements Summary Table

Note: Image Views require you to upload a base image first, then map data points to specific pixel coordinates on the image. This is most commonly used for physical data center visualizations.

17.2 Creating Views

Follow these steps to create a custom view.

Step 1. Navigate to Visualize → Views in the left navigation menu.

Step 2. Click the Create button (plus icon) in the toolbar.

Step 3. In the Presentation section, enter:

Step 4. Select the View Type from the dropdown: List, Trend, Distribution, Image, or Summary.

Step 5. In the Subjects section, select the Object Type that this view will report on. Common selections include:

Step 6. Switch to the Data tab. Here you select the metrics and properties to display:

Step 7. Switch to the Filter tab (optional). Apply conditions to limit which objects appear in the view. Filters use property or metric-based conditions such as:

Multiple filter conditions can be combined with AND/OR logic.

Step 8. Click Preview to verify the output displays the expected data with the correct format and filtering.

Step 9. Click Save. The view is now available for use in dashboards or report templates.

Tip: When creating List Views, limit the number of columns to 10–12 for readability. If you need more data points, create a second view rather than cramming everything into one table.

17.3 Creating Report Templates

Report templates combine one or more views into a formatted document suitable for distribution. Follow this procedure.

Step 1. Navigate to Visualize → Reports in the left navigation menu.

Step 2. Click Create Template in the toolbar.

Step 3. Enter the Report Name (e.g., "Weekly Infrastructure Health Report") and an optional Description.

Step 4. In the report canvas, add views by dragging them from the left panel into the report body. You can include multiple views of different types. Arrange them in the desired order — each view will render as a separate section in the final report.

Step 5. Optionally configure presentation elements:

Step 6. Click Save. The template is now available for on-demand generation or scheduled execution.

Important: Report templates are separate from the data they display. A template defines the structure; the data is populated at generation time based on the scope you select.

17.4 Generating and Scheduling Reports

On-Demand Generation:

  1. Navigate to Visualize → Reports.
  2. Select the desired report template.
  3. Click Run (play icon).
  4. In the dialog, select the Scope — choose one or more objects or groups that the report will cover (e.g., a specific cluster, a custom group of VMs, or "All Objects").
  5. Click Generate.
  6. The report enters a processing queue. Depending on complexity and scope, generation may take seconds to several minutes.
  7. Once complete, the report appears in the Generated Reports tab, available for download.

Scheduled Generation:

  1. Select the report template and click Schedule (calendar icon).
  2. Configure the schedule parameters:
Parameter Options Recommendation
Frequency Daily, Weekly, Monthly Weekly for operational reports, Monthly for executive reports
Day of Week Monday–Sunday (for weekly) Monday morning for "last week" review
Time of Day HH:MM (24-hour format) 06:00 — before the operations team arrives
Scope Object, Group, or Tag-based Use Custom Groups for consistent scoping
  1. Configure Email Delivery:
  2. Click Save Schedule.

Warning: Scheduled reports consume analytics engine resources during generation. Avoid scheduling more than 10 reports at the same time window. Stagger schedules by 15–30 minutes.

17.5 Export Formats

Format Content Use Case Limitations
PDF Fully formatted report with charts, tables, headers, footers, and cover page Distribution to stakeholders, audit documentation, archival Charts are rendered as static images; no interactivity
CSV Raw tabular data export; one CSV file per List or Summary view in the report Spreadsheet analysis, data import into third-party tools, custom charting No charts or formatting; Trend and Distribution views export as data tables

Both formats are available from the Generated Reports tab. Click the download icon next to a completed report and select the desired format.

Tip: For automated downstream processing, use the Suite API endpoint POST /suite-api/api/reports/{reportId}/download with the format query parameter set to pdf or csv. This enables integration with ticketing systems, SharePoint libraries, or custom portals.


Chapter 18: Capacity Planning and Optimization

Capacity planning in VCF Operations moves beyond simple threshold monitoring into predictive analytics. The platform's capacity engine continuously analyzes historical consumption patterns, applies multiple forecasting algorithms, and produces actionable recommendations for rightsizing, reclamation, and future procurement.

18.1 Capacity Engine Overview

The capacity engine evaluates every cluster, datastore, and resource pool across three dimensions.

Metric Definition Where to Find Action Trigger
Time Remaining Projected number of days until a resource (CPU, Memory, Disk) reaches its usable capacity limit Optimize → Capacity → select cluster < 90 days: plan procurement or migration
Capacity Remaining (%) Percentage of total usable capacity that is still available after accounting for HA reserves, buffers, and current demand Optimize → Capacity → select cluster < 20%: immediate attention required
Recommended Size The optimal allocation of vCPU, memory, or disk for a given VM based on actual usage patterns Optimize → Rightsizing → select VM Delta > 25% from current: rightsizing candidate

The capacity engine runs on a continuous cycle, recalculating projections every collection interval (default: 5 minutes for real-time, daily for long-term forecasts).

Important: Capacity calculations honor the policy settings applied to each object. If your policy sets a CPU utilization cap of 70% (meaning 70% is considered "full"), Time Remaining reflects when demand will reach 70%, not 100%.

18.2 Forecasting Algorithms

VCF Operations does not rely on a single forecasting model. Instead, it runs multiple algorithms in parallel and selects the best fit for each metric on each object.

Algorithm How It Works Best Suited For Weakness
Change-Point Detection Identifies sudden, sustained shifts in the data (step changes) and adjusts the baseline accordingly Environments with frequent application deployments or workload migrations May over-react to one-time events if not enough history
Linear Regression Fits a straight line through historical data points and projects the trend forward Steady, predictable growth patterns (e.g., data stores growing at constant rate) Cannot model cyclical or seasonal patterns
Cyclical Analysis Detects repeating patterns on daily, weekly, or monthly cycles and factors them into the projection Workloads with known cycles — month-end batch processing, weekly reporting jobs Requires 2+ full cycles of history to detect patterns
Exponential Smoothing Applies exponentially decreasing weights to older data, giving recent observations more influence Environments where recent behavior is more indicative of future behavior than distant history Can be thrown off by recent anomalies

The analytics engine scores each algorithm's fit against actual historical data using a mean-absolute-percentage-error (MAPE) calculation. The algorithm with the lowest MAPE for a given metric is selected for that metric's forecast.

Tip: To see which algorithm was selected for a specific metric, navigate to the cluster's Capacity tab and hover over the forecast line. The tooltip displays the algorithm name and confidence interval.

18.3 Peak Classification

Not all spikes in resource consumption are equal. The capacity engine classifies peaks to prevent false alarms and ensure accurate forecasting.

Peak Type Duration Impact on Capacity Calculation Example
Momentary Less than 5 minutes Ignored — treated as noise CPU spike during VM snapshot creation, brief network burst
Sustained 5 minutes to 4 hours Included in analysis with standard weight Application batch job, database index rebuild, backup window
Periodic Recurring at regular intervals Weighted appropriately based on recurrence frequency End-of-month financial close processing, weekly ETL jobs, nightly backups

Peak classification thresholds can be adjusted in the active policy under Configure → Policies → Edit Policy → Capacity and Allocation → Peak Classification.

18.4 Rightsizing

Navigate to: Optimize → Rightsizing

Rightsizing identifies VMs whose allocated resources are significantly mismatched to their actual consumption patterns.

Oversized VM Detection Criteria:

Resource Condition Default Threshold
CPU Provisioned vCPUs exceed peak demand by a factor of 2 or more Provisioned vCPU > 2x 95th-percentile CPU demand
Memory Provisioned RAM exceeds peak demand by a factor of 1.5 or more Provisioned RAM > 1.5x 95th-percentile active memory

Undersized VM Detection Criteria:

Resource Condition Default Threshold
CPU CPU Ready percentage consistently elevated cpu|readyPct > 2.5% over 7-day average
Memory Memory ballooning or swapping is active mem|balloonPct > 0% or mem|swapused_average > 0

Rightsizing Report Columns:

Column Description
VM Name Virtual machine display name
Current vCPU Currently provisioned vCPU count
Recommended vCPU Analytics-recommended vCPU count
Current Memory (GB) Currently provisioned RAM
Recommended Memory (GB) Analytics-recommended RAM
Potential Savings Estimated cost reduction if rightsized (requires cost drivers)

Taking Action on Rightsizing Recommendations:

  1. Generate Change Request — exports a formatted change request document for ITSM integration
  2. Export CSV — downloads all recommendations for offline review
  3. Apply via Automation — if VCF Automation (Aria Automation) is integrated, trigger a rightsizing workflow directly

Warning: Always validate rightsizing recommendations against application-level requirements. A VM may appear oversized from an infrastructure perspective but require the allocated resources for licensing compliance (e.g., Oracle per-core licensing) or application-mandated minimums.

18.5 Reclaimable Resources

Navigate to: Optimize → Reclaim

The reclamation engine identifies waste — resources that are allocated but delivering no value.

Category Detection Criteria Default Threshold Typical Savings
Powered-Off VMs VM in poweredOff state for extended period Idle > 30 days Full VM cost recovery
Orphaned VMDKs VMDK files on datastores not attached to any registered VM Any orphaned VMDK Storage reclamation
Old Snapshots VM snapshots exceeding age threshold Age > 72 hours (3 days) Storage reclamation; performance improvement
Idle VMs Powered-on VMs with negligible CPU, memory, network, and disk I/O CPU < 100 MHz, Network < 1 KBps, Disk I/O < 1 IOPS for 7+ days Full VM cost recovery

Best Practice: Schedule a weekly reclamation review meeting. Export the reclamation report and distribute it to application owners with a 14-day response window. VMs and VMDKs not claimed within the window are candidates for decommissioning.

18.6 Workload Optimization

Navigate to: Optimize → Workload Optimization

Workload Optimization provides DRS-like placement recommendations, but operates at the VCF Operations level rather than within a single vCenter. This enables cross-cluster and even cross-vCenter balancing recommendations.

Considerations evaluated by the engine:

Output: The engine generates a prioritized list of migration recommendations. Each recommendation includes:

Field Description
VM Name The virtual machine to migrate
Source Host / Cluster Current placement
Destination Host / Cluster Recommended placement
Improvement Projected reduction in contention or improvement in balance score
Risk Assessment of migration risk (Low / Medium / High)

Note: Workload Optimization recommendations are advisory. VCF Operations does not execute migrations autonomously unless integrated with an automation platform and explicitly configured to do so.

18.7 What-If Analysis

Navigate to: Optimize → What-If Analysis

What-If Analysis lets you model hypothetical changes to your environment and see projected capacity impacts before committing resources or budget.

Scenario Types:

Scenario Type Question It Answers Required Inputs
Add Workload "What if I deploy 50 new VMs?" VM profile (vCPU, RAM, Disk per VM), quantity, target cluster
Remove Workload "What if I decommission this cluster's VMs?" Select VMs or clusters to remove
Add Infrastructure "What if I add 3 hosts to this cluster?" Host profile (CPU cores, RAM, local storage), quantity, target cluster
Change Allocation "What if I change the overcommit ratio?" New CPU or memory overcommit ratio, target cluster

Step-by-Step Procedure (applicable to all scenario types):

Step 1. Click Create Scenario and provide a scenario name (e.g., "Q3 ERP Migration Impact").

Step 2. Select the scenario type from the four options above.

Step 3. Enter parameters specific to the scenario type. For "Add Workload," define the VM profile:

Step 4. Select the target cluster(s) where the workload will be placed or infrastructure will be added.

Step 5. Click Run Analysis. The engine calculates the impact using the same forecasting algorithms described in Section 18.2.

Step 6. Review the results:

Result Field Description
Time Remaining (Before) Projected days before the scenario
Time Remaining (After) Projected days after applying the scenario
Capacity Remaining % (Before/After) Side-by-side capacity comparison
Risk Level Change Whether the cluster moves from Green to Yellow/Red
Alerts Generated Any new capacity alerts that would trigger

Step 7. Save the scenario for future reference or discard it. Saved scenarios can be revisited, modified, and re-run as conditions change.

Tip: Combine scenario types for complex planning. First run "Add Workload" to see the impact of a new project, then run "Add Infrastructure" to determine how many hosts are needed to absorb it. Compare the two scenarios side by side.


Chapter 19: Management Packs

Management Packs extend VCF Operations beyond vSphere, enabling unified monitoring across heterogeneous infrastructure, cloud platforms, applications, and hardware.

19.1 What Are Management Packs

A Management Pack is a pluggable adapter module that teaches VCF Operations how to collect, interpret, and act on data from a specific technology. Each management pack is a self-contained package that includes:

Component Purpose
Adapter Code The collection engine that connects to the target system via API, SNMP, WMI, SSH, or other protocol
Object Model Defines the object types (e.g., "AWS EC2 Instance," "NetApp Volume") and their relationships
Metric Definitions The specific metrics to collect, their units, and collection intervals
Dashboards Pre-built dashboards tailored to the monitored technology
Alert Definitions Symptoms and alert rules specific to the technology
Views and Reports Pre-built views and report templates

Management packs are distributed as PAK files (Platform Archive Kit) — a signed archive format used by the VCF Operations platform for all extensions and updates.

19.2 Installation Steps

Step 1. Obtain the management pack PAK file. Sources include:

Step 2. In VCF Operations, navigate to Administration → Integrations → Repository.

Step 3. Click Add (or Upload PAK File, depending on UI version).

Step 4. Browse to the downloaded PAK file and click Upload. The system validates the file signature and compatibility.

Step 5. Review and accept the End User License Agreement (EULA).

Step 6. Monitor the installation progress bar. Installation typically takes 2–5 minutes. The cluster will distribute the adapter code to all nodes automatically.

Step 7. After installation completes, configure the adapter instance:

  1. Navigate to Administration → Integrations → Accounts
  2. Click Add Account
  3. Select the newly installed adapter type from the dropdown
  4. Provide connection details:
  5. Click Validate Connection to test connectivity
  6. Click Save

Warning: After installing a management pack, allow 2–3 collection cycles (typically 10–15 minutes) before expecting data to appear in dashboards. The initial collection cycle populates the object inventory; subsequent cycles populate metrics.

19.3 Official Management Packs (Broadcom)

The following table lists the management packs available from Broadcom, including those built into VCF Operations and those available as separate downloads.

# Management Pack Version Monitored Technology Key Metrics Built-In
1 VMware vSphere 8.18.2 vCenter, ESXi Hosts, VMs, Resource Pools CPU, Memory, Disk, Network for all vSphere objects Yes
2 VMware NSX-T 8.18.2 NSX Manager, Transport Nodes, Logical Switches, DFW Transport node health, DFW rule hit counts, tunnel status Yes
3 VMware SDDC Manager 8.18.2 SDDC Manager, Workload Domains, VCF Lifecycle Domain health, lifecycle operation status Yes
4 VMware vSAN 8.18.2 vSAN Clusters, Disk Groups, Capacity Devices Resync status, cache hit ratio, congestion, latency Yes
5 VMware Cloud Director 5.x VCD Cells, Organizations, vApps, Org VDCs Cell health, Org resource consumption No
6 VMware Horizon 4.x Connection Servers, Desktop Pools, Sessions Session latency, pool utilization, protocol performance No
7 VMware Tanzu 2.x TKG Clusters, Supervisor Namespaces, Pods, Nodes Pod restart count, node resource usage, cluster health No
8 VCF Automation 4.x Blueprints, Deployments, Catalog Items Deployment success rate, provisioning time No
9 AWS 4.x EC2, S3, RDS, Lambda, ELB, CloudWatch Instance utilization, S3 bucket size, RDS connections No
10 Azure 4.x VMs, Storage Accounts, SQL Database, App Services VM performance, storage transactions, DTU usage No
11 Google Cloud 2.x GCE Instances, GCS Buckets, BigQuery, Cloud SQL Instance CPU, bucket object count, query slot utilization No
12 Dell EMC Varies PowerStore, PowerScale, Unity, VMAX/PowerMax Array latency, capacity, IOPS, throughput No
13 NetApp ONTAP 3.x Clusters, SVMs, Volumes, Aggregates, LUNs Volume latency, aggregate capacity, snapshot reserve No
14 Pure Storage 2.x FlashArray, FlashBlade, Volumes Array latency, capacity, data reduction ratio No
15 HPE 2.x 3PAR/Primera, Nimble, Synergy, ProLiant Array performance, blade health, enclosure power No
16 Cisco UCS 3.x Fabric Interconnects, Blades, Rack Units, Service Profiles Fabric uplink utilization, blade faults, power draw No
17 OS: Windows 8.x Windows Servers (WMI-based) CPU, Memory, Disk, Network, Services, Processes No
18 OS: Linux 8.x Linux Servers (SSH-based) CPU, Memory, Disk, Network, top processes No
19 SNMP 5.x Generic SNMP-enabled devices (switches, routers, UPS) Interface traffic, device uptime, OID-based custom metrics No
20 Active Directory 3.x Domain Controllers, Sites, Replication Replication latency, LDAP response time, DC availability No
21 SQL Server 4.x SQL Instances, Databases, Always On Availability Groups Query latency, buffer cache hit ratio, log growth No
22 Oracle Database 3.x Oracle Instances, Tablespaces, ASM Disk Groups Tablespace usage, session counts, wait events No
23 Ping 8.x Any IP-reachable device ICMP availability, round-trip latency, packet loss No
24 Log Insight 8.x Operations for Logs integration Log event counts, ingestion rate Yes
25 Telegraf Agent 8.x Any system running Telegraf (push-based) Custom metrics via Telegraf input plugins No
26 Kubernetes 2.x Kubernetes Clusters, Namespaces, Nodes, Pods, Containers Pod status, container resource usage, node conditions No
27 Service Discovery 8.x Application dependency mapping Service relationships, communication flows, port mappings Yes

19.4 Management Pack Builder

For technologies not covered by existing management packs, VCF Operations includes a no-code development environment for building custom adapters.

Navigate to: Administration → Integrations → Management Pack Builder

Supported Input Methods:

Input Type Description Use Case
REST API Define endpoints, authentication, JSON path mappings Custom web applications, SaaS platforms, IoT APIs
SNMP MIB Import MIB files and map OIDs to metrics Legacy network devices, industrial equipment
Script-Based Python or PowerShell scripts that output metrics in a defined format Internal tools, proprietary systems, complex collection logic

Development Workflow:

  1. Create a new project in Management Pack Builder
  2. Define the object model — what object types exist (e.g., "Custom App Server," "Custom Database")
  3. Define relationships between object types (e.g., "App Server runs on Host")
  4. Map metrics — connect API responses, SNMP OIDs, or script outputs to named metrics
  5. Define collection intervals for each metric group
  6. Create alert definitions (optional) — symptoms and recommendations for the custom technology
  7. Build dashboards (optional) — pre-built dashboards included in the pack
  8. Export the project as a PAK file
  9. Install the PAK file using the standard process (Section 19.2)

Tip: Start with the REST API input type for most modern applications. Define a health-check endpoint first to validate connectivity, then expand to detailed metrics. Use the built-in Test button at each stage to validate collection before exporting.

19.5 Third-Party Packs

In addition to Broadcom-published management packs, several vendors produce and support their own packs for VCF Operations.

Vendor Management Pack Monitored Technology Key Capabilities
Dell Technologies OpenManage for VCF Operations PowerEdge server hardware via iDRAC Hardware health (fans, PSUs, RAID), firmware inventory, warranty status, thermal monitoring
NVIDIA vGPU Management Pack NVIDIA vGPU-enabled hosts and VMs GPU utilization %, GPU memory usage, temperature, encoder/decoder sessions, frame buffer
Rubrik Rubrik Management Pack Rubrik CDM and Polaris Backup job success/failure rates, SLA compliance percentage, storage consumption trends, archive status
Zerto Zerto Management Pack Zerto Virtual Replication VPG replication health, RPO status, journal size, failover test history, bandwidth consumption

Note: Third-party management packs follow their own release cadence independent of VCF Operations versions. Always verify compatibility with your VCF Operations version before installing. Check the vendor's compatibility matrix or release notes.


Chapter 20: Day-2 Operations and Maintenance

Once VCF Operations is deployed and configured, ongoing maintenance ensures the platform remains healthy, performant, and current. This chapter covers the operational tasks that every VCF Operations administrator must master.

20.1 Log File Locations

All VCF Operations appliance logs reside on the appliance filesystem. The following table identifies the critical log files, their paths, and their purposes.

Log File Path Purpose
Analytics /storage/log/vcops/analytics.log Analytics engine processing — capacity calculations, forecasting, anomaly detection
Collector /storage/log/vcops/collector.log Data collection framework — adapter scheduling, metric ingestion
API / UI /storage/log/vcops/web/catalina.out Tomcat application server — REST API requests, UI errors
CASA /storage/log/vmware/casa/casa.log Cluster management — node join/leave, role assignment, slice configuration
GemFire /storage/log/vcops/gemfire/gemfire.log Distributed cache — inter-node data replication, partition management
vPostgres /storage/log/vmware/vpostgres/postgresql.log PostgreSQL database — query errors, connection issues, replication
Adapter (per-adapter) /storage/log/vcops/adapters/<adapter-name>/ Individual adapter logs — collection errors, connectivity issues
VAMI /var/log/vmware/ VMware Appliance Management Interface — appliance configuration changes
PAK Manager /storage/log/vcops/pakManager.log PAK file installation, upgrade, and management pack deployment
Suiteapi /storage/log/vcops/web/suiteapi.log Suite API specific request/response logging

Tip: When troubleshooting, start with the most specific log. If a particular adapter is failing, check its log in /storage/log/vcops/adapters/ first. Escalate to collector.log only if the adapter log does not reveal the issue.

20.2 Safely Cleaning Logs

Log files can grow substantially in active environments, particularly when debug logging is enabled. Use the following procedures to reclaim disk space safely.

Check current disk usage:

df -h /storage/log
du -sh /storage/log/vcops/*

Truncate an active log file (preserves file handle):

truncate -s 0 /storage/log/vcops/analytics.log

Remove old rotated log archives:

find /storage/log -name "*.gz" -mtime +30 -delete
find /storage/log -name "*.log.*" -mtime +30 -delete

Check for core dumps consuming space:

du -sh /storage/core/
# If core dumps are present and no longer needed:
rm -f /storage/core/core.*

Warning: Never use rm on active log files (e.g., rm analytics.log). The process holding the file descriptor will continue writing to the deleted inode, consuming disk space invisibly. Always use truncate to safely zero out an active log file while preserving the file handle.

Warning: If log growth is persistent, investigate the root cause (e.g., a failing adapter retrying every 5 seconds, debug logging left enabled). Truncating logs without addressing the cause is a temporary fix.

20.3 Backup and Restore

Backup Configuration:

Step 1. Navigate to Administration → Backup/Restore.

Step 2. Configure the backup destination:

Step 3. Set the backup schedule:

Setting Recommendation
Frequency Daily
Time 02:00 (during low-activity window)
Retention 7 backups (1 week of daily backups)

Step 4. Select backup content:

Option Includes Size Impact
All (Configuration + Data) Cluster config, policies, dashboards, alerts, views, reports, custom groups, supermetrics, AND historical metric data Large (potentially hundreds of GB)
Configuration Only Everything except historical metric data Small (typically < 5 GB)

Step 5. Click Save to activate the schedule. For an immediate backup, click Backup Now.

Restore Procedure:

  1. Deploy a fresh VCF Operations OVA with the same version as the backup
  2. During the Initial Setup Wizard, select Restore from Backup instead of "New Installation"
  3. Provide the backup location path (NFS or local)
  4. Select the specific backup file to restore from (listed by date/time)
  5. The restore process rebuilds the cluster configuration, policies, dashboards, and (if included) historical data
  6. After restore completes, the cluster comes online and resumes data collection

Important: You cannot restore a backup from a newer version to an older version. The target appliance must be the same version or newer than the backup source.

20.4 Certificate Management

VCF Operations generates self-signed internal certificates during deployment. For production environments, replace these with certificates signed by your enterprise Certificate Authority (CA).

Current Certificate Status: Navigate to Administration → Certificates to view the current certificate details, including issuer, subject, expiration date, and thumbprint.

Supported Formats:

Format Description
PEM Base64-encoded certificate and private key in separate files (.pem, .crt, .key)
PFX / PKCS12 Binary format containing certificate chain and private key in a single file (.pfx, .p12)

Steps to Replace the Certificate:

Step 1. Generate a Certificate Signing Request (CSR) from VCF Operations, or prepare a PEM certificate chain externally.

Step 2. Upload the signed certificate and private key:

Step 3. Click Apply. VCF Operations validates the certificate chain, verifies the private key matches, and restarts services automatically. Expect 5–10 minutes of downtime during the service restart.

Warning: Ensure the certificate's Subject Alternative Names (SANs) include the FQDN of every node in the cluster and the cluster VIP (if using HA/CA). Missing SANs will cause inter-node communication failures.

20.5 Password Rotation

Regular password rotation is a security best practice and may be required by organizational policy.

Via CLI (SSH to the VCF Operations appliance):

$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py \
  password change --user admin

You will be prompted for the current password and the new password.

Via VCF Fleet Manager (SDDC Manager integration):

  1. Log in to SDDC Manager
  2. Navigate to Lifecycle → Password Management
  3. Select the VCF Operations instance
  4. Click Rotate
  5. Choose to auto-generate or manually specify the new password
  6. Confirm rotation

Rotation Schedule Recommendations:

Account Recommended Interval Notes
admin (UI) Every 90 days Primary administrative account
root (SSH) Every 90 days Appliance OS-level access
maintenanceAdmin Every 90 days Used for cluster maintenance operations
Adapter credentials Every 90 days or per policy Service accounts connecting to vCenter, NSX, etc.

Important: After rotating adapter credentials (e.g., the vCenter service account password), update the corresponding credential in Administration → Integrations → Accounts → Edit Credential. Failure to do so will cause data collection to stop.

20.6 Upgrading

VCF Operations upgrades are delivered as PAK files and follow a rolling upgrade process that minimizes downtime in HA and CA deployments.

Phase 1 — Pre-Upgrade Checklist:

Task Command / Location Purpose
Verify current version Administration → Cluster Management Confirm starting version
Take full backup Administration → Backup/Restore → Backup Now Rollback safety net
Check compatibility matrix Broadcom compatibility guide Ensure management packs are compatible with target version
Download upgrade PAK Broadcom support portal Obtain the upgrade binary
Snapshot all nodes vCenter → Right-click VM → Snapshots → Take Snapshot Quick rollback mechanism
Verify NTP sync ntpq -p on each node Prevent time-skew issues during upgrade
Check disk space df -h on each node Ensure /storage has > 20% free

Phase 2 — Upgrade Execution:

  1. Navigate to Administration → Software Update
  2. Click Upload PAK File and browse to the upgrade PAK
  3. Once uploaded, the system validates the PAK and displays the target version
  4. Click Install
  5. The upgrade proceeds in rolling fashion:
  6. During the upgrade, the cluster remains partially available (data collection continues on non-upgrading nodes)
  7. Total upgrade time: 30–90 minutes depending on cluster size

Phase 3 — Post-Upgrade Validation:

Task How to Verify
Cluster status is "Online" Administration → Cluster Management
All nodes show new version Administration → Cluster Management → Node details
Data collection is active Environment → select any object → verify recent metrics
Management packs are functional Administration → Integrations → Accounts → check status icons
Dashboards load correctly Navigate to several dashboards and verify data
Remove VM snapshots vCenter → Right-click VM → Snapshots → Delete All

Warning: Do not delete VM snapshots until you have fully validated the upgrade. Snapshots provide the fastest rollback path if issues are discovered. However, do not keep snapshots longer than 72 hours, as they degrade VM performance.

20.7 Scaling

As monitored environments grow, VCF Operations may require additional resources.

Vertical Scaling (Scale Up):

Increase the vCPU and memory allocated to existing nodes.

OVA Size vCPU Memory (GB) Objects Supported
Small 4 16 Up to 1,500
Medium 8 32 Up to 5,000
Large 16 48 Up to 15,000
Extra Large 24 128 Up to 30,000

To change the size: power off the node, adjust CPU/RAM in vCenter, power on. The analytics engine automatically detects the new resources.

Horizontal Scaling (Scale Out):

Add Data Nodes to distribute the analytics workload across more compute.

  1. Deploy a new VCF Operations OVA
  2. During setup, select Expand an existing cluster
  3. Provide the primary node's FQDN or IP
  4. The new node joins the cluster with the Data role
  5. Navigate to Administration → Cluster Management to verify the new node
  6. The cluster automatically rebalances object assignments across all data nodes

Guideline: Add one Data Node for every 10,000 additional objects beyond the primary node's capacity. For environments exceeding 50,000 objects, engage Broadcom Professional Services for architecture review.

20.8 Support Bundle Generation

When engaging Broadcom Global Support Services (GSS), a support bundle is typically required.

Via the UI:

  1. Navigate to Administration → Support → Generate Support Bundle
  2. Select the nodes to include (all nodes recommended)
  3. Optionally select specific log categories to include
  4. Click Generate
  5. Download the resulting ZIP file when generation completes

Via the CLI (SSH):

/usr/lib/vmware-vcops/support/vrops-support.sh

The script collects logs, configuration files, cluster state, and diagnostic information into a ZIP file located at /storage/log/vcops/support/.

Support Bundle Contents:

Category Included Items
Logs All log files from Section 20.1
Configuration Cluster config, slice configuration, property files
Cluster State Node roles, service status, GemFire partition info
System Info OS version, disk usage, memory usage, process list
Thread Dumps Java thread dumps for analytics and collector services

Tip: For targeted troubleshooting, you can generate a "lightweight" bundle by specifying only the relevant log categories. This reduces generation time and file size, which speeds up upload to the support ticket.

20.9 Troubleshooting Common Issues

The following sections document the most frequently encountered issues, their root causes, and step-by-step resolutions.


Cluster Stuck at "Going Online"

Symptom: The cluster status on the Administration → Cluster Management page shows "Going Online" for more than 30 minutes without progressing.

Root Cause: The analytics service is failing to start, typically due to a GemFire distributed cache partition conflict or corrupted analytics state.

Resolution:

# Step 1: Check current service status
/usr/lib/vmware-vcops/support/sliceConfiguration.sh --status

# Step 2: Restart the analytics service
service vmware-vcops-analytics restart

# Step 3: Monitor the analytics log for errors
tail -f /storage/log/vcops/analytics.log

If the restart does not resolve the issue, check for GemFire partition conflicts:

grep -i "partition" /storage/log/vcops/gemfire/gemfire.log | tail -20

If partition errors are present, a full cluster restart may be required:

service vmware-vcops stop
# Wait 5 minutes for all services to fully terminate
service vmware-vcops start

Cluster Stuck at "Going Offline"

Symptom: After requesting the cluster to go offline, the Admin UI becomes unresponsive and the cluster never reaches "Offline" state.

Root Cause: A hung analytics or vPostgres process is preventing graceful shutdown.

Resolution:

# Step 1: Force stop all services
service vmware-vcops stop

# Step 2: Verify all Java processes have terminated
ps aux | grep java

# Step 3: If processes remain, wait 5 minutes then check again
# Do NOT use kill -9 unless absolutely necessary

# Step 4: Start services
service vmware-vcops start

"Waiting for Analytics" Message

Symptom: Dashboard widgets display "Waiting for Analytics" instead of data. The message persists beyond the normal startup window (15 minutes).

Root Cause: The analytics engine has either crashed, is processing a large backlog, or has encountered an out-of-memory condition.

Resolution:

  1. Check the analytics service status:

    service vmware-vcops-analytics status
    
  2. If the service is stopped, check the log for the cause:

    tail -100 /storage/log/vcops/analytics.log
    
  3. Look for OutOfMemoryError or StackOverflowError in the log. If found, the node likely needs more memory (see Section 20.7 on vertical scaling).

  4. Restart the analytics service:

    service vmware-vcops-analytics restart
    

"FSDB Running Low on Disk Space"

Symptom: An alert fires indicating that the FSDB (File System Database) partition is running low on disk space. The /storage/db partition is at or above 85% utilization.

Root Cause: Historical metric data has filled the /storage/db partition. This occurs when retention is set too high for the available disk, or when a large number of new objects were added without corresponding disk expansion.

Resolution (in order of preference):

  1. Reduce data retention:

    Data Type Default Minimum Recommended
    Real-time (5-min) 1 day 1 day
    Hourly rollup 30 days 15 days
    Daily rollup 6 months 3 months
    Monthly rollup 13 months 6 months
  2. Expand the /storage/db disk:

  3. Remove unused management packs:


Slow Data Collection

Symptom: Metric charts show gaps, dashboards display stale data, or the "Last Collection" timestamp for adapters is more than 10 minutes old.

Root Cause: Multiple potential causes — adapter overload, network latency to the target system, expired or invalid credentials, or insufficient collector resources.

Resolution:

  1. Check adapter status:

  2. Check adapter logs:

    tail -200 /storage/log/vcops/adapters/<adapter-name>/<adapter-name>.log
    
  3. Verify credentials: Edit the adapter account and click Validate Connection

  4. Check collector resource usage:

    top -bn1 | head -20
    free -h
    
  5. For geographically distant targets: Deploy a Remote Collector at the remote site to reduce collection latency


Root Partition Full

Symptom: Services fail to start. SSH access still works but commands may produce "No space left on device" errors.

Root Cause: Core dumps, temporary files, or unexpected log files have filled the root (/) partition.

Resolution:

# Step 1: Identify the largest consumers
du -sh /* | sort -rh | head

# Step 2: Common culprits — check and clean
du -sh /storage/core/
rm -f /storage/core/core.*

du -sh /tmp/
# Remove old temp files (careful — do not remove active temp files)
find /tmp -type f -mtime +7 -delete

# Step 3: Check for unexpected log files in /var/log
du -sh /var/log/*

Warning: If the root partition is completely full (100%), services cannot write PID files or temp files and will refuse to start. In extreme cases, you may need to boot from a rescue ISO to clear space.

20.10 CLI Tools

VCF Operations provides several command-line tools for administration and troubleshooting.

Tool Command Purpose Common Usage
vrops-status vrops-status Quick cluster health check Verify all services are running, check node roles
OPS-CLI $VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py Full CLI management Adapter management, metric queries, object searches, password changes
Slice Configuration /usr/lib/vmware-vcops/support/sliceConfiguration.sh Cluster slice management Check slice status, force slice rebalancing
Support Script /usr/lib/vmware-vcops/support/vrops-support.sh Support bundle generation Generate log bundles for Broadcom support
Service Control service vmware-vcops {start|stop|restart|status} Service management Start, stop, or restart the entire VCF Operations stack
Platform CLI vcops-cli Platform-level operations License management, node management

OPS-CLI Examples:

# List all adapter instances
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py \
  adapter list

# Search for an object by name
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py \
  object search --name "web-server-01"

# Query a metric for a specific object
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py \
  metric query --objectId <uuid> --metricKey "cpu|usage_average"

Warning: Direct Cassandra access via cqlsh localhost 9042 is available for advanced troubleshooting but is unsupported by Broadcom. Modifying data in Cassandra directly can corrupt the FSDB and render the cluster inoperable. Use only under explicit guidance from Broadcom support.

20.11 Suite API Reference

The Suite API is the RESTful interface for programmatic access to all VCF Operations functionality. It enables integration with ITSM tools, custom portals, automation pipelines, and third-party systems.

Authentication:

# Acquire a token
curl -k -X POST \
  "https://<vrops-fqdn>/suite-api/api/auth/token/acquire" \
  -H "Content-Type: application/json" \
  -d '{"username":"admin","authSource":"local","password":"<password>"}'

The response returns a JSON object containing the token field. Use this token in subsequent requests.

Token Usage:

Include the token in the Authorization header for all API calls:

Authorization: vRealizeOpsToken <token>

Tokens expire after 6 hours by default. Acquire a new token when the current one expires.

Base URL:

https://<vrops-fqdn>/suite-api/api/

Key Endpoint Categories:

Category Base Path Operations
Resources /resources List, search, create, delete objects; query relationships
Alerts /alerts List, query, update, cancel alerts
Symptoms /symptoms List, create, delete symptom definitions
Supermetrics /supermetrics List, create, update, delete supermetric formulas
Policies /policies List, create, apply, export, import policies
Adapters /adapters List adapter kinds, instances; start/stop collection
Credentials /credentials List, create, update, delete credential instances
Reports /reports List templates, generate reports, download results
Dashboards /dashboards List, import, export, share dashboards
Auth /auth Token acquisition, token release, user management
Collector Groups /collectorgroups List, create, assign collectors
Custom Groups /customgroups List, create, update, delete custom groups
Metric Keys /resources/{id}/stats Query metric data for specific resources

Interactive API Documentation:

VCF Operations ships with embedded Swagger UI documentation:

https://<vrops-fqdn>/suite-api/doc/swagger-ui.html

The Swagger UI provides a complete, interactive reference for all API endpoints, including request/response schemas, parameter descriptions, and the ability to execute API calls directly from the browser.

Common API Workflow Example — Export All Critical Alerts:

# Step 1: Acquire token
TOKEN=$(curl -sk -X POST \
  "https://vrops.corp.local/suite-api/api/auth/token/acquire" \
  -H "Content-Type: application/json" \
  -d '{"username":"admin","authSource":"local","password":"P@ssw0rd"}' \
  | python -c "import sys,json; print(json.load(sys.stdin)['token'])")

# Step 2: Query critical alerts
curl -sk \
  "https://vrops.corp.local/suite-api/api/alerts?status=ACTIVE&criticality=CRITICAL" \
  -H "Authorization: vRealizeOpsToken $TOKEN" \
  -H "Accept: application/json" | python -m json.tool

# Step 3: Release token when done
curl -sk -X POST \
  "https://vrops.corp.local/suite-api/api/auth/token/release" \
  -H "Authorization: vRealizeOpsToken $TOKEN"

Best Practice: Always release tokens when your automation workflow completes. Each VCF Operations instance supports a limited number of concurrent API sessions. Unreleased tokens count against this limit until they expire naturally.


PART II: VCF Operations for Logs


Chapter 21: Product Overview — Operations for Logs

21.1 Naming History

VCF Operations for Logs has undergone several name changes since its inception, reflecting broader shifts in VMware's product portfolio and, ultimately, the Broadcom acquisition. Understanding the naming timeline is essential when referencing older documentation, knowledge-base articles, and community posts.

Year Product Name Context
2013 VMware vCenter Log Insight Original release; tightly associated with vCenter Server
2016 vRealize Log Insight Rebranded under the vRealize management suite umbrella
2022 VMware Aria Operations for Logs Part of the VMware Aria brand unification across all management products
2024 VCF Operations for Logs Broadcom acquisition; product folded into the VCF (VMware Cloud Foundation) brand

Note: Many CLI tools, OVA filenames, internal service names, and API endpoints still reference loginsight or vrli. Do not be alarmed when you encounter these legacy identifiers — they are functionally equivalent to the current product.

Throughout this handbook, the terms Operations for Logs, OpsForLogs, and the abbreviation vRLI may be used interchangeably where historical context or brevity demands it.

21.2 How It Differs from VCF Operations

VCF Operations (metrics) and VCF Operations for Logs (logs) are complementary products. They are deployed separately, serve distinct analytical purposes, and store fundamentally different data types. The following table summarizes the key differences.

Aspect VCF Operations VCF Operations for Logs
Data Type Metrics, properties, super metrics, events Log messages (syslog, agent-collected file logs)
Analysis Model Time-series statistical analysis, machine-learning anomaly detection, capacity modeling Full-text search, pattern matching, ML-based intelligent grouping
Alerting Threshold-based — triggers when a metric value crosses a defined boundary Pattern-based — triggers when a log message matches a content rule or frequency condition
Storage Engine FSDB (proprietary time-series database) Apache Cassandra + proprietary full-text index
Primary Use Cases Performance monitoring, capacity planning, cost analysis, what-if modeling Troubleshooting, root-cause analysis, audit trail, compliance reporting
Retention Model Configurable retention policies (weeks to months of metric data) Index partitions managed by time-based buckets (days to months of log data)
Integration Direction Launches-in-context to Operations for Logs for correlated log investigation Launches-in-context to Operations for metric correlation

Best Practice: Deploy both products and configure the bidirectional integration between them. When Operations detects an anomaly on an object, the administrator can pivot directly into Operations for Logs to examine the logs from that object during the anomaly window — dramatically reducing mean time to resolution.

21.3 Architecture

Operations for Logs follows a scale-out clustered architecture built on the following components:

Cluster Specifications:

Data Flow:

Log Sources (vCenter, ESXi, NSX, Agents, Syslog Devices)
        │
        ▼
   VIP Address (ILB)
        │
        ▼
  ┌─────┴─────┐
  │  Primary   │──── Worker 1 ──── Worker 2 ──── Worker N
  │   Node     │
  └────────────┘
        │
        ▼
  Ingestion Pipeline → Parsing → Field Extraction → Indexing
        │
        ▼
  Cassandra Index + Full-Text Index (per-node storage)
        │
        ▼
  Query Engine (distributed, merges results from all nodes)

21.4 Cluster Sizing Guidance

Select the cluster size based on expected daily ingestion volume and query concurrency requirements.

Cluster Size Nodes Estimated Ingestion Rate Typical Use Case
Small 1 (standalone) ~15 GB/day Lab, proof-of-concept, developer environments
Medium 3 ~45 GB/day Small production (single VCF instance, <200 VMs)
Large 6 ~90 GB/day Medium production (multi-cluster, 200–1,000 VMs)
Extra Large 12+ ~180+ GB/day Large enterprise (multi-site, 1,000+ VMs, compliance-heavy)

Warning: These figures assume default field extraction and content packs. Heavy use of custom regex extraction, large numbers of active alerts, or complex dashboards with many concurrent users will reduce effective ingestion capacity. Always monitor the Ingestion Rate and Query Latency dashboards after deployment and add worker nodes proactively if ingestion approaches 80% of rated capacity.

Tip: For VCF environments, a 3-node medium cluster is the recommended starting point for production. This provides both high availability (the cluster tolerates the loss of one node) and sufficient headroom for growth.


Chapter 22: Deployment

22.1 OVA Sizing

The Operations for Logs OVA offers three deployment sizes. Select the size at deployment time — it cannot be changed later without redeployment.

Size vCPUs Memory (GB) Disk (GB) Estimated Ingestion Rate
Small 4 8 530 ~15 GB/day
Medium 8 16 1,060 ~30 GB/day
Large 16 32 2,080 ~45 GB/day

Important: Disk sizes listed are total, including OS, application, and log index storage. The index partition consumes the majority of disk space. When planning retention, remember that longer retention windows require proportionally more disk. If the built-in disk is insufficient, you can attach additional VMDK volumes post-deployment and configure them as additional storage partitions.

Recommendation: For production deployments, always select Medium or Large. The Small size is appropriate only for labs and proof-of-concept environments.

22.2 OVA Deployment (Step-by-Step)

Deploy the Operations for Logs OVA through the vSphere Client using the following procedure.

Prerequisites:

Procedure:

  1. Log in to the vSphere Client and navigate to the target cluster or resource pool.
  2. Right-click the cluster or resource pool and select Deploy OVF Template.
  3. On the Select an OVF template page, choose Local file and browse to the downloaded OVA file. Click Next.
  4. On the Select a name and folder page, enter a descriptive VM name (e.g., vrli-primary-01) and select the target inventory folder. Click Next.
  5. On the Select a compute resource page, choose the target cluster, host, or resource pool. Click Next.
  6. On the Review details page, verify the OVA details (publisher, download size, disk size). Click Next.
  7. On the License agreements page, read and accept the EULA. Click Next.
  8. On the Deployment configuration page, select the appropriate size from the dropdown:
  9. On the Select storage page:
  10. On the Select networks page, map the OVA network to the appropriate port group. Click Next.
  11. On the Customize template page, fill in the following fields:
  12. On the Ready to complete page, review all settings and click Finish.
  13. Wait for the deployment task to complete in the vSphere Recent Tasks pane.
  14. Power on the virtual machine.
  15. Wait 3–5 minutes for all services to initialize. Monitor the VM console for boot progress.

Warning: Do not snapshot the VM during initial boot. Allow all services to fully start before taking the first snapshot.

22.3 Initial Configuration Wizard

After the first boot completes, access the web-based configuration wizard to finalize setup.

  1. Open a browser and navigate to https://<node-fqdn> (or https://<ip-address>).
  2. Accept the self-signed SSL certificate warning.
  3. The Initial Configuration Wizard launches automatically.

Step 1 — Admin Password:

Step 2 — License Key:

Step 3 — General Configuration:

Step 4 — CEIP:

Step 5 — NTP Configuration:

Critical: Accurate time synchronization is essential for log correlation. Ensure all Operations for Logs nodes, vCenter servers, and ESXi hosts share the same NTP source.

Step 6 — SMTP Configuration:

Step 7 — SSL Certificate:

Step 8 — Finish:

22.4 Cluster Setup

A single standalone node is suitable for labs, but production environments require a cluster of at least three nodes for high availability, ingestion scaling, and query performance.

22.4.1 Adding Worker Nodes

  1. Deploy additional OVAs following the same procedure described in Section 22.2. Each worker node requires its own unique FQDN and IP address.
  2. Power on the worker node and wait for services to initialize (3–5 minutes).
  3. Open a browser and navigate to https://<worker-fqdn>.
  4. On the initial setup page, select Join Existing Deployment (do not select "Start New Deployment").
  5. Enter the Primary Node FQDN or IP address.
  6. Review and accept the primary node's SSL certificate fingerprint.
  7. Click Join. The worker node contacts the primary, receives the cluster configuration, and begins participating in ingestion and query processing.
  8. Repeat for each additional worker node.

Note: Worker nodes do not require independent license keys. The license is managed centrally on the primary node and applies cluster-wide.

22.4.2 ILB and VIP Configuration

After adding worker nodes, configure the Integrated Load Balancer and Virtual IP to provide a single entry point for all clients.

  1. Log in to the primary node UI: https://<primary-fqdn>.
  2. Navigate to Administration → Cluster.
  3. Verify all worker nodes appear with a status of Connected.
  4. Navigate to Administration → Cluster → VIP (or the Virtual IP tab).
  5. Enter the desired Virtual IP address. This IP must be on the same subnet as the cluster nodes and must not be assigned to any other device.
  6. Enter the FQDN for the VIP (register this in DNS beforehand with both A and PTR records).
  7. Click Save.
  8. The ILB activates across all cluster nodes. Within 30 seconds, the VIP becomes responsive.
  9. Verify by navigating to https://<vip-fqdn> — the Operations for Logs UI should load.
  10. Update all log sources (syslog configurations, agent liagent.ini files) to point to the VIP address instead of the primary node address.

Warning: If you do not configure a VIP, log sources pointing to the primary node will not benefit from load balancing, and the primary node becomes a single point of failure for ingestion.

22.5 VCF 9.0 Deployment via Fleet Manager

In VCF 9.0 and later, Operations for Logs can be deployed through SDDC Manager's Fleet Management capability, which automates the entire lifecycle.

  1. Log in to SDDC Manager and navigate to Lifecycle → Fleet Management.
  2. Under Available Products, select VMware Aria Operations for Logs (some builds may still display the legacy name).
  3. Click Deploy and provide the required parameters:
  4. Click Submit. Fleet Manager performs the following actions automatically:
  5. Monitor deployment progress in the SDDC Manager → Tasks panel. A typical 3-node deployment completes in 30–45 minutes.
  6. Once complete, the Operations for Logs instance appears under Fleet Management → Deployed Products with a status of Active.

Tip: Fleet Manager also handles future upgrades, certificate rotation, and backup scheduling for Operations for Logs, reducing ongoing administrative overhead.


Chapter 23: Log Source Configuration

23.1 Syslog Ingestion Ports

Operations for Logs listens on several ports for log ingestion. The following table summarizes the default ports, protocols, and their intended use.

Port Protocol Transport Use Case Notes
514 Syslog TCP General syslog ingestion Unencrypted; most common for internal networks
514 Syslog UDP General syslog ingestion Unencrypted; no delivery guarantee; not recommended for production
6514 Syslog TCP + TLS Secure syslog ingestion Requires TLS certificate configuration on both sender and receiver
1514 Syslog TCP + SSL ESXi host log forwarding Automatically configured when vSphere integration is enabled
9000 CFAPI HTTP Agent-based ingestion VMware Log Insight agent protocol; unencrypted
9543 CFAPI HTTPS Secure agent-based ingestion VMware Log Insight agent protocol; certificate-secured

Best Practice: Use TCP-based protocols (514/TCP, 6514/TCP, 9543/TCP) for all production log sources. UDP-based syslog (514/UDP) does not guarantee delivery and can silently drop messages under load. For compliance-sensitive environments, use TLS-encrypted ports (6514/TCP for syslog, 9543/TCP for agents).

You can verify which ports are actively listening by navigating to Administration → Configuration → Ports in the Operations for Logs UI, or by running the following on the appliance:

netstat -tlnp | grep -E '514|9000|9543'

23.2 vCenter Log Forwarding

vCenter Server generates critical logs including vpxd, vpxd-svcs, vmware-sps, vmafdd, and many others. Forwarding these to Operations for Logs provides centralized visibility into vCenter operations.

  1. Open a browser and navigate to the vCenter VAMI: https://<vcenter-fqdn>:5480.
  2. Log in with the root account.
  3. Navigate to Syslog in the left navigation pane (location varies by vCenter version — check under Networking or Syslog Configuration).
  4. Click Edit or Configure.
  5. Add a remote syslog destination with the following settings:
  6. Click Save.
  7. Verify log delivery: in Operations for Logs, navigate to Explore Logs and search for source = <vcenter-fqdn>. Logs should appear within 1–2 minutes.

Via the CLI (Alternative)

If VAMI access is unavailable, configure syslog forwarding from the vCenter shell:

# SSH to vCenter as root
# List current syslog configuration
/usr/lib/vmware-syslog/bin/get-rsyslog-config.sh

# Set remote syslog target
/usr/lib/vmware-vmon/vmon-cli --restart rsyslog

Note: In vCenter 8.x and later, syslog configuration is managed through the VAMI. CLI-based configuration methods vary between versions. Always consult the release-specific documentation.

23.3 ESXi Syslog Configuration

ESXi hosts produce some of the most valuable logs in a VMware environment — vmkernel, hostd, vpxa, fdm, and vobd among others. There are three methods to configure ESXi syslog forwarding.

This is the simplest method and ensures all hosts managed by a vCenter are automatically configured.

  1. In Operations for Logs, navigate to Administration → Integrations → vSphere.
  2. Click Add vCenter.
  3. Enter the vCenter FQDN and credentials (a service account with read access is sufficient).
  4. Click Test Connection to verify connectivity.
  5. Click Save.
  6. Operations for Logs connects to vCenter, discovers all managed ESXi hosts, and automatically configures each host to forward logs via TCP on port 1514 (SSL-secured).
  7. Verify: within 2–3 minutes, ESXi host logs appear in Explore Logs with source names matching ESXi hostnames.

Tip: The vSphere integration also pulls ESXi events and tasks, enabling richer correlation between log messages and vCenter-reported events.

Method 2: Manual per-Host via esxcli

Use this method when vSphere integration is not desired or when configuring individual hosts outside of vCenter management.

# SSH to the ESXi host
esxcli system syslog config set --loghost=tcp://<vrli-vip>:514
esxcli system syslog reload
# Verify the configuration
esxcli system syslog config get

The --loghost parameter supports multiple targets separated by commas:

esxcli system syslog config set --loghost=tcp://vrli-vip.lab.local:514,tcp://backup-syslog.lab.local:514

Important: If the ESXi firewall is enabled, ensure the syslog firewall rule is open:

esxcli network firewall ruleset set -r syslog -e true
esxcli network firewall refresh

Method 3: Bulk Configuration via PowerCLI

For large environments, use PowerCLI to configure all hosts at once:

# Connect to vCenter
Connect-VIServer -Server vcenter.lab.local

# Set syslog target for all hosts
$logHost = "tcp://vrli-vip.lab.local:514"
Get-VMHost | ForEach-Object {
    Write-Host "Configuring syslog on $($_.Name)..."
    Set-VMHostSysLogServer -VMHost $_ -SysLogServer $logHost
    $esxcli = Get-EsxCli -VMHost $_ -V2
    $esxcli.system.syslog.reload.Invoke()
}

# Verify configuration
Get-VMHost | ForEach-Object {
    $esxcli = Get-EsxCli -VMHost $_ -V2
    $config = $esxcli.system.syslog.config.get.Invoke()
    Write-Host "$($_.Name): $($config.RemoteHost)"
}

Warning: When using both the vSphere integration (Method 1) and manual configuration (Method 2 or 3) simultaneously, you may receive duplicate log entries. Choose one method and apply it consistently.

23.4 NSX Log Forwarding

NSX Manager and NSX Edge nodes generate logs critical for network troubleshooting, security event analysis, and compliance auditing. Configure log forwarding from the NSX Manager UI.

Procedure:

  1. Log in to the NSX Manager UI (https://<nsx-manager-fqdn>).
  2. Navigate to System → Fabric → Profiles → Node Profiles.
  3. Select the node profile applied to your NSX Manager appliances (typically All NSX Nodes or a custom profile).
  4. Scroll to the Syslog Servers section and click Add.
  5. Enter the following:
  6. Click Save.
  7. The syslog configuration propagates to all nodes assigned to the profile.

NSX Edge Nodes:

In some NSX deployments, Edge transport nodes may require separate syslog configuration:

  1. Navigate to System → Fabric → Nodes → Edge Transport Nodes.
  2. Select each Edge node.
  3. Under Syslog, click Add and enter the same server details.
  4. Click Save.

Note: NSX Distributed Firewall (DFW) logs are generated on the ESXi hosts where the DFW rules are enforced. These logs are forwarded via the ESXi syslog configuration (Section 23.3), not via the NSX Manager syslog configuration.

Verification:

In Operations for Logs, search for:

appname = "nsxmanager" OR appname = "nsx-edge"

NSX logs should appear within 1–2 minutes of configuration.

23.5 Agent Installation

The Operations for Logs agent (also known as the Log Insight agent or liagent) is a lightweight process that collects log files from Windows and Linux operating systems and forwards them to Operations for Logs via the CFAPI protocol.

Windows Agent — GUI Installation

  1. In Operations for Logs, navigate to Administration → Agents → Downloads.
  2. Download the Windows agent installer (VMware-Log-Insight-Agent-*.msi).
  3. Run the MSI installer on the target Windows machine.
  4. Click Next through the welcome screen.
  5. Accept the EULA and click Next.
  6. Enter the server hostname: <vrli-vip-fqdn>.
  7. Set the port to 9543 (HTTPS) or 9000 (HTTP).
  8. Set the protocol to CFAPI.
  9. Check SSL and accept the server certificate if using port 9543.
  10. Click Install and then Finish.
  11. The agent service (VMware Log Insight Agent) starts automatically and begins forwarding Windows Event Logs.

Windows Agent — Silent Installation

For automated deployments via SCCM, GPO, or scripting:

msiexec /i VMware-Log-Insight-Agent-x64.msi /qn ^
  SERVERHOST=vrli-vip.lab.local ^
  SERVERPROTOCOL=cfapi ^
  SERVERPORT=9543 ^
  /l*v C:\temp\liagent-install.log

Tip: Add SERVICEACCOUNT=domain\svcaccount SERVICEPASSWORD=P@ssw0rd parameters if the agent service needs to run under a domain account to access specific log file paths.

Linux Agent — RPM-based Systems (RHEL, CentOS, SLES)

# Copy the RPM to the target server
sudo rpm -i VMware-Log-Insight-Agent-*.rpm

# Edit the agent configuration
sudo vi /var/lib/loginsight-agent/liagent.ini
# Set the [server] section hostname to the VIP FQDN

# Start and enable the agent service
sudo systemctl start liagent
sudo systemctl enable liagent

# Verify the agent is running
sudo systemctl status liagent

Linux Agent — DEB-based Systems (Ubuntu, Debian)

# Copy the DEB package to the target server
sudo dpkg -i VMware-Log-Insight-Agent-*.deb

# Edit the agent configuration
sudo vi /var/lib/loginsight-agent/liagent.ini
# Set the [server] section hostname to the VIP FQDN

# Start and enable the agent service
sudo systemctl start liagent
sudo systemctl enable liagent

# Verify the agent is running
sudo systemctl status liagent

Important: The agent collects /var/log/messages and /var/log/syslog by default. Additional log directories must be configured explicitly in liagent.ini (see Section 23.6).

23.6 Agent Configuration (liagent.ini)

The agent configuration file liagent.ini controls all aspects of agent behavior — server connectivity, log file collection, field tagging, and debug settings. The file is located at:

Complete Configuration Reference

; ─── Server Connection ───
[server]
hostname=vrli-vip.lab.local
port=9543
proto=cfapi
ssl=yes
ssl_accept_any=yes
; ssl_ca_path=/etc/pki/tls/certs/ca-bundle.crt   ; Use for strict CA validation

; ─── Default Syslog Collection (Linux) ───
[filelog|syslog]
directory=/var/log
include=*.log;messages;syslog

; ─── Custom Application Logs ───
[filelog|custom_app]
directory=/opt/myapp/logs
include=*.log
exclude=debug-*.log
parser=auto
tags={"appname":"myapp","env":"production","tier":"web"}

; ─── Apache Access Logs ───
[filelog|apache_access]
directory=/var/log/httpd
include=access_log*
parser=clf

; ─── Windows Event Log (Windows only) ───
[winlog|application]
channel=Application

[winlog|system]
channel=System

[winlog|security]
channel=Security

; ─── Agent Logging ───
[logging]
debug_level=0
; 0=Off, 1=Error, 2=Warning, 3=Info, 4=Debug
; Set to 4 only for troubleshooting; generates significant local log volume

Key Configuration Parameters

Parameter Description Default
hostname Operations for Logs VIP FQDN or IP (required)
port Ingestion port 9543
proto Protocol (cfapi or syslog) cfapi
ssl Enable SSL/TLS yes
ssl_accept_any Accept any server certificate (lab only) no
directory Log file directory to monitor (per section)
include Semicolon-separated file patterns to collect *.log
exclude Semicolon-separated file patterns to skip (none)
tags JSON key-value pairs attached to every log entry from this section {}
parser Log parsing mode (auto, clf, csv, or custom regex) auto

Central Agent Configuration (Agent Groups)

Instead of editing liagent.ini on every machine, you can push agent configurations centrally from the Operations for Logs UI:

  1. Navigate to Administration → Agents → Agent Groups.
  2. Click New Group.
  3. Define a filter to match agents (e.g., by IP range, hostname pattern, or OS type).
  4. Add [filelog|...] and [winlog|...] sections to the group configuration.
  5. Click Save. Matching agents pull the new configuration on their next check-in cycle (every 5 minutes by default).

Best Practice: Use Agent Groups for all production agent configuration. This ensures consistency, simplifies changes, and provides a single pane of glass for agent management.


Chapter 24: Content Packs

24.1 Built-in Content Packs

Operations for Logs ships with two content packs installed by default:

Note: The vSphere content pack is automatically activated when the vSphere integration is configured (Section 23.3, Method 1). Its extracted fields enable rich, structured queries against ESXi and vCenter logs.

24.2 Marketplace Content Packs

Additional content packs are available from the in-product Marketplace and from the Broadcom download portal. The following table lists commonly used packs.

Content Pack Source Key Features
VMware NSX VMware/Broadcom NSX Manager and Edge log parsing; security event dashboards; DFW rule hit analysis
VMware vSAN VMware/Broadcom vSAN trace and CMMDS log parsing; health event extraction; rebalance tracking
VMware VCF VMware/Broadcom SDDC Manager log parsing; lifecycle operation dashboards; compliance event tracking
Active Directory Community/VMware Windows AD log parsing; authentication success/failure dashboards; account lockout tracking
Linux Community/VMware /var/log/* parsing; SSH login analysis; cron job tracking; common Linux event fields
Palo Alto Networks Palo Alto/Community PAN-OS syslog parsing; firewall allow/deny dashboards; threat event correlation
F5 BIG-IP Community LTM and ASM log parsing; virtual server health dashboards; WAF event analysis
Dell EMC Dell/Community PowerStore, Unity, VNX storage array log parsing; hardware fault dashboards
Cisco Community IOS and NX-OS syslog parsing; interface state change tracking; routing event analysis

24.3 Content Pack Structure

Every content pack — whether built-in, marketplace, or custom — is composed of the following components:

Tip: When evaluating a content pack, review the extracted fields first. Fields are the foundation — dashboards and alerts depend on them. If the fields do not match your log format (e.g., because of a firmware version difference), the dashboards will show no data.

24.4 Installing Content Packs

From the Marketplace (In-Product):

  1. Navigate to Content Packs → Marketplace in the Operations for Logs UI.
  2. Browse or search for the desired content pack.
  3. Click the content pack name to view its description, components, and version history.
  4. Click Install.
  5. A summary dialog shows all components that will be installed (extracted fields, dashboards, alerts, queries). Review and click Confirm.
  6. Installation completes in seconds. Verify by navigating to Content Packs → Installed Content Packs and confirming the pack appears with a green status.

From a Downloaded File:

  1. Download the content pack file (.vlcp extension) from the Broadcom support portal or a community repository.
  2. In Operations for Logs, navigate to Content Packs → Installed Content Packs.
  3. Click Import Content Pack (or the upload icon).
  4. Browse to the .vlcp file and select it.
  5. Review the components and click Install.

Warning: Installing a content pack that defines fields with the same names as existing fields will overwrite the existing field definitions. Review field conflicts before installing, especially when mixing marketplace packs with custom-defined fields.

24.5 Creating Custom Content Packs

Organizations can bundle their custom fields, dashboards, alerts, and queries into a reusable content pack for distribution across environments or teams.

Procedure:

  1. Create the components you want to include:
  2. Navigate to Content Packs → My Content.
  3. Click Create Content Pack.
  4. Enter a Name (e.g., Custom - Payment Gateway Logs), Namespace (unique identifier, e.g., com.mycompany.paymentgw), and Description.
  5. In the component selection pane, check the boxes for each field, dashboard, alert, and query to include.
  6. Click Save to create the content pack.
  7. To share the pack, click Export — this generates a .vlcp file that can be imported into other Operations for Logs instances.

Tip: Use a consistent namespace convention (e.g., com.<company>.<application>) to avoid conflicts with VMware or community content packs. Version your content packs semantically (1.0, 1.1, 2.0) to track changes.

24.6 Permissions

Content pack operations are governed by the role-based access control system in Operations for Logs.

Role Install / Uninstall Create / Export Use Dashboards Use Queries Modify Components
Super Admin Yes Yes Yes Yes Yes
Admin Yes Yes Yes Yes Yes
User No No Yes Yes No (can create personal copies)
View Only No No Yes (read-only) Yes (read-only) No

Best Practice: Assign the User role to operations teams who need to search logs and view dashboards. Reserve Admin for the team responsible for content pack management and platform administration.


Chapter 25: Log Analysis (Explore Logs)

25.1 Accessing Explore Logs

The Explore Logs interface is the primary workspace for interactive log investigation in Operations for Logs. Access it by clicking Explore Logs in the main navigation bar at the top of the UI.

The interface consists of:

25.2 Query Types

Operations for Logs supports three query modes, each suited to different analytical needs.

The simplest query mode. Enter keywords or phrases in the query bar, and Operations for Logs searches the full text of all log messages within the selected time range.

error
"connection refused"
authentication failed

Use extracted fields and operators to create precise, structured queries. This mode is more efficient than free-text search because it operates on indexed field values rather than raw text.

vmw_host = esxi01.lab.local
appname = "vpxd" AND severity = "error"
vmw_vc_vm_name = web-server-* AND text CONTAINS "snapshot"

3. Aggregation Query (Statistical Analysis)

Apply statistical functions to log data to identify trends, volumes, and outliers. Aggregation queries produce charts rather than individual log entries.

# Count events per source over time
COUNT by source

# Average response time by application
AVERAGE(response_time) by appname

# Top 10 sources by error count
COUNT WHERE severity = "error" GROUP BY source ORDER BY COUNT DESC LIMIT 10

25.3 Search Capabilities

Syntax Description Example
Single keyword Finds logs containing the word anywhere in the message error
Phrase (quoted) Finds logs containing the exact phrase "connection refused"
Boolean AND Both terms must appear error AND vcenter
Boolean OR Either term must appear warning OR error
Boolean NOT Excludes logs containing the term error NOT test
Parentheses Group boolean expressions (error OR warning) AND vcenter

Glob Patterns

Pattern Description Example
* Matches zero or more characters *error* matches "timeout error occurred"
? Matches exactly one character host-??.lab.local matches "host-01.lab.local"
[...] Matches any character in the set [Ee]rror matches "Error" and "error"
[0-9] Character range vm-[0-9][0-9][0-9] matches "vm-001" through "vm-999"

Field-Based Filtering

Field-based filters are the most powerful and efficient search mechanism. They use extracted fields (from content packs or custom extraction) to narrow results precisely.

Operator Description Example
= Exact match vmw_host = esxi01.lab.local
!= Not equal vmw_esxi_vpxa_status != running
CONTAINS Substring match text CONTAINS "certificate expired"
NOT CONTAINS Excludes substring text NOT CONTAINS "debug"
MATCHES Regex match text MATCHES "err(or|no)\s\d+"
>, <, >=, <= Numeric comparison response_time > 5000
EXISTS Field has a value vmw_vc_vm_name EXISTS
NOT EXISTS Field is absent custom_field NOT EXISTS

Tip: Combine multiple field filters with AND / OR for complex investigations:

vmw_host = esxi01.lab.local AND appname = "vmkernel" AND text CONTAINS "NMP" AND severity = "warning"

25.4 Field Extraction

Built-in Static Fields

Every log message ingested by Operations for Logs automatically receives the following static fields, regardless of content packs:

Field Description Example Value
timestamp Time the event was generated (from syslog header or agent) 2026-03-20T14:32:01.000Z
source Hostname or IP of the sending device esxi01.lab.local
appname Application name (from syslog header APP-NAME field) vpxd, hostd, vmkernel
facility Syslog facility code local0, daemon, kern
severity Syslog severity level info, warning, error, critical
text Full message body (everything after the syslog header) (variable)

Dynamic Extracted Fields

Content packs define additional fields that are extracted at query time (or at ingest time, depending on configuration). For example, the vSphere content pack extracts:

One-Click Field Extraction

For logs not covered by existing content packs, you can create custom extracted fields interactively.

Procedure:

  1. In Explore Logs, locate a log entry that contains the data you want to extract.
  2. In the log message text, highlight the specific value you want to capture (e.g., an error code, a username, or a response time).
  3. A tooltip appears — click Extract Field.
  4. The Field Extraction dialog opens with the following settings:
  5. Click Preview to test the extraction against a sample of recent logs. Review the extracted values for accuracy.
  6. Adjust the regex or context patterns if the preview shows incorrect extractions.
  7. Click Save.
  8. The new field is immediately available for use in queries, dashboards, and alerts.

Warning: Custom extracted fields consume CPU during query execution. Avoid creating overly broad regex patterns that match unintended log messages. Test thoroughly using the Preview function before saving.

25.5 Regex Syntax

Operations for Logs uses different regex engines depending on the context:

Context Regex Engine Notes
UI queries and field extraction Java regex (java.util.regex) Double-escape backslashes in the UI: \\d+
Agent file parsing (liagent.ini) C++ Boost regex Standard PCRE-like syntax
API queries Java regex Same as UI

Common Regex Patterns

Pattern Purpose Regex
IPv4 address Match IP addresses in log text \\b\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\b
MAC address Match MAC addresses [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}
ISO timestamp Match ISO 8601 timestamps \\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}
HTTP status code Match 3-digit HTTP codes HTTP/\\d\\.\\d\\s+(\\d{3})
Email address Match email addresses [\\w.+-]+@[\\w.-]+\\.[a-zA-Z]{2,}
Windows SID Match Windows Security Identifiers S-\\d-\\d+-[\\d-]+
UUID / GUID Match UUIDs [0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}

Tip: When building regex patterns in the UI, use the Preview function to validate against live data. Start with a broad pattern and refine it iteratively. Named capture groups (?<fieldname>...) are supported for multi-field extraction from a single pattern.

25.6 Intelligent Grouping (ML)

Intelligent Grouping is a machine-learning feature that automatically clusters structurally similar log messages, ignoring variable components like IP addresses, timestamps, UUIDs, and numeric values.

Accessing Intelligent Grouping:

  1. Navigate to Explore Logs.
  2. Optionally enter a filter (e.g., source = esxi01.lab.local) and set the time range.
  3. Click the Event Types tab (in some versions, labeled Intelligent Grouping).

How It Works:

Use Cases:

Interacting with Groups:

25.7 Saved Queries

Saved queries preserve search criteria — keywords, field filters, time range preferences, and selected fields — for reuse without re-entering the query each time.

Saving a Query:

  1. In Explore Logs, construct and execute the desired query.
  2. Click the Save icon (disk icon) or select Save Query from the actions menu.
  3. Enter a Query Name (e.g., ESXi PSOD Events - All Hosts).
  4. Optionally enter a Description explaining the query's purpose.
  5. Set Visibility:
  6. Click Save.

Using Saved Queries:

  1. Navigate to Explore Logs.
  2. Click the Saved Queries dropdown (or the folder icon in the query bar).
  3. Select the desired query. The query criteria are loaded into the search bar and executed automatically.

Managing Saved Queries:

Best Practice: Establish a naming convention for shared saved queries (e.g., [Team] - [Description]) to keep the query library organized as it grows. Periodically review and prune unused saved queries to maintain clarity.

Chapter 26 — Dashboards and Alerts — Operations for Logs

26.1 Creating Dashboards from Queries

Operations for Logs provides two methods to create dashboards: promoting a query result directly from Explore Logs, or building a dashboard from scratch in the Dashboards section.

Method 1 — Promote from Explore Logs:

  1. Navigate to Explore Logs and execute a query that returns the data you wish to visualize.
  2. Configure the visualization by selecting the chart type (pie, bar, line, column), grouping field, and time range from the toolbar above the results pane.
  3. Click Add to Dashboard in the upper-right corner of the visualization pane.
  4. In the dialog, select an existing dashboard from the dropdown or click Create New Dashboard and provide a name.
  5. Enter a descriptive widget name that clearly identifies the data being displayed (e.g., "ESXi Error Rate — Last 24 Hours").
  6. Click Save. The widget is added to the selected dashboard and begins refreshing at the dashboard's configured interval.

Method 2 — Build from Scratch:

  1. Navigate to Dashboards from the main navigation menu.
  2. Click New Dashboard.
  3. Enter a dashboard name and optional description. Choose a sharing scope (Private, Shared with specific users, or Public).
  4. Click Add Widget on the empty dashboard canvas.
  5. Select the widget type from the widget picker (see Section 26.2 for available types).
  6. Configure the widget by entering a query, selecting fields, setting the time range, and choosing display options such as chart colors and axis labels.
  7. Click Save Widget, then click Save Dashboard to persist the layout.

Tip: Dashboards auto-refresh at configurable intervals (30 seconds, 1 minute, 5 minutes, 15 minutes, or manual). Set the refresh interval using the clock icon in the dashboard toolbar.

26.2 Widget Types

Operations for Logs supports a variety of widget types, each optimized for different analytical use cases.

Widget Type Description Best Use Case
Chart (Pie) Proportional breakdown of values as a circular chart Distribution of log sources, error types by category
Chart (Bar) Horizontal category comparison bars Top 10 error-generating hosts, busiest log sources
Chart (Line) Time-series trend line with data points Log volume over time, error rate trends, ingestion throughput
Chart (Column) Vertical bars for period-based comparison Hourly event counts, daily log volume comparison
Chart (Gauge) Single metric displayed as a gauge dial Current ingestion rate, active alert count
Field Table Tabular data view with sortable columns Detailed event listing with extracted fields
Query List List of saved queries displayed as clickable links Quick-access navigation panel for analysts
Event Types Breakdown of machine-learning-grouped event categories ML-classified event distribution
Event Trends Sparkline trend charts for each event type At-a-glance trend overview per event category

Widget Configuration Options:

26.3 Alert Definitions

To create an alert, navigate to Alerts → Alert Definitions → New Alert. Operations for Logs provides four trigger condition types, each suited to a different monitoring pattern.

Trigger Condition Type 1 — On Every Match

Trigger Condition Type 2 — Total Count

Trigger Condition Type 3 — Unique Count

Trigger Condition Type 4 — Aggregation

26.4 Alert Configuration

Each alert definition includes the following configuration fields:

Field Description Required
Name Descriptive name for the alert (e.g., "ESXi PSOD Detection") Yes
Description Detailed description of the alert purpose and expected response No
Query The log query that defines which events are evaluated Yes
Trigger Condition One of the four types described in Section 26.3 Yes
Frequency How often the alert query is evaluated: 1, 5, 15, 30, or 60 minutes Yes
Raise an Alert When to generate the alert: First occurrence only, Every time the condition is met, or Once per time window Yes
Notification Select one or more notification channels (email or webhook) No
Enable/Disable Toggle to activate or deactivate the alert without deleting it Yes

Best Practice: Set the alert frequency to be shorter than or equal to the trigger time window. For example, if the time window is 5 minutes, set the frequency to 5 minutes or less. This ensures no events are missed between evaluation cycles.

26.5 Snoozing Alerts

When an alert becomes temporarily noisy — for example, during a planned maintenance window — you can snooze it rather than disabling it entirely.

  1. Navigate to Alerts → Triggered Alerts.
  2. Locate the noisy alert and click the Snooze button (clock icon).
  3. Select a snooze duration from the dropdown:
  4. Click Confirm. The alert stops generating notifications for the selected duration.
  5. After the snooze period expires, the alert automatically resumes evaluation and notification.

Snoozed alerts display a clock icon and remaining snooze time in the Triggered Alerts list. You can cancel a snooze early by clicking Unsnooze on the alert.

26.6 Notification Channels

Operations for Logs supports two primary notification channel types: Email (SMTP) and Webhooks.

Email Notification Configuration

  1. Navigate to Administration → SMTP Configuration.
  2. Configure the following fields:
Field Example Value
SMTP Server smtp.lab.local
Port 587 (TLS) or 25 (unencrypted)
Use TLS Enabled
From Address vrli-alerts@lab.local
Username vrli-smtp-user
Password (SMTP authentication password)
  1. Click Test to send a test email, then click Save.
  2. In each alert definition, add one or more email recipients in the Notification section.

Webhook Notification Configuration

  1. Navigate to Administration → Webhooks.
  2. Click New Webhook.
  3. Configure the webhook:
Field Description
Name Descriptive name (e.g., "Slack-Ops-Channel")
URL Target endpoint URL
Content Type application/json (default)
Payload Template JSON body with placeholder variables
  1. Click Test to verify connectivity, then click Save.

Common Webhook Targets:

Target URL Format Notes
Slack https://hooks.slack.com/services/T.../B.../xxx Use Slack Incoming Webhook URL
PagerDuty https://events.pagerduty.com/v2/enqueue Use PagerDuty Events API v2 integration key
Aria Automation https://<vra-fqdn>/csp/gateway/am/api/... Trigger workflow via REST webhook
ServiceNow https://<instance>.service-now.com/api/now/table/incident Create incident via REST API
Custom Any https:// endpoint Configurable HTTP method, headers, body template

Available Payload Variables:


Chapter 27 — Integration with VCF Operations

27.1 Two Integration Methods

VCF Operations and VCF Operations for Logs are designed to work together as a unified observability platform. Two complementary integration methods connect the products:

  1. Notification Events — VCF Operations for Logs sends alert notifications to VCF Operations, creating corresponding alert objects that appear alongside metric-based alerts. This enables a single-pane-of-glass view of both metric and log-based anomalies.

  2. Launch in Context — From VCF Operations, operators can click on any monitored object and open its associated logs directly in Operations for Logs. The log view is automatically pre-filtered to show only events from the selected object and time range, eliminating the need to manually construct queries.

27.2 Configure Notification Integration (Operations for Logs to VCF Operations)

This integration pushes alert data from Operations for Logs into VCF Operations.

Step-by-step on the Operations for Logs side:

  1. Navigate to Administration → Integrations → VMware Aria Operations.
  2. Enter the VCF Operations cluster VIP hostname or FQDN (e.g., vrops-vip.lab.local).
  3. Enter credentials for a VCF Operations user with administrative privileges.
  4. Click Test Connection to verify network connectivity and authentication.
  5. Click Save to persist the configuration.
  6. Enable the Send alerts to VMware Aria Operations toggle. When enabled, all triggered alerts in Operations for Logs are forwarded as alert objects to VCF Operations.

Step-by-step on the VCF Operations side:

  1. Navigate to Administration → Integrations → Accounts.
  2. Verify that the VMware Aria Operations for Logs adapter instance appears in the adapter list.
  3. Check the adapter status indicator:

Note: Alerts forwarded from Operations for Logs appear under the Log Analytics alert type in VCF Operations. They can be viewed, acknowledged, and cancelled using the same alert management workflows as native VCF Operations alerts.

27.3 Configure Launch in Context (VCF Operations to Operations for Logs)

This integration allows operators to open contextual log data from within the VCF Operations interface.

Step-by-step on the VCF Operations side:

  1. Navigate to Administration → Integrations → Accounts.
  2. Add a new account or edit the existing VMware Aria Operations for Logs account.
  3. Enter the Operations for Logs VIP URL: https://<vrli-vip-fqdn> (e.g., https://vrli-vip.lab.local).
  4. Enter service account credentials for Operations for Logs.
  5. Click Test Connection to verify HTTPS connectivity on port 443.
  6. Click Save.

Verification:

  1. Navigate to any monitored object in VCF Operations (e.g., an ESXi host).
  2. Open the object detail page.
  3. Click the Logs tab.
  4. The Logs tab should display recent log events from Operations for Logs, pre-filtered to the selected object.
  5. Click any log entry or click Launch in Context to open a full Operations for Logs session with the query pre-populated.

27.4 VCF Operations Content Pack for Logs

A dedicated content pack enables Operations for Logs to parse, extract, and visualize logs generated by VCF Operations itself.

Installation:

  1. In Operations for Logs, navigate to Content Packs → Marketplace.
  2. Search for Aria Operations or vRealize Operations.
  3. Click Install on the VMware Aria Operations content pack.
  4. The content pack installs extracted fields, saved queries, and dashboards specific to VCF Operations log data.

Included Content:

Content Type Count Examples
Extracted Fields 15+ vrops_component, vrops_alert_name, vrops_adapter_kind
Saved Queries 10+ "VCF Operations Errors — Last 24h", "Analytics Engine Warnings"
Dashboards 3 "VCF Operations Health", "Adapter Collection Status", "Audit Trail"
Alerts 5 "VCF Operations Service Crash", "Collector Disconnected"

27.5 Log Analysis in VCF Operations

Once both integration methods are configured, the following capabilities become available in VCF Operations:


Chapter 28 — Log Forwarding and Archiving

28.1 Log Forwarding to External Systems

Operations for Logs can forward received logs to other systems for compliance archival, SIEM integration, or multi-site aggregation. Forwarding is asynchronous and adds no significant overhead to the cluster. Three forwarding protocols are supported:

Protocol Description Use Case
Ingestion API (CFAPI) Forward using the native Operations for Logs ingestion API format Forward to another Operations for Logs instance for multi-site aggregation
Syslog Forward as standard syslog messages over TCP, UDP, or TLS Forward to SIEM platforms (Splunk, QRadar, ArcSight), syslog servers
RAW Forward the original raw log data without transformation Preserve exact original format for compliance or forensic archives

28.2 Forwarding Configuration

Step-by-step:

  1. Navigate to Administration → Log Forwarding.
  2. Click New Destination.
  3. Configure the forwarding destination:
Field Description Example
Name Descriptive name for the destination SIEM-Splunk-Prod
Destination Host FQDN or IP address of the target system splunk-hec.lab.local
Protocol Syslog (TCP/UDP/TLS), CFAPI, or RAW Syslog (TLS)
Port Port number appropriate for the selected protocol 6514
  1. Filter (optional): Configure filters to forward only specific log data:

  2. Tags (optional): Add or modify tags on events before forwarding. This allows the receiving system to identify forwarded events.

  3. Click Test to verify connectivity to the destination, then click Save.

Note: Each cluster supports up to 10 forwarding destinations. Forwarding operates asynchronously from the ingestion pipeline — destination outages do not affect log ingestion or indexing. Events are buffered and retried if the destination is temporarily unreachable.

28.3 NFS Archive Setup

Operations for Logs can archive log data to an NFS share for long-term retention beyond the active index capacity.

Step-by-step:

  1. Navigate to Administration → Archiving.
  2. Click Enable Archiving.
  3. Enter the NFS mount path using the format: nfs://<server>/<share>
  4. Set the archive frequency (default: daily).
  5. Click Test Mount to verify NFS connectivity and write permissions.
  6. Click Save.

Archive Behavior:

Aspect Detail
When data is archived After it ages out of the active index (based on partition retention)
Archive format Compressed JSON files organized by date
Searchability Archived data is not searchable directly — must be re-ingested to query
NFS version requirement NFSv3
Permissions Read/write access required from all cluster nodes
Mount validation All nodes must successfully mount the NFS share

Important: Ensure the NFS share has sufficient capacity for long-term storage. A cluster ingesting 50 GB/day will generate approximately 15–20 GB/day of compressed archive data.

28.4 Retention Policies (Index Partitions)

Index partitions allow you to apply different retention periods to different categories of log data. This enables longer retention for compliance-critical logs (e.g., security audit events) while using shorter retention for high-volume operational logs.

Configuration:

  1. Navigate to Administration → Index Partitions.
  2. Click New Partition.
  3. Configure the partition:
Field Description Example
Name Partition identifier Security-Logs
Retention Period Number of days to retain indexed data 90
Filter Criteria determining which logs are routed to this partition appname CONTAINS "sshd" OR appname CONTAINS "audit"
  1. Click Save. New log events matching the filter criteria are routed to this partition and retained for the specified period.

Important: Longer retention periods require proportionally more disk space. Plan the /storage/var disk on each node to accommodate the total data volume across all partitions. Use the formula: Required Disk (GB) = Daily Ingestion (GB) x Retention (days) x 1.3 (index overhead).


Chapter 29 — Appliance Administration

29.1 Key Log Files on the Appliance

The Operations for Logs appliance stores its own operational logs in well-defined paths. Familiarity with these files is essential for troubleshooting appliance issues.

Log File Path Purpose Rotation
Core Application /var/log/loginsight/runtime.log Main application log — startup, shutdown, errors Daily
API / UI /var/log/loginsight/api_runtime.log API request logs, UI backend errors Daily
Ingestion /var/log/loginsight/ingestion.log Syslog and agent ingestion pipeline Daily
Cassandra /var/log/loginsight/cassandra.log Index database operations and errors Daily
Audit /var/log/loginsight/audit.log User actions, login events, configuration changes Daily
Watchdog /var/log/loginsight/watchdog.log Service health monitoring and auto-restart events Daily
System /var/log/messages OS-level syslog messages Weekly
Apache Reverse Proxy /var/log/loginsight/apache/ Reverse proxy access and error logs Daily
Upgrade /var/log/loginsight/upgrade.log Upgrade process log with step-by-step progress Per upgrade

29.2 Service Management Commands

The Operations for Logs appliance runs on a SUSE Linux Enterprise Server (SLES) base operating system. Services are managed via systemctl.

# Check overall service status
systemctl status loginsight

# Restart the main Operations for Logs service
systemctl restart loginsight

# Check Cassandra index database status
systemctl status loginsight-cassandra

# Check watchdog service (monitors and auto-restarts crashed services)
systemctl status loginsight-watchdog

# View real-time service logs
journalctl -u loginsight -f

# Check disk usage on storage partition
df -h /storage/var

# Check cluster node connectivity
curl -k https://localhost:9543/api/v2/version

Warning: Restarting the loginsight service causes a brief ingestion interruption on that node. In a cluster, agents and syslog sources connected to the restarted node temporarily buffer events and reconnect to another node via the ILB VIP.

29.3 Admin Password Reset (CLI)

If the admin password is lost and UI access is not possible, reset it from the appliance command line:

# SSH to the primary node as root
ssh root@vrli-primary.lab.local

# Navigate to the application sbin directory
cd /usr/lib/loginsight/application/sbin

# Execute the password reset script
./li-reset-admin-password.sh

# Follow the interactive prompts to set a new admin password
# Services restart automatically after the password is reset

Note: This procedure resets the local admin account password only. It does not affect Active Directory or VIDM-integrated accounts. The password reset requires SSH access to the primary node as root.

29.4 Logging Level Configuration

Adjusting the internal logging level can help diagnose appliance issues.

  1. Navigate to Administration → General → Logging Level.
  2. Select the desired level from the dropdown:
Level Volume Use Case
Error Minimal Production — only critical failures
Warning Low Production — failures and potential issues
Info Moderate (default) Normal operations — recommended for production
Debug High Active troubleshooting — detailed diagnostic output
Trace Very High Deep troubleshooting — full method-level tracing
  1. Click Save. The change takes effect immediately with no service restart required.

Important: Set the logging level to Debug or Trace only temporarily during active troubleshooting. These levels significantly increase log volume and can fill the /storage/var partition if left enabled. Always return to Info after troubleshooting is complete.

29.5 Support Bundle Generation

A support bundle collects diagnostic information required by Broadcom support for troubleshooting appliance issues.

UI Method:

  1. Navigate to Administration → Support Bundle.
  2. Click Generate Bundle.
  3. Select the scope:
  4. Wait for bundle generation to complete (typically 2–10 minutes depending on log volume).
  5. Click Download to save the bundle as a ZIP file.
  6. Attach the ZIP file to the Broadcom support ticket (SR).

CLI Method:

# SSH to the primary node as root
ssh root@vrli-primary.lab.local

# Generate the support bundle
/usr/lib/loginsight/application/sbin/li-support-bundle.sh

# Output location:
# /tmp/li-support-bundle-<timestamp>.tar.gz

# Transfer the bundle to your workstation
scp root@vrli-primary.lab.local:/tmp/li-support-bundle-*.tar.gz .

Bundle Contents:


Chapter 30 — Operations for Logs API

30.1 Base URL

All Operations for Logs API calls use HTTPS on port 9543 (or HTTP on port 9000 for non-production environments). The base URL format is:

https://<vrli-vip-fqdn>:9543/api/v2/

Replace <vrli-vip-fqdn> with the cluster VIP FQDN or individual node FQDN. All examples in this chapter use vrli-vip.lab.local as the target.

30.2 Authentication

All API calls (except /api/v2/sessions) require a valid session token. Obtain a token by authenticating against the sessions endpoint:

# Obtain a session token
curl -k -X POST "https://vrli-vip.lab.local:9543/api/v2/sessions" \
  -H "Content-Type: application/json" \
  -d '{
    "username": "admin",
    "password": "YourPassword123!",
    "provider": "Local"
  }'

# Response:
# {
#   "userId": "a1b2c3d4-...",
#   "sessionId": "abc123def456...",
#   "ttl": 1800
# }

Use the returned sessionId value as a Bearer token in subsequent requests:

Authorization: Bearer abc123def456...
Field Description
userId Unique identifier of the authenticated user
sessionId Session token — valid for ttl seconds
ttl Time-to-live in seconds (default 1800 = 30 minutes)
provider Authentication provider: Local, ActiveDirectory, or vidm

Note: Tokens expire after the TTL period. For long-running automation scripts, implement token refresh logic that re-authenticates before the TTL expires.

30.3 Ingestion API

Send log events programmatically using the ingestion endpoint. This is useful for forwarding application logs, CI/CD pipeline events, or custom monitoring data.

# Ingest a single event
curl -k -X POST "https://vrli-vip.lab.local:9543/api/v2/events/ingest/0" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer abc123def456..." \
  -d '{
    "events": [
      {
        "text": "Application deployment completed successfully",
        "timestamp": 1711000000000,
        "fields": [
          {"name": "appname", "content": "deploy-pipeline"},
          {"name": "environment", "content": "production"},
          {"name": "build_number", "content": "1842"},
          {"name": "deploy_status", "content": "success"}
        ]
      }
    ]
  }'

Ingestion Payload Fields:

Field Type Required Description
text String Yes The log message body
timestamp Long No Event timestamp in epoch milliseconds (defaults to server receipt time)
fields Array No Array of key-value pairs for structured field extraction
fields[].name String Yes (if fields used) Field name
fields[].content String Yes (if fields used) Field value

Tip: The /ingest/0 endpoint path suffix (0) specifies the shard hint. For most use cases, use 0 to let the cluster auto-distribute. For high-throughput ingestion, distribute across shard hints 0 through n-1 where n is the number of cluster nodes.

30.4 Query API

Retrieve log events and aggregated statistics programmatically.

Search Events:

# Simple keyword search — last 100 matching events
curl -k -X GET \
  "https://vrli-vip.lab.local:9543/api/v2/events?q=error&limit=100" \
  -H "Authorization: Bearer abc123def456..."

# Field-based query with time range
curl -k -X GET \
  "https://vrli-vip.lab.local:9543/api/v2/events?q=vmw_host%3Desxi01*&limit=50&start-time-ms=1711000000000&end-time-ms=1711086400000" \
  -H "Authorization: Bearer abc123def456..."

Query Parameters:

Parameter Type Description
q String Query string (URL-encoded)
limit Integer Maximum number of events to return (default 100, max 20000)
start-time-ms Long Start of time range in epoch milliseconds
end-time-ms Long End of time range in epoch milliseconds
order-by-direction String ASC or DESC (default DESC)
content-pack-fields String Include content pack extracted fields

Aggregated Events:

# Count events by source over the last hour, divided into 12 bins
curl -k -X GET \
  "https://vrli-vip.lab.local:9543/api/v2/aggregated-events/timestamp/LAST_HOUR?q=error&num-bins=12" \
  -H "Authorization: Bearer abc123def456..."

Aggregation Time Ranges:

Value Description
LAST_5_MINUTES Last 5 minutes
LAST_15_MINUTES Last 15 minutes
LAST_HOUR Last 60 minutes
LAST_6_HOURS Last 6 hours
LAST_24_HOURS Last 24 hours
LAST_3_DAYS Last 3 days
LAST_7_DAYS Last 7 days
LAST_30_DAYS Last 30 days
CUSTOM Use start-time-ms and end-time-ms

30.5 Full API Categories

The following table lists all available API endpoint categories in Operations for Logs v2 API.

# Category Endpoint Description
1 Sessions /api/v2/sessions Authentication — acquire and release tokens
2 Events /api/v2/events Query log events with filters and time ranges
3 Aggregated Events /api/v2/aggregated-events Statistical queries with time-bucketed aggregation
4 Ingest /api/v2/events/ingest Send log events via CFAPI
5 Alerts /api/v2/alerts Manage alert definitions (CRUD)
6 Content Packs /api/v2/content-packs Install, list, and manage content packs
7 Dashboards /api/v2/dashboards Create, update, delete dashboards and widgets
8 Groups /api/v2/groups Manage agent groups and group filters
9 Notifications /api/v2/notifications Manage notification channels
10 Users /api/v2/users User management (create, list, update, delete)
11 Roles /api/v2/roles Role management and permission assignment
12 Datasets /api/v2/datasets Index partition management
13 Cluster /api/v2/cluster Cluster topology, node status, and management
14 License Keys /api/v2/licensekeys License key management and status
15 SMTP /api/v2/notification/smtp Email notification server configuration
16 Webhooks /api/v2/notification/webhook Webhook endpoint configuration
17 Archiving /api/v2/archiving NFS archive configuration and status
18 Forwarding /api/v2/forwarding Log forwarding destination management
19 Agents /api/v2/agents Agent registration, status, and management
20 vSphere /api/v2/vsphere vSphere integration configuration
21 Spaces /api/v2/spaces Multi-tenancy space management
22 Certificates /api/v2/certificates TLS certificate management
23 Upgrades /api/v2/upgrades Appliance upgrade management
24 Support /api/v2/support Support bundle generation and download
25 Version /api/v2/version Product version and build information

30.6 Swagger/OpenAPI Documentation

Operations for Logs provides interactive API documentation accessible directly from the appliance:

https://<vrli-vip-fqdn>:9543/api/v2/docs

Features of the interactive documentation:

https://<vrli-vip-fqdn>:9543/api/v2/docs/openapi.json

Tip: Use the OpenAPI specification to generate client libraries in Python, Go, Java, or PowerShell for automating Operations for Logs administration tasks.


Appendix A — VCF Operations Port Reference

The following table lists all network ports required for VCF Operations deployment and operation. Firewall rules must permit traffic on these ports between the listed source and destination components.

Port Protocol Direction Source Destination Purpose
443 TCP Inbound Browser / API Client VCF Operations Cluster VIP Web UI and REST API (HTTPS)
443 TCP Outbound VCF Operations Node vCenter Server vCenter adapter data collection
443 TCP Outbound VCF Operations Node NSX Manager NSX adapter data collection
443 TCP Outbound VCF Operations Node SDDC Manager SDDC Manager adapter data collection
443 TCP Outbound VCF Operations Node ESXi Hosts Direct ESXi metric collection
443 TCP Outbound Remote Collector vCenter / NSX / Targets Remote adapter data collection
443 TCP Outbound VCF Operations Node Broadcom Marketplace Management pack downloads
8543 TCP Inbound Remote Collector / Agents VCF Operations Cluster VIP Collector-to-cluster communication
7001 TCP Internal VCF Operations Node VCF Operations Node GemFire cache replication
1300–1399 TCP Internal VCF Operations Node VCF Operations Node Distributed cache range ports
10002 TCP Internal VCF Operations Node VCF Operations Node GemFire locator port
20002 TCP Internal VCF Operations Node VCF Operations Node xDB replication primary port
20003 TCP Internal VCF Operations Node VCF Operations Node xDB replication secondary port
4369 TCP Internal VCF Operations Node VCF Operations Node Erlang Port Mapper Daemon (epmd)
5433 TCP Internal VCF Operations Node VCF Operations Node PostgreSQL database replication
8080 TCP Localhost VCF Operations Node Localhost Internal application HTTP
9090 TCP Localhost VCF Operations Node Localhost Internal admin service
514 UDP Inbound Network Devices VCF Operations Node Syslog reception (optional)
162 UDP Inbound Network Devices VCF Operations Node SNMP trap reception
25 TCP Outbound VCF Operations Node SMTP Server Email notification delivery
587 TCP Outbound VCF Operations Node SMTP Server Email notification delivery (TLS)
123 UDP Outbound VCF Operations Node NTP Server Time synchronization

Note: For the complete and most current port requirements, consult the Broadcom Ports and Protocols tool at https://ports.broadcom.com/.


Appendix B — Operations for Logs Port Reference

The following table lists all network ports required for VCF Operations for Logs deployment and operation.

Port Protocol Direction Source Destination Purpose
443 TCP Inbound Browser / API Client Operations for Logs VIP Web UI access (HTTPS)
514 TCP Inbound Syslog Sources Operations for Logs VIP Syslog ingestion (TCP)
514 UDP Inbound Syslog Sources Operations for Logs VIP Syslog ingestion (UDP)
6514 TCP Inbound Syslog Sources Operations for Logs VIP Syslog ingestion (TLS-encrypted)
1514 TCP Inbound ESXi Hosts Operations for Logs VIP ESXi SSL syslog forwarding
9000 TCP Inbound Log Insight Agents Operations for Logs VIP CFAPI ingestion (HTTP)
9543 TCP Inbound Log Insight Agents / API Clients Operations for Logs VIP CFAPI ingestion (HTTPS) + REST API
16520–16580 TCP Internal Operations for Logs Node Operations for Logs Node Cluster inter-node communication
59778 TCP Internal Operations for Logs Node Operations for Logs Node Thrift RPC inter-node calls
12543 TCP Internal Operations for Logs Node Operations for Logs Node Cassandra database communication
9200 TCP Internal Operations for Logs Node Operations for Logs Node Node indexing service
7000 TCP Internal Operations for Logs Node Operations for Logs Node Cassandra gossip protocol
7001 TCP Internal Operations for Logs Node Operations for Logs Node Cassandra SSL gossip protocol
123 UDP Outbound Operations for Logs Node NTP Server Time synchronization
25 TCP Outbound Operations for Logs Node SMTP Server Email notification delivery
587 TCP Outbound Operations for Logs Node SMTP Server Email notification delivery (TLS)
514 TCP Outbound Operations for Logs Node Syslog Destination Log forwarding (syslog)
443 TCP Outbound Operations for Logs Node VCF Operations VIP Alert notification integration
2049 TCP/UDP Outbound Operations for Logs Node NFS Server NFS archive mount

Note: Syslog ingestion on port 514 (both TCP and UDP) is enabled by default. Port 6514 (TLS) and port 1514 (ESXi SSL) require additional configuration in the appliance admin UI.


Appendix C — Complete Suite API Endpoint Reference

The VCF Operations Suite API provides programmatic access to all platform capabilities. The base path for all endpoints is:

https://<vrops-vip-fqdn>/suite-api/api/
# Category Base Path Key Operations
1 Authentication /suite-api/api/auth Acquire and release authentication tokens
2 Resources /suite-api/api/resources CRUD operations on monitored objects
3 Resource Kinds /suite-api/api/resourcekinds List and describe resource types
4 Adapter Kinds /suite-api/api/adapterkinds List and describe adapter types
5 Adapters /suite-api/api/adapters Manage adapter instances and credentials
6 Credentials /suite-api/api/credentials Create, update, and delete stored credentials
7 Alerts /suite-api/api/alerts Query, acknowledge, and cancel alerts
8 Alert Definitions /suite-api/api/alertdefinitions Create and manage alert definitions
9 Symptoms /suite-api/api/symptoms Query active symptom instances
10 Symptom Definitions /suite-api/api/symptomdefinitions Create and manage symptom definitions
11 Notifications /suite-api/api/notifications Manage notification rules and channels
12 Super Metrics /suite-api/api/supermetrics Create and manage super metric formulas
13 Policies /suite-api/api/policies Manage operational policies and assignments
14 Dashboards /suite-api/api/dashboards Create, clone, share, and delete dashboards
15 Reports /suite-api/api/reports Generate, schedule, and download reports
16 Report Definitions /suite-api/api/reportdefinitions Define report templates and layouts
17 Views /suite-api/api/views Create and manage data views
18 Tasks /suite-api/api/tasks Query and manage background tasks
19 Collector Groups /suite-api/api/collectorgroups Manage collector group assignments
20 Collectors /suite-api/api/collectors List and manage collector nodes
21 Audit /suite-api/api/audit Query audit log entries
22 Applications /suite-api/api/applications Application monitoring configuration
23 Deployment /suite-api/api/deployment Cluster deployment and scaling operations
24 Certificate /suite-api/api/certificate TLS certificate management
25 Cluster /suite-api/api/cluster Cluster topology and health
26 Versions /suite-api/api/versions Product version and build information
27 Content /suite-api/api/content Import and export content bundles
28 Events /suite-api/api/events Query and manage event timeline
29 Maintenance Schedules /suite-api/api/maintenanceschedules Schedule maintenance windows
30 Object Groups /suite-api/api/groups Manage built-in object groups
31 Custom Groups /suite-api/api/customgroups Create and manage custom object groups
32 Traversal Specs /suite-api/api/traversalspecs Define object relationship traversals
33 Relationships /suite-api/api/resources/{id}/relationships Query parent/child object relationships
34 Statistics /suite-api/api/resources/{id}/stats Retrieve metric data for a resource
35 Properties /suite-api/api/resources/{id}/properties Retrieve property values for a resource
36 Latest Statistics /suite-api/api/resources/{id}/stats/latest Retrieve the most recent metric values
37 Recommendations /suite-api/api/recommendations Query optimization recommendations
38 Cost /suite-api/api/costconfig Cost model and rate card configuration
39 Pricing /suite-api/api/pricing Pricing policy management
40 Capacity /suite-api/api/capacity Capacity analytics and projections
41 Reclamation /suite-api/api/reclamation Resource reclamation recommendations
42 Compliance /suite-api/api/compliance Compliance benchmark scoring
43 SDDC Health /suite-api/api/sddc SDDC-level health and status
44 vSAN /suite-api/api/vsan vSAN-specific health and capacity
45 Token /suite-api/api/auth/token Token-based authentication (acquire/validate)
46 Users /suite-api/api/auth/users User account management
47 Roles /suite-api/api/auth/roles Role and permission management

Note: All Suite API endpoints support JSON request and response bodies. Use Content-Type: application/json and Accept: application/json headers. Full Swagger documentation is available at https://<vrops-vip-fqdn>/suite-api/doc/swagger-ui.html.


Appendix D — OVA File Sizes and SHA256 Checksums

The following table lists the OVA appliance files used to deploy VCF Operations and related products. File sizes are approximate and vary by specific release version.

Product OVA Filename Approx. Size Notes
VCF Operations (Analytics Node) vRealize-Operations-Manager-Appliance-8.18.2.*.ova ~3.2 GB Primary, replica, and data node appliance
VCF Operations (Remote Collector) vRealize-Operations-Manager-Remote-Collector-*.ova ~1.8 GB Lightweight collection-only appliance
VCF Operations for Logs VMware-vRealize-Log-Insight-8.18.2.*.ova ~2.5 GB Log analytics node (primary and worker)
VCF Suite Lifecycle Manager VMware-vRealize-Suite-Lifecycle-Manager-*.ova ~4.5 GB Lifecycle management for the full VCF Operations suite
VCF Operations for Networks (Platform) VMware-vRealize-Network-Insight-*.ova ~3.0 GB Network analytics platform node
VCF Operations for Networks (Collector) VMware-vRealize-Network-Insight-Collector-*.ova ~2.0 GB Network flow and configuration collector

Checksum Verification:

Always verify the SHA256 checksum of downloaded OVA files against the values published on the Broadcom download portal before deployment.

# Linux / macOS
sha256sum vRealize-Operations-Manager-Appliance-8.18.2.*.ova

# Windows (PowerShell)
Get-FileHash -Algorithm SHA256 .\vRealize-Operations-Manager-Appliance-8.18.2.*.ova

Important: Deploying an OVA with a mismatched checksum may indicate a corrupted download or a tampered file. Re-download the OVA from the Broadcom support portal if the checksum does not match.


Appendix E — Broadcom TechDocs Reference URLs

The following table provides direct links to key documentation and resources for VCF Operations and related products.

Resource URL
VCF Operations Documentation https://docs.vmware.com/en/VMware-Aria-Operations/index.html
VCF Operations for Logs Documentation https://docs.vmware.com/en/VMware-Aria-Operations-for-Logs/index.html
VCF Operations API Reference (Suite API) https://docs.vmware.com/en/VMware-Aria-Operations/8.18/aria-operations-api-guide/GUID-intro.html
VCF Operations Sizing Guide https://kb.vmware.com/s/article/2093783
VCF Operations Port Requirements https://ports.broadcom.com/
VCF 9.0 Release Notes https://docs.vmware.com/en/VMware-Cloud-Foundation/9.0/rn/vmware-cloud-foundation-90-release-notes/index.html
Broadcom Support Portal https://support.broadcom.com/
VCF Compatibility Matrix https://interopmatrix.vmware.com/
Broadcom Marketplace (Management Packs) https://marketplace.cloud.vmware.com/
VMware Knowledge Base https://kb.vmware.com/
VCF Operations for Logs API Reference https://docs.vmware.com/en/VMware-Aria-Operations-for-Logs/8.18/aria-operations-for-logs-api-guide/GUID-intro.html
VCF Operations Community Forums https://community.broadcom.com/vmware-tanzu/home

End of Document

VCF Operations & Operations for Logs — Complete Handbook v1.0 © 2026 Virtual Control LLC. All rights reserved.