Virtual Control

VMware Cloud Foundation Solutions

Complete Handbook

VCF Operations
Complete Handbook

Comprehensive 180-page handbook covering VCF Operations and Operations for Logs deployment, configuration, and management.

180 PagesOperationsOps for LogsManagement

VCF 9.0

VMware Cloud Foundation

Proprietary & Confidential

VCF Operations & Operations for Logs — Complete Handbook

VMware Cloud Foundation 9.0 | Broadcom

PART I: VCF OPERATIONS
Chapter 1 — Product Overview
- 1.1 Naming History
- 1.2 Product Family
- 1.3 Architecture
- 1.4 VCF 9.0 Integration Model
Chapter 2 — Sizing and Prerequisites
- 2.1 Node Types
- 2.2 OVA Sizing
- 2.3 Cluster Models
- 2.4 Remote Collector Sizing
- 2.5 Disk Partitions
- 2.6 Browser and Hypervisor Requirements
Chapter 3 — Network Port Requirements
- 3.1 Inbound Ports
- 3.2 Outbound Ports
- 3.3 Cluster-Internal Ports
- 3.4 Localhost-Only Ports
- 3.5 Remote Collector Ports
- 3.6 Firewall Rule Guidance
Chapter 4 — Deployment
- 4.1 VCF Automated Deployment Flow
- 4.2 OVA Deployment via vSphere Client
- 4.3 OVA Deployment via ovftool CLI
- 4.4 VAMI Configuration
- 4.5 Initial Setup Wizard
Chapter 5 — High Availability Cluster Setup
- 5.1 Deploy Primary Node
- 5.2 Deploy and Join Replica Nodes
- 5.3 Cluster Initialization
- 5.4 Limitations and Considerations
Chapter 6 — Key Filesystem Paths and Services
- 6.1 Important Directories
- 6.2 Key Services and Daemons
- 6.3 Log File Locations
- 6.4 Certificate Locations
Chapter 7 — vCenter Adapter Configuration
- 7.1 Default vCenter Adapter
- 7.2 Adding a vCenter Adapter Instance
- 7.3 Credential Management
- 7.4 Adapter Advanced Settings
- 7.5 Collection Status and Troubleshooting
Chapter 8 — SDDC Manager Integration
- 8.1 SDDC Manager Adapter
- 8.2 Lifecycle Management via SDDC Manager
- 8.3 Workload Domain Monitoring
Chapter 9 — NSX Integration
- 9.1 NSX Adapter Configuration
- 9.2 Collected Object Types
- 9.3 NSX Dashboards
Chapter 10 — vSAN Integration
- 10.1 vSAN Adapter
- 10.2 vSAN Health Monitoring
- 10.3 vSAN Capacity Analytics
Chapter 11 — Policies
- 11.1 Policy Hierarchy
- 11.2 Creating and Editing Policies
- 11.3 Workload Automation Policies
- 11.4 Capacity Policy Settings
Chapter 12 — Alerts and Symptoms
- 12.1 Alert Lifecycle
- 12.2 Symptom Definitions
- 12.3 Alert Definitions
- 12.4 Notification Plugins
Chapter 13 — Super Metrics
- 13.1 Super Metric Concepts
- 13.2 Creating Super Metrics
- 13.3 Functions and Operators
Chapter 14 — Dashboards — Built-in (Predefined)
- 14.1 Predefined Dashboard Categories
- 14.2 Key Predefined Dashboards
- 14.3 Cloning Predefined Dashboards
Chapter 15 — Dashboards — Custom Creation
- 15.1 Dashboard Canvas
- 15.2 Widget Library
- 15.3 Widget Interactions
- 15.4 Sharing and Permissions
Chapter 16 — Best Practice Dashboard Designs
- 16.1 Design Principles
- 16.2 NOC Dashboard Layout
- 16.3 Executive Summary Dashboard
Chapter 17 — Views and Reports
- 17.1 View Types
- 17.2 Creating Custom Views
- 17.3 Report Templates
- 17.4 Scheduled Reports
Chapter 18 — Capacity Planning and Optimization
- 18.1 Capacity Overview
- 18.2 What-If Analysis
- 18.3 Reclamation Recommendations
- 18.4 Rightsizing
Chapter 19 — Management Packs
- 19.1 Management Pack Architecture
- 19.2 Installing Management Packs
- 19.3 Key Management Packs
Chapter 20 — Day-2 Operations and Maintenance
- 20.1 Backup and Restore
- 20.2 Certificate Management
- 20.3 PAK/License Management
- 20.4 Upgrade Procedures
- 20.5 Log Bundle Collection
PART II: VCF OPERATIONS FOR LOGS
Chapter 21 — Product Overview — Operations for Logs
- 21.1 Naming History
- 21.2 Architecture
- 21.3 Key Capabilities
Chapter 22 — Deployment
- 22.1 OVA Deployment
- 22.2 Cluster Deployment
- 22.3 Post-Deployment Configuration
Chapter 23 — Log Source Configuration
- 23.1 Syslog Sources
- 23.2 vRealize Log Insight Agents
- 23.3 API Ingestion
Chapter 24 — Content Packs
- 24.1 Built-in Content Packs
- 24.2 Marketplace Content Packs
- 24.3 Custom Content Packs
Chapter 25 — Log Analysis (Explore Logs)
- 25.1 Interactive Analytics
- 25.2 Query Syntax
- 25.3 Saved Queries
Chapter 26 — Dashboards and Alerts — Operations for Logs
- 26.1 Log Dashboards
- 26.2 Alert Queries
- 26.3 Webhook and Email Notifications
Chapter 27 — Integration with VCF Operations
- 27.1 Launch-in-Context
- 27.2 Shared Authentication
- 27.3 Log Events in VCF Operations
Chapter 28 — Log Forwarding and Archiving
- 28.1 Log Forwarding Destinations
- 28.2 NFS Archiving
- 28.3 Retention Policies
Chapter 29 — Appliance Administration
- 29.1 VAMI Administration
- 29.2 Cluster Management
- 29.3 Backup and Restore
- 29.4 Certificate Management
Chapter 30 — Operations for Logs API
- 30.1 Authentication
- 30.2 Ingestion API
- 30.3 Query API
- 30.4 Administration API
APPENDICES
Appendix A — VCF Operations Port Reference
Appendix B — Operations for Logs Port Reference
Appendix C — Suite API Endpoint Reference
Appendix D — OVA File Sizes and Checksums
Appendix E — Broadcom TechDocs URLs

PART I: VCF Operations

Chapter 1 — Product Overview

VCF Operations is the unified monitoring, capacity planning, and optimization platform for VMware Cloud Foundation environments. It provides real-time analytics across compute, storage, networking, and application layers, delivering intelligent workload placement, proactive alerting, and capacity forecasting to ensure the health and efficiency of your private cloud infrastructure.

1.1 Naming History

The product now known as VCF Operations has undergone several name changes as VMware's portfolio evolved and Broadcom completed its acquisition of VMware. Understanding this lineage is critical when referencing older documentation, KB articles, and community resources.

Year	Product Name	Context
2012	vCenter Operations Manager	Initial release as a standalone vCenter companion for performance analytics.
2015	vRealize Operations Manager (vROps)	Rebranded under the vRealize Suite umbrella. Versions 6.x through 8.x carried this name. This is the name most widely recognized in the VMware community.
2022	VMware Aria Operations	VMware unified its cloud management portfolio under the "Aria" brand. vRealize Operations Manager 8.10+ became Aria Operations.
2024	VCF Operations	Following the Broadcom acquisition of VMware, the Aria brand was retired. All products were realigned under the VMware Cloud Foundation (VCF) umbrella. VCF Operations is the current and official name.

Important: When searching VMware Knowledge Base articles, use all three historical names — vRealize Operations, Aria Operations, and VCF Operations — to ensure complete coverage of relevant results. Many KB articles have not yet been updated to reflect the latest naming.

The underlying technology, architecture, and API surface remain consistent across these name changes. A deployment upgraded from vRealize Operations 8.6 through Aria Operations 8.14 to VCF Operations 8.18.2 retains its full configuration, dashboards, super metrics, and alert definitions without requiring re-creation.

1.2 Product Family

VCF Operations is not a single appliance — it is the anchor product within a broader operations suite. The following table lists all products in the VCF Operations family as of VCF 9.0:

Product Name	Description	VCF 9.0 Version	Deployment Model
VCF Operations	Performance monitoring, capacity planning, optimization, and workload balancing for the entire VCF stack.	8.18.2	OVA appliance (analytics cluster)
VCF Operations for Logs	Centralized log management and analytics. Collects, indexes, and analyzes syslog and log data from all VCF components.	8.18.2	OVA appliance (standalone or cluster)
VCF Operations for Networks	Deep network visibility, micro-segmentation analytics, traffic flow analysis, and network topology mapping. Integrates with NSX.	6.14	OVA appliance
VCF Suite Lifecycle Manager	Lifecycle management for the VCF Operations suite. Handles deployment, upgrades, patching, and configuration drift for all suite products.	8.18	Embedded in SDDC Manager or standalone OVA
Cloud Proxies	Lightweight collectors deployed in remote sites or workload domains to collect data and forward it to the central analytics cluster.	Bundled with VCF Operations	OVA appliance (minimal footprint)

All products in the suite share a common authentication framework, can be cross-launched from one another, and are managed through a unified lifecycle workflow in SDDC Manager.

1.3 Architecture

VCF Operations employs a distributed analytics architecture designed for horizontal scalability and high availability. The architecture consists of the following tiers:

Analytics Cluster

The analytics cluster is the core of VCF Operations. It processes incoming metrics, computes dynamic thresholds, evaluates alert conditions, and serves the user interface. A production analytics cluster consists of:

Primary Node — The first node deployed. It hosts the master controller services, the global xDB distributed datastore partition, the API endpoint, and the administrative UI. In a single-node deployment, the primary node performs all roles.
Replica Node — An identical node that maintains a full copy of the primary node's configuration and data. If the primary node fails, the replica automatically assumes the primary role. There is exactly one replica in an HA-enabled cluster.
Data Nodes — Additional nodes that expand the cluster's data processing and storage capacity. Data nodes handle metric ingestion, analytics computation, and xDB partitions. You can add data nodes in pairs to scale from 16,000 to 100,000+ objects.

Collectors

Collectors are responsible for gathering metrics from monitored endpoints and delivering them to the analytics cluster:

Embedded Collector — Runs within each analytics cluster node. Suitable for endpoints that are network-local to the cluster.
Remote Collector — A separate lightweight OVA deployed near the monitored endpoints. Used when endpoints reside in different network segments, datacenters, or workload domains. Remote Collectors do not store data locally — they forward all collected data to the analytics cluster over HTTPS (port 443).

Adapters

Adapters are modular plugins that define how VCF Operations connects to and collects data from specific endpoint types. Each adapter type understands the API, object model, and metrics of its target system. Key built-in adapters include:

vCenter Adapter (collects ESXi hosts, VMs, clusters, datastores, resource pools)
NSX-T Adapter (collects transport nodes, logical switches, edge nodes, load balancers)
vSAN Adapter (collects vSAN clusters, disk groups, capacity, health)
SDDC Manager Adapter (collects workload domains, lifecycle events)

Management Packs

Management packs extend VCF Operations with adapters, dashboards, alert definitions, and reports for third-party and additional VMware products. They are installed through the product UI or via the VCF Suite Lifecycle Manager.

Data Flow Summary

Collection — Adapters (running on embedded or remote collectors) poll endpoints at configured intervals (default: 5 minutes) and retrieve raw metrics, properties, and relationships.
Ingestion — Raw data is transmitted to the analytics cluster over HTTPS and written to the xDB distributed datastore.
Analytics — The analytics engine computes dynamic thresholds (based on machine-learning models), evaluates symptom and alert definitions, runs capacity models, and calculates health/risk/efficiency scores.
Presentation — Processed data is served through the HTML5 UI, REST API, dashboards, views, and reports.

1.4 VCF 9.0 Integration Model

In VCF 9.0, VCF Operations is a mandatory first-class component of the Cloud Foundation stack, not an optional add-on. Its integration with the broader VCF platform is deep and bidirectional:

Fleet Manager Deployment

VCF 9.0 introduces Fleet Manager as the next-generation deployment and lifecycle orchestrator. Fleet Manager automates the full VCF Operations deployment workflow:

Fleet Manager references the Bill of Materials (BoM) to determine the correct VCF Operations OVA version.
The OVA is downloaded from the Broadcom repository or a local depot.
Fleet Manager deploys the OVA to the management cluster, applying network and sizing parameters from the deployment specification.
Post-deployment configuration (NTP, DNS, admin credentials) is applied automatically.
The VCF Operations instance is registered with SDDC Manager for ongoing lifecycle management.

SDDC Manager Orchestration

SDDC Manager provides ongoing lifecycle management for VCF Operations:

Upgrade orchestration — SDDC Manager downloads upgrade bundles, validates compatibility, creates pre-upgrade snapshots, applies the upgrade, and validates post-upgrade health.
Configuration drift detection — SDDC Manager monitors VCF Operations configuration against the desired-state specification and flags deviations.
Certificate rotation — SDDC Manager coordinates certificate renewal across all VCF components, including VCF Operations.

Mandatory Monitoring

In VCF 9.0, VCF Operations monitoring is enabled by default for every workload domain. When a new workload domain is created, SDDC Manager automatically:

Configures a vCenter adapter instance for the workload domain's vCenter Server.
Configures an NSX adapter instance for the workload domain's NSX Manager cluster.
Enables vSAN monitoring if the workload domain uses vSAN storage.
Applies the default monitoring policy for the workload domain.

This ensures complete observability from the moment a workload domain becomes operational, with no manual adapter configuration required.

Chapter 2 — Sizing and Prerequisites

Proper sizing of VCF Operations is critical to achieving reliable performance, accurate analytics, and timely alerting. Under-sizing leads to collection lag, delayed alerts, and UI timeouts. Over-sizing wastes management cluster resources. This chapter provides the authoritative sizing tables and prerequisite requirements.

2.1 Node Types

VCF Operations uses five distinct node types. Each serves a specific role in the analytics architecture:

Node Type	Role	Required?	Quantity
Primary	Master controller, xDB partition owner, API gateway, UI host. First node deployed.	Yes (exactly 1)	1
Replica	Hot standby for the primary node. Maintains a synchronized copy of all primary data and configuration. Automatically promoted if the primary fails.	No (required for HA)	0 or 1
Data	Expands analytics processing capacity and xDB storage. Added in pairs for balanced data distribution.	No (for scale-out)	0, 2, 4, 6, or 8
Remote Collector	Lightweight forwarder deployed near monitored endpoints. Collects metrics and sends them to the analytics cluster. Stores no data locally.	No (for remote sites)	0+
Cloud Proxy	Specialized remote collector for cloud-connected services and SaaS integrations.	No (for cloud use cases)	0+

Node Selection Guidance

Single-node deployment (primary only): Suitable for lab, proof-of-concept, or small environments with fewer than 1,200 objects. No high availability.
HA deployment (primary + replica): Suitable for production environments. Provides automatic failover. Recommended for all VCF production deployments.
Scale-out deployment (primary + replica + data nodes): Required when the object count exceeds 16,000 or when collection intervals need to be shortened below the default 5 minutes.

2.2 OVA Sizing

The VCF Operations OVA is deployed with one of five predefined size profiles. The size is selected during OVA deployment and cannot be changed after deployment without redeploying the node.

Size	vCPUs	Memory (GB)	Disk (GB)	Maximum Objects	Use Case
Extra Small	4	16	282	1,200	Lab and proof-of-concept environments. Not recommended for production.
Small	8	32	474	4,000	Small production environments. Single workload domain with limited VM count.
Medium	16	48	898	16,000	Mid-size production environments. Multiple workload domains. Most common production size.
Large	32	128	2,026	50,000	Large enterprise environments. Many workload domains, multiple vCenter instances.
Extra Large	48	512	4,014	100,000	Very large enterprise or service-provider environments. Requires data nodes for scale.

Note: The "Maximum Objects" column refers to the total count of monitored objects across all adapters — VMs, hosts, datastores, clusters, port groups, NSX objects, vSAN objects, and any objects from management packs. Use the formula: Total Objects ≈ (VMs × 1.0) + (Hosts × 3.5) + (Clusters × 2.0) + (Datastores × 1.0) as a rough estimation starting point.

Sizing Recommendations

Always size based on projected 12-month growth, not current object count.
If the object count falls between two size tiers, select the larger tier.
All nodes in a cluster (primary, replica, and data nodes) must use the same size profile.
Remote Collectors and Cloud Proxies have their own separate sizing (see Section 2.4).

2.3 Cluster Models

VCF Operations supports three cluster deployment models:

Simple (Single Node)

Consists of a single primary node.
No high availability or automatic failover.
Suitable for non-production environments only.
If the node fails, monitoring is completely unavailable until the node is restored.

High Availability (HA)

Consists of a primary node and a replica node.
The replica node maintains a continuously synchronized copy of the primary's data and configuration.
If the primary node becomes unavailable, the replica is automatically promoted to primary within approximately 5 minutes.
This is the minimum recommended production configuration.
Both nodes must reside on the same vCenter Server and ideally in the same cluster, but on different ESXi hosts (use anti-affinity rules).

Continuous Availability (CA)

Consists of three or more nodes configured as a cluster with a witness node.
Provides the highest level of availability — the cluster continues to operate with no data loss even if one node fails entirely.
Requires a minimum of three nodes: primary, replica, and at least one data node, plus a witness appliance.
The witness appliance is a minimal OVA that participates in quorum votes but does not store analytics data.
CA mode is recommended for mission-critical environments where even a brief monitoring gap is unacceptable.

Critical Limitation: You cannot convert a Simple (single-node) deployment to an HA or CA deployment in place. The replica and data nodes must be deployed as fresh OVAs and joined to the primary. However, you cannot retroactively add HA to a node that was initialized as a standalone instance without redeploying the primary. Plan your cluster model before initial deployment.

2.4 Remote Collector Sizing

Remote Collectors are deployed as separate OVAs with their own sizing profiles, independent of the analytics cluster node sizing:

Size	vCPUs	Memory (GB)	Disk (GB)	Max Adapters	Max Objects	Use Case
Standard	2	4	20	5	1,500	Small remote sites, single vCenter endpoint.
Large	4	16	20	15	10,000	Large remote sites, multiple endpoints, or high-frequency collection.

Remote Collector Placement Guidelines

Deploy one Remote Collector per remote network segment or datacenter that hosts monitored endpoints.
The Remote Collector must have low-latency, reliable network connectivity to the endpoints it monitors (same LAN or campus network).
The Remote Collector communicates with the analytics cluster over a single HTTPS connection (port 443). This connection can traverse WAN links and firewalls.
If a Remote Collector becomes unavailable, collection for its assigned endpoints pauses. Data is not buffered — the gap in metrics will appear as a break in charts. Collection resumes automatically when the Remote Collector reconnects.

2.5 Disk Partitions

The VCF Operations appliance uses multiple disk partitions to separate data by function. Understanding these partitions is essential for troubleshooting disk-space alerts and planning NFS extension:

Mount Point	Purpose	Grows With
`/`	Root filesystem. Operating system, appliance binaries, configuration files.	Static — does not grow significantly.
`/storage/db`	xDB distributed datastore. Primary storage for all collected metrics, properties, relationships, and computed analytics.	Object count and retention period. This is the largest and fastest-growing partition.
`/storage/log`	Application log files for all VCF Operations services.	Activity level and log verbosity settings.
`/storage/core`	Core dump files generated during application crashes.	Only grows when crashes occur.
`/storage/nfs`	Optional NFS mount point for offloading historical data or report storage.	Configured capacity of the NFS share.
`/storage/vcops/backup`	Local backup storage. Used by the built-in backup mechanism for configuration and data snapshots.	Backup frequency and retention count.

Best Practice: Monitor disk usage on /storage/db closely. When this partition reaches 90% utilization, VCF Operations triggers a critical alert and may begin dropping the oldest data to prevent total disk exhaustion. Extend this partition by adding an NFS datastore or by deploying additional data nodes to distribute the storage load.

2.6 Browser and Hypervisor Requirements

Supported Browsers

The VCF Operations HTML5 UI is supported on the following browsers:

Browser	Minimum Version	Notes
Google Chrome	100+	Recommended browser. Best performance and rendering.
Mozilla Firefox	100+	Fully supported.
Microsoft Edge (Chromium)	100+	Fully supported. Legacy Edge (EdgeHTML) is not supported.

JavaScript must be enabled.
Browser zoom should be set to 100% for optimal dashboard layout rendering.
Pop-up blockers may interfere with report downloads and cross-launch URLs.

Supported Hypervisor Versions

Component	Supported Versions
ESXi	7.0 U3+, 8.0, 8.0 U1, 8.0 U2, 8.0 U3
vCenter Server	7.0 U3+, 8.0, 8.0 U1, 8.0 U2, 8.0 U3
Virtual Hardware Version	19 (ESXi 7.0 U3) or 20/21 (ESXi 8.0+)

Additional Prerequisites

DNS — Forward and reverse DNS records must exist for every VCF Operations node before deployment. The FQDN is set during OVA deployment and cannot be changed post-deployment.
NTP — All VCF Operations nodes must synchronize time with the same NTP source used by vCenter and ESXi hosts. Time skew greater than 5 minutes causes collection failures and certificate validation errors.
Certificates — VCF Operations deploys with a self-signed certificate by default. For production, replace with a CA-signed certificate that includes the FQDN and IP address in the SAN field.

Chapter 3 — Network Port Requirements

VCF Operations requires specific network ports to be open between its nodes, monitored endpoints, and consuming services. Failure to open the correct ports results in collection failures, cluster communication breakdowns, or inaccessible UI. This chapter provides the complete port reference.

3.1 Inbound Ports

These ports must be open to the VCF Operations analytics cluster nodes from clients and external systems:

Port	Protocol	Source	Destination	Purpose
443	TCP (HTTPS)	Admin workstations, API clients, SDDC Manager, VCF Operations for Logs	VCF Operations cluster nodes	Primary UI access, REST API, Suite API, adapter data reception from Remote Collectors. This is the single most critical port.
8543	TCP (HTTPS)	Legacy API clients	VCF Operations cluster nodes	Legacy vRealize Operations API endpoint. Maintained for backward compatibility with older integrations and scripts. Deprecated — migrate to port 443.
443	TCP (HTTPS)	Remote Collectors, Cloud Proxies	VCF Operations cluster nodes	Data forwarding from remote collectors to the analytics cluster. Remote collectors push collected metrics to the cluster over this port.

3.2 Outbound Ports

These ports must be open from the VCF Operations analytics cluster nodes (and Remote Collectors) to external endpoints:

Port	Protocol	Source	Destination	Purpose
443	TCP (HTTPS)	VCF Operations nodes / Remote Collectors	vCenter Server	vCenter adapter data collection. Retrieves VM, host, cluster, datastore, and resource pool metrics via the vSphere API.
443	TCP (HTTPS)	VCF Operations nodes / Remote Collectors	NSX Manager	NSX adapter data collection. Retrieves transport node, logical switch, edge, and firewall metrics.
443	TCP (HTTPS)	VCF Operations nodes / Remote Collectors	SDDC Manager	SDDC Manager adapter. Retrieves workload domain configuration, lifecycle events, and compliance status.
443	TCP (HTTPS)	VCF Operations nodes	Broadcom repository (online)	Downloading upgrade bundles, management packs, and content updates when connected to the internet.
514	TCP/UDP	VCF Operations nodes	Syslog server / VCF Operations for Logs	Forwarding VCF Operations application logs to a centralized syslog collector.
25	TCP (SMTP)	VCF Operations nodes	Mail server	Sending email notifications for alert triggers. Unencrypted SMTP.
587	TCP (SMTP/TLS)	VCF Operations nodes	Mail server	Sending email notifications for alert triggers over TLS-encrypted SMTP. Preferred over port 25.

3.3 Cluster-Internal Ports

These ports are used for communication between VCF Operations cluster nodes (primary, replica, and data nodes). They must be open bidirectionally between all cluster members:

Port	Protocol	Purpose
7001	TCP	Cassandra (xDB) inter-node communication. Handles data replication, partition synchronization, and consistency management between cluster nodes.
1300–1399	TCP	GemFire distributed cache. Used for in-memory data grid communication, cache replication, and cluster state synchronization. The exact port within this range is assigned dynamically.
10002	TCP	Cluster controller RPC. The primary node uses this port to coordinate cluster operations — node joins, failover decisions, and configuration propagation.
20002	TCP	Analytics data synchronization. Distributes computed analytics results (dynamic thresholds, scores, capacity projections) across all cluster nodes.
20003	TCP	Cluster heartbeat. Used by the cluster health monitor to detect node failures. A missed heartbeat sequence triggers failover procedures.
4369	TCP	Erlang Port Mapper Daemon (epmd). Used by the RabbitMQ message broker embedded in each node for inter-node message routing.

Important: All cluster-internal ports must have low latency (< 1 ms round-trip) and high bandwidth (1 Gbps minimum). Cluster nodes should not be separated by WAN links, firewalls with deep packet inspection, or load balancers. Place all cluster nodes on the same VLAN or Layer 2 segment.

3.4 Localhost-Only Ports

These ports are bound to 127.0.0.1 (localhost) on each VCF Operations node. They do not require firewall rules because they are not accessible from the network. They are documented here for troubleshooting and security audit purposes:

Port	Protocol	Purpose
5433	TCP	vPostgres embedded database. Stores appliance configuration, user accounts, roles, policies, and alert definitions. Not used for metric storage (that is xDB).
8080	TCP (HTTP)	Internal CaSA (Collector and Storage Aggregator) service. Handles internal metric routing between collector threads and the storage layer.
9090	TCP (HTTP)	Internal admin/health-check endpoint. Used by the appliance self-monitoring watchdog to verify service health.

3.5 Remote Collector Ports

Remote Collectors have a simplified port profile because they do not run analytics or store data:

Port	Protocol	Direction	Source → Destination	Purpose
443	TCP (HTTPS)	Outbound	Remote Collector → VCF Operations cluster	Forwarding collected metrics, properties, and relationship data to the analytics cluster.
443	TCP (HTTPS)	Outbound	Remote Collector → Monitored endpoints (vCenter, NSX, etc.)	Collecting data from monitored endpoints. The Remote Collector initiates all connections — endpoints never connect inbound to the collector.

Remote Collectors do not expose any inbound ports. All communication is initiated outbound by the collector. This makes Remote Collectors ideal for deployment in DMZ or restricted network zones where inbound connections are prohibited.

3.6 Firewall Rule Guidance

When creating firewall rules for VCF Operations, follow these best practices:

Use FQDNs, not IP addresses, in firewall rules where possible. VCF Operations nodes may change IP addresses during disaster recovery or migration. FQDN-based rules are more resilient.
Restrict source addresses. Do not use any as the source for inbound port 443. Limit access to known admin workstation subnets, SDDC Manager, and Remote Collector IP ranges.
Enable stateful inspection. All VCF Operations connections are TCP-based and work correctly with stateful firewalls. Stateful inspection ensures return traffic is automatically permitted.
Do not use SSL decryption/inspection on traffic between VCF Operations cluster nodes. SSL interception between cluster members causes certificate validation failures and breaks cluster communication.
Test connectivity before deployment. Use curl -v https://<target>:443 from the VCF Operations node to verify that each required port is reachable before configuring adapters. Connection failures after adapter configuration are difficult to distinguish from credential or API errors.
Document all rules. Maintain a port matrix document that maps each firewall rule to its VCF Operations purpose. This accelerates troubleshooting when connectivity issues arise during maintenance windows or network changes.

Chapter 4 — Deployment

This chapter covers all deployment methods for VCF Operations — from the fully automated VCF 9.0 workflow to manual OVA deployment via the vSphere Client and command-line tools. Regardless of the deployment method, the end result is the same: a running VCF Operations appliance ready for initial configuration.

4.1 VCF Automated Deployment Flow

In a VCF 9.0 environment, VCF Operations deployment is orchestrated by SDDC Manager and Fleet Manager. This is the recommended deployment method for production VCF environments because it ensures consistency with the VCF Bill of Materials and integrates VCF Operations into the overall lifecycle management framework.

Automated Deployment Sequence

Prerequisite Validation — SDDC Manager validates that the management cluster has sufficient capacity (CPU, memory, storage) to host the VCF Operations appliance at the specified size.
OVA Acquisition — Fleet Manager retrieves the VCF Operations OVA from the configured software depot. In connected environments, this is the Broadcom online repository. In air-gapped environments, the OVA must be pre-staged in the local SDDC Manager depot.
OVA Deployment — Fleet Manager deploys the OVA to the management cluster's designated resource pool and datastore. Network configuration (IP, subnet, gateway, DNS, NTP) is injected via OVF properties derived from the deployment specification.
Appliance Boot and Self-Configuration — The appliance boots, applies the network configuration, generates initial self-signed certificates, and starts all core services. This takes approximately 10–15 minutes.
Registration — Fleet Manager registers the VCF Operations instance with SDDC Manager. This enables ongoing lifecycle management (upgrades, certificate rotation, health monitoring).
Initial Adapter Configuration — SDDC Manager automatically configures vCenter and NSX adapter instances for the management domain. If additional workload domains exist, adapters are configured for those as well.
Validation — Fleet Manager runs a post-deployment health check to confirm that all services are running, the UI is accessible, and initial data collection has started.

Note: The automated deployment flow always deploys a Medium-sized OVA by default. To override the size, edit the deployment specification JSON before initiating the workflow. Consult the SDDC Manager API documentation for the exact parameter path.

4.2 OVA Deployment via vSphere Client

For environments not using VCF 9.0 automation, or when deploying additional nodes (replica, data), the OVA can be deployed manually through the vSphere Client.

Step-by-Step Procedure

Step 1 — Download the OVA

Download the VCF Operations OVA file from the Broadcom Support Portal (support.broadcom.com). Navigate to VMware Cloud Foundation → VCF Operations → Downloads. Select the version matching your VCF Bill of Materials.

Step 2 — Launch the Deploy OVF Template Wizard

Log in to the vSphere Client (https://<vcenter-fqdn>/ui).
Right-click the target cluster or resource pool in the inventory tree.
Select Deploy OVF Template.

Step 3 — Select the OVA Source

Choose Local file and browse to the downloaded .ova file.
Alternatively, choose URL and enter the direct download URL if the vCenter has internet access.
Click Next.

Step 4 — Name and Location

Virtual machine name: Enter a descriptive name, e.g., vcf-operations-primary-01.
Location: Select the datacenter and folder where the VM will be placed.
Click Next.

Step 5 — Select a Compute Resource

Select the target cluster, resource pool, or ESXi host.
Ensure the selected compute resource has sufficient capacity for the chosen OVA size.
Click Next.

Step 6 — Review Details

Review the OVA metadata: publisher, version, download size, and disk requirements.
Click Next.

Step 7 — Configuration (Size Selection)

Select the deployment size from the dropdown:
- Extra Small, Small, Medium, Large, or Extra Large
This selection determines vCPU, memory, and disk allocation. It cannot be changed after deployment.
Click Next.

Step 8 — Select Storage

VM Storage Policy: Select an appropriate storage policy (e.g., vSAN Default Storage Policy or a custom policy).
Datastore: Select the target datastore. Ensure it has sufficient free space for the selected OVA size (refer to the sizing table in Section 2.2).
Virtual disk format: Select Thin Provision to conserve initial storage. The disks will grow as data is collected.
Click Next.

Step 9 — Select Networks

Map the OVA network labeled Network 1 to the target port group on the management VLAN.
Click Next.

Step 10 — Customize Template (OVF Properties)

This is the most critical page. Enter the following values:

Property	Value	Notes
Hostname	`vcf-ops-01.lab.local`	Must match the DNS A record. FQDN is permanent.
IP Address	`10.0.10.50`	Static IP on the management VLAN.
Subnet Mask	`255.255.255.0`	Matches the management VLAN subnet.
Default Gateway	`10.0.10.1`	Management VLAN gateway.
DNS Server(s)	`10.0.10.10`	Comma-separated if multiple.
Domain Name	`lab.local`	DNS search domain.
NTP Server(s)	`10.0.10.10`	Must match the NTP source used by vCenter and ESXi.
Admin Password	`(strong password)`	Password for the `admin` user account.
Root Password	`(strong password)`	Password for the Linux `root` user on the appliance.

Click Next.

Step 11 — Ready to Complete

Review all settings. Click Finish to begin deployment.
The deployment task appears in the vSphere Client Recent Tasks panel. Typical deployment time: 5–10 minutes depending on storage performance.

Step 12 — Power On

After deployment completes, right-click the VM and select Power → Power On.
Open the VM console to watch the boot process. First boot takes approximately 10–15 minutes as the appliance initializes services, generates certificates, and prepares the datastore.

4.3 OVA Deployment via ovftool CLI

For automated or scripted deployments, use the VMware ovftool command-line utility. This is useful for deploying multiple nodes in a cluster or for integrating VCF Operations deployment into infrastructure-as-code pipelines.

Full ovftool Command

ovftool \
  --name="vcf-operations-primary-01" \
  --deploymentOption="medium" \
  --diskMode="thin" \
  --datastore="vsanDatastore" \
  --network="Management-PG" \
  --acceptAllEulas \
  --allowExtraConfig \
  --powerOn \
  --prop:vami.DNS.VMware_Aria_Operations="10.0.10.10" \
  --prop:vami.gateway.VMware_Aria_Operations="10.0.10.1" \
  --prop:vami.ip0.VMware_Aria_Operations="10.0.10.50" \
  --prop:vami.netmask0.VMware_Aria_Operations="255.255.255.0" \
  --prop:vami.hostname="vcf-ops-01.lab.local" \
  --prop:vami.NTP.VMware_Aria_Operations="10.0.10.10" \
  --prop:vami.domain.VMware_Aria_Operations="lab.local" \
  --prop:guestinfo.cis.appliance.root.password="VMware123!" \
  --prop:guestinfo.cis.appliance.ssh.enabled="True" \
  /path/to/vcf-operations-8.18.2.ova \
  "vi://administrator@vsphere.local:password@vcenter.lab.local/Datacenter/host/Management-Cluster"

Key Parameters Explained

Parameter	Description
`--deploymentOption`	OVA size profile: `xsmall`, `small`, `medium`, `large`, `xlarge`.
`--diskMode`	Disk provisioning: `thin` (recommended) or `thick`.
`--datastore`	Target datastore name on the destination host/cluster.
`--network`	Port group name to map `Network 1` to.
`--prop:vami.DNS.*`	DNS server IP(s).
`--prop:vami.gateway.*`	Default gateway IP.
`--prop:vami.ip0.*`	Static IP address for the appliance.
`--prop:vami.netmask0.*`	Subnet mask.
`--prop:vami.hostname`	FQDN for the appliance. Must have a matching DNS record.
`--prop:vami.NTP.*`	NTP server IP(s).
`--powerOn`	Automatically power on the VM after deployment.

Note: The OVF property names reference VMware_Aria_Operations because the OVA internal metadata still uses the Aria Operations naming convention. This does not affect functionality — it is simply the OVF property namespace.

4.4 VAMI Configuration

After the appliance boots and completes its initial self-configuration, the Virtual Appliance Management Interface (VAMI) is available for administrative tasks.

Accessing VAMI

Open a browser and navigate to:

https://<node-fqdn>/admin

Historical Note: In older versions (vRealize Operations 6.x–8.x), the VAMI was accessed on port 5480 (https://<node-fqdn>:5480). In current versions, the VAMI is integrated into the main web interface at the /admin path on port 443.

VAMI Administrative Functions

The VAMI provides the following administrative functions:

Network Configuration — View and modify IP address, DNS, gateway, and NTP settings. Changes require a service restart.
Certificate Management — View the current TLS certificate, generate a new CSR (Certificate Signing Request), or install a CA-signed certificate.
Service Control — Start, stop, and restart individual VCF Operations services.
Time Zone Configuration — Set the appliance time zone.
SSH Access — Enable or disable SSH access to the appliance for CLI troubleshooting.
Support Bundle — Generate a diagnostic log bundle for Broadcom support cases.

4.5 Initial Setup Wizard

On the first login to the VCF Operations UI (https://<node-fqdn>/ui), the Initial Setup Wizard guides you through the essential configuration steps. The wizard consists of seven steps:

Step 1 — Getting Started

The wizard displays a welcome page with an overview of the setup process.
Click Next to begin.

Step 2 — Accept the EULA

Read and accept the End User License Agreement.
Check the I accept the terms checkbox.
Click Next.

Step 3 — Choose the Deployment Type

Select the deployment type:
- New Installation — First node in a new analytics cluster.
- Expand an Existing Cluster — Add this node to an existing cluster as a replica or data node.
For a primary node deployment, select New Installation.
Click Next.

Step 4 — Set the Admin Password

Set (or confirm) the password for the admin user account.
This account is the super-administrator with full access to all VCF Operations functions.
Password requirements: minimum 8 characters, must include uppercase, lowercase, digit, and special character.
Click Next.

Step 5 — Choose the Certificate Option

Select the certificate mode:
- Use the default self-signed certificate — Suitable for initial setup. Replace with a CA-signed certificate later.
- Install a custom certificate — Provide a CA-signed PEM certificate and private key now.
For most deployments, use the self-signed certificate initially and replace it post-setup.
Click Next.

Step 6 — Configure NTP

Verify or update the NTP server address(es).
Click Test to validate NTP connectivity and synchronization.
Click Next.

Step 7 — Ready to Complete

Review the configuration summary.
Click Finish to apply the configuration and start all VCF Operations services.
The appliance will take 5–10 minutes to fully initialize. A progress bar is displayed.

After the wizard completes, the VCF Operations login page is displayed. Log in with admin and the password set in Step 4. The system is now ready for adapter configuration and monitoring setup (covered in subsequent chapters).

Chapter 5 — High Availability Cluster Setup

A single-node VCF Operations deployment provides no fault tolerance — if the appliance fails, all monitoring, alerting, and capacity analytics are lost until the node is restored. For production environments, deploying a high availability (HA) cluster is strongly recommended. This chapter provides a detailed walkthrough of HA cluster configuration.

5.1 Deploy Primary Node

The primary node is deployed using the procedures described in Chapter 4 (Section 4.2 for vSphere Client or Section 4.3 for ovftool). Complete the Initial Setup Wizard (Section 4.5) with the New Installation option.

Before proceeding to replica deployment, verify the primary node is fully operational:

Log in to https://<primary-fqdn>/ui with the admin account.
Navigate to Administration → Cluster Management.
Confirm the node status shows Online with a green indicator.
Verify that all services are running: navigate to Administration → Cluster Management → Services and confirm that every listed service shows a status of Running.

Pre-Requisites for Cluster Expansion

Before deploying the replica node, ensure:

DNS A and PTR records exist for the replica node's FQDN and IP address.
The replica node's IP address is on the same network segment (VLAN) as the primary node.
Firewall rules are in place to allow all cluster-internal ports (see Section 3.3) between the primary and replica IP addresses.
Sufficient compute capacity exists on the management cluster to host a second identically-sized OVA.
Anti-affinity DRS rules are prepared (or will be created) to prevent the primary and replica VMs from running on the same ESXi host.

5.2 Deploy and Join Replica Nodes

Deploy the Replica OVA

Deploy a second VCF Operations OVA using the same size profile as the primary node. Use the same deployment method (vSphere Client or ovftool) described in Chapter 4.

Key settings for the replica OVA:

Setting	Value
VM Name	`vcf-operations-replica-01`
Size	Must match the primary node (e.g., `Medium`)
IP Address	Different IP, same VLAN as primary (e.g., `10.0.10.51`)
FQDN	Unique FQDN with matching DNS records (e.g., `vcf-ops-02.lab.local`)
Gateway, DNS, NTP	Identical to the primary node
Admin/Root Passwords	May differ from the primary, but using the same passwords simplifies administration

Power on the replica OVA and wait for it to complete first-boot initialization (10–15 minutes).

Join the Replica to the Primary

Open a browser and navigate to https://<replica-fqdn>/ui.
The Initial Setup Wizard appears. On the Deployment Type page, select Expand an Existing Cluster.
Enter the primary node's FQDN or IP address:
- Primary Node Address: vcf-ops-01.lab.local
Click Validate. The wizard connects to the primary node and retrieves its certificate.
Accept the Certificate — Review the certificate thumbprint displayed. Verify it matches the primary node's certificate thumbprint (you can find this on the primary node under Administration → Cluster Management → Certificate). Click Accept.
Authenticate — Enter the admin credentials for the primary node.
Node Role Selection — Select Replica as the role for this node.
- Selecting Data would add it as a data node instead (covered in scale-out documentation).
Click Next and then Finish to initiate the join process.

The join process takes approximately 15–25 minutes. During this time:

The replica node downloads the full configuration from the primary.
The xDB datastore is initialized and begins synchronizing data from the primary.
Cluster membership is updated, and the primary node begins heartbeat monitoring of the replica.

Monitoring the Join Progress

On the primary node, navigate to Administration → Cluster Management. The cluster status panel shows:

Replica Node: Status transitions from Joining → Synchronizing → Online.
Cluster Status: Transitions from Single Node → Preparing HA → Online (HA Enabled).

Do not modify any configuration or restart any services during the join process.

5.3 Cluster Initialization

After the replica node joins successfully, the cluster requires activation to enable HA functionality:

On the primary node, navigate to Administration → Cluster Management.
The cluster status panel shows both nodes: the primary and the replica. Both should show status Online.
Click the Enable HA button (if it has not been automatically enabled during the join process).
A confirmation dialog appears: "Enabling High Availability will synchronize all data between the primary and replica nodes. This may temporarily impact performance during the initial synchronization. Do you want to continue?"
Click Yes to confirm.
The cluster enters the Synchronizing state. During synchronization:
- All historical metrics, alert definitions, dashboards, super metrics, views, reports, and policies are replicated from the primary to the replica.
- Synchronization time depends on the volume of data — expect 30 minutes to several hours for large deployments.
- The UI remains accessible during synchronization, but some queries may be slower.
When synchronization completes, the cluster status changes to Online (HA Enabled). This indicates:
- Both nodes are operational and in sync.
- Automatic failover is active.
- If the primary node becomes unreachable, the replica will be promoted to primary within approximately 5 minutes.

Verifying HA Functionality

To verify that HA is working correctly:

Navigate to Administration → Cluster Management → Status.
Confirm:
- Cluster Mode: High Availability
- Primary Node: Online
- Replica Node: Online
- Synchronization Status: Synchronized (no pending sync operations)
Check the Cluster Health dashboard (Dashboards → VCF Operations Self-Monitoring → Cluster Health). All health indicators should be green.

Creating Anti-Affinity Rules

To prevent both cluster nodes from running on the same ESXi host (which would defeat the purpose of HA), create a DRS anti-affinity rule:

In the vSphere Client, navigate to the management cluster.
Go to Configure → VM/Host Rules.
Click Add.
Name: VCF-Operations-Anti-Affinity
Type: Separate Virtual Machines
Members: Add vcf-operations-primary-01 and vcf-operations-replica-01.
Click OK.

DRS will automatically vMotion the VMs to separate hosts if they are currently co-located.

5.4 Limitations and Considerations

Critical: No In-Place Conversion from Simple to HA

VCF Operations does not support converting a single-node (Simple) deployment to HA by adding a replica after the fact if the original deployment was initialized as a standalone instance with certain configuration flags. The supported path is:

Deploy the primary node using the New Installation wizard.
Deploy the replica OVA.
Join the replica to the primary before or shortly after production data collection begins.

If you attempt to add a replica to a long-running standalone deployment, the join may succeed, but synchronization of historical data can take an extremely long time and may fail for very large datasets. Best practice: decide on your cluster model before initial deployment and deploy the replica immediately after the primary.

IP Address and FQDN Immutability

Once deployed, the IP address and FQDN of each node are embedded in the cluster configuration, certificates, and inter-node trust relationships. Changing the IP or FQDN of a cluster member requires:

Removing the node from the cluster.
Redeploying the OVA with the new IP/FQDN.
Re-joining the node to the cluster.

This is disruptive and should be avoided. Plan IP addressing and DNS naming carefully before deployment.

Cluster Node Sizing Consistency

All nodes in a cluster must use the same OVA size profile. You cannot mix a Medium primary with a Small replica or add Large data nodes to a Medium cluster. If you need to change the cluster size, you must redeploy all nodes.

Network Latency Requirements

Cluster-internal communication (xDB replication, heartbeat, GemFire cache synchronization) is latency-sensitive. All cluster nodes must be on the same Layer 2 network segment with:

Maximum round-trip latency: < 1 ms
Minimum bandwidth: 1 Gbps
No stateful firewalls or WAN accelerators between cluster nodes

Violating these requirements leads to split-brain scenarios, data inconsistency, and false failover events.

Failover Behavior

When the primary node fails:

The replica detects the failure via missed heartbeats (default timeout: 5 minutes).
The replica promotes itself to primary and begins serving the UI and API.
All adapter collection continues without interruption (collectors reconnect to the new primary automatically).
When the original primary node is restored, it re-joins the cluster as the replica.

During the failover window (approximately 5 minutes), the UI is unavailable and no new alerts are generated. Metric collection continues in the collector buffer and is flushed to the cluster once the new primary is operational.

Witness Node for Continuous Availability

For environments requiring even higher availability, deploy a witness node in addition to the primary and replica. The witness participates in quorum voting to prevent split-brain scenarios but does not store analytics data or serve the UI. The witness OVA is a separate, much smaller appliance. Continuous Availability (CA) mode requires:

1 primary node
1 replica node
1 witness node
Minimum 2 data nodes (added in pairs)

CA mode ensures that the cluster continues to operate with zero data loss even if one node fails completely, by maintaining a quorum and synchronous data replication across all nodes.

Chapter 6: Key Filesystem Paths and Services

This chapter provides a comprehensive reference for the filesystem layout, service architecture, and operational commands used to manage VCF Operations (Aria Operations) appliances. Understanding these paths and services is essential for troubleshooting, backup planning, and day-to-day administration.

6.1 Filesystem Paths

The VCF Operations appliance is built on Photon OS and follows a structured directory layout. The two primary mount points are the root filesystem (/) and the data partition (/storage/), which is sized according to the deployment profile selected during installation.

Path	Purpose
`/usr/lib/vmware-vcops/`	Main application directory; contains binaries, libraries, and runtime components for all VCF Operations services.
`/usr/lib/vmware-vcops/user/conf/`	Application configuration files including `analytics.properties`, `collector.properties`, `gemfire.properties`, and adapter configuration XML files.
`/usr/lib/vmware-vcops/user/plugins/`	Management pack plugin directories. Each installed management pack places its adapter JAR files and descriptors here in a versioned subdirectory.
`/usr/lib/vmware-vcops/user/plugins/inbound/`	Inbound (data collection) adapter plugins. Contains subdirectories for each installed adapter such as `VMware_adapter3`, `PythonRemediationVcenterAdapter`, and third-party packs.
`/usr/lib/vmware-vcops/user/conf/ssl/`	SSL/TLS certificates and keystores used by the application, including the web server certificate (`cert.pem`), private key (`key.pem`), and trust stores.
`/usr/lib/vmware-vcops/user/conf/cassandra/`	Cassandra configuration directory containing `cassandra.yaml`, `cassandra-env.sh`, and related tuning files for the metrics datastore.
`/usr/lib/vmware-vcops/tomcat-enterprise/`	Apache Tomcat instance serving the REST API (`/suite-api`) and the administrative UI. Contains `conf/server.xml`, `webapps/`, and log directories.
`/usr/lib/vmware-vcops/tools/opscli/`	Operations CLI tooling. The primary entry point is `ops-cli.py`, used for adapter management, slice configuration queries, and cluster diagnostics.
`/usr/lib/vmware-vcops/support/`	Support and diagnostic scripts including `sliceConfiguration.sh`, `cleanupOps.sh`, and the support bundle generator `supportbundle.py`.
`/storage/db/`	Primary analytics database directory housing the FSDB (File System Database) and HIS (Historical) data stores. This is where time-series metric data resides.
`/storage/db/casa/`	CASA (Cluster Automated Services Architecture) database. Manages cluster membership, node roles, replication state, and slice ownership metadata.
`/storage/db/cassandra/`	Cassandra data directory for persisted metrics. Contains SSTables, commit logs, and saved caches.
`/storage/db/vcops/`	Core analytics working data, including dynamic threshold calculations, symptom state, and alert evaluation results.
`/storage/log/`	Application-level log files for all VCF Operations services. Primary troubleshooting location. Key files include `analytics.log`, `collector.log`, `api.log`, and `casa.log`.
`/storage/core/`	Core dump files generated during application crashes. Monitor disk usage here; large core dumps can fill the partition.
`/storage/nfs/`	Default NFS mount point for scheduled backup destinations. Must be pre-configured with appropriate NFS export permissions.
`/var/log/`	Operating system and VMware infrastructure service logs, including `syslog`, `messages`, `vmware/` subdirectory, and Photon OS package manager logs.
`/var/vmware/`	VMware infrastructure service runtime data, including STS token caches and VMware Identity Manager working files.
`/opt/vmware/etc/`	vPostgres (VMware-bundled PostgreSQL) configuration files. Contains `postgresql.conf`, `pg_hba.conf`, and recovery configuration.
`/opt/vmware/vpostgres/`	vPostgres binary and library directory. The PostgreSQL instance used for alert definitions, user data, and report storage.

Note: The /storage/ partition is critical. If it reaches capacity, analytics processing halts and data collection stops. Monitor the partition with df -h /storage/ and configure alerts for filesystem utilization exceeding 85%.

6.2 Services

VCF Operations runs as a collection of interdependent services managed by systemd. The following table lists every core service, its function, and its expected default state on a healthy primary node.

Service Name	Description	Default State
`vmware-vcops-analytics`	Core analytics engine responsible for dynamic threshold computation, symptom evaluation, alert generation, capacity modeling, and workload optimization calculations.	Running
`vmware-vcops-collector`	Data collection service that executes adapter instances, gathers metrics from monitored systems, and feeds raw data into the analytics pipeline.	Running
`vmware-vcops-api`	REST API and administrative UI service hosted on Tomcat. Serves the `/suite-api` endpoint and the HTML5 management interface on port 443.	Running
`vmware-casa`	Cluster Automated Services Architecture. Manages multi-node cluster topology, node membership, slice assignment, replication orchestration, and failover coordination.	Running
`vmware-vcops-gemfire`	Apache Geode (GemFire) distributed in-memory cache. Provides inter-node data sharing, real-time metric buffering, and distributed lock management across cluster nodes.	Running
`vmware-vcops-vpostgres`	VMware-packaged PostgreSQL database instance. Stores alert definitions, custom dashboards, super metrics, user accounts, report templates, and compliance data.	Running
`vmware-vcops-cassandra`	Apache Cassandra metrics storage engine. Provides the persistent time-series datastore for all collected metrics and properties.	Running
`vmware-vcops-watchdog`	Service watchdog daemon. Monitors the health of all other VCF Operations services and automatically restarts any service that becomes unresponsive or crashes.	Running
`vmware-vcops-web`	Front-end web server (httpd/nginx reverse proxy). Handles TLS termination, static content serving, and request routing to the Tomcat API backend.	Running
`vmware-stsd`	VMware Security Token Service daemon. Provides authentication token issuance and validation for inter-service communication.	Running
`vmware-vcops-rhino`	Rhino script engine for custom automation actions and notification plugins.	Running

6.3 Service Management Commands

All service operations must be performed as the root user via SSH or console access. The appliance supports both systemctl and legacy service command syntax.

Checking service status:

# Preferred — systemctl
systemctl status vmware-vcops-analytics

# Legacy — service wrapper
service vmware-vcops-analytics status

Starting, stopping, and restarting individual services:

# Start a service
systemctl start vmware-vcops-collector

# Stop a service
systemctl stop vmware-vcops-collector

# Restart a service (stop then start)
systemctl restart vmware-vcops-api

Querying overall cluster slice status:

/usr/lib/vmware-vcops/support/sliceConfiguration.sh --status

This command returns the role of the current node (primary, replica, data), cluster membership, and the online/offline state of each slice.

Using the Operations CLI:

$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py --help
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py adapter list
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py node list

Checking all VCF Operations services at once:

for svc in analytics collector api casa gemfire vpostgres cassandra watchdog web; do
  echo "=== vmware-vcops-${svc} ==="
  systemctl is-active vmware-vcops-${svc}
done

6.4 Shutdown and Startup Sequences

Correct shutdown and startup ordering is critical to avoid data corruption, split-brain scenarios, and prolonged recovery times. The required sequence varies depending on your deployment topology.

Non-HA (Single Node)

Shutdown:

# Stop all VCF Operations services
service vmware-stsd stop
service vmware-vcops stop

Startup:

# Start all VCF Operations services
service vmware-vcops start
service vmware-stsd start

Warning: Always stop vmware-stsd after vmware-vcops during shutdown, and start it before vmware-vcops during startup. Reversing this order can leave authentication tokens in an inconsistent state.

HA (High Availability — Primary + Replica)

Shutdown sequence (order matters):

If data nodes exist, shut down all data nodes first (in any order among themselves).
Shut down the replica node.
Shut down the primary node last.

# On each data node (if applicable):
service vmware-vcops stop && shutdown -h now

# On the replica node:
service vmware-vcops stop && shutdown -h now

# On the primary node (last):
service vmware-vcops stop && shutdown -h now

Startup sequence (reverse order):

Power on and start services on the primary node first.
Wait until the primary node UI is accessible and all services report healthy.
Power on and start services on the replica node.
Power on and start services on all data nodes.

# On the primary node (first):
service vmware-vcops start

# Verify primary is healthy before proceeding:
/usr/lib/vmware-vcops/support/sliceConfiguration.sh --status

# On the replica node (second):
service vmware-vcops start

# On each data node (last):
service vmware-vcops start

Warning: Starting the replica or data nodes before the primary is fully online will cause CASA cluster formation failures. The primary node must be the first to come online and the last to go offline.

Continuous Availability (CA)

Continuous Availability deployments include a witness node in addition to primary and replica nodes. The witness participates in quorum decisions but does not store data.

Shutdown sequence:

Shut down all data nodes.
Shut down the witness node.
Shut down the replica node.
Shut down the primary node last.

Startup sequence:

Start the primary node first.
Start the replica node.
Start the witness node.
Start all data nodes.

Warning: In a CA deployment, losing both the witness and one of the primary/replica nodes simultaneously causes a loss of quorum. Never perform maintenance on the witness and a data-bearing node at the same time. Always verify quorum status via the admin UI or sliceConfiguration.sh --status before proceeding to the next node.

Chapter 7: vCenter Adapter Configuration

The VMware vSphere adapter is the foundational integration for VCF Operations. It collects performance metrics, configuration properties, change events, and relationship data from vCenter Server and all managed objects including ESXi hosts, virtual machines, datastores, clusters, distributed switches, and resource pools. This chapter provides a complete walkthrough of credential creation, adapter instance configuration, collection tuning, and health monitoring.

7.1 Create vCenter Credentials

Before creating an adapter instance, you must configure a credential that VCF Operations will use to authenticate against the target vCenter Server.

Step-by-step procedure:

Log in to the VCF Operations UI as an administrator.
Navigate to Administration → Integrations → Accounts.
Click Add Account.
Select vCenter as the account type.
Complete the following fields:
- Display Name — A human-readable label (e.g., vcsa-mgmt-01.corp.local).
- Description — Optional description of the vCenter instance and its purpose.
In the Credential section, click Add Credential (or select an existing one):
- Credential Name — e.g., svc-vrops-mgmt-01.
- vCenter Server — Enter the FQDN of the vCenter Server (e.g., vcsa-mgmt-01.corp.local). Do not use an IP address; certificate validation requires FQDN.
- Username — A dedicated service account (e.g., svc-vrops@vsphere.local). Never use administrator@vsphere.local in production.
- Password — The service account password.
Click Validate to test the credential before saving.
Click Save.

Required vCenter Permissions:

The service account must be assigned a custom role at the vCenter root level with the following minimum privileges:

Privilege Category	Specific Privilege	Access Level
Global	Licenses	Read only
Global	Settings	Read only
Global	Health	Read only
Host	Configuration (all sub-items)	Read only
Host	CIM → CIM Interaction	Read only
Host	Storage operations	Read only
Virtual Machine	Interaction → Console interaction	Read only
Virtual Machine	State → Create snapshot, Remove snapshot	Read/Write
Virtual Machine	Configuration (all sub-items)	Read only
Datastore	Browse datastore	Read only
Datastore	Low-level file operations	Read only
Performance	Modify intervals	Read/Write
vSAN	Cluster → ReadOnly	Read only
Sessions	Validate session	Read only
Extension	Register extension	Read/Write (optional — only for remediation actions)
Alarm	Acknowledge alarm, Set alarm status	Read/Write (optional — only for alert sync)

Best Practice: Create a dedicated vSphere role named VCF-Operations-ReadOnly with these privileges. Assign it to the service account at the vCenter root object and select Propagate to children. This ensures the adapter can discover and monitor all objects in the inventory hierarchy.

7.2 Adapter Instance Configuration

With the credential in place, create the adapter instance that will perform data collection.

Navigate to Administration → Integrations → Accounts.
Click Add Account and select vCenter.
Complete the following fields:

Field	Description	Example Value
Adapter Type	Pre-selected as VMware vSphere.	VMware vSphere
Display Name	Unique name identifying this adapter instance in dashboards and alerts.	`vcsa-mgmt-01`
Description	Free-text description.	`Management domain vCenter`
Credential	Select the credential created in Section 7.1.	`svc-vrops-mgmt-01`
vCenter Server	FQDN of the target vCenter Server. Must match the credential's vCenter Server field.	`vcsa-mgmt-01.corp.local`
Collector / Collector Group	Select the collector node or group responsible for data collection. In multi-site deployments, choose a collector closest to the target vCenter.	`Default collector group`
Auto Discovery	When enabled, newly added hosts and VMs are automatically discovered and monitored.	Enabled (recommended)

Expand Advanced Settings to configure optional parameters:

Setting	Default	Description
`COLLECT_VSAN_PERF_METRICS`	`true`	Enables collection of vSAN performance counters from the vSAN Performance Service.
`COLLECT_VSAN_ADVANCED_METRICS`	`false`	Enables collection of extended vSAN metrics (DOM, LSOM, CMMDS). Increases load on vCenter.
`PROCESS_CHANGE_EVENTS`	`true`	Enables ingestion of vCenter events and tasks for change-driven analytics and audit trails.
`DISABLE_COMM_WITH_VCENTER`	`false`	Emergency toggle to stop all communication with vCenter without deleting the adapter. Useful during planned vCenter maintenance.
`CONNECT_TIMEOUT`	`60000`	Connection timeout in milliseconds for vCenter API calls. Increase for high-latency WAN connections.
`ENABLE_DIFFMERGE`	`true`	Enables differential collection (only changed properties are sent), reducing processing overhead.
`COLLECTOR_INSTANCE_COUNT`	`1`	Number of parallel collection threads. Increase for very large vCenter inventories (>5,000 VMs).

Click Validate Connection to verify connectivity.
Click Save to create the adapter instance.

7.3 Collection Intervals

VCF Operations collects different categories of data at different frequencies. These intervals can be modified per adapter instance, but the defaults are optimized for most environments.

Collection Type	Default Interval	Configurable Range	Notes
Performance Metrics	5 minutes	1–60 minutes	Aligns with vCenter's default real-time statistics interval (20-second samples aggregated to 5 minutes). Reducing below 5 minutes does not yield higher granularity from vCenter.
Configuration Properties	30 minutes	5–1440 minutes	Collects object configuration attributes (CPU count, memory size, disk layout, network assignments).
Change Events	5 minutes	1–60 minutes	Polls vCenter's `EventManager` for tasks and events since the last poll.
Inventory Discovery	6 hours	1–24 hours	Full inventory traversal to discover new objects and remove stale ones.
vSAN Performance	5 minutes	5–60 minutes	vSAN performance counters collected via the vSAN Performance Service API. Must be ≥5 minutes.
Relationship Mapping	30 minutes	5–1440 minutes	Updates parent-child and peer relationships between objects.

Tip: In very large environments (>10,000 VMs), increasing the configuration collection interval to 60 minutes and inventory discovery to 12 hours significantly reduces API load on vCenter with minimal impact on monitoring fidelity.

7.4 Wait/Cancel Cycles and Data Maturation

After initial deployment, the adapter follows a well-defined lifecycle before full analytics capability is reached:

Initial Discovery (0–30 minutes): The adapter performs a complete inventory traversal, creating resource objects for every discovered entity (hosts, VMs, clusters, datastores, etc.). The Object Count in the adapter status begins to populate.
First Collection Cycle (5–10 minutes after discovery): Performance metrics and configuration properties are collected for the first time. Metrics begin appearing in dashboards, but values are raw with no baseline context.
Statistics Build-Up (24–72 hours): The analytics engine begins calculating rolling averages, standard deviations, and trend lines. Capacity projections begin to appear, but with low confidence.
Dynamic Thresholds (1–2 weeks): After accumulating approximately one to two weeks of continuous data, the analytics engine generates dynamic thresholds (DT). These adaptive baselines learn normal behavior patterns for each metric on each object, including daily and weekly seasonality. Alerts based on dynamic thresholds become meaningful only after this maturation period.
Steady State (2+ weeks): Dynamic thresholds are fully established. Anomaly detection, predictive alerts, and capacity forecasts operate at full accuracy. The system continues to refine thresholds as it accumulates more historical data.

Important: Do not create custom alert definitions based on dynamic thresholds during the first two weeks. The immature thresholds will generate excessive false positives. Use static thresholds for immediate alerting needs during the burn-in period.

7.5 Test Connection and Initial Discovery

After saving the adapter instance, perform the following validation steps:

Test Connection: On the adapter configuration page, click Test Connection. A successful test confirms:
- Network connectivity from the collector to the vCenter Server on port 443.
- Credential validity (username and password accepted).
- SSL certificate trust (the vCenter certificate is trusted by the collector's trust store).
Monitor Discovery Progress:
- Navigate to Administration → Integrations → Accounts.
- Locate the adapter instance card. The status indicator transitions through:
  - Grey (Unknown) — Adapter has not yet started.
  - Yellow (Collecting) — Discovery or collection is in progress.
  - Green (Receiving Data) — Successful collection cycles are completing.
  - Red (Not Receiving Data) — Collection failures detected.
- Click the adapter card to view the Collection State and Collection Status details.
Verify Object Counts:
- Navigate to Environment → Inventory → Object Browser.
- Filter by adapter instance name.
- Confirm that the discovered object count is consistent with the vCenter inventory (e.g., if vCenter manages 200 VMs, approximately 200 Virtual Machine objects should appear).
Check Adapter Logs:
```
tail -100 /storage/log/collector/collector.log | grep -i "VMware_adapter3"
```
Look for Collection completed successfully messages and verify there are no authentication errors or timeout exceptions.

7.6 Monitoring Adapter Health

Ongoing adapter health monitoring ensures continuous data collection and early detection of integration failures.

Via REST API:

# Get adapter instance status
curl -sk -X GET \
  "https://<vrops-fqdn>/suite-api/api/adapters/{adapterId}" \
  -H "Authorization: vRealizeOpsToken <token>" \
  -H "Accept: application/json"

The response includes resourceStatusAndReason, where:

resourceStatus: DATA_RECEIVING — Healthy.
resourceStatus: NO_DATA_RECEIVING — Collection failures occurring.
resourceStatus: NO_PARENT_MONITORING — Collector node offline.
resourceStatus: UNKNOWN — Adapter has not completed its first cycle.

Via CLI:

$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py adapter list

This outputs all configured adapter instances, their types, collection states, and associated collector nodes.

Via UI:

Navigate to Administration → Integrations → Repository.
Each adapter is displayed as a card with a color-coded status indicator.
Click an adapter card to view:
- Last Collection Time — Timestamp of the most recent successful collection.
- Object Count — Number of discovered and monitored resources.
- Collection Duration — Time taken for the last collection cycle.
- Error Messages — Any warnings or errors from recent cycles.

Common adapter health issues and resolutions:

Symptom	Likely Cause	Resolution
Status is red; `SSLHandshakeException` in logs	vCenter certificate changed or renewed	Re-trust the vCenter certificate in VCF Operations: Administration → Certificates → Certificate Management
Status is red; `InvalidLogin` in logs	Service account password expired or changed	Update the credential in Administration → Integrations → Accounts
Status is yellow; collection duration exceeds interval	Oversized vCenter inventory or resource contention	Increase `COLLECTOR_INSTANCE_COUNT`, add a remote collector, or increase collection intervals
Object count is zero	Insufficient vCenter permissions	Verify the service account role assignment per Section 7.1
Status is green but metrics are stale	Collector node clock drift	Verify NTP synchronization on both the collector appliance and vCenter

Chapter 8: SDDC Manager Integration

SDDC Manager is the lifecycle management control plane for VMware Cloud Foundation. Integrating VCF Operations with SDDC Manager provides domain-level topology awareness, lifecycle status visibility, and operational context that enriches the analytics engine's understanding of the VCF stack.

8.1 Configuration

Step-by-step procedure:

Log in to the VCF Operations UI as an administrator.
Navigate to Administration → Integrations → Accounts.
Click Add Account.
Select SDDC Manager from the account type list.
Complete the following fields:
- Display Name — A descriptive label (e.g., sddc-mgr-01.corp.local).
- Description — Optional. Describe the VCF instance this SDDC Manager controls.
- SDDC Manager FQDN — The fully qualified domain name of the SDDC Manager appliance (e.g., sddc-mgr-01.corp.local). Use the FQDN, not the IP address.
- Credential — Click Add Credential and enter:
  - Username — An SDDC Manager user with at least the ADMIN or OPERATOR role (e.g., svc-vrops@corp.local).
  - Password — The corresponding password.
Click Test Connection to validate connectivity and credentials.
Click Save.

Note: The SDDC Manager API uses port 443. Ensure that firewall rules allow HTTPS traffic from the VCF Operations collector node to the SDDC Manager appliance.

8.2 What the Adapter Collects

Once configured, the SDDC Manager adapter automatically discovers and monitors the following data:

VCF Domain Topology — Management domain and all workload domains, including their constituent clusters, hosts, and networking configuration.
Workload Domain Health — Aggregate health status of each workload domain based on component availability and configuration drift.
Component Versions and Lifecycle Status — Current installed versions and available updates for vCenter, ESXi, NSX, and vSAN across all domains.
Host Commission/Decommission Events — Tracks when physical hosts are added to or removed from the SDDC Manager inventory, including the associated workflow status.
Certificate Status — Expiration dates and validity status for certificates managed by SDDC Manager, enabling proactive renewal alerting.
Upgrade and Patch Compliance — Identifies domains and clusters that are behind on recommended update bundles.
Network Pool Utilization — Tracks consumption of IP pools and VLAN assignments managed by SDDC Manager.

8.3 SDDC Manager REST API Endpoints Used

The adapter communicates with the SDDC Manager via its published REST API. The following table lists the key endpoints queried during each collection cycle:

Endpoint	Data Collected
`GET /v1/system`	Overall system information: SDDC Manager version, system status, NTP configuration, DNS settings, and deployment type.
`GET /v1/domains`	All workload domains including name, ID, type (management/VI), status, and associated cluster references.
`GET /v1/clusters`	Cluster details within each domain: cluster name, host count, vSAN enabled status, stretch cluster configuration, and image profile.
`GET /v1/hosts`	Host inventory: hardware model, ESXi version, commission status (ASSIGNED, UNASSIGNED_USEABLE, DECOMMISSIONED), and associated domain/cluster.
`GET /v1/tasks`	Recent task history: upgrade workflows, host operations, certificate rotations, and their completion status (SUCCESSFUL, FAILED, IN_PROGRESS).
`GET /v1/upgrades`	Available upgrade bundles and their applicability to each domain, including pre-check results and compatibility matrices.
`GET /v1/certificates`	Certificate inventory: issuing CA, subject, expiration date, and associated component (vCenter, NSX, ESXi).
`GET /v1/network-pools`	Network pool definitions, VLAN ranges, and IP address block utilization.
`GET /v1/sddc-managers`	SDDC Manager cluster node information (in multi-instance deployments).

Tip: If the SDDC Manager adapter reports errors for specific endpoints, verify that the service account has sufficient privileges. The ADMIN role provides access to all endpoints; the OPERATOR role may restrict access to certain lifecycle operations.

Chapter 9: NSX Integration

NSX provides the network virtualization and security layer in VMware Cloud Foundation. Integrating VCF Operations with NSX delivers visibility into logical networking constructs, transport infrastructure, distributed firewall activity, and load balancer performance — all correlated with the compute and storage metrics collected by the vSphere adapter.

9.1 Configuration

Step-by-step procedure:

Log in to the VCF Operations UI as an administrator.
Navigate to Administration → Integrations → Accounts.
Click Add Account and select NSX-T from the account type list.
Complete the following fields:
- Display Name — A descriptive label (e.g., nsx-mgmt-01.corp.local).
- Description — Optional. Describe the NSX deployment and its associated VCF domain.
- NSX Manager FQDN — Enter the Virtual IP (VIP) of the NSX Manager cluster (e.g., nsx-vip-mgmt.corp.local). This is critical — see the note below.
- Credential — Click Add Credential and enter:
  - Username — A service account with the audit role (e.g., svc-vrops-nsx@corp.local) or the enterprise_admin role for full visibility. The audit role is recommended for least-privilege compliance.
  - Password — The corresponding password.
- Collector / Collector Group — Select the appropriate collector. For cross-site deployments, choose a collector local to the NSX Manager.
Click Test Connection to verify connectivity and authentication.
Click Save.

Warning: Always use the NSX Manager VIP (Virtual IP), not the FQDN of an individual NSX Manager node. The NSX Manager cluster operates as a three-node Raft consensus group. During maintenance, node upgrades, or node failures, individual manager nodes become temporarily unavailable. The VIP automatically directs traffic to a healthy node, ensuring uninterrupted data collection. Configuring the adapter with an individual node address will result in collection outages during any node maintenance event.

9.2 What the Adapter Collects

The NSX-T adapter collects a comprehensive set of networking and security data:

Logical Switches and Segments — Segment name, VLAN/overlay type, admin state, connected VMs, and traffic statistics (packets in/out, bytes in/out, drops).
Transport Zones — Overlay and VLAN transport zone membership, host participation, and configuration status.
Transport Nodes — ESXi host and Edge node transport connectivity status, TEP (Tunnel Endpoint) IP addressing, and N-VDS configuration.
Edge Clusters and Edge Nodes — Edge cluster membership, deployment status (active/standby), CPU/memory/NIC utilization, and service deployment (T0/T1 routers, load balancers).
Distributed Firewall (DFW) — Rule count per section, rule hit counts, session counts, connection rates, and dropped packet counts. Enables identification of unused rules and overly permissive policies.
Gateway Firewall — North-south firewall statistics on Tier-0 and Tier-1 gateways.
Load Balancer Statistics — Virtual server status, pool member health, active sessions, throughput, and connection rate for NSX load balancer instances.
BGP and Routing — BGP neighbor status (established/active/idle), route table size, and routing protocol convergence events on Tier-0 gateways.
NSX Manager Cluster Health — Cluster status (STABLE, DEGRADED, UNSTABLE), individual node health, API responsiveness, and Corfu database replication state.
Alarms and System Events — NSX-generated alarms covering certificate expiry, control plane connectivity, and transport node preparation failures.

9.3 VCF Operations for Networks (Overview)

In addition to the built-in NSX adapter, VMware offers VCF Operations for Networks (formerly known as Aria Operations for Networks, or vRealize Network Insight) as a complementary product for deep network visibility. While the NSX adapter focuses on management-plane metrics, VCF Operations for Networks provides data-plane flow analysis.

Deployment model:

Requires two OVA appliances:
- Platform OVA — The analytics engine and web UI (minimum: 8 vCPU, 32 GB RAM, 2 TB disk).
- Collector OVA — Deployed near each data source to collect NetFlow/IPFIX, syslog, SNMP, and API data (minimum: 4 vCPU, 16 GB RAM, 200 GB disk).
Multiple collectors can feed a single platform for multi-site coverage.

Key capabilities:

Flow-Level Visibility — Analyzes NetFlow/IPFIX data from physical switches and ESXi hosts to map all east-west and north-south traffic flows.
Micro-Segmentation Planning — Recommends NSX DFW rules based on observed traffic patterns, accelerating zero-trust network implementation.
Network Topology Maps — Auto-generated visual maps showing physical-to-virtual network paths, including underlay switch topology (via LLDP/CDP), overlay segments, and VM-to-VM connectivity.
Application Discovery — Groups VMs into application tiers based on observed communication patterns.
Path Tracing — Traces the complete network path between any two endpoints, identifying every hop, firewall rule, and load balancer in the path.
Integration with VCF Operations Dashboards — Network topology and flow data can be surfaced in VCF Operations dashboards via cross-launch links and shared context.

Note: VCF Operations for Networks is licensed separately from VCF Operations. In VCF 5.x environments, it is included with the VCF Operations Advanced and Enterprise editions.

Chapter 10: vSAN Integration

VMware vSAN is the hyper-converged storage platform embedded in VCF. VCF Operations provides native vSAN monitoring through the vSphere adapter, delivering capacity analytics, performance trending, health correlation, and policy compliance tracking without requiring a separate adapter installation.

10.1 Automatic via vSphere Adapter

vSAN monitoring is automatically activated when the vCenter adapter discovers one or more vSAN-enabled clusters. No additional adapter installation, configuration, or licensing is required for core vSAN metrics.

Prerequisites for automatic vSAN data collection:

The vCenter adapter must be configured and actively collecting data (see Chapter 7).
vSAN Health Service must be enabled in vCenter. Verify by navigating to the vSAN cluster in vSphere Client → Monitor → vSAN → Health and confirming that health checks are running.
vSAN Performance Service must be enabled on the vSAN cluster. Navigate to vSphere Client → select the vSAN cluster → Configure → vSAN → Performance Service → ensure it is turned on and a network-accessible database is configured.
The VCF Operations service account must have the vSAN → Cluster → ReadOnly privilege (included in the role defined in Section 7.1).

Once these prerequisites are met, VCF Operations automatically creates resource objects for:

vSAN Cluster
vSAN Disk Groups
vSAN Disks (cache and capacity tiers)
vSAN Hosts (storage perspective)
vSAN Virtual Machine Storage Objects

10.2 Advanced vSAN Configuration

For environments requiring deeper vSAN observability, additional collection parameters can be enabled in the vCenter adapter instance's advanced settings.

Navigate to Administration → Integrations → Accounts → select the vCenter adapter → Edit → expand Advanced Settings:

Setting	Default	Description
`COLLECT_VSAN_PERF_METRICS`	`true`	Collects vSAN performance counters (IOPS, throughput, latency) from the vSAN Performance Service API. Disabling this removes all vSAN performance data while retaining capacity and health metrics.
`COLLECT_VSAN_ADVANCED_METRICS`	`false`	Enables collection of extended vSAN metrics from the DOM (Distributed Object Manager), LSOM (Local Log-Structured Object Manager), and CMMDS (Cluster Monitoring Membership and Directory Services) layers. Provides deep diagnostic visibility but increases collection load on vCenter and the ESXi hosts.
`VSAN_PERF_DIAG_MODE`	`false`	Enables vSAN performance diagnostics mode, which collects additional latency breakdown metrics (e.g., guest-to-kernel, kernel-to-disk) for troubleshooting storage performance issues.

Warning: Enabling COLLECT_VSAN_ADVANCED_METRICS on clusters with more than 32 hosts or heavy I/O workloads can significantly increase vCenter API response times and VCF Operations collection duration. Enable this setting selectively and monitor the adapter collection duration (see Section 7.6) after activation.

Additional vSAN Performance Service requirements:

The Performance Service stores statistics in the vSAN datastore itself. Ensure that the vSAN datastore has sufficient free capacity (at minimum 256 GB free) to accommodate the performance database.
The Performance Service network must be reachable from all hosts in the cluster. In stretched cluster configurations, verify that both sites can access the Performance Service host.

10.3 Key vSAN Metrics Collected

VCF Operations collects hundreds of vSAN metrics. The following table summarizes the most operationally significant metric groups:

Metric Group	Key Metrics	Description
Capacity	`vsanDatastore\|capacity_usedSpace`, `vsanDatastore\|capacity_freeSpace`, `vsanDatastore\|capacity_dedupRatio`, `vsanDatastore\|capacity_compressionRatio`, `vsanDatastore\|capacity_savingsRatio`	Overall vSAN datastore capacity utilization, deduplication effectiveness, compression ratios, and combined space savings. Used for capacity planning and trending.
Performance — IOPS	`vsanDatastore\|performance_readIops`, `vsanDatastore\|performance_writeIops`, `vsanDatastore\|performance_totalIops`	Read, write, and total I/O operations per second at the cluster, host, and disk group levels.
Performance — Throughput	`vsanDatastore\|performance_readThroughput`, `vsanDatastore\|performance_writeThroughput`	Data throughput in KBps for read and write operations. Useful for identifying bandwidth bottlenecks.
Performance — Latency	`vsanDatastore\|performance_readLatency`, `vsanDatastore\|performance_writeLatency`, `vsanDatastore\|performance_totalLatency`	Average latency in milliseconds for read, write, and combined operations. VCF Operations applies dynamic thresholds to these metrics after the burn-in period.
Resync	`vsanDatastore\|resync_bytesRemaining`, `vsanDatastore\|resync_objectsResyncing`, `vsanDatastore\|resync_etr`	Bytes remaining to resynchronize after a host failure or maintenance event, count of objects actively resyncing, and estimated time to completion (ETR). Critical for monitoring recovery progress.
Health	`vsanDatastore\|health_diskHealth`, `vsanDatastore\|health_networkHealth`, `vsanDatastore\|health_dataIntegrity`, `vsanDatastore\|health_overallHealth`	Health check results for disk subsystem, vSAN network (VMkernel connectivity, multicast), data integrity (object checksum verification), and overall cluster health.
Policy Compliance	`vsanDatastore\|policy_complianceStatus`, `vsanDatastore\|policy_objectsByPolicy`	Reports whether all VM storage objects comply with their assigned vSAN storage policy (e.g., FTT=1, stripe width). Identifies VMs at risk due to policy violations.
Congestion	`vsanDatastore\|performance_congestion`	vSAN congestion value (0–255). Values above 0 indicate back-pressure in the I/O path. Sustained values above 30 warrant investigation.
Disk Group	`vsanDiskGroup\|iopsRead`, `vsanDiskGroup\|iopsWrite`, `vsanDiskGroup\|latencyRead`, `vsanDiskGroup\|latencyWrite`, `vsanDiskGroup\|cacheHitRate`	Per-disk-group performance counters including cache tier hit rate. Low cache hit rates may indicate a need for larger cache disks or workload redistribution.

10.4 vSAN Dashboards

VCF Operations ships with a comprehensive set of predefined vSAN dashboards that provide immediate operational visibility without custom configuration. These dashboards cover:

vSAN Cluster Overview — High-level health, capacity, and performance summary for all vSAN clusters.
vSAN Capacity Planning — Trending and forecasting of vSAN capacity consumption with time-to-exhaustion projections.
vSAN Performance — Real-time and historical IOPS, throughput, and latency visualization at cluster, host, and disk group levels.
vSAN Resync Monitoring — Active resync operations, progress tracking, and estimated completion times.
vSAN Health Deep Dive — Detailed health check results correlated with related events and configuration changes.
vSAN VM Storage Policy Compliance — Identifies VMs with non-compliant storage objects and correlates compliance violations with capacity or host availability changes.

For a complete listing of all predefined vSAN dashboards, their widget configurations, and customization guidance, refer to Chapter 14: Predefined Dashboards and Views.

Tip: Pin the vSAN Cluster Overview and vSAN Capacity Planning dashboards to your home page for daily operational monitoring. Configure email-based scheduled reports from the vSAN Capacity Planning dashboard to automatically distribute weekly capacity status to infrastructure leads.

Chapter 11: Policies

Policies in VCF Operations govern how the platform analyzes, alerts on, and reports capacity for your monitored objects. Every object in the inventory is subject to exactly one policy at any given time, and understanding how policies layer and override one another is essential for accurate monitoring at scale.

11.1 Default Policy vs Custom Policies

VCF Operations ships with a single Default Policy that is automatically applied to every monitored object in the inventory. This policy contains Broadcom's recommended thresholds, alert definitions, symptom definitions, and capacity settings for all supported object types. It cannot be deleted, and it serves as the fallback for any object not explicitly covered by a custom policy.

Custom policies allow administrators to override specific settings from the Default Policy for targeted groups of objects. A custom policy does not need to redefine every setting — it inherits any setting left unconfigured from the Default Policy and only overrides the values explicitly changed.

To manage policies, navigate to:

Configure → Policies

The Policies page displays all active policies in a table with columns for Name, Description, Priority, and the number of object groups assigned. From this page you can:

Add a new custom policy
Clone an existing policy (including the Default Policy) to use as a starting template
Edit a custom policy's settings
Delete a custom policy (objects revert to the Default Policy)
View which object groups are assigned to each policy

Note: The Default Policy itself can be edited, but exercise caution — changes to the Default Policy affect every object that is not covered by a higher-priority custom policy.

11.2 Policy Priority and Inheritance

When multiple custom policies exist, VCF Operations uses a numeric priority system to determine which policy governs a given object. Each policy is assigned a priority number, and lower numbers indicate higher priority.

Policy resolution follows this logic:

VCF Operations evaluates all custom policies in ascending priority order (lowest number first).
For each policy, it checks whether the object belongs to any of the policy's assigned object groups.
The first matching policy (lowest priority number) wins and governs the object.
If no custom policy matches, the Default Policy applies.

Priority	Policy Name	Assigned Groups	Matched Object	Result
1	Critical Production	Production-Tier1	VM in Production-Tier1	Governed by Critical Production
2	Standard Production	Production-All	VM in Production-All	Governed by Standard Production
3	Development	Dev-Test	VM in Dev-Test	Governed by Development
—	Default Policy	(All objects)	VM in no group	Governed by Default Policy

If an object belongs to groups assigned to multiple policies, only the highest-priority policy (lowest number) applies. There is no merging of settings across policies — the winning policy's settings apply in full, with any unconfigured settings inherited from the Default Policy.

To change priority, navigate to Configure → Policies, select a policy, and click Edit Priority. Enter the desired numeric value and save.

11.3 Configurable Elements

Each policy exposes five major configuration areas. The following sections detail every configurable element, its UI navigation path, and key settings.

11.3.1 Workload Automation

Navigation: Configure → Policies → [Policy Name] → Edit → Workload Automation

Workload Automation enables DRS-like optimization recommendations (or automated actions, if configured) driven by VCF Operations analytics rather than vCenter DRS alone.

Setting	Description	Default
Enable Workload Automation	Turns on optimization analysis for the policy scope	Disabled
Automation Mode	Manual (recommendations only), Semi-Automatic, or Fully Automatic	Manual
Aggressiveness	Conservative, Moderate, or Aggressive balancing	Moderate
Excluded Object Types	Object types to exclude from automation	None

11.3.2 Capacity Settings

Navigation: Configure → Policies → [Policy Name] → Edit → Capacity

Capacity settings control how VCF Operations calculates remaining capacity and time-to-exhaustion.

Setting	Description	Default
Allocation Model / Demand Model	Method for computing capacity (see Section 11.4)	Allocation Model
Time Remaining Threshold (days)	Alert fires when projected exhaustion is within this window	90 days
Capacity Remaining Threshold (%)	Alert fires when remaining capacity drops below this value	20%
CPU Overcommit Ratio	Virtual-to-physical CPU ratio ceiling	4:1
Memory Overcommit Ratio	Virtual-to-physical memory ratio ceiling	1.25:1
Storage Overcommit Ratio	Virtual-to-physical storage ratio ceiling	1:1
High Availability Buffer (%)	Capacity reserved for HA failover	Based on cluster HA settings
Maintenance Buffer (%)	Capacity reserved for host maintenance	0%

11.3.3 Attributes and Metrics

Navigation: Configure → Policies → [Policy Name] → Edit → Attributes/Metrics

This section allows enabling or disabling the collection of specific metric groups per object type. Disabling unused metric groups reduces storage consumption and processing overhead.

Categories include CPU, Memory, Disk, Network, Datastore, Virtual Disk, GPU, vSAN, and System metrics. Each category can be individually toggled.

11.3.4 Alerts and Symptoms

Navigation: Configure → Policies → [Policy Name] → Edit → Alerts/Symptoms

Administrators can enable or disable individual alert definitions and symptom definitions within the scope of the policy. This is useful for suppressing alerts that are not relevant to a particular workload tier — for example, disabling memory overcommit alerts for development clusters where overcommit is expected.

11.3.5 Compliance

Navigation: Configure → Policies → [Policy Name] → Edit → Compliance

Activate or deactivate compliance benchmarks on a per-policy basis. Available benchmarks include VMware Security Hardening Guide, CIS Benchmarks, DISA STIGs, and any custom benchmarks that have been imported.

11.4 Allocation vs Demand Model

The capacity model determines how VCF Operations calculates how much capacity a cluster or datastore has remaining.

Aspect	Allocation Model	Demand Model
Calculation Basis	Provisioned (allocated) resources	Actual measured utilization
Philosophy	Conservative — assumes all provisioned resources may be consumed	Optimistic — assumes current usage patterns continue
CPU Capacity Used	Sum of all vCPUs allocated × overcommit ratio	Peak or 95th-percentile CPU demand
Memory Capacity Used	Sum of all configured VM memory	Active + consumed memory demand
Example (8-core host)	10 VMs × 4 vCPU = 40 vCPU allocated → 40/32 = 125% used (at 4:1 ratio)	Actual demand is 12 GHz of 64 GHz → 18.75% used
Best For	Production environments with strict SLAs	Development environments or well-understood workloads
Risk	May show capacity exhaustion prematurely	May underestimate future demand if workloads spike

11.5 Best Practices

Create separate policies for Production and Dev/Test. Production policies should use the Allocation Model with conservative overcommit ratios (CPU 4:1, Memory 1:1). Dev/Test policies can use the Demand Model with higher overcommit ratios (CPU 8:1, Memory 2:1).
Use object groups to target policies precisely. Define groups using vCenter folder structure, tags, or custom properties rather than manually adding individual objects.
Keep the number of custom policies manageable. Aim for 3–5 policies that align with your operational tiers, not dozens of niche policies.
Review priority order quarterly. As new policies are added, verify that priority ordering still reflects the intended precedence.
Test policy changes on a small group first. Clone a policy, apply it to a test group, monitor for 24–48 hours, then promote to production scope.
Document your policies. Use the Description field in each policy to record the rationale, the approver, and the date of last review.

Chapter 12: Alerts and Symptoms

Alerts and symptoms form the proactive monitoring backbone of VCF Operations. Symptoms detect individual conditions; alerts correlate one or more symptoms into actionable notifications that drive operational response.

12.1 Understanding Alerts

Every alert in VCF Operations is classified along three dimensions: type, criticality, and control state.

Alert Types:

Type	Badge Icon	Purpose	Example
Health	Red/Orange/Yellow cross	Indicates a current, active problem requiring immediate attention	Host memory usage critical
Risk	Red/Orange/Yellow diamond	Predicts a future problem based on trend analysis	Datastore will run out of space in 30 days
Efficiency	Red/Orange/Yellow arrow	Identifies optimization opportunities to reclaim waste	VM is oversized — using 5% of allocated CPU

Badge Colors and Criticality Levels:

Color	Criticality	Description
Red	Critical	Immediate action required; service impact is occurring or imminent
Orange	Immediate	Urgent attention needed; potential for service impact
Yellow	Warning	Attention recommended; condition is outside normal bounds
Green	Information / Clear	Informational or no active alerts

Control States:

State	Description
Open	Alert is active and unacknowledged
Assigned	An administrator has taken ownership
Suspended	Alert is temporarily suppressed (with optional expiration)
Cancelled	Alert has been manually dismissed by an administrator

When all triggering symptoms clear, the alert automatically transitions to a cancelled state. Manually cancelled alerts will not re-fire until the symptoms clear and then trigger again.

12.2 Alert Lifecycle

The alert lifecycle follows a deterministic sequence:

Symptom Detection — A metric crosses a threshold, a log message matches a pattern, or a fault event arrives. The symptom condition evaluates to TRUE.
Wait Cycles — If configured, the symptom must remain TRUE for the specified number of collection cycles (each cycle is typically 5 minutes) before it activates.
Symptom Activation — The symptom transitions to an active state.
Alert Evaluation — The alert definition checks whether its symptom combination logic (ALL or ANY) is satisfied.
Alert Fires — The alert appears on the object's alert list with the configured criticality and type.
Notification — If a notification rule matches the alert's attributes (type, criticality, object type), the configured notification method is triggered (email, webhook, SNMP trap, etc.).
Admin Review — An administrator views the alert in Alerts → Triggered Alerts or receives the notification.
Assignment — The administrator changes the control state to Assigned, taking ownership.
Resolution — The administrator resolves the underlying issue.
Auto-Cancellation — When all triggering symptoms clear, VCF Operations automatically cancels the alert. Alternatively, the administrator can manually cancel the alert.

Alerts can also be suspended for a configurable duration (e.g., during a maintenance window), after which they automatically resume evaluation.

12.3 Creating Alert Definitions

To create a custom alert definition, follow these steps:

Step 1. Navigate to Configure → Alerts → Alert Definitions.

Step 2. Click the Add button in the toolbar.

Step 3. On the Name and Description tab:

Enter a descriptive Name (e.g., "Production VM CPU Saturation").
Provide a Description explaining the alert's purpose and recommended remediation.
Select the Base Object Type from the dropdown (e.g., Virtual Machine, Host System, Cluster Compute Resource). This determines which objects can trigger the alert.

Step 4. On the Alert Impact tab:

Select the Alert Type: Health, Risk, or Efficiency.
Set the Criticality: Critical, Immediate, Warning, or Information.
Choose the Impact Area: Availability, Performance, Capacity, or Compliance.

Step 5. On the Add Symptom Definitions tab:

Browse or search the symptom catalog in the left panel.
Drag one or more symptom definitions into the alert canvas.
Alternatively, click Create Symptom Definition to build a new symptom inline (see Section 12.4).

Step 6. On the Configure Symptom Conditions section:

Set the combination logic:
- ALL — All listed symptoms must be active simultaneously for the alert to fire.
- ANY — Any single symptom being active will fire the alert.
Configure Wait Cycles for each symptom (recommended: 3–5 cycles to reduce false positives).
Optionally set Cancel Cycles — the number of cycles a symptom must remain clear before the alert cancels.

Step 7. Click Save. The alert definition is now created but will only evaluate against objects governed by a policy where the alert is enabled.

12.4 Symptom Definitions

Symptoms are the atomic conditions that feed into alert definitions. VCF Operations supports five distinct symptom types.

12.4.1 Metric/Property Symptom

Triggers when a monitored metric or property meets a defined condition.

Static Threshold Configuration:

Select Object Type and Metric (e.g., Virtual Machine → CPU → Usage Average %).
Choose Operator: >, <, >=, <=, =, !=.
Enter a fixed threshold value (e.g., 90).
Set Wait Cycles and Cancel Cycles.

Dynamic Threshold Configuration:

Select Object Type and Metric.
Choose Trend Direction: Above or Below.
Choose Threshold Sensitivity:
- Normal Range — triggers when outside the learned normal operating band.
- 1 Standard Deviation (1 sigma) — moderate sensitivity.
- 2 Standard Deviations (2 sigma) — low sensitivity (fewer false positives).
- 3 Standard Deviations (3 sigma) — very low sensitivity (only extreme deviations).
VCF Operations requires 1–2 weeks of data collection to establish baselines.

12.4.2 Message Event Symptom

Triggers when a log message matches a defined pattern. This symptom type requires Operations for Logs integration.

Define the Log Message Pattern using regex or keyword matching.
Select the Log Source (e.g., syslog, application log).
Set the Severity Level filter (Emergency, Alert, Critical, Error, Warning, Notice, Info, Debug).
Optionally filter by Facility or Application Name.

12.4.3 Fault Event Symptom

Triggers on fault events published by vCenter Server or other adapter sources.

Select the Event Type from the catalog of known vCenter events (e.g., VmPoweredOffEvent, HostConnectionLostEvent).
Optionally filter by Event Message content.
Fault event symptoms fire immediately upon event receipt — wait cycles are typically not applicable.

12.4.4 Metric Event Symptom

Triggers on metric events published by external systems through the VCF Operations REST API.

Define the Event Type identifier.
Set matching criteria on event attributes.
Used primarily for integration with third-party monitoring tools that push events into VCF Operations.

12.4.5 Smart Early Warning

Predictive symptom that uses machine-learning trend analysis to forecast when a metric will cross a threshold.

Select the Metric to monitor.
Define the Threshold Value that the prediction targets.
Set the Forecast Window (e.g., 7 days, 14 days, 30 days).
VCF Operations projects the metric's trajectory and fires the symptom if the forecast indicates the threshold will be breached within the window.

12.5 Static vs Dynamic Thresholds

Aspect	Static Threshold	Dynamic Threshold
Definition	Fixed numeric value set by the administrator	Machine-learned baseline derived from historical patterns
Trigger Condition	Fires when metric crosses the fixed value	Fires when metric deviates from the learned normal pattern
Setup Effort	Immediate — define value and save	Requires 1–2 weeks of data collection for baseline
Adaptability	Does not adapt; same value applies 24/7	Adapts to daily/weekly patterns (e.g., business hours vs off-hours)
False Positive Risk	Higher — a single threshold cannot account for variable workloads	Lower — learned baselines reflect actual usage patterns
Best For	Hard limits (e.g., disk full > 95%), SLA thresholds	Anomaly detection, workloads with variable patterns
Configuration	Operator + fixed value	Direction (Above/Below) + Sensitivity level (Normal, 1-3 sigma)

Dynamic Threshold Sensitivity Levels:

Level	Interpretation	Use When
Normal Range	Any deviation outside the learned band	You want maximum sensitivity to deviations
1 Standard Deviation	Moderate deviation from normal	General-purpose anomaly detection
2 Standard Deviations	Significant deviation from normal	Reducing noise while catching meaningful anomalies
3 Standard Deviations	Extreme deviation from normal	Only alerting on severe outliers

12.6 Relationship Types for Symptom Evaluation

Alert definitions can include symptoms that evaluate conditions not only on the alerting object itself but also on related objects in the inventory hierarchy.

Relationship	Description	Example
Self	Symptom evaluates on the object that will trigger the alert	VM CPU Usage > 90% on the VM itself
Parent	Symptom evaluates on the immediate parent object	Host memory pressure on the host running the VM
Child	Symptom evaluates on an immediate child object	A VM on a host has high disk latency
Peer	Symptom evaluates on an object at the same level sharing a parent	Another VM on the same host is consuming excessive CPU
Ancestor	Symptom evaluates on any object above in the hierarchy (parent, grandparent, etc.)	Cluster-level capacity warning affecting a VM two levels down
Descendant	Symptom evaluates on any object below in the hierarchy (child, grandchild, etc.)	Any VM in a cluster experiencing memory contention

Relationship-based symptoms enable compound alerts that correlate conditions across infrastructure layers — for example, an alert that fires only when a VM has high CPU ready AND its parent host has high CPU utilization, confirming the contention is host-driven rather than guest-driven.

12.7 Alert Definition Best Practices

Avoid duplicating built-in alerts. VCF Operations includes hundreds of predefined alert definitions. Before creating a custom alert, search the existing catalog to confirm no equivalent exists.
Use wait cycles to reduce false positives. A wait cycle count of 3–5 (representing 15–25 minutes at the default 5-minute collection interval) filters out transient spikes. Reserve zero wait cycles for truly critical conditions like host disconnection.
Combine multiple symptoms for meaningful alerts. A single symptom (e.g., CPU > 80%) in isolation may generate noise. Pairing it with a second symptom (e.g., CPU Ready > 5%) creates a more meaningful alert that confirms actual contention.
Prefer ALL logic for compound conditions. Use ANY logic only when multiple independent indicators should each independently warrant attention.
Test in a limited scope first. Create a test policy, assign it to a small object group, enable the new alert definition only in that policy, and observe behavior for several days before rolling it out broadly.
Document remediation steps in the Description field. Include specific actions the operations team should take when the alert fires. This turns the alert into a self-service runbook entry.
Set appropriate criticality levels. Reserve Critical for service-impacting conditions. Overuse of Critical erodes trust in the alerting system.

12.8 Notification Rules

Notification rules bridge alerts to human attention by defining what gets communicated, to whom, and through which channel.

Step 1. Navigate to Configure → Alerts → Notification Settings.

Step 2. Click Add to create a new notification rule.

Step 3. Enter a Name for the rule (e.g., "Critical Production Alerts to On-Call Team").

Step 4. Set Filter Criteria to control which alerts trigger this notification:

Alert Type: Health, Risk, Efficiency, or All.
Criticality: Critical, Immediate, Warning, Information, or All.
Object Type: Filter to specific object types (e.g., only Host System alerts).
Alert Definitions: Optionally select specific alert definitions by name.
Object Groups: Limit to objects in specific groups.

Step 5. Select the Notification Method — choose from the configured outbound plug-ins (see Section 12.9).

Step 6. Set Notification Frequency:

Immediate — sends notification as soon as the alert fires.
Hourly Digest — batches all matching alerts into a single hourly summary.
Daily Digest — batches all matching alerts into a single daily summary.

Step 7. Click Save. The notification rule takes effect immediately.

Tip: Create separate notification rules for different criticality levels. Route Critical alerts to PagerDuty or SMS-capable channels for immediate response, while routing Warning alerts to email or Slack for informational awareness.

12.9 Outbound Notification Plug-ins

Outbound plug-ins define the communication channels available for notification rules. Configure them at Administration → Outbound Settings → Add.

#	Plug-in Type	Key Configuration Fields	Notes
1	Standard Email (SMTP)	SMTP Host, Port (25/465/587), Secure Connection (TLS/SSL), From Address, Authentication (username/password)	Most common. Supports HTML formatting. Test with Test button before saving.
2	Log File	File path on the VCF Operations analytics node (e.g., `/var/log/vmware/vcops/alerts.log`)	Useful for SIEM ingestion from local filesystem.
3	Network Share (CIFS/NFS)	Share Path (e.g., `\\server\share\alerts`), Domain, Username, Password	Writes alert data as files to a network share.
4	SNMP Trap	Target Host (IP/FQDN), Port (default 162), Community String, SNMP Version (v1/v2c/v3), Security Level (v3: AuthPriv/AuthNoPriv/NoAuthNoPriv), Engine ID	For integration with enterprise SNMP managers (e.g., HP OpenView, IBM Tivoli).
5	ServiceNow	Instance URL (e.g., `https://instance.service-now.com`), Username, Password, REST Endpoint, Incident Table, Assignment Group	Creates ServiceNow incidents automatically. Requires the VCF Operations ServiceNow app or direct REST configuration.
6	Slack	Webhook URL (from Slack Incoming Webhooks app), Channel (override), Username (override)	Posts formatted alert messages to a Slack channel.
7	Webhook (REST)	URL, HTTP Method (POST/PUT/PATCH), Content Type (JSON/XML), Headers (key-value pairs), Body Template (with alert field placeholders), Authentication (None/Basic/Bearer Token/OAuth)	Most flexible — integrates with any REST-capable system (PagerDuty, Teams, OpsGenie, custom APIs).

Configuration procedure for each plug-in type:

Navigate to Administration → Outbound Settings.
Click Add.
Select the Plug-in Type from the dropdown.
Enter an Instance Name (e.g., "Production SMTP Server").
Fill in the required configuration fields (per the table above).
Click Test to send a test notification and verify connectivity.
Click Save.

Each plug-in type can have multiple instances configured (e.g., separate SMTP servers for different environments, multiple Slack channels). Notification rules reference specific plug-in instances when defining the delivery channel.

Chapter 13: Super Metrics

Super metrics extend the analytic capabilities of VCF Operations by enabling administrators to define custom calculated metrics that combine, aggregate, or transform multiple standard metrics into a single derived value. They fill gaps where the built-in metric catalog does not provide the exact calculation your organization needs.

13.1 What Are Super Metrics

A super metric is a user-defined formula that VCF Operations evaluates on every collection cycle, producing a new metric value that can be used in dashboards, views, reports, alert symptom definitions, and capacity calculations — just like any native metric.

Common use cases:

Weighted averages across clusters — Calculate a cluster-wide average that weights each host's contribution by its capacity rather than treating all hosts equally.
Ratio calculations — Compute overcommit ratios, efficiency ratios, or utilization-to-capacity percentages not available natively.
Cross-object aggregations — Sum or average a metric across all children of a parent object (e.g., total memory consumed by all VMs in a resource pool).
Conditional calculations — Apply different formulas based on object properties (e.g., calculate differently for powered-on vs powered-off VMs).
Business-level KPIs — Create metrics that map infrastructure telemetry to business-relevant indicators.

To access super metrics, navigate to: Configure → Super Metrics.

13.2 Creating a Super Metric

Follow this ten-step procedure to create a super metric:

Step 1. Navigate to Configure → Super Metrics.

Step 2. Click the Add button in the toolbar.

Step 3. Enter a Name for the super metric (e.g., "Cluster - Avg VM CPU Usage (Powered-On Only)"). Enter an optional Description explaining the formula's purpose and intended consumers.

Step 4. Select the Object Type that this super metric will be associated with. The super metric will appear as a metric on objects of this type. For example, selecting "Cluster Compute Resource" means the super metric will be calculated and displayed for each cluster.

Step 5. Build the formula in the Formula Editor. The editor provides a text area where you type or construct the formula using metric references, operators, and functions.

Step 6. Use the Metric Picker (right panel) to browse or search the available metric catalog. Double-click a metric to insert its reference into the formula. The metric reference is inserted in the syntax ${this, metric=<metric_key>}.

Step 7. Apply looping functions to iterate over child objects. For example, wrap a metric reference in avg() to compute the average value of that metric across all child objects at a specified depth. See Section 13.3 for the complete list of looping functions.

Step 8. Click the Preview button to validate the formula syntax and see sample results. The preview evaluates the formula against a few sample objects and displays the computed values. Fix any syntax errors before proceeding.

Step 9. Assign the super metric to a policy. A super metric only collects data when it is activated in at least one policy. Navigate to the Policies tab within the super metric editor, or go to Configure → Policies → [Policy Name] → Edit → Attributes/Metrics and enable the super metric under the appropriate object type.

Step 10. Click Save. The super metric begins collecting data on the next collection cycle for all objects governed by the policy where it is activated.

Important: Super metrics do not retroactively calculate historical data. Data collection begins from the moment the super metric is activated in a policy.

13.3 Looping Functions

Looping functions iterate over child objects (or related objects at a specified depth) and aggregate a metric across them.

Function	Description	Syntax Example
`avg()`	Calculates the arithmetic mean of a metric across child objects	`avg(${this, metric=cpu\|usage_average, depth=1})`
`combine()`	Combines individual time series from child objects into a unified series	`combine(${this, metric=cpu\|usage_average, depth=1})`
`count()`	Returns the number of child objects that report the specified metric	`count(${this, metric=cpu\|usage_average, depth=1})`
`max()`	Returns the maximum value of the metric across all child objects	`max(${this, metric=cpu\|usage_average, depth=1})`
`min()`	Returns the minimum value of the metric across all child objects	`min(${this, metric=cpu\|usage_average, depth=1})`
`sum()`	Returns the sum of the metric values across all child objects	`sum(${this, metric=mem\|consumed_average, depth=1})`

The depth parameter controls how many levels down the hierarchy to traverse:

depth=1 — direct children only (e.g., VMs directly under a host).
depth=2 — children and grandchildren (e.g., VMs under hosts under a cluster).
Omitting depth defaults to depth=1.

13.4 Single Functions

Single functions operate on individual numeric values within the formula.

Function	Description
`abs(x)`	Returns the absolute value of x
`acos(x)`	Returns the arc cosine of x (in radians)
`ceil(x)`	Returns the smallest integer greater than or equal to x
`cos(x)`	Returns the cosine of x (x in radians)
`exp(x)`	Returns Euler's number raised to the power of x
`floor(x)`	Returns the largest integer less than or equal to x
`log(x)`	Returns the natural logarithm (base e) of x
`log10(x)`	Returns the base-10 logarithm of x
`pow(x, y)`	Returns x raised to the power of y
`round(x)`	Returns x rounded to the nearest integer
`sqrt(x)`	Returns the square root of x
`sin(x)`	Returns the sine of x (x in radians)
`tan(x)`	Returns the tangent of x (x in radians)

13.5 Operators

Super metric formulas support the following operators:

Numeric Operators:

Operator	Description	Example
`+`	Addition	`metricA + metricB`
`-`	Subtraction	`metricA - metricB`
`*`	Multiplication	`metricA * 1024`
`/`	Division	`metricA / metricB`
`%`	Modulo (remainder)	`metricA % 60`

Comparison Operators:

Operator	Description	Example
`>`	Greater than	`metricA > 90`
`<`	Less than	`metricA < 10`
`>=`	Greater than or equal to	`metricA >= 50`
`<=`	Less than or equal to	`metricA <= 100`
`==`	Equal to	`metricA == 0`
`!=`	Not equal to	`metricA != -1`

Logical Operators:

Operator	Description	Example
`&&`	Logical AND	`(metricA > 90) && (metricB > 80)`
`\|\|`	Logical OR	`(metricA > 95) \|\| (metricB > 95)`
`!`	Logical NOT	`!(metricA == 0)`

String Operators:

Operator	Description	Example
`.contains()`	Checks if a string property contains a substring	`${this, property=config\|guestFullName}.contains("Windows")`
`.length()`	Returns the length of a string property	`${this, property=config\|name}.length()`

13.6 Formula Syntax Deep Dive

The `depth` Parameter

The depth parameter specifies how many levels of the object hierarchy to traverse when using looping functions:

depth=0 — the current object itself (no looping).
depth=1 — direct children (e.g., for a cluster: hosts; for a host: VMs).
depth=2 — grandchildren (e.g., for a cluster: VMs through hosts).
Higher values traverse deeper, but performance impact increases with depth.

The `where` Clause

The where clause filters child objects by a property value before aggregation:

avg(${this, metric=cpu|usage_average, depth=1, where=Summary|Guest Operating System=.*Linux.*})

This calculates the average CPU usage only for child VMs whose guest OS name matches the regex .*Linux.*.

The where clause supports:

Exact match: where=Summary|Runtime|PowerState=Powered On
Regex match: where=config|name=.*prod.*
Numeric comparison within the where context: where=cpu|usage_average > 50

The `isFresh()` Function

isFresh() checks whether a metric has received data within the most recent collection cycle. It returns 1 if fresh data exists, 0 otherwise. This is useful for conditionally including only actively-reporting objects:

sum(${this, metric=mem|consumed_average, depth=1, where=isFresh(mem|consumed_average)})

Aliases (Variable Assignment)

Intermediate calculations can be assigned to aliases for readability:

alias cpuTotal = sum(${this, metric=cpu|usagemhz_average, depth=1})
alias cpuCapacity = ${this, metric=cpu|capacity_usagemhz}
cpuTotal / cpuCapacity * 100

Ternary Expressions

Use ternary syntax for conditional logic:

${this, metric=cpu|usage_average} > 80 ? 1 : 0

This returns 1 if CPU usage exceeds 80%, otherwise returns 0 — useful for creating "count of objects exceeding threshold" super metrics when combined with sum().

13.7 Use Cases and Examples

The following real-world examples demonstrate practical super metric formulas.

Example 1: Average VM CPU Usage Across a Cluster (Windows VMs Only)

Object Type: Cluster Compute Resource

avg(${this, metric=cpu|usage_average, depth=2, where=Summary|Guest Operating System=.*Windows.*})

This formula traverses two levels deep from the cluster (cluster → host → VM), filters to only Windows VMs, and calculates the average CPU usage across all matching VMs in the cluster.

Example 2: Total Memory Consumed by Powered-On VMs

Object Type: Cluster Compute Resource

sum(${this, metric=mem|consumed_average, depth=2, where=Summary|Runtime|PowerState=Powered On})

This formula sums the consumed memory metric across all VMs in the cluster that are currently powered on, giving an accurate picture of active memory demand.

Example 3: Count of VMs with CPU Ready Exceeding Threshold

Object Type: Host System

count(${this, metric=cpu|readyPct, depth=1, where=cpu|readyPct > 2.5})

This formula returns the number of VMs on a host where the CPU Ready percentage exceeds 2.5%, providing a single metric that indicates how many VMs on the host are experiencing CPU scheduling contention.

Example 4: Cluster CPU Overcommit Ratio

Object Type: Cluster Compute Resource

sum(${this, metric=cpu|num_vcpus_latest, depth=2}) / sum(${this, metric=cpu|corecount_provisioned, depth=0})

This formula divides the total number of vCPUs allocated across all VMs in the cluster (depth=2 to traverse through hosts to VMs) by the total physical core count of the cluster itself (depth=0 for the cluster's own metric), producing the vCPU-to-pCPU overcommit ratio.

Chapter 14: Dashboards — Built-in (Predefined)

VCF Operations ships with an extensive library of predefined dashboards that provide immediate visibility into the health, performance, capacity, and efficiency of your virtual infrastructure. These dashboards represent Broadcom's best-practice views and serve as both operational tools and templates for custom dashboard development.

14.1 Accessing Predefined Dashboards

To access predefined dashboards:

Navigate to Visualize → Dashboards.
The left navigation panel displays dashboard categories as expandable folders.
Click a category to expand it and reveal the dashboards within.
Click a dashboard name to load it in the main content area.

Predefined dashboards are read-only — they cannot be modified directly. To customize a predefined dashboard:

Open the dashboard you wish to modify.
Click the Actions menu (three dots or gear icon) in the top-right corner.
Select Clone.
Enter a name for the cloned copy.
The cloned dashboard opens in edit mode, where all widgets and configurations can be freely modified.

Dashboards can be marked as Favorites (star icon) for quick access from the Favorites section of the left panel. The Home Dashboard can be set by navigating to Visualize → Dashboards → Actions → Set as Home Dashboard.

14.2 Complete List of Predefined Dashboards

Performance Category

Dashboard Name	Purpose	Key Widgets
VM Performance	Identifies top CPU, memory, disk, and network consumers among virtual machines	Top-N CPU Usage, Top-N Memory Usage, Top-N Disk Latency, Top-N Network Throughput, Metric Chart
Cluster Performance	Displays cluster-level utilization trends for compute and storage	Cluster CPU/Memory Utilization Heatmap, Utilization Trend Charts, DRS Balance Scoreboard
ESXi Host Performance	Shows per-host utilization, contention, and hardware health	Host CPU/Memory Utilization, Host Contention Metrics, NIC Throughput, HBA Throughput
Datastore Performance	Monitors storage latency, IOPS, and throughput per datastore	Datastore Latency Trend, IOPS Distribution, Throughput Top-N, Outstanding IO
Network Performance	Tracks packet loss, throughput, errors, and dropped packets across network paths	Packet Loss Heatmap, Throughput Trends, Error Rate Scoreboard, Dropped Packets Top-N
vSAN Performance	Provides vSAN-specific IOPS, latency, throughput, and congestion metrics	vSAN IOPS Trend, Backend Latency, Congestion Scoreboard, Disk Group Performance
VM Contention	Surfaces per-VM contention indicators including CPU Ready, Co-Stop, and Memory Contention	CPU Ready % Top-N, Co-Stop Top-N, Memory Contention % Top-N, Disk Latency Top-N
Cluster Contention	Aggregates contention metrics at the cluster level for rapid triage	Cluster CPU Contention Heatmap, Memory Pressure Trend, Cluster Disk Latency Summary

Capacity Category

Dashboard Name	Purpose	Key Widgets
Cluster Capacity	Shows Time Remaining and Capacity Remaining per cluster with trend projections	Capacity Remaining Scoreboard, Time Remaining Scoreboard, Capacity Trend Chart, What-If Scenario
Datastore Capacity	Monitors storage utilization, provisioned vs used space, and forecast	Datastore Usage Heatmap, Capacity Trend, Thin Provisioning Overcommit, Forecast Chart
ESXi Host Capacity	Displays per-host capacity metrics including headroom for additional workloads	Host CPU/Memory Remaining, VM Density, Headroom Scoreboard
VM Capacity	Provides rightsizing recommendations for oversized and undersized VMs	Oversized VMs List, Undersized VMs List, Reclaimable CPU/Memory Scoreboard, Idle VMs
vSAN Capacity	Shows vSAN capacity utilization including deduplication and compression savings	vSAN Used vs Free, Dedup/Compression Ratio, Slack Space, Capacity Trend

Cost Category

Dashboard Name	Purpose	Key Widgets
Cost Overview	Provides total and monthly cost breakdown across the environment	Total Cost Scoreboard, Monthly Trend Chart, Cost by Object Type, Cost by Datacenter
Optimization	Quantifies potential cost savings from rightsizing and reclamation	Reclaimable Cost Scoreboard, Powered-Off VM Cost, Idle VM Cost, Snapshot Cost
Showback	Displays cost allocation by business unit, department, or custom grouping	Cost by Department Chart, Cost by Application, Cost by Environment Tier
Chargeback	Supports billing integration with per-consumer cost detail	Chargeable Cost per Consumer, Rate Card Summary, Invoice Detail

Availability Category

Dashboard Name	Purpose	Key Widgets
Availability Overview	Summarizes uptime, active alerts, and overall environment health	Uptime Scoreboard, Alert Count by Severity, Health Badge Summary, Outage Timeline

Sustainability Category

Dashboard Name	Purpose	Key Widgets
Carbon Footprint	Estimates carbon emissions based on compute power consumption and regional emission factors	Total Carbon Emissions Scoreboard, Emissions Trend, Emissions by Cluster, PUE Factor
Green Scorecard	Tracks energy efficiency metrics and sustainability KPIs	Energy Efficiency Score, Power Consumption Trend, Idle Resource Waste, Green Improvement Recommendations

NSX Category

Dashboard Name	Purpose	Key Widgets
NSX-T Overview	High-level summary of NSX-T environment health, alert count, and component status	NSX Manager Health, Transport Node Status, Edge Cluster Status, Alert Summary
NSX Security Overview	Security posture summary including firewall rule counts, policy compliance, and threat indicators	DFW Rule Count, Security Policy Status, Applied Profiles, Threat Activity
NSX Logical Switching	Monitors logical switch health, port utilization, and segment configuration	Logical Switch List, Port Count Summary, Segment Health, VLAN/VXLAN Mapping
NSX Edge Performance	Tracks NSX Edge node CPU, memory, throughput, and session count	Edge CPU/Memory Utilization, Throughput per Edge, NAT Session Count, IPSec Tunnel Status
NSX Distributed Firewall	Monitors DFW rule evaluation rates, connection counts, and CPU overhead on hosts	DFW Rule Hit Count, Connection Rate, CPU Overhead Trend, Rule Table Size
NSX Load Balancer	Displays load balancer pool health, session distribution, and throughput	Pool Health Status, Active Sessions, Request Rate, Server Health Checks
NSX Network Topology	Visual topology map showing the relationships between logical routers, switches, and edge nodes	Interactive Topology Graph, Component Status Overlay, Alert Badge Overlay
NSX Troubleshooting	Diagnostic dashboard for identifying NSX control/data plane issues	Traceflow Results, Controller Cluster Health, Transport Zone Status, BFD Session Status

Other Categories

Dashboard Name	Purpose	Key Widgets
Application Monitoring	Tracks application-level metrics from integrated APM sources	Application Health Summary, Response Time Trend, Error Rate, Dependency Map
Workload Management	Monitors Tanzu Kubernetes clusters and workload placement	TKG Cluster Status, Pod Count, Namespace Utilization, Supervisor Cluster Health
Migration Planning	Assesses VM migration readiness and provides cloud cost comparison	Migration Readiness List, Cloud Cost Estimate, Dependency Analysis, Compatibility Check
Service Discovery	Maps discovered application services and their infrastructure dependencies	Service Map, Dependency Graph, Communication Flow, Infrastructure Mapping

14.3 KPI Thresholds

The following table provides industry-standard threshold guidance for key performance indicators. These values are used by many of the predefined dashboards and alert definitions.

KPI	Good (Green)	Warning (Yellow)	Critical (Red)	Notes
CPU Ready %	< 2.5%	2.5% – 5.0%	> 5.0%	Measured on a per-vCPU basis. Values above 5% indicate the VM is waiting for physical CPU scheduling and will experience application-visible latency.
CPU Co-Stop %	< 2.0%	2.0% – 4.0%	> 4.0%	Relevant for SMP (multi-vCPU) VMs. Indicates vCPUs being halted to synchronize scheduling. Reduce vCPU count if consistently high.
Memory Contention %	< 1.0%	1.0% – 3.0%	> 3.0%	Includes ballooning, swapping, and compression. Values above 3% indicate the host is under memory pressure and VMs are experiencing degraded performance.
Disk Latency (ms)	< 10	10 – 20	> 20	Combined read + write latency at the virtual disk (VMDK) level. Values above 20 ms are perceptible to most applications.
Disk Command Aborts	0	1 – 5	> 5	Per collection interval (5 minutes). Any aborted commands indicate storage path issues and warrant investigation.
Network TX Drops	0	1 – 100	> 100	Transmitted packet drops per interval. Indicates transmit queue saturation, typically caused by network bandwidth exhaustion or vSwitch misconfiguration.
Packet Loss %	0%	0% – 0.1%	> 0.1%	End-to-end packet loss. Even 0.1% loss is significant for latency-sensitive applications (VoIP, RDP, database replication).
vSAN Latency (ms)	< 5	5 – 10	> 10	vSAN backend (device-level) latency. Frontend (VM-visible) latency may be higher. Values above 10 ms indicate disk group saturation or network congestion.

14.4 Dashboard Time Settings

All dashboards support configurable time ranges and refresh intervals that control the data window displayed by widgets.

Time Range Options:

Setting	Duration	Best For
Last Hour	1 hour	Real-time troubleshooting, active incident investigation
Last 6 Hours	6 hours	Default view — covers a typical shift or business window
Last 24 Hours	24 hours	Daily review, identifying overnight patterns
Last 7 Days	7 days	Weekly trend analysis, capacity planning reviews
Last 30 Days	30 days	Monthly reporting, long-term trend identification
Custom	User-defined start and end	Post-incident analysis, compliance audits, specific maintenance windows

The time range selector is located in the top-right toolbar of every dashboard. Changing the time range affects all time-aware widgets on the dashboard simultaneously.

Auto-Refresh Intervals:

Setting	Behavior
Off	Dashboard displays static data from the last load; manual refresh required
5 Minutes	Dashboard automatically refreshes every 5 minutes (aligns with default collection interval)
10 Minutes	Dashboard automatically refreshes every 10 minutes
15 Minutes	Dashboard automatically refreshes every 15 minutes

The auto-refresh toggle is located next to the time range selector. For dashboards displayed on NOC wall screens, set auto-refresh to 5 minutes to maintain near-real-time visibility.

Note: Setting aggressive auto-refresh intervals on dashboards with many widgets or large object scopes may increase load on the VCF Operations analytics cluster. For environments with more than 10,000 objects, consider using 10- or 15-minute refresh intervals for complex dashboards.

Chapter 15: Dashboards — Custom Creation

While the predefined dashboards cover a broad range of operational scenarios, custom dashboards enable you to build views tailored to your organization's specific monitoring requirements, operational workflows, and reporting needs.

15.1 Creating a Dashboard

Follow these steps to create a new custom dashboard:

Step 1. Navigate to Visualize → Dashboards.

Step 2. Click the Create button in the top toolbar (or use the + icon).

Step 3. Enter a Dashboard Name (e.g., "Production Cluster Health — Tier 1").

Step 4. Optionally select a Dashboard Template from the dropdown. Templates provide pre-arranged widget layouts that you can populate with your own data sources. Available templates include Blank Canvas, Two-Column, Three-Column, Executive Summary, and Troubleshooting.

Step 5. Set the Default Time Range for the dashboard (e.g., Last 6 Hours). Individual widgets can override this if needed.

Step 6. Click Save. The empty dashboard canvas appears in edit mode, ready for widgets to be added.

To add widgets:

Click the Add Widget button (or drag from the widget panel on the left).
Select the desired widget type from the catalog (see Section 15.2).
Configure the widget (see Section 15.3).
Position and resize the widget on the canvas by dragging.
Repeat for additional widgets.
Click Save when the layout is complete.

VCF Operations provides a comprehensive widget catalog organized by functional category.

Data Visualization Widgets

Widget Name	Description
Metric Chart	Time-series visualization supporting line, area, and stacked area chart types. Displays one or more metrics for one or more objects over the selected time range. Supports trend lines, dynamic thresholds overlay, and data table toggle.
Scoreboard	Displays a single KPI value with configurable color-coded status bands (green/yellow/orange/red). Ideal for executive-level dashboards showing current state at a glance. Supports sparkline overlay and multi-metric mode.
Heatmap	Color-coded grid where each cell represents an object, colored by a selected metric value, and optionally sized by a second metric. Enables rapid visual identification of outliers across large object populations.
Top-N	Horizontal or vertical bar chart ranking objects by a selected metric. Configurable for top or bottom N values. Useful for identifying the highest consumers or worst performers.
Topology Graph	Interactive relationship map showing objects and their connections. Displays health badges, metric overlays, and alert status on each node. Supports configurable relationship depth.
Distribution Chart	Histogram or pie chart showing the distribution of objects across value ranges for a selected metric. Useful for understanding workload profiles and identifying clusters of similar behavior.
Sparkline	Compact, minimal trend line designed for embedding in dense dashboards. Shows directional trend without axis labels or detailed data points.

Object List Widgets

Widget Name	Description
Object List	Filterable, sortable table of inventory objects with configurable columns. Supports inline metric values, health badges, and property display. Can serve as a provider widget to drive other widgets on the dashboard.
Object Relationship	Hierarchical navigation widget showing parent, child, and peer relationships for a selected object. Enables drill-down through the inventory tree.
Alert List	Filtered table of active alerts with columns for severity, alert name, object name, time triggered, and control state. Supports filtering by alert type, criticality, object type, and time range.
Symptom List	Filtered table of active symptoms with details on the triggering condition, current value, and threshold.
Property List	Displays configuration properties and attributes for a selected object (CPU count, memory size, guest OS, tools version, etc.).

Utility Widgets

Widget Name	Description
Text Widget	Displays static text content. Supports HTML and Markdown formatting for embedding instructions, notes, team contact information, or operational procedures directly in the dashboard.
Image Widget	Embeds a static image (PNG, JPG, SVG) in the dashboard. Used for logos, architecture diagrams, or visual context. Images can be uploaded or referenced by URL.
Rolling View	Automatically cycles through a configured list of dashboards at a set interval. Designed for NOC wall displays that need to rotate between multiple views.
Container Widget	Groups multiple widgets into a tabbed container, conserving dashboard real estate. Each tab contains a separate widget, and users click tabs to switch between them.
Navigation Widget	Displays clickable links or buttons that navigate to other dashboards, external URLs, or specific objects in the inventory. Used for building multi-level dashboard hierarchies.
Geo Map	Plots objects on a geographic map based on configured location coordinates. Each marker shows health status and can be clicked for detail. Useful for multi-site or distributed infrastructure monitoring.

The Scoreboard widget is the most commonly used widget for executive dashboards and NOC displays.

Configuration steps:

Click Add Widget → Scoreboard.
In the Data tab, click Add Metric.
Select the Object Type and browse or search for the desired metric.
Select specific object(s) or use an object group to scope the data.
In the Thresholds tab, configure color bands:
- Green: Metric value range considered healthy (e.g., 0–70%).
- Yellow: Warning range (e.g., 70–85%).
- Orange: Elevated concern range (e.g., 85–95%).
- Red: Critical range (e.g., 95–100%).
In the Display tab, choose the display mode:
- Single Value: Shows just the current value with color background.
- Sparkline + Value: Shows a compact trend line alongside the current value.
- Multi-Metric: Displays multiple metrics in a grid within the single widget.
Configure Label (custom display name), Unit (override the default unit), and Decimal Places (0–4).
Click Save.

The Heatmap widget provides instant visual identification of outliers across hundreds or thousands of objects.

Configuration steps:

Click Add Widget → Heatmap.
In the Data tab, select the Object Type (e.g., Virtual Machine).
Set Group By to organize cells by a parent attribute (e.g., Cluster, Host, Datacenter). Objects are visually grouped under their parent's label.
Set Color By to the metric that determines cell color (e.g., CPU Usage %). Configure the color gradient with minimum (green) and maximum (red) values.
Set Size By to the metric that determines cell size (e.g., Configured Memory MB). Larger cells represent objects with more of the sized metric.
In the Thresholds tab, define the color bands:
- Specify the numeric ranges for green, yellow, orange, and red.
Optionally filter the object scope using an Object Group or tag-based filter.
Click Save.

The Metric Chart widget is the primary tool for time-series analysis and trend investigation.

Configuration steps:

Click Add Widget → Metric Chart.
In the Data tab, add one or more object-metric combinations:
- Click Add Metric, select the object(s) and metric(s).
- Each combination appears as a separate series on the chart.
In the Chart Options tab:
- Select Chart Type: Line (default), Area, or Stacked Area.
- Set Y-Axis Scale: Auto (dynamic based on data), Fixed (specify min/max), or Percentage (0–100).
- Toggle Show Legend to display/hide the series legend.
- Toggle Show Data Table to display a tabular view of the data below the chart.
- Toggle Show Trend Line to overlay a linear regression trend on each series.
- Toggle Show Dynamic Threshold to display the learned normal range as a shaded band (requires dynamic threshold data).
Configure the Time Range override if the widget should use a different range than the dashboard default.
Click Save.

The Top-N widget ranks objects by a selected metric to quickly surface the highest or lowest performers.

Configuration steps:

Click Add Widget → Top-N.
In the Data tab, select the Metric to rank by (e.g., Memory Usage %).
Set the N Value — the number of objects to display: 5, 10, 20, or 50.
Set Sort Order: Highest (top consumers) or Lowest (least utilized).
Set Scope: All objects of the selected type, or filter to a specific Object Group or parent object.
In the Display tab:
- Choose bar orientation (horizontal or vertical).
- Enable or disable value labels on bars.
- Configure color coding based on thresholds.
Click Save.

The Topology Graph widget visualizes the relationships between infrastructure objects as an interactive network diagram.

Configuration steps:

Click Add Widget → Topology Graph.
In the Data tab, select the Root Object — the starting point for the topology visualization (e.g., a specific cluster or datacenter).
Set Relationship Depth (1–5) — how many levels of parent/child/peer relationships to display from the root object.
In the Display tab:
- Toggle Show Badges to overlay health/risk/efficiency badge icons on each node.
- Toggle Metrics Overlay to display a selected metric value on each node.
- Toggle Alert Status to color-code nodes based on their most severe active alert.
Configure which Relationship Types to include: Parent, Child, Peer, or All.
Click Save.

Widget interactions enable a powerful provider/receiver paradigm where selecting an object in one widget automatically updates the data displayed in other widgets on the same dashboard. This creates interactive, drill-down capable dashboards.

Key concepts:

A Provider Widget publishes its currently selected object(s) to one or more receiver widgets. Typical providers include Object List, Heatmap, Top-N, and Topology Graph.
A Receiver Widget consumes the selected object from a provider and updates its display accordingly. Typical receivers include Metric Chart, Alert List, Property List, and Scoreboard.
A single widget can be both a provider and a receiver simultaneously.

Configuring widget interactions:

Open the dashboard in Edit Mode (click the pencil icon or Edit button).
Click the gear icon on the provider widget to open its configuration.
Navigate to the Widget Interactions tab (also labeled Output in some widget types).
In the Receiving Widgets section, check the boxes next to the widgets that should receive selections from this provider.
Click Save on the widget configuration.
Repeat for additional provider widgets as needed.
Save the dashboard.

Performance considerations:

Widget Limit: A maximum of 25 widgets per dashboard is recommended for optimal performance. Dashboards exceeding this count may experience slow load times, especially with large object scopes.
Interaction Chains: Avoid circular interaction chains (Widget A provides to Widget B, which provides back to Widget A). This can cause infinite refresh loops.
Object Scope: When a provider widget has thousands of objects, ensure receiver widgets can handle the potential selection. Use filtering in receiver widgets to limit the data they request.

Example interaction configuration:

A common pattern is the "list-and-detail" layout:

Widget	Role	Purpose
Object List (Virtual Machines)	Provider	Displays a filterable list of VMs. User clicks a row to select a VM.
Metric Chart (CPU)	Receiver	Shows CPU usage trend for the selected VM.
Metric Chart (Memory)	Receiver	Shows memory usage trend for the selected VM.
Alert List	Receiver	Shows active alerts for the selected VM.
Property List	Receiver	Shows configuration properties of the selected VM.

When the operator clicks a VM in the Object List, all four receiver widgets update simultaneously to show data for that specific VM, creating a cohesive investigation experience.

Dashboard navigation enables you to link multiple dashboards together, creating hierarchical drill-down paths that guide operators from high-level overviews to detailed investigation views.

Method 1: Navigation Widget

The Navigation Widget provides explicit, clickable links to other dashboards or external URLs.

Add a Navigation Widget to the dashboard.
In the configuration panel, add one or more Links:
- Dashboard Link: Select a target dashboard from the dropdown. Optionally pass the currently selected object as context.
- External URL Link: Enter a full URL (e.g., link to a runbook wiki page or external monitoring tool).
Configure the Display Style: Button, Text Link, or Icon.
Position the Navigation Widget at the top or side of the dashboard for visibility.

Method 2: Object Click Actions

Configure what happens when a user clicks an object in a widget:

Open the dashboard in Edit Mode.
Click the gear icon on the widget where click navigation should be enabled.
In the Output section (or Interactions tab), find the On Click action setting.
Select Navigate to Dashboard and choose the target dashboard.
Optionally enable Pass Object Context — the clicked object is automatically set as the focus in the target dashboard.
Save the widget and dashboard.

Method 3: Dashboard Linking via URL Parameters

Dashboards can be directly linked using URL parameters that pre-select objects and time ranges:

Object context: Append ?objectId=<resource_id> to the dashboard URL.
Time range: Append &timeRange=<start>-<end> in epoch milliseconds.
This method is useful for embedding VCF Operations dashboards in external portals, wikis, or ServiceNow pages.

Best practices for dashboard navigation:

Design a dashboard hierarchy: Create a top-level "Landing Page" dashboard with navigation links to domain-specific dashboards (Compute, Storage, Network, Applications). Each domain dashboard links further to object-specific detail dashboards.
Use consistent naming conventions: Name dashboards with a prefix indicating their level (e.g., "L1 — Environment Overview", "L2 — Cluster Performance", "L3 — VM Detail").
Include a "Back" link: On detail dashboards, add a Navigation Widget linking back to the parent dashboard so operators can easily return to the overview.
Leverage context passing: Always enable object context passing when linking from a list or heatmap dashboard to a detail dashboard. This eliminates the need for the operator to re-select the object on the target dashboard.
Share dashboards with teams: Use Visualize → Dashboards → Actions → Share to grant view or edit access to specific user groups. Shared dashboards appear in recipients' dashboard lists without requiring them to create their own copies.

Chapter 16: Best Practice Dashboard Designs

Dashboards are the primary interface through which operators, engineers, and executives consume data from VCF Operations. A poorly designed dashboard buries critical information; a well-designed dashboard surfaces the right data to the right audience at the right time. This chapter provides six ready-to-implement dashboard blueprints and a set of universal design principles.

16.1 Daily Operations Check Dashboard

This dashboard is the first screen an operations engineer should open each morning. It answers one question: "Is anything broken or about to break?"

Row 1 — Scoreboards (4 widgets, equal width)

Widget	Type	Metric / Property	Color Coding
Overall Cluster Health	Scoreboard	Worst badge color across all clusters	Green / Yellow / Orange / Red
Total Critical Alerts	Scoreboard	Count of alerts where Criticality = Critical	Red if > 0, Green if 0
Total Warning Alerts	Scoreboard	Count of alerts where Criticality = Warning	Yellow if > 5, Green if ≤ 5
VM Count / Host Count	Scoreboard	Total VMs (powered on) and total ESXi hosts	Informational — no threshold

Configuration Tip: Set the Scoreboard refresh interval to 5 minutes. Use the "Sparkline" option to show a 24-hour mini-trend directly inside the scoreboard tile.

Row 2 — Top-N Performance Offenders (3 widgets, equal width)

Widget	Type	Object Type	Metric	Sort	Count
Top-N CPU Ready VMs	Top-N	Virtual Machine	cpu\|readyPct	Descending	10
Top-N Memory Contention VMs	Top-N	Virtual Machine	mem\|contention_average	Descending	10
Top-N Disk Latency VMs	Top-N	Virtual Machine	virtualDisk\|totalLatency	Descending	10

Row 3 — Trends and Heatmaps (2 widgets, 60/40 split)

Widget	Type	Configuration
Cluster Capacity Heatmap	Heatmap	Object: Cluster Compute Resource; Color by: cpu\|capacityRemaining_percentage; Size by: summary\|total_number_vms
Alert Trend (7-day)	Metric Chart	Scope: all clusters; Metric: count of alerts by day; Mode: stacked bar by criticality

Blockquote — Why 7-day alert trend? A 7-day window reveals patterns tied to weekly batch jobs, backup windows, or recurring misconfigurations. A single day's snapshot hides these cycles.

16.2 Capacity Planning Dashboard

This dashboard is reviewed weekly by capacity and infrastructure teams. It answers: "When will we run out of resources, and what can we reclaim?"

Row 1 — Scoreboards

Widget	Metric	Threshold
Clusters at Risk	Count of clusters where Time Remaining < 90 days	Red if > 0
Total Reclaimable vCPU	Sum of reclaimable CPU across all VMs (from rightsizing engine)	Informational
Total Reclaimable Memory (GB)	Sum of reclaimable RAM	Informational
Average Cluster Utilization %	Avg of cpu\|demandPct across clusters	Yellow > 70%, Red > 85%

Row 2 — Bar Charts (2 widgets, equal width)

Widget	Type	Details
Cluster Capacity Time Remaining	Top-N (horizontal bar)	Metric: capacityRemainingUsingConsumers_timeRemaining; Sort: Ascending (worst first); Top 10
Datastore Capacity Remaining	Top-N (horizontal bar)	Metric: diskspace\|capacityRemaining_percentage; Sort: Ascending; Top 10

Row 3 — Lists and Actions (2 widgets, equal width)

Widget	Type	Details
VM Rightsizing Candidates	Object List	Filter: oversized = true; Columns: VM Name, Provisioned vCPU, Recommended vCPU, Provisioned RAM, Recommended RAM
What-If Scenario Launcher	Text Widget	Hyperlink to Optimize → What-If Analysis with instructions

Capacity Threshold Recommendations:

Resource	Conservative	Moderate	Aggressive
CPU Demand %	60%	70%	80%
Memory Demand %	70%	80%	90%
Datastore Used %	70%	80%	85%
Time Remaining (days)	180	90	60

16.3 Performance Monitoring Dashboard

This dashboard is used during active troubleshooting or continuous performance reviews. It answers: "How are my workloads performing right now and over time?"

Row 1 — Scoreboards (3 widgets)

Widget	Metric	Threshold
Average CPU Usage %	Avg cpu\|usage_average across all clusters	Yellow > 70%, Red > 85%
Average Memory Usage %	Avg mem\|usage_average across all clusters	Yellow > 75%, Red > 90%
Average Disk Latency (ms)	Avg virtualDisk\|totalLatency across all VMs	Yellow > 15 ms, Red > 25 ms

Row 2 — Metric Charts (2 widgets, equal width)

Widget	Type	Configuration
Cluster CPU/Memory Trend (30-day)	Metric Chart (line)	Scope: select clusters; Metrics: cpu\|demandPct, mem\|demandPct; Date range: Last 30 Days; Show dynamic thresholds
vSAN Latency Trend	Metric Chart (line)	Scope: vSAN clusters; Metrics: vSAN\|readLatency, vSAN\|writeLatency; Date range: Last 30 Days

Row 3 — Heatmap and Top-N (2 widgets, 60/40 split)

Widget	Type	Configuration
All VMs by CPU Ready %	Heatmap	Object: Virtual Machine; Group by: Parent Cluster; Color by: cpu\|readyPct; Size by: config\|hardware\|numCpu
Top-N Network Drops	Top-N	Object: Host System; Metric: net\|droppedPct; Sort: Descending; Count: 10

16.4 Cost Analysis Dashboard

This dashboard serves finance teams and infrastructure managers tracking cloud and on-premises spending. It answers: "Where is the money going, and where can we save?"

Row 1 — Scoreboards (3 widgets)

Widget	Metric	Notes
Total Monthly Cost	costop\|totalCost	Requires cost drivers to be configured under Optimize → Cost Drivers
Cost per VM	costop\|costPerVM	Derived from total cost ÷ powered-on VM count
Cost Trend	Metric Chart (sparkline)	6-month trend of totalCost

Row 2 — Distribution and Savings (2 widgets)

Widget	Type	Configuration
Cost by Department	Distribution (pie chart)	Group by: Custom Property "Department"; Metric: costop\|totalCost
Optimization Savings Potential	Scoreboard	Metric: sum of potential savings from rightsizing + reclamation recommendations

Row 3 — Actionable Lists (2 widgets)

Widget	Type	Configuration
Idle / Powered-Off VM List	Object List	Filter: powerState = poweredOff OR idleVM = true; Columns: VM Name, Power State, Days Since Last I/O, Monthly Cost
Snapshot Age Violations	Object List	Filter: snapshot\|age > 72 hours; Columns: VM Name, Snapshot Name, Age (hours), Size (GB)

16.5 Compliance Dashboard

This dashboard is essential for security and audit teams. It answers: "Are we compliant, and where have we drifted?"

Row 1 — Scoreboards (2 widgets)

Widget	Metric	Threshold
Overall Compliance Score	Percentage of objects passing all benchmark tests	Green ≥ 95%, Yellow ≥ 80%, Red < 80%
Non-Compliant Objects Count	Count of objects with at least one failure	Red if > 0

Row 2 — Compliance by Benchmark (3 widgets)

Widget	Type	Configuration
DISA STIG Compliance	Scoreboard + bar	Pass/Fail count for DISA STIG benchmark rules
CIS Benchmark Compliance	Scoreboard + bar	Pass/Fail count for CIS benchmark rules
PCI-DSS Compliance	Scoreboard + bar	Pass/Fail count for PCI-DSS benchmark rules

Row 3 — Drift and Changes (2 widgets)

Widget	Type	Configuration
Drift Detection Alerts	Alert List	Filter: Alert Type = Compliance, Sub-type = Drift; Sort by: time (newest first)
Configuration Change Timeline	Metric Chart (event overlay)	Show configuration change events overlaid on compliance score trend

16.6 Executive Summary Dashboard

This dashboard is designed for C-level and director-level audiences. It prioritizes clarity over detail and should be presentable on a projector or shared screen without explanation.

Design Principles for Executive Dashboards:

Maximum 8–10 widgets
Large font sizes (scoreboard "Big Number" mode)
No raw metric names — use plain-English labels ("Infrastructure Health" not "badge|health")
Traffic-light color coding (Green / Yellow / Red)

Row 1 — Environment Scorecard (3 large scoreboards)

Widget	Label	Source
Health	"Infrastructure Health"	Worst health badge across all clusters
Risk	"Risk Score"	Highest risk badge across all clusters
Efficiency	"Resource Efficiency"	Average efficiency badge across all clusters

Row 2 — 30-Day Trends (2 widgets)

Widget	Type	Configuration
30-Day Alert Trend	Metric Chart (area)	Stacked area by criticality (Critical, Warning, Info); Date range: 30 days
Capacity Runway Summary	Scoreboard list	Show Time Remaining (days) for each cluster, color-coded

Row 3 — Cost and Sustainability (2 widgets)

Widget	Type	Configuration
Cost Summary	Scoreboard	Total monthly cost with month-over-month delta percentage
Sustainability Metrics	Scoreboard	Power consumption (kWh), Carbon estimate (if available via management pack)

16.7 Dashboard Design Best Practices

Limit widget count. Keep dashboards to 15–20 widgets maximum. Each additional widget increases render time and cognitive load. If you need more, create a second dashboard and link them.
Use widget interactions for drill-down. Configure widget interactions so that clicking an object in a Top-N chart drives the selection in a Metric Chart or Object List widget on the same dashboard. This eliminates the need to duplicate data.
Group related metrics logically. Place CPU metrics adjacent to CPU-related alerts. Place capacity widgets together. The user's eye should flow naturally from overview to detail, left to right, top to bottom.
Use consistent time ranges. If one chart shows 30 days, all charts on that dashboard should show 30 days unless there is a specific analytical reason to differ. Inconsistent ranges confuse viewers.
Place critical KPIs in the top-left quadrant. Eye-tracking studies confirm that users scan dashboards starting from the top-left. Place the most urgent or important information there.
Use Text Widgets for section headers. A simple text widget with a bold label like "Performance Indicators" or "Capacity Metrics" helps organize the dashboard visually and aids comprehension.
Clone predefined dashboards as starting points. VCF Operations ships with dozens of out-of-the-box dashboards. Clone one that is close to your goal, then modify it. This saves time and ensures you start with proven widget configurations.
Test with real data at scale. A dashboard that loads quickly in a lab with 50 VMs may be unusably slow in production with 10,000 VMs. Test with production scope before publishing.
Set appropriate default scopes. Avoid dashboards scoped to "All Objects" when a narrower scope (specific cluster, resource pool, or custom group) would be more relevant.
Document your dashboards. Add a Text Widget at the top of each dashboard with a one-sentence purpose statement and the intended audience. This prevents dashboard sprawl and confusion.

Chapter 17: Views and Reports

Views and Reports are the primary mechanism for extracting structured, repeatable, and shareable data from VCF Operations. While dashboards are interactive and real-time, reports are static snapshots designed for distribution, archival, and audit compliance.

17.1 View Types

Views are the building blocks of reports. Each view type presents data in a specific visual format optimized for a particular analytical need.

View Type	Description	Best Use Case	Output Format
List View	Tabular list of objects with selected metrics and properties displayed as columns	Inventory reports, VM configuration audits, host hardware lists	Table
Trend View	Time-series line or area graph plotting one or more metrics over a defined date range	Performance analysis, capacity trending, SLA compliance over time	Line/Area Chart
Distribution View	Pie chart or histogram showing how a metric's values are distributed across objects	Resource allocation analysis, workload distribution, cost breakdown by department	Pie/Histogram
Image View	Custom uploaded image (PNG, JPG, SVG) with data overlays positioned at specific coordinates	Network topology diagrams, data center floor plans with live metrics, rack diagrams	Annotated Image
Summary View	Aggregated statistics (average, minimum, maximum, sum, count) for selected metrics across a group of objects	Executive summaries, SLA reports, aggregate capacity statements	Summary Table

Note: Image Views require you to upload a base image first, then map data points to specific pixel coordinates on the image. This is most commonly used for physical data center visualizations.

17.2 Creating Views

Follow these steps to create a custom view.

Step 1. Navigate to Visualize → Views in the left navigation menu.

Step 2. Click the Create button (plus icon) in the toolbar.

Step 3. In the Presentation section, enter:

View Name — a descriptive name (e.g., "Top 20 VMs by CPU Usage - Weekly")
Description — purpose and intended audience

Step 4. Select the View Type from the dropdown: List, Trend, Distribution, Image, or Summary.

Step 5. In the Subjects section, select the Object Type that this view will report on. Common selections include:

Virtual Machine
Host System
Cluster Compute Resource
Datastore
vSAN Cluster

Step 6. Switch to the Data tab. Here you select the metrics and properties to display:

For List Views: each selected metric or property becomes a column
For Trend Views: each selected metric becomes a line on the chart
For Distribution Views: select a single metric to distribute
For Summary Views: select metrics and choose aggregation functions (avg, min, max, sum, count)

Step 7. Switch to the Filter tab (optional). Apply conditions to limit which objects appear in the view. Filters use property or metric-based conditions such as:

powerState = poweredOn
cpu|usage_average > 80
summary|config|numCpu >= 8

Multiple filter conditions can be combined with AND/OR logic.

Step 8. Click Preview to verify the output displays the expected data with the correct format and filtering.

Step 9. Click Save. The view is now available for use in dashboards or report templates.

Tip: When creating List Views, limit the number of columns to 10–12 for readability. If you need more data points, create a second view rather than cramming everything into one table.

17.3 Creating Report Templates

Report templates combine one or more views into a formatted document suitable for distribution. Follow this procedure.

Step 1. Navigate to Visualize → Reports in the left navigation menu.

Step 2. Click Create Template in the toolbar.

Step 3. Enter the Report Name (e.g., "Weekly Infrastructure Health Report") and an optional Description.

Step 4. In the report canvas, add views by dragging them from the left panel into the report body. You can include multiple views of different types. Arrange them in the desired order — each view will render as a separate section in the final report.

Step 5. Optionally configure presentation elements:

Cover Page — toggle on/off; customize title, subtitle, logo
Header — appears on every page; can include report name and date
Footer — appears on every page; can include page numbers and confidentiality notice
Table of Contents — auto-generated from view names

Step 6. Click Save. The template is now available for on-demand generation or scheduled execution.

Important: Report templates are separate from the data they display. A template defines the structure; the data is populated at generation time based on the scope you select.

17.4 Generating and Scheduling Reports

On-Demand Generation:

Navigate to Visualize → Reports.
Select the desired report template.
Click Run (play icon).
In the dialog, select the Scope — choose one or more objects or groups that the report will cover (e.g., a specific cluster, a custom group of VMs, or "All Objects").
Click Generate.
The report enters a processing queue. Depending on complexity and scope, generation may take seconds to several minutes.
Once complete, the report appears in the Generated Reports tab, available for download.

Scheduled Generation:

Select the report template and click Schedule (calendar icon).
Configure the schedule parameters:

Parameter	Options	Recommendation
Frequency	Daily, Weekly, Monthly	Weekly for operational reports, Monthly for executive reports
Day of Week	Monday–Sunday (for weekly)	Monday morning for "last week" review
Time of Day	HH:MM (24-hour format)	06:00 — before the operations team arrives
Scope	Object, Group, or Tag-based	Use Custom Groups for consistent scoping

Configure Email Delivery:
- Recipients — comma-separated email addresses
- Subject Line — supports variables: ${ReportName}, ${Date}
- Attachment Format — PDF, CSV, or both
Click Save Schedule.

Warning: Scheduled reports consume analytics engine resources during generation. Avoid scheduling more than 10 reports at the same time window. Stagger schedules by 15–30 minutes.

17.5 Export Formats

Format	Content	Use Case	Limitations
PDF	Fully formatted report with charts, tables, headers, footers, and cover page	Distribution to stakeholders, audit documentation, archival	Charts are rendered as static images; no interactivity
CSV	Raw tabular data export; one CSV file per List or Summary view in the report	Spreadsheet analysis, data import into third-party tools, custom charting	No charts or formatting; Trend and Distribution views export as data tables

Both formats are available from the Generated Reports tab. Click the download icon next to a completed report and select the desired format.

Tip: For automated downstream processing, use the Suite API endpoint POST /suite-api/api/reports/{reportId}/download with the format query parameter set to pdf or csv. This enables integration with ticketing systems, SharePoint libraries, or custom portals.

Chapter 18: Capacity Planning and Optimization

Capacity planning in VCF Operations moves beyond simple threshold monitoring into predictive analytics. The platform's capacity engine continuously analyzes historical consumption patterns, applies multiple forecasting algorithms, and produces actionable recommendations for rightsizing, reclamation, and future procurement.

18.1 Capacity Engine Overview

The capacity engine evaluates every cluster, datastore, and resource pool across three dimensions.

Metric	Definition	Where to Find	Action Trigger
Time Remaining	Projected number of days until a resource (CPU, Memory, Disk) reaches its usable capacity limit	Optimize → Capacity → select cluster	< 90 days: plan procurement or migration
Capacity Remaining (%)	Percentage of total usable capacity that is still available after accounting for HA reserves, buffers, and current demand	Optimize → Capacity → select cluster	< 20%: immediate attention required
Recommended Size	The optimal allocation of vCPU, memory, or disk for a given VM based on actual usage patterns	Optimize → Rightsizing → select VM	Delta > 25% from current: rightsizing candidate

The capacity engine runs on a continuous cycle, recalculating projections every collection interval (default: 5 minutes for real-time, daily for long-term forecasts).

Important: Capacity calculations honor the policy settings applied to each object. If your policy sets a CPU utilization cap of 70% (meaning 70% is considered "full"), Time Remaining reflects when demand will reach 70%, not 100%.

18.2 Forecasting Algorithms

VCF Operations does not rely on a single forecasting model. Instead, it runs multiple algorithms in parallel and selects the best fit for each metric on each object.

Algorithm	How It Works	Best Suited For	Weakness
Change-Point Detection	Identifies sudden, sustained shifts in the data (step changes) and adjusts the baseline accordingly	Environments with frequent application deployments or workload migrations	May over-react to one-time events if not enough history
Linear Regression	Fits a straight line through historical data points and projects the trend forward	Steady, predictable growth patterns (e.g., data stores growing at constant rate)	Cannot model cyclical or seasonal patterns
Cyclical Analysis	Detects repeating patterns on daily, weekly, or monthly cycles and factors them into the projection	Workloads with known cycles — month-end batch processing, weekly reporting jobs	Requires 2+ full cycles of history to detect patterns
Exponential Smoothing	Applies exponentially decreasing weights to older data, giving recent observations more influence	Environments where recent behavior is more indicative of future behavior than distant history	Can be thrown off by recent anomalies

The analytics engine scores each algorithm's fit against actual historical data using a mean-absolute-percentage-error (MAPE) calculation. The algorithm with the lowest MAPE for a given metric is selected for that metric's forecast.

Tip: To see which algorithm was selected for a specific metric, navigate to the cluster's Capacity tab and hover over the forecast line. The tooltip displays the algorithm name and confidence interval.

18.3 Peak Classification

Not all spikes in resource consumption are equal. The capacity engine classifies peaks to prevent false alarms and ensure accurate forecasting.

Peak Type	Duration	Impact on Capacity Calculation	Example
Momentary	Less than 5 minutes	Ignored — treated as noise	CPU spike during VM snapshot creation, brief network burst
Sustained	5 minutes to 4 hours	Included in analysis with standard weight	Application batch job, database index rebuild, backup window
Periodic	Recurring at regular intervals	Weighted appropriately based on recurrence frequency	End-of-month financial close processing, weekly ETL jobs, nightly backups

Peak classification thresholds can be adjusted in the active policy under Configure → Policies → Edit Policy → Capacity and Allocation → Peak Classification.

18.4 Rightsizing

Navigate to: Optimize → Rightsizing

Rightsizing identifies VMs whose allocated resources are significantly mismatched to their actual consumption patterns.

Oversized VM Detection Criteria:

Resource	Condition	Default Threshold
CPU	Provisioned vCPUs exceed peak demand by a factor of 2 or more	Provisioned vCPU > 2x 95th-percentile CPU demand
Memory	Provisioned RAM exceeds peak demand by a factor of 1.5 or more	Provisioned RAM > 1.5x 95th-percentile active memory

Undersized VM Detection Criteria:

Resource	Condition	Default Threshold
CPU	CPU Ready percentage consistently elevated	cpu\|readyPct > 2.5% over 7-day average
Memory	Memory ballooning or swapping is active	mem\|balloonPct > 0% or mem\|swapused_average > 0

Rightsizing Report Columns:

Column	Description
VM Name	Virtual machine display name
Current vCPU	Currently provisioned vCPU count
Recommended vCPU	Analytics-recommended vCPU count
Current Memory (GB)	Currently provisioned RAM
Recommended Memory (GB)	Analytics-recommended RAM
Potential Savings	Estimated cost reduction if rightsized (requires cost drivers)

Taking Action on Rightsizing Recommendations:

Generate Change Request — exports a formatted change request document for ITSM integration
Export CSV — downloads all recommendations for offline review
Apply via Automation — if VCF Automation (Aria Automation) is integrated, trigger a rightsizing workflow directly

Warning: Always validate rightsizing recommendations against application-level requirements. A VM may appear oversized from an infrastructure perspective but require the allocated resources for licensing compliance (e.g., Oracle per-core licensing) or application-mandated minimums.

18.5 Reclaimable Resources

Navigate to: Optimize → Reclaim

The reclamation engine identifies waste — resources that are allocated but delivering no value.

Category	Detection Criteria	Default Threshold	Typical Savings
Powered-Off VMs	VM in poweredOff state for extended period	Idle > 30 days	Full VM cost recovery
Orphaned VMDKs	VMDK files on datastores not attached to any registered VM	Any orphaned VMDK	Storage reclamation
Old Snapshots	VM snapshots exceeding age threshold	Age > 72 hours (3 days)	Storage reclamation; performance improvement
Idle VMs	Powered-on VMs with negligible CPU, memory, network, and disk I/O	CPU < 100 MHz, Network < 1 KBps, Disk I/O < 1 IOPS for 7+ days	Full VM cost recovery

Best Practice: Schedule a weekly reclamation review meeting. Export the reclamation report and distribute it to application owners with a 14-day response window. VMs and VMDKs not claimed within the window are candidates for decommissioning.

18.6 Workload Optimization

Navigate to: Optimize → Workload Optimization

Workload Optimization provides DRS-like placement recommendations, but operates at the VCF Operations level rather than within a single vCenter. This enables cross-cluster and even cross-vCenter balancing recommendations.

Considerations evaluated by the engine:

CPU demand and contention on source and destination hosts
Memory demand, active memory, and contention
Network utilization and port group availability
Storage latency and capacity
DRS affinity and anti-affinity rules
VM-to-host compatibility (EVC mode, hardware version)

Output: The engine generates a prioritized list of migration recommendations. Each recommendation includes:

Field	Description
VM Name	The virtual machine to migrate
Source Host / Cluster	Current placement
Destination Host / Cluster	Recommended placement
Improvement	Projected reduction in contention or improvement in balance score
Risk	Assessment of migration risk (Low / Medium / High)

Note: Workload Optimization recommendations are advisory. VCF Operations does not execute migrations autonomously unless integrated with an automation platform and explicitly configured to do so.

18.7 What-If Analysis

Navigate to: Optimize → What-If Analysis

What-If Analysis lets you model hypothetical changes to your environment and see projected capacity impacts before committing resources or budget.

Scenario Types:

Scenario Type	Question It Answers	Required Inputs
Add Workload	"What if I deploy 50 new VMs?"	VM profile (vCPU, RAM, Disk per VM), quantity, target cluster
Remove Workload	"What if I decommission this cluster's VMs?"	Select VMs or clusters to remove
Add Infrastructure	"What if I add 3 hosts to this cluster?"	Host profile (CPU cores, RAM, local storage), quantity, target cluster
Change Allocation	"What if I change the overcommit ratio?"	New CPU or memory overcommit ratio, target cluster

Step-by-Step Procedure (applicable to all scenario types):

Step 1. Click Create Scenario and provide a scenario name (e.g., "Q3 ERP Migration Impact").

Step 2. Select the scenario type from the four options above.

Step 3. Enter parameters specific to the scenario type. For "Add Workload," define the VM profile:

vCPU count per VM (e.g., 4)
Memory per VM in GB (e.g., 16)
Storage per VM in GB (e.g., 100)
Number of VMs (e.g., 50)

Step 4. Select the target cluster(s) where the workload will be placed or infrastructure will be added.

Step 5. Click Run Analysis. The engine calculates the impact using the same forecasting algorithms described in Section 18.2.

Step 6. Review the results:

Result Field	Description
Time Remaining (Before)	Projected days before the scenario
Time Remaining (After)	Projected days after applying the scenario
Capacity Remaining % (Before/After)	Side-by-side capacity comparison
Risk Level Change	Whether the cluster moves from Green to Yellow/Red
Alerts Generated	Any new capacity alerts that would trigger

Step 7. Save the scenario for future reference or discard it. Saved scenarios can be revisited, modified, and re-run as conditions change.

Tip: Combine scenario types for complex planning. First run "Add Workload" to see the impact of a new project, then run "Add Infrastructure" to determine how many hosts are needed to absorb it. Compare the two scenarios side by side.

Chapter 19: Management Packs

Management Packs extend VCF Operations beyond vSphere, enabling unified monitoring across heterogeneous infrastructure, cloud platforms, applications, and hardware.

19.1 What Are Management Packs

A Management Pack is a pluggable adapter module that teaches VCF Operations how to collect, interpret, and act on data from a specific technology. Each management pack is a self-contained package that includes:

Component	Purpose
Adapter Code	The collection engine that connects to the target system via API, SNMP, WMI, SSH, or other protocol
Object Model	Defines the object types (e.g., "AWS EC2 Instance," "NetApp Volume") and their relationships
Metric Definitions	The specific metrics to collect, their units, and collection intervals
Dashboards	Pre-built dashboards tailored to the monitored technology
Alert Definitions	Symptoms and alert rules specific to the technology
Views and Reports	Pre-built views and report templates

Management packs are distributed as PAK files (Platform Archive Kit) — a signed archive format used by the VCF Operations platform for all extensions and updates.

19.2 Installation Steps

Step 1. Obtain the management pack PAK file. Sources include:

Broadcom Marketplace (marketplace.broadcom.com)
Vendor-specific download portals
VMware Flings (for experimental packs)

Step 2. In VCF Operations, navigate to Administration → Integrations → Repository.

Step 3. Click Add (or Upload PAK File, depending on UI version).

Step 4. Browse to the downloaded PAK file and click Upload. The system validates the file signature and compatibility.

Step 5. Review and accept the End User License Agreement (EULA).

Step 6. Monitor the installation progress bar. Installation typically takes 2–5 minutes. The cluster will distribute the adapter code to all nodes automatically.

Step 7. After installation completes, configure the adapter instance:

Navigate to Administration → Integrations → Accounts
Click Add Account
Select the newly installed adapter type from the dropdown
Provide connection details:
- Display Name — a friendly name for this adapter instance
- Endpoint URL / IP — the target system's management address
- Credential — create or select a credential (username/password, token, certificate)
- Collector — select which collector node will run this adapter (default or Remote Collector)
Click Validate Connection to test connectivity
Click Save

Warning: After installing a management pack, allow 2–3 collection cycles (typically 10–15 minutes) before expecting data to appear in dashboards. The initial collection cycle populates the object inventory; subsequent cycles populate metrics.

19.3 Official Management Packs (Broadcom)

The following table lists the management packs available from Broadcom, including those built into VCF Operations and those available as separate downloads.

#	Management Pack	Version	Monitored Technology	Key Metrics	Built-In
1	VMware vSphere	8.18.2	vCenter, ESXi Hosts, VMs, Resource Pools	CPU, Memory, Disk, Network for all vSphere objects	Yes
2	VMware NSX-T	8.18.2	NSX Manager, Transport Nodes, Logical Switches, DFW	Transport node health, DFW rule hit counts, tunnel status	Yes
3	VMware SDDC Manager	8.18.2	SDDC Manager, Workload Domains, VCF Lifecycle	Domain health, lifecycle operation status	Yes
4	VMware vSAN	8.18.2	vSAN Clusters, Disk Groups, Capacity Devices	Resync status, cache hit ratio, congestion, latency	Yes
5	VMware Cloud Director	5.x	VCD Cells, Organizations, vApps, Org VDCs	Cell health, Org resource consumption	No
6	VMware Horizon	4.x	Connection Servers, Desktop Pools, Sessions	Session latency, pool utilization, protocol performance	No
7	VMware Tanzu	2.x	TKG Clusters, Supervisor Namespaces, Pods, Nodes	Pod restart count, node resource usage, cluster health	No
8	VCF Automation	4.x	Blueprints, Deployments, Catalog Items	Deployment success rate, provisioning time	No
9	AWS	4.x	EC2, S3, RDS, Lambda, ELB, CloudWatch	Instance utilization, S3 bucket size, RDS connections	No
10	Azure	4.x	VMs, Storage Accounts, SQL Database, App Services	VM performance, storage transactions, DTU usage	No
11	Google Cloud	2.x	GCE Instances, GCS Buckets, BigQuery, Cloud SQL	Instance CPU, bucket object count, query slot utilization	No
12	Dell EMC	Varies	PowerStore, PowerScale, Unity, VMAX/PowerMax	Array latency, capacity, IOPS, throughput	No
13	NetApp ONTAP	3.x	Clusters, SVMs, Volumes, Aggregates, LUNs	Volume latency, aggregate capacity, snapshot reserve	No
14	Pure Storage	2.x	FlashArray, FlashBlade, Volumes	Array latency, capacity, data reduction ratio	No
15	HPE	2.x	3PAR/Primera, Nimble, Synergy, ProLiant	Array performance, blade health, enclosure power	No
16	Cisco UCS	3.x	Fabric Interconnects, Blades, Rack Units, Service Profiles	Fabric uplink utilization, blade faults, power draw	No
17	OS: Windows	8.x	Windows Servers (WMI-based)	CPU, Memory, Disk, Network, Services, Processes	No
18	OS: Linux	8.x	Linux Servers (SSH-based)	CPU, Memory, Disk, Network, top processes	No
19	SNMP	5.x	Generic SNMP-enabled devices (switches, routers, UPS)	Interface traffic, device uptime, OID-based custom metrics	No
20	Active Directory	3.x	Domain Controllers, Sites, Replication	Replication latency, LDAP response time, DC availability	No
21	SQL Server	4.x	SQL Instances, Databases, Always On Availability Groups	Query latency, buffer cache hit ratio, log growth	No
22	Oracle Database	3.x	Oracle Instances, Tablespaces, ASM Disk Groups	Tablespace usage, session counts, wait events	No
23	Ping	8.x	Any IP-reachable device	ICMP availability, round-trip latency, packet loss	No
24	Log Insight	8.x	Operations for Logs integration	Log event counts, ingestion rate	Yes
25	Telegraf Agent	8.x	Any system running Telegraf (push-based)	Custom metrics via Telegraf input plugins	No
26	Kubernetes	2.x	Kubernetes Clusters, Namespaces, Nodes, Pods, Containers	Pod status, container resource usage, node conditions	No
27	Service Discovery	8.x	Application dependency mapping	Service relationships, communication flows, port mappings	Yes

19.4 Management Pack Builder

For technologies not covered by existing management packs, VCF Operations includes a no-code development environment for building custom adapters.

Navigate to: Administration → Integrations → Management Pack Builder

Supported Input Methods:

Input Type	Description	Use Case
REST API	Define endpoints, authentication, JSON path mappings	Custom web applications, SaaS platforms, IoT APIs
SNMP MIB	Import MIB files and map OIDs to metrics	Legacy network devices, industrial equipment
Script-Based	Python or PowerShell scripts that output metrics in a defined format	Internal tools, proprietary systems, complex collection logic

Development Workflow:

Create a new project in Management Pack Builder
Define the object model — what object types exist (e.g., "Custom App Server," "Custom Database")
Define relationships between object types (e.g., "App Server runs on Host")
Map metrics — connect API responses, SNMP OIDs, or script outputs to named metrics
Define collection intervals for each metric group
Create alert definitions (optional) — symptoms and recommendations for the custom technology
Build dashboards (optional) — pre-built dashboards included in the pack
Export the project as a PAK file
Install the PAK file using the standard process (Section 19.2)

Tip: Start with the REST API input type for most modern applications. Define a health-check endpoint first to validate connectivity, then expand to detailed metrics. Use the built-in Test button at each stage to validate collection before exporting.

19.5 Third-Party Packs

In addition to Broadcom-published management packs, several vendors produce and support their own packs for VCF Operations.

Vendor	Management Pack	Monitored Technology	Key Capabilities
Dell Technologies	OpenManage for VCF Operations	PowerEdge server hardware via iDRAC	Hardware health (fans, PSUs, RAID), firmware inventory, warranty status, thermal monitoring
NVIDIA	vGPU Management Pack	NVIDIA vGPU-enabled hosts and VMs	GPU utilization %, GPU memory usage, temperature, encoder/decoder sessions, frame buffer
Rubrik	Rubrik Management Pack	Rubrik CDM and Polaris	Backup job success/failure rates, SLA compliance percentage, storage consumption trends, archive status
Zerto	Zerto Management Pack	Zerto Virtual Replication	VPG replication health, RPO status, journal size, failover test history, bandwidth consumption

Note: Third-party management packs follow their own release cadence independent of VCF Operations versions. Always verify compatibility with your VCF Operations version before installing. Check the vendor's compatibility matrix or release notes.

Chapter 20: Day-2 Operations and Maintenance

Once VCF Operations is deployed and configured, ongoing maintenance ensures the platform remains healthy, performant, and current. This chapter covers the operational tasks that every VCF Operations administrator must master.

20.1 Log File Locations

All VCF Operations appliance logs reside on the appliance filesystem. The following table identifies the critical log files, their paths, and their purposes.

Log File	Path	Purpose
Analytics	`/storage/log/vcops/analytics.log`	Analytics engine processing — capacity calculations, forecasting, anomaly detection
Collector	`/storage/log/vcops/collector.log`	Data collection framework — adapter scheduling, metric ingestion
API / UI	`/storage/log/vcops/web/catalina.out`	Tomcat application server — REST API requests, UI errors
CASA	`/storage/log/vmware/casa/casa.log`	Cluster management — node join/leave, role assignment, slice configuration
GemFire	`/storage/log/vcops/gemfire/gemfire.log`	Distributed cache — inter-node data replication, partition management
vPostgres	`/storage/log/vmware/vpostgres/postgresql.log`	PostgreSQL database — query errors, connection issues, replication
Adapter (per-adapter)	`/storage/log/vcops/adapters/<adapter-name>/`	Individual adapter logs — collection errors, connectivity issues
VAMI	`/var/log/vmware/`	VMware Appliance Management Interface — appliance configuration changes
PAK Manager	`/storage/log/vcops/pakManager.log`	PAK file installation, upgrade, and management pack deployment
Suiteapi	`/storage/log/vcops/web/suiteapi.log`	Suite API specific request/response logging

Tip: When troubleshooting, start with the most specific log. If a particular adapter is failing, check its log in /storage/log/vcops/adapters/ first. Escalate to collector.log only if the adapter log does not reveal the issue.

20.2 Safely Cleaning Logs

Log files can grow substantially in active environments, particularly when debug logging is enabled. Use the following procedures to reclaim disk space safely.

Check current disk usage:

df -h /storage/log
du -sh /storage/log/vcops/*

Truncate an active log file (preserves file handle):

truncate -s 0 /storage/log/vcops/analytics.log

Remove old rotated log archives:

find /storage/log -name "*.gz" -mtime +30 -delete
find /storage/log -name "*.log.*" -mtime +30 -delete

Check for core dumps consuming space:

du -sh /storage/core/
# If core dumps are present and no longer needed:
rm -f /storage/core/core.*

Warning: Never use rm on active log files (e.g., rm analytics.log). The process holding the file descriptor will continue writing to the deleted inode, consuming disk space invisibly. Always use truncate to safely zero out an active log file while preserving the file handle.

Warning: If log growth is persistent, investigate the root cause (e.g., a failing adapter retrying every 5 seconds, debug logging left enabled). Truncating logs without addressing the cause is a temporary fix.

20.3 Backup and Restore

Backup Configuration:

Step 1. Navigate to Administration → Backup/Restore.

Step 2. Configure the backup destination:

NFS Share — recommended for production; provide NFS path (e.g., nfs://fileserver.corp.local/backups/vcfops)
Local Path — on-appliance storage; suitable for lab environments only

Step 3. Set the backup schedule:

Setting	Recommendation
Frequency	Daily
Time	02:00 (during low-activity window)
Retention	7 backups (1 week of daily backups)

Step 4. Select backup content:

Option	Includes	Size Impact
All (Configuration + Data)	Cluster config, policies, dashboards, alerts, views, reports, custom groups, supermetrics, AND historical metric data	Large (potentially hundreds of GB)
Configuration Only	Everything except historical metric data	Small (typically < 5 GB)

Step 5. Click Save to activate the schedule. For an immediate backup, click Backup Now.

Restore Procedure:

Deploy a fresh VCF Operations OVA with the same version as the backup
During the Initial Setup Wizard, select Restore from Backup instead of "New Installation"
Provide the backup location path (NFS or local)
Select the specific backup file to restore from (listed by date/time)
The restore process rebuilds the cluster configuration, policies, dashboards, and (if included) historical data
After restore completes, the cluster comes online and resumes data collection

Important: You cannot restore a backup from a newer version to an older version. The target appliance must be the same version or newer than the backup source.

20.4 Certificate Management

VCF Operations generates self-signed internal certificates during deployment. For production environments, replace these with certificates signed by your enterprise Certificate Authority (CA).

Current Certificate Status: Navigate to Administration → Certificates to view the current certificate details, including issuer, subject, expiration date, and thumbprint.

Supported Formats:

Format	Description
PEM	Base64-encoded certificate and private key in separate files (.pem, .crt, .key)
PFX / PKCS12	Binary format containing certificate chain and private key in a single file (.pfx, .p12)

Steps to Replace the Certificate:

Step 1. Generate a Certificate Signing Request (CSR) from VCF Operations, or prepare a PEM certificate chain externally.

If generating a CSR: Administration → Certificates → Generate CSR — provide Subject, SANs (Subject Alternative Names must include FQDN and IP of all nodes), key size (2048 or 4096)
If providing an external certificate: ensure it includes the full chain (server cert + intermediate CA + root CA)

Step 2. Upload the signed certificate and private key:

For PEM: upload the certificate file and the private key file separately
For PFX: upload the PFX file and provide the passphrase

Step 3. Click Apply. VCF Operations validates the certificate chain, verifies the private key matches, and restarts services automatically. Expect 5–10 minutes of downtime during the service restart.

Warning: Ensure the certificate's Subject Alternative Names (SANs) include the FQDN of every node in the cluster and the cluster VIP (if using HA/CA). Missing SANs will cause inter-node communication failures.

20.5 Password Rotation

Regular password rotation is a security best practice and may be required by organizational policy.

Via CLI (SSH to the VCF Operations appliance):

$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py \
  password change --user admin

You will be prompted for the current password and the new password.

Via VCF Fleet Manager (SDDC Manager integration):

Log in to SDDC Manager
Navigate to Lifecycle → Password Management
Select the VCF Operations instance
Click Rotate
Choose to auto-generate or manually specify the new password
Confirm rotation

Rotation Schedule Recommendations:

Account	Recommended Interval	Notes
admin (UI)	Every 90 days	Primary administrative account
root (SSH)	Every 90 days	Appliance OS-level access
maintenanceAdmin	Every 90 days	Used for cluster maintenance operations
Adapter credentials	Every 90 days or per policy	Service accounts connecting to vCenter, NSX, etc.

Important: After rotating adapter credentials (e.g., the vCenter service account password), update the corresponding credential in Administration → Integrations → Accounts → Edit Credential. Failure to do so will cause data collection to stop.

20.6 Upgrading

VCF Operations upgrades are delivered as PAK files and follow a rolling upgrade process that minimizes downtime in HA and CA deployments.

Phase 1 — Pre-Upgrade Checklist:

Task	Command / Location	Purpose
Verify current version	Administration → Cluster Management	Confirm starting version
Take full backup	Administration → Backup/Restore → Backup Now	Rollback safety net
Check compatibility matrix	Broadcom compatibility guide	Ensure management packs are compatible with target version
Download upgrade PAK	Broadcom support portal	Obtain the upgrade binary
Snapshot all nodes	vCenter → Right-click VM → Snapshots → Take Snapshot	Quick rollback mechanism
Verify NTP sync	`ntpq -p` on each node	Prevent time-skew issues during upgrade
Check disk space	`df -h` on each node	Ensure /storage has > 20% free

Phase 2 — Upgrade Execution:

Navigate to Administration → Software Update
Click Upload PAK File and browse to the upgrade PAK
Once uploaded, the system validates the PAK and displays the target version
Click Install
The upgrade proceeds in rolling fashion:
- Data nodes upgrade first (one at a time)
- Remote Collector nodes upgrade next
- Primary Replica upgrades last
During the upgrade, the cluster remains partially available (data collection continues on non-upgrading nodes)
Total upgrade time: 30–90 minutes depending on cluster size

Phase 3 — Post-Upgrade Validation:

Task	How to Verify
Cluster status is "Online"	Administration → Cluster Management
All nodes show new version	Administration → Cluster Management → Node details
Data collection is active	Environment → select any object → verify recent metrics
Management packs are functional	Administration → Integrations → Accounts → check status icons
Dashboards load correctly	Navigate to several dashboards and verify data
Remove VM snapshots	vCenter → Right-click VM → Snapshots → Delete All

Warning: Do not delete VM snapshots until you have fully validated the upgrade. Snapshots provide the fastest rollback path if issues are discovered. However, do not keep snapshots longer than 72 hours, as they degrade VM performance.

20.7 Scaling

As monitored environments grow, VCF Operations may require additional resources.

Vertical Scaling (Scale Up):

Increase the vCPU and memory allocated to existing nodes.

OVA Size	vCPU	Memory (GB)	Objects Supported
Small	4	16	Up to 1,500
Medium	8	32	Up to 5,000
Large	16	48	Up to 15,000
Extra Large	24	128	Up to 30,000

To change the size: power off the node, adjust CPU/RAM in vCenter, power on. The analytics engine automatically detects the new resources.

Horizontal Scaling (Scale Out):

Add Data Nodes to distribute the analytics workload across more compute.

Deploy a new VCF Operations OVA
During setup, select Expand an existing cluster
Provide the primary node's FQDN or IP
The new node joins the cluster with the Data role
Navigate to Administration → Cluster Management to verify the new node
The cluster automatically rebalances object assignments across all data nodes

Guideline: Add one Data Node for every 10,000 additional objects beyond the primary node's capacity. For environments exceeding 50,000 objects, engage Broadcom Professional Services for architecture review.

20.8 Support Bundle Generation

When engaging Broadcom Global Support Services (GSS), a support bundle is typically required.

Via the UI:

Navigate to Administration → Support → Generate Support Bundle
Select the nodes to include (all nodes recommended)
Optionally select specific log categories to include
Click Generate
Download the resulting ZIP file when generation completes

Via the CLI (SSH):

/usr/lib/vmware-vcops/support/vrops-support.sh

The script collects logs, configuration files, cluster state, and diagnostic information into a ZIP file located at /storage/log/vcops/support/.

Support Bundle Contents:

Category	Included Items
Logs	All log files from Section 20.1
Configuration	Cluster config, slice configuration, property files
Cluster State	Node roles, service status, GemFire partition info
System Info	OS version, disk usage, memory usage, process list
Thread Dumps	Java thread dumps for analytics and collector services

Tip: For targeted troubleshooting, you can generate a "lightweight" bundle by specifying only the relevant log categories. This reduces generation time and file size, which speeds up upload to the support ticket.

20.9 Troubleshooting Common Issues

The following sections document the most frequently encountered issues, their root causes, and step-by-step resolutions.

Cluster Stuck at "Going Online"

Symptom: The cluster status on the Administration → Cluster Management page shows "Going Online" for more than 30 minutes without progressing.

Root Cause: The analytics service is failing to start, typically due to a GemFire distributed cache partition conflict or corrupted analytics state.

Resolution:

# Step 1: Check current service status
/usr/lib/vmware-vcops/support/sliceConfiguration.sh --status

# Step 2: Restart the analytics service
service vmware-vcops-analytics restart

# Step 3: Monitor the analytics log for errors
tail -f /storage/log/vcops/analytics.log

If the restart does not resolve the issue, check for GemFire partition conflicts:

grep -i "partition" /storage/log/vcops/gemfire/gemfire.log | tail -20

If partition errors are present, a full cluster restart may be required:

service vmware-vcops stop
# Wait 5 minutes for all services to fully terminate
service vmware-vcops start

Cluster Stuck at "Going Offline"

Symptom: After requesting the cluster to go offline, the Admin UI becomes unresponsive and the cluster never reaches "Offline" state.

Root Cause: A hung analytics or vPostgres process is preventing graceful shutdown.

Resolution:

# Step 1: Force stop all services
service vmware-vcops stop

# Step 2: Verify all Java processes have terminated
ps aux | grep java

# Step 3: If processes remain, wait 5 minutes then check again
# Do NOT use kill -9 unless absolutely necessary

# Step 4: Start services
service vmware-vcops start

"Waiting for Analytics" Message

Symptom: Dashboard widgets display "Waiting for Analytics" instead of data. The message persists beyond the normal startup window (15 minutes).

Root Cause: The analytics engine has either crashed, is processing a large backlog, or has encountered an out-of-memory condition.

Resolution:

Check the analytics service status:
```
service vmware-vcops-analytics status
```
If the service is stopped, check the log for the cause:
```
tail -100 /storage/log/vcops/analytics.log
```
Look for OutOfMemoryError or StackOverflowError in the log. If found, the node likely needs more memory (see Section 20.7 on vertical scaling).
Restart the analytics service:
```
service vmware-vcops-analytics restart
```

"FSDB Running Low on Disk Space"

Symptom: An alert fires indicating that the FSDB (File System Database) partition is running low on disk space. The /storage/db partition is at or above 85% utilization.

Root Cause: Historical metric data has filled the /storage/db partition. This occurs when retention is set too high for the available disk, or when a large number of new objects were added without corresponding disk expansion.

Resolution (in order of preference):

Reduce data retention:
- Navigate to Configure → Global Settings → Data Retention
- Reduce retention periods:
Data Type Default Minimum Recommended

Real-time (5-min) 1 day 1 day

Hourly rollup 30 days 15 days

Daily rollup 6 months 3 months

Monthly rollup 13 months 6 months
Expand the /storage/db disk:
- In vCenter, add a new virtual disk to the appliance VM
- Access the VAMI (https://<node-fqdn>:5480)
- Navigate to Storage and expand the /storage/db partition to include the new disk
Remove unused management packs:
- Each management pack with a configured adapter instance contributes objects and metrics to the FSDB
- Disable or remove adapters that are no longer needed

Slow Data Collection

Symptom: Metric charts show gaps, dashboards display stale data, or the "Last Collection" timestamp for adapters is more than 10 minutes old.

Root Cause: Multiple potential causes — adapter overload, network latency to the target system, expired or invalid credentials, or insufficient collector resources.

Resolution:

Check adapter status:
- Navigate to Administration → Integrations → Accounts
- Look for adapters with warning or error status icons

Check adapter logs:

tail -200 /storage/log/vcops/adapters/<adapter-name>/<adapter-name>.log

Verify credentials: Edit the adapter account and click Validate Connection
Check collector resource usage:
```
top -bn1 | head -20
free -h
```
For geographically distant targets: Deploy a Remote Collector at the remote site to reduce collection latency

Root Partition Full

Symptom: Services fail to start. SSH access still works but commands may produce "No space left on device" errors.

Root Cause: Core dumps, temporary files, or unexpected log files have filled the root (/) partition.

Resolution:

# Step 1: Identify the largest consumers
du -sh /* | sort -rh | head

# Step 2: Common culprits — check and clean
du -sh /storage/core/
rm -f /storage/core/core.*

du -sh /tmp/
# Remove old temp files (careful — do not remove active temp files)
find /tmp -type f -mtime +7 -delete

# Step 3: Check for unexpected log files in /var/log
du -sh /var/log/*

Warning: If the root partition is completely full (100%), services cannot write PID files or temp files and will refuse to start. In extreme cases, you may need to boot from a rescue ISO to clear space.

20.10 CLI Tools

VCF Operations provides several command-line tools for administration and troubleshooting.

Tool	Command	Purpose	Common Usage
vrops-status	`vrops-status`	Quick cluster health check	Verify all services are running, check node roles
OPS-CLI	`$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py`	Full CLI management	Adapter management, metric queries, object searches, password changes
Slice Configuration	`/usr/lib/vmware-vcops/support/sliceConfiguration.sh`	Cluster slice management	Check slice status, force slice rebalancing
Support Script	`/usr/lib/vmware-vcops/support/vrops-support.sh`	Support bundle generation	Generate log bundles for Broadcom support
Service Control	`service vmware-vcops {start\|stop\|restart\|status}`	Service management	Start, stop, or restart the entire VCF Operations stack
Platform CLI	`vcops-cli`	Platform-level operations	License management, node management

OPS-CLI Examples:

# List all adapter instances
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py \
  adapter list

# Search for an object by name
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py \
  object search --name "web-server-01"

# Query a metric for a specific object
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/opscli/ops-cli.py \
  metric query --objectId <uuid> --metricKey "cpu|usage_average"

Warning: Direct Cassandra access via cqlsh localhost 9042 is available for advanced troubleshooting but is unsupported by Broadcom. Modifying data in Cassandra directly can corrupt the FSDB and render the cluster inoperable. Use only under explicit guidance from Broadcom support.

20.11 Suite API Reference

The Suite API is the RESTful interface for programmatic access to all VCF Operations functionality. It enables integration with ITSM tools, custom portals, automation pipelines, and third-party systems.

Authentication:

# Acquire a token
curl -k -X POST \
  "https://<vrops-fqdn>/suite-api/api/auth/token/acquire" \
  -H "Content-Type: application/json" \
  -d '{"username":"admin","authSource":"local","password":"<password>"}'

The response returns a JSON object containing the token field. Use this token in subsequent requests.

Token Usage:

Include the token in the Authorization header for all API calls:

Authorization: vRealizeOpsToken <token>

Tokens expire after 6 hours by default. Acquire a new token when the current one expires.

Base URL:

https://<vrops-fqdn>/suite-api/api/

Key Endpoint Categories:

Category	Base Path	Operations
Resources	`/resources`	List, search, create, delete objects; query relationships
Alerts	`/alerts`	List, query, update, cancel alerts
Symptoms	`/symptoms`	List, create, delete symptom definitions
Supermetrics	`/supermetrics`	List, create, update, delete supermetric formulas
Policies	`/policies`	List, create, apply, export, import policies
Adapters	`/adapters`	List adapter kinds, instances; start/stop collection
Credentials	`/credentials`	List, create, update, delete credential instances
Reports	`/reports`	List templates, generate reports, download results
Dashboards	`/dashboards`	List, import, export, share dashboards
Auth	`/auth`	Token acquisition, token release, user management
Collector Groups	`/collectorgroups`	List, create, assign collectors
Custom Groups	`/customgroups`	List, create, update, delete custom groups
Metric Keys	`/resources/{id}/stats`	Query metric data for specific resources

Interactive API Documentation:

VCF Operations ships with embedded Swagger UI documentation:

https://<vrops-fqdn>/suite-api/doc/swagger-ui.html

The Swagger UI provides a complete, interactive reference for all API endpoints, including request/response schemas, parameter descriptions, and the ability to execute API calls directly from the browser.

Common API Workflow Example — Export All Critical Alerts:

# Step 1: Acquire token
TOKEN=$(curl -sk -X POST \
  "https://vrops.corp.local/suite-api/api/auth/token/acquire" \
  -H "Content-Type: application/json" \
  -d '{"username":"admin","authSource":"local","password":"P@ssw0rd"}' \
  | python -c "import sys,json; print(json.load(sys.stdin)['token'])")

# Step 2: Query critical alerts
curl -sk \
  "https://vrops.corp.local/suite-api/api/alerts?status=ACTIVE&criticality=CRITICAL" \
  -H "Authorization: vRealizeOpsToken $TOKEN" \
  -H "Accept: application/json" | python -m json.tool

# Step 3: Release token when done
curl -sk -X POST \
  "https://vrops.corp.local/suite-api/api/auth/token/release" \
  -H "Authorization: vRealizeOpsToken $TOKEN"

Best Practice: Always release tokens when your automation workflow completes. Each VCF Operations instance supports a limited number of concurrent API sessions. Unreleased tokens count against this limit until they expire naturally.

PART II: VCF Operations for Logs

Chapter 21: Product Overview — Operations for Logs

21.1 Naming History

VCF Operations for Logs has undergone several name changes since its inception, reflecting broader shifts in VMware's product portfolio and, ultimately, the Broadcom acquisition. Understanding the naming timeline is essential when referencing older documentation, knowledge-base articles, and community posts.

Year	Product Name	Context
2013	VMware vCenter Log Insight	Original release; tightly associated with vCenter Server
2016	vRealize Log Insight	Rebranded under the vRealize management suite umbrella
2022	VMware Aria Operations for Logs	Part of the VMware Aria brand unification across all management products
2024	VCF Operations for Logs	Broadcom acquisition; product folded into the VCF (VMware Cloud Foundation) brand

Note: Many CLI tools, OVA filenames, internal service names, and API endpoints still reference loginsight or vrli. Do not be alarmed when you encounter these legacy identifiers — they are functionally equivalent to the current product.

Throughout this handbook, the terms Operations for Logs, OpsForLogs, and the abbreviation vRLI may be used interchangeably where historical context or brevity demands it.

21.2 How It Differs from VCF Operations

VCF Operations (metrics) and VCF Operations for Logs (logs) are complementary products. They are deployed separately, serve distinct analytical purposes, and store fundamentally different data types. The following table summarizes the key differences.

Aspect	VCF Operations	VCF Operations for Logs
Data Type	Metrics, properties, super metrics, events	Log messages (syslog, agent-collected file logs)
Analysis Model	Time-series statistical analysis, machine-learning anomaly detection, capacity modeling	Full-text search, pattern matching, ML-based intelligent grouping
Alerting	Threshold-based — triggers when a metric value crosses a defined boundary	Pattern-based — triggers when a log message matches a content rule or frequency condition
Storage Engine	FSDB (proprietary time-series database)	Apache Cassandra + proprietary full-text index
Primary Use Cases	Performance monitoring, capacity planning, cost analysis, what-if modeling	Troubleshooting, root-cause analysis, audit trail, compliance reporting
Retention Model	Configurable retention policies (weeks to months of metric data)	Index partitions managed by time-based buckets (days to months of log data)
Integration Direction	Launches-in-context to Operations for Logs for correlated log investigation	Launches-in-context to Operations for metric correlation

Best Practice: Deploy both products and configure the bidirectional integration between them. When Operations detects an anomaly on an object, the administrator can pivot directly into Operations for Logs to examine the logs from that object during the anomaly window — dramatically reducing mean time to resolution.

21.3 Architecture

Operations for Logs follows a scale-out clustered architecture built on the following components:

Primary Node — The first node deployed. It hosts all core services: the web UI, the REST API, the ingestion pipeline, the indexing engine, the query engine, and the cluster coordination service. In a standalone deployment, the Primary Node handles everything.
Worker Nodes — Additional nodes that join the cluster to provide horizontal scaling of ingestion throughput, indexing capacity, and query parallelism. Each worker runs the same services as the primary (except cluster coordination leadership).
Integrated Load Balancer (ILB) — A built-in load-balancing service that distributes incoming log traffic and UI/API requests across all nodes in the cluster. The ILB eliminates the need for an external load balancer in most deployments.
Virtual IP (VIP) — A single IP address managed by the ILB that serves as the unified entry point for all client connections — syslog sources, agents, browser sessions, and API calls.

Cluster Specifications:

Minimum cluster size: 1 node (standalone)
Maximum cluster size: 18 nodes
Ingestion scaling is approximately linear: each node adds roughly 15 GB/day of baseline ingestion capacity (varies with log complexity and field extraction overhead)
All nodes store a portion of the indexed data; queries are distributed across nodes and results are merged

Data Flow:

Log Sources (vCenter, ESXi, NSX, Agents, Syslog Devices)
        │
        ▼
   VIP Address (ILB)
        │
        ▼
  ┌─────┴─────┐
  │  Primary   │──── Worker 1 ──── Worker 2 ──── Worker N
  │   Node     │
  └────────────┘
        │
        ▼
  Ingestion Pipeline → Parsing → Field Extraction → Indexing
        │
        ▼
  Cassandra Index + Full-Text Index (per-node storage)
        │
        ▼
  Query Engine (distributed, merges results from all nodes)

21.4 Cluster Sizing Guidance

Select the cluster size based on expected daily ingestion volume and query concurrency requirements.

Cluster Size	Nodes	Estimated Ingestion Rate	Typical Use Case
Small	1 (standalone)	~15 GB/day	Lab, proof-of-concept, developer environments
Medium	3	~45 GB/day	Small production (single VCF instance, <200 VMs)
Large	6	~90 GB/day	Medium production (multi-cluster, 200–1,000 VMs)
Extra Large	12+	~180+ GB/day	Large enterprise (multi-site, 1,000+ VMs, compliance-heavy)

Warning: These figures assume default field extraction and content packs. Heavy use of custom regex extraction, large numbers of active alerts, or complex dashboards with many concurrent users will reduce effective ingestion capacity. Always monitor the Ingestion Rate and Query Latency dashboards after deployment and add worker nodes proactively if ingestion approaches 80% of rated capacity.

Tip: For VCF environments, a 3-node medium cluster is the recommended starting point for production. This provides both high availability (the cluster tolerates the loss of one node) and sufficient headroom for growth.

Chapter 22: Deployment

22.1 OVA Sizing

The Operations for Logs OVA offers three deployment sizes. Select the size at deployment time — it cannot be changed later without redeployment.

Size	vCPUs	Memory (GB)	Disk (GB)	Estimated Ingestion Rate
Small	4	8	530	~15 GB/day
Medium	8	16	1,060	~30 GB/day
Large	16	32	2,080	~45 GB/day

Important: Disk sizes listed are total, including OS, application, and log index storage. The index partition consumes the majority of disk space. When planning retention, remember that longer retention windows require proportionally more disk. If the built-in disk is insufficient, you can attach additional VMDK volumes post-deployment and configure them as additional storage partitions.

Recommendation: For production deployments, always select Medium or Large. The Small size is appropriate only for labs and proof-of-concept environments.

22.2 OVA Deployment (Step-by-Step)

Deploy the Operations for Logs OVA through the vSphere Client using the following procedure.

Prerequisites:

Downloaded OVA file (VMware-vRealize-Log-Insight-*.ova or the renamed VCF equivalent)
Target cluster or resource pool identified
IP address, subnet mask, gateway, DNS, and NTP information prepared
FQDN registered in DNS (forward and reverse records)

Procedure:

Log in to the vSphere Client and navigate to the target cluster or resource pool.
Right-click the cluster or resource pool and select Deploy OVF Template.
On the Select an OVF template page, choose Local file and browse to the downloaded OVA file. Click Next.
On the Select a name and folder page, enter a descriptive VM name (e.g., vrli-primary-01) and select the target inventory folder. Click Next.
On the Select a compute resource page, choose the target cluster, host, or resource pool. Click Next.
On the Review details page, verify the OVA details (publisher, download size, disk size). Click Next.
On the License agreements page, read and accept the EULA. Click Next.
On the Deployment configuration page, select the appropriate size from the dropdown:
- Small (4 vCPU / 8 GB / 530 GB)
- Medium (8 vCPU / 16 GB / 1,060 GB)
- Large (16 vCPU / 32 GB / 2,080 GB) Click Next.
On the Select storage page:
- Choose the target datastore or datastore cluster.
- Set the virtual disk format to Thick Provision Eager Zeroed for production deployments (Thin Provision is acceptable for labs).
- Select the appropriate VM Storage Policy if applicable. Click Next.
On the Select networks page, map the OVA network to the appropriate port group. Click Next.
On the Customize template page, fill in the following fields:
- Root password: Set the root OS password (must meet complexity requirements).
- Hostname: Enter the FQDN (e.g., vrli-primary-01.lab.local).
- IP Address: Static IPv4 address.
- Netmask: Subnet mask (e.g., 255.255.255.0).
- Default Gateway: Gateway IP address.
- DNS Server(s): Comma-separated DNS server IPs.
- Domain: DNS search domain (e.g., lab.local).
- NTP Server(s): Comma-separated NTP server hostnames or IPs. Click Next.
On the Ready to complete page, review all settings and click Finish.
Wait for the deployment task to complete in the vSphere Recent Tasks pane.
Power on the virtual machine.
Wait 3–5 minutes for all services to initialize. Monitor the VM console for boot progress.

Warning: Do not snapshot the VM during initial boot. Allow all services to fully start before taking the first snapshot.

22.3 Initial Configuration Wizard

After the first boot completes, access the web-based configuration wizard to finalize setup.

Open a browser and navigate to https://<node-fqdn> (or https://<ip-address>).
Accept the self-signed SSL certificate warning.
The Initial Configuration Wizard launches automatically.

Step 1 — Admin Password:

Set the password for the admin user account.
Requirements: minimum 12 characters, at least one uppercase letter, one lowercase letter, one digit, and one special character.

Step 2 — License Key:

Enter your Operations for Logs license key.
Alternatively, click Skip to start a 60-day evaluation period (limited to 16 CPU pack and 25 OSI).

Step 3 — General Configuration:

Verify the hostname/FQDN is correct.
Enable or disable the Customer Experience Improvement Program (CEIP).

Step 4 — CEIP:

Confirm your CEIP participation preference (opt in or opt out).

Step 5 — NTP Configuration:

Add one or more NTP server addresses.
Click Test to verify time synchronization.

Critical: Accurate time synchronization is essential for log correlation. Ensure all Operations for Logs nodes, vCenter servers, and ESXi hosts share the same NTP source.

Step 6 — SMTP Configuration:

Mail Server: SMTP server hostname or IP.
Port: 25 (unencrypted) or 587 (STARTTLS).
TLS: Enable if the mail server supports it.
From Address: The sender address for alert notifications (e.g., vrli-alerts@lab.local).
Authentication: Provide username and password if the SMTP server requires authentication.
Click Test to send a test email.

Step 7 — SSL Certificate:

Upload a custom SSL certificate (PEM format: certificate + private key + CA chain) for trusted browser access.
Or accept the default self-signed certificate (can be replaced later under Administration → SSL).

Step 8 — Finish:

Click Finish to apply all settings.
Services restart automatically. The UI becomes available again in 2–5 minutes.
Log in with user admin and the password set in Step 1.

22.4 Cluster Setup

A single standalone node is suitable for labs, but production environments require a cluster of at least three nodes for high availability, ingestion scaling, and query performance.

22.4.1 Adding Worker Nodes

Deploy additional OVAs following the same procedure described in Section 22.2. Each worker node requires its own unique FQDN and IP address.
Power on the worker node and wait for services to initialize (3–5 minutes).
Open a browser and navigate to https://<worker-fqdn>.
On the initial setup page, select Join Existing Deployment (do not select "Start New Deployment").
Enter the Primary Node FQDN or IP address.
Review and accept the primary node's SSL certificate fingerprint.
Click Join. The worker node contacts the primary, receives the cluster configuration, and begins participating in ingestion and query processing.
Repeat for each additional worker node.

Note: Worker nodes do not require independent license keys. The license is managed centrally on the primary node and applies cluster-wide.

22.4.2 ILB and VIP Configuration

After adding worker nodes, configure the Integrated Load Balancer and Virtual IP to provide a single entry point for all clients.

Log in to the primary node UI: https://<primary-fqdn>.
Navigate to Administration → Cluster.
Verify all worker nodes appear with a status of Connected.
Navigate to Administration → Cluster → VIP (or the Virtual IP tab).
Enter the desired Virtual IP address. This IP must be on the same subnet as the cluster nodes and must not be assigned to any other device.
Enter the FQDN for the VIP (register this in DNS beforehand with both A and PTR records).
Click Save.
The ILB activates across all cluster nodes. Within 30 seconds, the VIP becomes responsive.
Verify by navigating to https://<vip-fqdn> — the Operations for Logs UI should load.
Update all log sources (syslog configurations, agent liagent.ini files) to point to the VIP address instead of the primary node address.

Warning: If you do not configure a VIP, log sources pointing to the primary node will not benefit from load balancing, and the primary node becomes a single point of failure for ingestion.

22.5 VCF 9.0 Deployment via Fleet Manager

In VCF 9.0 and later, Operations for Logs can be deployed through SDDC Manager's Fleet Management capability, which automates the entire lifecycle.

Log in to SDDC Manager and navigate to Lifecycle → Fleet Management.
Under Available Products, select VMware Aria Operations for Logs (some builds may still display the legacy name).
Click Deploy and provide the required parameters:
- IP Pool: Select or create an IP pool for the Operations for Logs nodes.
- Sizing: Choose Small, Medium, or Large.
- Cluster Size: Number of nodes (1, 3, 6, etc.).
- NTP Servers: Inherited from SDDC Manager or specified manually.
- DNS Servers: Inherited from SDDC Manager or specified manually.
- License Key: Enter the product license key.
Click Submit. Fleet Manager performs the following actions automatically:
- Downloads the OVA from the configured depot or Broadcom repository.
- Deploys the primary node and worker nodes.
- Runs the initial configuration wizard programmatically.
- Forms the cluster and configures the VIP.
- Registers the instance with SDDC Manager for lifecycle management.
Monitor deployment progress in the SDDC Manager → Tasks panel. A typical 3-node deployment completes in 30–45 minutes.
Once complete, the Operations for Logs instance appears under Fleet Management → Deployed Products with a status of Active.

Tip: Fleet Manager also handles future upgrades, certificate rotation, and backup scheduling for Operations for Logs, reducing ongoing administrative overhead.

Chapter 23: Log Source Configuration

23.1 Syslog Ingestion Ports

Operations for Logs listens on several ports for log ingestion. The following table summarizes the default ports, protocols, and their intended use.

Port	Protocol	Transport	Use Case	Notes
514	Syslog	TCP	General syslog ingestion	Unencrypted; most common for internal networks
514	Syslog	UDP	General syslog ingestion	Unencrypted; no delivery guarantee; not recommended for production
6514	Syslog	TCP + TLS	Secure syslog ingestion	Requires TLS certificate configuration on both sender and receiver
1514	Syslog	TCP + SSL	ESXi host log forwarding	Automatically configured when vSphere integration is enabled
9000	CFAPI	HTTP	Agent-based ingestion	VMware Log Insight agent protocol; unencrypted
9543	CFAPI	HTTPS	Secure agent-based ingestion	VMware Log Insight agent protocol; certificate-secured

Best Practice: Use TCP-based protocols (514/TCP, 6514/TCP, 9543/TCP) for all production log sources. UDP-based syslog (514/UDP) does not guarantee delivery and can silently drop messages under load. For compliance-sensitive environments, use TLS-encrypted ports (6514/TCP for syslog, 9543/TCP for agents).

You can verify which ports are actively listening by navigating to Administration → Configuration → Ports in the Operations for Logs UI, or by running the following on the appliance:

netstat -tlnp | grep -E '514|9000|9543'

23.2 vCenter Log Forwarding

vCenter Server generates critical logs including vpxd, vpxd-svcs, vmware-sps, vmafdd, and many others. Forwarding these to Operations for Logs provides centralized visibility into vCenter operations.

Via the VAMI (Recommended)

Open a browser and navigate to the vCenter VAMI: https://<vcenter-fqdn>:5480.
Log in with the root account.
Navigate to Syslog in the left navigation pane (location varies by vCenter version — check under Networking or Syslog Configuration).
Click Edit or Configure.
Add a remote syslog destination with the following settings:
- Protocol: TCP
- Hostname: <vrli-vip-fqdn> (the VIP address of your Operations for Logs cluster)
- Port: 514
Click Save.
Verify log delivery: in Operations for Logs, navigate to Explore Logs and search for source = <vcenter-fqdn>. Logs should appear within 1–2 minutes.

Via the CLI (Alternative)

If VAMI access is unavailable, configure syslog forwarding from the vCenter shell:

# SSH to vCenter as root
# List current syslog configuration
/usr/lib/vmware-syslog/bin/get-rsyslog-config.sh

# Set remote syslog target
/usr/lib/vmware-vmon/vmon-cli --restart rsyslog

Note: In vCenter 8.x and later, syslog configuration is managed through the VAMI. CLI-based configuration methods vary between versions. Always consult the release-specific documentation.

23.3 ESXi Syslog Configuration

ESXi hosts produce some of the most valuable logs in a VMware environment — vmkernel, hostd, vpxa, fdm, and vobd among others. There are three methods to configure ESXi syslog forwarding.

Method 1: Automatic via vSphere Integration (Recommended)

This is the simplest method and ensures all hosts managed by a vCenter are automatically configured.

In Operations for Logs, navigate to Administration → Integrations → vSphere.
Click Add vCenter.
Enter the vCenter FQDN and credentials (a service account with read access is sufficient).
Click Test Connection to verify connectivity.
Click Save.
Operations for Logs connects to vCenter, discovers all managed ESXi hosts, and automatically configures each host to forward logs via TCP on port 1514 (SSL-secured).
Verify: within 2–3 minutes, ESXi host logs appear in Explore Logs with source names matching ESXi hostnames.

Tip: The vSphere integration also pulls ESXi events and tasks, enabling richer correlation between log messages and vCenter-reported events.

Method 2: Manual per-Host via esxcli

Use this method when vSphere integration is not desired or when configuring individual hosts outside of vCenter management.

# SSH to the ESXi host
esxcli system syslog config set --loghost=tcp://<vrli-vip>:514
esxcli system syslog reload
# Verify the configuration
esxcli system syslog config get

The --loghost parameter supports multiple targets separated by commas:

esxcli system syslog config set --loghost=tcp://vrli-vip.lab.local:514,tcp://backup-syslog.lab.local:514

Important: If the ESXi firewall is enabled, ensure the syslog firewall rule is open:
esxcli network firewall ruleset set -r syslog -e true
esxcli network firewall refresh

Method 3: Bulk Configuration via PowerCLI

For large environments, use PowerCLI to configure all hosts at once:

# Connect to vCenter
Connect-VIServer -Server vcenter.lab.local

# Set syslog target for all hosts
$logHost = "tcp://vrli-vip.lab.local:514"
Get-VMHost | ForEach-Object {
    Write-Host "Configuring syslog on $($_.Name)..."
    Set-VMHostSysLogServer -VMHost $_ -SysLogServer $logHost
    $esxcli = Get-EsxCli -VMHost $_ -V2
    $esxcli.system.syslog.reload.Invoke()
}

# Verify configuration
Get-VMHost | ForEach-Object {
    $esxcli = Get-EsxCli -VMHost $_ -V2
    $config = $esxcli.system.syslog.config.get.Invoke()
    Write-Host "$($_.Name): $($config.RemoteHost)"
}

Warning: When using both the vSphere integration (Method 1) and manual configuration (Method 2 or 3) simultaneously, you may receive duplicate log entries. Choose one method and apply it consistently.

23.4 NSX Log Forwarding

NSX Manager and NSX Edge nodes generate logs critical for network troubleshooting, security event analysis, and compliance auditing. Configure log forwarding from the NSX Manager UI.

Procedure:

Log in to the NSX Manager UI (https://<nsx-manager-fqdn>).
Navigate to System → Fabric → Profiles → Node Profiles.
Select the node profile applied to your NSX Manager appliances (typically All NSX Nodes or a custom profile).
Scroll to the Syslog Servers section and click Add.
Enter the following:
- Server: <vrli-vip-fqdn>
- Port: 514 (for unencrypted TCP) or 6514 (for TLS)
- Protocol: TCP or LI-TLS (Log Insight TLS)
- Log Level: INFO (captures Info, Warning, Error, Critical, and Emergency)
Click Save.
The syslog configuration propagates to all nodes assigned to the profile.

NSX Edge Nodes:

In some NSX deployments, Edge transport nodes may require separate syslog configuration:

Navigate to System → Fabric → Nodes → Edge Transport Nodes.
Select each Edge node.
Under Syslog, click Add and enter the same server details.
Click Save.

Note: NSX Distributed Firewall (DFW) logs are generated on the ESXi hosts where the DFW rules are enforced. These logs are forwarded via the ESXi syslog configuration (Section 23.3), not via the NSX Manager syslog configuration.

Verification:

In Operations for Logs, search for:

appname = "nsxmanager" OR appname = "nsx-edge"

NSX logs should appear within 1–2 minutes of configuration.

23.5 Agent Installation

The Operations for Logs agent (also known as the Log Insight agent or liagent) is a lightweight process that collects log files from Windows and Linux operating systems and forwards them to Operations for Logs via the CFAPI protocol.

Windows Agent — GUI Installation

In Operations for Logs, navigate to Administration → Agents → Downloads.
Download the Windows agent installer (VMware-Log-Insight-Agent-*.msi).
Run the MSI installer on the target Windows machine.
Click Next through the welcome screen.
Accept the EULA and click Next.
Enter the server hostname: <vrli-vip-fqdn>.
Set the port to 9543 (HTTPS) or 9000 (HTTP).
Set the protocol to CFAPI.
Check SSL and accept the server certificate if using port 9543.
Click Install and then Finish.
The agent service (VMware Log Insight Agent) starts automatically and begins forwarding Windows Event Logs.

Windows Agent — Silent Installation

For automated deployments via SCCM, GPO, or scripting:

msiexec /i VMware-Log-Insight-Agent-x64.msi /qn ^
  SERVERHOST=vrli-vip.lab.local ^
  SERVERPROTOCOL=cfapi ^
  SERVERPORT=9543 ^
  /l*v C:\temp\liagent-install.log

Tip: Add SERVICEACCOUNT=domain\svcaccount SERVICEPASSWORD=P@ssw0rd parameters if the agent service needs to run under a domain account to access specific log file paths.

Linux Agent — RPM-based Systems (RHEL, CentOS, SLES)

# Copy the RPM to the target server
sudo rpm -i VMware-Log-Insight-Agent-*.rpm

# Edit the agent configuration
sudo vi /var/lib/loginsight-agent/liagent.ini
# Set the [server] section hostname to the VIP FQDN

# Start and enable the agent service
sudo systemctl start liagent
sudo systemctl enable liagent

# Verify the agent is running
sudo systemctl status liagent

Linux Agent — DEB-based Systems (Ubuntu, Debian)

# Copy the DEB package to the target server
sudo dpkg -i VMware-Log-Insight-Agent-*.deb

# Edit the agent configuration
sudo vi /var/lib/loginsight-agent/liagent.ini
# Set the [server] section hostname to the VIP FQDN

# Start and enable the agent service
sudo systemctl start liagent
sudo systemctl enable liagent

# Verify the agent is running
sudo systemctl status liagent

Important: The agent collects /var/log/messages and /var/log/syslog by default. Additional log directories must be configured explicitly in liagent.ini (see Section 23.6).

23.6 Agent Configuration (`liagent.ini`)

The agent configuration file liagent.ini controls all aspects of agent behavior — server connectivity, log file collection, field tagging, and debug settings. The file is located at:

Linux: /var/lib/loginsight-agent/liagent.ini
Windows: C:\ProgramData\VMware\Log Insight Agent\liagent.ini

Complete Configuration Reference

; ─── Server Connection ───
[server]
hostname=vrli-vip.lab.local
port=9543
proto=cfapi
ssl=yes
ssl_accept_any=yes
; ssl_ca_path=/etc/pki/tls/certs/ca-bundle.crt   ; Use for strict CA validation

; ─── Default Syslog Collection (Linux) ───
[filelog|syslog]
directory=/var/log
include=*.log;messages;syslog

; ─── Custom Application Logs ───
[filelog|custom_app]
directory=/opt/myapp/logs
include=*.log
exclude=debug-*.log
parser=auto
tags={"appname":"myapp","env":"production","tier":"web"}

; ─── Apache Access Logs ───
[filelog|apache_access]
directory=/var/log/httpd
include=access_log*
parser=clf

; ─── Windows Event Log (Windows only) ───
[winlog|application]
channel=Application

[winlog|system]
channel=System

[winlog|security]
channel=Security

; ─── Agent Logging ───
[logging]
debug_level=0
; 0=Off, 1=Error, 2=Warning, 3=Info, 4=Debug
; Set to 4 only for troubleshooting; generates significant local log volume

Key Configuration Parameters

Parameter	Description	Default
`hostname`	Operations for Logs VIP FQDN or IP	(required)
`port`	Ingestion port	`9543`
`proto`	Protocol (`cfapi` or `syslog`)	`cfapi`
`ssl`	Enable SSL/TLS	`yes`
`ssl_accept_any`	Accept any server certificate (lab only)	`no`
`directory`	Log file directory to monitor	(per section)
`include`	Semicolon-separated file patterns to collect	`*.log`
`exclude`	Semicolon-separated file patterns to skip	(none)
`tags`	JSON key-value pairs attached to every log entry from this section	`{}`
`parser`	Log parsing mode (`auto`, `clf`, `csv`, or custom regex)	`auto`

Central Agent Configuration (Agent Groups)

Instead of editing liagent.ini on every machine, you can push agent configurations centrally from the Operations for Logs UI:

Navigate to Administration → Agents → Agent Groups.
Click New Group.
Define a filter to match agents (e.g., by IP range, hostname pattern, or OS type).
Add [filelog|...] and [winlog|...] sections to the group configuration.
Click Save. Matching agents pull the new configuration on their next check-in cycle (every 5 minutes by default).

Best Practice: Use Agent Groups for all production agent configuration. This ensures consistency, simplifies changes, and provides a single pane of glass for agent management.

Chapter 24: Content Packs

24.1 Built-in Content Packs

Operations for Logs ships with two content packs installed by default:

General — Provides baseline syslog parsing, common extracted fields (source, appname, facility, severity, text), and a default overview dashboard. This pack handles standard RFC 3164 and RFC 5424 syslog messages.
VMware - vSphere — The most comprehensive built-in pack. It includes:
- Parsing rules for over 50 ESXi and vCenter log formats
- Extracted fields specific to VMware infrastructure (vmw_host, vmw_vc_vm_name, vmw_esxi_service, vmw_vc_event_type, etc.)
- Predefined dashboards for ESXi hardware health, vCenter task analysis, vMotion tracking, storage errors, and authentication events
- Alert definitions for critical vSphere events (PSOD, HA failover, storage path failures)

Note: The vSphere content pack is automatically activated when the vSphere integration is configured (Section 23.3, Method 1). Its extracted fields enable rich, structured queries against ESXi and vCenter logs.

24.2 Marketplace Content Packs

Additional content packs are available from the in-product Marketplace and from the Broadcom download portal. The following table lists commonly used packs.

Content Pack	Source	Key Features
VMware NSX	VMware/Broadcom	NSX Manager and Edge log parsing; security event dashboards; DFW rule hit analysis
VMware vSAN	VMware/Broadcom	vSAN trace and CMMDS log parsing; health event extraction; rebalance tracking
VMware VCF	VMware/Broadcom	SDDC Manager log parsing; lifecycle operation dashboards; compliance event tracking
Active Directory	Community/VMware	Windows AD log parsing; authentication success/failure dashboards; account lockout tracking
Linux	Community/VMware	`/var/log/*` parsing; SSH login analysis; cron job tracking; common Linux event fields
Palo Alto Networks	Palo Alto/Community	PAN-OS syslog parsing; firewall allow/deny dashboards; threat event correlation
F5 BIG-IP	Community	LTM and ASM log parsing; virtual server health dashboards; WAF event analysis
Dell EMC	Dell/Community	PowerStore, Unity, VNX storage array log parsing; hardware fault dashboards
Cisco	Community	IOS and NX-OS syslog parsing; interface state change tracking; routing event analysis

24.3 Content Pack Structure

Every content pack — whether built-in, marketplace, or custom — is composed of the following components:

Extracted Fields — Regular-expression-based field definitions that parse structured data from unstructured log messages. Each field has a name, data type (string, integer, float), and one or more regex patterns.
Dashboards — Pre-built dashboard pages containing widgets (charts, tables, trend lines) that visualize data using the extracted fields.
Alerts — Alert definitions that trigger notifications (email, webhook, or Operations for Logs notification) when log messages match specified patterns or frequency thresholds.
Queries — Saved search queries that can be executed from the Explore Logs interface with a single click.
Agents — Agent configuration templates (optional) that define which log files to collect and how to parse them. These templates can be pushed to agents via Agent Groups.

Tip: When evaluating a content pack, review the extracted fields first. Fields are the foundation — dashboards and alerts depend on them. If the fields do not match your log format (e.g., because of a firmware version difference), the dashboards will show no data.

24.4 Installing Content Packs

From the Marketplace (In-Product):

Navigate to Content Packs → Marketplace in the Operations for Logs UI.
Browse or search for the desired content pack.
Click the content pack name to view its description, components, and version history.
Click Install.
A summary dialog shows all components that will be installed (extracted fields, dashboards, alerts, queries). Review and click Confirm.
Installation completes in seconds. Verify by navigating to Content Packs → Installed Content Packs and confirming the pack appears with a green status.

From a Downloaded File:

Download the content pack file (.vlcp extension) from the Broadcom support portal or a community repository.
In Operations for Logs, navigate to Content Packs → Installed Content Packs.
Click Import Content Pack (or the upload icon).
Browse to the .vlcp file and select it.
Review the components and click Install.

Warning: Installing a content pack that defines fields with the same names as existing fields will overwrite the existing field definitions. Review field conflicts before installing, especially when mixing marketplace packs with custom-defined fields.

24.5 Creating Custom Content Packs

Organizations can bundle their custom fields, dashboards, alerts, and queries into a reusable content pack for distribution across environments or teams.

Procedure:

Create the components you want to include:
- Define custom extracted fields (see Section 25.4).
- Build dashboards with widgets that use those fields.
- Create alert definitions based on log patterns.
- Save queries used for routine troubleshooting.
Navigate to Content Packs → My Content.
Click Create Content Pack.
Enter a Name (e.g., Custom - Payment Gateway Logs), Namespace (unique identifier, e.g., com.mycompany.paymentgw), and Description.
In the component selection pane, check the boxes for each field, dashboard, alert, and query to include.
Click Save to create the content pack.
To share the pack, click Export — this generates a .vlcp file that can be imported into other Operations for Logs instances.

Tip: Use a consistent namespace convention (e.g., com.<company>.<application>) to avoid conflicts with VMware or community content packs. Version your content packs semantically (1.0, 1.1, 2.0) to track changes.

24.6 Permissions

Content pack operations are governed by the role-based access control system in Operations for Logs.

Role	Install / Uninstall	Create / Export	Use Dashboards	Use Queries	Modify Components
Super Admin	Yes	Yes	Yes	Yes	Yes
Admin	Yes	Yes	Yes	Yes	Yes
User	No	No	Yes	Yes	No (can create personal copies)
View Only	No	No	Yes (read-only)	Yes (read-only)	No

Super Admin and Admin roles have full control over content pack lifecycle — install, uninstall, create, export, and modify all components.
User role can consume dashboards and run saved queries but cannot install or modify content pack components. Users can create personal dashboards and saved queries that are not part of any content pack.
View Only role has strictly read-only access to dashboards and query results. This role is suitable for auditors and stakeholders who need visibility without modification capability.

Best Practice: Assign the User role to operations teams who need to search logs and view dashboards. Reserve Admin for the team responsible for content pack management and platform administration.

Chapter 25: Log Analysis (Explore Logs)

25.1 Accessing Explore Logs

The Explore Logs interface is the primary workspace for interactive log investigation in Operations for Logs. Access it by clicking Explore Logs in the main navigation bar at the top of the UI.

The interface consists of:

Query bar — Top of the page; enter search terms, field filters, and time range selection.
Results pane — Center; displays matching log entries in reverse chronological order.
Field sidebar — Left side; shows all available fields with value distribution histograms.
Event timeline — Histogram above the results showing event volume over time; click and drag to zoom into a specific time window.
Tabs — Switch between Events, Event Trends, Event Types (Intelligent Grouping), and Field Table.

25.2 Query Types

Operations for Logs supports three query modes, each suited to different analytical needs.

1. Message Query (Free-Text Search)

The simplest query mode. Enter keywords or phrases in the query bar, and Operations for Logs searches the full text of all log messages within the selected time range.

error
"connection refused"
authentication failed

2. Regex/Field Query (Structured Search)

Use extracted fields and operators to create precise, structured queries. This mode is more efficient than free-text search because it operates on indexed field values rather than raw text.

vmw_host = esxi01.lab.local
appname = "vpxd" AND severity = "error"
vmw_vc_vm_name = web-server-* AND text CONTAINS "snapshot"

3. Aggregation Query (Statistical Analysis)

Apply statistical functions to log data to identify trends, volumes, and outliers. Aggregation queries produce charts rather than individual log entries.

# Count events per source over time
COUNT by source

# Average response time by application
AVERAGE(response_time) by appname

# Top 10 sources by error count
COUNT WHERE severity = "error" GROUP BY source ORDER BY COUNT DESC LIMIT 10

25.3 Search Capabilities

Keyword Search

Syntax	Description	Example
Single keyword	Finds logs containing the word anywhere in the message	`error`
Phrase (quoted)	Finds logs containing the exact phrase	`"connection refused"`
Boolean AND	Both terms must appear	`error AND vcenter`
Boolean OR	Either term must appear	`warning OR error`
Boolean NOT	Excludes logs containing the term	`error NOT test`
Parentheses	Group boolean expressions	`(error OR warning) AND vcenter`

Glob Patterns

Pattern	Description	Example
`*`	Matches zero or more characters	`error` matches "timeout error occurred"
`?`	Matches exactly one character	`host-??.lab.local` matches "host-01.lab.local"
`[...]`	Matches any character in the set	`[Ee]rror` matches "Error" and "error"
`[0-9]`	Character range	`vm-[0-9][0-9][0-9]` matches "vm-001" through "vm-999"

Field-Based Filtering

Field-based filters are the most powerful and efficient search mechanism. They use extracted fields (from content packs or custom extraction) to narrow results precisely.

Operator	Description	Example
`=`	Exact match	`vmw_host = esxi01.lab.local`
`!=`	Not equal	`vmw_esxi_vpxa_status != running`
`CONTAINS`	Substring match	`text CONTAINS "certificate expired"`
`NOT CONTAINS`	Excludes substring	`text NOT CONTAINS "debug"`
`MATCHES`	Regex match	`text MATCHES "err(or\|no)\s\d+"`
`>`, `<`, `>=`, `<=`	Numeric comparison	`response_time > 5000`
`EXISTS`	Field has a value	`vmw_vc_vm_name EXISTS`
`NOT EXISTS`	Field is absent	`custom_field NOT EXISTS`

Tip: Combine multiple field filters with AND / OR for complex investigations:
vmw_host = esxi01.lab.local AND appname = "vmkernel" AND text CONTAINS "NMP" AND severity = "warning"

25.4 Field Extraction

Built-in Static Fields

Every log message ingested by Operations for Logs automatically receives the following static fields, regardless of content packs:

Field	Description	Example Value
`timestamp`	Time the event was generated (from syslog header or agent)	`2026-03-20T14:32:01.000Z`
`source`	Hostname or IP of the sending device	`esxi01.lab.local`
`appname`	Application name (from syslog header APP-NAME field)	`vpxd`, `hostd`, `vmkernel`
`facility`	Syslog facility code	`local0`, `daemon`, `kern`
`severity`	Syslog severity level	`info`, `warning`, `error`, `critical`
`text`	Full message body (everything after the syslog header)	(variable)

Dynamic Extracted Fields

Content packs define additional fields that are extracted at query time (or at ingest time, depending on configuration). For example, the vSphere content pack extracts:

vmw_host — ESXi hostname
vmw_vc_vm_name — Virtual machine name
vmw_esxi_service — ESXi service name (hostd, vpxa, etc.)
vmw_vc_event_type — vCenter event type (VmPoweredOnEvent, DrsVmMigratedEvent, etc.)
vmw_vc_user — User who initiated the action

One-Click Field Extraction

For logs not covered by existing content packs, you can create custom extracted fields interactively.

Procedure:

In Explore Logs, locate a log entry that contains the data you want to extract.
In the log message text, highlight the specific value you want to capture (e.g., an error code, a username, or a response time).
A tooltip appears — click Extract Field.
The Field Extraction dialog opens with the following settings:
- Field Name: Enter a descriptive name (e.g., http_response_code, db_query_time).
- Regex Pattern: A pattern is auto-suggested based on your text selection. Refine it if necessary.
- Pre-Context: The text pattern that appears before the extracted value (used as an anchor).
- Post-Context: The text pattern that appears after the extracted value.
- Field Type: Select String, Integer, or Float.
Click Preview to test the extraction against a sample of recent logs. Review the extracted values for accuracy.
Adjust the regex or context patterns if the preview shows incorrect extractions.
Click Save.
The new field is immediately available for use in queries, dashboards, and alerts.

Warning: Custom extracted fields consume CPU during query execution. Avoid creating overly broad regex patterns that match unintended log messages. Test thoroughly using the Preview function before saving.

25.5 Regex Syntax

Operations for Logs uses different regex engines depending on the context:

Context	Regex Engine	Notes
UI queries and field extraction	Java regex (java.util.regex)	Double-escape backslashes in the UI: `\\d+`
Agent file parsing (`liagent.ini`)	C++ Boost regex	Standard PCRE-like syntax
API queries	Java regex	Same as UI

Common Regex Patterns

Pattern	Purpose	Regex
IPv4 address	Match IP addresses in log text	`\\b\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\b`
MAC address	Match MAC addresses	`[0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}`
ISO timestamp	Match ISO 8601 timestamps	`\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}`
HTTP status code	Match 3-digit HTTP codes	`HTTP/\\d\\.\\d\\s+(\\d{3})`
Email address	Match email addresses	`[\\w.+-]+@[\\w.-]+\\.[a-zA-Z]{2,}`
Windows SID	Match Windows Security Identifiers	`S-\\d-\\d+-[\\d-]+`
UUID / GUID	Match UUIDs	`[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}`

Tip: When building regex patterns in the UI, use the Preview function to validate against live data. Start with a broad pattern and refine it iteratively. Named capture groups (?<fieldname>...) are supported for multi-field extraction from a single pattern.

25.6 Intelligent Grouping (ML)

Intelligent Grouping is a machine-learning feature that automatically clusters structurally similar log messages, ignoring variable components like IP addresses, timestamps, UUIDs, and numeric values.

Accessing Intelligent Grouping:

Navigate to Explore Logs.
Optionally enter a filter (e.g., source = esxi01.lab.local) and set the time range.
Click the Event Types tab (in some versions, labeled Intelligent Grouping).

How It Works:

The ML engine analyzes the structure of log messages — the static text skeleton versus the variable tokens.
Messages with the same skeleton are grouped into an Event Type.
Each Event Type displays:
- Pattern Template — The generalized message structure with variables replaced by placeholders (e.g., User <*> logged in from <*>).
- Event Count — Total number of matching log entries in the selected time range.
- Trend Sparkline — A mini time-series chart showing how frequently this event type occurs over time.
- Percentage — Proportion of total log volume represented by this event type.

Use Cases:

Noise Reduction — Identify the highest-volume event types and determine if they are actionable or noise. Suppress noisy, low-value log patterns.
Anomaly Detection — Sort by event count and look for unusual spikes. A sudden increase in a previously rare event type often indicates an emerging problem.
Pattern Discovery — Discover recurring error patterns that span multiple sources. Click an event type to drill into all matching log entries.
Baseline Establishment — During stable periods, review the top event types to establish a normal baseline. Deviations from this baseline in future analyses indicate changes in environment behavior.

Interacting with Groups:

Click any Event Type row to expand it and see all matching log entries.
Click Create Alert to define an alert based on that event type pattern.
Click Extract Field on a highlighted variable within the pattern to create a new extracted field from the variable portion.

25.7 Saved Queries

Saved queries preserve search criteria — keywords, field filters, time range preferences, and selected fields — for reuse without re-entering the query each time.

Saving a Query:

In Explore Logs, construct and execute the desired query.
Click the Save icon (disk icon) or select Save Query from the actions menu.
Enter a Query Name (e.g., ESXi PSOD Events - All Hosts).
Optionally enter a Description explaining the query's purpose.
Set Visibility:
- Private — Only visible to you.
- Shared — Visible to all users with appropriate role permissions.
Click Save.

Using Saved Queries:

Navigate to Explore Logs.
Click the Saved Queries dropdown (or the folder icon in the query bar).
Select the desired query. The query criteria are loaded into the search bar and executed automatically.

Managing Saved Queries:

Edit: Modify the query criteria and re-save.
Delete: Remove queries that are no longer needed.
Duplicate: Clone a query as a starting point for a variation.
Use as Alert Basis: Saved queries can be directly converted to alert definitions (see Chapter 26 in subsequent sections).
Use in Dashboards: Saved queries can be referenced by dashboard widgets for consistent, reusable visualizations.

Best Practice: Establish a naming convention for shared saved queries (e.g., [Team] - [Description]) to keep the query library organized as it grows. Periodically review and prune unused saved queries to maintain clarity.

Chapter 26 — Dashboards and Alerts — Operations for Logs

26.1 Creating Dashboards from Queries

Operations for Logs provides two methods to create dashboards: promoting a query result directly from Explore Logs, or building a dashboard from scratch in the Dashboards section.

Method 1 — Promote from Explore Logs:

Navigate to Explore Logs and execute a query that returns the data you wish to visualize.
Configure the visualization by selecting the chart type (pie, bar, line, column), grouping field, and time range from the toolbar above the results pane.
Click Add to Dashboard in the upper-right corner of the visualization pane.
In the dialog, select an existing dashboard from the dropdown or click Create New Dashboard and provide a name.
Enter a descriptive widget name that clearly identifies the data being displayed (e.g., "ESXi Error Rate — Last 24 Hours").
Click Save. The widget is added to the selected dashboard and begins refreshing at the dashboard's configured interval.

Method 2 — Build from Scratch:

Navigate to Dashboards from the main navigation menu.
Click New Dashboard.
Enter a dashboard name and optional description. Choose a sharing scope (Private, Shared with specific users, or Public).
Click Add Widget on the empty dashboard canvas.
Select the widget type from the widget picker (see Section 26.2 for available types).
Configure the widget by entering a query, selecting fields, setting the time range, and choosing display options such as chart colors and axis labels.
Click Save Widget, then click Save Dashboard to persist the layout.

Tip: Dashboards auto-refresh at configurable intervals (30 seconds, 1 minute, 5 minutes, 15 minutes, or manual). Set the refresh interval using the clock icon in the dashboard toolbar.

Operations for Logs supports a variety of widget types, each optimized for different analytical use cases.

Widget Type	Description	Best Use Case
Chart (Pie)	Proportional breakdown of values as a circular chart	Distribution of log sources, error types by category
Chart (Bar)	Horizontal category comparison bars	Top 10 error-generating hosts, busiest log sources
Chart (Line)	Time-series trend line with data points	Log volume over time, error rate trends, ingestion throughput
Chart (Column)	Vertical bars for period-based comparison	Hourly event counts, daily log volume comparison
Chart (Gauge)	Single metric displayed as a gauge dial	Current ingestion rate, active alert count
Field Table	Tabular data view with sortable columns	Detailed event listing with extracted fields
Query List	List of saved queries displayed as clickable links	Quick-access navigation panel for analysts
Event Types	Breakdown of machine-learning-grouped event categories	ML-classified event distribution
Event Trends	Sparkline trend charts for each event type	At-a-glance trend overview per event category

Widget Configuration Options:

Query: The log query defining the data source for the widget.
Time Range: Relative (Last 5 minutes through Last 30 days) or absolute (custom start/end).
Group By: Field used to segment data (e.g., hostname, appname, vmw_cluster).
Chart Colors: Customizable color palette per series.
Axis Labels: X-axis and Y-axis label text for line, bar, and column charts.
Legend Position: Top, Bottom, Left, Right, or Hidden.

26.3 Alert Definitions

To create an alert, navigate to Alerts → Alert Definitions → New Alert. Operations for Logs provides four trigger condition types, each suited to a different monitoring pattern.

Trigger Condition Type 1 — On Every Match

The alert fires for every individual log event that matches the query.
Use for: Critical security events such as root login, firewall breach detection, or unauthorized privilege escalation.
Threshold: Not applicable — every matching event triggers an alert notification.
Example Query: text CONTAINS "CRITICAL" AND appname CONTAINS "sshd" AND text CONTAINS "root"

Trigger Condition Type 2 — Total Count

The alert fires when the total number of matching events exceeds a threshold within a defined time window.
Use for: Error rate monitoring (e.g., more than 100 errors in 5 minutes), log volume spikes.
Configuration Fields: Count threshold (integer), Time window (minutes).
Example: Query = text CONTAINS "error", Threshold = 100, Window = 5 minutes.

Trigger Condition Type 3 — Unique Count

The alert fires when the number of unique values for a specified field exceeds a threshold within a time window.
Use for: Detecting brute-force attacks (e.g., more than 10 unique source IPs with failed login), anomalous user behavior.
Configuration Fields: Field name (e.g., source), Count threshold (integer), Time window (minutes).
Example: Query = text CONTAINS "authentication failure", Field = source, Threshold = 10, Window = 15 minutes.

Trigger Condition Type 4 — Aggregation

The alert fires when an aggregated metric computed over matching events crosses a threshold.
Use for: Average response time exceeding SLA, maximum memory usage crossing capacity limits.
Configuration Fields: Function (avg, min, max, sum, count), Field name, Threshold (numeric), Time window (minutes).
Example: Query = appname CONTAINS "nginx", Function = avg, Field = response_time, Threshold = 5000, Window = 10 minutes.

26.4 Alert Configuration

Each alert definition includes the following configuration fields:

Field	Description	Required
Name	Descriptive name for the alert (e.g., "ESXi PSOD Detection")	Yes
Description	Detailed description of the alert purpose and expected response	No
Query	The log query that defines which events are evaluated	Yes
Trigger Condition	One of the four types described in Section 26.3	Yes
Frequency	How often the alert query is evaluated: 1, 5, 15, 30, or 60 minutes	Yes
Raise an Alert	When to generate the alert: First occurrence only, Every time the condition is met, or Once per time window	Yes
Notification	Select one or more notification channels (email or webhook)	No
Enable/Disable	Toggle to activate or deactivate the alert without deleting it	Yes

Best Practice: Set the alert frequency to be shorter than or equal to the trigger time window. For example, if the time window is 5 minutes, set the frequency to 5 minutes or less. This ensures no events are missed between evaluation cycles.

26.5 Snoozing Alerts

When an alert becomes temporarily noisy — for example, during a planned maintenance window — you can snooze it rather than disabling it entirely.

Navigate to Alerts → Triggered Alerts.
Locate the noisy alert and click the Snooze button (clock icon).
Select a snooze duration from the dropdown:
- 15 minutes
- 1 hour
- 4 hours
- 8 hours
- 24 hours
- Custom (enter a specific duration in minutes)
Click Confirm. The alert stops generating notifications for the selected duration.
After the snooze period expires, the alert automatically resumes evaluation and notification.

Snoozed alerts display a clock icon and remaining snooze time in the Triggered Alerts list. You can cancel a snooze early by clicking Unsnooze on the alert.

26.6 Notification Channels

Operations for Logs supports two primary notification channel types: Email (SMTP) and Webhooks.

Email Notification Configuration

Navigate to Administration → SMTP Configuration.
Configure the following fields:

Field	Example Value
SMTP Server	`smtp.lab.local`
Port	`587` (TLS) or `25` (unencrypted)
Use TLS	Enabled
From Address	`vrli-alerts@lab.local`
Username	`vrli-smtp-user`
Password	(SMTP authentication password)

Click Test to send a test email, then click Save.
In each alert definition, add one or more email recipients in the Notification section.

Webhook Notification Configuration

Navigate to Administration → Webhooks.
Click New Webhook.
Configure the webhook:

Field	Description
Name	Descriptive name (e.g., "Slack-Ops-Channel")
URL	Target endpoint URL
Content Type	`application/json` (default)
Payload Template	JSON body with placeholder variables

Click Test to verify connectivity, then click Save.

Common Webhook Targets:

Target	URL Format	Notes
Slack	`https://hooks.slack.com/services/T.../B.../xxx`	Use Slack Incoming Webhook URL
PagerDuty	`https://events.pagerduty.com/v2/enqueue`	Use PagerDuty Events API v2 integration key
Aria Automation	`https://<vra-fqdn>/csp/gateway/am/api/...`	Trigger workflow via REST webhook
ServiceNow	`https://<instance>.service-now.com/api/now/table/incident`	Create incident via REST API
Custom	Any `https://` endpoint	Configurable HTTP method, headers, body template

Available Payload Variables:

${AlertName} — Name of the triggered alert
${AlertDescription} — Alert description text
${AlertQuery} — The query that triggered the alert
${MatchCount} — Number of matching events
${Url} — Direct URL to the alert in Operations for Logs
${Timestamp} — Time the alert was triggered
${Messages} — Sample of matching log messages

Chapter 27 — Integration with VCF Operations

27.1 Two Integration Methods

VCF Operations and VCF Operations for Logs are designed to work together as a unified observability platform. Two complementary integration methods connect the products:

Notification Events — VCF Operations for Logs sends alert notifications to VCF Operations, creating corresponding alert objects that appear alongside metric-based alerts. This enables a single-pane-of-glass view of both metric and log-based anomalies.
Launch in Context — From VCF Operations, operators can click on any monitored object and open its associated logs directly in Operations for Logs. The log view is automatically pre-filtered to show only events from the selected object and time range, eliminating the need to manually construct queries.

27.2 Configure Notification Integration (Operations for Logs to VCF Operations)

This integration pushes alert data from Operations for Logs into VCF Operations.

Step-by-step on the Operations for Logs side:

Navigate to Administration → Integrations → VMware Aria Operations.
Enter the VCF Operations cluster VIP hostname or FQDN (e.g., vrops-vip.lab.local).
Enter credentials for a VCF Operations user with administrative privileges.
Click Test Connection to verify network connectivity and authentication.
Click Save to persist the configuration.
Enable the Send alerts to VMware Aria Operations toggle. When enabled, all triggered alerts in Operations for Logs are forwarded as alert objects to VCF Operations.

Step-by-step on the VCF Operations side:

Navigate to Administration → Integrations → Accounts.
Verify that the VMware Aria Operations for Logs adapter instance appears in the adapter list.
Check the adapter status indicator:
- Green (Collecting) — Integration is healthy and receiving alert data.
- Yellow (Not Collecting) — Adapter is configured but not yet receiving data. Verify firewall rules and credentials.
- Red (Error) — Connection failure. Check the adapter logs for details.

Note: Alerts forwarded from Operations for Logs appear under the Log Analytics alert type in VCF Operations. They can be viewed, acknowledged, and cancelled using the same alert management workflows as native VCF Operations alerts.

27.3 Configure Launch in Context (VCF Operations to Operations for Logs)

This integration allows operators to open contextual log data from within the VCF Operations interface.

Step-by-step on the VCF Operations side:

Navigate to Administration → Integrations → Accounts.
Add a new account or edit the existing VMware Aria Operations for Logs account.
Enter the Operations for Logs VIP URL: https://<vrli-vip-fqdn> (e.g., https://vrli-vip.lab.local).
Enter service account credentials for Operations for Logs.
Click Test Connection to verify HTTPS connectivity on port 443.
Click Save.

Verification:

Navigate to any monitored object in VCF Operations (e.g., an ESXi host).
Open the object detail page.
Click the Logs tab.
The Logs tab should display recent log events from Operations for Logs, pre-filtered to the selected object.
Click any log entry or click Launch in Context to open a full Operations for Logs session with the query pre-populated.

27.4 VCF Operations Content Pack for Logs

A dedicated content pack enables Operations for Logs to parse, extract, and visualize logs generated by VCF Operations itself.

Installation:

In Operations for Logs, navigate to Content Packs → Marketplace.
Search for Aria Operations or vRealize Operations.
Click Install on the VMware Aria Operations content pack.
The content pack installs extracted fields, saved queries, and dashboards specific to VCF Operations log data.

Included Content:

Content Type	Count	Examples
Extracted Fields	15+	`vrops_component`, `vrops_alert_name`, `vrops_adapter_kind`
Saved Queries	10+	"VCF Operations Errors — Last 24h", "Analytics Engine Warnings"
Dashboards	3	"VCF Operations Health", "Adapter Collection Status", "Audit Trail"
Alerts	5	"VCF Operations Service Crash", "Collector Disconnected"

27.5 Log Analysis in VCF Operations

Once both integration methods are configured, the following capabilities become available in VCF Operations:

Object Detail → Logs Tab: Every monitored object (VM, host, cluster, datastore) displays a Logs tab showing recent log events from Operations for Logs filtered to that specific object.
Troubleshoot with Logs Dashboard: A predefined dashboard that correlates metric anomalies with temporally adjacent log events. When a metric alert fires, the dashboard shows related log spikes in the same time window.
Launch in Context: Clicking any log entry in the Logs tab opens the full Operations for Logs interface with the query, object filter, and time range pre-populated for deep-dive analysis.
Unified Alert Timeline: Both metric-based alerts from VCF Operations and log-based alerts forwarded from Operations for Logs appear on the same alert timeline, enabling root cause analysis across data types.

Chapter 28 — Log Forwarding and Archiving

28.1 Log Forwarding to External Systems

Operations for Logs can forward received logs to other systems for compliance archival, SIEM integration, or multi-site aggregation. Forwarding is asynchronous and adds no significant overhead to the cluster. Three forwarding protocols are supported:

Protocol	Description	Use Case
Ingestion API (CFAPI)	Forward using the native Operations for Logs ingestion API format	Forward to another Operations for Logs instance for multi-site aggregation
Syslog	Forward as standard syslog messages over TCP, UDP, or TLS	Forward to SIEM platforms (Splunk, QRadar, ArcSight), syslog servers
RAW	Forward the original raw log data without transformation	Preserve exact original format for compliance or forensic archives

28.2 Forwarding Configuration

Step-by-step:

Navigate to Administration → Log Forwarding.
Click New Destination.
Configure the forwarding destination:

Field	Description	Example
Name	Descriptive name for the destination	`SIEM-Splunk-Prod`
Destination Host	FQDN or IP address of the target system	`splunk-hec.lab.local`
Protocol	Syslog (TCP/UDP/TLS), CFAPI, or RAW	`Syslog (TLS)`
Port	Port number appropriate for the selected protocol	`6514`

Filter (optional): Configure filters to forward only specific log data:
- By Source: Forward only from specific hosts or IP ranges.
- By Content: Forward only events matching a text pattern.
- By Tag: Forward only events with specific tags.
- By Field Value: Forward only events where an extracted field matches a value.
Tags (optional): Add or modify tags on events before forwarding. This allows the receiving system to identify forwarded events.
Click Test to verify connectivity to the destination, then click Save.

Note: Each cluster supports up to 10 forwarding destinations. Forwarding operates asynchronously from the ingestion pipeline — destination outages do not affect log ingestion or indexing. Events are buffered and retried if the destination is temporarily unreachable.

28.3 NFS Archive Setup

Operations for Logs can archive log data to an NFS share for long-term retention beyond the active index capacity.

Step-by-step:

Navigate to Administration → Archiving.
Click Enable Archiving.
Enter the NFS mount path using the format: nfs://<server>/<share>
- Example: nfs://nfs-server.lab.local/vrli-archive
Set the archive frequency (default: daily).
Click Test Mount to verify NFS connectivity and write permissions.
Click Save.

Archive Behavior:

Aspect	Detail
When data is archived	After it ages out of the active index (based on partition retention)
Archive format	Compressed JSON files organized by date
Searchability	Archived data is not searchable directly — must be re-ingested to query
NFS version requirement	NFSv3
Permissions	Read/write access required from all cluster nodes
Mount validation	All nodes must successfully mount the NFS share

Important: Ensure the NFS share has sufficient capacity for long-term storage. A cluster ingesting 50 GB/day will generate approximately 15–20 GB/day of compressed archive data.

28.4 Retention Policies (Index Partitions)

Index partitions allow you to apply different retention periods to different categories of log data. This enables longer retention for compliance-critical logs (e.g., security audit events) while using shorter retention for high-volume operational logs.

Maximum partitions: 10 (including the default partition)
Default partition retention: 30 days
Minimum retention: 1 day
Maximum retention: Limited only by available disk space

Configuration:

Navigate to Administration → Index Partitions.
Click New Partition.
Configure the partition:

Field	Description	Example
Name	Partition identifier	`Security-Logs`
Retention Period	Number of days to retain indexed data	`90`
Filter	Criteria determining which logs are routed to this partition	`appname CONTAINS "sshd" OR appname CONTAINS "audit"`

Click Save. New log events matching the filter criteria are routed to this partition and retained for the specified period.

Important: Longer retention periods require proportionally more disk space. Plan the /storage/var disk on each node to accommodate the total data volume across all partitions. Use the formula: Required Disk (GB) = Daily Ingestion (GB) x Retention (days) x 1.3 (index overhead).

Chapter 29 — Appliance Administration

29.1 Key Log Files on the Appliance

The Operations for Logs appliance stores its own operational logs in well-defined paths. Familiarity with these files is essential for troubleshooting appliance issues.

Log File	Path	Purpose	Rotation
Core Application	`/var/log/loginsight/runtime.log`	Main application log — startup, shutdown, errors	Daily
API / UI	`/var/log/loginsight/api_runtime.log`	API request logs, UI backend errors	Daily
Ingestion	`/var/log/loginsight/ingestion.log`	Syslog and agent ingestion pipeline	Daily
Cassandra	`/var/log/loginsight/cassandra.log`	Index database operations and errors	Daily
Audit	`/var/log/loginsight/audit.log`	User actions, login events, configuration changes	Daily
Watchdog	`/var/log/loginsight/watchdog.log`	Service health monitoring and auto-restart events	Daily
System	`/var/log/messages`	OS-level syslog messages	Weekly
Apache Reverse Proxy	`/var/log/loginsight/apache/`	Reverse proxy access and error logs	Daily
Upgrade	`/var/log/loginsight/upgrade.log`	Upgrade process log with step-by-step progress	Per upgrade

29.2 Service Management Commands

The Operations for Logs appliance runs on a SUSE Linux Enterprise Server (SLES) base operating system. Services are managed via systemctl.

# Check overall service status
systemctl status loginsight

# Restart the main Operations for Logs service
systemctl restart loginsight

# Check Cassandra index database status
systemctl status loginsight-cassandra

# Check watchdog service (monitors and auto-restarts crashed services)
systemctl status loginsight-watchdog

# View real-time service logs
journalctl -u loginsight -f

# Check disk usage on storage partition
df -h /storage/var

# Check cluster node connectivity
curl -k https://localhost:9543/api/v2/version

Warning: Restarting the loginsight service causes a brief ingestion interruption on that node. In a cluster, agents and syslog sources connected to the restarted node temporarily buffer events and reconnect to another node via the ILB VIP.

29.3 Admin Password Reset (CLI)

If the admin password is lost and UI access is not possible, reset it from the appliance command line:

# SSH to the primary node as root
ssh root@vrli-primary.lab.local

# Navigate to the application sbin directory
cd /usr/lib/loginsight/application/sbin

# Execute the password reset script
./li-reset-admin-password.sh

# Follow the interactive prompts to set a new admin password
# Services restart automatically after the password is reset

Note: This procedure resets the local admin account password only. It does not affect Active Directory or VIDM-integrated accounts. The password reset requires SSH access to the primary node as root.

29.4 Logging Level Configuration

Adjusting the internal logging level can help diagnose appliance issues.

Navigate to Administration → General → Logging Level.
Select the desired level from the dropdown:

Level	Volume	Use Case
Error	Minimal	Production — only critical failures
Warning	Low	Production — failures and potential issues
Info	Moderate (default)	Normal operations — recommended for production
Debug	High	Active troubleshooting — detailed diagnostic output
Trace	Very High	Deep troubleshooting — full method-level tracing

Click Save. The change takes effect immediately with no service restart required.

Important: Set the logging level to Debug or Trace only temporarily during active troubleshooting. These levels significantly increase log volume and can fill the /storage/var partition if left enabled. Always return to Info after troubleshooting is complete.

29.5 Support Bundle Generation

A support bundle collects diagnostic information required by Broadcom support for troubleshooting appliance issues.

UI Method:

Navigate to Administration → Support Bundle.
Click Generate Bundle.
Select the scope:
- This Node — Bundle from the current node only.
- All Nodes — Bundle from every node in the cluster.
- Specific Nodes — Select individual nodes from the cluster list.
Wait for bundle generation to complete (typically 2–10 minutes depending on log volume).
Click Download to save the bundle as a ZIP file.
Attach the ZIP file to the Broadcom support ticket (SR).

CLI Method:

# SSH to the primary node as root
ssh root@vrli-primary.lab.local

# Generate the support bundle
/usr/lib/loginsight/application/sbin/li-support-bundle.sh

# Output location:
# /tmp/li-support-bundle-<timestamp>.tar.gz

# Transfer the bundle to your workstation
scp root@vrli-primary.lab.local:/tmp/li-support-bundle-*.tar.gz .

Bundle Contents:

Application log files from all selected nodes
Configuration files (with passwords redacted)
Cluster topology and health status
Cassandra database diagnostics
System resource utilization (CPU, memory, disk)
Network configuration and connectivity tests

Chapter 30 — Operations for Logs API

30.1 Base URL

All Operations for Logs API calls use HTTPS on port 9543 (or HTTP on port 9000 for non-production environments). The base URL format is:

https://<vrli-vip-fqdn>:9543/api/v2/

Replace <vrli-vip-fqdn> with the cluster VIP FQDN or individual node FQDN. All examples in this chapter use vrli-vip.lab.local as the target.

30.2 Authentication

All API calls (except /api/v2/sessions) require a valid session token. Obtain a token by authenticating against the sessions endpoint:

# Obtain a session token
curl -k -X POST "https://vrli-vip.lab.local:9543/api/v2/sessions" \
  -H "Content-Type: application/json" \
  -d '{
    "username": "admin",
    "password": "YourPassword123!",
    "provider": "Local"
  }'

# Response:
# {
#   "userId": "a1b2c3d4-...",
#   "sessionId": "abc123def456...",
#   "ttl": 1800
# }

Use the returned sessionId value as a Bearer token in subsequent requests:

Authorization: Bearer abc123def456...

Field	Description
`userId`	Unique identifier of the authenticated user
`sessionId`	Session token — valid for `ttl` seconds
`ttl`	Time-to-live in seconds (default 1800 = 30 minutes)
`provider`	Authentication provider: `Local`, `ActiveDirectory`, or `vidm`

Note: Tokens expire after the TTL period. For long-running automation scripts, implement token refresh logic that re-authenticates before the TTL expires.

30.3 Ingestion API

Send log events programmatically using the ingestion endpoint. This is useful for forwarding application logs, CI/CD pipeline events, or custom monitoring data.

# Ingest a single event
curl -k -X POST "https://vrli-vip.lab.local:9543/api/v2/events/ingest/0" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer abc123def456..." \
  -d '{
    "events": [
      {
        "text": "Application deployment completed successfully",
        "timestamp": 1711000000000,
        "fields": [
          {"name": "appname", "content": "deploy-pipeline"},
          {"name": "environment", "content": "production"},
          {"name": "build_number", "content": "1842"},
          {"name": "deploy_status", "content": "success"}
        ]
      }
    ]
  }'

Ingestion Payload Fields:

Field	Type	Required	Description
`text`	String	Yes	The log message body
`timestamp`	Long	No	Event timestamp in epoch milliseconds (defaults to server receipt time)
`fields`	Array	No	Array of key-value pairs for structured field extraction
`fields[].name`	String	Yes (if fields used)	Field name
`fields[].content`	String	Yes (if fields used)	Field value

Tip: The /ingest/0 endpoint path suffix (0) specifies the shard hint. For most use cases, use 0 to let the cluster auto-distribute. For high-throughput ingestion, distribute across shard hints 0 through n-1 where n is the number of cluster nodes.

30.4 Query API

Retrieve log events and aggregated statistics programmatically.

Search Events:

# Simple keyword search — last 100 matching events
curl -k -X GET \
  "https://vrli-vip.lab.local:9543/api/v2/events?q=error&limit=100" \
  -H "Authorization: Bearer abc123def456..."

# Field-based query with time range
curl -k -X GET \
  "https://vrli-vip.lab.local:9543/api/v2/events?q=vmw_host%3Desxi01*&limit=50&start-time-ms=1711000000000&end-time-ms=1711086400000" \
  -H "Authorization: Bearer abc123def456..."

Query Parameters:

Parameter	Type	Description
`q`	String	Query string (URL-encoded)
`limit`	Integer	Maximum number of events to return (default 100, max 20000)
`start-time-ms`	Long	Start of time range in epoch milliseconds
`end-time-ms`	Long	End of time range in epoch milliseconds
`order-by-direction`	String	`ASC` or `DESC` (default `DESC`)
`content-pack-fields`	String	Include content pack extracted fields

Aggregated Events:

# Count events by source over the last hour, divided into 12 bins
curl -k -X GET \
  "https://vrli-vip.lab.local:9543/api/v2/aggregated-events/timestamp/LAST_HOUR?q=error&num-bins=12" \
  -H "Authorization: Bearer abc123def456..."

Aggregation Time Ranges:

Value	Description
`LAST_5_MINUTES`	Last 5 minutes
`LAST_15_MINUTES`	Last 15 minutes
`LAST_HOUR`	Last 60 minutes
`LAST_6_HOURS`	Last 6 hours
`LAST_24_HOURS`	Last 24 hours
`LAST_3_DAYS`	Last 3 days
`LAST_7_DAYS`	Last 7 days
`LAST_30_DAYS`	Last 30 days
`CUSTOM`	Use `start-time-ms` and `end-time-ms`

30.5 Full API Categories

The following table lists all available API endpoint categories in Operations for Logs v2 API.

#	Category	Endpoint	Description
1	Sessions	`/api/v2/sessions`	Authentication — acquire and release tokens
2	Events	`/api/v2/events`	Query log events with filters and time ranges
3	Aggregated Events	`/api/v2/aggregated-events`	Statistical queries with time-bucketed aggregation
4	Ingest	`/api/v2/events/ingest`	Send log events via CFAPI
5	Alerts	`/api/v2/alerts`	Manage alert definitions (CRUD)
6	Content Packs	`/api/v2/content-packs`	Install, list, and manage content packs
7	Dashboards	`/api/v2/dashboards`	Create, update, delete dashboards and widgets
8	Groups	`/api/v2/groups`	Manage agent groups and group filters
9	Notifications	`/api/v2/notifications`	Manage notification channels
10	Users	`/api/v2/users`	User management (create, list, update, delete)
11	Roles	`/api/v2/roles`	Role management and permission assignment
12	Datasets	`/api/v2/datasets`	Index partition management
13	Cluster	`/api/v2/cluster`	Cluster topology, node status, and management
14	License Keys	`/api/v2/licensekeys`	License key management and status
15	SMTP	`/api/v2/notification/smtp`	Email notification server configuration
16	Webhooks	`/api/v2/notification/webhook`	Webhook endpoint configuration
17	Archiving	`/api/v2/archiving`	NFS archive configuration and status
18	Forwarding	`/api/v2/forwarding`	Log forwarding destination management
19	Agents	`/api/v2/agents`	Agent registration, status, and management
20	vSphere	`/api/v2/vsphere`	vSphere integration configuration
21	Spaces	`/api/v2/spaces`	Multi-tenancy space management
22	Certificates	`/api/v2/certificates`	TLS certificate management
23	Upgrades	`/api/v2/upgrades`	Appliance upgrade management
24	Support	`/api/v2/support`	Support bundle generation and download
25	Version	`/api/v2/version`	Product version and build information

30.6 Swagger/OpenAPI Documentation

Operations for Logs provides interactive API documentation accessible directly from the appliance:

https://<vrli-vip-fqdn>:9543/api/v2/docs

Features of the interactive documentation:

Full OpenAPI 3.0 specification for all API endpoints.
Try-it-out functionality: execute API calls directly from the browser with live responses.
Parameter descriptions, required/optional indicators, and example values for every endpoint.
Response schema documentation with field-level descriptions.
Download the OpenAPI JSON specification for client code generation using tools such as Swagger Codegen or OpenAPI Generator:

https://<vrli-vip-fqdn>:9543/api/v2/docs/openapi.json

Tip: Use the OpenAPI specification to generate client libraries in Python, Go, Java, or PowerShell for automating Operations for Logs administration tasks.

Appendix A — VCF Operations Port Reference

The following table lists all network ports required for VCF Operations deployment and operation. Firewall rules must permit traffic on these ports between the listed source and destination components.

Port	Protocol	Direction	Source	Destination	Purpose
443	TCP	Inbound	Browser / API Client	VCF Operations Cluster VIP	Web UI and REST API (HTTPS)
443	TCP	Outbound	VCF Operations Node	vCenter Server	vCenter adapter data collection
443	TCP	Outbound	VCF Operations Node	NSX Manager	NSX adapter data collection
443	TCP	Outbound	VCF Operations Node	SDDC Manager	SDDC Manager adapter data collection
443	TCP	Outbound	VCF Operations Node	ESXi Hosts	Direct ESXi metric collection
443	TCP	Outbound	Remote Collector	vCenter / NSX / Targets	Remote adapter data collection
443	TCP	Outbound	VCF Operations Node	Broadcom Marketplace	Management pack downloads
8543	TCP	Inbound	Remote Collector / Agents	VCF Operations Cluster VIP	Collector-to-cluster communication
7001	TCP	Internal	VCF Operations Node	VCF Operations Node	GemFire cache replication
1300–1399	TCP	Internal	VCF Operations Node	VCF Operations Node	Distributed cache range ports
10002	TCP	Internal	VCF Operations Node	VCF Operations Node	GemFire locator port
20002	TCP	Internal	VCF Operations Node	VCF Operations Node	xDB replication primary port
20003	TCP	Internal	VCF Operations Node	VCF Operations Node	xDB replication secondary port
4369	TCP	Internal	VCF Operations Node	VCF Operations Node	Erlang Port Mapper Daemon (epmd)
5433	TCP	Internal	VCF Operations Node	VCF Operations Node	PostgreSQL database replication
8080	TCP	Localhost	VCF Operations Node	Localhost	Internal application HTTP
9090	TCP	Localhost	VCF Operations Node	Localhost	Internal admin service
514	UDP	Inbound	Network Devices	VCF Operations Node	Syslog reception (optional)
162	UDP	Inbound	Network Devices	VCF Operations Node	SNMP trap reception
25	TCP	Outbound	VCF Operations Node	SMTP Server	Email notification delivery
587	TCP	Outbound	VCF Operations Node	SMTP Server	Email notification delivery (TLS)
123	UDP	Outbound	VCF Operations Node	NTP Server	Time synchronization

Note: For the complete and most current port requirements, consult the Broadcom Ports and Protocols tool at https://ports.broadcom.com/.

Appendix B — Operations for Logs Port Reference

The following table lists all network ports required for VCF Operations for Logs deployment and operation.

Port	Protocol	Direction	Source	Destination	Purpose
443	TCP	Inbound	Browser / API Client	Operations for Logs VIP	Web UI access (HTTPS)
514	TCP	Inbound	Syslog Sources	Operations for Logs VIP	Syslog ingestion (TCP)
514	UDP	Inbound	Syslog Sources	Operations for Logs VIP	Syslog ingestion (UDP)
6514	TCP	Inbound	Syslog Sources	Operations for Logs VIP	Syslog ingestion (TLS-encrypted)
1514	TCP	Inbound	ESXi Hosts	Operations for Logs VIP	ESXi SSL syslog forwarding
9000	TCP	Inbound	Log Insight Agents	Operations for Logs VIP	CFAPI ingestion (HTTP)
9543	TCP	Inbound	Log Insight Agents / API Clients	Operations for Logs VIP	CFAPI ingestion (HTTPS) + REST API
16520–16580	TCP	Internal	Operations for Logs Node	Operations for Logs Node	Cluster inter-node communication
59778	TCP	Internal	Operations for Logs Node	Operations for Logs Node	Thrift RPC inter-node calls
12543	TCP	Internal	Operations for Logs Node	Operations for Logs Node	Cassandra database communication
9200	TCP	Internal	Operations for Logs Node	Operations for Logs Node	Node indexing service
7000	TCP	Internal	Operations for Logs Node	Operations for Logs Node	Cassandra gossip protocol
7001	TCP	Internal	Operations for Logs Node	Operations for Logs Node	Cassandra SSL gossip protocol
123	UDP	Outbound	Operations for Logs Node	NTP Server	Time synchronization
25	TCP	Outbound	Operations for Logs Node	SMTP Server	Email notification delivery
587	TCP	Outbound	Operations for Logs Node	SMTP Server	Email notification delivery (TLS)
514	TCP	Outbound	Operations for Logs Node	Syslog Destination	Log forwarding (syslog)
443	TCP	Outbound	Operations for Logs Node	VCF Operations VIP	Alert notification integration
2049	TCP/UDP	Outbound	Operations for Logs Node	NFS Server	NFS archive mount

Note: Syslog ingestion on port 514 (both TCP and UDP) is enabled by default. Port 6514 (TLS) and port 1514 (ESXi SSL) require additional configuration in the appliance admin UI.

Appendix C — Complete Suite API Endpoint Reference

The VCF Operations Suite API provides programmatic access to all platform capabilities. The base path for all endpoints is:

https://<vrops-vip-fqdn>/suite-api/api/

#	Category	Base Path	Key Operations
1	Authentication	`/suite-api/api/auth`	Acquire and release authentication tokens
2	Resources	`/suite-api/api/resources`	CRUD operations on monitored objects
3	Resource Kinds	`/suite-api/api/resourcekinds`	List and describe resource types
4	Adapter Kinds	`/suite-api/api/adapterkinds`	List and describe adapter types
5	Adapters	`/suite-api/api/adapters`	Manage adapter instances and credentials
6	Credentials	`/suite-api/api/credentials`	Create, update, and delete stored credentials
7	Alerts	`/suite-api/api/alerts`	Query, acknowledge, and cancel alerts
8	Alert Definitions	`/suite-api/api/alertdefinitions`	Create and manage alert definitions
9	Symptoms	`/suite-api/api/symptoms`	Query active symptom instances
10	Symptom Definitions	`/suite-api/api/symptomdefinitions`	Create and manage symptom definitions
11	Notifications	`/suite-api/api/notifications`	Manage notification rules and channels
12	Super Metrics	`/suite-api/api/supermetrics`	Create and manage super metric formulas
13	Policies	`/suite-api/api/policies`	Manage operational policies and assignments
14	Dashboards	`/suite-api/api/dashboards`	Create, clone, share, and delete dashboards
15	Reports	`/suite-api/api/reports`	Generate, schedule, and download reports
16	Report Definitions	`/suite-api/api/reportdefinitions`	Define report templates and layouts
17	Views	`/suite-api/api/views`	Create and manage data views
18	Tasks	`/suite-api/api/tasks`	Query and manage background tasks
19	Collector Groups	`/suite-api/api/collectorgroups`	Manage collector group assignments
20	Collectors	`/suite-api/api/collectors`	List and manage collector nodes
21	Audit	`/suite-api/api/audit`	Query audit log entries
22	Applications	`/suite-api/api/applications`	Application monitoring configuration
23	Deployment	`/suite-api/api/deployment`	Cluster deployment and scaling operations
24	Certificate	`/suite-api/api/certificate`	TLS certificate management
25	Cluster	`/suite-api/api/cluster`	Cluster topology and health
26	Versions	`/suite-api/api/versions`	Product version and build information
27	Content	`/suite-api/api/content`	Import and export content bundles
28	Events	`/suite-api/api/events`	Query and manage event timeline
29	Maintenance Schedules	`/suite-api/api/maintenanceschedules`	Schedule maintenance windows
30	Object Groups	`/suite-api/api/groups`	Manage built-in object groups
31	Custom Groups	`/suite-api/api/customgroups`	Create and manage custom object groups
32	Traversal Specs	`/suite-api/api/traversalspecs`	Define object relationship traversals
33	Relationships	`/suite-api/api/resources/{id}/relationships`	Query parent/child object relationships
34	Statistics	`/suite-api/api/resources/{id}/stats`	Retrieve metric data for a resource
35	Properties	`/suite-api/api/resources/{id}/properties`	Retrieve property values for a resource
36	Latest Statistics	`/suite-api/api/resources/{id}/stats/latest`	Retrieve the most recent metric values
37	Recommendations	`/suite-api/api/recommendations`	Query optimization recommendations
38	Cost	`/suite-api/api/costconfig`	Cost model and rate card configuration
39	Pricing	`/suite-api/api/pricing`	Pricing policy management
40	Capacity	`/suite-api/api/capacity`	Capacity analytics and projections
41	Reclamation	`/suite-api/api/reclamation`	Resource reclamation recommendations
42	Compliance	`/suite-api/api/compliance`	Compliance benchmark scoring
43	SDDC Health	`/suite-api/api/sddc`	SDDC-level health and status
44	vSAN	`/suite-api/api/vsan`	vSAN-specific health and capacity
45	Token	`/suite-api/api/auth/token`	Token-based authentication (acquire/validate)
46	Users	`/suite-api/api/auth/users`	User account management
47	Roles	`/suite-api/api/auth/roles`	Role and permission management

Note: All Suite API endpoints support JSON request and response bodies. Use Content-Type: application/json and Accept: application/json headers. Full Swagger documentation is available at https://<vrops-vip-fqdn>/suite-api/doc/swagger-ui.html.

Appendix D — OVA File Sizes and SHA256 Checksums

The following table lists the OVA appliance files used to deploy VCF Operations and related products. File sizes are approximate and vary by specific release version.

Product	OVA Filename	Approx. Size	Notes
VCF Operations (Analytics Node)	`vRealize-Operations-Manager-Appliance-8.18.2.*.ova`	~3.2 GB	Primary, replica, and data node appliance
VCF Operations (Remote Collector)	`vRealize-Operations-Manager-Remote-Collector-*.ova`	~1.8 GB	Lightweight collection-only appliance
VCF Operations for Logs	`VMware-vRealize-Log-Insight-8.18.2.*.ova`	~2.5 GB	Log analytics node (primary and worker)
VCF Suite Lifecycle Manager	`VMware-vRealize-Suite-Lifecycle-Manager-*.ova`	~4.5 GB	Lifecycle management for the full VCF Operations suite
VCF Operations for Networks (Platform)	`VMware-vRealize-Network-Insight-*.ova`	~3.0 GB	Network analytics platform node
VCF Operations for Networks (Collector)	`VMware-vRealize-Network-Insight-Collector-*.ova`	~2.0 GB	Network flow and configuration collector

Checksum Verification:

Always verify the SHA256 checksum of downloaded OVA files against the values published on the Broadcom download portal before deployment.

# Linux / macOS
sha256sum vRealize-Operations-Manager-Appliance-8.18.2.*.ova

# Windows (PowerShell)
Get-FileHash -Algorithm SHA256 .\vRealize-Operations-Manager-Appliance-8.18.2.*.ova

Important: Deploying an OVA with a mismatched checksum may indicate a corrupted download or a tampered file. Re-download the OVA from the Broadcom support portal if the checksum does not match.

Appendix E — Broadcom TechDocs Reference URLs

The following table provides direct links to key documentation and resources for VCF Operations and related products.

Resource	URL
VCF Operations Documentation	`https://docs.vmware.com/en/VMware-Aria-Operations/index.html`
VCF Operations for Logs Documentation	`https://docs.vmware.com/en/VMware-Aria-Operations-for-Logs/index.html`
VCF Operations API Reference (Suite API)	`https://docs.vmware.com/en/VMware-Aria-Operations/8.18/aria-operations-api-guide/GUID-intro.html`
VCF Operations Sizing Guide	`https://kb.vmware.com/s/article/2093783`
VCF Operations Port Requirements	`https://ports.broadcom.com/`
VCF 9.0 Release Notes	`https://docs.vmware.com/en/VMware-Cloud-Foundation/9.0/rn/vmware-cloud-foundation-90-release-notes/index.html`
Broadcom Support Portal	`https://support.broadcom.com/`
VCF Compatibility Matrix	`https://interopmatrix.vmware.com/`
Broadcom Marketplace (Management Packs)	`https://marketplace.cloud.vmware.com/`
VMware Knowledge Base	`https://kb.vmware.com/`
VCF Operations for Logs API Reference	`https://docs.vmware.com/en/VMware-Aria-Operations-for-Logs/8.18/aria-operations-for-logs-api-guide/GUID-intro.html`
VCF Operations Community Forums	`https://community.broadcom.com/vmware-tanzu/home`

End of Document

Data Type	Default	Minimum Recommended
Real-time (5-min)	1 day	1 day
Hourly rollup	30 days	15 days
Daily rollup	6 months	3 months
Monthly rollup	13 months	6 months

VCF Operations & Operations for Logs — Complete Handbook

Table of Contents

PART I: VCF Operations

Chapter 1 — Product Overview

1.1 Naming History

1.2 Product Family

1.3 Architecture

Analytics Cluster

Collectors

Adapters

Management Packs

Data Flow Summary

1.4 VCF 9.0 Integration Model

Fleet Manager Deployment

SDDC Manager Orchestration

Mandatory Monitoring

Chapter 2 — Sizing and Prerequisites

2.1 Node Types

Node Selection Guidance

2.2 OVA Sizing

Sizing Recommendations

2.3 Cluster Models

Simple (Single Node)

High Availability (HA)

Continuous Availability (CA)

2.4 Remote Collector Sizing

Remote Collector Placement Guidelines

2.5 Disk Partitions

2.6 Browser and Hypervisor Requirements

Supported Browsers

Supported Hypervisor Versions

Additional Prerequisites

Chapter 3 — Network Port Requirements

3.1 Inbound Ports

3.2 Outbound Ports

3.3 Cluster-Internal Ports

3.4 Localhost-Only Ports

3.5 Remote Collector Ports

3.6 Firewall Rule Guidance

Chapter 4 — Deployment

4.1 VCF Automated Deployment Flow

Automated Deployment Sequence

4.2 OVA Deployment via vSphere Client

Step-by-Step Procedure

4.3 OVA Deployment via ovftool CLI

Full ovftool Command

Key Parameters Explained

4.4 VAMI Configuration

Accessing VAMI

VAMI Administrative Functions

4.5 Initial Setup Wizard

Step 1 — Getting Started

Step 2 — Accept the EULA

Step 3 — Choose the Deployment Type

Step 4 — Set the Admin Password

Step 5 — Choose the Certificate Option

Step 6 — Configure NTP

Step 7 — Ready to Complete

Chapter 5 — High Availability Cluster Setup

5.1 Deploy Primary Node

Pre-Requisites for Cluster Expansion

5.2 Deploy and Join Replica Nodes

Deploy the Replica OVA

Join the Replica to the Primary

Monitoring the Join Progress

5.3 Cluster Initialization

Verifying HA Functionality

Creating Anti-Affinity Rules

5.4 Limitations and Considerations

Critical: No In-Place Conversion from Simple to HA

IP Address and FQDN Immutability

Cluster Node Sizing Consistency

Network Latency Requirements

Failover Behavior

Witness Node for Continuous Availability

Chapter 6: Key Filesystem Paths and Services

6.1 Filesystem Paths

6.2 Services

6.3 Service Management Commands

6.4 Shutdown and Startup Sequences

The `depth` Parameter

The `where` Clause

The `isFresh()` Function