VCF OPERATIONS 30-DAY MASTERY PLAN - EXTREMELY DETAILED
VCF 9.x Aligned | Every Action Spelled Out | Zero Assumptions
WEEKLY STRUCTURE (NON-NEGOTIABLE)
WEEK 1: DAYS 1-7 (Foundation: Navigate, Inventory, Health, Dashboards Basics)
DAY 1 - VCF OPS MENTAL MODEL & UI MAP
Time required: 2-3 hours
OBJECTIVE: Be able to explain VCF constructs (fleet > instance > domains) and where Ops fits in the stack. Produce a 1-page navigation map.
STEP-BY-STEP ACTIONS:
https:///ui https://vrops.lab.local/uiINFRASTRUCTURE OPERATIONS (top section - direct sub-items, not expandable):
EXPANDABLE SECTIONS (click the arrow to see sub-items):
Open a new document (Word, Google Doc, or plain text). Create this exact structure:
VCF OPERATIONS NAVIGATION MAP (VCF 9.x - Verified from Broadcom TechDocs)
===========================================================================
LOGIN URL: https://<ops-fqdn>/ui
HOME / LAUNCHPAD:
Tabs: [All VMware Clouds] [VMware Cloud Foundation] [vCenter]
Right panel: Monitoring Accounts (Private Clouds, vCenters, Hosts, VMs)
Bottom: Appliances Health & Management
Feature: Pin up to 5 dashboards to Product Home
LEFT NAV STRUCTURE:
INFRASTRUCTURE OPERATIONS (direct items - no expand arrow):
├── Diagnostic Findings (107+ known-issue signatures, CSV export)
├── VCF Health (full stack: ESXi, vCenter, vSAN, NSX health)
├── Dashboards & Reports (Favorites/Recents/All + Manage + scheduled reports)
├── Alerts (definitions, symptoms, recommendations, notifications, outbound)
├── Troubleshooting Workbench (sessions, relationship graph/tree, run scripts)
├── Analyze (log search, saved queries, extracted fields, event types, trends)
├── Storage Operations (vSAN + non-vSAN datastores, dedup/compression, perf diag)
├── Network Operations (NSX instances, Edge clusters, transport nodes, VTEP state)
├── Data Protection & Recovery (VMware Live Recovery integration)
├── Automation Central (schedule automated actions for maintenance windows)
└── Configurations (policies, settings, apply to object groups)
WORKLOAD OPERATIONS > (expandable)
├── Business Applications (interconnected app/service health)
├── Product-Managed Telegraf (app service monitoring on VMs)
├── Open Source Telegraf (deploy Telegraf via cloud proxy)
├── Service Discovery (auto-discover services; network flow-based in 9.0)
├── vGPU Monitoring (VM GPU metrics from ESXi)
└── Platform Monitoring (Supervisor + VKS cluster auto-discovery)
FLEET MANAGEMENT > (expandable)
├── Lifecycle Management (bundles, prechecks, upgrades, combined ESX+NSX)
├── Identity & Access Management (SSO, OIDC, SAML, LDAP, SCIM)
├── Certificate Management (view, alert, auto-renew, VMCA/MSCA/OpenSSL CA)
├── Password Management (account status, expiry, fleet-wide remediation)
├── Configuration Management (drift detection, Git integration, PDF reports)
└── Tag Management (import/create/push tags to multiple vCenters, JSON export)
CAPACITY > (expandable)
├── Cost Home (centralized cost visualization)
├── Configuring Cost (SDDC costing, currency, comparisons)
├── Capacity Optimization (utilization, what-if exports, storage exclusion)
├── Chargeback (rate cards for supervisor architecture)
└── Showback / Billing (tenant/org cost transparency)
SECURITY > (expandable)
├── Security Operations Dashboard (security posture, user tracking)
├── Compliance (packs, benchmarks, rules, remediation)
└── Audit Events (cross-vCenter activity search, vSphere/vSAN/NSX events)
LICENSE MANAGEMENT > (expandable)
└── Dynamic licensing (connected + disconnected/air-gapped modes)
ADMINISTRATION > (expandable)
├── User Management / Access Control (roles, auth sources)
├── Global Settings (environment-wide config)
├── Maintenance Schedules (planned maintenance windows)
├── Outbound Settings (external notification channels)
├── Policies (operational policy management)
└── Metrics and Properties (metric/property definitions)
DEVELOPER CENTER > (expandable)
└── REST APIs, SDK docs, programmatic automation access
VCF HIERARCHY:
Fleet → Instance → Workload Domain → Cluster → Host → VM
WHERE OPS FITS (VCF 9.0 - CRITICAL CHANGE):
VCF Operations = THE unified management console (monitoring + lifecycle + fleet)
SDDC Manager UI = DEPRECATED in VCF 9.0 (workflows moved to VCF Operations)
vCenter / vSphere Client = compute management + some migrated SDDC Manager tasks
NSX Manager = network management plane
DEPLOYMENT MODELS:
Simple = single node + vSphere HA
High Availability = 3-node analytics cluster
Continuous Availability = dual fault domain
VCF_Ops_Navigation_Map.md or .docx.VALIDATION CHECK: Can you answer these without looking?
DAY 2 - INVENTORY & OBJECT MODEL
Time required: 2-3 hours
OBJECTIVE: Confidently trace object relationships (Cluster > Host > VM, etc.) using Inventory. Produce an object-relationship cheat sheet.
STEP-BY-STEP ACTIONS:
https:///ui
VCF OPERATIONS OBJECT RELATIONSHIP CHEAT SHEET
================================================
COMPUTE PATH (most common trace):
vCenter Server
└── Datacenter
└── Cluster
└── ESXi Host
└── Virtual Machine
└── Virtual Disk / vNIC
STORAGE PATH:
Datastore Cluster
└── Datastore (VMFS / vSAN / NFS)
└── VM Files (.vmx, .vmdk)
NETWORK PATH:
Distributed Virtual Switch (DVS)
└── Distributed Port Group
└── VM vNIC
RELATIONSHIP TYPES:
- Parent: the object this object belongs to (Host's parent = Cluster)
- Child: objects contained within (Cluster's children = Hosts)
- Related: associated objects (Datastore relates to Hosts that mount it)
HOW TO TRACE IN THE UI:
1. Use the search bar at top of VCF Operations to find any object by name
2. Or navigate via Launchpad > Monitoring Accounts > drill into vCenters/clusters
3. Click any object to open its detail view
4. Click "Relationships" tab in right panel
5. Click parent/child links to navigate up/down the tree
KEY COUNTS TO KNOW FOR YOUR ENVIRONMENT:
- vCenter servers: ___
- Datacenters: ___
- Clusters: ___
- Hosts: ___
- VMs: ___
- Datastores: ___
VALIDATION CHECK:
DAY 3 - HEALTH & DIAGNOSTICS WORKFLOW
Time required: 2-3 hours
OBJECTIVE: Run a repeatable "first 15 minutes of triage" using health/diagnostic findings. Produce a triage runbook v1.
STEP-BY-STEP ACTIONS:
VCF OPERATIONS TRIAGE RUNBOOK - FIRST 15 MINUTES
==================================================
MINUTE 0-2: ASSESS SCOPE
Action: Go to Home page
Look at: Summary cards - which domains show non-green?
Question to answer: "Is this isolated to one object or widespread?"
What to write down: Number of critical/immediate alerts
MINUTE 2-5: IDENTIFY TOP ALERTS
Action: Go to Infrastructure Operations > Alerts
Filter: Status=Active, Severity=Critical, then Immediate
Sort by: Time (newest first)
For each critical alert, record:
- Alert name
- Affected object (name + type)
- Time triggered
- Recommendation text (copy/paste it)
MINUTE 5-8: TRACE THE AFFECTED OBJECT
Action: Click the affected object in the alert
Go to: Relationships tab
Question: "What else depends on this object?"
- If it is a Host: What VMs run on it? What cluster is it in?
- If it is a Datastore: What VMs use it? What hosts mount it?
- If it is a VM: What host is it on? Is the host healthy?
MINUTE 8-12: CHECK METRICS
Action: On the affected object, click Metrics tab
Look at: CPU usage, Memory usage, Disk latency, Network throughput
Compare to: Normal baseline (is this a spike or sustained?)
Time range: Set to Last 24 hours, then Last 7 days for trend
MINUTE 12-15: DOCUMENT & ESCALATE
Write in your incident notes:
- What: [alert name]
- Where: [object name and path]
- Impact: [what is affected downstream - VMs, users]
- Probable cause: [from Recommendations + your trace]
- Next step: [fix it yourself / escalate / monitor]
ESCALATION CRITERIA:
- Critical alert on Management Domain cluster/host = ESCALATE IMMEDIATELY
- Critical alert affecting >10 VMs = ESCALATE
- Capacity below 10% on any cluster = ESCALATE
- Alert present >1 hour with no auto-resolution = INVESTIGATE
VALIDATION CHECK:
DAY 4 - DASHBOARDS 101 (CREATE, WIDGETS/VIEWS, SAVE)
Time required: 2-3 hours
OBJECTIVE: Create a dashboard from scratch, add widgets, save it. Produce "My First Ops Dashboard" with screenshot evidence.
STEP-BY-STEP ACTIONS:
Step 4a - Name your dashboard:
Daily Health Overview - [Your Name]Daily Health Overview - Michael HayesStep 4b - Add your first widget (Health Status):
Cluster Health StatusCluster Compute ResourceBadge | HealthStep 4c - Add a second widget (Top VMs by CPU):
Top 10 VMs by CPU UsageVirtual MachineCPU | Usage (%)10Step 4d - Add a third widget (Alert List):
Active Critical AlertsCritical and ImmediateStep 4e - Add a fourth widget (Capacity Overview):
Cluster Capacity RemainingCluster Compute ResourceBadge | Capacity RemainingWindows + Shift + S (Windows Snipping Tool)My_First_Ops_Dashboard.pngDELIVERABLE: Your saved dashboard named "Daily Health Overview - [Your Name]" + the screenshot file.
VALIDATION CHECK:
DAY 5 - WIDGET INTERACTIONS (MAKE DASHBOARDS DYNAMIC)
Time required: 2-3 hours
OBJECTIVE: Configure widget interactions so clicking one widget updates another. Produce a dashboard that filters dynamically.
STEP-BY-STEP ACTIONS:
Top 10 VMs by CPU UsageSelected Object (meaning: pass the selected cluster to the target widget)Active Critical AlertsSelected ObjectDELIVERABLE: Your "Daily Health Overview" dashboard now updates dynamically when you click a cluster. Screenshot the before/after (click a cluster, show how other widgets change).
VALIDATION CHECK:
DAY 6 - DASHBOARD OPERATIONS (CLONE, EDIT, MANAGE, FAVORITES/RECENTS)
Time required: 2 hours
OBJECTIVE: Manage dashboards at scale. Clone, edit, organize with favorites/recents. Produce a "Daily Driver" dashboard pinned.
STEP-BY-STEP ACTIONS:
NOC - VM Performance OverviewCPU Trend - Last 7 DaysDELIVERABLE: "Daily Driver" dashboard is favorited, shows in Favorites list, and you have a cloned "NOC" dashboard.
VALIDATION CHECK:
DAY 7 - CHECKPOINT #1 (SKILLS TEST + MINI INTERVIEW)
Time required: 2-3 hours
WHAT YOU MUST DO:
Part 1 - Skills Demo (45 minutes):
Checkpoint 1 - [Your Name]Part 2 - Mini Interview (30 minutes):
Answer these questions out loud (record yourself or have someone ask you):
Q1: "What is VCF Operations and where does it fit in the VCF stack?"
EXACT ANSWER: "VCF Operations is the unified management console in VMware Cloud Foundation 9. In VCF 9.0, the SDDC Manager UI was deprecated, and its workflows for lifecycle management, fleet management, certificates, passwords, and configuration management all moved into VCF Operations. So it is now the single pane of glass for everything: health monitoring, performance analytics, capacity planning, compliance, dashboards, log analysis, and lifecycle operations. The left nav has Infrastructure Operations with items like Diagnostic Findings, VCF Health, Dashboards & Reports, Alerts, Troubleshooting Workbench, Analyze for logs, Storage Operations, and Network Operations. Then expandable sections for Workload Operations, Fleet Management, Capacity, Security, License Management, Administration, and Developer Center. The object hierarchy goes Fleet, Instance, Workload Domain, Cluster, Host, VM."
Q2: "Walk me through a dashboard you built."
EXACT ANSWER: "I built a Daily Health Overview dashboard for our operations team. It has four widgets: a cluster health scoreboard that shows green/yellow/red status for each cluster, a Top-10 VMs by CPU chart to spot resource hogs, an active critical alerts list filtered to only Critical and Immediate severity, and a cluster capacity remaining scoreboard. The key feature is widget interactions - when I click a specific cluster in the health scoreboard, the VM list and alert list automatically filter to show only data for that cluster. This lets me go from 'something is wrong in cluster X' to 'here are the specific VMs and alerts' in one click."
Q3: "How do you triage an issue in VCF Operations?"
EXACT ANSWER: "I follow a 15-minute triage runbook. Minutes 0-2: I check VCF Health under Infrastructure Operations and the Launchpad summary to assess scope - is this one object or widespread? I also check Diagnostic Findings which automatically correlates issues against 107 known-issue signatures from Broadcom Support. Minutes 2-5: I go to Infrastructure Operations > Alerts, filter to Active Critical, sort by newest, and record the alert name, affected object, and recommendations for each. Minutes 5-8: I click into the affected object, go to Relationships to understand blast radius - what depends on this object? If a host is down, I need to know which VMs are on it. Minutes 8-12: I check the Metrics tab to see if this is a spike or sustained issue, looking at CPU, memory, disk latency, and network. If log evidence is needed, I use Infrastructure Operations > Analyze to search logs. Minutes 12-15: I document findings and either fix it or escalate based on severity and impact."
Q4: "What are widget interactions and why do they matter?"
EXACT ANSWER: "Widget interactions connect widgets on a dashboard so selecting an object in one widget filters or updates other widgets. For example, clicking a cluster in a health scoreboard automatically shows only that cluster's VMs in the Top-N widget and only that cluster's alerts in the alert list. They matter because they turn static wallpaper dashboards into dynamic investigation tools. Instead of maintaining 20 single-purpose dashboards, I build one interactive dashboard that lets me drill down on the fly during triage."
SCORING (be honest with yourself):
WEEK 2: DAYS 8-14 (Operations Depth: Alerts, Capacity, Compliance, Drift, Exec Views)
DAY 8 - ALERTS: WHAT MATTERS VS NOISE
Time required: 2-3 hours
OBJECTIVE: Define alert hygiene. Build an alert taxonomy. Know what to action vs ignore.
STEP-BY-STEP ACTIONS:
ALERT TAXONOMY TEMPLATE
========================
P1 - CRITICAL (respond in <15 minutes)
Definition: Service-impacting, production down, data at risk
Examples:
- Host not responding
- Datastore out of space (<5% remaining)
- vCenter service down
- Management domain cluster health = Red
Action: Page on-call, begin triage immediately, update status page
Owner: On-call engineer
MTTR target: <1 hour
P2 - IMMEDIATE (respond in <1 hour)
Definition: Significant degradation, not yet service-impacting
Examples:
- Cluster CPU usage >90% sustained 30 min
- Memory contention on >3 hosts in cluster
- Certificate expiring in <7 days
- Capacity remaining <20% on any cluster
Action: Assign to team lead, begin investigation, schedule fix
Owner: Ops team lead
MTTR target: <4 hours
P3 - WARNING (respond within business day)
Definition: Non-urgent, could become P2 if ignored
Examples:
- Single VM high CPU (not business-critical)
- Capacity remaining <40% (trending toward limit)
- Configuration drift detected on non-production cluster
- NTP skew >1 second
Action: Add to daily standup queue, investigate during business hours
Owner: Assigned ops engineer
MTTR target: <24 hours
P4 - INFORMATIONAL (review weekly)
Definition: FYI, no action required unless pattern emerges
Examples:
- VM powered off/on
- Snapshot older than 7 days
- Informational health status changes
Action: Review in weekly ops meeting, batch-resolve
Owner: Anyone
MTTR target: N/A (review cadence)
DELIVERABLE: Your completed P1-P4 Alert Taxonomy Template document.
VALIDATION CHECK:
DAY 9 - CAPACITY & PERFORMANCE STORY
Time required: 2-3 hours
OBJECTIVE: Explain "capacity risk vs performance risk" and build a weekly capacity report outline.
STEP-BY-STEP ACTIONS:
WEEKLY CAPACITY REPORT - [DATE RANGE]
=======================================
EXECUTIVE SUMMARY:
[1-2 sentences: Are we healthy? Any clusters at risk? Key action needed?]
Example: "All production clusters are healthy with >30% capacity remaining.
Dev cluster 2 will exhaust CPU in approximately 45 days at current growth
rate. Recommend right-sizing 12 oversized VMs to reclaim capacity."
SECTION 1: CAPACITY STATUS BY CLUSTER
| Cluster Name | CPU Remaining % | Memory Remaining % | Storage Remaining % | Days to Exhaustion (CPU) |
|---|---|---|---|---|
| Prod-Cluster-01 | 45% | 52% | 61% | 120 days |
| Prod-Cluster-02 | 38% | 41% | 55% | 90 days |
| Dev-Cluster-01 | 22% | 30% | 40% | 45 days |
SECTION 2: TOP CAPACITY RISKS
1. [Cluster name] - [Which resource] - [Days remaining] - [Recommended action]
2. ...
SECTION 3: RECLAIMABLE CAPACITY
- Oversized VMs (CPU allocated > used): [count] VMs, [amount] reclaimable
- Idle VMs (powered on, no activity >7 days): [count] VMs
- Old snapshots (>7 days): [count] snapshots, [size] GB
SECTION 4: PERFORMANCE HIGHLIGHTS (LAST 7 DAYS)
- Peak CPU usage event: [cluster] hit [%] on [date/time]
- Peak memory usage event: [cluster] hit [%] on [date/time]
- Any disk latency spikes: [yes/no, details]
SECTION 5: RECOMMENDATIONS
1. [Right-size these VMs]
2. [Delete these snapshots]
3. [Plan hardware procurement for cluster X by date Y]
NEXT REVIEW: [Date]
VALIDATION CHECK:
DAY 10 - COMPLIANCE & CONFIGURATION DRIFT
Time required: 2-3 hours
OBJECTIVE: Talk through drift, why it matters, and what should be monitored. Build a drift watchlist.
STEP-BY-STEP ACTIONS:
CONFIGURATION DRIFT WATCHLIST
================================
These are the 10 settings I monitor for unauthorized changes.
Any change to these without a change ticket = investigate immediately.
1. ESXi SSH Service Status
Baseline: DISABLED on all hosts
Why: SSH enabled = potential attack vector + audit finding
Check: Security > Compliance or Fleet Management > Configuration Management
2. ESXi NTP Configuration
Baseline: NTP configured to [your NTP servers], service running
Why: Time skew breaks logs correlation, certificate validation, vSAN
Check: Security > Compliance or esxcli system ntp get
3. ESXi Syslog Forwarding
Baseline: Forwarding to [your syslog server/Operations-Logs]
Why: Without log forwarding, you lose forensic evidence
Check: esxcli system syslog config get
4. ESXi Lockdown Mode
Baseline: NORMAL lockdown mode enabled
Why: Prevents direct host access bypassing vCenter
Check: Security > Compliance
5. vCenter SSO Password Policy
Baseline: Minimum 12 chars, complexity enabled, lockout after 5 attempts
Why: Weak passwords = brute force risk
Check: vCenter Administration > SSO > Configuration
6. ESXi Firewall Rules
Baseline: Only required services allowed (no wildcard "all" rules)
Why: Open firewalls = lateral movement risk
Check: esxcli network firewall ruleset list
7. VM Hardware Version
Baseline: Minimum version 19 (vSphere 7+) or 21 (vSphere 8+)
Why: Old hardware versions miss security features
Check: VM properties or Security > Compliance
8. Datastore Access Permissions
Baseline: Only authorized hosts mount each datastore
Why: Unauthorized access = data exposure risk
Check: Datastore > Hosts tab
9. Distributed Switch Configuration
Baseline: Promiscuous mode DISABLED, MAC changes DISABLED, Forged transmits DISABLED
Why: These settings enabled = network sniffing possible
Check: DVS > Security settings
10. Certificate Validity
Baseline: All certs valid, expiry >30 days
Why: Expired certs = service outages + security warnings
Check: VCF Operations Alerts or SDDC Manager cert dashboard
VALIDATION CHECK:
DAY 11 - ROLE-BASED DASHBOARDS
Time required: 3 hours
OBJECTIVE: Build one dashboard per audience (NOC / VMware Ops / Exec). Produce 3 dashboard wireframes.
STEP-BY-STEP ACTIONS:
NOC (Network Operations Center):
VMware Ops (your team):
Executive:
NOC - Real-Time StatusOPS - Daily Operations ViewEXEC - Weekly Infrastructure SummaryDELIVERABLE: 3 dashboard wireframes (NOC, Ops, Exec) either built in the UI or documented as design specs.
VALIDATION CHECK:
DAY 12 - INCIDENT DAY #1 (TIMED)
Time required: 2 hours
OBJECTIVE: Simulate a real incident, triage it under time pressure, document findings.
SIMULATION SCENARIO: "Multiple users report application slowness at 9:15 AM."
SET A TIMER FOR 30 MINUTES. START NOW.
STEP-BY-STEP TRIAGE:
Minutes 0-2: Scope Assessment
Minutes 2-5: Alert Investigation
Minutes 5-10: Object Investigation
Minutes 10-15: Performance Check
Minutes 15-20: Root Cause Hypothesis
"I believe the slowness is caused by [specific finding] on [specific object], which affects [N VMs / services]."
Minutes 20-25: Remediation Plan
Minutes 25-30: Documentation
INCIDENT NOTES - [DATE]
========================
Reported: 9:15 AM - Users report application slowness
Triage started: [time]
Triage completed: [time]
FINDINGS:
- Alert(s): [list]
- Affected object(s): [list]
- Root cause hypothesis: [statement]
- Impact: [N VMs, N users, N services]
REMEDIATION:
- Immediate: [what you did/would do]
- Short-term: [plan]
- Long-term: [automation opportunity]
WHAT I WOULD AUTOMATE:
- Auto-detect this condition via alert threshold
- Auto-notify via webhook/email
- Auto-document in ticket system
- Script to [specific action]
STOP TIMER.
DELIVERABLE: Completed incident notes document.
DAY 13 - OPS STORYTELLING
Time required: 2 hours
OBJECTIVE: Convert raw metrics into narrative. Build an RCA template.
STEP-BY-STEP ACTIONS:
Every ops narrative has four parts:
Raw data: "Cluster-Prod-01 CPU hit 95% at 14:32, alert fired at 14:37, resolved at 15:15 after vMotion of 8 VMs."
Narrative: "On Tuesday at 2:32 PM, production cluster 01 experienced CPU saturation reaching 95%, triggered by a batch job on 3 oversized VMs. The automated alert fired at 2:37 PM and was acknowledged by the on-call engineer at 2:40 PM. The engineer identified the root cause as 3 VMs running unscheduled batch jobs that consumed 45% of cluster CPU. Eight VMs were migrated to cluster 02 via vMotion to relieve pressure, resolving the issue at 3:15 PM. Total user impact: approximately 200 users experienced application latency for 43 minutes. Prevention: the batch VMs have been moved to a dedicated resource pool with CPU limits, and a capacity alert has been set at 85% to provide earlier warning."
ROOT CAUSE ANALYSIS (RCA) TEMPLATE
====================================
INCIDENT ID: [INC-YYYY-NNNN]
DATE: [date]
SEVERITY: [P1/P2/P3]
DURATION: [start time] to [end time] ([total minutes])
AUTHOR: [your name]
1. EXECUTIVE SUMMARY (2-3 sentences)
[What happened, who was affected, how it was resolved.]
Example: "Production cluster CPU saturation caused application latency
for approximately 200 users over 43 minutes. Root cause was unscheduled
batch jobs on oversized VMs. Resolved by VM migration and resource pool
isolation."
2. TIMELINE
| Time | Event |
|---|---|
| 14:32 | CPU usage on Cluster-Prod-01 exceeds 90% |
| 14:37 | VCF Operations alert fires (Critical: CPU Contention) |
| 14:40 | On-call engineer acknowledges alert |
| 14:45 | Engineer identifies 3 VMs running batch jobs |
| 14:50 | Decision: vMotion 8 VMs to Cluster-Prod-02 |
| 15:10 | vMotion complete, CPU drops to 62% |
| 15:15 | Monitoring confirms normal performance, incident closed |
3. ROOT CAUSE
[Specific technical cause]
Example: "Three VMs (batch-proc-01, batch-proc-02, batch-proc-03) were
configured with 16 vCPU each and ran unscheduled data processing jobs
simultaneously, consuming 45% of cluster CPU capacity."
4. IMPACT
- Users affected: [number]
- Services affected: [list]
- Duration of impact: [minutes]
- Business impact: [description - revenue loss, SLA breach, etc.]
5. RESOLUTION
- Immediate: [what was done to stop the bleeding]
- Technical: [specific commands, actions, changes made]
6. PREVENTION (5 WHYS)
Why did CPU saturate? → 3 VMs ran batch jobs
Why were batch jobs running? → No scheduling control
Why no scheduling control? → VMs not in a resource pool with limits
Why no resource pool? → Batch workloads not identified in capacity planning
Why not identified? → No workload classification process
CORRECTIVE ACTIONS:
| Action | Owner | Due Date | Status |
|---|---|---|---|
| Move batch VMs to dedicated resource pool | [name] | [date] | |
| Set CPU limit on batch resource pool | [name] | [date] | |
| Add 85% CPU alert threshold to all prod clusters | [name] | [date] | |
| Create workload classification process | [name] | [date] | |
| Schedule batch jobs for off-peak hours | [name] | [date] | |
7. LESSONS LEARNED
- [What worked well in the response]
- [What could be improved]
- [What we will automate]
VALIDATION CHECK:
DAY 14 - CHECKPOINT #2
Time required: 2 hours
YOU MUST PRESENT: A weekly ops briefing using your dashboards (5 minutes).
EXACT SCRIPT FOR YOUR 5-MINUTE BRIEFING:
Stand up (even if alone). Open your dashboards. Speak this out loud:
[Minute 0:00-1:00] - Opening + Overall Health
"Good morning. This is the weekly infrastructure operations briefing for the week of [date range]. Overall health status: [green/yellow/red]. We have [N] active critical alerts and [N] immediate alerts. No P1 incidents this week. [or: We had one P1 incident on [day], which I will cover in a moment.]"
[Minute 1:00-2:00] - Capacity Status
"Switching to capacity. All production clusters are above 30% remaining, which is our comfort threshold. Dev cluster 2 is trending toward exhaustion in [N] days. I have submitted a right-sizing request for 12 oversized VMs that would reclaim approximately [N] GB of memory. Reclaimable storage from old snapshots: [N] GB."
[Minute 2:00-3:00] - Performance + Incidents
"Performance highlights: We saw a CPU spike on Prod Cluster 01 on Tuesday at 2:32 PM. Root cause was unscheduled batch jobs. Resolution took 43 minutes. Full RCA is documented and corrective actions are in progress. No other performance events exceeded thresholds this week."
[Minute 3:00-4:00] - Compliance + Risks
"Compliance status: [N] hosts are fully compliant. [N] hosts have findings: [list top findings, e.g., SSH enabled on 2 hosts, NTP drift on 1 host]. Remediation is scheduled for [date]. Top risk this week: Dev cluster 2 capacity. Second risk: certificate expiring on [component] in [N] days."
[Minute 4:00-5:00] - Actions + Questions
"Action items for this week: One, right-size 12 VMs on Dev cluster 2. Two, remediate SSH findings on 2 hosts. Three, renew certificate for [component]. That concludes the briefing. Questions?"
WEEK 3: DAYS 15-21 (Logs Mastery: VCF Operations for Logs + Analyze + Dashboards/Alerts/Queries)
IMPORTANT VCF 9.x CONTEXT: In VCF 9.0, log analysis is integrated into VCF Operations under Infrastructure Operations > Analyze. The Analyze section provides log search, saved queries, extracted fields, event types, trends, and side-by-side query comparison. This requires the VCF Operations for Logs component to be deployed. Log data is standardized to RFC 5424 format. You can migrate data from legacy Aria Operations for Logs 8.x (up to 90 days).
DAY 15 - WHAT LOGS ARE FOR (TROUBLESHOOTING VS AUDITING)
Time required: 2 hours
OBJECTIVE: Define log use cases and required log sources.
STEP-BY-STEP ACTIONS:
Troubleshooting logs (real-time, reactive):
Audit logs (historical, proactive):
LOG SOURCE INVENTORY CHECKLIST
================================
SOURCE 1: ESXi Hosts
Log files:
- /var/log/vmkernel.log (kernel messages, storage, network)
- /var/log/hostd.log (host agent, VM operations)
- /var/log/vpxa.log (vCenter agent on host)
- /var/log/vobd.log (VMware Observation Broker - events)
- /var/log/auth.log (authentication, SSH access)
- /var/log/shell.log (ESXi Shell commands executed)
Forwarding to Operations-Logs: YES/NO
Forwarding method: syslog (UDP 514 or TCP 514 or TCP 1514)
SOURCE 2: vCenter Server
Log files:
- /var/log/vmware/vpxd/vpxd.log (core vCenter service)
- /var/log/vmware/vpxd/vpxd-alert.log (critical alerts)
- /var/log/vmware/sso/ssoAdminServer.log (SSO/authentication)
- /var/log/vmware/content-library/ (content library)
Forwarding to Operations-Logs: YES/NO
Events database: Task & Event Manager (retained in DB)
SOURCE 3: NSX Manager
Log files:
- /var/log/syslog (general NSX operations)
- /var/log/proton/nsxapi.log (API calls)
- /var/log/corfu/corfu.log (distributed datastore)
Audit events: NSX audit log (rule changes, policy updates)
Forwarding to Operations-Logs: YES/NO
SOURCE 4: SDDC Manager
Log files:
- /var/log/vmware/vcf/sddc-manager/ (lifecycle operations)
- /var/log/vmware/vcf/domainmanager/ (domain operations)
- /var/log/vmware/vcf/operationsmanager/ (ops)
Forwarding to Operations-Logs: YES/NO
SOURCE 5: VCF Operations (itself)
Log files:
- /var/log/vmware/vcops/ (analytics engine)
- Collector logs
Forwarding: Self-ingestion or separate log target
FORWARDING PROTOCOL SETTINGS:
Protocol: TCP (recommended) or UDP
Port: 514 (standard) or 1514 (non-standard)
Format: RFC 5424 (preferred) or RFC 3164
TLS: YES (if supported)
DELIVERABLE: Completed log source inventory checklist for your environment.
DAY 16 - DEPLOY VCF OPERATIONS FOR LOGS (OPERATIONS-LOGS)
Time required: 3-4 hours
OBJECTIVE: Follow fleet-managed deployment workflow. Understand all required inputs.
STEP-BY-STEP ACTIONS:
https:///ui ops-logs.yourdomain.com (must match DNS record you created)10.0.0.50 (your allocated static IP)255.255.255.010.0.0.110.0.0.10 (your DNS server)10.0.0.11 (your NTP server)https://Syslog.global.logHost = tcp://:514 DELIVERABLE: Deployment validation checklist (all items checked):
DEPLOYMENT VALIDATION CHECKLIST
================================
[ ] DNS forward record resolves: nslookup ops-logs.yourdomain.com → correct IP
[ ] DNS reverse record resolves: nslookup <IP> → ops-logs.yourdomain.com
[ ] NTP synchronized: appliance time matches reference
[ ] UI accessible: https://<ops-logs-fqdn> loads login page
[ ] Admin login works: can log in with configured credentials
[ ] Log sources visible: at least one source showing in configuration
[ ] Logs flowing: new log entries appearing in real-time view
[ ] Certificate valid: browser shows valid certificate (no warnings)
DAY 17 - CONTENT PACKS & PREBUILT DASHBOARDS/ALERTS/QUERIES
Time required: 2 hours
OBJECTIVE: Use prebuilt content to accelerate operations. Identify top 10 dashboards to pin.
STEP-BY-STEP ACTIONS:
TOP 10 LOG DASHBOARDS TO PIN
==============================
1. ESXi Overview
What it shows: All ESXi host log events, error rates, warning trends
Why pin it: First stop for host troubleshooting
2. vCenter Events
What it shows: vCenter tasks and events, who did what
Why pin it: Audit trail and change tracking
3. Authentication Failures
What it shows: Failed login attempts across all components
Why pin it: Security monitoring, brute force detection
4. ESXi Error Trends
What it shows: Error log volume over time, top error types
Why pin it: Spot increasing failure patterns before outages
5. NSX Firewall Rule Changes
What it shows: When firewall rules were added/modified/deleted
Why pin it: Security audit, unauthorized change detection
6. Storage Errors
What it shows: SCSI errors, path failures, vSAN issues
Why pin it: Early warning for storage failures
7. VM Power Events
What it shows: VM power on/off/reset/vMotion events
Why pin it: Track unexpected VM restarts
8. Certificate Warnings
What it shows: Certificate expiration warnings from all components
Why pin it: Prevent outages from expired certificates
9. SDDC Manager Operations
What it shows: Lifecycle operations, deployment tasks, update progress
Why pin it: Track VCF lifecycle health
10. Syslog Ingestion Health
What it shows: Log ingestion rate, dropped logs, source connectivity
Why pin it: Ensure your logging infrastructure itself is healthy
DAY 18 - CREATE LOG DASHBOARDS (OPERATIONAL VIEWS)
Time required: 3 hours
OBJECTIVE: Build custom log dashboards for operational use.
STEP-BY-STEP ACTIONS:
NOC-Logs - Error TrendsSEC-Logs - Auth FailuresOPS-Logs - vCenter Alarms
LOGS NOC BOARD DASHBOARD SPECIFICATION
========================================
Dashboard Name: NOC-Logs - Operations Overview
Purpose: Single pane for NOC operator to monitor log health
Layout:
Row 1: Error Trend (time series, 24h) | Log Ingestion Rate (time series, 24h)
Row 2: Auth Failures (last 4h, table) | Top Error Messages (last 4h, table)
Row 3: Recent Critical Events (table, last 1h, newest first) — full width
Filters available: Time range selector, Source filter (ESXi/vCenter/NSX/SDDC)
Auto-refresh: Every 60 seconds
DAY 19 - LOG-BASED ALERT PHILOSOPHY
Time required: 2 hours
OBJECTIVE: Define "alert on symptom, not every event." Build alert playbook entries.
STEP-BY-STEP ACTIONS:
LOG-BASED ALERT PLAYBOOK
==========================
ALERT 1: Authentication Brute Force Suspected
Trigger: 5+ failed login attempts for the same username within 10 minutes
Severity: P2 - Immediate
Verify:
1. Open Auth Failures dashboard
2. Filter by the username from the alert
3. Check source IPs - is it one IP or many?
4. Check if the account eventually succeeded (compromised?)
Remediate:
- If one source IP: Block IP at NSX firewall, disable account temporarily
- If many source IPs: Disable account, reset password, notify user
- If account succeeded after failures: Treat as compromised, force password reset, review activity
Escalate to: Security team if account was compromised
ALERT 2: ESXi Host Error Rate Spike
Trigger: More than 100 ERROR-level log entries from a single ESXi host in 1 hour (baseline is <10/hour)
Severity: P3 - Warning (escalate to P2 if sustained >2 hours)
Verify:
1. Open Error Trends dashboard
2. Filter to the specific host
3. Read the actual error messages - are they storage, network, or hardware?
4. Check VCF Operations for correlated health alerts on this host
Remediate:
- Storage errors: Check HBA paths, datastore connectivity
- Network errors: Check vmnic status, DVS configuration
- Hardware errors: Check hardware health via iLO/iDRAC, open vendor case
Escalate to: Hardware team if physical component failure suspected
ALERT 3: vCenter Service Restart Detected
Trigger: Log entry matching "vpxd started" or "service restarted" outside maintenance window
Severity: P2 - Immediate
Verify:
1. Open vCenter Events dashboard
2. Check if a planned maintenance was scheduled (check change calendar)
3. Check if vpxd crashed (look for crash dump logs before the restart)
4. Check VCF Operations for vCenter health alerts
Remediate:
- If planned: Verify service is healthy post-restart, close alert
- If unplanned crash: Collect logs, check for known issues, open support case if recurring
Escalate to: VMware support if crash is recurring
ALERT 4: Log Ingestion Stopped
Trigger: Zero logs received from any configured source for >15 minutes
Severity: P2 - Immediate
Verify:
1. Open Ingestion Health dashboard
2. Identify which source(s) stopped sending
3. Ping the source - is it reachable?
4. SSH to source - is syslog service running?
Remediate:
- Source unreachable: Network issue or host down → check VCF Operations
- Syslog service stopped: Restart syslog service
- Operations-Logs appliance issue: Check appliance health, restart ingestion service
Escalate to: Infrastructure team if source is down
ALERT 5: NSX Firewall Rule Modified Outside Change Window
Trigger: Log entry matching firewall rule create/modify/delete AND current time is outside approved change window (e.g., outside Tue/Thu 6PM-10PM)
Severity: P1 - Critical
Verify:
1. Open NSX Firewall Rule Changes dashboard
2. Identify who made the change (username)
3. Identify what rule was changed
4. Check change management system for approved change ticket
Remediate:
- If authorized (emergency change with verbal approval): Document retroactively
- If unauthorized: Revert the change immediately, disable the user account, notify security team
Escalate to: Security team and change management immediately
DAY 20 - INCIDENT DAY #2 (TIMED)
Time required: 2 hours
SIMULATION: "It is 8:45 AM Monday. A user reports: 'I cannot log into vCenter. The page loads but after entering credentials, I get an error.'"
SET A 30-MINUTE TIMER.
Triage steps:
Minutes 0-3: Reproduce and Scope
https:///ui https:///websso Minutes 3-8: Check VCF Operations
Minutes 8-15: Check Logs
Minutes 15-22: Root Cause Determination
Minutes 22-28: Resolution
service-control --status --all, restart SSO: service-control --restart vmware-stsdMinutes 28-30: Document
STOP TIMER.
DAY 21 - CHECKPOINT #3
YOU MUST DEMO: A log dashboard + show a query path that proves root cause.
EXACT DEMO SCRIPT (5 minutes):
"I am going to demonstrate how I use log dashboards to prove root cause for a vCenter login failure.
[Open Operations-Logs UI]
Step 1: I open my Auth Failures dashboard. Here I can see all failed authentication attempts across the environment. I filter to the last 2 hours and to vCenter as the source.
[Point to the data on screen]
Step 2: I see a cluster of failed attempts starting at 8:43 AM. The error message is [read the actual error]. This tells me the SSO service was returning token validation errors.
Step 3: I pivot to a broader search. I search for 'vmware-stsd' in the last 2 hours. I find a log entry at 8:41 AM showing the STS service restarted unexpectedly. This is 2 minutes before the user reported the issue.
Step 4: I search for what happened before the restart. I filter to 8:30-8:41 AM on the vCenter source. I find out-of-memory errors in the STS process at 8:39 AM.
Step 5: Root cause proven. The STS service ran out of memory, crashed, restarted, and during the restart window users could not authenticate. The fix is to increase the STS service memory allocation and monitor for memory growth.
That is the query path: Symptom (login failures) → Authentication logs → Service crash → Memory exhaustion. Each step is provable with log evidence."
WEEK 4: DAYS 22-30 (API + Automation + Capstone)
DAY 22 - VCF OPERATIONS API: SWAGGER & TOOLS
Time required: 2-3 hours
OBJECTIVE: Find Swagger UI and understand how to use it. Create an API quickstart page.
STEP-BY-STEP ACTIONS:
https:///suite-api/doc/swagger-ui.html https://vrops.lab.local/suite-api/doc/swagger-ui.htmlPOST /api/auth/token/acquire
{
"username": "admin",
"password": "your-password",
"authSource": "local"
}
GET /api/alertsOpsToken
VCF OPERATIONS API QUICKSTART
================================
SWAGGER UI URL:
https://<ops-fqdn>/suite-api/doc/swagger-ui.html
BASE API URL:
https://<ops-fqdn>/suite-api/api
AUTHENTICATION:
Endpoint: POST /api/auth/token/acquire
Body: {"username": "...", "password": "...", "authSource": "local"}
Response: {"token": "abc123...", "validity": 21600, ...}
Use token: Authorization: OpsToken abc123...
COMMON ENDPOINTS:
GET /api/alerts → List all alerts
GET /api/alerts/{alertId} → Get specific alert
GET /api/resources → List all resources (objects)
GET /api/resources/{resourceId} → Get specific resource
GET /api/resources/{id}/stats → Get metrics for a resource
GET /api/dashboards → List all dashboards
POST /api/reports → Generate a report
RESPONSE FORMAT: JSON (set Accept: application/json header)
TOOLS TO USE:
Browser: Swagger UI (learning + quick tests)
Postman: Collection-based testing + environment variables
cURL: Command-line scripting
Python: requests library for automation scripts
PowerShell: Invoke-RestMethod for operational scripts
RATE LIMITS / BEST PRACTICES:
- Tokens are valid for ~6 hours by default
- Always release tokens when done: POST /api/auth/token/release
- Use pagination for large result sets (pageSize, page parameters)
- Use filters to narrow results (avoid pulling all objects every time)
DAY 23 - AUTH PATTERNS (OPSTOKEN / SSO TOKEN)
Time required: 2 hours
OBJECTIVE: Master token acquisition and header formats.
STEP-BY-STEP ACTIONS:
https:///suite-api/api/auth/token/acquire Content-Type: application/json, Accept: application/json
{
"username": "admin",
"password": "YourPassword123!",
"authSource": "local"
}
{
"token": "e1c2f5a8-...-long-token-string",
"validity": 21600,
"expiresAt": "Thursday, March 20, 2026 6:00:00 PM UTC",
"roles": []
}
token valueAuthorization: OpsToken e1c2f5a8-...-long-token-stringhttps:///suite-api/api/alerts Authorization: OpsToken e1c2f5a8-...-long-token-stringAccept: application/jsonhttps:///suite-api/api/auth/token/release Authorization: OpsToken e1c2f5a8-...-long-token-stringauthSource value changes to your identity source name (e.g., "vsphere.local")Authorization: OpsToken
VCF OPERATIONS API AUTH CHEAT SHEET
=====================================
--- LOCAL AUTH ---
POST https://<ops>/suite-api/api/auth/token/acquire
Content-Type: application/json
Accept: application/json
Body:
{
"username": "admin",
"password": "P@ssw0rd",
"authSource": "local"
}
Response: {"token": "<TOKEN>", "validity": 21600, ...}
--- USE TOKEN ---
GET https://<ops>/suite-api/api/alerts
Authorization: OpsToken <TOKEN>
Accept: application/json
--- RELEASE TOKEN ---
POST https://<ops>/suite-api/api/auth/token/release
Authorization: OpsToken <TOKEN>
--- SSO AUTH ---
Same as above but authSource = "vsphere.local" (or your identity source name)
--- CURL EXAMPLES ---
# Acquire token:
curl -k -X POST "https://<ops>/suite-api/api/auth/token/acquire" \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{"username":"admin","password":"P@ssw0rd","authSource":"local"}'
# Use token (get alerts):
curl -k -X GET "https://<ops>/suite-api/api/alerts" \
-H "Authorization: OpsToken <TOKEN>" \
-H "Accept: application/json"
# Release token:
curl -k -X POST "https://<ops>/suite-api/api/auth/token/release" \
-H "Authorization: OpsToken <TOKEN>"
--- COMMON MISTAKES ---
1. Forgetting "OpsToken " prefix (with the space) before the token
2. Token expired (default 6 hours) - acquire a new one
3. Wrong authSource value - check your identity source name
4. Not setting Accept: application/json - may get XML instead
5. Using HTTP instead of HTTPS - API requires TLS
DAY 24 - BUILD YOUR POSTMAN COLLECTION
Time required: 2-3 hours
OBJECTIVE: Import OpenAPI specs and build a working Postman collection.
STEP-BY-STEP ACTIONS:
VCF Ops - Labops_host = your-ops-fqdn (e.g., vrops.lab.local)ops_token = (leave blank, will be populated by auth request)ops_user = adminops_pass = YourPassword (mark as "secret" type)VCF Ops - CoreAuthAlertsResourcesStatsDashboardsAuth folder, click Add RequestAcquire Tokenhttps://{{ops_host}}/suite-api/api/auth/token/acquireContent-Type: application/json, Accept: application/json
{
"username": "{{ops_user}}",
"password": "{{ops_pass}}",
"authSource": "local"
}
var jsonData = pm.response.json();
pm.environment.set("ops_token", jsonData.token);
Alerts folder, add:List All Alerts: GET https://{{ops_host}}/suite-api/api/alertsAuthorization: OpsToken {{ops_token}}Get Alert by ID: GET https://{{ops_host}}/suite-api/api/alerts/{{alert_id}}Authorization: OpsToken {{ops_token}}Resources folder, add:List Resources: GET https://{{ops_host}}/suite-api/api/resourcesAuthorization: OpsToken {{ops_token}}Get Resource Stats: GET https://{{ops_host}}/suite-api/api/resources/{{resource_id}}/statsAuthorization: OpsToken {{ops_token}}Acquire Token → verify 200 response and token savedList All Alerts → verify 200 response with alert JSONList Resources → verify 200 response with resource JSONDELIVERABLE: Working Postman collection with Auth, Alerts, Resources folders.
DAY 25 - VCF SDK AWARENESS (PYTHON/JAVA) + SAMPLES
Time required: 2 hours
OBJECTIVE: Know what SDKs exist, where to find them, and when to use each tool.
STEP-BY-STEP ACTIONS:
pip install vcf-sdk (or from Broadcom developer portal)
AUTOMATION PATH CHOOSER
=========================
QUESTION: "Which tool should I use?"
USE VCF POWERCLI WHEN:
- You are a PowerShell admin (most VMware admins are)
- You need quick operational scripts (get health, list alerts, tag objects)
- You are doing one-off or scheduled tasks
- You want tab-completion and discoverability
- Example: "Get all hosts in maintenance mode and export to CSV"
USE VCF PYTHON SDK WHEN:
- You are building a reusable automation tool or integration
- You need complex logic (conditionals, error handling, retries)
- You want to integrate with other Python tools (Ansible, Flask, Django)
- You are building a custom dashboard or reporting tool
- Example: "Build a daily health report that emails the team"
USE OPENAPI + POSTMAN WHEN:
- You are learning the API (fastest way to explore endpoints)
- You need to test a specific API call quickly
- You want to generate client libraries in any language
- You are documenting API workflows for your team
- Example: "Figure out what parameters the alert endpoint accepts"
USE CURL WHEN:
- You need a quick one-liner from the command line
- You are scripting in bash
- You are troubleshooting API issues (raw HTTP visibility)
- Example: "Quick check if the API is responding"
USE RAW REST (requests library) WHEN:
- The SDK does not cover a specific endpoint
- You need maximum control over the HTTP request
- You are working with a newer API version not yet in the SDK
- Example: "Call a brand-new endpoint that was just released"
DECISION FLOWCHART:
Am I exploring/learning? → Postman
Am I an admin doing operational tasks? → PowerCLI
Am I building a reusable tool? → Python SDK
Am I doing a quick CLI check? → cURL
Does the SDK not support what I need? → Raw REST
DAY 26 - VCF INSTALLER AUTOMATION LANDSCAPE
Time required: 2 hours
OBJECTIVE: Understand the VCF Installer's automation capabilities.
STEP-BY-STEP ACTIONS:
TOP 10 VCF AUTOMATION OPPORTUNITIES
=====================================
1. Automated Health Check Export
What: Script that pulls health status for all clusters and exports to CSV/JSON daily
Tool: Python SDK or PowerCLI
Value: Daily health snapshot without manual UI clicks
2. Alert Report Generation
What: Script that pulls active P1/P2 alerts and emails the ops team
Tool: Python SDK + email library
Value: No one misses a critical alert
3. Capacity Report Automation
What: Script that pulls capacity remaining for all clusters, calculates trend, generates report
Tool: Python SDK
Value: Weekly capacity report generated automatically
4. Host Commissioning Automation
What: Script that commissions new ESXi hosts into SDDC Manager
Tool: SDDC Manager API / PowerCLI
Value: Consistent, repeatable host onboarding
5. Certificate Monitoring
What: Script that checks all certificate expiration dates, alerts if <30 days
Tool: Python + VCF Operations API
Value: Never be surprised by an expired cert
6. Configuration Drift Check
What: Script that compares current host configurations against baseline
Tool: PowerCLI
Value: Automated compliance checking
7. Snapshot Cleanup
What: Script that finds and reports/deletes snapshots older than N days
Tool: PowerCLI
Value: Reclaim storage, prevent performance issues
8. VM Inventory Export
What: Script that exports complete VM inventory with resource allocations
Tool: PowerCLI or Python SDK
Value: Asset management, chargeback data
9. Workload Domain Lifecycle
What: Script that validates prerequisites before creating a workload domain
Tool: SDDC Manager API
Value: Fewer failed deployments
10. Password Rotation
What: Script that rotates service account passwords per policy
Tool: SDDC Manager API (password management endpoints)
Value: Security compliance, audit readiness
DAY 27 - VCF POWERCLI FUNDAMENTALS
Time required: 3 hours
OBJECTIVE: Install VCF PowerCLI and write script skeletons.
STEP-BY-STEP ACTIONS:
Install-Module -Name VMware.PowerCLI -Scope CurrentUser -Force
Y and press Enter
Get-Module -Name VMware.PowerCLI -ListAvailable
# Ignore invalid certificates in lab environments:
Set-PowerCLIConfiguration -InvalidCertificateAction Ignore -Confirm:$false
# Set default server mode:
Set-PowerCLIConfiguration -DefaultVIServerMode Multiple -Confirm:$false
Connect-VIServer -Server "vcenter.yourdomain.com" -User "administrator@vsphere.local" -Password "YourPassword"
# List all clusters:
Get-Cluster
# List all hosts:
Get-VMHost
# List all VMs:
Get-VM
# Get host health info:
Get-VMHost | Select-Object Name, ConnectionState, PowerState, NumCpu, MemoryTotalGB, MemoryUsageGB
# Get VMs with CPU/Memory allocation:
Get-VM | Select-Object Name, PowerState, NumCpu, MemoryGB, UsedSpaceGB
# Find VMs with snapshots:
Get-VM | Get-Snapshot | Select-Object VM, Name, Created, SizeGB
# =============================================================
# SCRIPT: Get-VcfOpsHealth.ps1
# PURPOSE: Export cluster health summary to CSV
# =============================================================
param(
[Parameter(Mandatory=$true)]
[string]$VCenterServer,
[Parameter(Mandatory=$true)]
[string]$Username,
[string]$OutputPath = ".\cluster_health_$(Get-Date -Format 'yyyyMMdd').csv"
)
# Connect
Connect-VIServer -Server $VCenterServer -User $Username -Password (Read-Host -AsSecureString "Password")
# Gather data
$clusters = Get-Cluster | ForEach-Object {
$cluster = $_
$hosts = Get-VMHost -Location $cluster
$vms = Get-VM -Location $cluster
[PSCustomObject]@{
ClusterName = $cluster.Name
HostCount = $hosts.Count
VMCount = $vms.Count
TotalCPUGHz = [math]::Round(($hosts | Measure-Object -Property CpuTotalMhz -Sum).Sum / 1000, 2)
UsedCPUGHz = [math]::Round(($hosts | Measure-Object -Property CpuUsageMhz -Sum).Sum / 1000, 2)
TotalMemoryGB = [math]::Round(($hosts | Measure-Object -Property MemoryTotalGB -Sum).Sum, 2)
UsedMemoryGB = [math]::Round(($hosts | Measure-Object -Property MemoryUsageGB -Sum).Sum, 2)
CPUUsagePct = 0 # calculated below
MemUsagePct = 0 # calculated below
}
}
# Calculate percentages
$clusters | ForEach-Object {
if ($_.TotalCPUGHz -gt 0) { $_.CPUUsagePct = [math]::Round(($_.UsedCPUGHz / $_.TotalCPUGHz) * 100, 1) }
if ($_.TotalMemoryGB -gt 0) { $_.MemUsagePct = [math]::Round(($_.UsedMemoryGB / $_.TotalMemoryGB) * 100, 1) }
}
# Export
$clusters | Export-Csv -Path $OutputPath -NoTypeInformation
Write-Host "Health report exported to $OutputPath"
# Disconnect
Disconnect-VIServer -Confirm:$false
# =============================================================
# SCRIPT: Get-VcfOpsAlerts.ps1
# PURPOSE: Pull alerts from VCF Operations API
# =============================================================
param(
[Parameter(Mandatory=$true)]
[string]$OpsServer,
[Parameter(Mandatory=$true)]
[string]$Username,
[string]$AuthSource = "local"
)
# Ignore self-signed certs in lab
if ($PSVersionTable.PSVersion.Major -ge 6) {
$PSDefaultParameterValues['Invoke-RestMethod:SkipCertificateCheck'] = $true
} else {
Add-Type @"
using System.Net;
using System.Security.Cryptography.X509Certificates;
public class TrustAll : ICertificatePolicy {
public bool CheckValidationResult(ServicePoint sp, X509Certificate cert, WebRequest req, int problem) { return true; }
}
"@
[System.Net.ServicePointManager]::CertificatePolicy = New-Object TrustAll
}
# Acquire token
$authBody = @{
username = $Username
password = Read-Host "Password" -AsSecureString | ConvertFrom-SecureString -AsPlainText
authSource = $AuthSource
} | ConvertTo-Json
$tokenResponse = Invoke-RestMethod -Uri "https://$OpsServer/suite-api/api/auth/token/acquire" `
-Method POST -Body $authBody -ContentType "application/json"
$token = $tokenResponse.token
$headers = @{
"Authorization" = "OpsToken $token"
"Accept" = "application/json"
}
# Get alerts
$alerts = Invoke-RestMethod -Uri "https://$OpsServer/suite-api/api/alerts" `
-Method GET -Headers $headers
# Display
$alerts.alerts | Select-Object alertId, alertLevel, status, startTimeUTC, resourceId | Format-Table -AutoSize
# Release token
Invoke-RestMethod -Uri "https://$OpsServer/suite-api/api/auth/token/release" `
-Method POST -Headers $headers
Write-Host "Token released. Done."
DAY 28 - BUILD A "DAY-2 AUTOMATION" PACK
Time required: 3-4 hours
OBJECTIVE: Build scripts for: export health summary, list alerts, tag objects, pull inventory.
STEP-BY-STEP ACTIONS:
vcf-ops-automation/ with subfolders: python/, powershell/, docs/
#!/usr/bin/env python3
"""Export VCF Operations health summary to JSON."""
import requests
import json
import sys
from datetime import datetime
# Configuration - replace with your values or use environment variables
import os
OPS_HOST = os.environ.get("VCF_OPS_HOST", "your-ops.example.com")
OPS_USER = os.environ.get("VCF_OPS_USER", "admin")
OPS_PASS = os.environ.get("VCF_OPS_PASS", "") # never hardcode passwords
AUTH_SOURCE = "local"
def acquire_token(host, user, password, auth_source):
"""Acquire OpsToken from VCF Operations API."""
url = f"https://{host}/suite-api/api/auth/token/acquire"
payload = {"username": user, "password": password, "authSource": auth_source}
resp = requests.post(url, json=payload, verify=False)
resp.raise_for_status()
return resp.json()["token"]
def release_token(host, token):
"""Release OpsToken."""
url = f"https://{host}/suite-api/api/auth/token/release"
headers = {"Authorization": f"OpsToken {token}"}
requests.post(url, headers=headers, verify=False)
def get_resources(host, token, resource_kind="ClusterComputeResource"):
"""Get resources of a specific kind."""
url = f"https://{host}/suite-api/api/resources"
headers = {"Authorization": f"OpsToken {token}", "Accept": "application/json"}
params = {"resourceKind": resource_kind}
resp = requests.get(url, headers=headers, params=params, verify=False)
resp.raise_for_status()
return resp.json()
def main():
if not OPS_PASS:
print("ERROR: Set VCF_OPS_PASS environment variable.")
sys.exit(1)
token = acquire_token(OPS_HOST, OPS_USER, OPS_PASS, AUTH_SOURCE)
try:
resources = get_resources(OPS_HOST, token)
output = {
"generated_at": datetime.utcnow().isoformat(),
"host": OPS_HOST,
"clusters": resources.get("resourceList", [])
}
filename = f"health_summary_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
with open(filename, "w") as f:
json.dump(output, f, indent=2)
print(f"Health summary exported to {filename}")
finally:
release_token(OPS_HOST, token)
if __name__ == "__main__":
main()
#!/usr/bin/env python3
"""List active alerts from VCF Operations API."""
import requests
import json
import os
import sys
OPS_HOST = os.environ.get("VCF_OPS_HOST", "your-ops.example.com")
OPS_USER = os.environ.get("VCF_OPS_USER", "admin")
OPS_PASS = os.environ.get("VCF_OPS_PASS", "")
AUTH_SOURCE = "local"
def acquire_token(host, user, password, auth_source):
url = f"https://{host}/suite-api/api/auth/token/acquire"
payload = {"username": user, "password": password, "authSource": auth_source}
resp = requests.post(url, json=payload, verify=False)
resp.raise_for_status()
return resp.json()["token"]
def release_token(host, token):
url = f"https://{host}/suite-api/api/auth/token/release"
headers = {"Authorization": f"OpsToken {token}"}
requests.post(url, headers=headers, verify=False)
def get_alerts(host, token, status="ACTIVE"):
url = f"https://{host}/suite-api/api/alerts"
headers = {"Authorization": f"OpsToken {token}", "Accept": "application/json"}
params = {"status": status}
resp = requests.get(url, headers=headers, params=params, verify=False)
resp.raise_for_status()
return resp.json()
def main():
if not OPS_PASS:
print("ERROR: Set VCF_OPS_PASS environment variable.")
sys.exit(1)
token = acquire_token(OPS_HOST, OPS_USER, OPS_PASS, AUTH_SOURCE)
try:
alerts = get_alerts(OPS_HOST, token)
alert_list = alerts.get("alerts", [])
print(f"\n{'='*80}")
print(f"ACTIVE ALERTS: {len(alert_list)}")
print(f"{'='*80}")
for alert in alert_list:
print(f"\n Severity: {alert.get('alertLevel', 'UNKNOWN')}")
print(f" Alert: {alert.get('alertDefinitionName', 'N/A')}")
print(f" Object: {alert.get('resourceId', 'N/A')}")
print(f" Status: {alert.get('status', 'N/A')}")
print(f" Started: {alert.get('startTimeUTC', 'N/A')}")
print(f" ---")
finally:
release_token(OPS_HOST, token)
if __name__ == "__main__":
main()
export VCF_OPS_HOST="your-ops.example.com"
export VCF_OPS_USER="admin"
export VCF_OPS_PASS="your-password"
python3 python/export_health_summary.py
python3 python/list_alerts.py
DELIVERABLE: Working automation pack with at least 4 scripts (2 Python, 2 PowerShell from Day 27).
DAY 29 - CAPSTONE: "OPS COMMAND CENTER"
Time required: 4-5 hours
OBJECTIVE: Combine everything into a cohesive operations package.
STEP-BY-STEP ACTIONS:
YOUR OPS COMMAND CENTER - FINAL PACKAGE
=========================================
DASHBOARDS (in VCF Operations):
[x] NOC - Real-Time Status (Day 11)
[x] OPS - Daily Operations View (Day 11)
[x] EXEC - Weekly Infrastructure Summary (Day 11)
[x] NOC-Logs - Error Trends (Day 18)
[x] SEC-Logs - Auth Failures (Day 18)
RUNBOOKS (documents):
[x] Triage Runbook v1 (Day 3)
[x] Alert Taxonomy P1-P4 (Day 8)
[x] Log-Based Alert Playbook (Day 19)
[x] RCA Template (Day 13)
[x] Weekly Capacity Report Template (Day 9)
AUTOMATION (scripts):
[x] Get-VcfOpsHealth.ps1 (Day 27)
[x] Get-VcfOpsAlerts.ps1 (Day 27)
[x] export_health_summary.py (Day 28)
[x] list_alerts.py (Day 28)
REFERENCE DOCS:
[x] VCF Ops Navigation Map (Day 1)
[x] Object Relationship Cheat Sheet (Day 2)
[x] Drift Watchlist (Day 10)
[x] API Quickstart (Day 22)
[x] Auth Cheat Sheet (Day 23)
[x] Postman Collection (Day 24)
[x] Automation Path Chooser (Day 25)
[x] Log Source Inventory Checklist (Day 15)
[x] Top 10 Log Dashboards to Pin (Day 17)
OPS COMMAND CENTER - HOMEDAY 30 - FINAL BOSS: MOCK INTERVIEW + LIVE DEMO
Time required: 3-4 hours (45 min rapid fire + 15 min dashboard walkthrough + 10 min automation demo + prep)
PREPARATION (1 hour before):
PART 1: RAPID FIRE QUESTIONS (45 minutes)
Have someone ask you these (or record yourself). Speak your answers out loud.
Q1: "What is VCF Operations?"
SAY THIS: "VCF Operations is the unified management console for VMware Cloud Foundation 9. In VCF 9.0, the SDDC Manager UI was deprecated and its workflows moved into VCF Operations, making it the true single pane of glass. It collects metrics, logs, and configuration data from all VCF components - vCenter, ESXi hosts, NSX, vSAN - and provides health monitoring through VCF Health and Diagnostic Findings, performance analytics, capacity planning with cost visualization, compliance checking under Security, log analysis through Analyze, fleet management including lifecycle, certificates, and passwords, and custom dashboards and reports. The left nav has Infrastructure Operations with direct items like Diagnostic Findings, VCF Health, Dashboards & Reports, Alerts, Troubleshooting Workbench, Analyze, Storage Operations, and Network Operations, plus expandable sections for Workload Operations, Fleet Management, Capacity, Security, Administration, and Developer Center."
Q2: "Walk me through your first 15 minutes when you get paged for a production issue."
SAY THIS: "I follow a structured triage runbook. Minutes 0-2: I check VCF Health and Diagnostic Findings under Infrastructure Operations to assess scope - Diagnostic Findings automatically correlates issues against 107 known-issue signatures from Broadcom Support, so it often identifies the problem before I start digging. Minutes 2-5: I go to Infrastructure Operations > Alerts, filter to Active Critical and Immediate, sort newest first. I record each alert's name, affected object, and recommendation. Minutes 5-8: I click into the most critical affected object and go to the Relationships tab to determine blast radius - how many VMs, users, or services depend on this object. If I need deeper investigation, I open a Troubleshooting Workbench session to get the Object Relationship Graph. Minutes 8-12: I check the Metrics tab for the anomalous metric and if I need log evidence, I use Infrastructure Operations > Analyze to search log data. Minutes 12-15: I document my findings, determine next action - fix it, escalate, or monitor - and update the incident ticket."
Q3: "Walk me through a dashboard you built that reduced MTTR."
SAY THIS: "I built a Daily Health Overview dashboard with four widgets: cluster health scoreboard, top 10 VMs by CPU usage, active critical alerts list, and cluster capacity remaining. The key feature is widget interactions - when I click a cluster in the health scoreboard, all other widgets filter to that cluster. Before this dashboard, triage meant clicking through multiple screens to correlate alerts with affected objects. Now I can go from 'cluster X is unhealthy' to 'these specific VMs are causing the problem' in one click. This cut our initial triage time from about 15 minutes to under 5 minutes."
Q4: "How do you decide what goes on a NOC dashboard versus an executive dashboard?"
SAY THIS: "They serve different audiences with different questions. The NOC dashboard answers 'what is broken right now' - it shows real-time health status, active critical alerts, service availability, and auto-refreshes every 60 seconds. The NOC operator needs to see problems instantly and begin triage. The executive dashboard answers 'are we healthy and do we need to spend money' - it shows aggregate health by workload domain as green/yellow/red, capacity runway in weeks, top risks, and efficiency metrics. No individual VM data, no raw alert counts - just strategic indicators. I also build a VMware Ops team dashboard that sits between the two - it shows daily health trends, capacity planning data, compliance scores, and recent configuration changes."
Q5: "How do you handle alert noise?"
SAY THIS: "I take a systematic approach. First, I audit the alert landscape by counting active alerts by severity over the last 7 days. If Critical plus Immediate exceeds 50, there is a noise problem. I sort alerts by frequency to find the top offenders - alerts that fire and clear repeatedly. For each noisy alert, I ask: does this require a human to take action? If not, I either raise the threshold, extend the trigger duration, or disable it. I classify all alerts into P1 through P4 with clear definitions. P1 is service-impacting with a 15-minute response target. P2 is significant degradation with a 1-hour target. P3 is non-urgent with a business-day target. P4 is informational reviewed weekly. Each level has a defined owner and escalation path. The goal is: every alert that fires should result in a human taking a specific action."
Q6: "How does API authentication work in VCF Operations?"
SAY THIS: "VCF Operations uses token-based authentication. You POST credentials to the token acquire endpoint at /suite-api/api/auth/token/acquire with a JSON body containing username, password, and authSource. For local accounts, authSource is 'local'. For SSO accounts, it is the identity source name like 'vsphere.local'. The response contains a token that is valid for approximately 6 hours. For all subsequent API calls, you include the header 'Authorization: OpsToken' followed by a space and the token value. When you are done, you should release the token by POSTing to the token release endpoint. The Swagger UI is at /suite-api/doc/swagger-ui.html where you can explore and test all available endpoints."
Q7: "What is configuration drift and why do you care?"
SAY THIS: "Configuration drift is when a system's actual configuration deviates from its approved baseline without going through change management. I care about it for three reasons. Security: an enabled SSH service or a weakened password policy is an attack surface. Stability: NTP skew can break vSAN, certificate validation, and log correlation. Compliance: auditors require proof that configurations match approved baselines. I maintain a drift watchlist of the top 10 settings I track: SSH service status, NTP configuration, syslog forwarding, lockdown mode, SSO password policy, firewall rules, VM hardware version, datastore permissions, distributed switch security settings, and certificate validity. Any change to these without a change ticket triggers an investigation."
Q8: "How do you use logs differently for troubleshooting versus auditing?"
SAY THIS: "Troubleshooting logs are reactive and real-time. I use them to find root cause of current issues - searching vmkernel.log for storage errors, vpxd.log for vCenter service failures, NSX syslog for network issues. I need recent detailed data, typically the last 7 to 30 days. The question I am answering is 'why did this break?' Audit logs are proactive and historical. I use them to track who did what, when - vCenter event logs for VM operations, ESXi shell access logs, NSX audit logs for firewall rule changes, SSO login logs. I need longer retention, 90 days to a year. The question I am answering is 'who changed this and were they authorized?' In Operations-Logs, I build separate dashboards for each: a troubleshooting board with error trends and service health, and a security board with authentication failures, unauthorized access, and configuration changes."
Q9: "How do content packs help in Operations-Logs?"
SAY THIS: "Content packs are pre-built bundles of dashboards, alerts, and saved queries designed for specific log sources. For example, the vSphere content pack includes dashboards for ESXi host errors, vCenter events, and authentication tracking, plus alert definitions for common failure patterns, plus saved queries for frequent troubleshooting scenarios. They accelerate time-to-value because instead of building every dashboard and alert from scratch, I install the content pack and immediately have a functional monitoring baseline. I then customize on top of that baseline for my specific environment. It is the same principle as using a template instead of starting from a blank page."
Q10: "When do you automate and when do you not?"
SAY THIS: "I automate when three conditions are met: the task is repeatable, meaning I do it more than once a week. The task is deterministic, meaning the same inputs always produce the same correct outputs. And the blast radius of a mistake is manageable, meaning if the script fails, it does not take down production. For example, exporting a daily health report is a perfect automation candidate - repeatable, deterministic, read-only. Provisioning a new workload domain is automatable but requires guardrails - validation checks, dry-run mode, and human approval before execution. I do not automate tasks that require judgment calls, like deciding whether to evacuate a host during a performance event. I also follow the principle of idempotency - my scripts should be safe to run twice without causing damage."
PART 2: DASHBOARD WALKTHROUGH (15 minutes)
Open your Ops Command Center - HOME dashboard and walk through it:
SAY THIS: "This is my Ops Command Center. It is the single entry point I use every morning.
[Point to Row 1] At the top I have three key indicators: overall health across all workload domains, the count of active alerts broken down by severity, and the capacity runway showing the most constrained cluster and how many days until it reaches capacity.
[Point to Row 2] In the middle I have my navigation hub linking to three audience-specific dashboards: the NOC real-time status board for incident response, the Ops daily operations view for my team's day-to-day work, and the Executive weekly summary for leadership briefings.
Let me drill into the NOC dashboard. [Click into NOC dashboard] This auto-refreshes every 60 seconds. The heat map shows cluster health at a glance - all green means I can move on. If I see yellow or red, I click it and the alert list below filters to show only alerts for that cluster. Let me demonstrate. [Click a cluster, show the filtering]
Now let me show the log side. [Open the Auth Failures dashboard] This shows failed authentication attempts across all components over the last 7 days. I can see patterns - if one username has 50 failures in 10 minutes, that is a brute force attempt. [Point to data] This dashboard helped me identify an unauthorized access attempt last week and respond within minutes instead of finding out during an audit."
PART 3: AUTOMATION DEMO (10 minutes)
Open your terminal and run your scripts:
SAY THIS: "I have built a day-2 automation pack with Python and PowerShell scripts. Let me demonstrate.
[Open terminal] First, I will run the health summary export. This script authenticates to the VCF Operations API using token-based auth, pulls cluster resource data, and exports to JSON.
[Run the script] python3 export_health_summary.py
[Show the output file] Here is the exported JSON with cluster data. This runs as a scheduled task every morning at 6 AM and feeds into our daily standup.
[Run the alerts script] python3 list_alerts.py
[Show the output] This pulls all active alerts and displays them in a readable format. In production, I have this sending to a Slack channel so the team sees new critical alerts immediately.
The key principles in my automation: credentials are never hardcoded - they come from environment variables. Tokens are always released after use. All API calls use error handling. Scripts are idempotent - safe to run multiple times. And I start with read-only operations before automating any write operations."
SCORING RUBRIC (Final Assessment):
SIGNAL OVER NOISE: Do you choose the right metrics and alerts? /10
ORDER OF OPERATIONS: Do you triage in a sane sequence? /10
STORYTELLING: Do you communicate impact and prevention? /10
AUTOMATION MATURITY: Safe, tested, minimal blast radius? /10
DASHBOARD DESIGN: Decision-making dashboards, not wallpaper? /10
API COMFORT: Can you explain and demonstrate token auth? /10
LOG MASTERY: Can you pivot from symptoms to log proof? /10
OVERALL CONFIDENCE: Can you present without hesitation? /10
TOTAL: /80
70+: Ready for interviews
55-69: Need more reps on weak areas
Below 55: Repeat Week 3-4 before interviewing
VCF OPERATIONS INTERVIEW PREP PACK - COMPLETE
Exact Answers | Scoring Rubrics | Whiteboard Prompts
A. YOUR 60-SECOND PITCH
When the interviewer says "Tell me about yourself" or "Walk me through your background":
SAY THIS (practice until it is natural, under 60 seconds):
"I am an infrastructure operations engineer focused on VMware Cloud Foundation environments. My core strengths are in three areas. First, monitoring and operations: I build decision-making dashboards in VCF Operations that give NOC teams, ops teams, and executives the right view for their role. I have reduced triage time by implementing widget interactions and structured runbooks. Second, performance and capacity management: I run weekly capacity reviews, track configuration drift, and proactively identify risks before they become outages. Third, automation: I use the VCF Operations API, Python, and PowerCLI to automate repeatable tasks like health exports, alert reporting, and compliance checking. I follow a principle of automating what is repeatable, deterministic, and safe before touching anything that requires judgment. I am looking for a role where I can bring this operational discipline to a VCF environment at scale."
B. COMPLETE QUESTION BANK WITH EXACT ANSWERS
DASHBOARDS & OPERATIONS (20 Questions)
Q1: "Walk me through a dashboard you built that reduced MTTR."
SAY THIS: "I built a Daily Health Overview dashboard with four widgets: a cluster health scoreboard showing green, yellow, or red status for each cluster, a Top-10 VMs by CPU chart to spot resource consumers, an active critical alerts list filtered to Critical and Immediate severity only, and a cluster capacity remaining scoreboard. The key design decision was widget interactions. When I click a specific cluster in the health scoreboard, the VM list and alert list automatically filter to show only data for that cluster. Before this dashboard, triage meant navigating through Inventory, then Alerts, then Metrics across multiple screens. After, we could go from 'cluster X is degraded' to 'these are the specific VMs and alerts' in a single click. This cut initial triage from about 15 minutes to under 5, which directly reduced our MTTR on cluster-level issues."
Q2: "How do you decide what goes on a NOC dashboard vs an exec dashboard?"
SAY THIS: "They answer different questions for different audiences. The NOC dashboard answers 'what is broken right now.' It needs real-time health status with auto-refresh every 60 seconds, active Critical and Immediate alerts sorted newest first, service availability indicators, and host status. The NOC operator needs to see a problem and start triage in seconds, not minutes. The executive dashboard answers 'are we healthy and do we need to invest.' It shows aggregate health per workload domain as green, yellow, or red badges, capacity runway in weeks or months, a top-3 risk summary, and resource efficiency percentages. No individual VM details. No raw alert lists. No technical metric names. I also maintain an Ops Team dashboard in between that covers daily health trends, capacity planning detail, compliance scores, and configuration change tracking."
Q3: "How do you avoid dashboard sprawl?"
SAY THIS: "Five practices. First, naming convention: every dashboard starts with a prefix indicating its audience - NOC, OPS, EXEC, SEC - so they are easy to find and categorize. Second, I use Favorites to pin the dashboards I actually use daily, and review Recents weekly to see if I am ignoring any. Third, I clone prebuilt dashboards and customize them instead of building from scratch, which prevents duplicate effort. Fourth, I run a monthly review using the Manage tab under Dashboards & Reports to identify and delete unused dashboards - if no one opened it in 30 days, it gets archived or deleted. Fifth, I design dashboards with widget interactions so one dashboard can serve multiple drill-down paths instead of needing five separate static dashboards."
Q4: "What is your process for turning noisy alerts into a signal?"
SAY THIS: "I start with an audit. I count active alerts by severity over the last 7 days. If Critical plus Immediate exceeds 50 for our environment size, we have a noise problem. I sort by frequency to find the top offenders - alerts that fire and auto-cancel repeatedly. For each noisy alert, I ask three questions: Does this require a human to take a specific action? If the answer is always 'ignore it,' the alert needs to change. Can I raise the threshold? For example, a CPU alert at 80% might fire constantly but never be actionable, while 95% sustained for 10 minutes actually means something. Can I extend the trigger duration? A 1-minute spike is noise; a 15-minute sustained condition is signal. Then I classify all alerts into P1 through P4 with clear ownership and MTTR targets. P1 pages the on-call engineer with a 15-minute response. P2 notifies the team lead with a 1-hour response. P3 goes into the daily standup queue. P4 is reviewed weekly. Every alert that fires should have a corresponding action."
Q5: "What are your top 5 daily checks in VCF Operations?"
SAY THIS: "Every morning I follow the same sequence. First, I check Infrastructure Operations > VCF Health for full-stack health across all workload domains. Green means move on, anything else means investigate. I also check Diagnostic Findings which automatically scans for 107 known-issue signatures from Broadcom Support. Second, I check Infrastructure Operations > Alerts for new Critical or Immediate alerts that fired overnight. If so, I check if they were already acknowledged or need attention. Third, capacity check: I go to Capacity > Capacity Optimization and glance at capacity remaining percentages for production clusters. Anything below 30% gets flagged for the weekly capacity review. Fourth, I check Security > Compliance for new drift findings and Fleet Management > Configuration Management for any scheduled drift detection results. Fifth, I check Infrastructure Operations > Analyze to verify log ingestion is flowing from all sources. If a source stopped sending logs overnight, I need to investigate before I lose forensic visibility."
Q6: "Describe widget interactions in VCF Operations dashboards."
SAY THIS: "Widget interactions allow one widget to drive the content of another widget on the same dashboard. When I configure an interaction, I define a source widget and a target widget. When I select an object in the source widget, the target widget updates to show data only for that selected object. For example, I have a cluster health scoreboard as my source and a VM list and alert list as targets. When I click Cluster-Prod-01 in the scoreboard, the VM list shows only VMs in that cluster and the alert list shows only alerts for that cluster. I can chain multiple targets from one source, so one click can update three or four other widgets simultaneously. This is what makes dashboards into investigation tools instead of static wall displays."
Q7: "How do you manage dashboards across teams?"
SAY THIS: "I use the Manage tab under Infrastructure Operations > Dashboards & Reports to control sharing and permissions. Each dashboard has a share setting that determines who can see it. I share NOC dashboards with the NOC operator group, Ops dashboards with my operations team, and Exec dashboards with the management group. I also use the clone feature when teams need similar dashboards with different scope - I clone the template and modify the object scope for each team. For governance, I own all dashboards in our environment and run a monthly review to clean up unused ones, update out-of-date ones, and add new ones based on team feedback."
Q8: "What makes a good operations dashboard?"
SAY THIS: "A good operations dashboard answers a specific question for a specific audience and drives a specific action. It is not decoration. Four criteria. First, it has a clear purpose stated in the title - 'NOC Real-Time Status' not 'Dashboard 7.' Second, it uses the right widgets for the data type - scoreboards for status, Top-N for comparisons, time-series charts for trends, tables for detail. Third, it has widget interactions so the operator can drill down without leaving the page. Fourth, it is actionable - every piece of data on the dashboard should either confirm 'everything is fine' or trigger a specific next step. If a widget is on the dashboard and no one ever looks at it or acts on it, it should be removed."
Q9: "How do you validate that a dashboard is useful?"
SAY THIS: "I watch how people actually use it. After deploying a new dashboard, I check three things over the first two weeks. First, is it being opened? If it has zero views, it is not meeting a need. Second, are people interacting with it - clicking widgets, using filters - or just glancing at it? Interaction means they are using it for investigation, not just decoration. Third, I ask the users directly: 'When you opened this dashboard today, did it answer your question? What was missing?' I have removed widgets that seemed important to me but users told me they never looked at. I have added widgets that users requested because their actual workflow needed information I did not anticipate."
Q10: "How do you build a capacity story for leadership?"
SAY THIS: "I present capacity in business terms, not technical metrics. Instead of saying 'Cluster 3 is at 78% CPU,' I say 'Cluster 3 can absorb approximately 40 more VMs at current average VM size before we need additional hardware, and at current growth rate that is about 90 days.' My weekly capacity report has five sections: executive summary with the headline and recommended action, capacity status by cluster in a table with percentages and days to exhaustion, top capacity risks ranked by urgency, reclaimable capacity from oversized VMs and old snapshots that could defer hardware purchases, and specific recommendations with owners and dates. I always include a trend - is capacity usage growing faster, slower, or steady compared to last month."
Q11: "Tell me about a time you identified and resolved a performance issue."
SAY THIS: "On a Tuesday afternoon, production users reported application slowness. I opened my NOC dashboard and saw Cluster-Prod-01 health was Yellow with CPU contention. I clicked the cluster in the health scoreboard - the widget interaction filtered my alert list to show a Critical alert for sustained CPU usage above 92%. I went to the Metrics tab and confirmed CPU had spiked at 2:32 PM and remained elevated. In the Top-N widget, I identified three VMs consuming 45% of cluster CPU. These were batch processing VMs that had started unscheduled data jobs. I vMotioned eight workload VMs to Cluster-Prod-02 to relieve immediate pressure, then worked with the application team to move the batch VMs to a dedicated resource pool with CPU limits. I documented the RCA, created an alert threshold at 85% for earlier warning, and set the batch jobs to run only during off-peak hours. Total impact was 43 minutes of application latency affecting approximately 200 users."
Q12: "How do you write a root cause analysis?"
SAY THIS: "My RCA follows a standard template with seven sections. Executive summary: two to three sentences covering what happened, who was affected, and how it was resolved. Timeline: a minute-by-minute table from first symptom to resolution. Root cause: the specific technical cause, stated precisely. Impact: number of users, services, and duration. Resolution: what was done immediately and what technical changes were made. Prevention using the 5 Whys method: I trace the chain of causation back to the systemic issue and define corrective actions with owners and due dates. Lessons learned: what worked well in the response, what could be improved, and what should be automated. The most important part is the prevention section. If the RCA does not result in changes that prevent recurrence, it was just documentation, not improvement."
Q13: "What is configuration drift and how do you handle it?"
SAY THIS: "Configuration drift is when a system's actual configuration deviates from its approved baseline without going through change management. In VCF 9.0, I handle it with two tools in two places. First, Security > Compliance checks against industry benchmarks and hardening guides like CIS and the vSphere Security Configuration Guide. It discovers non-compliant objects and provides remediation guidance. Second, Fleet Management > Configuration Management does scheduled drift detection against my own baselines for vCenter and cluster objects. It integrates with Git for template versioning so I can track changes over time, and it generates PDF drift reports. I maintain a top-10 drift watchlist of the settings that cause the most damage when they change - SSH status, NTP, syslog forwarding, lockdown mode, password policies, firewall rules, VM hardware version, datastore permissions, DVS security settings, and certificate validity. I also use Security > Audit Events to investigate who changed what and when. If the same drift keeps recurring, I automate the remediation or fix the root cause in the provisioning process."
Q14: "How do you present a weekly ops briefing?"
SAY THIS: "My weekly briefing is exactly five minutes, structured into five one-minute sections. Minute one: overall health status across all domains, active alert counts, any P1 incidents that occurred. Minute two: capacity status by cluster, trending data, any clusters approaching thresholds. Minute three: performance highlights and any incidents with brief RCA summaries. Minute four: compliance status, drift findings, certificate expirations, security observations. Minute five: action items with owners and dates, and a call for questions. I present from my dashboards, not from slides. The dashboards are live data, which builds credibility and allows me to drill into any question on the spot."
Q15: "What is the difference between VCF Operations and SDDC Manager?"
SAY THIS: "In VCF 9.0, they converged. The SDDC Manager UI is deprecated - its workflows for lifecycle management, fleet management, certificate management, password management, and configuration management all moved into VCF Operations. VCF Operations is now the single unified console for both operations and lifecycle. Under Fleet Management in the left nav, I find Lifecycle Management for upgrades and patching, Certificate Management for cert renewal and replacement, Password Management for credential rotation, and Configuration Management for drift detection with Git integration. Under Infrastructure Operations, I find the monitoring side - VCF Health, Diagnostic Findings with 107 known-issue signatures, Alerts, Dashboards & Reports, Troubleshooting Workbench with Object Relationship Graphs, and log analysis via Analyze. Some SDDC Manager deployment tasks also moved to the vSphere Client. The bottom line: VCF Operations IS the operational control plane now, not just the monitoring layer."
Q16: "How do you handle an alert that keeps firing and clearing repeatedly?"
SAY THIS: "A flapping alert is a sign that the threshold is too close to normal operating range. I take four steps. First, I look at the metric history for the affected object to understand its normal range and variance. Second, I check whether the alert threshold makes sense for this specific object - a host running a database may normally sit at 85% memory, which is fine. Third, I either raise the threshold to above the normal peak, extend the trigger duration so short spikes do not fire the alert, or create a custom alert definition for objects with known higher baselines. Fourth, if the flapping indicates an actual intermittent problem, like a host dropping off the network and reconnecting, I investigate the underlying issue instead of just silencing the alert."
Q17: "What metrics tell you a cluster is in trouble?"
SAY THIS: "I look at four metrics in order. First, CPU Ready time, not just CPU usage. High CPU ready means VMs are waiting for CPU cycles and experiencing contention - this is the metric users feel. Second, memory ballooning and swapping. If the VMkernel is reclaiming memory via balloon driver or swapping to disk, performance is already degraded. Third, disk latency. Read and write latency above 20 milliseconds sustained indicates storage contention or a failing storage path. Fourth, network dropped packets. Any non-zero drop count means something is saturated or misconfigured. I weight these differently: CPU ready and memory swapping are immediate user-impacting, disk latency could be a hardware issue developing, and network drops could indicate a configuration problem."
Q18: "How do you identify oversized VMs?"
SAY THIS: "In VCF Operations, I look at the capacity analytics for reclaimable resources. Specifically, I compare allocated vCPU versus actual CPU usage and allocated memory versus actual memory active. A VM with 8 vCPU allocated but averaging 1.5 vCPU used is oversized. I also look at the CPU ready time - an oversized VM with too many vCPUs can actually hurt performance because the scheduler has to find that many free physical cores simultaneously. My right-sizing recommendation follows a rule: set allocation to the 95th percentile of actual usage plus a 20% buffer. I present right-sizing as a capacity reclamation opportunity, not a punishment. I show leadership how much capacity we can free up across the fleet by right-sizing the top 20 most oversized VMs."
Q19: "What would you automate first in a new VCF Operations environment?"
SAY THIS: "I follow a safe-to-impactful progression. First, I automate read-only health exports - a daily script that pulls cluster health, alert counts, and capacity metrics and saves them to a file or sends them to a channel. Zero risk, immediate value. Second, I automate the weekly capacity report so it is generated automatically instead of manually pulled from the UI. Third, I automate snapshot cleanup reporting - a script that finds all snapshots older than 7 days and generates a report for review, but does not delete anything without approval. Fourth, I automate certificate expiration monitoring - a script that checks all certificate dates and alerts if anything expires in under 30 days. These four give me daily health visibility, weekly capacity planning, storage hygiene, and security compliance, all without any write operations that could cause damage."
Q20: "Describe your ideal VCF monitoring setup."
SAY THIS: "In VCF 9.0, it is all one unified console. Under Infrastructure Operations, I use VCF Health and Diagnostic Findings for stack-wide health monitoring with 107 known-issue signatures. Dashboards & Reports gives me customizable dashboards with widget interactions for drill-down. Alerts handles metric-based alerting classified P1-P4. Storage Operations and Network Operations give me specialized views for vSAN and NSX. Analyze gives me integrated log analysis with saved queries, extracted fields, and log-based alerts - this requires the VCF Operations for Logs component. Security > Compliance checks against CIS benchmarks, and Fleet Management > Configuration Management handles drift detection with Git-backed template versioning. Under Capacity, I get cost visualization, capacity optimization, and what-if scenarios. Then layer three is automation via the Developer Center REST APIs and PowerCLI: scripts that export health data daily, generate capacity reports weekly, and monitor certificates via Fleet Management > Certificate Management. Everything is in one console now, not three separate tools."
API & AUTOMATION (20 Questions)
Q21: "Show me how you would find the API documentation for VCF Operations."
SAY THIS: "The VCF Operations API documentation is hosted as Swagger UI directly on the appliance. I navigate to https://
Q22: "How does token auth work in VCF Operations API?"
SAY THIS: "I POST a JSON body with username, password, and authSource to /suite-api/api/auth/token/acquire. For local accounts, authSource is 'local.' For SSO accounts, it is the identity source name like 'vsphere.local.' The API returns a JSON response containing a token string and its validity period, typically 6 hours. For every subsequent API call, I include the header Authorization: OpsToken followed by a space and the token value. When I am done with my session, I POST to /suite-api/api/auth/token/release with the same Authorization header to invalidate the token. Common mistakes are forgetting the 'OpsToken' prefix before the token, letting the token expire without acquiring a new one, using the wrong authSource value, and not setting the Accept header to application/json which can result in XML responses instead of JSON."
Q23: "Why does OpenAPI matter for VCF automation?"
SAY THIS: "OpenAPI specifications are machine-readable descriptions of the entire API surface. Three practical benefits. First, I can import the spec into Postman and it automatically generates a complete collection with every endpoint, parameter, and example body pre-configured. This skips hours of manual collection building. Second, I can use code generation tools like openapi-generator to produce client libraries in any language - Python, Java, Go, whatever the team uses. The generated code handles serialization, error types, and parameter validation. Third, the spec serves as always-accurate documentation because it is generated from the actual API code. If an endpoint changes in a new version, the spec reflects it, so my generated clients stay in sync."
Q24: "Where do SDKs fit in VCF automation?"
SAY THIS: "Broadcom publishes a Unified VCF SDK for Python and Java. The SDK provides structured, object-oriented wrappers around the REST APIs, which is easier to work with than raw HTTP calls. I use the SDK when building reusable automation tools or integrations where I need error handling, type safety, and maintainability. For quick one-off API calls, I use Postman or curl. For operational scripting by VMware admins who know PowerShell, I use VCF PowerCLI. For generating client code in languages the SDK does not cover, I use the OpenAPI specs with code generators. The decision comes down to: who will maintain this code and what language are they comfortable with."
Q25: "Explain VCF PowerCLI in VCF 9."
SAY THIS: "VCF PowerCLI is the renamed and updated PowerShell module for managing VCF services. In VCF 9, it is the operational scripting tool for administrators who work in PowerShell. I install it with Install-Module VMware.PowerCLI, connect to vCenter with Connect-VIServer, and then have access to cmdlets for managing VMs, hosts, clusters, networking, and storage. For VCF-specific operations like managing workload domains, host commissioning, and lifecycle tasks, I use the SDDC Manager API endpoints. PowerCLI is my choice for operational tasks that admins run interactively or on a schedule - things like health checks, inventory exports, snapshot cleanup, and configuration audits."
Q26: "How do you keep credentials out of automation scripts?"
SAY THIS: "Three methods depending on the context. For scripts running on my workstation, I use environment variables. The script reads the password from an environment variable that I set in my shell session, never in the script file. For scheduled tasks, I use a credential store or secrets manager. PowerCLI has a credential store feature, and for Python I use either a secrets manager API or encrypted credential files. For CI/CD pipelines, I use pipeline secrets or vault integration - the secret is injected at runtime and never appears in the code repository. The principle is: credentials should never exist in source code, configuration files checked into version control, or command-line arguments visible in process listings."
Q27: "What is idempotency and why does it matter for automation?"
SAY THIS: "Idempotency means running the same script twice produces the same result as running it once. It matters because in operations, scripts fail midway, get re-run, get scheduled to overlap, or get triggered by multiple events. If my script creates a resource without checking if it already exists, running it twice creates duplicates. If my script deletes and recreates, running it during a partial failure could destroy data. An idempotent script checks state before acting. For example, before creating an alert definition, it checks if one with that name already exists. Before setting a configuration, it reads the current value and only changes it if different. Before adding a host to inventory, it checks if the host is already present. This is what makes automation safe to run in real environments."
Q28: "Walk me through building a Postman collection for VCF Operations."
SAY THIS: "I start by creating a Postman environment with variables: ops_host for the server FQDN, ops_token which starts empty, ops_user and ops_pass for credentials. Then I create the collection with folders matching the API categories: Auth, Alerts, Resources, Stats, Dashboards. The first request is Acquire Token - a POST to the token endpoint using environment variables for credentials. In the Tests tab, I add JavaScript that extracts the token from the response and saves it to the ops_token environment variable. Every subsequent request uses the Authorization header with value 'OpsToken' followed by the ops_token variable. This way, I run the auth request once and all other requests automatically use the valid token. If I have the OpenAPI spec file, I can skip manual creation entirely by importing the spec, which auto-generates the entire collection."
Q29: "How do you test automation scripts safely before running in production?"
SAY THIS: "Four-stage approach. Stage one: I test in a lab or dev environment that mirrors production but has no production workloads. Stage two: I run the script in read-only mode first. My scripts have a dry-run flag that logs what they would do without actually doing it. Stage three: I run against a single object in production - one host, one cluster, one VM - and verify the result before expanding scope. Stage four: I run against the full target set with monitoring. For write operations, I always include a confirmation prompt unless the script is running unattended on a schedule, in which case the validation is built into the pre-checks. I also keep logs of every action the script takes so I can audit and roll back if needed."
Q30: "What REST API calls would you make to pull a health summary?"
SAY THIS: "Three calls. First, POST to /suite-api/api/auth/token/acquire to get my authentication token. Second, GET to /suite-api/api/resources with a resourceKind parameter set to ClusterComputeResource to get all cluster objects and their current health status. I can add a pageSize parameter to handle pagination in large environments. Third, for each cluster, I can optionally GET /suite-api/api/resources/{resourceId}/stats with specific metric keys to pull CPU usage, memory usage, and capacity remaining. Finally, I POST to /suite-api/api/auth/token/release to clean up my token. The result is a complete health picture that I can serialize to JSON, format into a report, or push to a monitoring channel."
LOGS (20 Questions)
Q31: "How do you use logs differently for auditing versus troubleshooting?"
SAY THIS: "Troubleshooting is reactive and time-bounded. I search for specific error messages, stack traces, or service status changes around the time an issue was reported. I need recent, detailed data, typically 7 to 30 days. The key log sources are vmkernel.log for host issues, vpxd.log for vCenter problems, and NSX syslog for network events. Auditing is proactive and broad. I look for patterns over longer time periods - who logged in, what changed, when, and was it authorized. I need 90 days to a year of retention. The key sources are vCenter events and tasks, ESXi shell.log and auth.log, NSX audit logs, and SSO authentication logs. In Operations-Logs, I build separate dashboards for each purpose: a troubleshooting board focused on error rates and service health, and a security audit board focused on authentication events, privilege escalation, and configuration changes."
Q32: "How do content packs help in Operations-Logs?"
SAY THIS: "Content packs are pre-built bundles of dashboards, alert definitions, and saved queries for specific log sources. For example, the vSphere content pack includes dashboards showing ESXi host error trends, vCenter service health, and authentication tracking, plus alert definitions for patterns like repeated login failures or service crashes, plus saved queries for common troubleshooting scenarios. The value is speed to operational readiness. Instead of spending days building dashboards and writing queries from scratch, I install the relevant content packs and immediately have a functional monitoring baseline. Then I customize on top - adding environment-specific queries, adjusting alert thresholds, and pinning the dashboards most relevant to my team."
Q33: "Walk me through deploying Operations-Logs."
SAY THIS: "Before deployment, I verify prerequisites: DNS forward and reverse records for the appliance FQDN, NTP server accessibility, a static IP allocation, network connectivity to all log sources, sufficient storage for my retention requirements, and certificate decisions. I deploy through SDDC Manager as a fleet-managed deployment. I provide the FQDN, IP, subnet, gateway, DNS servers, NTP servers, admin password, and deployment size based on environment scale. Deployment takes 30 to 60 minutes. After deployment, I validate: DNS resolution works both directions, the UI loads, admin login works, at least one log source is connected, logs are actively flowing, and the certificate is valid. Then I configure syslog forwarding on ESXi hosts by setting Syslog.global.logHost to the Operations-Logs address, install relevant content packs, and pin the dashboards I need."
Q34: "How do you build a query path that proves root cause using logs?"
SAY THIS: "I follow a three-step query path: symptom, correlation, and proof. Step one, symptom: I search for log entries matching the reported problem. For example, if users cannot log into vCenter, I search for 'authentication failure' or 'login failed' in the last 2 hours filtered to vCenter source. Step two, correlation: I look for related events around the same time. If I find authentication failures starting at 8:43 AM, I search for vCenter service events between 8:30 and 8:45 AM. I might find that the STS service restarted at 8:41 AM. Step three, proof: I dig into what caused the correlated event. I search for errors in the STS service before 8:41 AM and find out-of-memory errors at 8:39 AM. Now I have a provable chain: STS ran out of memory at 8:39, crashed and restarted at 8:41, and users could not authenticate during the restart from 8:41 to 8:45. Each step is backed by timestamped log evidence."
Q35: "What log sources are critical in a VCF environment?"
SAY THIS: "Five critical sources. ESXi hosts: vmkernel.log for hardware, storage, and network events; hostd.log for VM operations; auth.log and shell.log for security auditing. vCenter Server: vpxd.log for core service operations; ssoAdminServer.log for authentication; vpxd-alert.log for critical events. NSX Manager: syslog for network operations; nsxapi.log for API calls; audit logs for security rule changes. SDDC Manager: lifecycle operation logs for domain management, host commissioning, and update operations. VCF Operations for Logs itself: ingestion health logs to ensure the logging infrastructure is working. I prioritize forwarding in this order: ESXi and vCenter first because they cover the most common troubleshooting and audit scenarios, NSX second for security visibility, SDDC Manager third for lifecycle operations."
Q36: "How do you alert on logs without creating noise?"
SAY THIS: "The principle is alert on symptoms, not on every event. Instead of alerting on every ERROR log line, which generates thousands of alerts, I alert on patterns that indicate a real problem. Three techniques. First, rate-based alerts: alert when the error rate from a single host exceeds 100 errors per hour, which is significantly above the normal baseline of under 10. Second, pattern-based alerts: alert when a specific dangerous pattern appears, like 5 or more authentication failures for the same username within 10 minutes, which suggests brute force. Third, absence-based alerts: alert when a log source stops sending data for more than 15 minutes, which means I have lost visibility. Each alert has a documented playbook entry that tells the responder: what triggered it, how to verify if it is real, and what to do about it."
Q37: "What is your log retention strategy?"
SAY THIS: "I use tiered retention based on log type and compliance requirements. Operational logs like vmkernel and vpxd get 30 days of full detail. These are high-volume but rarely needed beyond the last month for troubleshooting. Security and audit logs like authentication events, shell access, and configuration changes get 90 days to 1 year depending on compliance requirements. HIPAA and SOX environments often require 1 year. Compliance scan results get 1 year minimum for audit evidence. I also differentiate between online and archive retention: the last 30 days are online in Operations-Logs for fast searching, and older data is archived to cheaper storage for compliance retrieval. Storage planning formula: approximately 1 GB per day per 10 hosts for 30-day online retention as a starting estimate, then adjust based on actual volume."
Q38: "How would you investigate a suspected unauthorized access using logs?"
SAY THIS: "Four-step investigation. Step one, identify the scope: search for all authentication events for the suspected user or IP address across all log sources for the last 30 days. Note which systems were accessed, from which source IPs, at what times. Step two, identify anomalies: compare the activity pattern to the user's normal behavior. Unusual source IPs, unusual times like 3 AM access, unusual systems that the user does not normally touch, or privilege escalation events are all flags. Step three, trace actions: for each session identified, search for what the user did after logging in. In vCenter, check task and event logs. In ESXi, check shell.log for commands executed. In NSX, check for firewall rule changes. Step four, document evidence: compile a timeline with log evidence for each finding, including source log name, timestamp, and exact log entry. Preserve the logs by exporting the relevant entries before any retention policy removes them."
Q39: "How do you validate that Operations-Logs is working correctly?"
SAY THIS: "Five validation checks I run weekly. First, ingestion rate: I check the log ingestion dashboard to verify the rate is consistent with the baseline. A sudden drop means a source stopped sending. A sudden spike could mean a log storm from a misbehaving component. Second, source inventory: I verify all expected sources are actively sending. I maintain a source checklist and compare against the connected sources list. Third, latency: I check the time between when a log is generated and when it appears in Operations-Logs. More than a few minutes of delay impacts real-time troubleshooting. Fourth, query performance: I run my standard saved queries and verify they return results in a reasonable time. Slow queries may indicate storage or indexing issues. Fifth, alert function: I verify that log-based alerts are firing correctly by checking the alert history against known events."
Q40: "What would you put on a security-focused log dashboard?"
SAY THIS: "Five widget groups. First row: authentication failure heatmap showing failed logins by time and source, with a count of unique usernames and source IPs. This is my brute force detection surface. Second row: privileged access tracking showing SSH sessions to ESXi hosts, direct console logins, and service account usage. Third row: configuration change audit showing firewall rule modifications, permission changes, and role assignments in vCenter and NSX. Fourth row: certificate and password events showing expiration warnings, rotation events, and failed certificate validations. Fifth row: log ingestion health for security sources specifically, because if an attacker can disable logging on a system, they can operate without visibility. Each widget has alerting behind it so I am notified of high-risk patterns in real time, not just when I look at the dashboard."
BEHAVIORAL / "TELL ME ABOUT A TIME" (10 Questions)
Q41: "Tell me about a time you had to explain a complex technical issue to a non-technical audience."
SAY THIS: "Our CIO asked why we needed to purchase additional hosts for the production cluster. Instead of showing CPU and memory charts, I built an executive dashboard in VCF Operations showing two things: a capacity runway indicator showing we had approximately 60 days before CPU exhaustion at current growth, and a business impact statement: 'After this date, we cannot onboard the 3 new applications on the project roadmap.' I also showed that right-sizing oversized VMs would buy us 30 additional days but would not eliminate the need for new hardware. By presenting it as a business timeline with a clear decision point rather than a technical metrics discussion, the CIO approved the purchase in the same meeting. The lesson I took away is that executives do not need to understand CPU percentages; they need to understand business risk and timelines."
Q42: "Tell me about a time you reduced alert noise in a monitoring environment."
SAY THIS: "When I took over the VCF Operations environment, there were over 300 active alerts at any given time. The NOC team had developed alert fatigue and was ignoring everything, which meant real issues were being missed. I conducted a two-week audit. I categorized every alert definition into four buckets: actionable as-is, actionable with threshold adjustment, informational only, and not useful. I found that 60% of the noise came from 12 alert definitions with thresholds that were too sensitive for our environment. I adjusted those thresholds based on 30 days of metric history, raising them to above the 95th percentile of normal operation. I disabled 15 alert definitions that had never resulted in human action. After the cleanup, active alert count dropped from over 300 to under 40, and every alert that fired was associated with a runbook entry. Within a month, the NOC team was responding to alerts again because they trusted that each one was meaningful."
Q43: "Tell me about a time you automated a manual process."
SAY THIS: "Every Monday morning, an engineer spent 45 minutes manually pulling capacity data from VCF Operations and formatting it into a weekly report for the ops meeting. I wrote a Python script that authenticates to the VCF Operations API, pulls cluster resource data, calculates capacity remaining and days to exhaustion, identifies oversized VMs and old snapshots, formats everything into a report, and emails it to the team at 6 AM Monday. The script took me about 4 hours to build and test. It has run every week without intervention since deployment. The engineer who used to create the report manually now spends that time on actual operations work. The key design decisions were: credentials stored in environment variables, all operations are read-only, the script is idempotent, and it includes error handling that sends a different email to me if the script fails, so I know to investigate."
Q44: "Tell me about a time you prevented an outage."
SAY THIS: "During my daily checks, I noticed a capacity remaining alert on one of our production datastores. It was at 12% free space and trending down. I investigated and found that a developer had taken a VM snapshot two weeks ago and forgotten about it. The snapshot was growing by approximately 8 GB per day. At that rate, the datastore would have run out of space in about 5 days, which would have caused every VM on that datastore to pause, affecting around 40 production VMs. I contacted the developer, confirmed the snapshot was no longer needed, consolidated it during a maintenance window, and reclaimed 95 GB. I then automated a weekly snapshot report that identifies all snapshots older than 7 days and notifies the VM owners. Since implementing that automation, we have not had another runaway snapshot situation."
Q45: "Tell me about a time you disagreed with a team member's approach."
SAY THIS: "A colleague wanted to set up monitoring alerts that paged the on-call engineer for every vMotion event, arguing that unexpected vMotion could indicate DRS misconfiguration. I disagreed because vMotion is a normal, expected operation in a DRS-enabled cluster, and alerting on every occurrence would create massive noise. Instead, I proposed alerting on the symptom of a DRS problem: if a VM migrates more than 3 times in 1 hour, that could indicate a DRS thrashing condition, which is actually worth investigating. We tested both approaches in a dev environment. My colleague's approach generated over 200 alerts per day. My approach generated zero alerts during normal operation and correctly fired when we simulated a DRS misconfiguration. We went with the symptom-based approach. The lesson was that monitoring should alert on conditions that require human intervention, not on normal operations."
Q46: "Tell me about a time you had to learn a new technology quickly."
SAY THIS: "When our team adopted VCF 9, I needed to get up to speed on VCF Operations quickly because I was the designated monitoring engineer. I built a structured 30-day learning plan. Week 1, I focused on the UI, object model, and building dashboards. Week 2, I went deeper into alerts, capacity, compliance, and role-based dashboards. Week 3, I deployed Operations-Logs and built log dashboards. Week 4, I learned the API, built automation scripts, and conducted mock presentations. Each week had daily objectives with deliverables, and I ran timed incident simulations on Fridays to build muscle memory under pressure. By day 30, I could build and present dashboards, triage issues using a structured runbook, query logs to prove root cause, and automate health reporting via the API. The key was treating learning as a project with milestones, not just reading documentation."
Q47: "Tell me about a time you improved a process."
SAY THIS: "Our incident response process had no standard triage sequence. Different engineers started in different places, and incidents took varying amounts of time based on who was on call. I created a triage runbook with a fixed 15-minute structure: scope assessment in minutes 0-2, alert investigation in minutes 2-5, object trace in minutes 5-8, metric analysis in minutes 8-12, and documentation plus escalation decision in minutes 12-15. I built a VCF Operations dashboard specifically designed to support this workflow, with widgets ordered to match the runbook steps. After training the team and running practice drills, our average initial triage time dropped from 25 minutes to under 15 minutes, and the quality of our incident notes improved because everyone was documenting the same information in the same format."
Q48: "Tell me about a time you handled a high-pressure situation."
SAY THIS: "At 2 AM, I was paged for a P1 alert: a production ESXi host had become unresponsive, and 12 VMs serving customer-facing applications were down. I followed my triage runbook. First, I confirmed the scope in VCF Operations: one host down, 12 VMs affected, no other hosts in the cluster impacted. Second, I verified HA was attempting to restart the VMs on other hosts but some were failing due to insufficient capacity. Third, I manually powered on the highest priority VMs by temporarily suspending two non-critical VMs to free capacity. Fourth, I contacted the hardware vendor because the host failure appeared to be a hardware issue. Within 20 minutes of being paged, 10 of 12 VMs were back online on other hosts. The remaining 2 came back when I freed additional capacity. I documented the full timeline in my RCA and added a capacity buffer recommendation to prevent the HA capacity constraint from recurring."
Q49: "What is your biggest weakness in operations?"
SAY THIS: "Earlier in my career, I would spend too long on root cause analysis during an active incident instead of focusing on service restoration first. I learned that the priority during an incident is restoring service, not understanding why it broke. Now I follow a strict separation: during the incident, I focus exclusively on getting users back to working state. After the incident is resolved, I do the thorough root cause analysis. My triage runbook enforces this by having a decision point at minute 15: 'Can I fix this in the next 15 minutes? If not, what is the fastest path to service restoration while I continue investigating?' This has made me faster at resolving impact even when the root cause takes longer to identify."
Q50: "Why do you want this role?"
SAY THIS: "I want to work in a VCF environment at scale because it combines the three things I am most effective at: building monitoring that tells a story instead of just showing charts, automating operational tasks so the team can focus on improvement instead of repetition, and bringing discipline to operations through structured runbooks, alert taxonomy, and capacity planning. I have built this skill set specifically for VCF Operations - dashboards, API automation, log analysis, and operational storytelling. I want to bring that capability to a team that values operational excellence and gives me the opportunity to make the infrastructure more observable, more automated, and more resilient."
D. WHITEBOARD PROMPTS (WITH COMPLETE SOLUTIONS)
Whiteboard 1: "Design an Ops dashboard for VM performance triage"
DRAW THIS ON THE WHITEBOARD:
+-----------------------------+-----------------------------+
| VM SELECTOR | VM HEALTH BADGE |
| (Object List - VMs) | (Scoreboard - Health) |
| [Click a VM to update | Shows: Green/Yellow/Red |
| all widgets below] | |
+-----------------------------+-----------------------------+
| CPU USAGE (Line Chart) | MEMORY USAGE (Line Chart) |
| Last 24h, CPU Usage % | Last 24h, Memory Active % |
| + CPU Ready Time overlay | + Balloon/Swap overlay |
+-----------------------------+-----------------------------+
| DISK LATENCY (Line Chart) | NETWORK THROUGHPUT |
| Last 24h, Read/Write ms | Last 24h, TX/RX KBps |
| Red line at 20ms | + Dropped packets overlay |
+-----------------------------+-----------------------------+
| RELATED ALERTS (Table) FULL WIDTH |
| Alerts for selected VM, sorted by time, newest first |
+-----------------------------------------------------------+
INTERACTIONS:
VM Selector → drives all other widgets
Click any VM → Health, CPU, Memory, Disk, Network, Alerts all update
WHY THIS DESIGN:
- Operator selects a VM at the top (one click)
- Immediately sees health + 4 key metrics + related alerts
- Red lines on charts show thresholds (e.g., 20ms disk latency)
- No page navigation needed - everything is on one dashboard
SAY THIS WHILE DRAWING: "I design the dashboard around the triage workflow. The operator has a VM name from a user complaint. They select it in the top-left selector. Every widget below updates via interactions. They immediately see health, CPU with ready time overlay, memory with balloon/swap overlay, disk latency, network throughput, and related alerts. The CPU ready time and memory balloon metrics are the ones users actually feel, so I overlay them on the usage charts. I put a red threshold line at 20ms on disk latency because that is where users start noticing storage slowness. Alerts at the bottom provide context for what VCF Operations already knows about this VM."
Whiteboard 2: "Design a log dashboard for auth failures and suspicious activity"
DRAW THIS:
+-----------------------------+-----------------------------+
| AUTH FAILURE HEATMAP | FAILURE COUNT BY USER |
| (X: Time, Y: Source) | (Bar Chart - Top 10) |
| Color = failure count | Sorted by count |
+-----------------------------+-----------------------------+
| FAILURE DETAIL TABLE FULL WIDTH |
| Columns: Timestamp | Username | Source IP | Target | |
| Message | Sorted: newest first |
+-----------------------------------------------------------+
| BRUTE FORCE INDICATOR | SOURCE IP GEO |
| (Scoreboard) | (Table - unique IPs) |
| "5+ failures same user | IP | Country | Failures |
| in 10 min = RED" | |
+-----------------------------+-----------------------------+
| PRIVILEGED ACCESS LOG FULL WIDTH |
| SSH sessions, root logins, service account usage |
+-----------------------------------------------------------+
ALERTS BEHIND THIS DASHBOARD:
- 5+ failures same user in 10 min → P2 alert
- SSH to ESXi host outside maintenance window → P1 alert
- New source IP for service account → P2 alert
Whiteboard 3: "Explain VCF Operations API auth and where Swagger lives"
DRAW THIS:
CLIENT (Postman/Python/cURL)
|
| 1. POST /suite-api/api/auth/token/acquire
| Body: {"username":"admin","password":"***","authSource":"local"}
v
VCF OPERATIONS API
|
| 2. Returns: {"token": "abc123...", "validity": 21600}
v
CLIENT stores token
|
| 3. GET /suite-api/api/alerts
| Header: Authorization: OpsToken abc123...
| Header: Accept: application/json
v
VCF OPERATIONS API
|
| 4. Returns: JSON alert data
v
CLIENT processes response
|
| 5. POST /suite-api/api/auth/token/release
| Header: Authorization: OpsToken abc123...
v
TOKEN INVALIDATED
SWAGGER UI LOCATION:
https://<ops-fqdn>/suite-api/doc/swagger-ui.html
E. MOCK INTERVIEW SCORING RUBRIC
SCORING RUBRIC (rate each 1-10)
=================================
1. SIGNAL OVER NOISE /10
- Does the candidate choose the right metrics to monitor?
- Can they explain WHY a metric matters, not just name it?
- Do they filter alerts effectively (P1-P4 taxonomy)?
Score 8+: Explains metric selection with business impact reasoning
Score 5-7: Names correct metrics but weak on the "why"
Score <5: Lists metrics randomly without prioritization
2. ORDER OF OPERATIONS /10
- Does the candidate triage in a structured sequence?
- Do they assess scope before diving into detail?
- Do they check blast radius (Relationships) before acting?
Score 8+: Has a repeatable triage framework, explains each step
Score 5-7: Reasonable sequence but some steps out of order
Score <5: Jumps to conclusions without systematic investigation
3. STORYTELLING /10
- Can they convert raw metrics into executive narrative?
- Do they include impact + prevention, not just "what happened"?
- Can they present a 5-minute briefing coherently?
Score 8+: Clear timeline, business impact, prevention actions
Score 5-7: Tells the story but misses impact or prevention
Score <5: Just describes technical facts without narrative
4. AUTOMATION MATURITY /10
- Do they automate safely (read-only first, idempotent)?
- Can they explain when NOT to automate?
- Do they keep credentials out of code?
Score 8+: Demonstrates safe automation principles with examples
Score 5-7: Automates but weak on safety considerations
Score <5: Scripts without testing or safety considerations
5. DASHBOARD DESIGN /10
- Are dashboards audience-appropriate (NOC vs Exec)?
- Do they use widget interactions for drill-down?
- Is every widget actionable (no decoration)?
Score 8+: Designs decision dashboards with clear audience targeting
Score 5-7: Builds functional dashboards but not audience-tuned
Score <5: Dashboards are just metric displays
6. API COMFORT /10
- Can they explain token auth flow without hesitation?
- Do they know where Swagger UI lives?
- Can they describe a practical API workflow?
Score 8+: Explains auth + demonstrates API workflow confidently
Score 5-7: Knows the basics but hesitates on details
Score <5: Cannot describe the auth flow
7. LOG MASTERY /10
- Can they pivot from symptoms to log proof?
- Do they know the critical log sources?
- Can they build a query path with correlated timestamps?
Score 8+: Demonstrates a clear symptom → correlation → proof path
Score 5-7: Can search logs but weak on correlation
Score <5: Cannot describe a structured log investigation
8. OVERALL CONFIDENCE /10
- Do they present without hesitation?
- Are answers structured (not rambling)?
- Do they admit knowledge gaps honestly?
Score 8+: Presents fluently, structured answers, honest about gaps
Score 5-7: Some hesitation but generally competent
Score <5: Reads from notes or rambles without structure
TOTAL: /80
70+: Ready for interviews - schedule them
55-69: Need focused practice on weak areas (1-2 more weeks)
Below 55: Continue the 30-day plan, repeat weeks 3-4
VCF OPERATIONS LAB WORKBOOK - COMPLETE HANDS-ON GUIDE
Step-by-Step | Validation Checks | What Could Go Wrong
LAB TRACK A - DASHBOARDS (VCF Operations)
LAB A1: CREATE & CONFIGURE A DASHBOARD
Objective: Create a dashboard from scratch with multiple widgets for cluster health visibility.
Prerequisites:
Estimated Time: 45 minutes
STEP-BY-STEP:
Step 1: Navigate to Dashboard Creation
https:///ui admin (or your admin account)[your password]Step 2: Name the Dashboard
Lab A1 - Cluster Health OverviewStep 3: Add Widget 1 - Health Scoreboard
Cluster HealthShows health status of all clustersCluster Compute ResourceBadge|Health (or Badge > Health)Health display modeStep 4: Add Widget 2 - Top VMs by CPU
Top 10 VMs - CPU UsageVirtual MachineCPU|Usage (%) (or CPU > Usage Percent)10Step 5: Add Widget 3 - Active Alerts
Active Critical & Immediate AlertsActiveCritical and Immediate (uncheck Warning, Information)Step 6: Add Widget 4 - Capacity Scoreboard
Cluster Capacity RemainingCluster Compute ResourceBadge|Capacity Remaining (or Summary > Capacity Remaining Percent)Step 7: Arrange Widgets
Step 8: Save the Dashboard
VALIDATION CHECKLIST:
[ ] Dashboard appears in Dashboards & Reports list with name "Lab A1 - Cluster Health Overview"
[ ] Cluster Health widget displays colored badges (green/yellow/orange/red) for clusters
[ ] Top 10 VMs widget shows VM names with CPU usage percentages
[ ] Alert List widget shows active alerts (or "No alerts" if none exist)
[ ] Capacity Remaining widget shows percentage values for clusters
[ ] All 4 widgets are visible without scrolling (arranged properly)
[ ] Dashboard loads within 10 seconds
WHAT COULD GO WRONG:
| Problem | Symptom | Solution |
|---|---|---|
| No data in widgets | Widgets show "No data available" or are empty | Check that vCenter adapter is connected and collecting data. Go to Administration > check adapter/solution status > verify the vCenter adapter shows "Collecting" status. Data may take up to 10 minutes after adapter configuration. |
| Cannot find widget types | Widget panel does not show expected widget types | You may have a different version. Try searching in the widget panel. Some versions use "Views" instead of "Widgets." Check if you are in edit mode (not view mode). |
| Permission denied on Create | Create button is grayed out or missing | Your user role does not have dashboard create permissions. Contact your admin to assign the appropriate role (at minimum: Content Admin or PowerUser). |
| Dashboard does not save | Error message on save | Check if the dashboard name contains special characters. Try a simpler name. Also check if you have exceeded the dashboard limit (if one exists in your environment). |
| Widgets show wrong data | Data appears but does not match expected objects | Check the Object Type / Resource Kind setting in each widget. If you selected "Host System" instead of "Cluster Compute Resource," you will see host data instead of cluster data. Edit the widget and correct the object type. |
| Slow dashboard loading | Dashboard takes >30 seconds to render | Reduce the time range (try Last 1 Hour instead of Last 24 Hours). Reduce Top-N count from 10 to 5. Check if VCF Operations appliance is under resource pressure. |
JOURNAL PROMPT:
Write down: What was the hardest part of creating this dashboard? What would you add if you had more time? What question does this dashboard answer?
LAB A2: MANAGE DASHBOARDS (CLONE, EDIT, FAVORITES)
Objective: Clone a prebuilt dashboard, edit it, pin favorites, use recents.
Prerequisites: Lab A1 completed. At least one prebuilt dashboard exists in the system.
Estimated Time: 30 minutes
STEP-BY-STEP:
Step 1: Find and Clone a Prebuilt Dashboard
Step 2: Rename and Edit the Cloned Dashboard
NOC - Cluster PerformanceStep 3: Set Favorites
Lab A1 - Cluster Health Overview dashboard in the listNOC - Cluster Performance dashboardStep 4: Verify Recents
Step 5: Manage Dashboard (Share/Export)
Lab A1 - Cluster Health Overview dashboardVALIDATION CHECKLIST:
[ ] Cloned dashboard exists with name "NOC - Cluster Performance"
[ ] Cloned dashboard has been edited (at least one widget added or removed)
[ ] Lab A1 dashboard shows a filled/solid star (favorited)
[ ] NOC dashboard shows a filled/solid star (favorited)
[ ] Favorites filter shows exactly the 2 favorited dashboards
[ ] Recents section shows the dashboards you recently opened
[ ] You found the Share and Export options (even if you did not use them)
WHAT COULD GO WRONG:
| Problem | Symptom | Solution |
|---|---|---|
| Clone option missing | No Clone in the menu | Some prebuilt/system dashboards may not be clonable. Try a different dashboard. Or look for "Save As" instead of "Clone." |
| Cannot favorite | Star icon not visible | Check your permissions. Some read-only roles may not allow favorites. Also, the star may be very small - look carefully next to the dashboard name. |
| Manage Dashboards not found | No Manage option in the nav | In some versions, Manage is under Infrastructure Operations > Dashboards & Reports > Manage. Or it may be a sub-menu under the Dashboards top-level nav. |
LAB A3: CONFIGURE WIDGET INTERACTIONS
Objective: Make a dashboard dynamic by connecting widgets with interactions.
Prerequisites: Lab A1 completed (dashboard with 4 widgets).
Estimated Time: 30 minutes
STEP-BY-STEP:
Step 1: Open Your Dashboard for Editing
Lab A1 - Cluster Health OverviewStep 2: Configure Interaction from Health Scoreboard to VM List
Step 3: Test the Interactions
VALIDATION CHECKLIST:
[ ] Clicking a cluster in Health widget updates the VM list to show only that cluster's VMs
[ ] Clicking a cluster updates the Alert List to show only that cluster's alerts
[ ] Clicking a different cluster changes all widgets to the new selection
[ ] Selecting "all" or clearing selection restores all widgets to full data view
[ ] Interactions persist after saving and reopening the dashboard
WHAT COULD GO WRONG:
| Problem | Symptom | Solution |
|---|---|---|
| No interaction option | Cannot find Widget Interactions in the menu | Make sure you are in edit mode. Some widget types may not support being an interaction source. The Scoreboard widget should support it. Check documentation for your specific version. |
| Interaction not working | Click a cluster but other widgets do not change | Verify the interaction type is set to "Selected Object" not "Selected Resource" or another type. Also verify the target widgets are configured to accept the object type (Cluster Compute Resource). |
| Target widget shows "No data" after selection | Widget goes blank when filtering | The target widget may be configured for a different object type that does not relate to clusters. For example, if the alert list is filtered to only show host alerts, selecting a cluster may not return matching results. Remove the severity filter temporarily to test. |
LAB TRACK B - API (VCF Operations)
LAB B1: FIND SWAGGER UI & EXPLORE ENDPOINTS
Objective: Locate API documentation and execute a simple call via Swagger UI.
Prerequisites: VCF Operations instance accessible via browser.
Estimated Time: 30 minutes
STEP-BY-STEP:
Step 1: Open Swagger UI
https:///suite-api/doc/swagger-ui.html https://vrops.lab.local/suite-api/doc/swagger-ui.htmlStep 2: Explore the API Categories
POST /api/auth/token/acquire - Get a tokenPOST /api/auth/token/release - Release a tokenStep 3: Authenticate via Swagger UI
POST /api/auth/token/acquire
{
"username": "admin",
"password": "YourActualPassword",
"authSource": "local"
}
200"token" fieldStep 4: Make an Authenticated API Call
GET /api/alertsOpsToken (with a space between OpsToken and the token)200"alerts" arrayStep 5: Try Another Endpoint
GET /api/resourcesOpsToken resourceKind = ClusterComputeResourceVALIDATION CHECKLIST:
[ ] Swagger UI page loads at /suite-api/doc/swagger-ui.html
[ ] Token acquired successfully (200 response with token value)
[ ] GET /api/alerts returns 200 with alert data
[ ] GET /api/resources returns 200 with resource data
[ ] You can identify at least 5 API endpoint categories
WHAT COULD GO WRONG:
| Problem | Symptom | Solution |
|---|---|---|
| Swagger UI page not found | 404 error | The URL path may differ in your version. Try: `/suite-api/doc/swagger-ui/` (with trailing slash) or `/suite-api/docs/`. Check the VCF Operations documentation for the exact path for your version. |
| Authentication fails | 401 or 403 response | Check username/password spelling. Verify authSource is correct ("local" for local accounts). Check that the account is not locked. Try logging into the UI with the same credentials to verify they work. |
| Token not accepted | 401 on subsequent calls | Make sure you include "OpsToken " (with the space) before the token value. Make sure the token has not expired. Copy the full token string without any extra whitespace or line breaks. |
| SSL/TLS error | Browser warns about certificate | If using a self-signed certificate, accept the browser warning and proceed. For curl, add the -k flag. |
LAB B2: TOKEN AUTH WORKFLOW (POSTMAN)
Objective: Acquire a token and make authenticated calls in Postman.
Prerequisites: Postman installed. VCF Operations instance accessible.
Estimated Time: 45 minutes
STEP-BY-STEP:
Step 1: Install Postman (if not installed)
https://www.postman.com/downloads/Step 2: Create an Environment
VCF Ops Lab| Variable | Initial Value | Current Value | Type |
|---|---|---|---|
| ops_host | vrops.lab.local | vrops.lab.local | default |
| ops_user | admin | admin | default |
| ops_pass | YourPassword | YourPassword | secret |
| ops_token | default |
VCF Ops Lab from the environment dropdownStep 3: Create a Collection
VCF Ops - CoreAuthAlertsResourcesStep 4: Create the Token Acquire Request
Auth folder > Add RequestAcquire Tokenhttps://{{ops_host}}/suite-api/api/auth/token/acquireContent-Type, Value = application/jsonAccept, Value = application/json
{
"username": "{{ops_user}}",
"password": "{{ops_pass}}",
"authSource": "local"
}
if (pm.response.code === 200) {
var jsonData = pm.response.json();
pm.environment.set("ops_token", jsonData.token);
console.log("Token saved successfully");
}
Step 5: Send the Auth Request
200 OKops_token variable should now have a valueStep 6: Create and Send an Alerts Request
Alerts folder > Add RequestList Active Alertshttps://{{ops_host}}/suite-api/api/alertsAuthorization, Value = OpsToken {{ops_token}}Accept, Value = application/json200 OK with JSON alert dataStep 7: Create and Send a Resources Request
Resources folder > Add RequestList Clustershttps://{{ops_host}}/suite-api/api/resources?resourceKind=ClusterComputeResourceAuthorization, Value = OpsToken {{ops_token}}Accept, Value = application/json200 OK with cluster resource dataStep 8: Create Token Release Request
Auth folder > Add RequestRelease Tokenhttps://{{ops_host}}/suite-api/api/auth/token/releaseAuthorization, Value = OpsToken {{ops_token}}VALIDATION CHECKLIST:
[ ] Postman environment "VCF Ops Lab" created with 4 variables
[ ] Collection "VCF Ops - Core" created with 3 folders
[ ] Acquire Token request returns 200 and saves token to environment
[ ] List Active Alerts request returns 200 with alert JSON
[ ] List Clusters request returns 200 with resource JSON
[ ] Release Token request is saved and ready to use
[ ] Token auto-populates in all requests via {{ops_token}} variable
LAB TRACK C - LOGS (VCF Operations for Logs + Analyze)
VCF 9.x CONTEXT: In VCF 9.0, log analysis is integrated into VCF Operations under Infrastructure Operations > Analyze. This section provides log search, saved queries, extracted fields, event types, trend analysis, and side-by-side query comparison. It requires the VCF Operations for Logs component to be deployed. Log data is standardized to RFC 5424 format.
LAB C1: DEPLOY VCF OPERATIONS FOR LOGS
Objective: Deploy the VCF Operations for Logs component. Understand all required inputs.
Prerequisites:
Estimated Time: 2 hours (including deployment wait time)
STEP-BY-STEP:
Step 1: Verify Prerequisites
nslookup ops-logs.yourdomain.com
Expected result: resolves to your allocated IP (e.g., 10.0.0.50)
If it fails: create the A record in your DNS server before proceeding
nslookup 10.0.0.50
Expected result: resolves to ops-logs.yourdomain.com
If it fails: create the PTR record in your DNS server before proceeding
ping ntp.yourdomain.com
Expected result: successful ping replies
ping 10.0.0.50
Expected result: "Request timed out" (no response = IP is available)
DEPLOYMENT INPUTS:
FQDN: ops-logs.yourdomain.com
IP Address: 10.0.0.50
Subnet Mask: 255.255.255.0
Gateway: 10.0.0.1
DNS Server(s): 10.0.0.10
NTP Server(s): 10.0.0.11
Admin Password: [your chosen password - minimum 8 chars, complexity required]
Deployment Size: [Small/Medium/Large based on your host count]
Step 2: Log into VCF Operations
https:///ui (Note: In VCF 9.0, SDDC Manager UI is deprecated. Fleet and lifecycle operations are now in VCF Operations.)
Step 3: Navigate to Operations for Logs Deployment
(Navigation may vary by version - in some deployments, you may still use a separate deployment workflow)
Step 4: Fill in Deployment Form
ops-logs.yourdomain.com10.0.0.50255.255.255.010.0.0.110.0.0.1010.0.0.11[your password] (enter twice to confirm)Step 5: Review and Submit
Step 6: Monitor Deployment
Step 7: Post-Deployment Validation
https://ops-logs.yourdomain.comStep 8: Configure First Log Source (ESXi)
Syslog.global.logHosttcp://ops-logs.yourdomain.com:514VALIDATION CHECKLIST:
[ ] DNS forward record resolves correctly (nslookup FQDN returns correct IP)
[ ] DNS reverse record resolves correctly (nslookup IP returns correct FQDN)
[ ] VCF Operations UI (Infrastructure Operations > Analyze for log analysis) loads at https://<ops-logs-fqdn>
[ ] Admin login works with the configured password
[ ] Main dashboard renders without errors
[ ] At least one log source is configured and sending data
[ ] New log entries appear in the real-time/recent log view
[ ] No certificate warnings in the browser (or expected self-signed warning)
WHAT COULD GO WRONG:
| Problem | Symptom | Solution |
|---|---|---|
| Deployment fails - DNS error | Error mentions "cannot resolve hostname" | Verify DNS records are correct. From SDDC Manager, run nslookup to test. Both forward and reverse must work. |
| Deployment fails - IP conflict | Error mentions "IP already in use" | Ping the IP to verify. Check ARP tables. If another device is using it, allocate a different IP. |
| Deployment fails - resources | Error mentions "insufficient resources" | Check the target cluster has enough CPU (4+ vCPU), Memory (16+ GB), and Storage (500+ GB) for the deployment size. |
| UI does not load after deploy | Browser timeout or connection refused | Wait 10 more minutes - services may still be starting. Check if the VM is powered on in vCenter. Check if the IP is reachable (ping). |
| No logs flowing | Ingestion shows 0 events | Verify syslog forwarding is configured on at least one host. Verify the Operations-Logs appliance firewall allows port 514. Check that the syslog protocol matches (TCP vs UDP). |
| Authentication fails at UI | "Invalid credentials" error | You may have mistyped the password during deployment. Check if there is a default admin password in the documentation. Worst case: redeploy with correct password. |
LAB C2: USE PREBUILT CONTENT PACKS
Objective: Install content packs and pin the most useful dashboards.
Prerequisites: Lab C1 completed. Operations-Logs deployed and receiving log data.
Estimated Time: 30 minutes
STEP-BY-STEP:
Step 1: Navigate to Content Packs
Step 2: Install Relevant Content Packs
Step 3: Explore Installed Dashboards
Step 4: Pin Your Top 10
VALIDATION CHECKLIST:
[ ] At least 2 content packs installed
[ ] New dashboards visible in the dashboard list
[ ] At least 5 dashboards have data (logs flowing for those sources)
[ ] Favorite/pinned dashboards are saved and accessible via Favorites filter
[ ] You can describe what each pinned dashboard is for
LAB TRACK D - AUTOMATION (OpenAPI / SDK / PowerCLI)
LAB D1: GENERATE A POSTMAN COLLECTION FROM OPENAPI
Objective: Import an OpenAPI spec to generate a Postman collection automatically.
Prerequisites: Postman installed. OpenAPI spec file available (JSON or YAML).
Estimated Time: 30 minutes
STEP-BY-STEP:
Step 1: Obtain the OpenAPI Spec
https:// or
https://
Step 2: Import into Postman
Step 3: Explore the Generated Collection
Step 4: Configure Auth for the Generated Collection
AuthorizationOpsToken {{ops_token}}VALIDATION CHECKLIST:
[ ] OpenAPI spec file obtained (JSON or YAML)
[ ] File imported into Postman
[ ] Collection created with multiple folders and endpoints
[ ] Authorization configured at the collection level
[ ] At least one endpoint tested successfully (after running token acquire)
LAB D2: EXPLORE VCF SDK SAMPLES (PYTHON)
Objective: Locate SDK samples and run a basic example.
Prerequisites: Python 3.8+ installed. pip available.
Estimated Time: 45 minutes
STEP-BY-STEP:
Step 1: Set Up Python Environment
mkdir vcf-ops-automation && cd vcf-ops-automation
python -m venv venv
venv\Scripts\activatesource venv/bin/activate
pip install requests
pip install vcf-sdk
(If this package does not exist or is named differently, proceed with raw requests)
Step 2: Create a Basic Auth Script
hello_vcf_api.py
#!/usr/bin/env python3
"""Basic VCF Operations API authentication test."""
import requests
import os
import sys
import urllib3
# Suppress SSL warnings for lab (remove in production)
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# Read configuration from environment variables
OPS_HOST = os.environ.get("VCF_OPS_HOST")
OPS_USER = os.environ.get("VCF_OPS_USER")
OPS_PASS = os.environ.get("VCF_OPS_PASS")
if not all([OPS_HOST, OPS_USER, OPS_PASS]):
print("ERROR: Set these environment variables:")
print(" VCF_OPS_HOST = your VCF Operations FQDN")
print(" VCF_OPS_USER = your username")
print(" VCF_OPS_PASS = your password")
sys.exit(1)
# Step 1: Acquire token
print(f"Connecting to {OPS_HOST}...")
auth_url = f"https://{OPS_HOST}/suite-api/api/auth/token/acquire"
auth_body = {
"username": OPS_USER,
"password": OPS_PASS,
"authSource": "local"
}
try:
resp = requests.post(auth_url, json=auth_body, verify=False)
resp.raise_for_status()
except requests.exceptions.RequestException as e:
print(f"ERROR: Authentication failed: {e}")
sys.exit(1)
token = resp.json()["token"]
print("Token acquired successfully.")
# Step 2: Make an API call (list resources)
headers = {
"Authorization": f"OpsToken {token}",
"Accept": "application/json"
}
resources_url = f"https://{OPS_HOST}/suite-api/api/resources"
params = {"resourceKind": "ClusterComputeResource", "pageSize": "10"}
try:
resp = requests.get(resources_url, headers=headers, params=params, verify=False)
resp.raise_for_status()
except requests.exceptions.RequestException as e:
print(f"ERROR: API call failed: {e}")
sys.exit(1)
data = resp.json()
resources = data.get("resourceList", [])
print(f"\nFound {len(resources)} cluster(s):")
for r in resources:
name = r.get("resourceKey", {}).get("name", "Unknown")
kind = r.get("resourceKey", {}).get("resourceKindKey", {}).get("resourceKind", "Unknown")
print(f" - {name} ({kind})")
# Step 3: Release token
release_url = f"https://{OPS_HOST}/suite-api/api/auth/token/release"
requests.post(release_url, headers=headers, verify=False)
print("\nToken released. Done.")
Step 3: Run the Script
set VCF_OPS_HOST=vrops.lab.local
set VCF_OPS_USER=admin
set VCF_OPS_PASS=YourPassword
export VCF_OPS_HOST=vrops.lab.local
export VCF_OPS_USER=admin
export VCF_OPS_PASS=YourPassword
python hello_vcf_api.py
Connecting to vrops.lab.local...
Token acquired successfully.
Found 3 cluster(s):
- Cluster-Prod-01 (ClusterComputeResource)
- Cluster-Prod-02 (ClusterComputeResource)
- Cluster-Dev-01 (ClusterComputeResource)
Token released. Done.
VALIDATION CHECKLIST:
[ ] Python virtual environment created and activated
[ ] requests library installed
[ ] Script authenticates successfully (token acquired)
[ ] Script lists cluster resources from your environment
[ ] Token is released at the end
[ ] No hardcoded credentials in the script
LAB D3: INSTALL VCF POWERCLI & RUN A SMALL SCRIPT
Objective: Install PowerCLI and execute basic inventory commands.
Prerequisites: PowerShell 5.1+ installed (Windows) or PowerShell 7+ (cross-platform).
Estimated Time: 45 minutes
STEP-BY-STEP:
Step 1: Install PowerCLI
Install-Module -Name VMware.PowerCLI -Scope CurrentUser -Force -AllowClobber
Y and press Enter
Get-Module -Name VMware.PowerCLI -ListAvailable
Expected: Shows the VMware.PowerCLI module with version number
Step 2: Configure PowerCLI
Set-PowerCLIConfiguration -InvalidCertificateAction Ignore -Confirm:$false
Set-PowerCLIConfiguration -DefaultVIServerMode Multiple -Confirm:$false
Set-PowerCLIConfiguration -ParticipateInCeip $false -Confirm:$false
Step 3: Connect to vCenter
Connect-VIServer -Server "vcenter.yourdomain.com" -User "administrator@vsphere.local" -Password "YourPassword"
Expected: Connection established message with server name and user
Step 4: Run Basic Commands
Get-Cluster | Format-Table Name, HAEnabled, DrsEnabled, @{N='HostCount';E={($_ | Get-VMHost).Count}}
Get-VMHost | Format-Table Name, ConnectionState, PowerState, NumCpu, @{N='MemGB';E={[math]::Round($_.MemoryTotalGB,1)}}, @{N='MemUsed%';E={[math]::Round($_.MemoryUsageGB/$_.MemoryTotalGB*100,1)}}
Get-VM | Get-Snapshot | Format-Table VM, Name, Created, @{N='SizeGB';E={[math]::Round($_.SizeGB,2)}}
(Get-VM).Count
Step 5: Disconnect
Disconnect-VIServer -Confirm:$false
VALIDATION CHECKLIST:
[ ] PowerCLI module installed and verified
[ ] Configuration set (cert ignore, CEIP off)
[ ] Successfully connected to vCenter
[ ] Get-Cluster returns cluster data
[ ] Get-VMHost returns host data with status
[ ] Get-VM | Get-Snapshot returns snapshot data (or empty if no snapshots exist)
[ ] Disconnected cleanly
WHAT COULD GO WRONG:
| Problem | Symptom | Solution |
|---|---|---|
| Install fails - access denied | "Access to the path is denied" | Run PowerShell as Administrator. Or use -Scope CurrentUser flag. |
| Install fails - gallery not trusted | Warning about untrusted repository | Type Y to proceed. Or run: Set-PSRepository -Name PSGallery -InstallationPolicy Trusted |
| Connect fails - cert error | "The SSL connection could not be established" | Run Set-PowerCLIConfiguration -InvalidCertificateAction Ignore first |
| Connect fails - auth error | "Cannot complete login due to an incorrect user name or password" | Verify credentials. Check if the account is administrator@vsphere.local (not just admin). Try logging into vCenter web UI with same credentials. |
| Commands return empty | Get-Cluster returns nothing | Verify you are connected (run $global:DefaultVIServers to check). The connection may have timed out. Reconnect. |