8. Operational Playbooks¶

This section provides detailed, actionable checklists for common operational scenarios. Each playbook is designed to be printed or bookmarked as a standalone reference.

8.1 New Team Onboarding Checklist¶

Phase 1: Pre-Onboarding (1 week before kickoff)

Identify the SLO Champion for the team
Schedule a 1-hour kickoff meeting with the full team
Share pre-reading materials: Nobl9 documentation, SLODLC handbook, this guide
Create the team's project in Nobl9 with proper naming convention
Configure RBAC: assign Project Owner to SLO Champion, Project Viewer to all team members
Verify that the team's monitoring tools are supported by Nobl9
Pre-create the data source connection and verify it returns data

Phase 2: Kickoff Meeting

Present the SLO program overview and business case
Walk through the tiering model and where the team's services fit
Discuss the team's services and identify 2-3 initial SLO candidates
Review the label taxonomy and annotation requirements
Agree on review cadence (weekly operational, monthly target review)
Assign action items: who will draft the first SLO definitions

Phase 3: First SLOs (Week 1)

Use AI-assisted discovery (Section 4.2.1) to generate initial SLO YAML from codebase
Review and refine AI-generated YAML with the team
Use the SLI Analyzer to validate that queries return expected data
Confirm initial targets are reasonable based on historical data

Phase 4: Alert Configuration (Week 2)

Create fast-burn alert policy (20x / 5 min) using Nobl9 preset
Create slow-burn alert policy (2x / 6 hr) using Nobl9 preset
Create budget threshold alerts (25% and 10%)
Configure no-data anomaly alerts based on service tier
Connect alert methods: Slack channel for the team, PagerDuty for critical
If using ServiceNow, configure the ServiceNow alert method
Test all alert methods using the Nobl9 built-in test feature
Document alert routing in the team's runbook

Phase 5: CI/CD Setup (Week 2-3)

Create the team's directory in the SLO definitions repository
Move all YAML definitions into the repository structure
Add the label linting script and Conftest policies if applicable
Configure the CI/CD pipeline (GitHub Actions or equivalent)
Verify dry-run validation passes in CI for a test PR
Merge the first PR and verify definitions are applied successfully
Add deployment annotation automation to the team's deploy pipeline

Phase 6: Operational Readiness (Week 3-4)

Conduct the first weekly SLO review with the team
Verify error budget tracking is working as expected
Set up the Nobl9 SLO Oversight review schedule for the team's SLOs
Document the team's error budget policy
Ensure all metadata annotations are populated (runbook, oncall, repo, owner)
Add the team's SLOs to the cross-team review dashboard
Conduct a 30-day check-in to review initial targets and adjust
Schedule the handoff: team is now self-sufficient, platform team provides support as needed

Phase 7: Handoff Validation

SLO Champion can independently create and modify SLO definitions via code
Team understands and follows the review cadence
Alert routing is tested and confirmed working
Error budget policy is documented and agreed upon
Team has access to and understands the Service Health Dashboard
Onboarding retrospective completed: lessons learned documented for next team

8.2 Error Budget Exhaustion Checklist¶

Step 1: Acknowledge and Verify (within 15 minutes)

Confirm the budget exhaustion alert is not a false positive or data quality issue
Check the Nobl9 SLO detail page to understand the consumption pattern
Determine if this is caused by a single incident or gradual degradation
Check for no-data anomalies that might be masking the real data

Step 2: Communicate (within 30 minutes)

Notify the team via the designated Slack channel
Post the current error budget percentage and burn rate
If caused by a single incident, link to the incident channel or postmortem
Brief the engineering manager
If ITSM integration is configured (e.g., ServiceNow), verify an incident has been created

Step 3: Triage (within 2 hours)

Review the last 7 days of error budget consumption on the SLO timeline
Check SLO annotations for recent deployments, rollbacks, or config changes
Identify the top contributors to budget consumption
Check platform and infrastructure layer SLOs for cascading issues (Section 3.5)
Check dependency layer SLOs for external causes
Review the Nobl9 Alert Center for related alerts across services

Step 4: Remediate

Implement the error budget policy: freeze non-critical deployments
If a recent deployment caused the issue, evaluate whether to rollback
Create action items for the top error contributors
Assign owners to each action item with deadlines
If cross-team dependencies are involved, engage the SLO Process Owner

Step 5: Monitor Recovery

Track burn rate daily to confirm it is below 1x (budget is recovering)
Hold daily standups focused on reliability until the budget is above 10%
Create SLO annotations marking remediation actions taken

Step 6: Review (within 1 week)

Conduct a formal review of the incident and budget exhaustion
Evaluate whether the SLO target was appropriate
Update the SLO target if the review reveals it was unrealistic
Update alert thresholds if early warning was insufficient
Document lessons learned in the team's knowledge base
Update the runbook with any new troubleshooting steps
Close the ITSM incident with resolution details

8.3 Quarterly SLO Program Review Checklist¶

Step 1: Preparation (1 week before review)

Gather cross-service SLO reports from Nobl9 for the quarter
Compile composite SLO performance for key user journeys
Count error budget violations and their business impact
Assess adoption metrics: number of teams, services, SLOs, and active review cycles
Identify any Overdue SLO reviews in Nobl9 Oversight
Pull label compliance metrics from CI/CD pipeline logs
Prepare a summary of data anomalies (no-data, constant-burn, no-burn)

Step 2: Maturity Assessment

Compare current state to the maturity model (Section 4.1.3)
Identify areas where the program is ahead of or behind expectations
Evaluate RBAC configuration: are roles still appropriate?
Review the label taxonomy: are new labels needed? Are any unused?
Assess CI/CD pipeline health: are all teams using automated validation?

Step 3: Target Evaluation

Identify SLOs consistently over-achieving (error budget barely consumed)
Identify SLOs consistently under-achieving (budget frequently exhausted)
For over-achievers: recommend tightening targets to provide useful signal
For under-achievers: determine if targets are unrealistic or if systemic issues exist
Review composite SLO weights: do they still reflect actual user impact?

Step 4: Plan Next Quarter

Define adoption goals: how many new teams and services to onboard
Identify platform improvements needed (new integrations, better automation)
Allocate resources for team onboarding and training
Set improvement targets for alert precision and review compliance
Schedule next quarter's strategic review date
Publish the quarterly report to all stakeholders

8.4 New SLO Creation Checklist¶

Prerequisites

Service exists in Nobl9 with proper labels (team, tier, layer, env)
Metadata annotations populated (owner, runbook, oncall, repo)
Data source connection is active and tested
User journey mapped and SLI candidates identified

Design

SLI type selected (availability, latency, throughput, correctness)
Budgeting method selected (occurrences or time slices)
Target set based on historical baseline data (use SLI Analyzer)
Time window configured (rolling for alerting, calendar-aligned for reviews)
SLO specification template filled out (Appendix A)

Implementation

YAML definition created with all required labels and annotations
Label linting passes
sloctl apply --dry-run passes
Code review approved by at least one team member
Applied to Nobl9 via CI/CD pipeline
Verified in the Service Health Dashboard that data is flowing

Alert Configuration

Fast-burn alert policy attached (20x / 5 min)
Slow-burn alert policy attached (2x / 6 hr)
Budget threshold alerts configured (25% and 10%)
No-data anomaly alert configured based on service tier
All alert methods tested

Governance

Review schedule configured in Nobl9 SLO Oversight
SLO added to the team's review cadence
Error budget policy documented and communicated
SLO Champion confirms ownership