Skip to content

8. Operational Playbooks

This section provides detailed, actionable checklists for common operational scenarios. Each playbook is designed to be printed or bookmarked as a standalone reference.

8.1 New Team Onboarding Checklist

Phase 1: Pre-Onboarding (1 week before kickoff)

  • Identify the SLO Champion for the team
  • Schedule a 1-hour kickoff meeting with the full team
  • Share pre-reading materials: Nobl9 documentation, SLODLC handbook, this guide
  • Create the team's project in Nobl9 with proper naming convention
  • Configure RBAC: assign Project Owner to SLO Champion, Project Viewer to all team members
  • Verify that the team's monitoring tools are supported by Nobl9
  • Pre-create the data source connection and verify it returns data

Phase 2: Kickoff Meeting

  • Present the SLO program overview and business case
  • Walk through the tiering model and where the team's services fit
  • Discuss the team's services and identify 2-3 initial SLO candidates
  • Review the label taxonomy and annotation requirements
  • Agree on review cadence (weekly operational, monthly target review)
  • Assign action items: who will draft the first SLO definitions

Phase 3: First SLOs (Week 1)

  • Use AI-assisted discovery (Section 4.2.1) to generate initial SLO YAML from codebase
  • Review and refine AI-generated YAML with the team
  • Use the SLI Analyzer to validate that queries return expected data
  • Confirm initial targets are reasonable based on historical data

Phase 4: Alert Configuration (Week 2)

  • Create fast-burn alert policy (20x / 5 min) using Nobl9 preset
  • Create slow-burn alert policy (2x / 6 hr) using Nobl9 preset
  • Create budget threshold alerts (25% and 10%)
  • Configure no-data anomaly alerts based on service tier
  • Connect alert methods: Slack channel for the team, PagerDuty for critical
  • If using ServiceNow, configure the ServiceNow alert method
  • Test all alert methods using the Nobl9 built-in test feature
  • Document alert routing in the team's runbook

Phase 5: CI/CD Setup (Week 2-3)

  • Create the team's directory in the SLO definitions repository
  • Move all YAML definitions into the repository structure
  • Add the label linting script and Conftest policies if applicable
  • Configure the CI/CD pipeline (GitHub Actions or equivalent)
  • Verify dry-run validation passes in CI for a test PR
  • Merge the first PR and verify definitions are applied successfully
  • Add deployment annotation automation to the team's deploy pipeline

Phase 6: Operational Readiness (Week 3-4)

  • Conduct the first weekly SLO review with the team
  • Verify error budget tracking is working as expected
  • Set up the Nobl9 SLO Oversight review schedule for the team's SLOs
  • Document the team's error budget policy
  • Ensure all metadata annotations are populated (runbook, oncall, repo, owner)
  • Add the team's SLOs to the cross-team review dashboard
  • Conduct a 30-day check-in to review initial targets and adjust
  • Schedule the handoff: team is now self-sufficient, platform team provides support as needed

Phase 7: Handoff Validation

  • SLO Champion can independently create and modify SLO definitions via code
  • Team understands and follows the review cadence
  • Alert routing is tested and confirmed working
  • Error budget policy is documented and agreed upon
  • Team has access to and understands the Service Health Dashboard
  • Onboarding retrospective completed: lessons learned documented for next team

8.2 Error Budget Exhaustion Checklist

Step 1: Acknowledge and Verify (within 15 minutes)

  • Confirm the budget exhaustion alert is not a false positive or data quality issue
  • Check the Nobl9 SLO detail page to understand the consumption pattern
  • Determine if this is caused by a single incident or gradual degradation
  • Check for no-data anomalies that might be masking the real data

Step 2: Communicate (within 30 minutes)

  • Notify the team via the designated Slack channel
  • Post the current error budget percentage and burn rate
  • If caused by a single incident, link to the incident channel or postmortem
  • Brief the engineering manager
  • If ITSM integration is configured (e.g., ServiceNow), verify an incident has been created

Step 3: Triage (within 2 hours)

  • Review the last 7 days of error budget consumption on the SLO timeline
  • Check SLO annotations for recent deployments, rollbacks, or config changes
  • Identify the top contributors to budget consumption
  • Check platform and infrastructure layer SLOs for cascading issues (Section 3.5)
  • Check dependency layer SLOs for external causes
  • Review the Nobl9 Alert Center for related alerts across services

Step 4: Remediate

  • Implement the error budget policy: freeze non-critical deployments
  • If a recent deployment caused the issue, evaluate whether to rollback
  • Create action items for the top error contributors
  • Assign owners to each action item with deadlines
  • If cross-team dependencies are involved, engage the SLO Process Owner

Step 5: Monitor Recovery

  • Track burn rate daily to confirm it is below 1x (budget is recovering)
  • Hold daily standups focused on reliability until the budget is above 10%
  • Create SLO annotations marking remediation actions taken

Step 6: Review (within 1 week)

  • Conduct a formal review of the incident and budget exhaustion
  • Evaluate whether the SLO target was appropriate
  • Update the SLO target if the review reveals it was unrealistic
  • Update alert thresholds if early warning was insufficient
  • Document lessons learned in the team's knowledge base
  • Update the runbook with any new troubleshooting steps
  • Close the ITSM incident with resolution details

8.3 Quarterly SLO Program Review Checklist

Step 1: Preparation (1 week before review)

  • Gather cross-service SLO reports from Nobl9 for the quarter
  • Compile composite SLO performance for key user journeys
  • Count error budget violations and their business impact
  • Assess adoption metrics: number of teams, services, SLOs, and active review cycles
  • Identify any Overdue SLO reviews in Nobl9 Oversight
  • Pull label compliance metrics from CI/CD pipeline logs
  • Prepare a summary of data anomalies (no-data, constant-burn, no-burn)

Step 2: Maturity Assessment

  • Compare current state to the maturity model (Section 4.1.3)
  • Identify areas where the program is ahead of or behind expectations
  • Evaluate RBAC configuration: are roles still appropriate?
  • Review the label taxonomy: are new labels needed? Are any unused?
  • Assess CI/CD pipeline health: are all teams using automated validation?

Step 3: Target Evaluation

  • Identify SLOs consistently over-achieving (error budget barely consumed)
  • Identify SLOs consistently under-achieving (budget frequently exhausted)
  • For over-achievers: recommend tightening targets to provide useful signal
  • For under-achievers: determine if targets are unrealistic or if systemic issues exist
  • Review composite SLO weights: do they still reflect actual user impact?

Step 4: Plan Next Quarter

  • Define adoption goals: how many new teams and services to onboard
  • Identify platform improvements needed (new integrations, better automation)
  • Allocate resources for team onboarding and training
  • Set improvement targets for alert precision and review compliance
  • Schedule next quarter's strategic review date
  • Publish the quarterly report to all stakeholders

8.4 New SLO Creation Checklist

Prerequisites

  • Service exists in Nobl9 with proper labels (team, tier, layer, env)
  • Metadata annotations populated (owner, runbook, oncall, repo)
  • Data source connection is active and tested
  • User journey mapped and SLI candidates identified

Design

  • SLI type selected (availability, latency, throughput, correctness)
  • Budgeting method selected (occurrences or time slices)
  • Target set based on historical baseline data (use SLI Analyzer)
  • Time window configured (rolling for alerting, calendar-aligned for reviews)
  • SLO specification template filled out (Appendix A)

Implementation

  • YAML definition created with all required labels and annotations
  • Label linting passes
  • sloctl apply --dry-run passes
  • Code review approved by at least one team member
  • Applied to Nobl9 via CI/CD pipeline
  • Verified in the Service Health Dashboard that data is flowing

Alert Configuration

  • Fast-burn alert policy attached (20x / 5 min)
  • Slow-burn alert policy attached (2x / 6 hr)
  • Budget threshold alerts configured (25% and 10%)
  • No-data anomaly alert configured based on service tier
  • All alert methods tested

Governance

  • Review schedule configured in Nobl9 SLO Oversight
  • SLO added to the team's review cadence
  • Error budget policy documented and communicated
  • SLO Champion confirms ownership