4. SLO Lifecycle Management (SLODLC)¶

This section maps the five SLODLC phases (Initiate, Discover, Design, Implement, Operate) to Nobl9 capabilities and organizational practices.
Docs: SLODLC Handbook
Docs: Nobl9 SLO Best Practices
4.1 Phase 0: Initiate¶
The Initiate phase establishes the organizational foundation for SLO adoption. Before defining a single SLO, you need executive sponsorship, a clear understanding of why SLOs matter to your organization, and a plan for scaling the practice.
4.1.1 Secure Executive Sponsorship¶
SLO adoption requires executive sponsorship to sustain investment through competing priorities. Present the business case in terms of reduced downtime costs, faster incident response, and data-driven reliability prioritization.
4.1.2 Identify Pilot Teams¶
Start with one or two teams that are already reliability-minded and have well-instrumented services. Ideal pilot teams have existing monitoring in a Nobl9-supported data source, a track record of incident response, clear service ownership, and willingness to experiment with new processes.
4.1.3 Establish Maturity Expectations¶
Set realistic expectations for the maturity journey:
| Stage | Duration | Characteristics |
|---|---|---|
| Foundation | Months 1-3 | Pilot teams define first SLOs. Platform configuration established. Basic alerting configured. AI-assisted discovery for initial services. |
| Adoption | Months 3-6 | Additional teams onboard. Label taxonomy refined. Error budget policies drafted. SLOs-as-code introduced. |
| Standardization | Months 6-12 | CI/CD pipelines with label linting enforced. Review cycles established. Cross-team reporting via composite SLOs begins. |
| Optimization | 12+ months | Full tiering model deployed. SLO Oversight governance active. Error budget policies drive prioritization. SLOs inform capacity planning. |
4.2 Phase 1: Discover¶
The Discover phase maps user journeys and identifies the critical service interactions that SLOs should protect. The goal is to ensure that your SLOs measure what actually matters to users, not just what is easy to measure.
4.2.1 Using AI to Accelerate Discovery¶
Modern coding agents such as Claude Code, Cursor, and similar tools can dramatically accelerate the discovery phase. Rather than manually mapping every user journey and identifying SLI candidates, you can use AI agents to analyze your codebase, infrastructure configuration, and existing monitoring to generate initial SLO recommendations.
Example Prompts for AI-Assisted Discovery:
Prompt 1: User Journey Mapping from Code
Analyze the codebase in this repository and map the key user journeys
for our checkout service. For each journey, identify:
- The entry point (API endpoint or UI action)
- The sequence of backend service calls
- External dependencies (payment processors, email services, etc)
- Expected latency thresholds based on timeout configurations
- Error handling paths and failure modes
Output as a table mapping each journey to its critical path services.
Prompt 2: SLI Candidate Identification from Monitoring Config
Review the Datadog monitor definitions and Prometheus recording rules
in our monitoring/ directory. For each service with existing metrics:
- Identify availability SLI candidates (success rate metrics)
- Identify latency SLI candidates (histogram metrics with thresholds)
- Map each metric to the Nobl9 SLI query format
- Flag any gaps where services lack appropriate metrics
- Suggest the Nobl9 budgeting method (occurrences vs time-slices)
based on the metric type
Output as a Nobl9 SLO YAML file ready for sloctl apply --dry-run.
Prompt 3: Generate Initial SLO YAML from Architecture
Read the architecture documentation in docs/architecture/ and the
Kubernetes manifests in k8s/. Generate a complete set of Nobl9 SLO
YAML definitions for the payments-api service, including:
- An availability SLO using the success rate from our Datadog metrics
- A latency SLO using the p99 latency from Prometheus
- Appropriate labels following our taxonomy: team, tier, layer, env
- Metadata annotations for runbook, oncall, and repo links
- Alert policies using multi-window burn rate (fast: 20x/5min,
slow: 2x/6hr)
- No-data anomaly alert configured for 5-minute threshold
Set an initial availability target of 99.9% based on the SLA
documented in docs/slas/payments.md.
Prompt 4: Dependency Analysis for Tiering
Analyze the service dependency graph from our Kubernetes service mesh
configuration and Terraform infrastructure definitions. For each
service, classify it into our tiering model:
- User Journey layer: end-to-end workflows spanning multiple services
- Application / Service layer: user-facing services
- Platform layer: shared internal services
- Infrastructure layer: databases, caches, queues
- Dependency layer: external third-party services
Generate a tiering report with recommended SLO targets for each
layer, accounting for cascading failure patterns.
Why Start with Code Instead of the UI:
With coding agents now widely available, we recommend starting with SLOs-as-code from day one rather than beginning in the Nobl9 UI. The code-first approach provides several advantages: AI agents can generate initial YAML definitions from your existing codebase and monitoring configuration, all definitions are version-controlled from the start, code review processes catch mistakes before they reach production, and the transition from prototype to production is seamless since the same YAML files serve both purposes.
Use the Nobl9 UI as a visualization and validation tool rather than a creation tool. After applying your AI-generated SLO definitions with sloctl apply, use the Service Health Dashboard and SLO detail pages to verify that the queries return expected data and that targets are reasonable.
4.2.2 Manual Discovery Process¶
For services where AI-assisted discovery is not practical (e.g., legacy systems without infrastructure-as-code), follow the traditional SLODLC discovery process:
- For each service in scope, identify the key user journeys it supports.
- Document the user action that initiates each journey, the backend services involved, the expected latency, and the impact of failure.
- Identify SLI candidates across the four standard categories: availability, latency, throughput, and quality/correctness.
- Assess data source readiness: verify that metrics are available in your monitoring tools and that Nobl9 can query them.
4.2.3 SLI Categories¶
The four standard SLI categories are availability, latency, throughput, and quality/correctness. Start with availability and latency for most services, and add throughput or correctness SLIs only when they provide additional actionable signal. For detailed definitions, see the SLODLC Handbook.
4.3 Phase 2: Design¶
The Design phase translates discovered SLI candidates into concrete SLO specifications.
4.3.1 Choose a Budgeting Method¶
| Method | How It Works | Best For |
|---|---|---|
| Occurrences | Counts good events against total events. Automatically adjusts for traffic volume. A 99.9% target means 99.9% of all requests must succeed. | Request-based services where each request is roughly equal. Most APIs and web services. |
| Time Slices | Divides the time window into one-minute slices and classifies each as good or bad. A 99.5% target means 99.5% of minutes must be good. | Services needing consistent performance regardless of traffic. Good for SLA compliance monitoring. |
4.3.2 Set Appropriate Targets¶
- Never set a 100% target. Every system fails eventually, and a 100% target creates zero error budget.
- Start with your current baseline. Measure actual reliability over the past 30 to 90 days and set your initial SLO slightly below that level.
- Align targets with your tiering model: infrastructure targets should be more stringent than application / service targets.
- Align with business requirements and any existing SLAs. Your SLO should be more stringent than your external SLA.
- Use the SLODLC Review phase to adjust targets iteratively. Your first SLO target will almost certainly need refinement.
4.3.3 Define Time Windows¶
- Rolling windows (e.g., 30-day rolling) provide a continuous view of reliability. Ideal for operational monitoring.
- Calendar-aligned windows (weekly, monthly, quarterly) align with business reporting cycles and Nobl9 review schedules. For most organizations, use rolling windows for operational alerting and calendar-aligned windows for Nobl9 SLO Oversight review cycles.
4.3.4 Threshold vs. Ratio Metrics¶
In Nobl9, SLIs can be defined using threshold metrics (a single time series evaluated against a threshold value) or ratio metrics (a count of good or bad events divided by total events). Ratio metrics are generally preferred because they naturally express the proportion of successful interactions.
4.4 Phase 3: Implement¶
The Implement phase deploys your designed SLOs into Nobl9 and configures the associated alerting.
4.4.1 Use the SLI Analyzer¶
Nobl9 provides an SLI Analyzer that helps teams evaluate potential SLIs before committing to an SLO. Use it to validate that your metric query returns expected data, understand historical reliability against different thresholds, and identify the right target level based on actual performance.
4.4.2 Configure Alert Policies Using Nobl9 Templates¶
If you are unsure where to start with alert policies, Nobl9 offers alert presets (templates) for fast-burn and slow-burn policies. These presets follow SRE best practices for multi-window, multi-burn-rate alerting.
Docs: Alert Presets
Docs: Alert Policies
Docs: Burn Rate Calculations
Recommended Alert Policy Configuration:
Multi-window, multi-burn-rate alerting is the gold standard from Google's SRE Workbook. The principle is to pair a short window with a long window for each severity level. A good guideline is to make the short window 1/12th the duration of the long window.
| Alert Severity | Short Window (Burn Rate) | Long Window (Burn Rate) | Notification | Expected Response |
|---|---|---|---|---|
| Critical (Page) | 5 min at 20x burn | 1 hour at 5x burn | PagerDuty (high urgency) | Investigate immediately. Likely active incident. |
| Warning | 30 min at 5x burn | 6 hours at 2x burn | Slack team channel | Investigate within 1 hour. Check recent deployments. |
| Low / Slow Burn | 2 hours at 2x burn | 1 day at 1.5x burn | Slack + Jira ticket | Add to next standup. Review over next 24 hours. |
| Budget Threshold (25%) | N/A | N/A | Slack + email to manager | Conduct team review. Consider deployment freeze. |
| Budget Threshold (10%) | N/A | N/A | PagerDuty + Slack + ServiceNow | Escalate to leadership. Freeze non-critical deploys. |
For detailed error budget policy thresholds and required actions at each budget level, see Section 5.5.3.
Example: Fast-Burn Alert Policy YAML (Using Nobl9 Presets)
The following examples use PagerDuty and Slack as illustrative alert methods. Substitute the alert methods that match your organization's tooling.
apiVersion: n9/v1alpha
kind: AlertPolicy
metadata:
name: payments-api-fast-burn
project: checkout-team
spec:
description: Critical fast-burn alert for Payments API
severity: High
conditions:
- measurement: averageBurnRate
value: 20
alertingWindow: 5m
op: gte
alertMethods:
- metadata:
name: pagerduty-checkout
project: checkout-team
- metadata:
name: slack-slo-critical
project: shared-alerts
Example: Slow-Burn Alert Policy YAML
apiVersion: n9/v1alpha
kind: AlertPolicy
metadata:
name: payments-api-slow-burn
project: checkout-team
spec:
description: Warning slow-burn alert for Payments API
severity: Medium
conditions:
- measurement: averageBurnRate
value: 2
alertingWindow: 6h
op: gte
alertMethods:
- metadata:
name: slack-checkout-team
project: checkout-team
SRE Alerting Best Practices:
- Precision over recall: it is better to miss a minor event than to page an engineer for a false alarm. Tune thresholds to reduce false positives.
- Every page must be actionable: if the on-call engineer cannot take meaningful action when paged, the alert should be a warning in Slack, not a PagerDuty incident.
- Evaluate alerting quality using four criteria: precision (every alert corresponds to a real event), recall (real events are not missed), detection time (how quickly alerts fire), and reset time (how quickly alerts clear after resolution).
- Apply multiple time windows to each SLO. Nobl9's own SRE team uses four alert policies per SLO: 15-minute, 1-hour, 6-hour, and 12-hour windows to distinguish between early indicators and severe incidents.
- Route by severity: fast burns to PagerDuty, slow burns to Slack, budget thresholds to Jira or ServiceNow.
Once alert policies are in place, ongoing SLO management—review cycles, error budget policies, and continuous improvement—is covered in Section 5.