E

Skill Entry

Example SLO document authoring

Operationalizes Appendix A from Google’s SRE workbook by translating the illustrative “Example Game Service” SLO dossier into a checklist teams can mimic: articulate the user-facing workload, nominate rolling measurement windows (the appendix uses four weeks), pair each subsystem with tightly defined SLIs (availability from load balancers excluding 5xx, latency percentile gates, freshness for derived tables, correctness via probers, completeness for pipelines), cite explicit numerator/denominator language, rationalize rounding policies, quantify per-objective error budgets, and cite the sibling error budget policy for enforcement.

Category Operations
Platform Google SRE Workbook / Codex
Published 2026-05-14
reliabilityslislo

Use cases

  • A new customer-facing microservice launches and leadership requests an auditable reliability contract before GA
  • Observability data exists but no document ties metrics to UX outcomes or quantified targets
  • Multiple surfaces (REST API vs static HTTP vs batch pipeline) share infrastructure and need partitioned SLO clauses
  • Compliance asks how synthetic correctness coverage maps to allowable defect rates documented numerically
  • Post-incident retros determine that vague “four nines somewhere” wording prevented consistent freeze decisions

Key features

  • Summarize architectural context and customer-visible interfaces so reviewers know what system is in scope
  • Choose a canonical evaluation window—the workbook example adopts a four-week rolling period—and state it verbatim in the preamble
  • For each subsystem, enumerate SLIs with plain-language numerator/denominator definitions (availability counting non-5xx at the LB, latency thresholds referencing concrete milliseconds, freshness windows for derived reads, completeness for batch jobs)
  • Set SLO percentages per SLI plus rationale paragraphs explaining historical measurement windows, rounding heuristics, and explicit caveats about evidence quality
  • Derive discrete error budgets (100% minus target) independently per objective so finance and product leaders can negotiate trade-offs granularly
  • Cross-reference the enacted error budget policy so readers know freeze behavior when any budget drains, and annotate clarifications (LB blind spots, prober workload assumptions, etc.)

When to Use This Skill

  • When onboarding SRE practices modeled after Google workbook examples rather than reinventing unstructured uptime promises
  • When pairing newly defined SLIs from monitoring design reviews with stakeholder sign-off artifacts
  • When refactoring legacy SLAs into modern SLI/SLO narratives that align with iterative delivery

Expected Output

A concise SLO document mirroring Appendix A sections: overview, SLI/SLO table, rationale, discrete error budgets, clarifications/caveats, and links to enforcing policy.

Frequently Asked Questions

Do we copy the exact gaming example numbers?
Use the appendix as scaffolding; replace illustrative percentages, latency breakpoints, pipeline cadence, and prober sizing with telemetry grounded in your service—while preserving the explanatory structure Google uses.
How granular should SLIs become?
As granular as required for independent pacing—distinct API vs static web tiers vs freshness pipelines—as shown in Appendix A separate rows.
What if instrumentation cannot yet prove user journeys?
The workbook urges documenting evidence gaps plainly so future reviewers can prioritize better UX-linked metrics investments.

Related

Related

3 Indexed items