Example SLO document authoring Skill for Google SRE Workbook / Codex

Operationalizes Appendix A from Google’s SRE workbook by translating the illustrative “Example Game Service” SLO dossier into a checklist teams can mimic: articulate the user-facing workload, nominate rolling measurement windows (the appendix uses four weeks), pair each subsystem with tightly defined SLIs (availability from load balancers excluding 5xx, latency percentile gates, freshness for derived tables, correctness via probers, completeness for pipelines), cite explicit numerator/denominator language, rationalize rounding policies, quantify per-objective error budgets, and cite the sibling error budget policy for enforcement.

Category Operations

Platform Google SRE Workbook / Codex

Published 2026-05-14

reliabilityslislo

Use cases

A new customer-facing microservice launches and leadership requests an auditable reliability contract before GA
Observability data exists but no document ties metrics to UX outcomes or quantified targets
Multiple surfaces (REST API vs static HTTP vs batch pipeline) share infrastructure and need partitioned SLO clauses
Compliance asks how synthetic correctness coverage maps to allowable defect rates documented numerically
Post-incident retros determine that vague “four nines somewhere” wording prevented consistent freeze decisions

Key features

Summarize architectural context and customer-visible interfaces so reviewers know what system is in scope
Choose a canonical evaluation window—the workbook example adopts a four-week rolling period—and state it verbatim in the preamble
For each subsystem, enumerate SLIs with plain-language numerator/denominator definitions (availability counting non-5xx at the LB, latency thresholds referencing concrete milliseconds, freshness windows for derived reads, completeness for batch jobs)
Set SLO percentages per SLI plus rationale paragraphs explaining historical measurement windows, rounding heuristics, and explicit caveats about evidence quality
Derive discrete error budgets (100% minus target) independently per objective so finance and product leaders can negotiate trade-offs granularly
Cross-reference the enacted error budget policy so readers know freeze behavior when any budget drains, and annotate clarifications (LB blind spots, prober workload assumptions, etc.)

When to Use This Skill

When onboarding SRE practices modeled after Google workbook examples rather than reinventing unstructured uptime promises
When pairing newly defined SLIs from monitoring design reviews with stakeholder sign-off artifacts
When refactoring legacy SLAs into modern SLI/SLO narratives that align with iterative delivery

Expected Output

A concise SLO document mirroring Appendix A sections: overview, SLI/SLO table, rationale, discrete error budgets, clarifications/caveats, and links to enforcing policy.

Frequently Asked Questions

Do we copy the exact gaming example numbers?: Use the appendix as scaffolding; replace illustrative percentages, latency breakpoints, pipeline cadence, and prober sizing with telemetry grounded in your service—while preserving the explanatory structure Google uses.
How granular should SLIs become?: As granular as required for independent pacing—distinct API vs static web tiers vs freshness pipelines—as shown in Appendix A separate rows.
What if instrumentation cannot yet prove user journeys?: The workbook urges documenting evidence gaps plainly so future reviewers can prioritize better UX-linked metrics investments.

3 Indexed items

Error budget policy drafting

Operations

Translates Google’s worked example error-budget policy into a repeatable playbook for tying release tempo to measured reliability: define goals (protect users from repeated SLO misses while preserving innovation incentives), spell out what happens when the rolling window consumes its budget (freeze changes except urgent defects or security work), codify outage investigation thresholds, and document escalation paths when stakeholders disagree about budget math.

Postmortem writing

Operations

Captures the full incident timeline, blast radius, contributing factors, and concrete follow-up actions after production incidents so teams build institutional memory rather than repeating the same surprises. A well-written postmortem separates root cause from triggers, avoids blame, and produces tracked action items that prevent recurrence.

Incident response

Operations

Structured process for handling production incidents from detection to resolution and post-mortem. Covers severity assessment using P0-P3 grading, team coordination with a designated incident commander, communication templates for stakeholders and users, and structured post-mortem requirements to drive organizational learning from every significant outage.

Example SLO document authoring

Use cases

Key features

When to Use This Skill

Expected Output

Frequently Asked Questions

Related

Error budget policy drafting

Postmortem writing

Incident response

Related news