Operationalizes Appendix A from Google’s SRE workbook by translating the illustrative “Example Game Service” SLO dossier into a checklist teams can mimic: articulate the user-facing workload, nominate rolling measurement windows (the appendix uses four weeks), pair each subsystem with tightly defined SLIs (availability from load balancers excluding 5xx, latency percentile gates, freshness for derived tables, correctness via probers, completeness for pipelines), cite explicit numerator/denominator language, rationalize rounding policies, quantify per-objective error budgets, and cite the sibling error budget policy for enforcement.
Use cases
- A new customer-facing microservice launches and leadership requests an auditable reliability contract before GA
- Observability data exists but no document ties metrics to UX outcomes or quantified targets
- Multiple surfaces (REST API vs static HTTP vs batch pipeline) share infrastructure and need partitioned SLO clauses
- Compliance asks how synthetic correctness coverage maps to allowable defect rates documented numerically
- Post-incident retros determine that vague “four nines somewhere” wording prevented consistent freeze decisions
Key features
- Summarize architectural context and customer-visible interfaces so reviewers know what system is in scope
- Choose a canonical evaluation window—the workbook example adopts a four-week rolling period—and state it verbatim in the preamble
- For each subsystem, enumerate SLIs with plain-language numerator/denominator definitions (availability counting non-5xx at the LB, latency thresholds referencing concrete milliseconds, freshness windows for derived reads, completeness for batch jobs)
- Set SLO percentages per SLI plus rationale paragraphs explaining historical measurement windows, rounding heuristics, and explicit caveats about evidence quality
- Derive discrete error budgets (100% minus target) independently per objective so finance and product leaders can negotiate trade-offs granularly
- Cross-reference the enacted error budget policy so readers know freeze behavior when any budget drains, and annotate clarifications (LB blind spots, prober workload assumptions, etc.)
When to Use This Skill
- When onboarding SRE practices modeled after Google workbook examples rather than reinventing unstructured uptime promises
- When pairing newly defined SLIs from monitoring design reviews with stakeholder sign-off artifacts
- When refactoring legacy SLAs into modern SLI/SLO narratives that align with iterative delivery
Expected Output
A concise SLO document mirroring Appendix A sections: overview, SLI/SLO table, rationale, discrete error budgets, clarifications/caveats, and links to enforcing policy.
Frequently Asked Questions
- Do we copy the exact gaming example numbers?
- Use the appendix as scaffolding; replace illustrative percentages, latency breakpoints, pipeline cadence, and prober sizing with telemetry grounded in your service—while preserving the explanatory structure Google uses.
- How granular should SLIs become?
- As granular as required for independent pacing—distinct API vs static web tiers vs freshness pipelines—as shown in Appendix A separate rows.
- What if instrumentation cannot yet prove user journeys?
- The workbook urges documenting evidence gaps plainly so future reviewers can prioritize better UX-linked metrics investments.
Related
Related
3 Indexed items
Error budget policy drafting
Translates Google’s worked example error-budget policy into a repeatable playbook for tying release tempo to measured reliability: define goals (protect users from repeated SLO misses while preserving innovation incentives), spell out what happens when the rolling window consumes its budget (freeze changes except urgent defects or security work), codify outage investigation thresholds, and document escalation paths when stakeholders disagree about budget math.
Postmortem writing
Captures the full incident timeline, blast radius, contributing factors, and concrete follow-up actions after production incidents so teams build institutional memory rather than repeating the same surprises. A well-written postmortem separates root cause from triggers, avoids blame, and produces tracked action items that prevent recurrence.
Incident response
Structured process for handling production incidents from detection to resolution and post-mortem. Covers severity assessment using P0-P3 grading, team coordination with a designated incident commander, communication templates for stakeholders and users, and structured post-mortem requirements to drive organizational learning from every significant outage.