Files
awesome-awesomeness/html/sre.html
2025-07-18 22:22:32 +02:00

1443 lines
70 KiB
HTML
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<h1 id="awesome-site-reliability-engineering-awesome">Awesome Site
Reliability Engineering <a
href="https://github.com/sindresorhus/awesome"><img
src="https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg"
alt="Awesome" /></a></h1>
<p><a
href="https://dastergon.gr/awesome-sre"><img src="awesome-sre-logo.svg" align="right" width="100"></a></p>
<p>A curated list of awesome <a
href="https://www.usenix.org/conference/srecon14/technical-sessions/presentation/keys-sre">Site
Reliability</a> and <a
href="https://www.usenix.org/conference/srecon15/program/presentation/canahuati">Production</a>
Engineering resources.</p>
<h4 id="what-is-site-reliability-engineering">What is Site Reliability
Engineering?</h4>
<blockquote>
<p>“Fundamentally, its what happens when you ask a software engineer to
design an operations function.” - Ben Treynor Sloss, VP Google
Engineering, founder of Google SRE</p>
</blockquote>
<h2 id="contributing">Contributing</h2>
<p>Please take a look at the <a href="CONTRIBUTING.md">contribution
guidelines</a> first. Contributions are always welcome!</p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#culture">Culture</a></li>
<li><a href="#education">Education</a></li>
<li><a href="#books">Books</a></li>
<li><a href="#hiring">Hiring</a></li>
<li><a href="#reliability">Reliability</a></li>
<li><a href="#monitoring--observability--alerting">Monitoring &amp;
Observability &amp; Alerting</a></li>
<li><a href="#on-call">On-Call</a></li>
<li><a href="#post-mortem">Post-Mortem</a></li>
<li><a href="#capacity-planning">Capacity Planning</a></li>
<li><a href="#service-level-agreement">Service Level Agreement</a></li>
<li><a href="#performance">Performance</a></li>
<li><a href="#programming">Programming</a></li>
<li><a href="#misc-articles">Misc Articles</a></li>
<li><a href="#real-time-messaging">Real-time Messaging</a></li>
<li><a href="#blogs">Blogs</a></li>
<li><a href="#newsletters">Newsletters</a></li>
<li><a href="#conferences-meetups">Conferences &amp; Meetups</a></li>
<li><a href="#twitter">Twitter</a></li>
<li><a href="#sre-tools">SRE Tools</a></li>
<li><a href="#podcasts">SRE Podcasts</a></li>
</ul>
<h2 id="culture">Culture</h2>
<ul>
<li><a
href="https://landing.google.com/sre/interview/ben-treynor.html">What is
Site Reliability Engineering?</a></li>
<li><a
href="https://www.usenix.org/conference/srecon14/technical-sessions/presentation/keys-sre">Keys
To SRE by Ben Treynor</a></li>
<li><a href="https://landing.google.com/sre/resources.html">Google SRE
Resources</a></li>
<li><a
href="https://www.usenix.org/conference/srecon15/program/presentation/canahuati">Notes
from Production Engineering by Pedro Canahuati</a></li>
<li><a
href="https://www.usenix.org/conference/srecon15europe/program/presentation/underwood">PostOps:
Recovery from Operations</a></li>
<li><a
href="https://www.atlassian.com/it-service/site-reliability-engineering-sre">Love
DevOps? Wait till you meet SRE</a> <a
href="https://youtu.be/fsTpRx8Pt-k">[video]</a></li>
<li><a href="https://www.youtube.com/watch?v=H4vMcD7zKM0">How Google
Does Planet-Scale Engineering for Planet-Scale Infra</a></li>
<li><a
href="https://www.facebook.com/notes/facebook-engineering/site-reliability-engineering-at-facebook/291616313919/">Site
Reliability Engineering at Facebook</a></li>
<li><a
href="https://www.youtube.com/watch?v=qJnS-EfIIIE&amp;nohtml5=False">A
History of Site Reliability Engineering at Uber</a></li>
<li><a
href="https://www.usenix.org/conference/srecon15/program/presentation/limoncelli">Case
Study: Adopting SRE Principles at StackOverflow</a></li>
<li><a href="https://www.youtube.com/watch?v=ggizCjUCCqE">Site
Reliability Engineering at Dropbox</a></li>
<li><a href="https://www.youtube.com/watch?v=yXI7r0_J29M">Site
Reliability Engineers — Keeping Google up and running 24/7</a></li>
<li><a href="https://www.salesforce.com/video/193050/">Site Reliability
Engineering at Salesforce</a></li>
<li>From Sys Admin to Netflix SRE - <a
href="https://www.youtube.com/watch?v=lZI51YzIgVE">video</a> and <a
href="https://www.socallinuxexpo.org/sites/default/files/presentations/Scale%20x14%20Slides.pdf">slides</a></li>
<li><a href="https://www.youtube.com/watch?v=iIuTnhdTzK0">SRE@Google:
Thousands of DevOps Since 2004</a></li>
<li><a
href="https://www.usenix.org/conference/lisa15/conference-program/presentation/limoncelli">Transactional
System Administration Is Killing Us and Must be Stopped</a></li>
<li><a
href="https://web.archive.org/web/20190401220948/https://plus.google.com/+lizthegrey/posts/MLAJFVyEb2f">A
hierarchy of SRE needs</a></li>
<li><a
href="https://www.usenix.org/conference/lisa13/technical-sessions/plenary/underwood">PostOps:
A Non-Surgical Tale of Software, Fragility, and Reliability</a></li>
<li><a
href="https://web.archive.org/web/20180820235243/http://anthonycaiafa.com/2016/04/10/sre-cultural-narnia/">SRE:
An incomplete guide to cultural Narnia</a> - <a
href="https://www.youtube.com/watch?v=__wypEhdcrQ&amp;t=0s">[Video]</a></li>
<li><a
href="https://www.usenix.org/conference/srecon16/program/presentation/krishnan">Putting
Together Great SRE Teams</a></li>
<li><a href="https://www.youtube.com/watch?v=bwt6TZjefGM">Work at
Google: Meet our Production Engineers for Site Reliability Hangout on
Air</a></li>
<li><a
href="https://sharpend.io/toil-a-word-every-engineer-should-know/">Toil:
A Word Every Engineer Should Know</a></li>
<li><a href="https://research.google.com/pubs/pub32583.html">Engineering
Reliability into Web Sites: Google SRE</a></li>
<li><a href="https://vimeo.com/179914447">DEVOPS &amp; SRE AMA -
Building High Performance Organizations</a></li>
<li><a
href="https://community.atlassian.com/t5/Jira-Ops-questions/I-m-John-Allspaw-Ask-Me-Anything-about-incident-analysis-and/qaq-p/957084">John
Allspaws AMA on Incident Analysis and Postmortems</a></li>
<li>Site Reliability Engineering with Paul Newson - <a
href="https://www.gcppodcast.com/post/episode-38-site-reliability-engineering-with-paul-newson/">Part
1</a> &amp; <a
href="https://gcppodcast.com/post/episode-59-sre-ii-with-paul-newson/">Part
2</a></li>
<li><a href="https://queue.acm.org/detail.cfm?id=2891413">How SysAdmins
Devalue Themselves</a></li>
<li><a href="https://www.youtube.com/watch?v=ry51Llzil1I">The Softer
Side of DevOps</a></li>
<li><a
href="https://medium.com/@kobolog/sre-noun-see-also-confidence-trust-e7e33e19efc1">SRE,
noun. See also: confidence, trust.</a></li>
<li><a href="https://youtu.be/24xb7oZgu-I?t=29m24s">Site Reliability
Engineering with Stephen Weinberg</a></li>
<li><a
href="https://www.reddit.com/r/IAmA/comments/177267/we_are_the_google_site_reliability_team_we_make">We
are the Google Site Reliability team. We make Googles websites work.
Ask us Anything!</a></li>
<li><a
href="https://www.reddit.com/r/IAmA/comments/1w1y5m/we_are_the_google_site_reliability_engineering/">We
are the Google Site Reliability Engineering team. Ask us
Anything!</a></li>
<li><a
href="http://www.susanjfowler.com/blog/2016/10/13/the-ops-identity-crisis">The
Ops Identity Crisis</a></li>
<li><a
href="http://www.susanjfowler.com/blog/2016/11/2/the-irreproducibility-of-bugs-in-large-scale-production-systems">The
Irreproducibility Of Bugs In Large-Scale Production Systems</a></li>
<li><a
href="http://www.se-radio.net/2016/12/se-radio-episode-276-bjorn-rabenstein-on-site-reliability-engineering/">SE-Radio
Episode 276: Björn Rabenstein on Site Reliability Engineering</a></li>
<li><a
href="https://blog.netsil.com/microservices-devops-and-operational-complexity-be98cb01b660">Microservices,
DevOps and Production Complexity</a></li>
<li><a
href="https://cloudplatform.googleblog.com/2016/10/introducing-a-new-era-of-customer-support-Google-Customer-Reliability-Engineering.html">Introducing
Google Customer Reliability Engineering</a></li>
<li><a
href="https://robhirschfeld.com/2016/12/29/evolution-or-rebellion-the-rise-of-site-reliability-engineers-sre/">Evolution
or Rebellion? The rise of Site Reliability Engineers (SRE)</a></li>
<li><a
href="https://standalone-sysadmin.com/the-difference-between-site-reliability-engineering-system-administration-and-devops-d05031495499">The
difference between Site Reliability Engineering, System Administration,
and DevOps</a></li>
<li><a
href="https://www.usenix.org/conference/lisa16/conference-program/presentation/closing-plenary">SRE
in the Small and in the Large</a></li>
<li><a href="https://www.youtube.com/watch?v=zLXf0cKDOv0">SBSRE Meetup:
Different SRE roles and challenges(Netflix)</a></li>
<li><a
href="https://www.usenix.org/conference/srecon16/program/presentation/definition-of-sre-panel">Panel:
Who/What Is SRE?</a></li>
<li><a
href="https://medium.com/@jerub/hope-is-not-a-strategy-6a7d0a3b1c08">Hope
Is Not a Strategy</a></li>
<li><a
href="https://medium.com/@jerub/tenets-of-sre-8af6238ae8a8">Tenets of
SRE</a></li>
<li><a
href="https://medium.com/@venkatachalamrangasamy/site-reliability-engineering-demystified-ed676e0a7d56">Site
Reliability Engineering Demystified</a></li>
<li><a
href="https://devops.com/site-reliability-engineering-sre-true-ops-devops/">Is
Site Reliability Engineering the True Ops in DevOps?</a></li>
<li><a
href="https://devops.com/sre-devops-cloud-native-server-cage-match/">SRE
vs. DevOps vs. Cloud Native: The Server Cage Match</a></li>
<li><a href="https://youtu.be/8dfYLRAWn_c">SRE: Whats The Big
Idea?</a></li>
<li><a
href="https://engineering.linkedin.com/blog/2017/05/building-the-sre-culture-at-linkedin">Building
the SRE Culture at LinkedIn</a></li>
<li><a
href="https://stackoverflow.blog/2017/06/12/podcast-111-sre-occasionally-maintaining-infrastructure-hate/">Podcast
#111 SRE: Occasionally Maintaining Infrastructure That You
Hate</a></li>
<li><a
href="https://www.usenix.org/conference/srecon16europe/program/presentation/splicing-sre-dna-sequences-biggest-software-company">Splicing
SRE DNA Sequences in the Biggest Software Company on the Planet</a></li>
<li><a
href="https://cloudplatform.googleblog.com/2017/06/why-should-your-app-get-SRE-support-CRE-life-lessons.html">Why
should your app get SRE support? - CRE life lessons</a></li>
<li><a
href="https://cloudplatform.googleblog.com/2017/06/how-SREs-find-the-landmines-in-a-service-CRE-life-lessons.html">How
SREs find the landmines in a service - CRE life lessons</a></li>
<li><a
href="https://cloudplatform.googleblog.com/2017/07/making-the-most-of-an-SRE-service-takeover-CRE-life-lessons.html">Making
the most of an SRE service takeover - CRE life lessons</a></li>
<li><a
href="https://dzone.com/articles/the-cloudcast-301-sre-and-infrastructure-operation">The
Cloudcast #301: SRE and Infrastructure Operations (Podcast)</a></li>
<li><a href="https://medium.com/@rakyll/the-sre-model-6e19376ef986">The
SRE model</a></li>
<li><a
href="https://circleci.com/blog/onboarding-new-site-reliability-engineers/">Onboarding
New Site Reliability Engineers</a></li>
<li><a href="https://www.youtube.com/watch?v=nQv9ySa8MTU">Building
Blocks for Site Reliability At Google</a></li>
<li><a
href="https://blog.netsil.com/beyond-google-sre-what-is-site-reliability-engineering-like-at-medium-71c65bd35f4e">Beyond
Google SRE: What is Site Reliability Engineering like at
Medium?</a></li>
<li><a
href="http://blog.adnanmasood.com/2016/05/19/intelligent-site-reliability-engineering-a-machine-learning-perspective/">Intelligent
Site Reliability Engineering A Machine Learning Perspective</a></li>
<li><a
href="https://engineering.linkedin.com/day-life/crash-course-linkedins-global-site-operations">A
crash course in LinkedIns global site operations</a></li>
<li><a
href="https://softwareengineeringdaily.com/2016/06/14/googles-site-reliability-engineering-todd-underwood/">Googles
Site Reliability Engineering with Todd Underwood</a></li>
<li><a
href="https://blogs.vmware.com/services-education-insights/2018/02/site-reliability-engineering.html">What
is Site Reliability Engineering? (VMware)</a></li>
<li><a href="http://geekologist.co/introduction-to-sre/">A Gentle
Introduction to SRE</a></li>
<li><a
href="http://engineering.medallia.com/blog/posts/understanding-site-reliability-engineering-through-movies-and-books/">Understanding
Site Reliability Engineering through Movies and Books</a></li>
<li><a href="https://www.youtube.com/watch?v=Cxb7a8lTv8A">GOTO 2017 •
Site Reliability Engineering at Google • Christof Leng</a></li>
<li>The Makeup of Successful Geographically-Distributed SRE Teams - <a
href="https://engineering.linkedin.com/blog/2018/03/the-makeup-of-successful-geographically-distributed-sre-teams--p">Part1</a>
&amp; <a
href="https://engineering.linkedin.com/blog/2018/03/the-makeup-of-successful-geographically-distributed-sre-teams--p0">Part2</a></li>
<li><a href="https://www.youtube.com/watch?v=6G2V1xPIM64">Tech
Leadership in SRE</a></li>
<li><a
href="http://azpodcast.azurewebsites.net/post/Episode-227-Azure-SRE1">The
Azure Podcast: Episode 227 - Azure SRE</a></li>
<li><a
href="https://medium.com/@mattklein123/the-human-scalability-of-devops-e36c37d3db6a">The
human scalability of “DevOps”</a></li>
<li><a
href="https://softwareengineeringdaily.com/2018/04/09/site-reliability-management-with-mike-hiraga/">Podcast:
Site Reliability Management with Mike Hiraga</a></li>
<li><a
href="https://medium.com/@Knowlarity_Engineering/how-a-cat-inspired-system-reliability-at-knowlarity-ad73c24f29a7">How
a cat inspired system reliability at Knowlarity</a></li>
<li><a
href="https://github.com/devopsenterprise/2018-London/blob/master/Tuesday/Breakout%20Sessions/Throne%2C%20Stephen%2C%20Getting%20Started%20with%20Site%20Reliability%20Engineering.pdf">Getting
Started with Site Reliability Engineering</a></li>
<li><a href="https://www.youtube.com/watch?v=xWAfTAu0Mww">“Practical
Applications of the Dickerson Pyramid” by Nat Welch</a></li>
<li><a
href="https://blameless.com/blog/sre-implementations-blindspots/">LinkedIns
Kurt Andersen Uncovers Blindspots in SRE Implementations</a></li>
<li><a
href="https://driftboatdave.com/2018/10/09/interview-with-betsy-beyer-stephen-thorne-of-google/">Interview
with Betsy Beyer, Stephen Thorne of Google</a></li>
<li><a href="https://www.youtube.com/watch?v=0zqBlRW_6jA">Less Risk
Through Greater Humanity - Dave Rensin</a></li>
<li><a href="https://www.youtube.com/watch?v=c-w_GYvi0eA">Getting
Started with SRE - Stephen Thorne, Google</a></li>
<li><a
href="https://drive.google.com/file/d/1FXwHm6mpmRA9NaIJEu4cB1s6ffbyGBfl/view">Building
Successful SRE in Large Enterprises</a></li>
<li><a href="https://www.youtube.com/watch?v=ZcZtU_TiFEM">Solving
Reliability Fears with Site Reliability Engineering</a></li>
<li><a
href="https://cloud.google.com/blog/products/gcp/sre-vs-devops-competing-standards-or-close-friends">SRE
vs. DevOps: competing standards or close friends?</a></li>
<li><a
href="https://thenewstack.io/how-to-avoid-the-5-sre-implementation-traps-that-catch-even-the-best-teams/">How
to Avoid the 5 SRE Implementation Traps that Catch Even the Best
Teams</a></li>
<li><a href="https://vimeo.com/344515149">Reliability Engineering The
Essential Discipline for Complex Systems</a></li>
<li><a href="https://www.youtube.com/watch?v=bC5dIPzNH24">The Modern
Site Reliability Workbench on Top of OCI</a></li>
<li><a
href="https://www.usenix.org/conference/srecon19emea/presentation/rabenstein">SRE
in the Third Age</a></li>
<li><a href="https://www.youtube.com/watch?v=vF6ajM3P_wM">About SRE and
how (not) to apply it</a></li>
<li><a
href="https://cloud.google.com/blog/products/management-tools/transitioning-a-typical-engineering-ops-team-into-an-sre-powerhouse">Transitioning
a typical engineering ops team into an SRE powerhouse</a></li>
<li><a
href="https://www.infoq.com/presentations/ing-sre-teams-practices/">Making
a Lion Bulletproof: SRE in Banking</a></li>
<li><a
href="https://cloud.google.com/blog/products/management-tools/identifying-and-tracking-toil-using-sre-principles">Identifying
and tracking toil using SRE principles</a></li>
<li><a
href="https://www.openshift.com/blog/from-ops-to-sre-evolution-of-the-openshift-dedicated-team">From
Ops to SRE: Evolution of the OpenShift Dedicated Team</a></li>
<li><a
href="https://cloud.google.com/blog/products/management-tools/meeting-reliability-challenges-with-sre-principles">Meeting
reliability challenges with SRE principles</a></li>
<li><a href="https://github.com/fhivemind/sre-playground">A quick
introduction to SRE principles</a></li>
<li><a href="https://www.youtube.com/watch?v=KnC2eRUZMKY">The SRE I
Aspire to Be</a></li>
<li><a
href="https://tanzu.vmware.com/content/blog/taming-operational-load-vmware-cre">Taming
Operational Load with VMware CRE</a></li>
<li><a
href="https://dubrie.medium.com/sre-cultural-values-a0073b475183">SRE
Cultural Values</a></li>
<li><a
href="https://cloud.google.com/blog/products/devops-sre/evaluating-where-your-team-lies-on-the-sre-spectrum">Are
we there yet? Thoughts on assessing an SRE teams maturity</a></li>
<li><a
href="https://www.linkedin.com/pulse/what-sres-have-do-project-based-services-rod-anami/">What
SREs have to do with project-based services?</a></li>
<li><a href="https://github.com/readme/guides/ops-work-visible">Making
operational work more visible</a></li>
<li><a href="https://spacelift.io/blog/sre-vs-devops">SRE vs. DevOps:
Whats the Difference Between Them?</a></li>
</ul>
<h2 id="education">Education</h2>
<ul>
<li><a
href="https://www.usenix.org/conference/srecon15/program/presentation/sebenik">Panel:
Educating SRE</a></li>
<li><a
href="https://www.usenix.org/conference/srecon15/program/presentation/widdowson">From
Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE
Teams</a></li>
<li><a
href="https://www.linkedin.com/pulse/new-sre-team-anthony-caiafa/">New
to an SRE team?</a></li>
<li><a
href="https://www.usenix.org/publications/login/june15/hixson">The
Systems Engineering Side of Site Reliability Engineering</a></li>
<li><a
href="https://medium.com/@tammybutow/graduating-from-bootcamp-and-interested-in-becoming-a-site-reliability-engineer-b69a38ce858b">Graduating
from Bootcamp and interested in becoming a Site Reliability
Engineer?</a></li>
<li><a
href="https://www.loomsystems.com/single-post/2016/03/23/So-you-want-to-be-a-Site-Reliability-Engineer">So
you want to be a Site Reliability Engineer?</a></li>
<li><a
href="https://www.loomsystems.com/blog/2017/02/06/spiraling-ops-debt-the-sre-coding-imperative">Spiraling
Ops Debt &amp; the SRE Coding Imperative</a></li>
<li><a
href="https://hackernoon.com/so-you-want-to-be-an-sre-34e832357a8c">So
you want to be an SRE?</a></li>
<li><a
href="https://www.khanacademy.org/college-careers-more/career-content/career-profile-videos/site-reliability-engineer/v/ruth-grace-site-reliability-engineer-what-i-do-and-how-much-i-make">Career
Profiles/Site Reliability Engineer</a></li>
<li><a
href="https://cloudacademy.com/blog/what-is-the-role-of-a-site-reliability-engineer/">What
is the role of a Site Reliability Engineer?</a></li>
<li><a
href="https://www.lynda.com/Software-Development-tutorials/DevOps-Foundations-Site-Reliability-Engineering/669542-2.html">Lynda.com:
DevOps Foundations: Site Reliability Engineering</a></li>
<li><a href="https://dastergon.gr/wheel-of-misfortune/">Incident
Management Training: Wheel of Misfortune</a></li>
<li><a href="https://www.youtube.com/watch?v=rmY8_PHanuI">Site
Un-Reliability Engineering [Video Series]</a></li>
<li><a
href="https://medium.com/swlh/the-ultimate-guide-to-structuring-a-90-day-onboarding-plan-c91af947376">The
Ultimate Guide to Structuring a 90-Day Onboarding Plan</a></li>
<li><a
href="https://cloud.google.com/blog/products/gcp/sre-fundamentals-slis-slas-and-slos">SRE
fundamentals: SLIs, SLAs and SLOs</a></li>
<li><a href="https://blog.alicegoldfuss.com/how-to-get-into-sre/">How to
Get Into SRE</a></li>
<li><a
href="https://cloud.google.com/blog/products/devops-sre/how-to-start-and-assess-your-sre-journey">Do
you have an SRE team yet? How to start and assess your journey</a></li>
<li><a
href="https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started">How
SRE teams are organized, and how to get started</a></li>
<li><a href="https://queue.acm.org/detail.cfm?id=3283589">Why SRE
Documents Matter</a></li>
<li><a
href="https://www.oreilly.com/ideas/how-to-get-started-with-site-reliability-engineering-sre">How
to get started with site reliability engineering (SRE)</a></li>
<li><a
href="https://victorops.com/blog/duties-of-a-site-reliability-engineering-manager">Duties
of a Site Reliability Engineering Manager</a></li>
<li><a
href="https://cloud.google.com/blog/products/management-tools/sre-principles-and-flashcards-to-design-nalsd">Designing
distributed systems using NALSD flashcards</a></li>
<li><a
href="https://landing.google.com/sre/resources/practicesandprocesses/training-site-reliability-engineers">Training
Site Reliability Engineers: What Your Organization Needs to Create a
Learning Program</a></li>
<li><a
href="https://landing.google.com/sre/resources/practicesandprocesses/sre-classroom/">SRE
Classroom: Distributed PubSub workshop</a></li>
<li><a href="https://linkedin.github.io/school-of-sre/">School of SRE:
Curriculum for onboarding non-traditional hires and new grads</a></li>
</ul>
<h2 id="books">Books</h2>
<ul>
<li><a
href="https://link.springer.com/book/10.1007/978-1-4842-0511-2">Practical
Linux Infrastructure</a></li>
<li><a href="https://landing.google.com/sre/book.html">Site Reliability
Engineering: How Google Runs Production Systems</a></li>
<li><a href="https://landing.google.com/sre/book.html">The Site
Reliability Workbook: Practical Ways to Implement SRE</a></li>
<li><a
href="https://info.honeycomb.io/observability-engineering-oreilly-book-2022">Observability
Engineering: Achieving Production Excellence</a></li>
<li><a href="http://the-cloud-book.com/">The Practice Of Cloud System
Administration: Designing and Operating Large Distributed
Systems</a></li>
<li><a href="http://shop.oreilly.com/product/0636920000136.do">Web
Operations - Keeping the Data On Time</a></li>
<li><a href="http://atulgawande.com/book/the-checklist-manifesto/">The
Checklist Manifesto: How to Get Things Right</a></li>
<li><a
href="http://www.oreilly.com/programming/free/microservices-in-production.csp">Microservices
in Production - Standard Principles and Requirements</a></li>
<li><a
href="http://shop.oreilly.com/product/0636920053675.do">Production-Ready
Microservices - Building Standardized Systems Across an Engineering
Organization</a></li>
<li><a
href="https://www.amazon.com/Systems-Performance-Enterprise-Brendan-Gregg/dp/0133390098/">Systems
Performance: Enterprise and the Cloud</a> [Sample chapter titled <a
href="http://ptgmedia.pearsoncmg.com/images/9780133390094/samplepages/0133390098.pdf">CPUs</a></li>
<li><a
href="http://www.oreilly.com/webops-perf/free/monitoring-distributed-systems.csp">Monitoring
Distributed Systems: Case Studies from Googles SRE Teams</a></li>
<li><a
href="http://www.oreilly.com/webops-perf/free/the-human-side-of-postmortems.csp">The
Human Side of Postmortems: Managing Stress and Cognitive Biases</a></li>
<li><a
href="http://www.oreilly.com/webops-perf/free/chaos-engineering.csp">Chaos
Engineering: Building Confidence in System Behavior through
Experiment</a></li>
<li><a
href="https://victorops.com/oreilly-post-incident-review/">Post-Incident
Reviews: Learning from Failure for Improved Incident Responses</a></li>
<li><a
href="http://www.oreilly.com/webops-perf/free/antifragile-systems-and-teams.csp">Antifragile
Systems and Teams</a></li>
<li><a
href="https://www.slideshare.net/OpsStack/how-to-monitoring-the-sre-golden-signals-ebook/">How
to Monitoring the SRE Golden Signals (E-Book)</a></li>
<li><a href="http://shop.oreilly.com/product/0636920036159.do">Incident
Management for Operations</a></li>
<li><a
href="https://www.packtpub.com/web-development/real-world-sre">Real-World
SRE</a></li>
<li><a href="http://shop.oreilly.com/product/0636920063964.do">Seeking
SRE</a></li>
<li><a
href="https://www.verizondigitalmedia.com/e-book/oreilly-what-is-sre/">What
is SRE?</a></li>
<li><a
href="https://landing.google.com/sre/resources/practicesandprocesses/engineering-reliable-mobile-applications/">Engineering
Reliable Mobile Applications: Strategies for Developing Resilient Native
Mobile Applications</a></li>
<li><a href="https://landing.google.com/sre/book.html">Building Secure
and Reliable Systems</a></li>
<li><a href="https://www.manning.com/books/chaos-engineering/">Chaos
Engineering: Crash test your applications</a></li>
<li><a
href="https://www.oreilly.com/library/view/97-things-every/9781492081487/">97
Things Every SRE Should Know</a></li>
<li><a
href="https://shopify.engineering/four-steps-creating-effective-game-day-tests">Four
Steps to Creating Effective Game Day Tests</a></li>
<li><a href="https://nostarch.com/tlpi">The Linux Programming
Interface</a></li>
</ul>
<h2 id="hiring">Hiring</h2>
<ul>
<li><a
href="https://www.usenix.org/conference/srecon15/program/presentation/fong">SRE
Hiring</a></li>
<li><a
href="https://engineering.linkedin.com/engineering-culture/hiring-sres-linkedin">Hiring
SREs at LinkedIn</a></li>
<li><a
href="https://www.usenix.org/publications/login/june15/hiring-site-reliability-engineers">Hiring
Site Reliability Engineers</a></li>
<li><a
href="https://sreally.com/hiring-your-first-sre-bdda38ee175d#.2m3sqyuw9">Hiring
your first SRE</a></li>
<li><a href="https://www.youtube.com/watch?v=ZemNg9GYvOA">Growing the
Site Reliability Team at LinkedIn: Hiring is Hard</a></li>
<li><a href="https://danrl.com/blog/srm">Engineering Manager - Site
Reliability Engineering Interview Preparation</a></li>
</ul>
<h2 id="reliability">Reliability</h2>
<ul>
<li><a
href="https://www.usenix.org/conference/srecon16/program/presentation/kroll">The
Realities of the Job of Delivering Reliability</a></li>
<li><a href="http://queue.acm.org/detail.cfm?id=2839461">Fail at Scale
by Ben Maurer</a></li>
<li><a href="https://www.youtube.com/watch?v=wrY7XoOnysg">Embracing
Failure: Fault-Injection and Service Reliability</a></li>
<li><a
href="https://www.usenix.org/conference/lisa15/conference-program/presentation/krishnan">10
Years of Crashing Google</a></li>
<li><a
href="https://blog.twitter.com/2015/how-we-break-things-at-twitter-failure-testing">How
we break things at Twitter: failure testing</a></li>
<li><a href="http://queue.acm.org/detail.cfm?id=2745840">Reliable Cron
across the Planet</a></li>
<li><a
href="https://blog.twitter.com/2014/push-our-limits-reliability-testing-at-twitter">Push
our limits - reliability testing at Twitter</a></li>
<li><a href="http://queue.acm.org/detail.cfm?ref=rss&amp;id=2889274">The
Verification of a Distributed System by Caitie McCaffrey</a></li>
<li><a href="http://queue.acm.org/detail.cfm?id=2371516">Weathering the
Unexpected</a></li>
<li><a href="https://www.youtube.com/watch?v=YFDwdRVTg4g">SRE Hour: Tech
Talks by Box &amp; Yelp</a></li>
<li><a
href="https://sharpend.io/simplicity-a-prerequisite-for-reliability/">Simplicity:
A Prerequisite for Reliability</a></li>
<li><a
href="https://speakerdeck.com/garethr/the-two-sides-to-google-infrastructure-for-everyone-else">The
Two Sides to Google Infrastructure for Everyone Else</a></li>
<li><a
href="https://www.usenix.org/conference/ures14west/summit-program/presentation/dickson">How
Embracing Continuous Release Reduced Change Complexity</a></li>
<li><a
href="https://www.usenix.org/publications/login/october-2014-vol-39-no-5/making-push-green-reality">Making
“Push On Green” a Reality</a></li>
<li><a
href="https://www.usenix.org/publications/login/dec14/ward">BeyondCorp:
A New Approach to Enterprise Security</a></li>
<li><a href="https://www.youtube.com/watch?v=dKe9S8u44Yk">Brainstorming
Failure by Jeff Smith</a></li>
<li><a href="http://cloudtweaks.com/2016/04/outages-and-downtime/">The
Ripple Effect Of Outages And Downtime Cannot Be Underestimated</a></li>
<li><a
href="https://blog.twitter.com/2016/the-infrastructure-behind-twitter-efficiency-and-optimization">The
infrastructure behind Twitter: efficiency and optimization</a></li>
<li><a
href="https://docs.google.com/drawings/d/1kshrK2RLkW-XV8enmWZxeRFRgADj6d4Ru_w5txz_k9I/edit">Dickersons
Hierarchy of Reliability</a></li>
<li><a
href="https://blog.acolyer.org/2016/09/21/the-morning-paper-on-operability/">The
Morning Paper on Operability</a></li>
<li><a
href="http://naildrivin5.com/blog/2013/06/16/production-is-all-that-matters.html">Production
is all that matters</a></li>
<li><a
href="https://cloudplatform.googleblog.com/2016/12/using-load-shedding-to-survive-a-success-disaster-CRE-life-lessons.html">Using
load shedding to survive a success disaster - CRE life lessons</a></li>
<li><a
href="https://cloudplatform.googleblog.com/2016/11/how-to-avoid-a-self-inflicted-DDoS-Attack-CRE-life-lessons.html">How
to avoid a self-inflicted DDoS Attack - CRE life lessons</a></li>
<li><a
href="https://www.oreilly.com/ideas/dont-gamble-when-it-comes-to-reliability">Dont
gamble when it comes to reliability</a></li>
<li><a href="https://queue.acm.org/detail.cfm?id=2371297">Resilience
Engineering: Learning to Embrace Failure</a></li>
<li><a
href="https://blog.twitter.com/2017/the-infrastructure-behind-twitter-scale">The
Infrastructure Behind Twitter: Scale</a></li>
<li><a href="https://www.youtube.com/watch?v=hYu13kBenjE">Scaling
Reliability at Twitter: So You Want to Add a 9</a></li>
<li><a href="http://principlesofchaos.org/">Principles Of Chaos
Engineering</a></li>
<li><a href="https://www.infoq.com/articles/chaos-engineering">Chaos
Engineering</a></li>
<li><a
href="https://cloudplatform.googleblog.com/2017/01/available-or-not-that-is-the-question-CRE-life-lessons.html">Available…or
not? That is the question - CRE life lessons</a></li>
<li><a
href="http://highscalability.com/blog/2014/2/3/how-google-backs-up-the-internet-along-with-exabytes-of-othe.html">How
Google Backs Up The Internet Along With Exabytes Of Other Data</a></li>
<li><a
href="http://highscalability.com/blog/2017/2/2/performance-scalability-and-high-availability-3-key-infrastr.html">Performance,
Scalability, And High Availability: 3 Key Infrastructure Adaptability
Requirements</a></li>
<li>The Production Environment at Google - <a
href="https://medium.com/@jerub/the-production-environment-at-google-8a1aaece3767">Part
1</a> &amp; <a
href="https://medium.com/@jerub/the-production-environment-at-google-part-2-610884268aaa">Part
2</a></li>
<li><a
href="https://cloudplatform.googleblog.com/2017/03/reliable-releases-and-rollbacks-CRE-life-lessons.html">Reliable
releases and rollbacks - CRE life lessons</a></li>
<li><a
href="https://cloudplatform.googleblog.com/2017/03/how-release-canaries-can-save-your-bacon-CRE-life-lessons.html">How
release canaries can save your bacon - CRE life lessons</a></li>
<li><a
href="https://zwischenzugs.wordpress.com/2017/04/04/things-i-learned-managing-site-reliability-for-some-of-the-worlds-busiest-gambling-sites/">Things
I Learned Managing Site Reliability for Some of the Worlds Busiest
Gambling Sites</a></li>
<li><a
href="https://www.linkedin.com/pulse/introduction-every-day-monday-operations-benjamin-purgason">Every
Day Is Monday in Operations</a></li>
<li><a
href="https://engineering.squarespace.com/blog/2017/under-the-hood-ensuring-site-reliability">Under
the Hood: Ensuring Site Reliability</a></li>
<li><a href="https://www.youtube.com/watch?v=7Hy_6SMn8pY">Designing
reliable systems with cloud infrastructure (Google Cloud Next
17)</a></li>
<li><a
href="https://cloud.google.com/blog/big-data/2016/10/a-google-sre-explores-github-reliability-with-bigquery">A
Google SRE explores GitHub reliability with BigQuery</a></li>
<li><a
href="https://cloudplatform.googleblog.com/2017/05/know-thy-enemy-how-to-prioritize-and-communicate-risks-CRE-life-lessons.html">Know
thy enemy: how to prioritize and communicate risks - CRE life
lessons</a></li>
<li><a
href="https://github.com/dastergon/awesome-chaos-engineering">Chaos
Engineering resources</a></li>
<li><a
href="https://cloudplatform.googleblog.com/2017/08/CRE-life-lessons-what-is-a-dark-launch-and-what-does-it-do-for-me.html">CRE
life lessons: What is a dark launch, and what does it do for
me?</a></li>
<li><a
href="https://cloudplatform.googleblog.com/2018/01/why-you-should-pick-strong-consistency-whenever-possible.html">Why
you should pick strong consistency, whenever possible</a></li>
<li><a href="https://queue.acm.org/detail.cfm?id=2655736">The Network is
Reliable</a></li>
<li><a href="https://queue.acm.org/detail.cfm?id=3028689">Are You Load
Balancing Wrong?</a></li>
<li><a
href="https://code.facebook.com/posts/166966743929963/how-production-engineers-support-global-events-on-facebook/">How
production engineers support global events on Facebook</a></li>
<li><a
href="http://highscalability.com/blog/2018/4/16/google-a-collection-of-best-practices-for-production-service.html">Google:
A Collection Of Best Practices For Production Services</a></li>
<li><a href="https://queue.acm.org/detail.cfm?id=3194655">Canary
Analysis Service</a></li>
<li><a
href="https://medium.com/@NetflixTechBlog/tips-for-high-availability-be0472f2599c">Tips
for High Availability</a></li>
<li><a
href="https://auth0.com/blog/progressive-service-architecture-at-auth0/">Progressive
Service Architecture At Auth0</a></li>
<li><a
href="https://medium.com/google-cloud/production-guideline-9d5d10c8f1e">Google
Cloud Production Guideline</a></li>
<li><a href="https://jbd.dev/prod-readiness/">production
readiness</a></li>
<li><a href="https://www.youtube.com/watch?v=Vvd3uvNvMns">Trust By
Design: The Fusion of Operational Maturity and Risk Modeling</a></li>
<li><a
href="https://www.verica.io/top-seven-myths-of-robust-systems/">Top
Seven Myths of Robust Systems</a></li>
<li><a
href="https://www.oreilly.com/ideas/taming-chaos-preparing-for-your-next-incident">Taming
chaos: Preparing for your next incident</a></li>
<li><a href="https://www.youtube.com/watch?v=3AxSwCC7I4s">PID Loops and
the Art of Keeping Systems Stable</a></li>
<li><a href="https://www.youtube.com/watch?v=YptJ2rrGAYY">Are you ready
for production?</a> - <a
href="https://speakerdeck.com/rakyll/are-you-ready-for-production">Slides</a></li>
<li><a
href="https://srcco.de/posts/web-service-on-kubernetes-production-checklist-2019.html">Production
Checklist for Web Apps on Kubernetes</a></li>
<li><a
href="https://cloud.google.com/blog/products/management-tools/sre-keeps-digging-to-prevent-problems">Finding
a problem at the bottom of the Google stack</a></li>
<li><a
href="https://www.oreilly.com/content/rethinking-task-size-in-sre/">Rethinking
Task Size in SRE</a></li>
<li><a
href="https://cloud.google.com/blog/products/management-tools/sre-error-budgets-and-maintenance-windows">How
maintenance windows affect your error budget</a></li>
<li><a
href="https://dastergon.gr/posts/2020/09/the-production-readiness-spectrum/">The
Production Readiness Spectrum</a></li>
<li><a
href="https://www.oreilly.com/content/generic-mitigations/">Generic
mitigations</a></li>
<li><a
href="https://grafana.com/blog/2021/10/13/how-were-building-a-production-readiness-review-process-at-grafana-labs/">How
were building a production readiness review process at Grafana
Labs</a></li>
<li><a
href="https://shopify.engineering/resiliency-planning-for-high-traffic-events">Resiliency
Planning for High-Traffic Events</a></li>
<li><a
href="https://doordash.engineering/2022/04/25/using-fault-injection-testing-to-improve-doordash-reliability/">Using
Fault Injection Testing to Improve DoorDash Reliability</a></li>
</ul>
<h2 id="monitoring-observability-alerting">Monitoring &amp;
Observability &amp; Alerting</h2>
<ul>
<li><a
href="https://www.usenix.org/conference/lisa13/working-theory-monitoring">A
Working Theory-of-Monitoring</a></li>
<li><a href="https://vimeo.com/131484321">The Evolution of Monitoring
Systems at Google - Tony Rippy</a></li>
<li><a
href="https://www.usenix.org/conference/srecon15/program/presentation/serebryany">Monitoring
without Infrastructure @ Airbnb</a></li>
<li><a
href="https://www.oreilly.com/ideas/monitoring-distributed-systems">Monitoring
distributed systems</a></li>
<li><a href="https://www.youtube.com/watch?v=2JAnmzVwgP8">Observability
at Uber Engineering: Past, Present, Future</a></li>
<li><a
href="https://blog.netsil.com/the-4-golden-signals-of-api-health-and-performance-in-cloud-native-applications-a6e87526e74">The
4 Golden Signals of API Health and Performance in Cloud-Native
Applications</a></li>
<li><a
href="https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/preview#">My
Philosophy on Alerting by Rob Ewaschuk</a></li>
<li><a href="https://www.youtube.com/watch?v=wsgpV67MLFo">Time To Detect
- Netflix</a></li>
<li><a
href="https://www.vividcortex.com/blog/why-percentiles-dont-work-the-way-you-think">Why
Percentiles Dont Work the Way you Think</a></li>
<li><a href="https://www.youtube.com/watch?v=jQggG0qIjTM">Building
Twitters Next-Gen Alerting System</a></li>
<li><a
href="https://honeycomb.io/blog/2017/01/instrumentation-worst-case-performance-matters/">Instrumentation:
Worst case performance matters</a></li>
<li><a
href="https://honeycomb.io/blog/2017/01/instrumentation-what-does-uptime-mean/">Instrumentation:
What does uptime mean?</a></li>
<li><a
href="https://circleci.com/blog/incidents-outages-at-circleci-our-playbook-and-what-we-ve-learned/">Incidents
+ Outages at CircleCI: Our Playbook and What Weve Learned</a></li>
<li><a href="https://www.youtube.com/watch?v=gNmWzkGViAY">An
introduction to monitoring and alerting with timeseries at scale, with
Prometheus</a></li>
<li><a href="https://www.youtube.com/watch?v=mG4ZpEhRKHA">Detecting
outliers and anomalies in realtime at Datadog</a></li>
<li><a
href="https://medium.com/devopslinks/how-to-monitor-the-sre-golden-signals-1391cadc7524">How
to Monitor the SRE Golden Signals</a></li>
<li><a href="https://queue.acm.org/detail.cfm?id=3178371">Monitoring in
a DevOps World</a></li>
<li><a
href="https://medium.com/@jerub/monitoring-your-monitorings-monitoring-51d479100f4c">Monitoring
Your Monitorings Monitoring</a></li>
<li><a
href="https://medium.com/@dlite/observability-the-new-wave-or-buzzword-fc23a68abf72">Observability:
the new wave or buzzword?</a></li>
<li><a
href="https://www.vividcortex.com/blog/monitoring-isnt-observability">Monitoring
Isnt Observability</a></li>
<li><a
href="https://medium.com/@copyconstruct/monitoring-in-the-time-of-cloud-native-c87c7a5bfa3e">Monitoring
in the time of Cloud Native</a></li>
<li><a href="https://www.youtube.com/watch?v=2LNHv0JyBUk">Principles of
Monitoring Microservices</a></li>
<li><a href="https://www.usenix.org/node/197446">The Many Ways Your
Monitoring Is Lying to You</a></li>
<li><a
href="https://www.weave.works/blog/gitops-part-3-observability">GitOps
Part 3 - Observability</a></li>
<li><a
href="https://medium.com/observability/want-to-debug-latency-7aa48ecbe8f7">Want
to Debug Latency?</a></li>
<li><a
href="https://medium.com/observability/debugging-latency-in-go-1-11-9f97a7910d68">Debugging
Latency in Go 1.11</a></li>
<li><a
href="https://developers.soundcloud.com/blog/alerting-on-slos">Alerting
on SLOs like Pros</a></li>
<li><a href="https://www.youtube.com/watch?v=JhxfZ0VIPP0">Applied
Alerting Philosophy</a></li>
<li><a
href="https://blog.colinbreck.com/observations-on-observability/">Observations
on Observability</a></li>
<li><a
href="https://charity.wtf/2019/10/28/deploys-its-not-actually-about-fridays/">Deploys:
Its Not Actually About Fridays</a></li>
<li><a
href="https://medium.com/better-programming/site-reliability-engineering-best-practices-for-data-pipelines-44a78e91f6f0">Site
Reliability Engineering Best Practices for Data Pipelines</a></li>
<li><a
href="https://www.elastic.co/blog/elastic-observability-sre-incident-response">Elastic
Observability in SRE and Incident Response</a></li>
<li><a
href="https://medium.com/expedia-group-tech/error-budget-policy-adoption-at-expedia-group-7d80d41c4a8b">Error
Budget Policy - Part 1 - Adoption at Expedia Group</a></li>
<li><a
href="https://medium.com/expedia-group-tech/error-budget-policies-in-practice-4c98f56a28c1">Error
Budget Policy - Part 2 - Practices at Expedia Group</a></li>
</ul>
<h2 id="on-call">On-Call</h2>
<ul>
<li><a href="http://research.google.com/pubs/pub44813.html">Being an
On-Call Engineer: A Google SRE Perspective</a></li>
<li><a
href="https://www.atlassian.com/blog/it-teams/inside-atlassian-site-reliability-engineers-incident-management">Inside
Atlassian: how our site reliability engineers do incident
management</a></li>
<li><a
href="https://www.atlassian.com/blog/2016/02/inside-atlassian-sre-use-chatops-run-incident-management">Inside
Atlassian: how IT &amp; SRE use ChatOps to run incident
management</a></li>
<li><a
href="https://blog.heroku.com/archives/2014/5/9/incident-response-at-heroku">Incident
Response at Heroku</a></li>
<li><a
href="http://www.susanjfowler.com/blog/2016/9/6/whos-on-call">Whos On
Call?</a></li>
<li><a
href="https://sysadvent.blogspot.com/2016/12/day-6-no-more-on-call-martyrs.html">SysAdvent
- Day 6 - No More On-Call Martyrs</a></li>
<li><a href="http://naildrivin5.com/blog/2016/12/07/on-call.html">On
Being On Call</a></li>
<li><a href="https://github.com/alicegoldfuss/oncall-handbook">The
On-Call Handbook</a></li>
<li><a
href="https://cloudplatform.googleblog.com/2017/02/Incident-management-at-Google-adventures-in-SRE-land.html">Incident
management at Google — adventures in SRE-land</a></li>
<li><a href="https://github.com/SkeltonThatcher/run-book-template">Run
Book / Operations Manual template</a></li>
<li><a
href="https://engineering.linkedin.com/blog/2017/12/open-sourcing-fossor-and-ascii-etch">Automating
Your Oncall: Open Sourcing Fossor and Ascii Etch</a></li>
<li><a
href="https://engineering.linkedin.com/blog/2018/01/project-star-streamlining-our-on-call-process">Project
STAR*: Streamlining Our On-Call Process</a></li>
<li><a
href="https://devblog.xero.com/sre-xero-managing-incidents-part-i-7d02d650a71c">SRE@Xero:
Managing Incidents Part I</a></li>
<li><a
href="https://devblog.xero.com/sre-xero-managing-incidents-part-ii-224a6e06f426">SRE@Xero:
Managing Incidents Part II</a></li>
<li><a
href="https://www.gremlin.com/how-to-establish-a-high-severity-incident-management-program/">How
To Establish a High Severity Incident Management Program</a></li>
<li><a href="https://www.youtube.com/watch?v=xA5U85LSk0M">How Your
Systems Keep Running Day After Day - John Allspaw</a></li>
<li><a
href="https://medium.com/@copyconstruct/on-call-b0bd8c5ea4e0">On-call
doesnt have to suck</a></li>
<li><a
href="https://medium.com/@awspyker/why-as-a-netflix-infrastructure-manager-am-i-on-call-bdc551ac01fe">Why,
as a Netflix infrastructure manager, am I on call?</a></li>
<li><a
href="https://honeycomb.io/blog/2018/02/oncall-and-sustainable-software-development/">Oncall
and Sustainable Software Development</a></li>
<li><a
href="https://thenewstack.io/call-rotations-best-wake-devs-middle-night/">On
Call Rotations: How Best to Wake Devs Up in the Middle of the
Night</a></li>
<li><a
href="https://www.gremlin.com/community/tutorials/understanding-the-role-of-the-incident-manager-on-call-imoc/">Understanding
The Role Of The Incident Manager On-Call (IMOC)</a></li>
<li><a
href="https://devops.com/three-ways-to-minimize-the-impact-of-high-severity-incidents/">3
Ways to Minimize the Impact of High Severity Incidents</a></li>
<li><a
href="https://thenewstack.io/advice-management-teams-enrolling-changes-on-call-systems/">Advice
to Management Teams While Enrolling Changes to On-Call Systems</a></li>
<li><a
href="http://www.adaptivecapacitylabs.com/blog/2018/03/23/moving-past-shallow-incident-data/">Moving
Past Shallow Incident Data</a></li>
<li><a
href="https://codywilbourn.com/2018/03/22/sustainable-on-call/">Sustainable
On-Call</a></li>
<li><a href="https://youtu.be/8pPrtf1J1Z8">dotScale 2017 - Aish Raj
Dahal - Chaos management during a major incident</a></li>
<li><a
href="https://www.infoq.com/presentations/netflix-incident-management">Incident
Management at Netflix Velocity</a></li>
<li><a
href="https://medium.com/booking-com-infrastructure/incidents-fixes-and-the-day-after-c5d9aeae28c3">Incidents,
fixes, and the day after</a></li>
<li><a
href="https://engineering.salesforce.com/10-steps-to-develop-an-incident-response-plan-youll-actually-use-6cc49d9bf94c">10
Steps to Develop an Incident Response Plan Youll ACTUALLY Use</a></li>
<li><a
href="https://tech.buzzfeed.com/checklists-an-operational-gift-aaf42cf0be12">Checklists:
a stupidly simple but valuable operational gift</a></li>
<li><a
href="https://blog.hostedgraphite.com/2018/09/13/how-to-write-a-status-page-update/">How
to write a status page update</a></li>
<li><a
href="https://www.atlassian.com/software/jira/ops/handbook">Atlassian
Incident Handbook</a></li>
<li><a href="https://response.pagerduty.com/">PagerDuty Incident
Response Handbook</a></li>
<li><a
href="https://blog.zenduty.com/blog/2019/05/02/Avoiding-SRE-Burnout">Avoiding
Burnout for SREs</a></li>
<li><a href="https://vimeo.com/344516642">Better On-Call the SRE
way</a></li>
<li><a href="https://www.youtube.com/watch?v=ZqwVlsIonIw">Managing
Incidents at Monzo</a></li>
<li><a
href="https://dev.to/molly_struve/making-on-call-not-suck-490">Making
On-Call Not Suck</a></li>
<li><a
href="https://monzo.com/blog/2019/07/08/how-we-respond-to-incidents">How
we (Monzo) respond to incidents</a></li>
<li><a
href="https://monzo.com/blog/how-weve-evolved-on-call-at-monzo">How
weve evolved on-call at Monzo</a></li>
<li><a
href="https://devops.com/code-yellow-when-operations-isnt-perfect/">Code
Yellow: When Operations Isnt Perfect</a></li>
<li><a
href="https://opensource.com/article/19/7/measure-operational-performance">MTTR
is dead, long live CIRT</a></li>
<li><a href="https://github.com/preed/incident-lifecycle-model">Extended
Dreyfus Model for Incident Lifecycles</a></li>
<li><a
href="https://www.verica.io/inhumanity-of-root-cause-analysis/">Inhumanity
of Root Cause Analysis</a></li>
<li><a href="https://www.youtube.com/watch?v=ODYO2MPymJ4">Incident
insights from NASA, NTSB, and the CDC</a></li>
<li><a
href="https://www.squadcast.com/blog/how-to-avoid-on-call-burnout">How
to avoid On-Call Burnout the SRE Way</a></li>
<li><a href="https://about.gitlab.com/blog/2019/12/16/sre-shadow/">My
week shadowing a GitLab Site Reliability Engineer</a></li>
<li><a
href="https://about.gitlab.com/blog/2018/03/14/the-on-call-handover-at-gitlab/">How
our production team runs the weekly on-call handover</a></li>
<li><a
href="https://www.transposit.com/blog/2020.01.30-writing-runbook-documentation-when-youre-an-sre/">Writing
Runbook Documentation When Youre An SRE</a></li>
<li><a
href="https://lethain.com/incident-response-programs-and-your-startup/">Incident
response, programs and you(r startup)</a></li>
<li><a
href="https://blog.danslimmon.com/2019/06/24/an-incident-command-training-handbook/">An
Incident Command Training Handbook</a></li>
<li><a
href="https://cloud.google.com/blog/products/management-tools/shrinking-the-time-to-mitigate-production-incidents">Shrinking
the time to mitigate production incidents</a></li>
<li><a
href="https://surfingcomplexity.blog/2021/06/11/incident-writeup-as-sociological-storytelling/">Incident
writeup as sociological storytelling</a></li>
<li><a
href="https://www.blameless.com/incident-response/elephant-in-the-blameless-war-room-accountability">Elephant
in the Blameless War Room: Accountability</a></li>
<li><a
href="https://surfingcomplexity.blog/2021/05/22/naming-names-in-incident-writeups/">Naming
names in incident writeups</a></li>
<li><a
href="https://github.blog/2021-01-06-building-on-call-culture-at-github/">Building
On-Call Culture at GitHub</a></li>
</ul>
<h2 id="post-mortem">Post-Mortem</h2>
<ul>
<li><a href="https://github.com/danluu/post-mortems">A collection of
post-mortems</a></li>
<li><a
href="https://github.com/hjacobs/kubernetes-failure-stories">Collection
of Kubernetes Failure Stories</a></li>
<li><a
href="https://codeascraft.com/2012/05/22/blameless-postmortems/">Blameless
PostMortems and a Just Culture</a></li>
<li><a href="https://blog.box.com/blog/a-tale-of-postmortems/">A Tale of
Postmortems</a></li>
<li><a href="http://runasradio.com/Shows/Show/486">Building a Blameless
Post-Mortem Culture with Jason Hand</a></li>
<li><a href="https://www.oreilly.com/ideas/the-infinite-hows">The
infinite hows</a></li>
<li><a href="https://victorops.com/blog/blameless-culture/">Failure is
Always An Option: How a Blameless Culture Leads to Better
Results</a></li>
<li><a
href="https://sysadvent.blogspot.com/2016/12/day-1-why-you-need-postmortem-process.html">SysAdvent
- Day 1 - Why You Need a Postmortem Process</a></li>
<li><a
href="https://codeascraft.com/2016/11/17/debriefing-facilitation-guide/">Etsys
Debriefing Facilitation Guide for Blameless Postmortems</a></li>
<li><a href="https://sharpend.io/writing-your-first-postmortem/">Writing
Your First Postmortem</a></li>
<li><a
href="https://artsy.github.io/blog/2014/11/19/how-to-write-great-outage-post-mortems/">How
to Write Great Outage Post-Mortems</a></li>
<li><a href="https://github.com/dastergon/postmortem-templates">A
collection of postmortem templates</a></li>
<li><a
href="https://blog.heptio.com/embracing-feedback-2fd703da714f">Embracing
Feedback</a></li>
<li><a
href="https://www.usenix.org/conference/srecon17americas/program/presentation/lueder">Postmortem
Action Items: Plan the Work and Work the Plan</a></li>
<li><a
href="https://medium.com/@allspaw/social-issues-in-postmortems-d48dde624d18">Social
Issues In Postmortems</a></li>
<li><a
href="https://www.inc.com/justin-bariso/meet-postmortem-googles-brilliant-process-tool-for-learning-from-failure.html">Google
Has an Official Process in Place for Learning From Failureand Its
Absolutely Brilliant</a></li>
<li><a
href="https://rework.withgoogle.com/blog/postmortem-culture-how-you-can-learn-from-failure/">Postmortem
culture: how you can learn from failure</a></li>
<li><a
href="https://docs.google.com/document/d/1ob0dfG_gefr_gQ8kbKr0kS4XpaKbc0oVAk4Te9tbDqM/edit">re:Work
- Postmortem discussion template</a></li>
<li><a
href="https://increment.com/documentation/post-mortems-to-the-rescue/">Post-mortems
to the rescue</a></li>
<li><a href="https://ai.google/research/pubs/pub45906">Postmortem Action
Items: Plan the Work and Work the Plan</a></li>
<li><a
href="https://www.blameless.com/why-companies-can-benefit-from-blameless-culture/">Why
Every Company Can Benefit from a Blameless Culture</a></li>
<li><a
href="https://www.hostedgraphite.com/blog/its-dead-jim-how-we-write-an-incident-postmortem">“Its
dead, Jim”: How we write an incident postmortem</a></li>
<li><a
href="https://www.hostedgraphite.com/blog/incident-postmortem-template">Our
incident postmortem template</a></li>
<li><a
href="https://fernandocejas.com/2020/03/21/learn-out-of-mistakes-postmortems/">Learn
out of mistakes. Postmortems to the rescue.</a></li>
<li><a
href="https://www.blameless.com/improve-postmortem-with-sre-steve-mcghee/">Improving
Postmortem Practices with Veteran Google SRE, Steve McGhee</a></li>
<li><a
href="https://www.verica.io/blog/inhumanity-of-root-cause-analysis/">Inhumanity
of Root Cause Analysis</a></li>
</ul>
<h2 id="capacity-planning">Capacity Planning</h2>
<ul>
<li><a
href="https://www.usenix.org/system/files/login/articles/login_feb15_07_hixson.pdf">Capacity
Planning</a></li>
<li><a href="https://www.youtube.com/watch?v=MDQ0uEUmLOo">SouthBay SRE:
Cloud Capacity Planning</a></li>
<li><a
href="https://www.squadcast.com/blog/intent-based-capacity-planning-and-autoscaling-with-kubernetes">Intent-based
Capacity Planning and Autoscaling with Kubernetes</a></li>
<li><a
href="https://jvns.ca/blog/2016/03/20/how-do-you-do-capacity-planning/">How
do you do Capacity Planning</a></li>
<li><a
href="https://medium.com/back-market-engineering/how-back-market-sres-prepared-for-black-friday-5f017f343408">How
Back Market SREs prepared for Black Friday</a></li>
</ul>
<h2 id="service-level-agreement">Service Level Agreement</h2>
<ul>
<li><a
href="http://er.educause.edu/articles/2010/6/if-its-in-the-cloud-get-it-on-paper-cloud-computing-contract-issues">If
Its in the Cloud, Get It on Paper: Cloud Computing Contract
Issues</a></li>
<li><a
href="http://www.wired.com/insights/2011/12/service-level-agreements-in-the-cloud-who-cares/">Service
Level Agreements in the Cloud: Who cares?</a></li>
<li><a
href="https://sysadvent.blogspot.com/2016/12/day-20-how-to-set-and-monitor-slas.html">SysAdvent-
Day 20 - How to set and monitor SLAs</a></li>
<li><a
href="https://cloudplatform.googleblog.com/2017/01/availability-part-deux--CRE-life-lessons.html">SLOs,
SLIs, SLAs, oh my - CRE life lessons</a></li>
<li><a
href="https://www.usenix.org/conference/srecon16/program/presentation/jones">Service
Levels and Error Budgets</a></li>
<li><a
href="https://www.usenix.org/system/files/login/articles/login_aug15_06_roth.pdf">(Un)Reliability
Budgets - Finding Balance between Innovation and Reliability</a></li>
<li><a
href="https://queue.acm.org/detail.cfm?id=3096459&amp;__s=dnkxuaws9pogqdnxmx8i">The
Calculus of Service Availability</a></li>
<li><a
href="https://dastergon.github.io/availability-calculator/">Availability
Calculator: Calculate how much downtime should be permitted in your
SLA</a></li>
<li><a
href="https://www.ibm.com/developerworks/cloud/library/cl-SLAloadbalance-numanalysis/">Standardize
cloud SLA availability with numerical performance data</a></li>
<li><a
href="https://www.ibm.com/developerworks/cloud/library/cl-slastandards/">Best
practices to develop SLAs for cloud computing</a></li>
<li><a href="https://www.catchpoint.com/blog/sla-management-guide/">A
Practical Guide to SLAs</a></li>
<li><a
href="https://cloudplatform.googleblog.com/2017/10/building-good-SLOs-CRE-life-lessons.html">Building
good SLOs - CRE life lessons</a></li>
<li><a
href="https://thenewstack.io/sre-lessons-google-no-grumpy-humans/">No
Grumpy Humans and Other Site Reliability Engineering Lessons from
Google</a></li>
<li><a
href="https://cloudplatform.googleblog.com/2018/01/consequences-of-SLO-violations-CRE-life-lessons.html">Consequences
of SLO violations — CRE life lessons</a></li>
<li><a
href="https://medium.com/@jerub/service-level-objectives-in-practice-ed1200502d5">Service
Level Objectives in Practice</a></li>
<li><a
href="https://medium.com/@jerub/sre-consensus-building-36ad5d2e470b">SRE
Consensus Building</a></li>
<li><a
href="https://cloudplatform.googleblog.com/2018/01/an-example-escalation-policy-CRE-life-lessons.html">An
example escalation policy — CRE life lessons</a></li>
<li><a href="https://dastergon.gr/error-budget-calculator/">Error Budget
Calculator</a></li>
<li><a
href="https://cloudplatform.googleblog.com/2018/06/understanding-error-budget-overspend-cre-life-lessons.html">Understanding
error budget overspend - part one - CRE life lessons</a></li>
<li><a
href="https://cloudplatform.googleblog.com/2018/06/cre-life-lessons-good-housekeeping-for-error-budgets.html">Good
housekeeping for error budgets - part two - CRE life lessons</a></li>
<li><a
href="https://cloudplatform.googleblog.com/2018/07/sre-fundamentals-slis-slas-and-slos.html">SRE
fundamentals: SLIs, SLAs and SLOs</a></li>
<li><a
href="https://www.circonus.com/2018/07/a-guide-to-service-level-objectives/">SLOs
&amp; You: A Guide To Service Level Objectives</a></li>
<li><a
href="https://medium.com/concourse-ci/earning-our-wings-a0c307fa73e6">Earning
Our Wings: Stories and Findings From Operating a Large-scale Concourse
Deployment</a></li>
<li><a href="https://ai.google/research/pubs/pub48033">Nines are Not
Enough: Meaningful Metrics for Clouds</a></li>
<li><a
href="https://medium.com/@jamesacowling/how-many-nines-is-my-storage-system-7d16e852d56d">How
many nines is my storage system?</a></li>
<li><a href="https://lethain.com/dont-follow-the-sun/">Dont follow the
sun.</a></li>
<li><a href="https://www.youtube.com/watch?v=4cPqLuIXBnw">The Tyranny of
the SLA</a></li>
<li><a
href="https://www.backblaze.com/blog/cloud-storage-durability/">Backblaze
Durability is 99.999999999% — And Why It Doesnt Matter</a></li>
<li><a href="https://youtu.be/Dfnbw5dJQ5I">DevOpsDays Chicago 2019 - The
Art of SLOs</a></li>
<li><a href="https://cre.page.link/art-of-slos">The Art of SLOs Workshop
Materials</a></li>
<li><a
href="https://grafana.com/blog/2019/11/27/kubecon-recap-how-to-include-latency-in-slo-based-alerting/">How
to Include Latency in SLO-Based Alerting</a></li>
<li><a
href="https://www.squadcast.com/blog/succeeding-with-service-level-objectives">Succeeding
With Service Level Objectives</a></li>
<li><a
href="https://medium.com/the-telegraph-engineering/putting-customers-first-with-slis-and-slos-15352f9b6cbc">Putting
customers first with SLIs and SLOs</a></li>
<li><a
href="https://medium.com/site-reliability-engineering-leadership/sre-tip-have-tiered-slas-2c432ffe46a">SRE
Leadership: Have Tiered SLAs</a></li>
<li><a
href="https://www.blameless.com/blog/how-slos-enable-fast-reliable-application-delivery">How
SLOs Enable Fast, Reliable Application Delivery</a></li>
<li><a href="https://billduncan.org/the-tail-at-scale/">The Tail at
Scale</a></li>
<li><a href="https://billduncan.org/the-tail-at-scale-revisited/">The
Tail at Scale Revisited</a></li>
<li><a
href="https://cloud.google.com/blog/products/gcp/defining-slos-for-services-with-dependencies-cre-life-lessons">Defining
SLOs for services with dependencies</a></li>
<li><a
href="https://blog.b3k.us/2009/07/15/service-level-disagreements.html">Service
Level Disagreements</a></li>
<li><a
href="https://mattermost.com/blog/sloth-for-slo-monitoring-and-alerting-with-prometheus/">How
We Use Sloth to do SLO Monitoring and Alerting with Prometheus</a></li>
<li><a
href="https://medium.com/site-reliability-engineering-leadership/sli-deep-dive-cae92bd90a79">SLI
Deep Dive</a></li>
<li><a
href="https://medium.com/google-cloud/measuring-reliability-in-gcp-step-by-step-slo-creation-guide-using-cloud-operation-sandbox-99043bd0e70f">Measuring
Reliability in GCP: Step By Step SLO creation guide using Cloud
Operation Sandbox</a></li>
<li><a href="https://slotracker.com/">SLO tracker</a></li>
<li><a
href="https://ervinbarta.com/2021/10/19/slo-alerting-for-mortals/">SLO
Alerting for Mortals</a></li>
<li><a
href="https://bpetit.nce.re/2021/03/sre-methods-and-climate-change/">SRE
methods and climate change</a></li>
<li><a
href="https://medium.com/lightstephq/what-made-slos-so-messy-and-what-we-can-do-about-it-89be415a80b3">What
made SLOs so messy (and what we can do about it)</a></li>
<li><a
href="https://engineering.fb.com/2021/12/13/production-engineering/slick/">SLICK:
Adopting SLOs for improved reliability</a></li>
<li><a
href="https://alexewerlof.medium.com/calculating-composite-sla-d855eaf2c655">Calculating
composite SLA</a></li>
<li><a
href="https://newrelic.com/blog/best-practices/best-practices-for-setting-slos-and-slis-for-modern-complex-systems">Best
practices for setting SLOs and SLIs for modern, complex systems</a></li>
</ul>
<h2 id="performance">Performance</h2>
<ul>
<li><a
href="https://www.brendangregg.com/blog/2016-05-04/srecon2016-perf-checklists-for-sres.html">Performance
Checklists for SREs</a></li>
<li><a href="https://youtu.be/uQ0flQOtQEA">South Bay SRE Meetup -
Netflix Cloud Performance Team</a></li>
<li><a
href="https://medium.com/dm03514-tech-blog/sre-performance-analysis-tuning-methodology-using-a-simple-http-webserver-in-go-d475460f27ca">Software
Performance Analysis Guided By SLOs</a></li>
<li><a
href="https://mterwill.com/posts/framework-for-performance-engineering/">A
framework for pragmatic performance engineering</a></li>
</ul>
<h2 id="programming">Programming</h2>
<ul>
<li><a href="http://www.oreilly.com/pub/e/2712">Go Language for Ops and
Site Reliability Engineering</a></li>
<li><a
href="https://www.usenix.org/sites/default/files/conference/protected-files/srecon16_slides_hamilton.pdf">Go
for SREs using Python</a></li>
<li><a
href="https://speakerdeck.com/ianschenck/operability-in-go">Operability
in Go</a></li>
<li><a href="https://www.youtube.com/watch?v=5doOcaMXx08">Go Reliability
and Durability at Dropbox</a></li>
</ul>
<h2 id="misc-articles">Misc Articles</h2>
<ul>
<li><a
href="https://www.oreilly.com/ideas/what-is-sre-site-reliability-engineering">What
is SRE (Site Reliability Engineering)?</a></li>
<li><a
href="http://www.wired.com/2016/04/google-ensures-services-almost-never-go/">Heres
How Google Makes Sure It (Almost) Never Goes Down</a></li>
<li><a
href="http://techcrunch.com/2016/03/02/are-site-reliability-engineers-the-next-data-scientists/">Are
site reliability engineers the next data scientists?</a></li>
<li><a
href="http://googleresearch.blogspot.gr/2012/07/site-reliability-engineers-solving-most.html">Site
Reliability Engineers: “solving the most interesting problems”</a></li>
<li><a
href="http://googleforstudents.blogspot.gr/2012/06/site-reliability-engineers-worlds-most.html">Site
Reliability Engineers: the “worlds most intense pit crew”</a></li>
<li><a
href="http://searchitoperations.techtarget.com/feature/Site-reliability-engineering-kicks-rote-tasks-out-of-IT-ops">Site
reliability engineering kicks rote tasks out of IT ops</a></li>
<li><a href="http://danluu.com/google-sre-book/">Notes on Site
Reliability Engineering</a></li>
<li><a
href="https://cloudplatform.googleblog.com/2016/07/adventures-in-SRE-land-welcome-to-Google-Mission-Control.html">Adventures
in SRE-land: Welcome to Google Mission Control</a></li>
<li><a
href="https://www.infoq.com/articles/site-reliability-engineering">Book
Review: Site Reliability Engineering - How Google Runs Production
Systems</a></li>
<li><a
href="https://www.google.com/about/careers/stories/site-reliability-engineering-profile-google/">Site
Reliability Engineers: “We solve cooler problems”</a></li>
<li><a
href="http://www.networkworld.com/article/3182827/cloud-computing/srecon17-brave-new-world-of-site-reliability-engineering.html">SREcon17:
Brave new world of site reliability engineering</a></li>
<li><a href="https://github.com/open-guides/og-aws">Open AWS
guide</a></li>
<li><a
href="https://medium.com/@jerub/commentary-on-site-reliability-engineering-9ba9e1be2a8c">Commentary
on Site Reliability Engineering</a></li>
<li><a
href="https://www.networkcomputing.com/data-centers/site-reliability-engineering-4-things-know/888724300">Site
Reliability Engineering: 4 Things to Know</a></li>
<li><a
href="https://www.linkedin.com/pulse/looking-sre-success-find-intrapreneurs-josh-gilliland/">Looking
for SRE Success? Then Find the Intrapreneurs!</a></li>
<li><a href="http://web.devopstopologies.com/">What Team Structure is
Right for DevOps to Flourish?</a></li>
<li><a
href="https://www.sidewalksafari.com/2018/12/sre-in-a-travel-emergency.html">Injured
on Vacation? Applying Principles from Site Reliability Engineering to a
Travel Emergency</a></li>
<li><a href="https://sobolevn.me/2018/12/blameless-environment">Building
blameless working environment</a></li>
<li><a
href="https://techbeacon.com/devops/how-accenture-retrofitted-site-reliability-engineering">SRE
Adoption Report</a></li>
<li><a
href="https://devops.com/sres-the-happiest-and-highest-paid-in-the-industry/">SREs:
The Happiest and Highest Paid in the Industry</a></li>
<li><a
href="https://thenewstack.io/the-role-of-site-reliability-engineering-today-and-tomorrow/">The
Role of Site Reliability Engineering, Today and Tomorrow</a></li>
<li><a
href="https://medium.com/@bellmar/sre-as-a-lifestyle-choice-de9f5a82d73d">SRE
as a Lifestyle Choice</a></li>
<li><a
href="https://speakerdeck.com/dastergon/srecon-emea-2019-recap-sre-muc-meetup">SRECon
EMEA 2019 Recap</a></li>
<li><a href="https://www.youtube.com/watch?v=7Oe8mYPBZmw">Life of an SRE
at Google - JC van Winkel</a></li>
<li><a
href="https://www.infoq.com/articles/site-reliability-engineering-mobile-apps/">Site
Reliability Engineering for Native Mobile Apps - Abhijith Krishnappa</a>
- Case study: Halodoc adaptation of SRE principles for Native Mobile
Apps</li>
<li><a href="https://www.infracloud.io/blogs/sre-best-practices/">SRE
Best Practices by InfraCloud</a></li>
</ul>
<h2 id="real-time-messaging">Real-time Messaging</h2>
<ul>
<li><a href="https://hangops.slack.com/">#sre channel at Hangops
Slack</a> - Discussion of Site Reliability Engineering generally.</li>
<li><a href="https://hangops.slack.com/">#incident_response channel at
Hangops Slack</a> - Discussion about Incident Response.</li>
<li><a href="https://usenix-srecon.slack.com">USENIX SREcon
Slack</a></li>
</ul>
<h2 id="blogs">Blogs</h2>
<ul>
<li><a href="http://www.brendangregg.com/blog/index.html">Brendan
Greggs Blog</a> - Highly Technical Blog Posts About Systems Internals,
Performance and SRE.</li>
<li><a href="http://everythingsysadmin.com/">Everything Sysadmin</a> -
Blog Posts About SysAdmin/DevOps/SRE by Tom Limoncelli.</li>
<li><a href="http://highscalability.com/">High Scalability</a> -
Technical Blog Posts About Systems Architecture.</li>
<li><a href="https://rachelbythebay.com/w/">rachelbythebay</a> -
Techincal Blog Posts.</li>
<li><a href="http://www.susanjfowler.com/blog/">Susan J. Fowler</a> -
Various blog posts about SRE, Software Engineering and
Microservices.</li>
<li><a href="https://sysadvent.blogspot.com">SysAdvent</a> - One article
for each day of December, ending on the 25th article.</li>
<li><a href="https://medium.com/@jerub">Stephen Thornes Blog</a> - Blog
Posts About SRE</li>
<li><a href="https://increment.com/">Increment</a> - A digital magazine
about how teams build and operate software systems at scale.</li>
<li><a href="http://www.gophersre.com/">GopherSRE</a> - Blog Posts about
Go and SRE.</li>
<li><a href="https://medium.com/@copyconstruct">Cindy Sridharan</a> -
Blog posts about distributed systems and their management.</li>
<li><a href="https://www.blameless.com/blog/">Blameless Blog</a> - Blog
posts about SRE culture and practices.</li>
<li><a href="https://ResilienceRoundup.com">Resilience Roundup</a> -
Weekly analysis of Resilience Engineering and Human Factors research
designed for software systems</li>
<li><a href="https://www.squadcast.com/blog">Squadcast Blog</a> - Blog
posts about SRE best practices, reliability, on-call and incident
management.</li>
<li><a href="https://www.firehydrant.io/blog">FireHydrant Blog</a> -
Posts about complex systems, incident response, and SRE best
practices.</li>
<li><a href="https://www.rootly.io/blog">Rootly Blog</a> - Incident
management best practices and guides.</li>
<li><a href="https://www.incident.io/blog">incident.io Blog</a> -
Guides, advice and resources on incident management and response.</li>
<li><a href="https://logit.io/blog">Logit.io Blog</a> - Resources on log
management, SRE and devOps.</li>
</ul>
<h2 id="newsletters">Newsletters</h2>
<ul>
<li><a href="https://faun.dev">DevOpsLinks</a> - A weekly newsletter
about SRE, SysAdmin and DevOps news, tools, tutorials and opinions.</li>
<li><a href="https://kubeweekly.io/">KubeWeekly</a> - The weekly
newsletters for all things Kubernetes. KubeWeekly is curated by Bob
Killen, Chris Short, Craig Box, Kim McMahon and Michael Hausenblas</li>
<li><a href="https://sreweekly.com/">SRE Weekly</a> - Weekly Site
Reliability Newsletter.</li>
<li><a
href="http://www.oreilly.com/webops-perf/newsletter.html">OReilly
Systems Engineering and Operations Newsletter</a> - Weekly systems
engineering and operations news and insights from industry
insiders.</li>
<li><a href="https://chaosengineering.news/">ChaosEngineering.news</a> -
Chaos Engineering newsletter. All things Chaos Engineering, directly to
your inbox!</li>
<li><a href="https://monitoring.love/">Monitoring Weekly</a> - Whats
new in monitoring? Curated monitoring articles to your inbox each
week.</li>
<li><a href="https://o11y.news/">Observability news</a> - Updates around
observability (o11y) with a special focus on open source.</li>
</ul>
<h2 id="conferences-meetups">Conferences &amp; Meetups</h2>
<ul>
<li><a href="https://www.usenix.org/conferences/byname/925">SRECon
Conferences</a> - The Official SRE Conference.</li>
<li><a href="https://www.usenix.org/conferences/byname/5">LISA
Conferences</a> - Prominent Conference About SysAdmin/DevOps/SRE.</li>
<li><a href="https://developers.google.com/events/sre/">SRE Tech
Talks</a> - SRE Talks Hosted by Google.</li>
<li><a
href="https://www.meetup.com/South-Bay-Site-Reliability-Engineering/">South
Bay Site Reliability Engineering (Sunnyvale, CA) Meetup</a> - A Group
For Individuals Who Tackle Reliability Challenges For Web-Scale
Systems.</li>
<li><a
href="https://www.meetup.com/San-Francisco-Reliability-Engineering/">San
Francisco Reliability Engineering</a> - A Group Of People Who Are
Passionate About Reliable, Performant Software Systems.</li>
<li><a
href="https://www.meetup.com/Site-Reliability-Engineering-Munich/">Site
Reliability Engineering Munich, Germany</a> - SRE Meetup in the greater
area of Oktoberfest city.</li>
<li><a href="https://www.alldaydevops.com/">ADDO - All Day DevOps</a> -
A 24 hour conference that is completely online and free.</li>
<li><a
href="https://www.meetup.com/Site-Reliability-Engineering-Paris/">Site
Reliability Engineering Paris, France</a> - SRE Meetup in the city of
light.</li>
<li><a href="https://www.meetup.com/site-reliability-enggineering/">Site
Reliability Engineering India</a> - SRE Meetup India</li>
</ul>
<h2 id="twitter">Twitter</h2>
<ul>
<li><a href="https://twitter.com/googlesre">Google SRE Twitter
Account</a> - Googles SRE Twitter Account.</li>
<li><a href="https://twitter.com/SREBook">SREBook</a> - The Official
Twitter Account of Site Reliability Engineering Book.</li>
<li><a href="https://twitter.com/SREcon">SREcon</a> - SRECons Official
Twitter Account.</li>
<li><a href="https://twitter.com/SREWorkbook">SREWorkbook</a> - The
Official Twitter Account of Site Reliability Workbook.</li>
<li><a href="https://twitter.com/The_SRE_Dev">The SRE Dev</a> -
SRE-related Posts from <a href="https://dev.to">dev.to</a>.</li>
<li><a href="https://twitter.com/TwitterSRE">Twitter SRE</a> - The
Official Twitter Account of Twitters SRE team.</li>
<li><a href="https://twitter.com/SREWeekly">Twitter SRE Weekly</a> - The
Official Twitter Account of SRE Weekly Newsletter.</li>
<li><a href="https://twitter.com/usenix">USENIX Association</a> - The
Official USENIX Twitter Account.</li>
</ul>
<h2 id="sre-tools">SRE Tools</h2>
<ul>
<li><a href="https://github.com/SquadcastHub/awesome-sre-tools">Awesome
SRE Tools</a> - A curated list of Site Reliability and Production
Engineering tools</li>
<li><a href="https://github.com/ligurio/awesome-ci">List of Continuous
Integration services</a></li>
<li><a href="https://github.com/shibumi/SRE-cheat-sheet">SRE cheat
sheet</a> - A cheat sheet for Site Reliability Engineering principles
and numbers</li>
</ul>
<h2 id="podcasts">Podcasts</h2>
<ul>
<li><a
href="https://podcasts.apple.com/us/podcast/resilience-in-action/id1506828506">Blameless
/ Resilience in Action</a></li>
<li><a href="https://sre.google/prodcast">Google SRE Prodcast</a></li>
<li><a href="https://www.honeycomb.io/usecase/o11ycast/">o11y
Observability Podcast</a></li>
<li><a
href="https://podcasts.apple.com/us/podcast/on-call-nightmares-podcast/id1447430839">On
Call Nightmares (retired)</a></li>
<li><a
href="https://open.spotify.com/show/1KxLVUduNdDRAiOw8BB32J">Making of
the SRE Omelette</a></li>
</ul>
<p><a href="https://github.com/dastergon/awesome-sre">sre.md
Github</a></p>