1443 lines
70 KiB
HTML
1443 lines
70 KiB
HTML
<h1 id="awesome-site-reliability-engineering-awesome">Awesome Site
|
||
Reliability Engineering <a
|
||
href="https://github.com/sindresorhus/awesome"><img
|
||
src="https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg"
|
||
alt="Awesome" /></a></h1>
|
||
<p><a
|
||
href="https://dastergon.gr/awesome-sre"><img src="awesome-sre-logo.svg" align="right" width="100"></a></p>
|
||
<p>A curated list of awesome <a
|
||
href="https://www.usenix.org/conference/srecon14/technical-sessions/presentation/keys-sre">Site
|
||
Reliability</a> and <a
|
||
href="https://www.usenix.org/conference/srecon15/program/presentation/canahuati">Production</a>
|
||
Engineering resources.</p>
|
||
<h4 id="what-is-site-reliability-engineering">What is Site Reliability
|
||
Engineering?</h4>
|
||
<blockquote>
|
||
<p>“Fundamentally, it’s what happens when you ask a software engineer to
|
||
design an operations function.” - Ben Treynor Sloss, VP Google
|
||
Engineering, founder of Google SRE</p>
|
||
</blockquote>
|
||
<h2 id="contributing">Contributing</h2>
|
||
<p>Please take a look at the <a href="CONTRIBUTING.md">contribution
|
||
guidelines</a> first. Contributions are always welcome!</p>
|
||
<h2 id="contents">Contents</h2>
|
||
<ul>
|
||
<li><a href="#culture">Culture</a></li>
|
||
<li><a href="#education">Education</a></li>
|
||
<li><a href="#books">Books</a></li>
|
||
<li><a href="#hiring">Hiring</a></li>
|
||
<li><a href="#reliability">Reliability</a></li>
|
||
<li><a href="#monitoring--observability--alerting">Monitoring &
|
||
Observability & Alerting</a></li>
|
||
<li><a href="#on-call">On-Call</a></li>
|
||
<li><a href="#post-mortem">Post-Mortem</a></li>
|
||
<li><a href="#capacity-planning">Capacity Planning</a></li>
|
||
<li><a href="#service-level-agreement">Service Level Agreement</a></li>
|
||
<li><a href="#performance">Performance</a></li>
|
||
<li><a href="#programming">Programming</a></li>
|
||
<li><a href="#misc-articles">Misc Articles</a></li>
|
||
<li><a href="#real-time-messaging">Real-time Messaging</a></li>
|
||
<li><a href="#blogs">Blogs</a></li>
|
||
<li><a href="#newsletters">Newsletters</a></li>
|
||
<li><a href="#conferences-meetups">Conferences & Meetups</a></li>
|
||
<li><a href="#twitter">Twitter</a></li>
|
||
<li><a href="#sre-tools">SRE Tools</a></li>
|
||
<li><a href="#podcasts">SRE Podcasts</a></li>
|
||
</ul>
|
||
<h2 id="culture">Culture</h2>
|
||
<ul>
|
||
<li><a
|
||
href="https://landing.google.com/sre/interview/ben-treynor.html">What is
|
||
Site Reliability Engineering?</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/conference/srecon14/technical-sessions/presentation/keys-sre">Keys
|
||
To SRE by Ben Treynor</a></li>
|
||
<li><a href="https://landing.google.com/sre/resources.html">Google SRE
|
||
Resources</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/conference/srecon15/program/presentation/canahuati">Notes
|
||
from Production Engineering by Pedro Canahuati</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/conference/srecon15europe/program/presentation/underwood">PostOps:
|
||
Recovery from Operations</a></li>
|
||
<li><a
|
||
href="https://www.atlassian.com/it-service/site-reliability-engineering-sre">Love
|
||
DevOps? Wait ’till you meet SRE</a> <a
|
||
href="https://youtu.be/fsTpRx8Pt-k">[video]</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=H4vMcD7zKM0">How Google
|
||
Does Planet-Scale Engineering for Planet-Scale Infra</a></li>
|
||
<li><a
|
||
href="https://www.facebook.com/notes/facebook-engineering/site-reliability-engineering-at-facebook/291616313919/">Site
|
||
Reliability Engineering at Facebook</a></li>
|
||
<li><a
|
||
href="https://www.youtube.com/watch?v=qJnS-EfIIIE&nohtml5=False">A
|
||
History of Site Reliability Engineering at Uber</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/conference/srecon15/program/presentation/limoncelli">Case
|
||
Study: Adopting SRE Principles at StackOverflow</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=ggizCjUCCqE">Site
|
||
Reliability Engineering at Dropbox</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=yXI7r0_J29M">Site
|
||
Reliability Engineers — Keeping Google up and running 24/7</a></li>
|
||
<li><a href="https://www.salesforce.com/video/193050/">Site Reliability
|
||
Engineering at Salesforce</a></li>
|
||
<li>From Sys Admin to Netflix SRE - <a
|
||
href="https://www.youtube.com/watch?v=lZI51YzIgVE">video</a> and <a
|
||
href="https://www.socallinuxexpo.org/sites/default/files/presentations/Scale%20x14%20Slides.pdf">slides</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=iIuTnhdTzK0">SRE@Google:
|
||
Thousands of DevOps Since 2004</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/conference/lisa15/conference-program/presentation/limoncelli">Transactional
|
||
System Administration Is Killing Us and Must be Stopped</a></li>
|
||
<li><a
|
||
href="https://web.archive.org/web/20190401220948/https://plus.google.com/+lizthegrey/posts/MLAJFVyEb2f">A
|
||
hierarchy of SRE needs</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/conference/lisa13/technical-sessions/plenary/underwood">PostOps:
|
||
A Non-Surgical Tale of Software, Fragility, and Reliability</a></li>
|
||
<li><a
|
||
href="https://web.archive.org/web/20180820235243/http://anthonycaiafa.com/2016/04/10/sre-cultural-narnia/">SRE:
|
||
An incomplete guide to cultural Narnia</a> - <a
|
||
href="https://www.youtube.com/watch?v=__wypEhdcrQ&t=0s">[Video]</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/conference/srecon16/program/presentation/krishnan">Putting
|
||
Together Great SRE Teams</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=bwt6TZjefGM">Work at
|
||
Google: Meet our Production Engineers for Site Reliability Hangout on
|
||
Air</a></li>
|
||
<li><a
|
||
href="https://sharpend.io/toil-a-word-every-engineer-should-know/">Toil:
|
||
A Word Every Engineer Should Know</a></li>
|
||
<li><a href="https://research.google.com/pubs/pub32583.html">Engineering
|
||
Reliability into Web Sites: Google SRE</a></li>
|
||
<li><a href="https://vimeo.com/179914447">DEVOPS & SRE AMA -
|
||
Building High Performance Organizations</a></li>
|
||
<li><a
|
||
href="https://community.atlassian.com/t5/Jira-Ops-questions/I-m-John-Allspaw-Ask-Me-Anything-about-incident-analysis-and/qaq-p/957084">John
|
||
Allspaw’s AMA on Incident Analysis and Postmortems</a></li>
|
||
<li>Site Reliability Engineering with Paul Newson - <a
|
||
href="https://www.gcppodcast.com/post/episode-38-site-reliability-engineering-with-paul-newson/">Part
|
||
1</a> & <a
|
||
href="https://gcppodcast.com/post/episode-59-sre-ii-with-paul-newson/">Part
|
||
2</a></li>
|
||
<li><a href="https://queue.acm.org/detail.cfm?id=2891413">How SysAdmins
|
||
Devalue Themselves</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=ry51Llzil1I">The Softer
|
||
Side of DevOps</a></li>
|
||
<li><a
|
||
href="https://medium.com/@kobolog/sre-noun-see-also-confidence-trust-e7e33e19efc1">SRE,
|
||
noun. See also: confidence, trust.</a></li>
|
||
<li><a href="https://youtu.be/24xb7oZgu-I?t=29m24s">Site Reliability
|
||
Engineering with Stephen Weinberg</a></li>
|
||
<li><a
|
||
href="https://www.reddit.com/r/IAmA/comments/177267/we_are_the_google_site_reliability_team_we_make">We
|
||
are the Google Site Reliability team. We make Google’s websites work.
|
||
Ask us Anything!</a></li>
|
||
<li><a
|
||
href="https://www.reddit.com/r/IAmA/comments/1w1y5m/we_are_the_google_site_reliability_engineering/">We
|
||
are the Google Site Reliability Engineering team. Ask us
|
||
Anything!</a></li>
|
||
<li><a
|
||
href="http://www.susanjfowler.com/blog/2016/10/13/the-ops-identity-crisis">The
|
||
Ops Identity Crisis</a></li>
|
||
<li><a
|
||
href="http://www.susanjfowler.com/blog/2016/11/2/the-irreproducibility-of-bugs-in-large-scale-production-systems">The
|
||
Irreproducibility Of Bugs In Large-Scale Production Systems</a></li>
|
||
<li><a
|
||
href="http://www.se-radio.net/2016/12/se-radio-episode-276-bjorn-rabenstein-on-site-reliability-engineering/">SE-Radio
|
||
Episode 276: Björn Rabenstein on Site Reliability Engineering</a></li>
|
||
<li><a
|
||
href="https://blog.netsil.com/microservices-devops-and-operational-complexity-be98cb01b660">Microservices,
|
||
DevOps and Production Complexity</a></li>
|
||
<li><a
|
||
href="https://cloudplatform.googleblog.com/2016/10/introducing-a-new-era-of-customer-support-Google-Customer-Reliability-Engineering.html">Introducing
|
||
Google Customer Reliability Engineering</a></li>
|
||
<li><a
|
||
href="https://robhirschfeld.com/2016/12/29/evolution-or-rebellion-the-rise-of-site-reliability-engineers-sre/">Evolution
|
||
or Rebellion? The rise of Site Reliability Engineers (SRE)</a></li>
|
||
<li><a
|
||
href="https://standalone-sysadmin.com/the-difference-between-site-reliability-engineering-system-administration-and-devops-d05031495499">The
|
||
difference between Site Reliability Engineering, System Administration,
|
||
and DevOps</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/conference/lisa16/conference-program/presentation/closing-plenary">SRE
|
||
in the Small and in the Large</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=zLXf0cKDOv0">SBSRE Meetup:
|
||
Different SRE roles and challenges(Netflix)</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/conference/srecon16/program/presentation/definition-of-sre-panel">Panel:
|
||
Who/What Is SRE?</a></li>
|
||
<li><a
|
||
href="https://medium.com/@jerub/hope-is-not-a-strategy-6a7d0a3b1c08">Hope
|
||
Is Not a Strategy</a></li>
|
||
<li><a
|
||
href="https://medium.com/@jerub/tenets-of-sre-8af6238ae8a8">Tenets of
|
||
SRE</a></li>
|
||
<li><a
|
||
href="https://medium.com/@venkatachalamrangasamy/site-reliability-engineering-demystified-ed676e0a7d56">Site
|
||
Reliability Engineering Demystified</a></li>
|
||
<li><a
|
||
href="https://devops.com/site-reliability-engineering-sre-true-ops-devops/">Is
|
||
Site Reliability Engineering the True ‘Ops’ in DevOps?</a></li>
|
||
<li><a
|
||
href="https://devops.com/sre-devops-cloud-native-server-cage-match/">SRE
|
||
vs. DevOps vs. Cloud Native: The Server Cage Match</a></li>
|
||
<li><a href="https://youtu.be/8dfYLRAWn_c">SRE: What’s The Big
|
||
Idea?</a></li>
|
||
<li><a
|
||
href="https://engineering.linkedin.com/blog/2017/05/building-the-sre-culture-at-linkedin">Building
|
||
the SRE Culture at LinkedIn</a></li>
|
||
<li><a
|
||
href="https://stackoverflow.blog/2017/06/12/podcast-111-sre-occasionally-maintaining-infrastructure-hate/">Podcast
|
||
#111 – SRE: Occasionally Maintaining Infrastructure That You
|
||
Hate</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/conference/srecon16europe/program/presentation/splicing-sre-dna-sequences-biggest-software-company">Splicing
|
||
SRE DNA Sequences in the Biggest Software Company on the Planet</a></li>
|
||
<li><a
|
||
href="https://cloudplatform.googleblog.com/2017/06/why-should-your-app-get-SRE-support-CRE-life-lessons.html">Why
|
||
should your app get SRE support? - CRE life lessons</a></li>
|
||
<li><a
|
||
href="https://cloudplatform.googleblog.com/2017/06/how-SREs-find-the-landmines-in-a-service-CRE-life-lessons.html">How
|
||
SREs find the landmines in a service - CRE life lessons</a></li>
|
||
<li><a
|
||
href="https://cloudplatform.googleblog.com/2017/07/making-the-most-of-an-SRE-service-takeover-CRE-life-lessons.html">Making
|
||
the most of an SRE service takeover - CRE life lessons</a></li>
|
||
<li><a
|
||
href="https://dzone.com/articles/the-cloudcast-301-sre-and-infrastructure-operation">The
|
||
Cloudcast #301: SRE and Infrastructure Operations (Podcast)</a></li>
|
||
<li><a href="https://medium.com/@rakyll/the-sre-model-6e19376ef986">The
|
||
SRE model</a></li>
|
||
<li><a
|
||
href="https://circleci.com/blog/onboarding-new-site-reliability-engineers/">Onboarding
|
||
New Site Reliability Engineers</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=nQv9ySa8MTU">Building
|
||
Blocks for Site Reliability At Google</a></li>
|
||
<li><a
|
||
href="https://blog.netsil.com/beyond-google-sre-what-is-site-reliability-engineering-like-at-medium-71c65bd35f4e">Beyond
|
||
Google SRE: What is Site Reliability Engineering like at
|
||
Medium?</a></li>
|
||
<li><a
|
||
href="http://blog.adnanmasood.com/2016/05/19/intelligent-site-reliability-engineering-a-machine-learning-perspective/">Intelligent
|
||
Site Reliability Engineering – A Machine Learning Perspective</a></li>
|
||
<li><a
|
||
href="https://engineering.linkedin.com/day-life/crash-course-linkedins-global-site-operations">A
|
||
crash course in LinkedIn’s global site operations</a></li>
|
||
<li><a
|
||
href="https://softwareengineeringdaily.com/2016/06/14/googles-site-reliability-engineering-todd-underwood/">Google’s
|
||
Site Reliability Engineering with Todd Underwood</a></li>
|
||
<li><a
|
||
href="https://blogs.vmware.com/services-education-insights/2018/02/site-reliability-engineering.html">What
|
||
is Site Reliability Engineering? (VMware)</a></li>
|
||
<li><a href="http://geekologist.co/introduction-to-sre/">A Gentle
|
||
Introduction to SRE</a></li>
|
||
<li><a
|
||
href="http://engineering.medallia.com/blog/posts/understanding-site-reliability-engineering-through-movies-and-books/">Understanding
|
||
Site Reliability Engineering through Movies and Books</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=Cxb7a8lTv8A">GOTO 2017 •
|
||
Site Reliability Engineering at Google • Christof Leng</a></li>
|
||
<li>The Makeup of Successful Geographically-Distributed SRE Teams - <a
|
||
href="https://engineering.linkedin.com/blog/2018/03/the-makeup-of-successful-geographically-distributed-sre-teams--p">Part1</a>
|
||
& <a
|
||
href="https://engineering.linkedin.com/blog/2018/03/the-makeup-of-successful-geographically-distributed-sre-teams--p0">Part2</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=6G2V1xPIM64">Tech
|
||
Leadership in SRE</a></li>
|
||
<li><a
|
||
href="http://azpodcast.azurewebsites.net/post/Episode-227-Azure-SRE1">The
|
||
Azure Podcast: Episode 227 - Azure SRE</a></li>
|
||
<li><a
|
||
href="https://medium.com/@mattklein123/the-human-scalability-of-devops-e36c37d3db6a">The
|
||
human scalability of “DevOps”</a></li>
|
||
<li><a
|
||
href="https://softwareengineeringdaily.com/2018/04/09/site-reliability-management-with-mike-hiraga/">Podcast:
|
||
Site Reliability Management with Mike Hiraga</a></li>
|
||
<li><a
|
||
href="https://medium.com/@Knowlarity_Engineering/how-a-cat-inspired-system-reliability-at-knowlarity-ad73c24f29a7">How
|
||
a cat inspired system reliability at Knowlarity</a></li>
|
||
<li><a
|
||
href="https://github.com/devopsenterprise/2018-London/blob/master/Tuesday/Breakout%20Sessions/Throne%2C%20Stephen%2C%20Getting%20Started%20with%20Site%20Reliability%20Engineering.pdf">Getting
|
||
Started with Site Reliability Engineering</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=xWAfTAu0Mww">“Practical
|
||
Applications of the Dickerson Pyramid” by Nat Welch</a></li>
|
||
<li><a
|
||
href="https://blameless.com/blog/sre-implementations-blindspots/">LinkedIn’s
|
||
Kurt Andersen Uncovers Blindspots in SRE Implementations</a></li>
|
||
<li><a
|
||
href="https://driftboatdave.com/2018/10/09/interview-with-betsy-beyer-stephen-thorne-of-google/">Interview
|
||
with Betsy Beyer, Stephen Thorne of Google</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=0zqBlRW_6jA">Less Risk
|
||
Through Greater Humanity - Dave Rensin</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=c-w_GYvi0eA">Getting
|
||
Started with SRE - Stephen Thorne, Google</a></li>
|
||
<li><a
|
||
href="https://drive.google.com/file/d/1FXwHm6mpmRA9NaIJEu4cB1s6ffbyGBfl/view">Building
|
||
Successful SRE in Large Enterprises</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=ZcZtU_TiFEM">Solving
|
||
Reliability Fears with Site Reliability Engineering</a></li>
|
||
<li><a
|
||
href="https://cloud.google.com/blog/products/gcp/sre-vs-devops-competing-standards-or-close-friends">SRE
|
||
vs. DevOps: competing standards or close friends?</a></li>
|
||
<li><a
|
||
href="https://thenewstack.io/how-to-avoid-the-5-sre-implementation-traps-that-catch-even-the-best-teams/">How
|
||
to Avoid the 5 SRE Implementation Traps that Catch Even the Best
|
||
Teams</a></li>
|
||
<li><a href="https://vimeo.com/344515149">Reliability Engineering – The
|
||
Essential Discipline for Complex Systems</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=bC5dIPzNH24">The Modern
|
||
Site Reliability Workbench on Top of OCI</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/conference/srecon19emea/presentation/rabenstein">SRE
|
||
in the Third Age</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=vF6ajM3P_wM">About SRE and
|
||
how (not) to apply it</a></li>
|
||
<li><a
|
||
href="https://cloud.google.com/blog/products/management-tools/transitioning-a-typical-engineering-ops-team-into-an-sre-powerhouse">Transitioning
|
||
a typical engineering ops team into an SRE powerhouse</a></li>
|
||
<li><a
|
||
href="https://www.infoq.com/presentations/ing-sre-teams-practices/">Making
|
||
a Lion Bulletproof: SRE in Banking</a></li>
|
||
<li><a
|
||
href="https://cloud.google.com/blog/products/management-tools/identifying-and-tracking-toil-using-sre-principles">Identifying
|
||
and tracking toil using SRE principles</a></li>
|
||
<li><a
|
||
href="https://www.openshift.com/blog/from-ops-to-sre-evolution-of-the-openshift-dedicated-team">From
|
||
Ops to SRE: Evolution of the OpenShift Dedicated Team</a></li>
|
||
<li><a
|
||
href="https://cloud.google.com/blog/products/management-tools/meeting-reliability-challenges-with-sre-principles">Meeting
|
||
reliability challenges with SRE principles</a></li>
|
||
<li><a href="https://github.com/fhivemind/sre-playground">A quick
|
||
introduction to SRE principles</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=KnC2eRUZMKY">The SRE I
|
||
Aspire to Be</a></li>
|
||
<li><a
|
||
href="https://tanzu.vmware.com/content/blog/taming-operational-load-vmware-cre">Taming
|
||
Operational Load with VMware CRE</a></li>
|
||
<li><a
|
||
href="https://dubrie.medium.com/sre-cultural-values-a0073b475183">SRE
|
||
Cultural Values</a></li>
|
||
<li><a
|
||
href="https://cloud.google.com/blog/products/devops-sre/evaluating-where-your-team-lies-on-the-sre-spectrum">Are
|
||
we there yet? Thoughts on assessing an SRE team’s maturity</a></li>
|
||
<li><a
|
||
href="https://www.linkedin.com/pulse/what-sres-have-do-project-based-services-rod-anami/">What
|
||
SREs have to do with project-based services?</a></li>
|
||
<li><a href="https://github.com/readme/guides/ops-work-visible">Making
|
||
operational work more visible</a></li>
|
||
<li><a href="https://spacelift.io/blog/sre-vs-devops">SRE vs. DevOps:
|
||
What’s the Difference Between Them?</a></li>
|
||
</ul>
|
||
<h2 id="education">Education</h2>
|
||
<ul>
|
||
<li><a
|
||
href="https://www.usenix.org/conference/srecon15/program/presentation/sebenik">Panel:
|
||
Educating SRE</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/conference/srecon15/program/presentation/widdowson">From
|
||
Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE
|
||
Teams</a></li>
|
||
<li><a
|
||
href="https://www.linkedin.com/pulse/new-sre-team-anthony-caiafa/">New
|
||
to an SRE team?</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/publications/login/june15/hixson">The
|
||
Systems Engineering Side of Site Reliability Engineering</a></li>
|
||
<li><a
|
||
href="https://medium.com/@tammybutow/graduating-from-bootcamp-and-interested-in-becoming-a-site-reliability-engineer-b69a38ce858b">Graduating
|
||
from Bootcamp and interested in becoming a Site Reliability
|
||
Engineer?</a></li>
|
||
<li><a
|
||
href="https://www.loomsystems.com/single-post/2016/03/23/So-you-want-to-be-a-Site-Reliability-Engineer">So
|
||
you want to be a Site Reliability Engineer?</a></li>
|
||
<li><a
|
||
href="https://www.loomsystems.com/blog/2017/02/06/spiraling-ops-debt-the-sre-coding-imperative">Spiraling
|
||
Ops Debt & the SRE Coding Imperative</a></li>
|
||
<li><a
|
||
href="https://hackernoon.com/so-you-want-to-be-an-sre-34e832357a8c">So
|
||
you want to be an SRE?</a></li>
|
||
<li><a
|
||
href="https://www.khanacademy.org/college-careers-more/career-content/career-profile-videos/site-reliability-engineer/v/ruth-grace-site-reliability-engineer-what-i-do-and-how-much-i-make">Career
|
||
Profiles/Site Reliability Engineer</a></li>
|
||
<li><a
|
||
href="https://cloudacademy.com/blog/what-is-the-role-of-a-site-reliability-engineer/">What
|
||
is the role of a Site Reliability Engineer?</a></li>
|
||
<li><a
|
||
href="https://www.lynda.com/Software-Development-tutorials/DevOps-Foundations-Site-Reliability-Engineering/669542-2.html">Lynda.com:
|
||
DevOps Foundations: Site Reliability Engineering</a></li>
|
||
<li><a href="https://dastergon.gr/wheel-of-misfortune/">Incident
|
||
Management Training: Wheel of Misfortune</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=rmY8_PHanuI">Site
|
||
Un-Reliability Engineering [Video Series]</a></li>
|
||
<li><a
|
||
href="https://medium.com/swlh/the-ultimate-guide-to-structuring-a-90-day-onboarding-plan-c91af947376">The
|
||
Ultimate Guide to Structuring a 90-Day Onboarding Plan</a></li>
|
||
<li><a
|
||
href="https://cloud.google.com/blog/products/gcp/sre-fundamentals-slis-slas-and-slos">SRE
|
||
fundamentals: SLIs, SLAs and SLOs</a></li>
|
||
<li><a href="https://blog.alicegoldfuss.com/how-to-get-into-sre/">How to
|
||
Get Into SRE</a></li>
|
||
<li><a
|
||
href="https://cloud.google.com/blog/products/devops-sre/how-to-start-and-assess-your-sre-journey">Do
|
||
you have an SRE team yet? How to start and assess your journey</a></li>
|
||
<li><a
|
||
href="https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started">How
|
||
SRE teams are organized, and how to get started</a></li>
|
||
<li><a href="https://queue.acm.org/detail.cfm?id=3283589">Why SRE
|
||
Documents Matter</a></li>
|
||
<li><a
|
||
href="https://www.oreilly.com/ideas/how-to-get-started-with-site-reliability-engineering-sre">How
|
||
to get started with site reliability engineering (SRE)</a></li>
|
||
<li><a
|
||
href="https://victorops.com/blog/duties-of-a-site-reliability-engineering-manager">Duties
|
||
of a Site Reliability Engineering Manager</a></li>
|
||
<li><a
|
||
href="https://cloud.google.com/blog/products/management-tools/sre-principles-and-flashcards-to-design-nalsd">Designing
|
||
distributed systems using NALSD flashcards</a></li>
|
||
<li><a
|
||
href="https://landing.google.com/sre/resources/practicesandprocesses/training-site-reliability-engineers">Training
|
||
Site Reliability Engineers: What Your Organization Needs to Create a
|
||
Learning Program</a></li>
|
||
<li><a
|
||
href="https://landing.google.com/sre/resources/practicesandprocesses/sre-classroom/">SRE
|
||
Classroom: Distributed PubSub workshop</a></li>
|
||
<li><a href="https://linkedin.github.io/school-of-sre/">School of SRE:
|
||
Curriculum for onboarding non-traditional hires and new grads</a></li>
|
||
</ul>
|
||
<h2 id="books">Books</h2>
|
||
<ul>
|
||
<li><a
|
||
href="https://link.springer.com/book/10.1007/978-1-4842-0511-2">Practical
|
||
Linux Infrastructure</a></li>
|
||
<li><a href="https://landing.google.com/sre/book.html">Site Reliability
|
||
Engineering: How Google Runs Production Systems</a></li>
|
||
<li><a href="https://landing.google.com/sre/book.html">The Site
|
||
Reliability Workbook: Practical Ways to Implement SRE</a></li>
|
||
<li><a
|
||
href="https://info.honeycomb.io/observability-engineering-oreilly-book-2022">Observability
|
||
Engineering: Achieving Production Excellence</a></li>
|
||
<li><a href="http://the-cloud-book.com/">The Practice Of Cloud System
|
||
Administration: Designing and Operating Large Distributed
|
||
Systems</a></li>
|
||
<li><a href="http://shop.oreilly.com/product/0636920000136.do">Web
|
||
Operations - Keeping the Data On Time</a></li>
|
||
<li><a href="http://atulgawande.com/book/the-checklist-manifesto/">The
|
||
Checklist Manifesto: How to Get Things Right</a></li>
|
||
<li><a
|
||
href="http://www.oreilly.com/programming/free/microservices-in-production.csp">Microservices
|
||
in Production - Standard Principles and Requirements</a></li>
|
||
<li><a
|
||
href="http://shop.oreilly.com/product/0636920053675.do">Production-Ready
|
||
Microservices - Building Standardized Systems Across an Engineering
|
||
Organization</a></li>
|
||
<li><a
|
||
href="https://www.amazon.com/Systems-Performance-Enterprise-Brendan-Gregg/dp/0133390098/">Systems
|
||
Performance: Enterprise and the Cloud</a> [Sample chapter titled <a
|
||
href="http://ptgmedia.pearsoncmg.com/images/9780133390094/samplepages/0133390098.pdf">CPUs</a></li>
|
||
<li><a
|
||
href="http://www.oreilly.com/webops-perf/free/monitoring-distributed-systems.csp">Monitoring
|
||
Distributed Systems: Case Studies from Google’s SRE Teams</a></li>
|
||
<li><a
|
||
href="http://www.oreilly.com/webops-perf/free/the-human-side-of-postmortems.csp">The
|
||
Human Side of Postmortems: Managing Stress and Cognitive Biases</a></li>
|
||
<li><a
|
||
href="http://www.oreilly.com/webops-perf/free/chaos-engineering.csp">Chaos
|
||
Engineering: Building Confidence in System Behavior through
|
||
Experiment</a></li>
|
||
<li><a
|
||
href="https://victorops.com/oreilly-post-incident-review/">Post-Incident
|
||
Reviews: Learning from Failure for Improved Incident Responses</a></li>
|
||
<li><a
|
||
href="http://www.oreilly.com/webops-perf/free/antifragile-systems-and-teams.csp">Antifragile
|
||
Systems and Teams</a></li>
|
||
<li><a
|
||
href="https://www.slideshare.net/OpsStack/how-to-monitoring-the-sre-golden-signals-ebook/">How
|
||
to Monitoring the SRE Golden Signals (E-Book)</a></li>
|
||
<li><a href="http://shop.oreilly.com/product/0636920036159.do">Incident
|
||
Management for Operations</a></li>
|
||
<li><a
|
||
href="https://www.packtpub.com/web-development/real-world-sre">Real-World
|
||
SRE</a></li>
|
||
<li><a href="http://shop.oreilly.com/product/0636920063964.do">Seeking
|
||
SRE</a></li>
|
||
<li><a
|
||
href="https://www.verizondigitalmedia.com/e-book/oreilly-what-is-sre/">What
|
||
is SRE?</a></li>
|
||
<li><a
|
||
href="https://landing.google.com/sre/resources/practicesandprocesses/engineering-reliable-mobile-applications/">Engineering
|
||
Reliable Mobile Applications: Strategies for Developing Resilient Native
|
||
Mobile Applications</a></li>
|
||
<li><a href="https://landing.google.com/sre/book.html">Building Secure
|
||
and Reliable Systems</a></li>
|
||
<li><a href="https://www.manning.com/books/chaos-engineering/">Chaos
|
||
Engineering: Crash test your applications</a></li>
|
||
<li><a
|
||
href="https://www.oreilly.com/library/view/97-things-every/9781492081487/">97
|
||
Things Every SRE Should Know</a></li>
|
||
<li><a
|
||
href="https://shopify.engineering/four-steps-creating-effective-game-day-tests">Four
|
||
Steps to Creating Effective Game Day Tests</a></li>
|
||
<li><a href="https://nostarch.com/tlpi">The Linux Programming
|
||
Interface</a></li>
|
||
</ul>
|
||
<h2 id="hiring">Hiring</h2>
|
||
<ul>
|
||
<li><a
|
||
href="https://www.usenix.org/conference/srecon15/program/presentation/fong">SRE
|
||
Hiring</a></li>
|
||
<li><a
|
||
href="https://engineering.linkedin.com/engineering-culture/hiring-sres-linkedin">Hiring
|
||
SREs at LinkedIn</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/publications/login/june15/hiring-site-reliability-engineers">Hiring
|
||
Site Reliability Engineers</a></li>
|
||
<li><a
|
||
href="https://sreally.com/hiring-your-first-sre-bdda38ee175d#.2m3sqyuw9">Hiring
|
||
your first SRE</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=ZemNg9GYvOA">Growing the
|
||
Site Reliability Team at LinkedIn: Hiring is Hard</a></li>
|
||
<li><a href="https://danrl.com/blog/srm">Engineering Manager - Site
|
||
Reliability Engineering Interview Preparation</a></li>
|
||
</ul>
|
||
<h2 id="reliability">Reliability</h2>
|
||
<ul>
|
||
<li><a
|
||
href="https://www.usenix.org/conference/srecon16/program/presentation/kroll">The
|
||
Realities of the Job of Delivering Reliability</a></li>
|
||
<li><a href="http://queue.acm.org/detail.cfm?id=2839461">Fail at Scale
|
||
by Ben Maurer</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=wrY7XoOnysg">Embracing
|
||
Failure: Fault-Injection and Service Reliability</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/conference/lisa15/conference-program/presentation/krishnan">10
|
||
Years of Crashing Google</a></li>
|
||
<li><a
|
||
href="https://blog.twitter.com/2015/how-we-break-things-at-twitter-failure-testing">How
|
||
we break things at Twitter: failure testing</a></li>
|
||
<li><a href="http://queue.acm.org/detail.cfm?id=2745840">Reliable Cron
|
||
across the Planet</a></li>
|
||
<li><a
|
||
href="https://blog.twitter.com/2014/push-our-limits-reliability-testing-at-twitter">Push
|
||
our limits - reliability testing at Twitter</a></li>
|
||
<li><a href="http://queue.acm.org/detail.cfm?ref=rss&id=2889274">The
|
||
Verification of a Distributed System by Caitie McCaffrey</a></li>
|
||
<li><a href="http://queue.acm.org/detail.cfm?id=2371516">Weathering the
|
||
Unexpected</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=YFDwdRVTg4g">SRE Hour: Tech
|
||
Talks by Box & Yelp</a></li>
|
||
<li><a
|
||
href="https://sharpend.io/simplicity-a-prerequisite-for-reliability/">Simplicity:
|
||
A Prerequisite for Reliability</a></li>
|
||
<li><a
|
||
href="https://speakerdeck.com/garethr/the-two-sides-to-google-infrastructure-for-everyone-else">The
|
||
Two Sides to Google Infrastructure for Everyone Else</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/conference/ures14west/summit-program/presentation/dickson">How
|
||
Embracing Continuous Release Reduced Change Complexity</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/publications/login/october-2014-vol-39-no-5/making-push-green-reality">Making
|
||
“Push On Green” a Reality</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/publications/login/dec14/ward">BeyondCorp:
|
||
A New Approach to Enterprise Security</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=dKe9S8u44Yk">Brainstorming
|
||
Failure by Jeff Smith</a></li>
|
||
<li><a href="http://cloudtweaks.com/2016/04/outages-and-downtime/">The
|
||
Ripple Effect Of Outages And Downtime Cannot Be Underestimated</a></li>
|
||
<li><a
|
||
href="https://blog.twitter.com/2016/the-infrastructure-behind-twitter-efficiency-and-optimization">The
|
||
infrastructure behind Twitter: efficiency and optimization</a></li>
|
||
<li><a
|
||
href="https://docs.google.com/drawings/d/1kshrK2RLkW-XV8enmWZxeRFRgADj6d4Ru_w5txz_k9I/edit">Dickerson’s
|
||
Hierarchy of Reliability</a></li>
|
||
<li><a
|
||
href="https://blog.acolyer.org/2016/09/21/the-morning-paper-on-operability/">The
|
||
Morning Paper on Operability</a></li>
|
||
<li><a
|
||
href="http://naildrivin5.com/blog/2013/06/16/production-is-all-that-matters.html">Production
|
||
is all that matters</a></li>
|
||
<li><a
|
||
href="https://cloudplatform.googleblog.com/2016/12/using-load-shedding-to-survive-a-success-disaster-CRE-life-lessons.html">Using
|
||
load shedding to survive a success disaster - CRE life lessons</a></li>
|
||
<li><a
|
||
href="https://cloudplatform.googleblog.com/2016/11/how-to-avoid-a-self-inflicted-DDoS-Attack-CRE-life-lessons.html">How
|
||
to avoid a self-inflicted DDoS Attack - CRE life lessons</a></li>
|
||
<li><a
|
||
href="https://www.oreilly.com/ideas/dont-gamble-when-it-comes-to-reliability">Don’t
|
||
gamble when it comes to reliability</a></li>
|
||
<li><a href="https://queue.acm.org/detail.cfm?id=2371297">Resilience
|
||
Engineering: Learning to Embrace Failure</a></li>
|
||
<li><a
|
||
href="https://blog.twitter.com/2017/the-infrastructure-behind-twitter-scale">The
|
||
Infrastructure Behind Twitter: Scale</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=hYu13kBenjE">Scaling
|
||
Reliability at Twitter: So You Want to Add a 9</a></li>
|
||
<li><a href="http://principlesofchaos.org/">Principles Of Chaos
|
||
Engineering</a></li>
|
||
<li><a href="https://www.infoq.com/articles/chaos-engineering">Chaos
|
||
Engineering</a></li>
|
||
<li><a
|
||
href="https://cloudplatform.googleblog.com/2017/01/available-or-not-that-is-the-question-CRE-life-lessons.html">Available…or
|
||
not? That is the question - CRE life lessons</a></li>
|
||
<li><a
|
||
href="http://highscalability.com/blog/2014/2/3/how-google-backs-up-the-internet-along-with-exabytes-of-othe.html">How
|
||
Google Backs Up The Internet Along With Exabytes Of Other Data</a></li>
|
||
<li><a
|
||
href="http://highscalability.com/blog/2017/2/2/performance-scalability-and-high-availability-3-key-infrastr.html">Performance,
|
||
Scalability, And High Availability: 3 Key Infrastructure Adaptability
|
||
Requirements</a></li>
|
||
<li>The Production Environment at Google - <a
|
||
href="https://medium.com/@jerub/the-production-environment-at-google-8a1aaece3767">Part
|
||
1</a> & <a
|
||
href="https://medium.com/@jerub/the-production-environment-at-google-part-2-610884268aaa">Part
|
||
2</a></li>
|
||
<li><a
|
||
href="https://cloudplatform.googleblog.com/2017/03/reliable-releases-and-rollbacks-CRE-life-lessons.html">Reliable
|
||
releases and rollbacks - CRE life lessons</a></li>
|
||
<li><a
|
||
href="https://cloudplatform.googleblog.com/2017/03/how-release-canaries-can-save-your-bacon-CRE-life-lessons.html">How
|
||
release canaries can save your bacon - CRE life lessons</a></li>
|
||
<li><a
|
||
href="https://zwischenzugs.wordpress.com/2017/04/04/things-i-learned-managing-site-reliability-for-some-of-the-worlds-busiest-gambling-sites/">Things
|
||
I Learned Managing Site Reliability for Some of the World’s Busiest
|
||
Gambling Sites</a></li>
|
||
<li><a
|
||
href="https://www.linkedin.com/pulse/introduction-every-day-monday-operations-benjamin-purgason">Every
|
||
Day Is Monday in Operations</a></li>
|
||
<li><a
|
||
href="https://engineering.squarespace.com/blog/2017/under-the-hood-ensuring-site-reliability">Under
|
||
the Hood: Ensuring Site Reliability</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=7Hy_6SMn8pY">Designing
|
||
reliable systems with cloud infrastructure (Google Cloud Next
|
||
’17)</a></li>
|
||
<li><a
|
||
href="https://cloud.google.com/blog/big-data/2016/10/a-google-sre-explores-github-reliability-with-bigquery">A
|
||
Google SRE explores GitHub reliability with BigQuery</a></li>
|
||
<li><a
|
||
href="https://cloudplatform.googleblog.com/2017/05/know-thy-enemy-how-to-prioritize-and-communicate-risks-CRE-life-lessons.html">Know
|
||
thy enemy: how to prioritize and communicate risks - CRE life
|
||
lessons</a></li>
|
||
<li><a
|
||
href="https://github.com/dastergon/awesome-chaos-engineering">Chaos
|
||
Engineering resources</a></li>
|
||
<li><a
|
||
href="https://cloudplatform.googleblog.com/2017/08/CRE-life-lessons-what-is-a-dark-launch-and-what-does-it-do-for-me.html">CRE
|
||
life lessons: What is a dark launch, and what does it do for
|
||
me?</a></li>
|
||
<li><a
|
||
href="https://cloudplatform.googleblog.com/2018/01/why-you-should-pick-strong-consistency-whenever-possible.html">Why
|
||
you should pick strong consistency, whenever possible</a></li>
|
||
<li><a href="https://queue.acm.org/detail.cfm?id=2655736">The Network is
|
||
Reliable</a></li>
|
||
<li><a href="https://queue.acm.org/detail.cfm?id=3028689">Are You Load
|
||
Balancing Wrong?</a></li>
|
||
<li><a
|
||
href="https://code.facebook.com/posts/166966743929963/how-production-engineers-support-global-events-on-facebook/">How
|
||
production engineers support global events on Facebook</a></li>
|
||
<li><a
|
||
href="http://highscalability.com/blog/2018/4/16/google-a-collection-of-best-practices-for-production-service.html">Google:
|
||
A Collection Of Best Practices For Production Services</a></li>
|
||
<li><a href="https://queue.acm.org/detail.cfm?id=3194655">Canary
|
||
Analysis Service</a></li>
|
||
<li><a
|
||
href="https://medium.com/@NetflixTechBlog/tips-for-high-availability-be0472f2599c">Tips
|
||
for High Availability</a></li>
|
||
<li><a
|
||
href="https://auth0.com/blog/progressive-service-architecture-at-auth0/">Progressive
|
||
Service Architecture At Auth0</a></li>
|
||
<li><a
|
||
href="https://medium.com/google-cloud/production-guideline-9d5d10c8f1e">Google
|
||
Cloud Production Guideline</a></li>
|
||
<li><a href="https://jbd.dev/prod-readiness/">production
|
||
readiness</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=Vvd3uvNvMns">Trust By
|
||
Design: The Fusion of Operational Maturity and Risk Modeling</a></li>
|
||
<li><a
|
||
href="https://www.verica.io/top-seven-myths-of-robust-systems/">Top
|
||
Seven Myths of Robust Systems</a></li>
|
||
<li><a
|
||
href="https://www.oreilly.com/ideas/taming-chaos-preparing-for-your-next-incident">Taming
|
||
chaos: Preparing for your next incident</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=3AxSwCC7I4s">PID Loops and
|
||
the Art of Keeping Systems Stable</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=YptJ2rrGAYY">Are you ready
|
||
for production?</a> - <a
|
||
href="https://speakerdeck.com/rakyll/are-you-ready-for-production">Slides</a></li>
|
||
<li><a
|
||
href="https://srcco.de/posts/web-service-on-kubernetes-production-checklist-2019.html">Production
|
||
Checklist for Web Apps on Kubernetes</a></li>
|
||
<li><a
|
||
href="https://cloud.google.com/blog/products/management-tools/sre-keeps-digging-to-prevent-problems">Finding
|
||
a problem at the bottom of the Google stack</a></li>
|
||
<li><a
|
||
href="https://www.oreilly.com/content/rethinking-task-size-in-sre/">Rethinking
|
||
Task Size in SRE</a></li>
|
||
<li><a
|
||
href="https://cloud.google.com/blog/products/management-tools/sre-error-budgets-and-maintenance-windows">How
|
||
maintenance windows affect your error budget</a></li>
|
||
<li><a
|
||
href="https://dastergon.gr/posts/2020/09/the-production-readiness-spectrum/">The
|
||
Production Readiness Spectrum</a></li>
|
||
<li><a
|
||
href="https://www.oreilly.com/content/generic-mitigations/">Generic
|
||
mitigations</a></li>
|
||
<li><a
|
||
href="https://grafana.com/blog/2021/10/13/how-were-building-a-production-readiness-review-process-at-grafana-labs/">How
|
||
we’re building a production readiness review process at Grafana
|
||
Labs</a></li>
|
||
<li><a
|
||
href="https://shopify.engineering/resiliency-planning-for-high-traffic-events">Resiliency
|
||
Planning for High-Traffic Events</a></li>
|
||
<li><a
|
||
href="https://doordash.engineering/2022/04/25/using-fault-injection-testing-to-improve-doordash-reliability/">Using
|
||
Fault Injection Testing to Improve DoorDash Reliability</a></li>
|
||
</ul>
|
||
<h2 id="monitoring-observability-alerting">Monitoring &
|
||
Observability & Alerting</h2>
|
||
<ul>
|
||
<li><a
|
||
href="https://www.usenix.org/conference/lisa13/working-theory-monitoring">A
|
||
Working Theory-of-Monitoring</a></li>
|
||
<li><a href="https://vimeo.com/131484321">The Evolution of Monitoring
|
||
Systems at Google - Tony Rippy</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/conference/srecon15/program/presentation/serebryany">Monitoring
|
||
without Infrastructure @ Airbnb</a></li>
|
||
<li><a
|
||
href="https://www.oreilly.com/ideas/monitoring-distributed-systems">Monitoring
|
||
distributed systems</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=2JAnmzVwgP8">Observability
|
||
at Uber Engineering: Past, Present, Future</a></li>
|
||
<li><a
|
||
href="https://blog.netsil.com/the-4-golden-signals-of-api-health-and-performance-in-cloud-native-applications-a6e87526e74">The
|
||
4 Golden Signals of API Health and Performance in Cloud-Native
|
||
Applications</a></li>
|
||
<li><a
|
||
href="https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/preview#">My
|
||
Philosophy on Alerting by Rob Ewaschuk</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=wsgpV67MLFo">Time To Detect
|
||
- Netflix</a></li>
|
||
<li><a
|
||
href="https://www.vividcortex.com/blog/why-percentiles-dont-work-the-way-you-think">Why
|
||
Percentiles Don’t Work the Way you Think</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=jQggG0qIjTM">Building
|
||
Twitter’s Next-Gen Alerting System</a></li>
|
||
<li><a
|
||
href="https://honeycomb.io/blog/2017/01/instrumentation-worst-case-performance-matters/">Instrumentation:
|
||
Worst case performance matters</a></li>
|
||
<li><a
|
||
href="https://honeycomb.io/blog/2017/01/instrumentation-what-does-uptime-mean/">Instrumentation:
|
||
What does ‘uptime’ mean?</a></li>
|
||
<li><a
|
||
href="https://circleci.com/blog/incidents-outages-at-circleci-our-playbook-and-what-we-ve-learned/">Incidents
|
||
+ Outages at CircleCI: Our Playbook and What We’ve Learned</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=gNmWzkGViAY">An
|
||
introduction to monitoring and alerting with timeseries at scale, with
|
||
Prometheus</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=mG4ZpEhRKHA">Detecting
|
||
outliers and anomalies in realtime at Datadog</a></li>
|
||
<li><a
|
||
href="https://medium.com/devopslinks/how-to-monitor-the-sre-golden-signals-1391cadc7524">How
|
||
to Monitor the SRE Golden Signals</a></li>
|
||
<li><a href="https://queue.acm.org/detail.cfm?id=3178371">Monitoring in
|
||
a DevOps World</a></li>
|
||
<li><a
|
||
href="https://medium.com/@jerub/monitoring-your-monitorings-monitoring-51d479100f4c">Monitoring
|
||
Your Monitoring’s Monitoring</a></li>
|
||
<li><a
|
||
href="https://medium.com/@dlite/observability-the-new-wave-or-buzzword-fc23a68abf72">Observability:
|
||
the new wave or buzzword?</a></li>
|
||
<li><a
|
||
href="https://www.vividcortex.com/blog/monitoring-isnt-observability">Monitoring
|
||
Isn’t Observability</a></li>
|
||
<li><a
|
||
href="https://medium.com/@copyconstruct/monitoring-in-the-time-of-cloud-native-c87c7a5bfa3e">Monitoring
|
||
in the time of Cloud Native</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=2LNHv0JyBUk">Principles of
|
||
Monitoring Microservices</a></li>
|
||
<li><a href="https://www.usenix.org/node/197446">The Many Ways Your
|
||
Monitoring Is Lying to You</a></li>
|
||
<li><a
|
||
href="https://www.weave.works/blog/gitops-part-3-observability">GitOps
|
||
Part 3 - Observability</a></li>
|
||
<li><a
|
||
href="https://medium.com/observability/want-to-debug-latency-7aa48ecbe8f7">Want
|
||
to Debug Latency?</a></li>
|
||
<li><a
|
||
href="https://medium.com/observability/debugging-latency-in-go-1-11-9f97a7910d68">Debugging
|
||
Latency in Go 1.11</a></li>
|
||
<li><a
|
||
href="https://developers.soundcloud.com/blog/alerting-on-slos">Alerting
|
||
on SLOs like Pros</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=JhxfZ0VIPP0">Applied
|
||
Alerting Philosophy</a></li>
|
||
<li><a
|
||
href="https://blog.colinbreck.com/observations-on-observability/">Observations
|
||
on Observability</a></li>
|
||
<li><a
|
||
href="https://charity.wtf/2019/10/28/deploys-its-not-actually-about-fridays/">Deploys:
|
||
It’s Not Actually About Fridays</a></li>
|
||
<li><a
|
||
href="https://medium.com/better-programming/site-reliability-engineering-best-practices-for-data-pipelines-44a78e91f6f0">Site
|
||
Reliability Engineering Best Practices for Data Pipelines</a></li>
|
||
<li><a
|
||
href="https://www.elastic.co/blog/elastic-observability-sre-incident-response">Elastic
|
||
Observability in SRE and Incident Response</a></li>
|
||
<li><a
|
||
href="https://medium.com/expedia-group-tech/error-budget-policy-adoption-at-expedia-group-7d80d41c4a8b">Error
|
||
Budget Policy - Part 1 - Adoption at Expedia Group</a></li>
|
||
<li><a
|
||
href="https://medium.com/expedia-group-tech/error-budget-policies-in-practice-4c98f56a28c1">Error
|
||
Budget Policy - Part 2 - Practices at Expedia Group</a></li>
|
||
</ul>
|
||
<h2 id="on-call">On-Call</h2>
|
||
<ul>
|
||
<li><a href="http://research.google.com/pubs/pub44813.html">Being an
|
||
On-Call Engineer: A Google SRE Perspective</a></li>
|
||
<li><a
|
||
href="https://www.atlassian.com/blog/it-teams/inside-atlassian-site-reliability-engineers-incident-management">Inside
|
||
Atlassian: how our site reliability engineers do incident
|
||
management</a></li>
|
||
<li><a
|
||
href="https://www.atlassian.com/blog/2016/02/inside-atlassian-sre-use-chatops-run-incident-management">Inside
|
||
Atlassian: how IT & SRE use ChatOps to run incident
|
||
management</a></li>
|
||
<li><a
|
||
href="https://blog.heroku.com/archives/2014/5/9/incident-response-at-heroku">Incident
|
||
Response at Heroku</a></li>
|
||
<li><a
|
||
href="http://www.susanjfowler.com/blog/2016/9/6/whos-on-call">Who’s On
|
||
Call?</a></li>
|
||
<li><a
|
||
href="https://sysadvent.blogspot.com/2016/12/day-6-no-more-on-call-martyrs.html">SysAdvent
|
||
- Day 6 - No More On-Call Martyrs</a></li>
|
||
<li><a href="http://naildrivin5.com/blog/2016/12/07/on-call.html">On
|
||
Being On Call</a></li>
|
||
<li><a href="https://github.com/alicegoldfuss/oncall-handbook">The
|
||
On-Call Handbook</a></li>
|
||
<li><a
|
||
href="https://cloudplatform.googleblog.com/2017/02/Incident-management-at-Google-adventures-in-SRE-land.html">Incident
|
||
management at Google — adventures in SRE-land</a></li>
|
||
<li><a href="https://github.com/SkeltonThatcher/run-book-template">Run
|
||
Book / Operations Manual template</a></li>
|
||
<li><a
|
||
href="https://engineering.linkedin.com/blog/2017/12/open-sourcing-fossor-and-ascii-etch">Automating
|
||
Your Oncall: Open Sourcing Fossor and Ascii Etch</a></li>
|
||
<li><a
|
||
href="https://engineering.linkedin.com/blog/2018/01/project-star-streamlining-our-on-call-process">Project
|
||
STAR*: Streamlining Our On-Call Process</a></li>
|
||
<li><a
|
||
href="https://devblog.xero.com/sre-xero-managing-incidents-part-i-7d02d650a71c">SRE@Xero:
|
||
Managing Incidents Part I</a></li>
|
||
<li><a
|
||
href="https://devblog.xero.com/sre-xero-managing-incidents-part-ii-224a6e06f426">SRE@Xero:
|
||
Managing Incidents Part II</a></li>
|
||
<li><a
|
||
href="https://www.gremlin.com/how-to-establish-a-high-severity-incident-management-program/">How
|
||
To Establish a High Severity Incident Management Program</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=xA5U85LSk0M">How Your
|
||
Systems Keep Running Day After Day - John Allspaw</a></li>
|
||
<li><a
|
||
href="https://medium.com/@copyconstruct/on-call-b0bd8c5ea4e0">On-call
|
||
doesn’t have to suck</a></li>
|
||
<li><a
|
||
href="https://medium.com/@awspyker/why-as-a-netflix-infrastructure-manager-am-i-on-call-bdc551ac01fe">Why,
|
||
as a Netflix infrastructure manager, am I on call?</a></li>
|
||
<li><a
|
||
href="https://honeycomb.io/blog/2018/02/oncall-and-sustainable-software-development/">Oncall
|
||
and Sustainable Software Development</a></li>
|
||
<li><a
|
||
href="https://thenewstack.io/call-rotations-best-wake-devs-middle-night/">On
|
||
Call Rotations: How Best to Wake Devs Up in the Middle of the
|
||
Night</a></li>
|
||
<li><a
|
||
href="https://www.gremlin.com/community/tutorials/understanding-the-role-of-the-incident-manager-on-call-imoc/">Understanding
|
||
The Role Of The Incident Manager On-Call (IMOC)</a></li>
|
||
<li><a
|
||
href="https://devops.com/three-ways-to-minimize-the-impact-of-high-severity-incidents/">3
|
||
Ways to Minimize the Impact of High Severity Incidents</a></li>
|
||
<li><a
|
||
href="https://thenewstack.io/advice-management-teams-enrolling-changes-on-call-systems/">Advice
|
||
to Management Teams While Enrolling Changes to On-Call Systems</a></li>
|
||
<li><a
|
||
href="http://www.adaptivecapacitylabs.com/blog/2018/03/23/moving-past-shallow-incident-data/">Moving
|
||
Past Shallow Incident Data</a></li>
|
||
<li><a
|
||
href="https://codywilbourn.com/2018/03/22/sustainable-on-call/">Sustainable
|
||
On-Call</a></li>
|
||
<li><a href="https://youtu.be/8pPrtf1J1Z8">dotScale 2017 - Aish Raj
|
||
Dahal - Chaos management during a major incident</a></li>
|
||
<li><a
|
||
href="https://www.infoq.com/presentations/netflix-incident-management">Incident
|
||
Management at Netflix Velocity</a></li>
|
||
<li><a
|
||
href="https://medium.com/booking-com-infrastructure/incidents-fixes-and-the-day-after-c5d9aeae28c3">Incidents,
|
||
fixes, and the day after</a></li>
|
||
<li><a
|
||
href="https://engineering.salesforce.com/10-steps-to-develop-an-incident-response-plan-youll-actually-use-6cc49d9bf94c">10
|
||
Steps to Develop an Incident Response Plan You’ll ACTUALLY Use</a></li>
|
||
<li><a
|
||
href="https://tech.buzzfeed.com/checklists-an-operational-gift-aaf42cf0be12">Checklists:
|
||
a stupidly simple but valuable operational gift</a></li>
|
||
<li><a
|
||
href="https://blog.hostedgraphite.com/2018/09/13/how-to-write-a-status-page-update/">How
|
||
to write a status page update</a></li>
|
||
<li><a
|
||
href="https://www.atlassian.com/software/jira/ops/handbook">Atlassian
|
||
Incident Handbook</a></li>
|
||
<li><a href="https://response.pagerduty.com/">PagerDuty Incident
|
||
Response Handbook</a></li>
|
||
<li><a
|
||
href="https://blog.zenduty.com/blog/2019/05/02/Avoiding-SRE-Burnout">Avoiding
|
||
Burnout for SREs</a></li>
|
||
<li><a href="https://vimeo.com/344516642">Better On-Call the SRE
|
||
way</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=ZqwVlsIonIw">Managing
|
||
Incidents at Monzo</a></li>
|
||
<li><a
|
||
href="https://dev.to/molly_struve/making-on-call-not-suck-490">Making
|
||
On-Call Not Suck</a></li>
|
||
<li><a
|
||
href="https://monzo.com/blog/2019/07/08/how-we-respond-to-incidents">How
|
||
we (Monzo) respond to incidents</a></li>
|
||
<li><a
|
||
href="https://monzo.com/blog/how-weve-evolved-on-call-at-monzo">How
|
||
we’ve evolved on-call at Monzo</a></li>
|
||
<li><a
|
||
href="https://devops.com/code-yellow-when-operations-isnt-perfect/">Code
|
||
Yellow: When Operations Isn’t Perfect</a></li>
|
||
<li><a
|
||
href="https://opensource.com/article/19/7/measure-operational-performance">MTTR
|
||
is dead, long live CIRT</a></li>
|
||
<li><a href="https://github.com/preed/incident-lifecycle-model">Extended
|
||
Dreyfus Model for Incident Lifecycles</a></li>
|
||
<li><a
|
||
href="https://www.verica.io/inhumanity-of-root-cause-analysis/">Inhumanity
|
||
of Root Cause Analysis</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=ODYO2MPymJ4">Incident
|
||
insights from NASA, NTSB, and the CDC</a></li>
|
||
<li><a
|
||
href="https://www.squadcast.com/blog/how-to-avoid-on-call-burnout">How
|
||
to avoid On-Call Burnout the SRE Way</a></li>
|
||
<li><a href="https://about.gitlab.com/blog/2019/12/16/sre-shadow/">My
|
||
week shadowing a GitLab Site Reliability Engineer</a></li>
|
||
<li><a
|
||
href="https://about.gitlab.com/blog/2018/03/14/the-on-call-handover-at-gitlab/">How
|
||
our production team runs the weekly on-call handover</a></li>
|
||
<li><a
|
||
href="https://www.transposit.com/blog/2020.01.30-writing-runbook-documentation-when-youre-an-sre/">Writing
|
||
Runbook Documentation When You’re An SRE</a></li>
|
||
<li><a
|
||
href="https://lethain.com/incident-response-programs-and-your-startup/">Incident
|
||
response, programs and you(r startup)</a></li>
|
||
<li><a
|
||
href="https://blog.danslimmon.com/2019/06/24/an-incident-command-training-handbook/">An
|
||
Incident Command Training Handbook</a></li>
|
||
<li><a
|
||
href="https://cloud.google.com/blog/products/management-tools/shrinking-the-time-to-mitigate-production-incidents">Shrinking
|
||
the time to mitigate production incidents</a></li>
|
||
<li><a
|
||
href="https://surfingcomplexity.blog/2021/06/11/incident-writeup-as-sociological-storytelling/">Incident
|
||
writeup as sociological storytelling</a></li>
|
||
<li><a
|
||
href="https://www.blameless.com/incident-response/elephant-in-the-blameless-war-room-accountability">Elephant
|
||
in the Blameless War Room: Accountability</a></li>
|
||
<li><a
|
||
href="https://surfingcomplexity.blog/2021/05/22/naming-names-in-incident-writeups/">Naming
|
||
names in incident writeups</a></li>
|
||
<li><a
|
||
href="https://github.blog/2021-01-06-building-on-call-culture-at-github/">Building
|
||
On-Call Culture at GitHub</a></li>
|
||
</ul>
|
||
<h2 id="post-mortem">Post-Mortem</h2>
|
||
<ul>
|
||
<li><a href="https://github.com/danluu/post-mortems">A collection of
|
||
post-mortems</a></li>
|
||
<li><a
|
||
href="https://github.com/hjacobs/kubernetes-failure-stories">Collection
|
||
of Kubernetes Failure Stories</a></li>
|
||
<li><a
|
||
href="https://codeascraft.com/2012/05/22/blameless-postmortems/">Blameless
|
||
PostMortems and a Just Culture</a></li>
|
||
<li><a href="https://blog.box.com/blog/a-tale-of-postmortems/">A Tale of
|
||
Postmortems</a></li>
|
||
<li><a href="http://runasradio.com/Shows/Show/486">Building a Blameless
|
||
Post-Mortem Culture with Jason Hand</a></li>
|
||
<li><a href="https://www.oreilly.com/ideas/the-infinite-hows">The
|
||
infinite hows</a></li>
|
||
<li><a href="https://victorops.com/blog/blameless-culture/">Failure is
|
||
Always An Option: How a Blameless Culture Leads to Better
|
||
Results</a></li>
|
||
<li><a
|
||
href="https://sysadvent.blogspot.com/2016/12/day-1-why-you-need-postmortem-process.html">SysAdvent
|
||
- Day 1 - Why You Need a Postmortem Process</a></li>
|
||
<li><a
|
||
href="https://codeascraft.com/2016/11/17/debriefing-facilitation-guide/">Etsy’s
|
||
Debriefing Facilitation Guide for Blameless Postmortems</a></li>
|
||
<li><a href="https://sharpend.io/writing-your-first-postmortem/">Writing
|
||
Your First Postmortem</a></li>
|
||
<li><a
|
||
href="https://artsy.github.io/blog/2014/11/19/how-to-write-great-outage-post-mortems/">How
|
||
to Write Great Outage Post-Mortems</a></li>
|
||
<li><a href="https://github.com/dastergon/postmortem-templates">A
|
||
collection of postmortem templates</a></li>
|
||
<li><a
|
||
href="https://blog.heptio.com/embracing-feedback-2fd703da714f">Embracing
|
||
Feedback</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/conference/srecon17americas/program/presentation/lueder">Postmortem
|
||
Action Items: Plan the Work and Work the Plan</a></li>
|
||
<li><a
|
||
href="https://medium.com/@allspaw/social-issues-in-postmortems-d48dde624d18">Social
|
||
Issues In Postmortems</a></li>
|
||
<li><a
|
||
href="https://www.inc.com/justin-bariso/meet-postmortem-googles-brilliant-process-tool-for-learning-from-failure.html">Google
|
||
Has an Official Process in Place for Learning From Failure–and It’s
|
||
Absolutely Brilliant</a></li>
|
||
<li><a
|
||
href="https://rework.withgoogle.com/blog/postmortem-culture-how-you-can-learn-from-failure/">Postmortem
|
||
culture: how you can learn from failure</a></li>
|
||
<li><a
|
||
href="https://docs.google.com/document/d/1ob0dfG_gefr_gQ8kbKr0kS4XpaKbc0oVAk4Te9tbDqM/edit">re:Work
|
||
- Postmortem discussion template</a></li>
|
||
<li><a
|
||
href="https://increment.com/documentation/post-mortems-to-the-rescue/">Post-mortems
|
||
to the rescue</a></li>
|
||
<li><a href="https://ai.google/research/pubs/pub45906">Postmortem Action
|
||
Items: Plan the Work and Work the Plan</a></li>
|
||
<li><a
|
||
href="https://www.blameless.com/why-companies-can-benefit-from-blameless-culture/">Why
|
||
Every Company Can Benefit from a Blameless Culture</a></li>
|
||
<li><a
|
||
href="https://www.hostedgraphite.com/blog/its-dead-jim-how-we-write-an-incident-postmortem">“It’s
|
||
dead, Jim”: How we write an incident postmortem</a></li>
|
||
<li><a
|
||
href="https://www.hostedgraphite.com/blog/incident-postmortem-template">Our
|
||
incident postmortem template</a></li>
|
||
<li><a
|
||
href="https://fernandocejas.com/2020/03/21/learn-out-of-mistakes-postmortems/">Learn
|
||
out of mistakes. Postmortems to the rescue.</a></li>
|
||
<li><a
|
||
href="https://www.blameless.com/improve-postmortem-with-sre-steve-mcghee/">Improving
|
||
Postmortem Practices with Veteran Google SRE, Steve McGhee</a></li>
|
||
<li><a
|
||
href="https://www.verica.io/blog/inhumanity-of-root-cause-analysis/">Inhumanity
|
||
of Root Cause Analysis</a></li>
|
||
</ul>
|
||
<h2 id="capacity-planning">Capacity Planning</h2>
|
||
<ul>
|
||
<li><a
|
||
href="https://www.usenix.org/system/files/login/articles/login_feb15_07_hixson.pdf">Capacity
|
||
Planning</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=MDQ0uEUmLOo">SouthBay SRE:
|
||
Cloud Capacity Planning</a></li>
|
||
<li><a
|
||
href="https://www.squadcast.com/blog/intent-based-capacity-planning-and-autoscaling-with-kubernetes">Intent-based
|
||
Capacity Planning and Autoscaling with Kubernetes</a></li>
|
||
<li><a
|
||
href="https://jvns.ca/blog/2016/03/20/how-do-you-do-capacity-planning/">How
|
||
do you do Capacity Planning</a></li>
|
||
<li><a
|
||
href="https://medium.com/back-market-engineering/how-back-market-sres-prepared-for-black-friday-5f017f343408">How
|
||
Back Market SREs prepared for Black Friday</a></li>
|
||
</ul>
|
||
<h2 id="service-level-agreement">Service Level Agreement</h2>
|
||
<ul>
|
||
<li><a
|
||
href="http://er.educause.edu/articles/2010/6/if-its-in-the-cloud-get-it-on-paper-cloud-computing-contract-issues">If
|
||
It’s in the Cloud, Get It on Paper: Cloud Computing Contract
|
||
Issues</a></li>
|
||
<li><a
|
||
href="http://www.wired.com/insights/2011/12/service-level-agreements-in-the-cloud-who-cares/">Service
|
||
Level Agreements in the Cloud: Who cares?</a></li>
|
||
<li><a
|
||
href="https://sysadvent.blogspot.com/2016/12/day-20-how-to-set-and-monitor-slas.html">SysAdvent-
|
||
Day 20 - How to set and monitor SLAs</a></li>
|
||
<li><a
|
||
href="https://cloudplatform.googleblog.com/2017/01/availability-part-deux--CRE-life-lessons.html">SLOs,
|
||
SLIs, SLAs, oh my - CRE life lessons</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/conference/srecon16/program/presentation/jones">Service
|
||
Levels and Error Budgets</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/system/files/login/articles/login_aug15_06_roth.pdf">(Un)Reliability
|
||
Budgets - Finding Balance between Innovation and Reliability</a></li>
|
||
<li><a
|
||
href="https://queue.acm.org/detail.cfm?id=3096459&__s=dnkxuaws9pogqdnxmx8i">The
|
||
Calculus of Service Availability</a></li>
|
||
<li><a
|
||
href="https://dastergon.github.io/availability-calculator/">Availability
|
||
Calculator: Calculate how much downtime should be permitted in your
|
||
SLA</a></li>
|
||
<li><a
|
||
href="https://www.ibm.com/developerworks/cloud/library/cl-SLAloadbalance-numanalysis/">Standardize
|
||
cloud SLA availability with numerical performance data</a></li>
|
||
<li><a
|
||
href="https://www.ibm.com/developerworks/cloud/library/cl-slastandards/">Best
|
||
practices to develop SLAs for cloud computing</a></li>
|
||
<li><a href="https://www.catchpoint.com/blog/sla-management-guide/">A
|
||
Practical Guide to SLAs</a></li>
|
||
<li><a
|
||
href="https://cloudplatform.googleblog.com/2017/10/building-good-SLOs-CRE-life-lessons.html">Building
|
||
good SLOs - CRE life lessons</a></li>
|
||
<li><a
|
||
href="https://thenewstack.io/sre-lessons-google-no-grumpy-humans/">No
|
||
Grumpy Humans and Other Site Reliability Engineering Lessons from
|
||
Google</a></li>
|
||
<li><a
|
||
href="https://cloudplatform.googleblog.com/2018/01/consequences-of-SLO-violations-CRE-life-lessons.html">Consequences
|
||
of SLO violations — CRE life lessons</a></li>
|
||
<li><a
|
||
href="https://medium.com/@jerub/service-level-objectives-in-practice-ed1200502d5">Service
|
||
Level Objectives in Practice</a></li>
|
||
<li><a
|
||
href="https://medium.com/@jerub/sre-consensus-building-36ad5d2e470b">SRE
|
||
Consensus Building</a></li>
|
||
<li><a
|
||
href="https://cloudplatform.googleblog.com/2018/01/an-example-escalation-policy-CRE-life-lessons.html">An
|
||
example escalation policy — CRE life lessons</a></li>
|
||
<li><a href="https://dastergon.gr/error-budget-calculator/">Error Budget
|
||
Calculator</a></li>
|
||
<li><a
|
||
href="https://cloudplatform.googleblog.com/2018/06/understanding-error-budget-overspend-cre-life-lessons.html">Understanding
|
||
error budget overspend - part one - CRE life lessons</a></li>
|
||
<li><a
|
||
href="https://cloudplatform.googleblog.com/2018/06/cre-life-lessons-good-housekeeping-for-error-budgets.html">Good
|
||
housekeeping for error budgets - part two - CRE life lessons</a></li>
|
||
<li><a
|
||
href="https://cloudplatform.googleblog.com/2018/07/sre-fundamentals-slis-slas-and-slos.html">SRE
|
||
fundamentals: SLIs, SLAs and SLOs</a></li>
|
||
<li><a
|
||
href="https://www.circonus.com/2018/07/a-guide-to-service-level-objectives/">SLOs
|
||
& You: A Guide To Service Level Objectives</a></li>
|
||
<li><a
|
||
href="https://medium.com/concourse-ci/earning-our-wings-a0c307fa73e6">Earning
|
||
Our Wings: Stories and Findings From Operating a Large-scale Concourse
|
||
Deployment</a></li>
|
||
<li><a href="https://ai.google/research/pubs/pub48033">Nines are Not
|
||
Enough: Meaningful Metrics for Clouds</a></li>
|
||
<li><a
|
||
href="https://medium.com/@jamesacowling/how-many-nines-is-my-storage-system-7d16e852d56d">How
|
||
many nines is my storage system?</a></li>
|
||
<li><a href="https://lethain.com/dont-follow-the-sun/">Don’t follow the
|
||
sun.</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=4cPqLuIXBnw">The Tyranny of
|
||
the SLA</a></li>
|
||
<li><a
|
||
href="https://www.backblaze.com/blog/cloud-storage-durability/">Backblaze
|
||
Durability is 99.999999999% — And Why It Doesn’t Matter</a></li>
|
||
<li><a href="https://youtu.be/Dfnbw5dJQ5I">DevOpsDays Chicago 2019 - The
|
||
Art of SLOs</a></li>
|
||
<li><a href="https://cre.page.link/art-of-slos">The Art of SLOs Workshop
|
||
Materials</a></li>
|
||
<li><a
|
||
href="https://grafana.com/blog/2019/11/27/kubecon-recap-how-to-include-latency-in-slo-based-alerting/">How
|
||
to Include Latency in SLO-Based Alerting</a></li>
|
||
<li><a
|
||
href="https://www.squadcast.com/blog/succeeding-with-service-level-objectives">Succeeding
|
||
With Service Level Objectives</a></li>
|
||
<li><a
|
||
href="https://medium.com/the-telegraph-engineering/putting-customers-first-with-slis-and-slos-15352f9b6cbc">Putting
|
||
customers first with SLIs and SLOs</a></li>
|
||
<li><a
|
||
href="https://medium.com/site-reliability-engineering-leadership/sre-tip-have-tiered-slas-2c432ffe46a">SRE
|
||
Leadership: Have Tiered SLAs</a></li>
|
||
<li><a
|
||
href="https://www.blameless.com/blog/how-slos-enable-fast-reliable-application-delivery">How
|
||
SLOs Enable Fast, Reliable Application Delivery</a></li>
|
||
<li><a href="https://billduncan.org/the-tail-at-scale/">The Tail at
|
||
Scale</a></li>
|
||
<li><a href="https://billduncan.org/the-tail-at-scale-revisited/">The
|
||
Tail at Scale Revisited</a></li>
|
||
<li><a
|
||
href="https://cloud.google.com/blog/products/gcp/defining-slos-for-services-with-dependencies-cre-life-lessons">Defining
|
||
SLOs for services with dependencies</a></li>
|
||
<li><a
|
||
href="https://blog.b3k.us/2009/07/15/service-level-disagreements.html">Service
|
||
Level Disagreements</a></li>
|
||
<li><a
|
||
href="https://mattermost.com/blog/sloth-for-slo-monitoring-and-alerting-with-prometheus/">How
|
||
We Use Sloth to do SLO Monitoring and Alerting with Prometheus</a></li>
|
||
<li><a
|
||
href="https://medium.com/site-reliability-engineering-leadership/sli-deep-dive-cae92bd90a79">SLI
|
||
Deep Dive</a></li>
|
||
<li><a
|
||
href="https://medium.com/google-cloud/measuring-reliability-in-gcp-step-by-step-slo-creation-guide-using-cloud-operation-sandbox-99043bd0e70f">Measuring
|
||
Reliability in GCP: Step By Step SLO creation guide using Cloud
|
||
Operation Sandbox</a></li>
|
||
<li><a href="https://slotracker.com/">SLO tracker</a></li>
|
||
<li><a
|
||
href="https://ervinbarta.com/2021/10/19/slo-alerting-for-mortals/">SLO
|
||
Alerting for Mortals</a></li>
|
||
<li><a
|
||
href="https://bpetit.nce.re/2021/03/sre-methods-and-climate-change/">SRE
|
||
methods and climate change</a></li>
|
||
<li><a
|
||
href="https://medium.com/lightstephq/what-made-slos-so-messy-and-what-we-can-do-about-it-89be415a80b3">What
|
||
made SLOs so messy (and what we can do about it)</a></li>
|
||
<li><a
|
||
href="https://engineering.fb.com/2021/12/13/production-engineering/slick/">SLICK:
|
||
Adopting SLOs for improved reliability</a></li>
|
||
<li><a
|
||
href="https://alexewerlof.medium.com/calculating-composite-sla-d855eaf2c655">Calculating
|
||
composite SLA</a></li>
|
||
<li><a
|
||
href="https://newrelic.com/blog/best-practices/best-practices-for-setting-slos-and-slis-for-modern-complex-systems">Best
|
||
practices for setting SLOs and SLIs for modern, complex systems</a></li>
|
||
</ul>
|
||
<h2 id="performance">Performance</h2>
|
||
<ul>
|
||
<li><a
|
||
href="https://www.brendangregg.com/blog/2016-05-04/srecon2016-perf-checklists-for-sres.html">Performance
|
||
Checklists for SREs</a></li>
|
||
<li><a href="https://youtu.be/uQ0flQOtQEA">South Bay SRE Meetup -
|
||
Netflix Cloud Performance Team</a></li>
|
||
<li><a
|
||
href="https://medium.com/dm03514-tech-blog/sre-performance-analysis-tuning-methodology-using-a-simple-http-webserver-in-go-d475460f27ca">Software
|
||
Performance Analysis Guided By SLOs</a></li>
|
||
<li><a
|
||
href="https://mterwill.com/posts/framework-for-performance-engineering/">A
|
||
framework for pragmatic performance engineering</a></li>
|
||
</ul>
|
||
<h2 id="programming">Programming</h2>
|
||
<ul>
|
||
<li><a href="http://www.oreilly.com/pub/e/2712">Go Language for Ops and
|
||
Site Reliability Engineering</a></li>
|
||
<li><a
|
||
href="https://www.usenix.org/sites/default/files/conference/protected-files/srecon16_slides_hamilton.pdf">Go
|
||
for SREs using Python</a></li>
|
||
<li><a
|
||
href="https://speakerdeck.com/ianschenck/operability-in-go">Operability
|
||
in Go</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=5doOcaMXx08">Go Reliability
|
||
and Durability at Dropbox</a></li>
|
||
</ul>
|
||
<h2 id="misc-articles">Misc Articles</h2>
|
||
<ul>
|
||
<li><a
|
||
href="https://www.oreilly.com/ideas/what-is-sre-site-reliability-engineering">What
|
||
is SRE (Site Reliability Engineering)?</a></li>
|
||
<li><a
|
||
href="http://www.wired.com/2016/04/google-ensures-services-almost-never-go/">Here’s
|
||
How Google Makes Sure It (Almost) Never Goes Down</a></li>
|
||
<li><a
|
||
href="http://techcrunch.com/2016/03/02/are-site-reliability-engineers-the-next-data-scientists/">Are
|
||
site reliability engineers the next data scientists?</a></li>
|
||
<li><a
|
||
href="http://googleresearch.blogspot.gr/2012/07/site-reliability-engineers-solving-most.html">Site
|
||
Reliability Engineers: “solving the most interesting problems”</a></li>
|
||
<li><a
|
||
href="http://googleforstudents.blogspot.gr/2012/06/site-reliability-engineers-worlds-most.html">Site
|
||
Reliability Engineers: the “world’s most intense pit crew”</a></li>
|
||
<li><a
|
||
href="http://searchitoperations.techtarget.com/feature/Site-reliability-engineering-kicks-rote-tasks-out-of-IT-ops">Site
|
||
reliability engineering kicks rote tasks out of IT ops</a></li>
|
||
<li><a href="http://danluu.com/google-sre-book/">Notes on Site
|
||
Reliability Engineering</a></li>
|
||
<li><a
|
||
href="https://cloudplatform.googleblog.com/2016/07/adventures-in-SRE-land-welcome-to-Google-Mission-Control.html">Adventures
|
||
in SRE-land: Welcome to Google Mission Control</a></li>
|
||
<li><a
|
||
href="https://www.infoq.com/articles/site-reliability-engineering">Book
|
||
Review: Site Reliability Engineering - How Google Runs Production
|
||
Systems</a></li>
|
||
<li><a
|
||
href="https://www.google.com/about/careers/stories/site-reliability-engineering-profile-google/">Site
|
||
Reliability Engineers: “We solve cooler problems”</a></li>
|
||
<li><a
|
||
href="http://www.networkworld.com/article/3182827/cloud-computing/srecon17-brave-new-world-of-site-reliability-engineering.html">SREcon17:
|
||
Brave new world of site reliability engineering</a></li>
|
||
<li><a href="https://github.com/open-guides/og-aws">Open AWS
|
||
guide</a></li>
|
||
<li><a
|
||
href="https://medium.com/@jerub/commentary-on-site-reliability-engineering-9ba9e1be2a8c">Commentary
|
||
on Site Reliability Engineering</a></li>
|
||
<li><a
|
||
href="https://www.networkcomputing.com/data-centers/site-reliability-engineering-4-things-know/888724300">Site
|
||
Reliability Engineering: 4 Things to Know</a></li>
|
||
<li><a
|
||
href="https://www.linkedin.com/pulse/looking-sre-success-find-intrapreneurs-josh-gilliland/">Looking
|
||
for SRE Success? Then Find the Intrapreneurs!</a></li>
|
||
<li><a href="http://web.devopstopologies.com/">What Team Structure is
|
||
Right for DevOps to Flourish?</a></li>
|
||
<li><a
|
||
href="https://www.sidewalksafari.com/2018/12/sre-in-a-travel-emergency.html">Injured
|
||
on Vacation? Applying Principles from Site Reliability Engineering to a
|
||
Travel Emergency</a></li>
|
||
<li><a href="https://sobolevn.me/2018/12/blameless-environment">Building
|
||
blameless working environment</a></li>
|
||
<li><a
|
||
href="https://techbeacon.com/devops/how-accenture-retrofitted-site-reliability-engineering">SRE
|
||
Adoption Report</a></li>
|
||
<li><a
|
||
href="https://devops.com/sres-the-happiest-and-highest-paid-in-the-industry/">SREs:
|
||
The Happiest – and Highest Paid – in the Industry</a></li>
|
||
<li><a
|
||
href="https://thenewstack.io/the-role-of-site-reliability-engineering-today-and-tomorrow/">The
|
||
Role of Site Reliability Engineering, Today and Tomorrow</a></li>
|
||
<li><a
|
||
href="https://medium.com/@bellmar/sre-as-a-lifestyle-choice-de9f5a82d73d">SRE
|
||
as a Lifestyle Choice</a></li>
|
||
<li><a
|
||
href="https://speakerdeck.com/dastergon/srecon-emea-2019-recap-sre-muc-meetup">SRECon
|
||
EMEA 2019 Recap</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=7Oe8mYPBZmw">Life of an SRE
|
||
at Google - JC van Winkel</a></li>
|
||
<li><a
|
||
href="https://www.infoq.com/articles/site-reliability-engineering-mobile-apps/">Site
|
||
Reliability Engineering for Native Mobile Apps - Abhijith Krishnappa</a>
|
||
- Case study: Halodoc adaptation of SRE principles for Native Mobile
|
||
Apps</li>
|
||
<li><a href="https://www.infracloud.io/blogs/sre-best-practices/">SRE
|
||
Best Practices by InfraCloud</a></li>
|
||
</ul>
|
||
<h2 id="real-time-messaging">Real-time Messaging</h2>
|
||
<ul>
|
||
<li><a href="https://hangops.slack.com/">#sre channel at Hangops
|
||
Slack</a> - Discussion of Site Reliability Engineering generally.</li>
|
||
<li><a href="https://hangops.slack.com/">#incident_response channel at
|
||
Hangops Slack</a> - Discussion about Incident Response.</li>
|
||
<li><a href="https://usenix-srecon.slack.com">USENIX SREcon
|
||
Slack</a></li>
|
||
</ul>
|
||
<h2 id="blogs">Blogs</h2>
|
||
<ul>
|
||
<li><a href="http://www.brendangregg.com/blog/index.html">Brendan
|
||
Gregg’s Blog</a> - Highly Technical Blog Posts About Systems Internals,
|
||
Performance and SRE.</li>
|
||
<li><a href="http://everythingsysadmin.com/">Everything Sysadmin</a> -
|
||
Blog Posts About SysAdmin/DevOps/SRE by Tom Limoncelli.</li>
|
||
<li><a href="http://highscalability.com/">High Scalability</a> -
|
||
Technical Blog Posts About Systems Architecture.</li>
|
||
<li><a href="https://rachelbythebay.com/w/">rachelbythebay</a> -
|
||
Techincal Blog Posts.</li>
|
||
<li><a href="http://www.susanjfowler.com/blog/">Susan J. Fowler</a> -
|
||
Various blog posts about SRE, Software Engineering and
|
||
Microservices.</li>
|
||
<li><a href="https://sysadvent.blogspot.com">SysAdvent</a> - One article
|
||
for each day of December, ending on the 25th article.</li>
|
||
<li><a href="https://medium.com/@jerub">Stephen Thorne’s Blog</a> - Blog
|
||
Posts About SRE</li>
|
||
<li><a href="https://increment.com/">Increment</a> - A digital magazine
|
||
about how teams build and operate software systems at scale.</li>
|
||
<li><a href="http://www.gophersre.com/">GopherSRE</a> - Blog Posts about
|
||
Go and SRE.</li>
|
||
<li><a href="https://medium.com/@copyconstruct">Cindy Sridharan</a> -
|
||
Blog posts about distributed systems and their management.</li>
|
||
<li><a href="https://www.blameless.com/blog/">Blameless Blog</a> - Blog
|
||
posts about SRE culture and practices.</li>
|
||
<li><a href="https://ResilienceRoundup.com">Resilience Roundup</a> -
|
||
Weekly analysis of Resilience Engineering and Human Factors research
|
||
designed for software systems</li>
|
||
<li><a href="https://www.squadcast.com/blog">Squadcast Blog</a> - Blog
|
||
posts about SRE best practices, reliability, on-call and incident
|
||
management.</li>
|
||
<li><a href="https://www.firehydrant.io/blog">FireHydrant Blog</a> -
|
||
Posts about complex systems, incident response, and SRE best
|
||
practices.</li>
|
||
<li><a href="https://www.rootly.io/blog">Rootly Blog</a> - Incident
|
||
management best practices and guides.</li>
|
||
<li><a href="https://www.incident.io/blog">incident.io Blog</a> -
|
||
Guides, advice and resources on incident management and response.</li>
|
||
<li><a href="https://logit.io/blog">Logit.io Blog</a> - Resources on log
|
||
management, SRE and devOps.</li>
|
||
</ul>
|
||
<h2 id="newsletters">Newsletters</h2>
|
||
<ul>
|
||
<li><a href="https://faun.dev">DevOpsLinks</a> - A weekly newsletter
|
||
about SRE, SysAdmin and DevOps news, tools, tutorials and opinions.</li>
|
||
<li><a href="https://kubeweekly.io/">KubeWeekly</a> - The weekly
|
||
newsletters for all things Kubernetes. KubeWeekly is curated by Bob
|
||
Killen, Chris Short, Craig Box, Kim McMahon and Michael Hausenblas</li>
|
||
<li><a href="https://sreweekly.com/">SRE Weekly</a> - Weekly Site
|
||
Reliability Newsletter.</li>
|
||
<li><a
|
||
href="http://www.oreilly.com/webops-perf/newsletter.html">O’Reilly
|
||
Systems Engineering and Operations Newsletter</a> - Weekly systems
|
||
engineering and operations news and insights from industry
|
||
insiders.</li>
|
||
<li><a href="https://chaosengineering.news/">ChaosEngineering.news</a> -
|
||
Chaos Engineering newsletter. All things Chaos Engineering, directly to
|
||
your inbox!</li>
|
||
<li><a href="https://monitoring.love/">Monitoring Weekly</a> - What’s
|
||
new in monitoring? Curated monitoring articles to your inbox each
|
||
week.</li>
|
||
<li><a href="https://o11y.news/">Observability news</a> - Updates around
|
||
observability (o11y) with a special focus on open source.</li>
|
||
</ul>
|
||
<h2 id="conferences-meetups">Conferences & Meetups</h2>
|
||
<ul>
|
||
<li><a href="https://www.usenix.org/conferences/byname/925">SRECon
|
||
Conferences</a> - The Official SRE Conference.</li>
|
||
<li><a href="https://www.usenix.org/conferences/byname/5">LISA
|
||
Conferences</a> - Prominent Conference About SysAdmin/DevOps/SRE.</li>
|
||
<li><a href="https://developers.google.com/events/sre/">SRE Tech
|
||
Talks</a> - SRE Talks Hosted by Google.</li>
|
||
<li><a
|
||
href="https://www.meetup.com/South-Bay-Site-Reliability-Engineering/">South
|
||
Bay Site Reliability Engineering (Sunnyvale, CA) Meetup</a> - A Group
|
||
For Individuals Who Tackle Reliability Challenges For Web-Scale
|
||
Systems.</li>
|
||
<li><a
|
||
href="https://www.meetup.com/San-Francisco-Reliability-Engineering/">San
|
||
Francisco Reliability Engineering</a> - A Group Of People Who Are
|
||
Passionate About Reliable, Performant Software Systems.</li>
|
||
<li><a
|
||
href="https://www.meetup.com/Site-Reliability-Engineering-Munich/">Site
|
||
Reliability Engineering Munich, Germany</a> - SRE Meetup in the greater
|
||
area of Oktoberfest city.</li>
|
||
<li><a href="https://www.alldaydevops.com/">ADDO - All Day DevOps</a> -
|
||
A 24 hour conference that is completely online and free.</li>
|
||
<li><a
|
||
href="https://www.meetup.com/Site-Reliability-Engineering-Paris/">Site
|
||
Reliability Engineering Paris, France</a> - SRE Meetup in the city of
|
||
light.</li>
|
||
<li><a href="https://www.meetup.com/site-reliability-enggineering/">Site
|
||
Reliability Engineering India</a> - SRE Meetup India</li>
|
||
</ul>
|
||
<h2 id="twitter">Twitter</h2>
|
||
<ul>
|
||
<li><a href="https://twitter.com/googlesre">Google SRE Twitter
|
||
Account</a> - Google’s SRE Twitter Account.</li>
|
||
<li><a href="https://twitter.com/SREBook">SREBook</a> - The Official
|
||
Twitter Account of Site Reliability Engineering Book.</li>
|
||
<li><a href="https://twitter.com/SREcon">SREcon</a> - SRECon’s Official
|
||
Twitter Account.</li>
|
||
<li><a href="https://twitter.com/SREWorkbook">SREWorkbook</a> - The
|
||
Official Twitter Account of Site Reliability Workbook.</li>
|
||
<li><a href="https://twitter.com/The_SRE_Dev">The SRE Dev</a> -
|
||
SRE-related Posts from <a href="https://dev.to">dev.to</a>.</li>
|
||
<li><a href="https://twitter.com/TwitterSRE">Twitter SRE</a> - The
|
||
Official Twitter Account of Twitter’s SRE team.</li>
|
||
<li><a href="https://twitter.com/SREWeekly">Twitter SRE Weekly</a> - The
|
||
Official Twitter Account of SRE Weekly Newsletter.</li>
|
||
<li><a href="https://twitter.com/usenix">USENIX Association</a> - The
|
||
Official USENIX Twitter Account.</li>
|
||
</ul>
|
||
<h2 id="sre-tools">SRE Tools</h2>
|
||
<ul>
|
||
<li><a href="https://github.com/SquadcastHub/awesome-sre-tools">Awesome
|
||
SRE Tools</a> - A curated list of Site Reliability and Production
|
||
Engineering tools</li>
|
||
<li><a href="https://github.com/ligurio/awesome-ci">List of Continuous
|
||
Integration services</a></li>
|
||
<li><a href="https://github.com/shibumi/SRE-cheat-sheet">SRE cheat
|
||
sheet</a> - A cheat sheet for Site Reliability Engineering principles
|
||
and numbers</li>
|
||
</ul>
|
||
<h2 id="podcasts">Podcasts</h2>
|
||
<ul>
|
||
<li><a
|
||
href="https://podcasts.apple.com/us/podcast/resilience-in-action/id1506828506">Blameless
|
||
/ Resilience in Action</a></li>
|
||
<li><a href="https://sre.google/prodcast">Google SRE Prodcast</a></li>
|
||
<li><a href="https://www.honeycomb.io/usecase/o11ycast/">o11y
|
||
Observability Podcast</a></li>
|
||
<li><a
|
||
href="https://podcasts.apple.com/us/podcast/on-call-nightmares-podcast/id1447430839">On
|
||
Call Nightmares (retired)</a></li>
|
||
<li><a
|
||
href="https://open.spotify.com/show/1KxLVUduNdDRAiOw8BB32J">Making of
|
||
the SRE Omelette</a></li>
|
||
</ul>
|
||
<p><a href="https://github.com/dastergon/awesome-sre">sre.md
|
||
Github</a></p>
|