SRE

SRE

237 bookmarks
Custom sorting
How We Saved 70K Cores Across 30 Mission-Critical Services (Large-Scale, Semi-Automated Go GC Tuning @Uber)
How We Saved 70K Cores Across 30 Mission-Critical Services (Large-Scale, Semi-Automated Go GC Tuning @Uber)
Introduction As part of Uber engineering’s wide efforts to reach profitability, recently our team was focused on reducing cost of compute capacity by improving efficiency. Some of the most impactful work was around GOGC optimization. In this blog we want to share our experience with a highly effective, low-risk, large-scale, semi-automated Go GC tuning mechanism. Uber’s tech stack is composed of thousands of microservices, backed by a cloud-native, scheduler-based infrastructure. Most of these services are written in Go. Our team, Maps Production Engineering, has previously played an instrumental role in significantly improving the efficiency of multiple Java services by tuning
·eng.uber.com·
How We Saved 70K Cores Across 30 Mission-Critical Services (Large-Scale, Semi-Automated Go GC Tuning @Uber)
Day 23 - What is eBPF?
Day 23 - What is eBPF?
By: Ania Kapuścińska ( @lambdanis ) Edited by: Shaun Mouton ( @sdmouton ) Like many engineers, for a long time I’ve thought ...
·sysadvent.blogspot.com·
Day 23 - What is eBPF?
Kit “SLOconf is May 9-12 2022" Merker on Twitter
Kit “SLOconf is May 9-12 2022" Merker on Twitter
I've had a repeated conversation recently about SLO Adoption. The question I get is "Which services should I start with?"And there is a counterintuitive idea I want to share. 🧵— Kit “SLOconf is May 9-12 2022" Merker (@KitMerker) March 17, 2022
·twitter.com·
Kit “SLOconf is May 9-12 2022" Merker on Twitter
James Eastham on Twitter
James Eastham on Twitter
Finally.Trace of a request into my #serverless event driven system, API Gateway - Dynamo - Dynamo Streams - Lambda - SQS x 2 - Event Bridge. One consistent trace through the entire flow.Written in .NET, traced with @opentelemetry, observed in @honeycombio #dotnet #o11y pic.twitter.com/YfvpAYPiTD— James Eastham (@plantpowerjames) October 8, 2022
·twitter.com·
James Eastham on Twitter
How HashiCorp Does Site Reliability Engineering - The New Stack
How HashiCorp Does Site Reliability Engineering - The New Stack
The company's SRE journey started three years ago, and it now has reliability teams focused on infrastructure, products and developer productivity. #SRE #reliability
·thenewstack.io·
How HashiCorp Does Site Reliability Engineering - The New Stack
Felix Geisendörfer on Twitter
Felix Geisendörfer on Twitter
🎉 Announcing fgtrace, a new profiler/tracer for #golang.It captures wallclock timeline views for each goroutine and it's really simple to use:defer fgtrace.Config{}.Start().Stop()Check it out & let me know what you think https://t.co/Ttdm5hl0Vi pic.twitter.com/4iP9SNVypD— Felix Geisendörfer (@felixge) September 19, 2022
·twitter.com·
Felix Geisendörfer on Twitter
What is eBPF? | An Introduction and Practical Tips
What is eBPF? | An Introduction and Practical Tips
Addr:https://ebpf.xyz/post/an_introduction_and_practical_tips March 23, 2022 This article introduces developers to eBPF and explains how it can be used to add security, networking, and other capabilities in the Linux kernel space. In Linux architecture, memory is separated into kernel space and user space. The kernel space is used to run the core kernel code and the device drivers. Processes running in kernel space have unrestricted access to all hardware, including CPU, memory, and disks.
·ebpf.xyz·
What is eBPF? | An Introduction and Practical Tips
Resilience Engineering and Strange Loops
Resilience Engineering and Strange Loops
My notes and takeaways from a long read on anomalies and system complexity called the STELLA Report from the SNAFUcatchers Workshop on Coping With Complexity, 2017. Via Matt. This paper is one of t…
·sensible.blog·
Resilience Engineering and Strange Loops
Who Destroyed Three Mile Island? - Nickolas Means | #LeadDevLondon 2018
Who Destroyed Three Mile Island? - Nickolas Means | #LeadDevLondon 2018
Check out the latest from The Lead Developer at theleaddeveloper.com. On March 28, 1979, at exactly 4 o’clock in the morning, control rods slammed into the reactor core of Three Mile Island Unit #2, halting the nuclear reaction because of a fault in the reactor cooling system. At 4:02, the automated emergency cooling system activated as the reactor core temperature continued to rise. At 4:04, one of the plant operators made the befuddling decision to switch off the emergency cooling system, dooming the reactor to partial meltdown. Why? When something bad happens, it’s easy to just blame someone and move on. Taking the time to find the systemic causes, though, will not only help keep the problem from repeating, it will enable you to build the psychological safety necessary for your team to truly collaborate. Let’s let the story of Three Mile Island teach us how to make our teams stronger through systems thinking and just culture.
·youtu.be·
Who Destroyed Three Mile Island? - Nickolas Means | #LeadDevLondon 2018
SLI, SLO, SLA explained in a way your kids will understand… maybe
SLI, SLO, SLA explained in a way your kids will understand… maybe
Imagine you are in a remote meeting using terms like SLI, SLO, or SLA, and your kid asks you what it means? How would you explain it to them? Or maybe you need to explain it to your boss or a colleague. In this article, I will try to put SLI, SLO, and SLA in a way even your kids would understand… maybe.
·thrownewexception.com·
SLI, SLO, SLA explained in a way your kids will understand… maybe
[PUBLIC] The Art of SLOs – Slides
[PUBLIC] The Art of SLOs – Slides
Self link: https://cre.page.link/art-of-slos-slides Participant Handbook: https://cre.page.link/art-of-slos-handbook Facilitator Handbook: https://cre.page.link/art-of-slos-howto SLO Worksheet: https://cre.page.link/art-of-slos-worksheet Errors in the content? https://cre.page.link/art-of-slos-bu...
·docs.google.com·
[PUBLIC] The Art of SLOs – Slides
Improving Incident Management through Role Assignments and Game Days
Improving Incident Management through Role Assignments and Game Days
John Arundel, principal consultant at Bitfield Consulting, shared his thoughts on how to ensure incidents are handled smoothly and quickly. He suggests assigning specific roles to each team member responding to the incident. Red team versus blue team exercises can also be leveraged to ensure the team is prepared to respond accurately and quickly.
·infoq.com·
Improving Incident Management through Role Assignments and Game Days
SRE book list
SRE book list
Increase your knowledge of site reliability engineering with these books
·docs.microsoft.com·
SRE book list