Yamldocs

SRE
Appendix A: Designed-For Availability for Select AWS Services - Reliability Pillar
The focus of this paper is the reliability pillar of the AWS Well-Architected Framework. It provides guidance to help customers apply best practices in the design, delivery, and maintenance of AWS environments.
What happens when you press a key in your terminal?
Wassim Chegham 🇲🇦 on Twitter
“Ever wondered what happens when you type in a URL in an address bar in a browser? Here is a brief overview...
#programming #web #sketchnotes”
Visual Patterns to Improve Monitoring Dashboards
Monitoring is essential to ensuring a system is running well in production. Without monitoring, you are driving a car with a blindfold on…
Observability Anti-Patterns | Lightstep Blog
Avoid committing "crimes against Observability" and get your Observability practice off the ground the right way, by avoiding these common Observability pitfalls!
Three Terraform Mistakes, and How to Avoid Them
Learn about Terraform gotchas, and how to solve them, so that you will hopefully be spared utter despair and panic
indent-rainbow - Visual Studio Marketplace
Extension for Visual Studio Code - Makes indentation easier to read
Visualizing Multi Cloud IAM Concepts
AWS Azure GCP IAM visualized and key concepts
emblem/docs/decisions at main · GoogleCloudPlatform/emblem · GitHub
:diamond_shape_with_a_dot_inside:Emblem Giving is a sample application that demonstrates a serverless architecture with continuous delivery, and trouble recovery. - emblem/docs/decisions at main ·...
How Complex Systems Fail
Supporting Data Driven Change With SLOs
In Support of Change
The Incident Retrospective Ground Rules | Honeycomb
Join Lex, SRE at Honeycomb, as he describes the incident retrospective process we abide by, and see why he was pleasantly surprised.
Nóva :nova: (@nova@hachyderm.io)
SLA We promise
SLO We want
SLI We have
Dear Console,… - a collection of code snippets to use in the browser console
mikaelvesavuori/dorametrix: Dorametrix is a serverless web service that helps you calculate your DORA metrics, by inferring your metrics from events you create with webhooks (or manually!).
Dorametrix is a serverless web service that helps you calculate your DORA metrics, by inferring your metrics from events you create with webhooks (or manually!). - mikaelvesavuori/dorametrix: Doram...
Automate end to end processes and quickly respond to events with Datadog Workflows
Learn how to combine monitoring and workflow automation into a single, streamlined solution with Datadog Workflows.
Gain visibility and control of your cloud spend with Datadog Cloud Cost Management | Datadog
Unlock visibility into the cloud costs of your teams. Empower engineers to optimize the cost of their services and adopt a culture of cost awareness.
What are reasonable SLOs for Kafka? - Ops - Confluent Community
Opinions are my own… These depend on the SLAs you are supporting with your SLIs. But here are a couple of core ones: Controller count - must equal 1 else something is wrong Under replicated partitions - under replicated partitions greater than one is normally an early warning that something is about to go pear shaped. Depending on your setting for publish acks, this might mean that some publishers might also stop, if min ISR is less than required. Leader elections - These might happen due to...
Seeing Like an SRE: Site Reliability Engineering as High Modernism
Best Practices for Local File Parameters | Amazon Web Services
If you have ever passed the contents of a file to a parameter of the AWS CLI, you most likely did so using the file:// notation. By setting a parameter’s value as the file’s path prepended by file://, you can explicitly pass in the contents of a local file as input to a command: aws […]
OpenSLO/OpenSLO: Open specification for defining and expressing service level objectives (SLO)
Open specification for defining and expressing service level objectives (SLO) - OpenSLO/OpenSLO: Open specification for defining and expressing service level objectives (SLO)
Critical User Journeys | Payments Reseller Subscription API | Google Developers
Google - Site Reliability Engineering
Network Monitoring Software by ManageEngine OpManager
ManageEngine OpManager provides easy-to-use Network Monitoring Software that offers advanced Network & Server Performance Management. Download free trial now!
What Does It Mean To Be In The 95Th Percentile? – Problem Solver X
What is The 95th Percentile, And Why Does It Matter? – FirstWave
95% of your requests / the other 5% are the times it exceed this value
Avoiding the 'SLOs as Reliability Theater' trap
Site Reliability Engineering: SLI Implementation Example
The Service Level Indicator is the ongoing measurement of your system that tells you whether you’re meeting your objective
Message Queueing vs. Event Stream Processing in Azure
Message Queueing vs. Event Stream Processing in Azure.