Datadog region migration at Wolt

SRE
You're Logging Wrong: What One-Per-Service ("Phat Event") Logs Are and Why You Need Them. » Open Up The Cloud
Having difficulties understanding what and when to log? The pattern of one-per-service logging is worth investigating. In this article we cover what they are and how to use them.
Performance testing GOV.UK Pay’s webhooks mechanism
We used performance testing to check our new webhook functionality worked at scale for our services. Find out what we learned.
Why Benchmarks Miss The Mark for AWS Spend
Benchmarks ignore product design / Benchmarks ignore competitive decisions / Compete against yourself, not others
SumUp Uses Honeycomb to Improve Service Quality and Strengthen Customer Loyalty
SLOs become a negotiating and prioritization tool for your engineering teams… It forces discussions that, without the SLO, you wouldn’t have until a customer complained,
SLOs become a negotiating and prioritization tool for your engineering teams… It forces discussions that, without the SLO, you wouldn’t have until a customer complained,
How Thoughtworks uses Cloud Carbon Footprint for sustainability | Google Cloud Blog
Read how Thoughtworks uses a combination of Google Cloud products to measure carbon emissions and track sustainability goals.
2023 03 08 Incident: Infrastructure connectivity issue affecting multiple regions | Datadog
TL;DR k8s legacy update channel triggered an unplan OS update. Unforeseen systemd test scenario came into in effect and wiped routing tables.
PagerDuty Incident Response Documentation
A collection of information about the PagerDuty incident response process. Not only how to prepare new employees for on-call responsibilities, but also how to handle major incidents, both in preparation and after-work.
Monitoring in the Kubernetes era
Learn about the key components in a Kubernetes architecture and how container orchestration changes your approach to monitoring.
Advanced Log Collection Configurations
Use the Datadog Agent to collect your logs and send them to Datadog
Monitor and query for unparsed logs
Datadog, the leading service for cloud-scale monitoring.
Datadog Release Notes | New Integration with Argo CD
Does OpenTelemetry in .NET Cause Performance Degradation?
There’s an impact to performance when instrumenting using Activity & also processing using OpenTelemetry.
Let's find out how much.
hot-shots
Node.js client for StatsD, DogStatsD, and Telegraf. Latest version: 10.0.0, last published: 2 months ago. Start using hot-shots in your project by running `npm i hot-shots`. There are 502 other projects in the npm registry using hot-shots.
Release Notes: Monitors - Understand alerting trends with an out-of-the-box dashboard for monitor notifications
Datadog Release Notes: [RUM] Configure IP and Geolocation data capture from the Datadog UI
Datadog Release Notes: Shadow DOM support is available for Session Replay
Universal Service Monitoring
Datadog, the leading service for cloud-scale monitoring.
9 insights on real world container use
Our latest report examines more than 1.5 billion containers run by tens of thousands of Datadog customers to understand the state of the container ecosystem.
Datadog On Reliability Engineering
There are many different ways to implement Site Reliability Engineering (SRE). From team structures to roles and responsibilities to planning and prioritizat...
Container Lifecycle Hooks
This page describes how kubelet managed Containers can use the Container lifecycle hook framework to run code triggered by events during their management lifecycle.
Overview Analogous to many programming language frameworks that have component lifecycle hooks, such as Angular, Kubernetes provides Containers with lifecycle hooks. The hooks enable Containers to be aware of events in their management lifecycle and run code implemented in a handler when the corresponding lifecycle hook is executed.
Keptn - Cloud-native application life-cycle orchestration.
Keptn automates observability, SLO-driven multi-stage delivery, and operations
slo-generator/datadog.md at master · google/slo-generator
SLO Generator is a tool to compute SLIs, SLOs, Error Budgets and Burn rate and export an SLO report to supported exporters. - slo-generator/datadog.md at master · google/slo-generator
Artillery.io | Load & Smoke Testing
Keep production reliable, customers happy, and pagers silent.
On Rake Collections and Software Engineering
Illustration by Furryviza Matthew posted on twitter a metaphor about rakes and software engineering – well, software development but at this point I would argue anyone arguing over these distinctio…
Prometheus Alternatives
What are the alternatives to Prometheus? A guide to comparing different Prometheus Alternatives.
SLOconf 2022: Leo Vasiliou- Perform How many Nines Depends on Accumulation
Meet the powerful analytic for performance-based SLOs. This talk starts with the fact that most teaching SLO discussions focus on using an internal, non-cumulative endpoint (e.g. how many successful GET requests to /API) to illustrate SLO concepts. And arriving at the fact that when it comes to setting SLO for cumulative endpoints (e.g. an app or page consisting of many, distributed requests), determining the number of nines for this objective must be accordingly adjusted to account. In other words, three or four nines may be acceptable for /API. But three or four nines for an experience-based (cumulative) endpoint is not practical. In this session, will discuss the various adjustments needed for experience-based (cumulative) endpoints through both an availability and performance lens. Will further expand on the performance lens and discuss semi-advanced distribution functions for analyzing them – with the ultimate goal being reliable, resilient experiences to better serve self, team, and business.
SLOconf 2022: Stephen Townshend & Gwen De Leon- Defining SLOs When You Dont Know Anything About SLOs
In this talk we walk through our SLO definition workshop, a facilitated session that we used at IAG as an experiment to help teams embed customer focus. We talk openly about what did and did not work, and the experimentation and adjustments we made along the way.=
SRE Pyramid: Dickerson's Hierarchy Of Service Reliability
The SRE pyramid was created by Mikey Dickerson to represent the different hierarchy of service reliability. We explore the 7 principles of the SRE pyramid.
What does an SRE do?
Are you a software engineering director in charge of some Site Reliability Engineers (SRE) and wondering what they’re doing - or should do? Then read on!