SRE

243 bookmarks

Custom sorting

Datadog region migration at Wolt

#datadog #migration #networking

·careers.wolt.com·Jul 19, 2023

Datadog region migration at Wolt

You're Logging Wrong: What One-Per-Service ("Phat Event") Logs Are and Why You Need Them. » Open Up The Cloud

Having difficulties understanding what and when to log? The pattern of one-per-service logging is worth investigating. In this article we cover what they are and how to use them.

·openupthecloud.com·Jul 18, 2023

You're Logging Wrong: What One-Per-Service ("Phat Event") Logs Are and Why You Need Them. » Open Up The Cloud

Performance testing GOV.UK Pay’s webhooks mechanism

We used performance testing to check our new webhook functionality worked at scale for our services. Find out what we learned.

·gds.blog.gov.uk·Jul 11, 2023

Performance testing GOV.UK Pay’s webhooks mechanism

Why Benchmarks Miss The Mark for AWS Spend

Benchmarks ignore product design / Benchmarks ignore competitive decisions / Compete against yourself, not others

#benchmark #performance

·duckbillgroup.com·Jun 30, 2023

Why Benchmarks Miss The Mark for AWS Spend

SumUp Uses Honeycomb to Improve Service Quality and Strengthen Customer Loyalty

SLOs become a negotiating and prioritization tool for your engineering teams… It forces discussions that, without the SLO, you wouldn’t have until a customer complained,

#slo #case_studies

·honeycomb.io·Jun 30, 2023

SumUp Uses Honeycomb to Improve Service Quality and Strengthen Customer Loyalty

How Thoughtworks uses Cloud Carbon Footprint for sustainability | Google Cloud Blog

Read how Thoughtworks uses a combination of Google Cloud products to measure carbon emissions and track sustainability goals.

#carbon_footprint #sustainability #green_computing

·cloud.google.com·Jun 12, 2023

How Thoughtworks uses Cloud Carbon Footprint for sustainability | Google Cloud Blog

2023 03 08 Incident: Infrastructure connectivity issue affecting multiple regions | Datadog

TL;DR k8s legacy update channel triggered an unplan OS update. Unforeseen systemd test scenario came into in effect and wiped routing tables.

#incident_management #datadog #k8s

·datadoghq.com·Jun 5, 2023

2023 03 08 Incident: Infrastructure connectivity issue affecting multiple regions | Datadog

PagerDuty Incident Response Documentation

A collection of information about the PagerDuty incident response process. Not only how to prepare new employees for on-call responsibilities, but also how to handle major incidents, both in preparation and after-work.

#incident_management

·response.pagerduty.com·May 10, 2023

PagerDuty Incident Response Documentation

Monitoring in the Kubernetes era

Learn about the key components in a Kubernetes architecture and how container orchestration changes your approach to monitoring.

#datadog #k8s

·datadoghq.com·May 3, 2023

Monitoring in the Kubernetes era

Advanced Log Collection Configurations

Use the Datadog Agent to collect your logs and send them to Datadog

#datadog

·docs.datadoghq.com·May 3, 2023

Advanced Log Collection Configurations

Monitor and query for unparsed logs

Datadog, the leading service for cloud-scale monitoring.

#datadog

·docs.datadoghq.com·May 3, 2023

Monitor and query for unparsed logs

Datadog Release Notes | New Integration with Argo CD

#datadog #argocd

·app.datadoghq.com·Apr 25, 2023

Datadog Release Notes | New Integration with Argo CD

Does OpenTelemetry in .NET Cause Performance Degradation?

There’s an impact to performance when instrumenting using Activity & also processing using OpenTelemetry. Let's find out how much.

#open_telemetry #dotnet

·honeycomb.io·Apr 19, 2023

Does OpenTelemetry in .NET Cause Performance Degradation?

hot-shots

Node.js client for StatsD, DogStatsD, and Telegraf. Latest version: 10.0.0, last published: 2 months ago. Start using hot-shots in your project by running `npm i hot-shots`. There are 502 other projects in the npm registry using hot-shots.

#statsd #datadog #telegraf #nodejs #node

·npmjs.com·Apr 18, 2023

hot-shots

Release Notes: Monitors - Understand alerting trends with an out-of-the-box dashboard for monitor notifications

#datadog

·app.datadoghq.com·Apr 17, 2023

Release Notes: Monitors - Understand alerting trends with an out-of-the-box dashboard for monitor notifications

Datadog Release Notes: [RUM] Configure IP and Geolocation data capture from the Datadog UI

#datadog

·app.datadoghq.com·Apr 5, 2023

Datadog Release Notes: [RUM] Configure IP and Geolocation data capture from the Datadog UI

Datadog Release Notes: Shadow DOM support is available for Session Replay

#datadog

·app.datadoghq.com·Apr 5, 2023

Datadog Release Notes: Shadow DOM support is available for Session Replay

Universal Service Monitoring

Datadog, the leading service for cloud-scale monitoring.

#datadog

·docs.datadoghq.com·Apr 5, 2023

Universal Service Monitoring

9 insights on real world container use

Our latest report examines more than 1.5 billion containers run by tens of thousands of Datadog customers to understand the state of the container ecosystem.

#datadog #report

·datadoghq.com·Mar 22, 2023

9 insights on real world container use

Datadog On Reliability Engineering

There are many different ways to implement Site Reliability Engineering (SRE). From team structures to roles and responsibilities to planning and prioritizat...

#datadog #pov

·youtube.com·Mar 9, 2023

Datadog On Reliability Engineering

Container Lifecycle Hooks

This page describes how kubelet managed Containers can use the Container lifecycle hook framework to run code triggered by events during their management lifecycle. Overview Analogous to many programming language frameworks that have component lifecycle hooks, such as Angular, Kubernetes provides Containers with lifecycle hooks. The hooks enable Containers to be aware of events in their management lifecycle and run code implemented in a handler when the corresponding lifecycle hook is executed.

#k8s

·kubernetes.io·Mar 7, 2023

Container Lifecycle Hooks

Keptn - Cloud-native application life-cycle orchestration.

Keptn automates observability, SLO-driven multi-stage delivery, and operations

#slo #sdlc

·keptn.sh·Feb 27, 2023

Keptn - Cloud-native application life-cycle orchestration.

slo-generator/datadog.md at master · google/slo-generator

SLO Generator is a tool to compute SLIs, SLOs, Error Budgets and Burn rate and export an SLO report to supported exporters. - slo-generator/datadog.md at master · google/slo-generator

#slo #datadog

·github.com·Feb 27, 2023

slo-generator/datadog.md at master · google/slo-generator

Artillery.io | Load & Smoke Testing

Keep production reliable, customers happy, and pagers silent.

#tools #performance #testing #workload

·artillery.io·Feb 20, 2023

Artillery.io | Load & Smoke Testing

On Rake Collections and Software Engineering

Illustration by Furryviza Matthew posted on twitter a metaphor about rakes and software engineering – well, software development but at this point I would argue anyone arguing over these distinctio…

·flameeyes.blog·Feb 13, 2023

On Rake Collections and Software Engineering

Prometheus Alternatives

What are the alternatives to Prometheus? A guide to comparing different Prometheus Alternatives.

#prometheus #datadog #influxdb #graphite

·last9.io·Feb 7, 2023

Prometheus Alternatives

SLOconf 2022: Leo Vasiliou- Perform How many Nines Depends on Accumulation

Meet the powerful analytic for performance-based SLOs. This talk starts with the fact that most teaching SLO discussions focus on using an internal, non-cumulative endpoint (e.g. how many successful GET requests to /API) to illustrate SLO concepts. And arriving at the fact that when it comes to setting SLO for cumulative endpoints (e.g. an app or page consisting of many, distributed requests), determining the number of nines for this objective must be accordingly adjusted to account. In other words, three or four nines may be acceptable for /API. But three or four nines for an experience-based (cumulative) endpoint is not practical. In this session, will discuss the various adjustments needed for experience-based (cumulative) endpoints through both an availability and performance lens. Will further expand on the performance lens and discuss semi-advanced distribution functions for analyzing them – with the ultimate goal being reliable, resilient experiences to better serve self, team, and business.

#hockey_stick #cdf #performance #traffic_analysis #comparison #sre

·youtu.be·Dec 23, 2022

SLOconf 2022: Leo Vasiliou- Perform How many Nines Depends on Accumulation

SLOconf 2022: Stephen Townshend & Gwen De Leon- Defining SLOs When You Dont Know Anything About SLOs

In this talk we walk through our SLO definition workshop, a facilitated session that we used at IAG as an experiment to help teams embed customer focus. We talk openly about what did and did not work, and the experimentation and adjustments we made along the way.=

#sre #workshop #slo #slo_conf

·youtu.be·Dec 23, 2022

SLOconf 2022: Stephen Townshend & Gwen De Leon- Defining SLOs When You Dont Know Anything About SLOs

SRE Pyramid: Dickerson's Hierarchy Of Service Reliability

The SRE pyramid was created by Mikey Dickerson to represent the different hierarchy of service reliability. We explore the 7 principles of the SRE pyramid.

#sre

·reliably.com·Dec 21, 2022

SRE Pyramid: Dickerson's Hierarchy Of Service Reliability

What does an SRE do?

Are you a software engineering director in charge of some Site Reliability Engineers (SRE) and wondering what they’re doing - or should do? Then read on!

#sre

·stanza.systems·Dec 20, 2022

What does an SRE do?