Patterns

13 bookmarks

Custom sorting

Building Cost Efficient Data Pipelines with Python & DuckDB

Imagine working for a company that processes a few GBs of data every day but spends hours configuring/debugging large-scale data processing systems! Whoever set up the data infrastructure copied it from some blog/talk by big tech. Now, the responsibility of managing the data team's expenses has fallen on your shoulders. You're under pressure to scrutinize every system expense, no matter how small, in an effort to save some money for the organization. It can be frustrating when data vendors charge you a lot and will gladly charge you more if you are not careful with usage. Imagine if your data processing costs were dirt cheap! Imagine being able to replicate and debug issues quickly on your laptop! In this post, we will discuss how to use the latest advancements in data processing systems and cheap hardware to enable cheap data processing. We will use DuckDB and Python to demonstrate how to process data quickly while improving developer ergonomics.

#DuckDB

·startdataengineering.com·Jun 8, 2024

Building Cost Efficient Data Pipelines with Python & DuckDB

Data Pipeline Design Patterns - #2. Coding patterns in Python

As data engineers, you might have heard the terms functional data pipeline, factory pattern, singleton pattern, etc. One can quickly look up the implementation, but it can be tricky to understand what they are precisely and when to (& when not to) use them. Blindly following a pattern can help in some cases, but not knowing the caveats of a design will lead to hard-to-maintain and brittle code! While writing clean and easy-to-read code takes years of experience, you can accelerate that by understanding the nuances and reasoning behind each pattern. Imagine being able to design an implementation that provides the best extensibility and maintainability! Your colleagues (& future self) will be extremely grateful, your feature delivery speed will increase, and your boss will highly value your opinion. In this post, we will go over the specific code design patterns used for data pipelines, when and why to use them, and when not to use them, and we will also go over a few python specific techniques to help you write better pipelines. By the end of this post, you will be able to identify patterns in your data pipelines and apply the appropriate code design patterns. You will also be able to take advantage of pythonic features to write bug-free, maintainable code that is a joy to work on!

·startdataengineering.com·Apr 11, 2024

Data Pipeline Design Patterns - #2. Coding patterns in Python

What is DataOps? - Gradient Flow

The rise of tools and processes to manage and control data. By Assaf Araki and Ben Lorica. Data has emerged as an imperative foundational asset for all organizations. Data fuels significant initiatives such as digital transformation and the adoption of analytics, machine learning, and AI. Organizations that are able to tame, manage, and unlock theirContinue reading "What is DataOps?"

·gradientflow.com·Apr 21, 2021

What is DataOps? - Gradient Flow

A proven approach to land a Data Engineering job

Proven approach to get usable experience and land a data engineering job

·startdataengineering.com·Apr 22, 2021

A proven approach to land a Data Engineering job

How to Validate Datatypes in Python

Frustrated with handling data type conversion issues in python? Then this post is for you. In this post, we go over a reusable data type conversion pattern using Pydantic. We will also go over the caveats involved in using this library.

·click.convertkit-mail2.com·Jul 29, 2021

How to Validate Datatypes in Python

Effective Data Monitoring

Ten steps to ensure your data monitoring solution is effective.

·blog.anomalo.com·Mar 23, 2021

Effective Data Monitoring

Grokking the Advanced System Design Interview - Learn Interactively

System design questions have increasingly become an integral part of software engineering interviews. For senior engineers, the discussion around system design is considered even more important than solving a coding question. In a system design interview, you can show your real design skills and show how they will work with designing complex systems. It is a given that a good performance in system design interviews will get you a senior position and result in higher salaries. This course presents the architectural review of famous distributed systems. The main goal is to extract out important design details that are relevant to system design interviews. The course also presents a list of system design patterns that constitute the common design problems and their solutions that different distributed systems have developed over time.

·educative.io·Feb 16, 2022

Grokking the Advanced System Design Interview - Learn Interactively

Data Contracts — From Zero To Hero

A pragmatic approach to data contracts

·towardsdatascience.com·Sep 12, 2022

Data Contracts — From Zero To Hero

Functional Data Engineering — a modern paradigm for batch data processing

Batch data processing — historically known as ETL — is extremely challenging. It’s time-consuming, brittle, and often unrewarding. Not only…

·maximebeauchemin.medium.com·Dec 12, 2022

Functional Data Engineering — a modern paradigm for batch data processing

Data Pipeline Design Patterns - #1. Data flow patterns

Data pipelines built (and added on to) without a solid foundation will suffer from poor efficiency, slow development speed, long times to triage production issues, and hard testability. What if your data pipelines are elegant and enable you to deliver features quickly? An easy-to-maintain and extendable data pipeline significantly increase developer morale, stakeholder trust, and the business bottom line! Using the correct design pattern will increase feature delivery speed and developer value (allowing devs to do more in less time), decrease toil during pipeline failures, and build trust with stakeholders. This post goes over the most commonly used data flow design patterns, what they do, when to use them, and, more importantly, when not to use them. By the end of this post, you will have an overview of the typical data flow patterns and be able to choose the right one for your use case.

·startdataengineering.com·Jan 7, 2023

Data Pipeline Design Patterns - #1. Data flow patterns

Functional Data Engineering - A Blueprint

How to build a Recoverable & Reproducible data pipeline

·dataengineeringweekly.com·Jan 13, 2023

Functional Data Engineering - A Blueprint

Design patterns every data engineer should know

(empty introductory line to avoid a formatting issue with Medium editor)

·rspacesamuel.medium.com·Sep 30, 2023

Design patterns every data engineer should know

How to make data pipelines idempotent

A common way to make your data pipeline idempotent is to use the delete-write pattern.

“Idempotence is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application”

running a data pipeline multiple times with the same input will always produce the same output.

A common way to make your data pipeline idempotent is to use the delete-write pattern.

·startdataengineering.com·May 26, 2021

How to make data pipelines idempotent