I got sucked into a data mesh Twitter thread this weekend (it’s worth a read if you haven’t seen it). Data meshes have clearly struck a nerve. Some don’t understand them, while others believe they’r
System Design Interview Tutorial – The Beginner's Guide to System Design
System Design is an important topic to understand if you want to advance further in your career as a software engineer. Even if you are just beginning your coding journey, it's a good idea to get a head start on learning about system design. Early in your career you will
A common way to make your data pipeline idempotent is to use the delete-write pattern.
“Idempotence is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application”
running a data pipeline multiple times with the same input will always produce the same output.
A common way to make your data pipeline idempotent is to use the delete-write pattern.
OCTO Project: Flat explores how to make it easy to work with data in git and GitHub. It builds on the “[git scraping” approach pioneered by Simon Willison](https://simonwillison.net/2020/Oct/9/git-scraping/) to offer a simple pattern for bringing working datasets into your repositories and versioning them, because developing against local datasets is faster and easier than working with data over the wire.
This post goes over what the term data warehousing means. This post provides a simple e-commerce relational data model and how it has to be changed to fit analytical queries. It also covers the reasoning behind wanting to use a data warehouse and how to choose an appropriate database for your project.
The rise of tools and processes to manage and control data. By Assaf Araki and Ben Lorica. Data has emerged as an imperative foundational asset for all organizations. Data fuels significant initiatives such as digital transformation and the adoption of analytics, machine learning, and AI. Organizations that are able to tame, manage, and unlock theirContinue reading "What is DataOps?"
Wondering how to execute a spark job on an AWS EMR cluster, based on a file upload event on S3? Then this post if for you. In this post we go over how to trigger spark jobs on an AWS EMR cluster, using AWS Lambda. The lambda function will execute in response to an S3 upload event. We will go over this event driven pattern with code snippets and set up a fully functioning pipeline.
Data Engineering Project: Stream Edition · Start Data Engineering
Data engineering project for beginners, stream edition. In this post we design and build a simple data streaming pipeline using Apache Kafka, Apache Flink and PostgreSQL DB. We will also review the design and understand some common issues to avoid while building distributed stream processing systems.
Become a Data Engineer with this Complete List of Resources
Want to know how to become a data engineer? Here is a list of resources, certifications and other important links that will help you to get started with it.
Uber's Journey Toward Better Data Culture From First Principles
Data powers Uber Uber has revolutionized how the world moves by powering billions of rides and deliveries connecting millions of riders, businesses, restaurants, drivers, and couriers. At the heart of this massive transportation platform is Big Data and Data Science that powers everything that Uber does, such as better pricing and matching, fraud detection, lowering ETAs, and experimentation. Petabytes of data are collected and processed per day and thousands of users derive insights and make decisions from this data to build/improve these products. Problems beyond scale While we are able to scale our data systems, we previously didn’t focus enough