Data Engineering

107 bookmarks

Custom sorting

Where to validate incoming data?

When you watch the blueprint I also use in my cookbook you see the different phases: Connect, Processing Framework, Store and Buffer. At…

Tutorials

·medium.com·Nov 17, 2021

Where to validate incoming data?

Spark as function — Containerize PySpark code for AWS Lambda and Amazon Kubernetes

Spark Data processing in an easy, cost-effective and serverless way

Tricks #Spark #Delta

·medium.com·Nov 11, 2021

Spark as function — Containerize PySpark code for AWS Lambda and Amazon Kubernetes

A Beginner Guide to Airflow

A step-by-step guide on how to start with Airflow: from your local set-up to creating simple tasks.

Tutorials

·medium.com·Nov 11, 2021

A Beginner Guide to Airflow

How to improve at SQL as a data engineer

Are you disappointed with online SQL tutorials that aren't deep enough? Are you frustrated knowing that you are missing SQL skills, but can't quite put your finger on it? This post is for you. In this post, we go over a few topics that can take your SQL skills to the next level and help you be a better data engineer.

Tutorials

·startdataengineering.com·Nov 4, 2021

How to improve at SQL as a data engineer

6 Key Concepts, to Master Window Functions

In this post, we go over 6 key concepts to help you master window functions. Window functions are one the most powerful features of SQL, they are very useful in analytics and performing operations that cannot be done easily with the standard group by, subquery and filters. Despite this, window functions are not used frequently. If you have ever thought 'window functions are confusing', then this post is for you.

Tutorials

·startdataengineering.com·Oct 31, 2021

6 Key Concepts, to Master Window Functions

What are Common Table Expressions(CTEs) and when to use them?

You have heard of Common Table Expressions(CTEs), but are not be sure what they are and when to use them. What if you knew exactly what Common Table Expressions(CTEs) were and when to use them? In this post, we go over what CTEs are, and their performance comparisons against subqueries, derived tables, and temp tables to help decide when to use them.

Tutorials

·startdataengineering.com·Oct 31, 2021

What are Common Table Expressions(CTEs) and when to use them?

Create a Databricks Load Template with Dynamic Parameters

Moving to Azure and implementing Databricks and Delta Lake for managing your data pipelines is recommended by Microsoft for the Modern Data…

·medium.com·Oct 26, 2021

Create a Databricks Load Template with Dynamic Parameters

Event-driven architecture with Azure Eventgrid (Databricks-Azure Data Factory)

A step by step guide on how to build an Event-driven architecture in Azure.

Tutorials

·medium.com·Oct 26, 2021

Event-driven architecture with Azure Eventgrid (Databricks-Azure Data Factory)

Learning Resources - Data Engineering Wiki

Learning Resources

Roadmaps

·dataengineering.wiki·Sep 21, 2021

Learning Resources - Data Engineering Wiki

Designing a Data Project to Impress Hiring Managers

Frustrated that hiring managers are not reading your Github projects? then this post is for you. In this post, we discuss a way to impress hiring managers by hosting a live dashboard with near real-time data. We will also go over coding best practices such as project structure, automated formatting, and testing to make your code professional. By the end of this post, you will have deployed a live dashboard that you can link to your resume and LinkedIn.

Tutorials

·startdataengineering.com·Jul 16, 2021

Designing a Data Project to Impress Hiring Managers

datastacktv/data-engineer-roadmap

Roadmap to becoming a data engineer in 2021. Contribute to datastacktv/data-engineer-roadmap development by creating an account on GitHub.

Roadmaps

·github.com·Jun 26, 2021

datastacktv/data-engineer-roadmap

Connect to Azure SQL in Python with MFA Active Directory Interactive Authentication without using Microsoft.IdentityModel.Clients.ActiveDirectory dll

To connect to Azure SQL Database using MFA (which is in SSMS as "Active Directory - Universal") Microsoft recommends and currently only has a tutorial on connecting with C# using Microsoft.Identity...

Tricks

·stackoverflow.com·Jun 22, 2021

Connect to Azure SQL in Python with MFA Active Directory Interactive Authentication without using Microsoft.IdentityModel.Clients.ActiveDirectory dll

How to make data pipelines idempotent

A common way to make your data pipeline idempotent is to use the delete-write pattern.

“Idempotence is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application”

running a data pipeline multiple times with the same input will always produce the same output.

A common way to make your data pipeline idempotent is to use the delete-write pattern.

Patterns

·startdataengineering.com·May 26, 2021

How to make data pipelines idempotent

GitHub OCTO | Flat Data

OCTO Project: Flat explores how to make it easy to work with data in git and GitHub. It builds on the “[git scraping” approach pioneered by Simon Willison](https://simonwillison.net/2020/Oct/9/git-scraping/) to offer a simple pattern for bringing working datasets into your repositories and versioning them, because developing against local datasets is faster and easier than working with data over the wire.

·octo.github.com·May 26, 2021

GitHub OCTO | Flat Data

Data Engineering Project: Stream Edition · Start Data Engineering

Data engineering project for beginners, stream edition. In this post we design and build a simple data streaming pipeline using Apache Kafka, Apache Flink and PostgreSQL DB. We will also review the design and understand some common issues to avoid while building distributed stream processing systems.

Tutorials

·startdataengineering.com·Mar 29, 2021

Data Engineering Project: Stream Edition · Start Data Engineering

Become a Data Engineer with this Complete List of Resources

Want to know how to become a data engineer? Here is a list of resources, certifications and other important links that will help you to get started with it.

Roadmaps

·analyticsvidhya.com·Mar 24, 2021

Become a Data Engineer with this Complete List of Resources

Uber's Journey Toward Better Data Culture From First Principles

Data powers Uber Uber has revolutionized how the world moves by powering billions of rides and deliveries connecting millions of riders, businesses, restaurants, drivers, and couriers. At the heart of this massive transportation platform is Big Data and Data Science that powers everything that Uber does, such as better pricing and matching, fraud detection, lowering ETAs, and experimentation. Petabytes of data are collected and processed per day and thousands of users derive insights and make decisions from this data to build/improve these products. Problems beyond scale While we are able to scale our data systems, we previously didn’t focus enough

·eng.uber.com·Mar 23, 2021

Uber's Journey Toward Better Data Culture From First Principles