Incremental Processing using Netflix Maestro and Apache Iceberg
Data Engineering
Exploring Spark Catalog — Mastering Pyspark
Data cataloguing in Spark | by Petrica Leuca | Medium
Streaming from Apache Iceberg - QCon NY 2023
Streaming from Apache Iceberg
Building Low-Latency and Cost Effective Data Pipelines
Steven Wu @ Apple
red-data-tools/YouPlot: A command line tool that draw plots on the terminal.
Data processing with Spark: data catalog – own your data
Delivering High Quality Analytics at Netflix
Same Data, Sturdier Frame: Layering in Dimensional Data Modeling at Whatnot
Unit Testing for Data Engineers.
r/dataengineering - What did ETL look like before the "modern data stack" was a thing?
Resolving Late Arriving Dimensions
r/dataengineering - Which lakehouse table format do you expect your organization will be using by the end of 2023?
🫡🐳 pedramdb🫡🐳 on Twitter
Data Systems Tend Towards Production
Airbyte Monitoring with dbt and Metabase - Part I | Airbyte
Building a Data Engineering Project in 20 Minutes
r/dataengineering - Has anyone built a data warehouse primarily using Databricks?
The Contract-Powered Data Platform | Buz
The Breakdown: Databricks, Snowflake, and Open Source Positioning in the Data World
Yet another post on Data Contracts - Part 1
The missing piece of the modern data stack
Kicking the tires on dbt Metrics
The modern data experience (w/ Benn Stancil)
Engineers Shouldn’t Write ETL: A Guide to Building a High Functioning Data Science Department | Stitch Fix Technology – Multithreaded
There is nothing more soul sucking than writing, maintaining, modifying, and supporting ETL to produce data that you yourself never get to use or consume.
Instead, give people end-to-end ownership of the work they produce (autonomy). In the case of data scientists, that means ownership of the ETL.
Mediocre engineers really excel at building enormously over complicated, awful-to-work-with messes they call “solutions”. Messes tend to necessitate specialization.
most technologies have evolved to a point where they can trivially scale to your needs.
Viewpoint | dbt Docs
Upgrading Data Warehouse Infrastructure at Airbnb
We the purple people
The end of Big Data
Take real-time products, for example. Most businesses have little use for true real-time experiences. But, all else being equal, real-time data is better than latent data. We all have dashboards that update a little too slowly, or marketing emails we wish we could send a little sooner. While these annoyances don’t justify the effort currently required to build real-time pipelines, they do cause small headaches. But if someone came along and offered me a streaming Fivetran, or a reactive version of dbt, I’d take it. If the cost of a real-time architecture was low enough, regardless of the shoehorned use-cases, there’d be no reason to turn it down. And just as we came to rely on Snowflake after we chose it as a better Postgres, I’m certain we’d come to rely on streaming pipelines if they replaced our current batch ones. We’d start doing more real-time marketing outreach, or build customer success workflows around live customer behavior. Over the next five years, I’d guess that real-time data tools follow this exact path: They’ll finally go mainstream, not because we all discover we need them, but because there will be no reason not to have them. And once we do, we’ll find ways to push it to their limits, just as we did with fast internet connections and powerful browsers.
Ep 30: The Personal Data Warehouse (w/ Jordan Tigani of MotherDuck)
Microsoft, Google, and the original purple people