Exploring Spark Catalog — Mastering Pyspark
Data Engineering
Data cataloguing in Spark | by Petrica Leuca | Medium
Streaming from Apache Iceberg - QCon NY 2023
Streaming from Apache Iceberg
Building Low-Latency and Cost Effective Data Pipelines
Steven Wu @ Apple
red-data-tools/YouPlot: A command line tool that draw plots on the terminal.
A command line tool that draw plots on the terminal. - red-data-tools/YouPlot: A command line tool that draw plots on the terminal.
Data processing with Spark: data catalog – own your data
Delivering High Quality Analytics at Netflix
Netflix is a data-driven entertainment company, where analytics are extensively used to make informed decisions on every aspect of the business. As such, whe...
Same Data, Sturdier Frame: Layering in Dimensional Data Modeling at Whatnot
Alice Leach, Lalita Yang, Stephen Bailey | Data Engineering
Unit Testing for Data Engineers.
I know you don't want to, but if you don't I will call your grandma.
r/dataengineering - What did ETL look like before the "modern data stack" was a thing?
98 votes and 224 comments so far on Reddit
Resolving Late Arriving Dimensions
how to handle late arriving dimensions
r/dataengineering - Which lakehouse table format do you expect your organization will be using by the end of 2023?
26 votes and 69 comments so far on Reddit
🫡🐳 pedramdb🫡🐳 on Twitter
“Does anyone here (not vendors) work with CDPs, either traditional or unbundled as part of a data team? What’s the experience been like? How much input did you have in the process ?”
Data Systems Tend Towards Production
Data teams have substantially larger influence than a decade ago. The surface area of what can go wrong has grown just as fast.
Airbyte Monitoring with dbt and Metabase - Part I | Airbyte
How to implement an Airbyte Monitoring Dashboard with dbt and Metabase on a locally deployed instance to get an operational view and high-level overview.
Building a Data Engineering Project in 20 Minutes
You'll learn web-scraping with real-estates, uploading them to S3, Spark and Delta Lake, adding Data Science with Jupyter, ingesting into Druid, visualising with Superset and managing everything with Dagster.
r/dataengineering - Has anyone built a data warehouse primarily using Databricks?
90 votes and 65 comments so far on Reddit
The Contract-Powered Data Platform | Buz
The contract-powered data platform is a step towards improving data quality, reducing organizational friction, and automating the toil data teams face. Here's what it looks like and how it works.
The Breakdown: Databricks, Snowflake, and Open Source Positioning in the Data World
This post will explore how Databricks and Snowflake are positioning against one another with a particular focus on using open source as a strategic tool.
Yet another post on Data Contracts - Part 1
Starting with some history
The missing piece of the modern data stack
Our cool new house needs one more plank in its foundation.
Kicking the tires on dbt Metrics
Daily Active YAML is up and to the right
The modern data experience (w/ Benn Stancil)
For most people, the modern data stack isn’t a collection of architectural diagrams; it’s an experience. It’s rushing to answer the CEO’s urgent question bef...
Engineers Shouldn’t Write ETL: A Guide to Building a High Functioning Data Science Department | Stitch Fix Technology – Multithreaded
“What is the relationship like between your team and the data scientists?” This is, without a doubt, the question I’m most frequently asked when conducting i...
There is nothing more soul sucking than writing, maintaining, modifying, and supporting ETL to produce data that you yourself never get to use or consume.
Instead, give people end-to-end ownership of the work they produce (autonomy). In the case of data scientists, that means ownership of the ETL.
Mediocre engineers really excel at building enormously over complicated, awful-to-work-with messes they call “solutions”. Messes tend to necessitate specialization.
most technologies have evolved to a point where they can trivially scale to your needs.
Viewpoint | dbt Docs
In 2015-2016, a team of folks at RJMetrics had the opportunity to observe, and participate in, a significant evolution of the analytics ecosystem. The seeds of dbt were conceived in this environment, and the viewpoint below was written to reflect what we had learned and how we believed the world should be different. dbt is our attempt to address the workflow challenges we observed, and as such, this viewpoint is the most foundational statement of the dbt project's goals.
Upgrading Data Warehouse Infrastructure at Airbnb
This blog aims to introduce Airbnb’s experience upgrading Data Warehouse infrastructure to Spark and Iceberg
We the purple people
The data world needs more purple people — generalists who can navigate both the business context and the modern data stack. Let's put aside skillset dichotomies, and learn to feel comfortable in the space between.
The end of Big Data
Databricks, Snowflake, and the end of an overhyped era.
Take real-time products, for example. Most businesses have little use for true real-time experiences. But, all else being equal, real-time data is better than latent data. We all have dashboards that update a little too slowly, or marketing emails we wish we could send a little sooner. While these annoyances don’t justify the effort currently required to build real-time pipelines, they do cause small headaches. But if someone came along and offered me a streaming Fivetran, or a reactive version of dbt, I’d take it. If the cost of a real-time architecture was low enough, regardless of the shoehorned use-cases, there’d be no reason to turn it down. And just as we came to rely on Snowflake after we chose it as a better Postgres, I’m certain we’d come to rely on streaming pipelines if they replaced our current batch ones. We’d start doing more real-time marketing outreach, or build customer success workflows around live customer behavior. Over the next five years, I’d guess that real-time data tools follow this exact path: They’ll finally go mainstream, not because we all discover we need them, but because there will be no reason not to have them. And once we do, we’ll find ways to push it to their limits, just as we did with fast internet connections and powerful browsers.
Ep 30: The Personal Data Warehouse (w/ Jordan Tigani of MotherDuck)
Flipping the vision for "data apps" on its head: what if, instead of having data run round trips to a cloud data warehouse, we just bring the user's data to their machine?
Microsoft, Google, and the original purple people
And, of course, Pokémon.
A Thought of Stream...
...and a three-horse race