Data Engineering

26 bookmarks
Custom sorting
Engineers Shouldn’t Write ETL: A Guide to Building a High Functioning Data Science Department | Stitch Fix Technology – Multithreaded
Engineers Shouldn’t Write ETL: A Guide to Building a High Functioning Data Science Department | Stitch Fix Technology – Multithreaded
“What is the relationship like between your team and the data scientists?” This is, without a doubt, the question I’m most frequently asked when conducting i...
There is nothing more soul sucking than writing, maintaining, modifying, and supporting ETL to produce data that you yourself never get to use or consume. Instead, give people end-to-end ownership of the work they produce (autonomy). In the case of data scientists, that means ownership of the ETL.
·multithreaded.stitchfix.com·
Engineers Shouldn’t Write ETL: A Guide to Building a High Functioning Data Science Department | Stitch Fix Technology – Multithreaded
Viewpoint | dbt Docs
Viewpoint | dbt Docs
In 2015-2016, a team of folks at RJMetrics had the opportunity to observe, and participate in, a significant evolution of the analytics ecosystem. The seeds of dbt were conceived in this environment, and the viewpoint below was written to reflect what we had learned and how we believed the world should be different. dbt is our attempt to address the workflow challenges we observed, and as such, this viewpoint is the most foundational statement of the dbt project's goals.
·docs.getdbt.com·
Viewpoint | dbt Docs
We the purple people
We the purple people
The data world needs more purple people — generalists who can navigate both the business context and the modern data stack. Let's put aside skillset dichotomies, and learn to feel comfortable in the space between.
·getdbt.com·
We the purple people
The end of Big Data
The end of Big Data
Databricks, Snowflake, and the end of an overhyped era.
Take real-time products, for example. Most businesses have little use for true real-time experiences. But, all else being equal, real-time data is better than latent data. We all have dashboards that update a little too slowly, or marketing emails we wish we could send a little sooner. While these annoyances don’t justify the effort currently required to build real-time pipelines, they do cause small headaches. But if someone came along and offered me a streaming Fivetran, or a reactive version of dbt, I’d take it. If the cost of a real-time architecture was low enough, regardless of the shoehorned use-cases, there’d be no reason to turn it down. And just as we came to rely on Snowflake after we chose it as a better Postgres, I’m certain we’d come to rely on streaming pipelines if they replaced our current batch ones. We’d start doing more real-time marketing outreach, or build customer success workflows around live customer behavior. Over the next five years, I’d guess that real-time data tools follow this exact path: They’ll finally go mainstream, not because we all discover we need them, but because there will be no reason not to have them. And once we do, we’ll find ways to push it to their limits, just as we did with fast internet connections and powerful browsers.
·benn.substack.com·
The end of Big Data
HTAP Databases
HTAP Databases
Do we actually need so many different databases? Or can we shove them all into a single cloud infrastructure and behind the same SQL API?
·roundup.getdbt.com·
HTAP Databases
Iceberg Case Studies
Iceberg Case Studies
This talk will introduce the use cases for Apache Iceberg tables that we didn’t expect when we created Iceberg and will explain the details so you can use Iceberg for similar cases.
·youtube.com·
Iceberg Case Studies
Why You Shouldn’t Care About Iceberg | Tabular
Why You Shouldn’t Care About Iceberg | Tabular
Slides: https://www.datacouncil.ai/talks/why-you-shouldnt-care-about-iceberg ABOUT THE TALK: Ryan Blue, co-creator of the Apache Iceberg project will try to convince you not to care about Iceberg: if you’re thinking about your table format, then it isn’t doing a good enough job. This session will show how Iceberg solves real-world problems that used to take hours or days of time from data engineers and analysts: Safe schema changes — no more zombie data columns Layout evolution — update table partitioning without rewriting any queries Hidden partitioning — safe and fast queries without being a DBA Future work — current frustrations and how we’re making them disappear ABOUT THE SPEAKER: Ryan is the co-creator of Apache Iceberg and spent the last decade working on big data formats and infrastructure at Netflix, Cloudera, and now Tabular. He is an ASF member and a committer in the Apache Parquet, Avro, and Spark communities. ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers. Make sure to subscribe to our channel for more videos, including DC_THURS, our series of live online interviews with leading data professionals from top open source projects and startups. FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/ Eventbrite: https://www.eventbrite.com/o/data-council-30357384520
·youtube.com·
Why You Shouldn’t Care About Iceberg | Tabular
Red Hot: The 2021 Machine Learning, AI and Data (MAD) Landscape
Red Hot: The 2021 Machine Learning, AI and Data (MAD) Landscape
Full resolution version of the landscape image here It’s been a hot, hot year in the world of data, machine learning and AI. Just when you thought it couldn’t grow any more explosively, the data/AI landscape just did: rapid pace of company creation, exciting new product and project launch
·mattturck.com·
Red Hot: The 2021 Machine Learning, AI and Data (MAD) Landscape
The Baseline Data Stack - Going Beyond The Modern Data Stack - Part 1
The Baseline Data Stack - Going Beyond The Modern Data Stack - Part 1
Billions of dollars have been put into investing into companies that fall under the concept of “Modern Data Stack. Fivetran nearly has one billion dollars funding them, DBT has 150 million(and is looking to raise more), Starburst has 100 million(Not considered part of the MDS
·seattledataguy.substack.com·
The Baseline Data Stack - Going Beyond The Modern Data Stack - Part 1
Rethinking the Modern Data Stack
Rethinking the Modern Data Stack
A version of this post was originally published as a byline for DEVOPSdigest on November 8,…
·starburst.io·
Rethinking the Modern Data Stack