Data Pipeline Design Patterns - #2. Coding patterns in Python
As data engineers, you might have heard the terms functional data pipeline, factory pattern, singleton pattern, etc. One can quickly look up the implementation, but it can be tricky to understand what they are precisely and when to (& when not to) use them. Blindly following a pattern can help in some cases, but not knowing the caveats of a design will lead to hard-to-maintain and brittle code! While writing clean and easy-to-read code takes years of experience, you can accelerate that by understanding the nuances and reasoning behind each pattern. Imagine being able to design an implementation that provides the best extensibility and maintainability! Your colleagues (& future self) will be extremely grateful, your feature delivery speed will increase, and your boss will highly value your opinion. In this post, we will go over the specific code design patterns used for data pipelines, when and why to use them, and when not to use them, and we will also go over a few python specific techniques to help you write better pipelines. By the end of this post, you will be able to identify patterns in your data pipelines and apply the appropriate code design patterns. You will also be able to take advantage of pythonic features to write bug-free, maintainable code that is a joy to work on!
Azure Data Studio for Data Engineers – Robert J Blackburn
This is the first of a series of posts about Azure Data Studio. If you have not used Data Studio or only tried it when it first came out a few years ago, time to take another look. Especially if yo…
Wondering how to execute a spark job on an AWS EMR cluster, based on a file upload event on S3? Then this post if for you. In this post we go over how to trigger spark jobs on an AWS EMR cluster, using AWS Lambda. The lambda function will execute in response to an S3 upload event. We will go over this event driven pattern with code snippets and set up a fully functioning pipeline.
Course Information - Big Data Platforms, Autumn 2021
Helsingin yliopiston kaikille avoin ja ilmainen ohjelmoinnin perusteet opettava verkkokurssi. Kurssilla perehdytään nykyaikaisen ohjelmoinnin perusideoihin sekä ohjelmoinnissa käytettävien työvälineiden lisäksi algoritmien laatimiseen. Kurssille osallistuminen ei vaadi ennakkotietoja ohjelmoinnista.
There are lots of data-storage options available today. You have to choose between managed or unmanaged, relational or NoSQL, write- or read-optimized, proprietary or open-source — and it doesn't end there. Once you begin your search, you will end up in the universe that is database marketing. All of the vendors
Data Architecture Revisited: The Platform Hypothesis
Software systems are increasingly based on data, rather than code. A new class of tools and technologies have emerged to process data for both analytics and ML.
This post goes over what the term data warehousing means. This post provides a simple e-commerce relational data model and how it has to be changed to fit analytical queries. It also covers the reasoning behind wanting to use a data warehouse and how to choose an appropriate database for your project.
What is the difference between a data lake and a data warehouse?
Confused by all the "data lake vs data warehouse" articles? Struggling to understand what the differences between data lakes and warehouses are? Then this post is for you. We go over what data lakes and warehouses are. We also cover the key points to consider when choosing your lake and warehouse tools.
Building an End-To-End Analytic solution in Power BI: Part 3 – Level Up with Data Modeling! | LinkedIn
When I talk to people who are not deep into the Power BI world, I often get the impression that they think of Power BI as a visualization tool exclusively. While that is true to a certain extent, it seems to me that they are not seeing the bigger picture – or maybe it’s better to say – they see just