02 - Design and Develop Data Processing

Spark Streaming - Different Output modes explained - Spark by {Examples}
Spark Streaming - Different Output modes explained - Spark by {Examples}
This article describes usage and differences between complete, append and update output modes in Apache Spark Streaming. outputMode describes what data is written to a data sink (console, Kafka e.t.c) when there is new data available in streaming input (Kafka, Socket, e.t.c)
Use complete as output mode outputMode("complete") when you want to aggregate the data and output the entire results to sink every time.
It is similar to the complete with one exception; update output mode outputMode("update") just outputs the updated aggregated results every time to data sink when new data arrives.
Use append as output mode outputMode("append") when you want to output only new rows to the output sink.
·sparkbyexamples.com·
Spark Streaming - Different Output modes explained - Spark by {Examples}
Configure clusters - Azure Databricks
Configure clusters - Azure Databricks
Learn how to configure Azure Databricks clusters, including cluster mode, runtime, instance types, size, pools, autoscaling preferences, termination schedule, Apache Spark options, custom tags, log delivery, and more.
·docs.microsoft.com·
Configure clusters - Azure Databricks
Continuous integration and delivery - Azure Data Factory
Continuous integration and delivery - Azure Data Factory
Learn how to use continuous integration and delivery to move Azure Data Factory pipelines from one environment (development, test, production) to another.
·docs.microsoft.com·
Continuous integration and delivery - Azure Data Factory
Common query patterns in Azure Stream Analytics
Common query patterns in Azure Stream Analytics
This article describes several common query patterns and designs that are useful in Azure Stream Analytics jobs.
·docs.microsoft.com·
Common query patterns in Azure Stream Analytics
Replicated Tables now generally available in Azure SQL Data Warehouse
Replicated Tables now generally available in Azure SQL Data Warehouse
We are excited to announce that Replicated Tables are Generally Available in Azure SQL Data Warehouse.A key to performance for large-scale data warehouses is how data is distributed across the system.…
·azure.microsoft.com·
Replicated Tables now generally available in Azure SQL Data Warehouse
Incrementally copy data using Change Tracking using Azure portal - Azure Data Factory
Incrementally copy data using Change Tracking using Azure portal - Azure Data Factory
In this tutorial, you create an Azure Data Factory with a pipeline that loads delta data based on change tracking information in the source database in Azure SQL Database to an Azure blob storage.
In some cases, the changed data within a period in your source data store can be easily to sliced up (for example, LastModifyTime, CreationTime). In some cases, there is no explicit way to identify the delta data from last time you processed the data. The Change Tracking technology supported by data stores such as Azure SQL Database and SQL Server can be used to identify the delta data.
SYS_CHANGE_VERSION
·docs.microsoft.com·
Incrementally copy data using Change Tracking using Azure portal - Azure Data Factory
Create tumbling window trigger dependencies - Azure Data Factory & Azure Synapse
Create tumbling window trigger dependencies - Azure Data Factory & Azure Synapse
Learn how to create dependency on a tumbling window trigger in Azure Data Factory and Synapse Analytics.
Provide a value in time span format and both negative and positive offsets are allowed. This property is mandatory if the trigger is depending on itself and in all other cases it is optional. Self-dependency should always be a negative offset. If no value specified, the window is the same as the trigger itself.
·docs.microsoft.com·
Create tumbling window trigger dependencies - Azure Data Factory & Azure Synapse
Surrogate key transformation in mapping data flow - Azure Data Factory & Azure Synapse
Surrogate key transformation in mapping data flow - Azure Data Factory & Azure Synapse
Learn how to use the mapping data flow Surrogate Key Transformation to generate sequential key values in Azure Data Factory and Synapse Analytics.
Use the surrogate key transformation to add an incrementing key value to each row of data. This is useful when designing dimension tables in a star schema analytical data model. In a star schema, each member in your dimension tables requires a unique key that is a non-business key.
·docs.microsoft.com·
Surrogate key transformation in mapping data flow - Azure Data Factory & Azure Synapse
Sink transformation in mapping data flow - Azure Data Factory & Azure Synapse
Sink transformation in mapping data flow - Azure Data Factory & Azure Synapse
Learn how to configure a sink transformation in mapping data flow.
A cache sink is when a data flow writes data into the Spark cache instead of a data store. In mapping data flows, you can reference this data within the same flow many times using a cache lookup. This is useful when you want to reference data as part of an expression but don't want to explicitly join the columns to it. Common examples where a cache sink can help are looking up a max value on a data store and matching error codes to an error message database.
·docs.microsoft.com·
Sink transformation in mapping data flow - Azure Data Factory & Azure Synapse
Assert data transformation in mapping data flow - Azure Data Factory
Assert data transformation in mapping data flow - Azure Data Factory
Set assertions for mapping data flows
The assert transformation enables you to build custom rules inside your mapping data flows for data quality and data validation. You can build rules that will determine whether values meet an expected value domain. Additionally, you can build rules that check for row uniqueness. The assert transformation will help to determine if each row in your data meets a set of criteria. The assert transformation also allows you to set custom error messages when data validation rules are not met.
·docs.microsoft.com·
Assert data transformation in mapping data flow - Azure Data Factory
Alter row transformation in mapping data flow - Azure Data Factory & Azure Synapse
Alter row transformation in mapping data flow - Azure Data Factory & Azure Synapse
How to update database target using the alter row transformation in the mapping data flow in Azure Data Factory and Azure Synapse Analytics pipelines.
Use the Alter Row transformation to set insert, delete, update, and upsert policies on rows. You can add one-to-many conditions as expressions. These conditions should be specified in order of priority, as each row will be marked with the policy corresponding to the first-matching expression. Each of those conditions can result in a row (or rows) being inserted, updated, deleted, or upserted. Alter Row can produce both DDL & DML actions against your database.
·docs.microsoft.com·
Alter row transformation in mapping data flow - Azure Data Factory & Azure Synapse
Real-time data visualization of data from Azure IoT Hub – Power BI
Real-time data visualization of data from Azure IoT Hub – Power BI
Use Power BI to visualize temperature and humidity data that is collected from the sensor and sent to your Azure IoT hub.
Create a consumer group on your IoT hub. Create and configure an Azure Stream Analytics job to read temperature telemetry from your consumer group and send it to Power BI. Create a report of the temperature data in Power BI and share it to the web.
·docs.microsoft.com·
Real-time data visualization of data from Azure IoT Hub – Power BI
Incrementally copy data using Change Tracking using PowerShell - Azure Data Factory
Incrementally copy data using Change Tracking using PowerShell - Azure Data Factory
In this tutorial, you create an Azure Data Factory pipeline that copies delta data incrementally from multiple tables in a SQL Server database to Azure SQL Database.
You perform the following steps in this tutorial: Prepare the source data store Create a data factory. Create linked services. Create source, sink, and change tracking datasets. Create, run, and monitor the full copy pipeline Add or update data in the source table Create, run, and monitor the incremental copy pipeline
·docs.microsoft.com·
Incrementally copy data using Change Tracking using PowerShell - Azure Data Factory
Azure Stream Analytics on IoT Edge
Azure Stream Analytics on IoT Edge
Create edge jobs in Azure Stream Analytics and deploy them to devices running Azure IoT Edge.
A cloud part that is responsible for the job definition: users define inputs, output, query, and other settings, such as out of order events, in the cloud.
A module running on your IoT devices. The module contains the Stream Analytics engine and receives the job definition from the cloud.
Supported stream input types are: Edge Hub Event Hub IoT Hub Supported stream output types are: Edge Hub SQL Database Event Hub Blob Storage/ADLS Gen2
For both inputs and outputs, CSV and JSON formats are supported.
Manufacturing safety systems must respond to operational data with ultra-low latency. With Stream Analytics on IoT Edge, you can analyze sensor data in near real-time, and issue commands when you detect anomalies to stop a machine or trigger alerts.
Mission critical systems, such as remote mining equipment, connected vessels, or offshore drilling, need to analyze and react to data even when cloud connectivity is intermittent.
·docs.microsoft.com·
Azure Stream Analytics on IoT Edge
What is Azure IoT Edge
What is Azure IoT Edge
Overview of the Azure IoT Edge service
Azure IoT Edge moves cloud analytics and custom business logic to devices so that your organization can focus on business insights instead of data management.
Azure IoT Edge allows you to deploy complex event processing, machine learning, image recognition, and other high value AI without writing it in-house. A
Installs and update workloads on the device. Maintains Azure IoT Edge security standards on the device. Ensures that IoT Edge modules are always running. Reports module health to the cloud for remote monitoring. Manages communication between downstream leaf devices and an IoT Edge device, between modules on an IoT Edge device, and between an IoT Edge device and the cloud.
·docs.microsoft.com·
What is Azure IoT Edge
Tutorial - Stream Analytics at the edge using Azure IoT Edge
Tutorial - Stream Analytics at the edge using Azure IoT Edge
In this tutorial, you deploy Azure Stream Analytics as a module to an IoT Edge device
Create an Azure Stream Analytics job to process data on the edge. Connect the new Azure Stream Analytics job with other IoT Edge modules. Deploy the Azure Stream Analytics job to an IoT Edge device from the Azure portal.
When you create an Azure Stream Analytics job to run on an IoT Edge device, it needs to be stored in a way that can be called from the device. You can use an existing Azure Storage account, or create a new one now.
·docs.microsoft.com·
Tutorial - Stream Analytics at the edge using Azure IoT Edge
Secrets | Databricks on AWS
Secrets | Databricks on AWS
Learn how to create and manage secrets, which are key-value pairs that store secret material.
·docs.databricks.com·
Secrets | Databricks on AWS
Secret scopes - Azure Databricks
Secret scopes - Azure Databricks
Learn how to create and manage both types of secret scope for Azure Databricks, Azure Key Vault-backed and Databricks-backed, and use best practices for secret scopes.
·docs.microsoft.com·
Secret scopes - Azure Databricks
Secret scopes | Databricks on AWS
Secret scopes | Databricks on AWS
Learn how to create and manage both types of secret scope for Databricks, Azure Key Vault-backed and Databricks-backed, and use best practices for secret scopes.
·docs.databricks.com·
Secret scopes | Databricks on AWS
Append Variable Activity - Azure Data Factory & Azure Synapse
Append Variable Activity - Azure Data Factory & Azure Synapse
Learn how to set the Append Variable activity to add a value to an existing array variable defined in a Data Factory or Synapse Analytics pipeline.
Use the Append Variable activity to add a value to an existing array variable defined in a Data Factory or Synapse Analytics pipeline.
·docs.microsoft.com·
Append Variable Activity - Azure Data Factory & Azure Synapse
Create tumbling window triggers - Azure Data Factory & Azure Synapse
Create tumbling window triggers - Azure Data Factory & Azure Synapse
Learn how to create a trigger in Azure Data Factory or Azure Synapse Analytics that runs a pipeline on a tumbling window.
Tumbling window triggers are a type of trigger that fires at a periodic time interval from a specified start time, while retaining state. Tumbling windows are a series of fixed-sized, non-overlapping, and contiguous time intervals. A tumbling window trigger has a one-to-one relationship with a pipeline and can only reference a singular pipeline.
·docs.microsoft.com·
Create tumbling window triggers - Azure Data Factory & Azure Synapse
Create event-based triggers - Azure Data Factory & Azure Synapse
Create event-based triggers - Azure Data Factory & Azure Synapse
Learn how to create a trigger in an Azure Data Factory or Azure Synapse Analytics that runs a pipeline in response to an event.
Data integration scenarios often require customers to trigger pipelines based on events happening in storage account, such as the arrival or deletion of a file in Azure Blob Storage account. Data Factory and Synapse pipelines natively integrate with Azure Event Grid, which lets you trigger pipelines on such events.
·docs.microsoft.com·
Create event-based triggers - Azure Data Factory & Azure Synapse
Understand inputs for Azure Stream Analytics
Understand inputs for Azure Stream Analytics
This article describe the concept of inputs in an Azure Stream Analytics job, comparing streaming input to reference data input.
Azure Blob storage, Azure Data Lake Storage Gen2, and Azure SQL Database are currently supported as input sources for reference data.
Event Hubs, IoT Hub, Azure Data Lake Storage Gen2 and Blob storage are supported as data stream input sources.
A data stream is an unbounded sequence of events over time. Stream Analytics jobs must include at least one data stream input.
Reference data is either completely static or changes slowly. It is typically used to perform correlation and lookups.
·docs.microsoft.com·
Understand inputs for Azure Stream Analytics