01 - Design and Implement Storage

UNION (Azure Stream Analytics) - Stream Analytics Query
UNION (Azure Stream Analytics) - Stream Analytics Query
Combines the results of two or more queries into a single result set that includes all the rows that belong to all queries in the union.
The following are basic rules for combining the result sets of two queries by using UNION: The number and the order of the columns must be the same in all queries. The data types must be compatible. Streams must have the same partition and partition count
·docs.microsoft.com·
UNION (Azure Stream Analytics) - Stream Analytics Query
Maximize throughput with repartitioning in Azure Stream Analytics |...
Maximize throughput with repartitioning in Azure Stream Analytics |...
Customers love Azure Stream Analytics for its ease of analyzing streams of data in movement, with the ability to set up a running pipeline within five minutes. Optimizing throughput has always been...
When joining two streams of data explicitly repartitioned, these streams must have the same partition key and partition count. The outcome is a stream that has the same partition scheme.
·azure.microsoft.com·
Maximize throughput with repartitioning in Azure Stream Analytics |...
Materialized View pattern - Azure Architecture Center
Materialized View pattern - Azure Architecture Center
Generate prepopulated views over the data in one or more data stores when the data isn't ideally formatted for required query operations.
Creating this materialized view requires complex queries. However, by exposing the query result as a materialized view, users can easily obtain the results and use them directly or incorporate them in another query. The view is likely to be used in a reporting system or dashboard, and can be updated on a scheduled basis such as weekly.
·docs.microsoft.com·
Materialized View pattern - Azure Architecture Center
Designing tables - Azure Synapse Analytics
Designing tables - Azure Synapse Analytics
Introduction to designing tables using dedicated SQL pool.
A hash distributed table distributes rows based on the value in the distribution column. A hash distributed table is designed to achieve high performance for queries on large tables.
A replicated table has a full copy of the table available on every Compute node. Queries run fast on replicated tables since joins on replicated tables don't require data movement. Replication requires extra storage, though, and isn't practical for large tables.
A round-robin table distributes table rows evenly across all distributions. The rows are distributed randomly. Loading data into a round-robin table is fast. Keep in mind that queries can require more data movement than the other distribution methods.
Staging Use round-robin for the staging table.
Dimension Use replicated for smaller tables. If tables are too large to store on each Compute node, use hash-distributed.
Fact Use hash-distribution with clustered columnstore index. Performance improves when two hash tables are joined on the same distribution column.
By default, dedicated SQL pool stores a table as a clustered columnstore index. This form of data storage achieves high data compression and query performance on large tables.
A heap table can be especially useful for loading transient data, such as a staging table which is transformed into a final table.
An external table points to data located in Azure Storage blob or Azure Data Lake Store. When used with the CREATE TABLE AS SELECT statement, selecting from an external table imports data into dedicated SQL pool.
A temporary table only exists for the duration of the session. You can use a temporary table to prevent other users from seeing temporary results and also to reduce the need for cleanup.
·docs.microsoft.com·
Designing tables - Azure Synapse Analytics
6 Practical Data Protection Features in SQL Server (Pros & Cons)
6 Practical Data Protection Features in SQL Server (Pros & Cons)
Here's how we use the data protection features within SQL Server to protect confidential data and make it available to only those authorized to see it.
Adding a dynamic data mask to a column in SQL Server blocks out part of the information column. This is useful if an employee needs to see only part of some sort of ID number or even part of a phone number for verification purposes.
Transparent data encryption is a feature that encrypts any data that is being saved to the hard drive/disk. So, if any data is updated in a table, that data becomes transparently encrypted immediately upon save. When the data is pulled back, SQL Server will unencrypt it for you.
This satisfies the regulatory requirement that any data “at rest” be encrypted.   All data within the database is encrypted.
Encrypted Columns in SQL Server are columns within a table that have been encrypted to hide whatever sensitive data the column contains. This is a good way to both hide sensitive data like a social security number or a date of birth and have the data encrypted “at rest”. To read the data, special permissions are needed to access the necessary keys.
The data is encrypted so this satisfies any sort of regulatory requirement of “encrypting data at rest”.
Always Encrypted is useful if the people working in the database are not always authorized to see the data inside the database (dates of birth, SSN’s, salaries).
The Always Encrypted feature even prevents those who manage the database from accessing or decrypting sensitive data, while still allowing end users to read and interact with the same data. In this case, a web or desktop application is set up to encrypt or decrypt the data without the SQL Server being able to read the data.
·corebts.com·
6 Practical Data Protection Features in SQL Server (Pros & Cons)
Introducing data virtualization with PolyBase - SQL Server
Introducing data virtualization with PolyBase - SQL Server
PolyBase enables your SQL Server instance to process Transact-SQL queries that read data from external data sources such as Hadoop and Azure blob storage.
PolyBase enables your SQL Server instance to query data with T-SQL directly from SQL Server, Oracle, Teradata, MongoDB, Hadoop clusters, Cosmos DB without separately installing client connection software. You can also use the generic ODBC connector to connect to additional providers using third-party ODBC drivers. PolyBase allows T-SQL queries to join the data from external sources to relational tables in an instance of SQL Server.
A key use case for data virtualization with the PolyBase feature is to allow the data to stay in its original location and format. You can virtualize the external data through the SQL Server instance, so that it can be queried in place like any other table in SQL Server. This process minimizes the need for ETL processes for data movement.
Query data stored in Hadoop from a SQL Server instance or PDW.
Query data stored in Azure blob storage.
Import data from Hadoop, Azure blob storage, or Azure Data Lake Store.
Export data to Hadoop, Azure blob storage, or Azure Data Lake Store
Integrate with BI tools.
·docs.microsoft.com·
Introducing data virtualization with PolyBase - SQL Server
Parent-Child Dimensions
Parent-Child Dimensions
Learn about parent-child hierarchies, which are hierarchies in a standard dimension that contain a parent attribute.
In this dimension table, the ParentOrganizationKey column has a foreign key relationship with the OrganizationKey primary key column. In other words, each record in this table can be related through a parent-child relationship with another record in the table. This kind of self-join is generally used to represent organization entity data, such as the management structure of employees in a department.
·docs.microsoft.com·
Parent-Child Dimensions
Best practices for using Azure Data Lake Storage Gen2
Best practices for using Azure Data Lake Storage Gen2
Learn how to optimize performance, reduce costs, and secure your Data Lake Storage Gen2 enabled Azure Storage account.
The network connectivity between your source data and your storage account can sometimes be a bottleneck. When your source data is on premise, consider using a dedicated link with Azure ExpressRoute. If your source data is in Azure, the performance is best when the data is in the same Azure region as your Data Lake Storage Gen2 enabled account.
·docs.microsoft.com·
Best practices for using Azure Data Lake Storage Gen2
Difference between Clustered and Non-clustered index - GeeksforGeeks
Difference between Clustered and Non-clustered index - GeeksforGeeks
A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and programming articles, quizzes and practice/competitive programming/company interview Questions.
A Clustered index is a type of index in which table records are physically reordered to match the index. A Non-Clustered index is a special type of index in which logical order of index does not match physical stored order of the rows on disk.
Clustered index is faster. Non-clustered index is slower. Clustered index requires less memory for operations. Non-Clustered index requires more memory for operations.
·geeksforgeeks.org·
Difference between Clustered and Non-clustered index - GeeksforGeeks
Clustered and nonclustered indexes described - SQL Server
Clustered and nonclustered indexes described - SQL Server
Clustered and nonclustered indexes described
An index is an on-disk structure associated with a table or view that speeds retrieval of rows from the table or view. An index contains keys built from one or more columns in the table or view. These keys are stored in a structure (B-tree) that enables SQL Server to find the row or rows associated with the key values quickly and efficiently.
·docs.microsoft.com·
Clustered and nonclustered indexes described - SQL Server
Columnstore indexes: Overview - SQL Server
Columnstore indexes: Overview - SQL Server
Columnstore indexes: Overview
Columnstore indexes are the standard for storing and querying large data warehousing fact tables.
A nonclustered columnstore index and a clustered columnstore index function the same. The difference is that a nonclustered index is a secondary index that's created on a rowstore table, but a clustered columnstore index is the primary storage for the entire table.
A clustered columnstore index is the physical storage for the entire table.
Use a clustered columnstore index to store fact tables and large dimension tables for data warehousing workloads. This method improves query performance and data compression by up to 10 times.
Use a nonclustered columnstore index to perform analysis in real time on an OLTP workload.
Rowstore indexes perform best on queries that seek into the data, when searching for a particular value, or for queries on a small range of values. Use rowstore indexes with transactional workloads because they tend to require mostly table seeks instead of table scans.
Columnstore indexes give high performance gains for analytic queries that scan large amounts of data, especially on large tables.
·docs.microsoft.com·
Columnstore indexes: Overview - SQL Server
Auto-failover groups overview & best practices - Azure SQL Database
Auto-failover groups overview & best practices - Azure SQL Database
Auto-failover groups let you manage geo-replication and automatic / coordinated failover of a group of databases on a server for both single and pooled database in Azure SQL Database.
The auto-failover groups feature allows you to manage the replication and failover of some or all databases on a logical server to another region. This article focuses on using the Auto-failover group feature with Azure SQL Database and some best practices.
The auto-failover groups feature allows you to manage the replication and failover of a group of databases on a server or all user databases in a managed instance to another Azure region.
By default, a failover group is configured with an automatic failover policy. The system triggers a geo-failover after the failure is detected and the grace period has expired.
Planned failover performs full data synchronization between primary and secondary databases before the secondary switches to the primary role.
You can initiate a geo-failover manually at any time regardless of the automatic failover configuration. During an outage that impacts the primary, if automatic failover policy is not configured, a manual failover is required to promote the secondary to the primary role.
Unplanned or forced failover immediately switches the secondary to the primary role without waiting for recent changes to propagate from the primary. This operation may result in data loss.
By default, the failover of the read-only listener is disabled. It ensures that the performance of the primary is not impacted when the secondary is offline. However, it also means the read-only sessions will not be able to connect until the secondary is recovered.
Because the data is replicated to the secondary database using asynchronous replication, an automatic geo-failover may result in data loss. You can customize the automatic failover policy to reflect your application’s tolerance to data loss. By configuring GracePeriodWithDataLossHours, you can control how long the system waits before initiating a forced failover, which may result in data loss.
·docs.microsoft.com·
Auto-failover groups overview & best practices - Azure SQL Database
Active geo-replication - Azure SQL Database
Active geo-replication - Azure SQL Database
Use active geo-replication to create readable secondary databases of individual databases in Azure SQL Database in the same or different regions.
Active geo-replication is a feature that lets you to create a continuously synchronized readable secondary database for a primary database. The readable secondary database may be in the same Azure region as the primary, or, more commonly, in a different region. This kind of readable secondary databases are also known as geo-secondaries, or geo-replicas.
Unplanned geo-failover Unplanned, or forced, geo-failover immediately switches the geo-secondary to the primary role without any synchronization with the primary. Any transactions committed on the primary but not yet replicated to the secondary are lost. This operation is designed as a recovery method during outages when the primary is not accessible, but database availability must be quickly restored. When the original primary is back online, it will be automatically re-connected, reseeded using the current primary data, and become a new geo-secondary.
Planned geo-failover Planned geo-failover switches the roles of primary and geo-secondary databases after completing full data synchronization. A planned failover does not result in data loss.
·docs.microsoft.com·
Active geo-replication - Azure SQL Database
Introduction to Azure Storage - Cloud storage on Azure
Introduction to Azure Storage - Cloud storage on Azure
The Azure Storage platform is Microsoft's cloud storage solution. Azure Storage provides highly available, secure, durable, massively scalable, and redundant storage for data objects in the cloud. Learn about the services available in Azure Storage and how you can use them in your applications, services, or enterprise solutions.
·docs.microsoft.com·
Introduction to Azure Storage - Cloud storage on Azure
Choose a batch processing technology - Azure Architecture Center
Choose a batch processing technology - Azure Architecture Center
Compare technology choices for big data batch processing in Azure, including key selection criteria and a capability matrix.
Azure Synapse is a distributed system designed to perform analytics on large data. It supports massive parallel processing (MPP), which makes it suitable for running high-performance analytics. Consider Azure Synapse when you have large amounts of data (more than 1 TB) and are running an analytics workload that will benefit from parallelism.
Languages: R, Python, Java, Scala, Spark SQL Fast cluster start times, autotermination, autoscaling. Manages the Spark cluster for you. Built-in integration with Azure Blob Storage, Azure Data Lake Storage (ADLS), Azure Synapse, and other services. See Data Sources. User authentication with Azure Active Directory. Web-based notebooks for collaboration and data exploration.
Azure Databricks is an Apache Spark-based analytics platform. You can think of it as "Spark as a service." It's the easiest way to use Spark on the Azure platform.
·docs.microsoft.com·
Choose a batch processing technology - Azure Architecture Center
Temporal Tables - SQL Server
Temporal Tables - SQL Server
System-versioned temporal tables bring built-in support for providing information about data stored in the table at any point in time
A system-versioned temporal table is a type of user table designed to keep a full history of data changes, allowing easy point-in-time analysis.
Every temporal table has two explicitly defined columns, each with a datetime2 data type. These columns are referred to as period columns.
In addition to these period columns, a temporal table also contains a reference to another table with a mirrored schema, called the history table. The system uses the history table to automatically store the previous version of the row each time a row in the temporal table gets updated or deleted.
script
·docs.microsoft.com·
Temporal Tables - SQL Server
Choose between slowly changing dimension types - Learn
Choose between slowly changing dimension types - Learn
Choose between slowly changing dimension types
When a customer email address or phone number changes, the dimension table updates the customer row with the new values.
It also includes columns that define the date range validity of the version (for example, StartDate and EndDate) and possibly a flag column (for example, IsCurrent) to easily filter by current dimension members.
The table includes a column for the current value of a member plus either the original or previous value of the member.
·docs.microsoft.com·
Choose between slowly changing dimension types - Learn
Purchasing models - Azure SQL Database
Purchasing models - Azure SQL Database
Learn about the purchasing models that are available for Azure SQL Database: the vCore purchasing model and the DTU purchasing model.
The DTU-based purchasing model uses a database transaction unit (DTU) to calculate and bundle compute costs. A database transaction unit (DTU) represents a blended measure of CPU, memory, reads, and writes.
A virtual core (vCore) represents a logical CPU and offers you the option to choose between generations of hardware and the physical characteristics of the hardware (for example, the number of cores, the memory, and the storage size).
·docs.microsoft.com·
Purchasing models - Azure SQL Database
Always Encrypted - SQL Server
Always Encrypted - SQL Server
Overview of Always Encrypted that supports transparent client-side encryption and confidential computing in SQL Server and Azure SQL Database
Always Encrypted allows clients to encrypt sensitive data inside client applications and never reveal the encryption keys to the Database Engine (SQL Database or SQL Server). As a result, Always Encrypted provides a separation between those who own the data and can view it, and those who manage the data but should have no access.
This allows organizations to store their data in Azure, and enable delegation of on-premises database administration to third parties, or to reduce security clearance requirements for their own DBA staff.
Deterministic encryption always generates the same encrypted value for any given plain text value.
Database Permissions
Randomized encryption uses a method that encrypts data in a less predictable manner.
prevents searching, grouping, indexing, and joining on encrypted columns.
allows point lookups, equality joins, grouping and indexing on encrypted columns.
·docs.microsoft.com·
Always Encrypted - SQL Server
Distributed tables design guidance - Azure Synapse Analytics
Distributed tables design guidance - Azure Synapse Analytics
Recommendations for designing hash-distributed and round-robin distributed tables using dedicated SQL pool.
Consider using a hash-distributed table when: The table size on disk is more than 2 GB. The table has frequent insert, update, and delete operations.
Consider using the round-robin distribution for your table in the following scenarios: When getting started as a simple starting point since it is the default If there is no obvious joining key If there is no good candidate column for hash distributing the table If the table does not share a common join key with other tables If the join is less significant than other joins in the query When the table is a temporary staging table
Hash-distributed tables work well for large fact tables in a star schema.
Is not used in WHERE clauses.
Is not a date column.
Is used in JOIN, GROUP BY, DISTINCT, OVER, and HAVING clauses.
·docs.microsoft.com·
Distributed tables design guidance - Azure Synapse Analytics