Setting up Azure Disk Encryption for a Virtual Machine with PowerShell

As I discussed in my previous blog post, I opted to use Azure Disk Encryption for my virtual machines in Azure, rather than Storage Service Encryption. Azure Disk Encryption utilizes Bitlocker inside of the VM. Enabling Azure Disk Encryption involves these Azure services:

  • Azure Active Directory for a service principal
  • Azure Key Vault for a KEK (key encryption key) which wraps around the BEK (bitlocker encryption key)
  • Azure Virtual Machine (IaaS)

Following are 4 scripts which configures encryption for an existing VM. I initially had it all as one single script, but I purposely separated them. Now that they are modular, if you already have a Service Principal and/or a Key Vault, you can skip those steps. I have my 'real' version of these scripts stored in an ARM Visual Studio project (same logic, just with actual names for the Azure services). These PowerShell templates go along with other ARM templates to serve as source control for our Azure infrastructure.

As any expert will immediately know by looking at my scripts below, I'm pretty much a PowerShell novice. So, be kind dear reader. My purpose is to document the steps, the flow,add some commentary, and to pull together a couple pieces I found on different documentation pages. 


Step 1: Set up Service Principal in AAD

<#
.SYNOPSIS
Creates Service Principal in Azure Active Directory
.DESCRIPTION
This script creates a service principal in Azure Active Directory. 
A service principal is required to enable disk encryption for VM.
.NOTES
File Name: CreateAADSvcPrinForDiskEncryption.ps1
Author : Melissa Coates
Notes: Be sure the variables in the input area are completed, following all standard naming conventions.
The $aadSvcPrinAppPassword needs to be removed before saving this script in source control.
.LINK
Supporting information: 
https://blogs.msdn.microsoft.com/azuresecurity/2015/11/16/explore-azure-disk-encryption-with-azure-powershell/
https://docs.microsoft.com/en-us/azure/security/azure-security-disk-encryption 
#>

#-----------------------------------------

#Input Area
$subscriptionName = 'MyAzureSubscriptionDev'
$aadSvcPrinAppDisplayName = 'VMEncryptionSvcPrinDev'
$aadSvcPrinAppHomePage = 'http://FakeURLBecauseItsNotReallyNeededForThisPurpose'
$aadSvcPrinAppIdentifierUri = 'https://DomainName.com/VMEncryptionSvcPrinDev'
$aadSvcPrinAppPassword = 'SuperStrongPassword'

#-----------------------------------------

#Manual login into Azure
Login-AzureRmAccount -SubscriptionName $subscriptionName

#-----------------------------------------

#Create Service Principal App to Use For Encryption of VMs
$aadSvcPrinApplication = New-AzureRmADApplication -DisplayName $aadSvcPrinAppDisplayName -HomePage $aadSvcPrinAppHomePage -IdentifierUris $aadSvcPrinAppIdentifierUri -Password $aadSvcPrinAppPassword
New-AzureRmADServicePrincipal -ApplicationId $aadSvcPrinApplication.ApplicationId

Step 2: Create Azure Key Vault

<#
.SYNOPSIS
Creates Azure Key Vault.
.DESCRIPTION
This script does the following:
1 - Creates a key vault in Azure.
2 - Allows the Azure Backup Service permission to the key vault.
This is required if Recovery Vault will be used for backups.
A key vault is required to enable disk encryption for VM.
.NOTES
File Name: ProvisionAzureKeyVault.ps1
Author : Melissa Coates
Notes: Be sure the variables in the input area are completed, following all standard naming conventions.
The key vault must reside in the same region as the VM which will be encrypted.
A Premium key vault is being provisioned so that an HSM key can be created for the KEK.
The 262044b1-e2ce-469f-a196-69ab7ada62d3 ID refers to the Azure Key Vault (which is why it is not a variable).
.LINK
Supporting information: 
https://blogs.msdn.microsoft.com/azuresecurity/2015/11/16/explore-azure-disk-encryption-with-azure-powershell/
https://docs.microsoft.com/en-us/azure/security/azure-security-disk-encryption 

#>

#-----------------------------------------

#Input Area
$subscriptionName = 'MyAzureSubscriptionDev'
$resourceGroupName = 'MyDevRG'
$keyVaultName = 'KeyVault-Dev'
$keyVaultLocation = 'East US 2'

#-----------------------------------------

#Manual login into Azure
#Login-AzureRmAccount -SubscriptionName $subscriptionName

#-----------------------------------------

#Create Azure Key Vault
New-AzureRmKeyVault -VaultName $keyVaultName -ResourceGroupName $resourceGroupName -Location $keyVaultLocation -Sku 'Premium'

#-----------------------------------------

#Permit the Azure Backup service to access the key vault
Set-AzureRmKeyVaultAccessPolicy -VaultName $keyVaultName -ResourceGroupName $resourceGroupName -PermissionsToKeys backup,get,list -PermissionsToSecrets get,list -ServicePrincipalName 262044b1-e2ce-469f-a196-69ab7ada62d3

Step 3: Connect Service Principal with Key Vault

<#
.SYNOPSIS
Enables the service principal for VM disk encryption to communicate with Key Vault.
.DESCRIPTION
This script does the following:
A - Allows service principal the selective permissions to the key vault so
that disk encryption functionality works.
B - Creates a KEK (Key Encryption Key). For Disk Encryption, a KEK is required 
in addition to the BEK (BitLocker Encryption Key).
Prerequisite 1: Service Principal name (see CreateAADSvcPrinForVMEncryption.ps1)
Prerequisite 2: Azure Key Vault (see ProvisionAzureKeyVault.ps1)
.NOTES
File Name: EnableSvcPrinWithKeyVaultForDiskEncryption.ps1
Author : Melissa Coates
Notes: Be sure the variables in the input area are completed, following all standard naming conventions.
The key vault must reside in the same region as the VM being encrypted.
The key type can be either HSM or Software (HSM offers additional security but does require a Premium key vault). 
.LINK
Supporting information: 
https://blogs.msdn.microsoft.com/azuresecurity/2015/11/16/explore-azure-disk-encryption-with-azure-powershell/
https://docs.microsoft.com/en-us/azure/security/azure-security-disk-encryption 
#>

#Input Area
$subscriptionName = 'MyAzureSubscriptionDev'
$resourceGroupName = 'MyDevRG'
$aadSvcPrinAppDisplayName = 'VMEncryptionSvcPrinDev'
$keyVaultName = 'KeyVault-Dev'
$keyName = 'VMEncryption-KEK'
$keyType = 'HSM'

#-----------------------------------------

#Manual login into Azure
#Login-AzureRmAccount -SubscriptionName $subscriptionName

#-----------------------------------------

#Allow the Service Principal Permissions to the Key Vault
$aadSvcPrinApplication = Get-AzureRmADApplication -DisplayName $aadSvcPrinAppDisplayName
Set-AzureRmKeyVaultAccessPolicy -VaultName $keyVaultName -ServicePrincipalName $aadSvcPrinApplication.ApplicationId -PermissionsToKeys 'WrapKey' -PermissionsToSecrets 'Set' -ResourceGroupName $resourceGroupName

#-----------------------------------------

#Create KEK in the Key Vault
Add-AzureKeyVaultKey -VaultName $keyVaultName -Name $keyName -Destination $keyType

#-----------------------------------------

#Allow Azure platform access to the KEK
Set-AzureRmKeyVaultAccessPolicy -VaultName $keyVaultName -ResourceGroupName $resourceGroupName -EnabledForDiskEncryption

Step 4: Enable Disk Encryption

<#
.SYNOPSIS
Enables disk encryption for a VM.
.DESCRIPTION
This script enables disk encryption for an Azure virtual machine.
Prerequisite 1: Service Principal name (see CreateAADSvcPrinForDiskEncryption.ps1)
Prerequisite 2: Azure Key Vault (see ProvisionAzureKeyVault.ps1)
Prerequisite 3: Permissions to Key Vault for Service Principal (see EnableSvcPrinWithKeyVaultForDiskEncryption.ps1)
.NOTES
File Name: EnableAzureDiskEncryption.ps1
Author : Melissa Coates
Notes: Be sure the variables in the input area are completed, following all standard naming conventions.
Azure Disk Encryption (ADE) 
The Azure VMs must already exist and be running.
To verify when completed: Get-AzureRmVmDiskEncryptionStatus -ResourceGroupName $resourceGroupName -VMName $vmName
.LINK
Supporting information: 
https://blogs.msdn.microsoft.com/azuresecurity/2015/11/16/explore-azure-disk-encryption-with-azure-powershell/
https://docs.microsoft.com/en-us/azure/security/azure-security-disk-encryption 
#>

#-----------------------------------------

#Input Area
$subscriptionName = 'MyAzureSubscriptionDev'
$resourceGroupName = 'MyDevRG'
$keyVaultName = 'KeyVault-Dev'
$keyName = 'VMEncryption-KEK'
$aadSvcPrinAppDisplayName = 'VMEncryptionSvcPrinDev'
$aadSvcPrinAppPassword = 'SuperStrongPassword'
$vmName = 'VMName-Dev'

#-----------------------------------------

#Manual login into Azure
#Login-AzureRmAccount -SubscriptionName $subscriptionName

#-----------------------------------------

#Enable Encryption on Virtual Machine
$keyVault = Get-AzureRmKeyVault -VaultName $keyVaultName -ResourceGroupName $resourceGroupName
$diskEncryptionKeyVaultUrl = $KeyVault.VaultUri
$keyVaultResourceId = $KeyVault.ResourceId
$keyEncryptionKeyUri = Get-AzureKeyVaultKey -VaultName $keyVaultName -KeyName $keyName 
$aadSvcPrinApplication = Get-AzureRmADApplication -DisplayName $aadSvcPrinAppDisplayName 
Set-AzureRmVMDiskEncryptionExtension -ResourceGroupName $resourceGroupName -VMName $vmName -AadClientID $aadSvcPrinApplication.ApplicationId -AadClientSecret $aadSvcPrinAppPassword -DiskEncryptionKeyVaultUrl $diskEncryptionKeyVaultUrl -DiskEncryptionKeyVaultId $KeyVaultResourceId -KeyEncryptionKeyUrl $keyEncryptionKeyUri.Id -KeyEncryptionKeyVaultId $keyVaultResourceId

Step 4 takes around 10 minutes to run; it will prompt you with the following dialog box before it executes:

You'll see this message when step 4 has completed:

And in the portal, the disks associated with the VM will also indicate that encryption is now enabled:


Troubleshooting

One error I had issues with was "Azure Backup Service does not have sufficient permissions to Key Vault for Backup of Encrypted Virtual Machines."  The last cmdlet in Step 2 above resolves this issue. So, watch out for that if you are using a key vault that already exists.

You Might Also Like...

Deciding on Encryption at Rest for an Azure Virtual Machine

Setting Up a PC for Cortana Intelligence Suite Development

Deciding on Encryption at Rest for an Azure Virtual Machine

I have been getting more familiar with the encryption-at-rest capabilities in Azure for virtual machines (VMs). This post covers two options: Storage Service Encryption (SSE) and Disk Encryption.

Initially I opted for Storage Service Encryption due to its sheer simplicity. This is done by enabling encryption when you initially provision the storage account. After having set it up, I had proceeded onto other configuration items, one of which is setting up backups via the Azure Recovery Vault. Turns out that encrypted backups in the Recovery Vault are not (yet?) supported for VMs encrypted with Storage Service Encryption (as of Feb 2017).

Next I decided to investigate Disk Encryption because it supports encrypted backups in the Recovery Vault. It's more complex to set up because you need a Service Principal in AAD, as well as Azure Key Vault integration. (More details on that in my next post.)

Here is a quick summary of the differentiators between Storage Service Encryption and Disk Encryption that I am aware of:

Storage Service Encryption

  • Controlled by storage account administrator
  • Encryption is enabled or not for all files within a storage account
  • Encrypts only newly written data after point in time that encryption is set (i.e., doesn't encrypt existing data)
  • Backups in Recovery Vault are not encrypted

Disk Encryption

  • Controlled by virtual machine owner
  • Encrypts operating system as well as the volumes (which are actually in Azure storage)
  • Offers more control - for example, if two VMs share one storage account, one VM could have disk encryption and the other not
  • Is a prerequisite for VMs backed up to the Recovery Vault to be encrypted <--the meaningful one for me

The documentation does state that both types of encryption can work together. However, if you enable Disk Encryption and Storage Service Encryption is already in place, then data will be encrypted twice. I opted not to do this.

What you want to see for your VM after disk enryption is enabled (you can click the Columns button to add the Disk Encryption column):

What you want to see in Azure Recovery Vault after disk encryption is enabled:

In my next post I'll share the PowerShell scripts I used for enabling Azure Disk Encryption + setting up the prerequisites for the Service Principal in AAD and Azure Key Vault integration.

You Might Also Like...

Setting up Azure Disk Encryption for a Virtual Machine with PowerShell

Setting Up a PC for Cortana Intelligence Suite Development

How to create a Demo/Test Environment for Azure Data Catalog

Defining the Components of a Modern Data Warehouse

As I put together a new presentation on my current favorite topic (modern data warehousing), it occurred to me that others might feel like there's some confusion and/or overlap with terminology. Some terms are somewhat fuzzy and mean different things within different organizations, so here's my best effort at a glossary of the components within a Modern Data Warehouse.

In alphabetical order:

Advanced Analytics

The term advanced analytics is a broad phrase for sophisticated statistical techniques to find patterns in the data for the purpose of predictions, recommendations, optimizations, and descriptions of information. It can include subcategories such as predictive analytics, prescriptive analytics, operational analytics, descriptive analytics, and so forth. Advanced analytics can be employed for use cases such as fraud detection, customer segmentation, assessing credit risk, or predictions such as student dropouts, customer churn, or hospital readmissions.

Analytics

In general, use of the term analytics has evolved over time. It used to imply the usage of statistical methods, such as machine learning and data mining. More recently the term analytics has evolved to be commonly used to describe finding meaningful patterns in data.

Analytics Sandbox

An analytics sandbox is an exploratory environment which a knowledgeable analyst or data scientist controls. In this ungoverned (or less governed) personal environment, an analyst can move very quickly with usage of preferred tools and techniques. The advantage of an analytics sandbox is agility for purposes of prototyping and/or data exploration. Sandbox solutions may be productionized and moved into the governed DW/BI/analytics environment.

Big Data

The term 'big data' is terribly overused, and used in different ways. One meaning refers to sheer size of data volumes. Another meaning is associated with multi-structured data (the combination of unstructured, semi-structured, and structured) of varying data types. Further, others mean it to imply analysis of data in new and interesting ways. Low latency data (the velocity portion of the 3 V's of volume, variety, and velocity) is also attributed to big data.

Bimodal BI

Originally coined by Gartner, Bimodal BI refers to two modes for development and delivery of information. One mode is focused on agility: business-driven, rapid delivery which values the freedom of exploration and speed. The second mode focuses on traditional IT-driven processes which value reliability, governance, standardization, and security. Having groups of knowledgeable staff each devoted to their own mode, while working collaboratively with each other, is optimal for achieving a good balance around delivery of BI. 

Data Catalog

An enterprise data catalog is a repository which describes corporate data assets. One aspect of the data catalog is documentation which is often referred to as a data dictionary, but can be much more than that depending on the software vendor's implementation. The ability to search for data based on keywords is another major benefit. In a search scenario, the underlying data security is still present; what is exposed is metadata such as columns, owner, and meaning of the data. A data catalog should also assist with data provisioning (i.e., where to request access to a particular data source if a user does not currently have access).

Data Governance

Data governance processes oversee corporate data assets so they are properly utilized with integrity and the data is understood as intended. Implementation of data governance varies wildly from organization to organization.

Data Federation

Frequently data federation is used synonymously with data virtualization, and the concepts are certainly related. Technically speaking, however, a federated query returns data from multiple data stores - federating, or combining, the query results. A federated query works using data virtualization techniques.

Data Integration

The term data integration can be used in a couple of ways. Conceptually, it refers to the integration, or consolidation, of data from multiple sources together. However, usually when we refer to data integration for a data warehouse, we're referring to data integration being performed physically, via an ETL (Extract>Transform>Load) process. Oftentimes data virtualization (in which the consolidated data is not materialized) is contrasted with data integration (in which the consolidated data is materialized). Data integration processes typically involve significant development effort, which is one reason why data virtualization has become more prevalent.

Data Lake

A data lake is one piece of an overall data management strategy. Conceptually, a data lake is nothing more than a data repository. The data lake can store any type of data, so it's well-suited to ingestion of multi-structured data such as logs and machinery output. Cost and effort associated with data ingestion are reduced because the data is stored in its original native format with no structure (schema) required of it initially. Data lakes usually align with an "ELT" strategy which means we can Extract and Load into the data lake in its original format, then Transform later *if* a need presents itself. A data lake is commonly implemented with HDFS (Hadoop Distributed File System), but could also easy involve other technologies such as NoSQL.

Data Mart

A data mart is usually focused on reporting and analysis for a particular department or a particular subject area. Data marts typically are considered to be a subset of what is contained in an enterprise data warehouse, though that is not always consistent for all implementations. Like a data warehouse, data marts are usually populated from source systems with an ETL process (data integration), so some latency is expected for the data to become available for reporting and analysis in the marts. It's also possible for data marts to employ some level of data virtualization.

Data Virtualization

A data virtualization layer is one aspect of a logical data warehouse implementation. It can be thought of as a method to access one or more underlying data sources, regardless of where the data resides, without requiring the data to be physically materialized in another data structure. Depending on the data virtualization platform, data cleansing and transformations can be performed. On a small scale, data virtualization can be used very successfully during pilot/exploratory/prototyping phases to improve BI agility, shorten the learning curve, and influence what should be built in the data warehouse. It can also be used successfully to access data in small data volumes, and/or for infrequently needed data. Using a data virtualization approach has numerous challenges, most notably user query performance and reporting load on the source systems -- therefore, a full-fledged logical data warehouse is likely necessary if data virtualization is utilized significantly across an enterprise.

Data Warehouse

Traditionally a data warehouse is a repository of enterprise-wide data which has been consolidated from multiple source systems, thus increasing the value of the data after it's been correlated. The source data is cleansed, transformed, standardized, enriched with calculations, and stored historically to facilitate time-oriented analysis. The traditional data warehouse is a centralized database, separate and distinct from the source systems, which usually translates to some level of delay in the data being available for reporting and analysis. The level of effort in developing an end-to-end data warehouse can involve long development cycles, which has opened up opportunities for alternative methods for handling data integration and data access.

Data Warehouse Appliance

A DW appliance is a combination of both hardware and software which is optimized specifically for data warehousing and analytics at high scale. Most DW appliances are based on MPP (massively parallel processing) architecture for high performance operations.

Distributed Query

A query which returns results from multiple, disparate data sources. A federated query and a distributed query are synonymous.

Distributed Processing

Distributed processing is one aspect of a logical data warehouse implementation. It involves pushing down the processing effort to each distributed source system whenever possible in order to maximize performance via parallelism. There are numerous techniques utilized by vendors to improve the performance of distributed processing.

Hadoop

Hadoop is a broad-ranging set of open source Apache projects which form an ecosystem for distributed storage as well as distributed processing/computations for large amounts of data (i.e., Big Data). Hadoop has continued to expand its presence due to its ability to scale both storage and processing at lower cost. A Hadoop environment complements a data warehouse by offering an environment which can handle data exploration, processing for a variety of data types, advanced analytic computations, as well as a capable ETL alternative. Because the Hadoop ecosystem is extremely broad, there's a myriad of options available for using Hadoop.

Lambda Architecture

A Lambda architecture is more about data processing than data storage. It's an architectural pattern designed to process large data volumes using both batch and streaming methods. Batch processing is typically pull-oriented, whereas streaming data is push-oriented. A Lambda architecture implementation involves a batch layer, a speed layer, and a serving layer. The serving layer, which handles data access/consumption, may serve consolidated data which was ingested through the different batch and speed layers.

Logical Data Warehouse

A logical data warehouse (LDW) builds upon the traditional DW by providing unified data access to multiple platforms. Conceptually, the logical data warehouse is a view layer that abstractly accesses distributed systems such as relational DBs, NoSQL DBs, data lakes, in-memory data structures, and so forth, consolidating and relating the data in a virtual layer. This availability of data on various platforms adds flexibility to a traditional DW, and speeds up data availability. The tradeoff for this flexibility can be slower performance for user queries, though the full-fledged LDW vendors employ an array of optimization techniques to mitigate performance issues. A logical data warehouse is broader than just data virtualization and distributed processing which can be thought of as enabling technologies. According to Gartner a full-fledged LDW system also involves metadata management, repository management, taxonomy/ontology resolution, auditing & performance services, as well as service level agreement management.

Massively Parallel Processing (MPP)

An MPP system operates on high volumes of data in a parallel manner across distributed nodes. This "shared-nothing architecture" differs from an SMP system because each separate physical node has its own disk storage, memory, and CPU. Though an MPP platform has a lot of functionality in common with SMP, there are also many differences with respect to table design, handling distributed data, and alternative data loading patterns in order to take full advantage of parallelism. Most MPP systems are oriented towards DW/analytics/big data workloads, as opposed to transactional system workloads.

Master Data Management (MDM)

A master data management system is a set of tools and process for the purpose of managing, standardizing, and augmenting an organization’s key reference and descriptive information for subject areas such as customers, products, accounts, or stores.  Master data management is often thought of in two forms: analytical MDM, and operational MDM. The value of a data warehouse can be exponentially increased with skillful master data management.

Modern Data Warehouse

Characteristics of a modern data warehouse frequently include (in no particular order):

  • Capability of handling a variety of subject areas and diverse data sources
  • Ability to handle large volumes of data
  • Expansion beyond a single DW/relational data mart structure (to include structures such as Hadoop, data lake, and/or NoSQL databases)
  • Multi-platform architecture which balances scalibility and performance
  • Data virtualization in addition to data integration
  • Ability to facilitate near real-time analysis on high velocity data (potentially via Lambda architecture)
  • A flexible deployment model which is decoupled from the tool used for development
  • Built with agile, modular approach with fast delivery cycles
  • Hybrid integration with cloud services
  • Some DW automation to improve speed, consistency, and flexibly adapt to change
  • Data cataloging to facilitate data search and to document business terminology
  • Governance model to support trust and security
  • Master data management for curation of reference data
  • Support for all types of users, and all levels of users
  • Availability of an analytics sandbox or workbench area to facilitate agility within a bimodal BI environment
  • Support for self-service BI to augment corporate BI
  • Delivery of data discovery and data exploration, in addition to reports and dashboards
  • Ability to certify and promote self-service solutions to the corporate BI/analytics environment

NoSQL Database

Like a data lake, a NoSQL database is a schema-agnostic data storage option. NoSQL databases are well-suited to hierarchical/nested data with simple data structures (ex: property and its value). There are various types of NoSQL databases which suit different use cases, including: key-value DBs, document DBs, column family stores, and graph DBs.

Polygot Persistence

These days, we anticipate our BI and analytics environment to be comprised of multiple platforms (which is ideally transparent to users). Polygot persistence (aka "best fit engineering") refers to the approach of using the most suitable data storage technology based on the data itself. For instance, multi-structured data is usually well-suited to a data lake or HDFS storage solution; log data in XML or JSON format may be suitable for a NoSQL solution; structured data aligns with a relational database. Using multiple technologies in an information management platform is a form of scaling out, and although it simplifies needs related to certain data types, it does add complexity to the overall data management solution. The more that polygot persistence methodology is followed, the additional likelihood data virtualization is of benefit due to varied infrastructure.

Schema on Read

Querying techniques such as "Schema on Read" have become prevalent because it applies structure at the time the data is queried, rather than at the time the data is written or initially stored (i.e., "Schema on Write"). The value of the Schema on Read approach is to be able to store the data relatively easily with less up-front time investment, then query the data "where it lives" later once a use case is defined. Schema on Read does offer significant flexibility which can offer agility in a modern data warehousing environment and allows us to learn and deliver value more quickly. Schema on Read is often associated with an ELT (Extract>Load>Transform) approach, which is common for big data projects. The implementation of Schema on Read is usually only a portion of the overall information management platform.

Schema on Write

Schema on Write refers to how a traditional data warehouse is designed. The dimensional data model is defined up front, its schema is created, then the dimension and fact tables are loaded with data. Once the data has been loaded, users can begin using it for reporting. This approach requires up-front data analysis, data modeling, and creating of data load processes, all of which can involve long development cycles which challenge business agility. Schema on Write is associated with the traditional ETL approach: Extract>Transform>Load.

Semantic Model

A semantic model extends a data warehouse with calculations, relationships, formatting, and friendly names which make sense to functional users. The semantic model is the primary interface for users, and can significantly improve the user experience for reporting, particularly for less technical users who are responsible for some level of self-service reporting and analysis. A semantic model can be a pass-through only (with some similar characteristics to data virtualization), or could be combined with cached data storage mechanisms (such as OLAP or in-memory analytical models).

Operational Data Store (ODS)

An ODS is a data storage repository intended for near real-time operational reporting and analysis. Whereas a data warehouse and data marts are usually considered nonvolatile, an ODS is considered volatile because the data changes so frequently. To reduce data latency, data transformations and cleansing are usually minimized during the ETL process to populate the ODS. An ODS typically contains minimal history.

Symmetrical Multiprocessing (SMP)

An SMP system is a traditional relational database platform in which resources such as disk storage and memory are shared. For very large-scale databases, when an SMP database begins to experience limits related to performance and scalability, an MPP (massively parallel processing) database may be an alternative.

Where Azure Analysis Services Fits Into BI & Analytics Architecture (Part 3)

This is part 3 of a discussion about a new service in Microsoft Azure: Azure Analysis Services. Azure AS is a Platform-as-a-Service (PaaS) offering which is in public preview mode (as of December 2016).

Part 1: Why a Semantic Layer Like Azure Analysis Services is Relevant

Part 2: Use Cases for Azure Analysis Services

Part 3: Where Azure Analysis Services Fits Into BI & Analytics Architecture {you are here}

From an infrastructure perspective, there are a variety of ways you can use Azure Analysis Services. Basically, you can think of it as a "mix & match" each of the 3 following options:

(1) Location of Assets

  • Hybrid. This is most common for companies which are extending their existing infrastructure.
  • All cloud-based. Utilization of cloud infrastructure, such as Azure services, is most common for brand new companies.
  • All on-premises. N/A for Azure AS - you'll want to use SQL Server Analysis Services instead of Azure Analysis Services for a fully on-premises implementation.

(2) Data Sources

  • From a single source such as a data warehouse. This is the most traditional path for BI development, and still has a very valid place in many BI/analytics deployments. This scenario puts the work of data integration on the ETL process into the data warehouse, which is the most appropriate place.
  • Directly from various systems.  This can be done, but works well only in specific cases - it definitely won't work well if there are a lot of highly normalized tables, or if there's not a straightforward way to relate the disparate data together. Trying to go directly to the source systems & skip an intermediary data warehouse puts the "integration" burden on the data source view in Analysis Services, so plan for plenty of time testing if you're going to try this route (i.e., it can be much harder, not easier). Note that this option only makes sense if the data is stored in Analysis Services because it needs to be related together somehow (i.e., DirectQuery mode, discussed next in #3, with > 1 data source won't work if a user tries to combine data sources because the data is not inherently related).

There is one twist to the data source options, and that is the use of federated queries. See the last example at the end of this post for more on that.

(3) Data Storage

  • Stored (cached) in the AS data model. Data is stored in the in-memory model. In this case, we're using Analysis Services for its in-memory database as well as a semantic layer. This requires a scheduled refresh and provides the best performance.
  • DirectQuery. Data is *not* stored in Analysis Services; rather, AS is basically a semantic layer only wherein all queries are actually sent to underlying data source(s). This is useful when near real-time data is desired. However, unless your source system is tuned for it, performance may not be acceptable.

Note that the property to specify if the AS model is DirectQuery or not is associated with the .bim file. This means that the entire model is either DirectQuery or it's not (can't choose on a data source-by-data source basis or a table-by-table basis). 

 

Below are some common scenarios (note not all possible combinations are actually depicted below, but enough to give you ideas).


In this first scenario, we have a traditional data warehouse which has integrated data from four different source systems. The data warehouse resides in an on-premises server, and the Analysis Services semantic layer resides in Azure. Data in Analysis Services is refreshed on a schedule. Reporting is primarily handled through the semantic layer to improve the user experience.

The data sources & Visual Studio pieces are removed from the rest of the examples to simplify (though they certainly still pertain).


The following depicts using Azure AS in DirectQuery mode back to the data warehouse. In this case, the DAX or MDX (whichever is passed from the client tool) is converted to SQL, sent to the data warehouse through the gateway. Data is then retrieved and sent back securely to the client tool.


Alternatively, the DirectQuery mode could be directed at a cloud-based data warehouse. Note in this scenario that a gateway is not necessary (because AS and the data warehouse both reside in Azure).


The next scenario varies only in that the data is stored in Analysis Services, and refreshed on a schedule.


Lastly, we have the concept of federated queries. PolyBase allows us to define an "external table" in SQL Server or Azure SQL Data Warehouse and reach into data stored in Azure Blob Storage (Azure Data Lake Store support is coming soon). These external tables are known as "schema on read" because the data isn't physically stored in the data warehouse.

In the following example, we are using Analysis Services in DirectQuery mode directed to the data warehouse. Under the covers, since an external table is involved, the user queries will actually reach back farther to get the Equipment Health Data as well, though the users don't have to know that's actually happening - though they do need to be willing to accept slower performance. Federated queries like this offer great flexibility to avoid or delay data integration for data analysis which is infrequent (if the data is frequently accessed, or has progressed beyond proof of concept, you likely want to implement full data integration).

Though I didn't depict every possible combination, hopefully this gives you a good idea of ways to use Analysis Services, and where it can fit into your existing architecture. As you can tell from Part 1 of this series, I'm a big fan of using a semantic layer for a consistent and friendly end-user experience.

You Might Also Like...

Overview of SQL Server Analysis Services Tabular in DirectQuery Mode for SQL Server 2016

Building Blocks of Cortana Intelligence Suite in Azure

Power BI Features End-to-End