When Should We Load Relational Data to a Data Lake?

This is a question I get fairly regularly these days: Should we extract relational data and load it to a data lake?

Architecture diagrams, such as the one displayed here from Microsoft, frequently depict all types of data sources going thru the lake:

For certain types of data, writing it to the data lake really is frequently the best choice. This is often true for low latency IoT data, semi-structured data like logs, and varying structures such as social media data. However, the handling of structured data which originates from a relational database is much less clear.

Most data lake technologies store data as files (like csv, json, or parquet). This means that when we extract relational data into a file stored in a data lake, we lose valuable metadata from the database such as data types, constraints, foreign keys, etc. I tend to say that we "de-relationalize" data when we write it to a file in the data lake. If we're going to turn right around and load that data to a relational database destination, is it the right call to write it out to a file in the data lake as an intermediary step?

My current thinking is that this is justified primarily by these situations:

(1) Do you need to keep the history? If you are retaining snapshots as of periodic points in time, having those snapshots accessible in a data lake can be useful.

(2) Will multiple downstream teams, systems, or processes access the data? If there are multiple consumers of the data (and presumably we don't want them accessing the original source directly), then it's possible that providing data access from a data lake might be beneficial.

(3) Do you have a strategy of storing *all* of the organizational data in your data lake? If you've standardized on this type of approach, and (1) or (2) above applies, then it makes sense.

(4) Do you need to provide a subset of data for a specialized use case? If you can't provide access to the source database, but a full replica or another relational DB is overkill, then delivery of data via the data lake can sometimes make sense.

DataLakeQuote.jpg
 

The above list isn’t exhaustive - it always depends. My rule of thumb: Avoid writing relational data to a data lake when it doesn't have value being there. This is an opinion and not everyone agrees with this. Even if we are using our data lake as a staging area for a data warehouse, my opinion is that all relational data doesn't necessarily have to make a pit stop in the data lake except when it's justified to do so.

You Might Also Like…

Zones in a Data Lake

Data Lake Use Cases and Planning Considerations

James Serra’s blog - Should I load Structured Data Into My Data Lake?

Controlling Data Access in Azure for Administrators and Owners

ResourceGroup.jpg

Recently a customer expressed concern that an owner of an Azure resource group automatically gains access to the data within the services contained in the resource group. In this case, the customer was specifically referring to data in Azure Data Lake Storage Gen 1 but this concept applies to Azure Storage and some of the other data-oriented services in Azure as well. The customer’s comment prompted me to look into available alternatives. This is by no means a detailed security post…rather, I’m trying to share a few nuggets of what I learned.

Key points to take away from this post:

  • The built-in Owner role is, by design, unrestricted for both management operations as well as data operations. *See option 3 below where it looks like this behavior is changing.*

  • You cannot 'break' the inheritance model of resources in Azure.

  • If resource group owners are not allowed to see data, these are the following choices I’m aware of currently:

  1. Remove owner permissions from the resource group level, or

  2. Isolate the resource into its own resource group so it can be managed separately, or

  3. Investigate using the new "DataActions" and "NotDataActions" properties within RBAC role structure to separate management operations from data operations (currently in preview at the time of this writing)

Default Behavior of an Owner in a Resource Group Allows Access to Data

Let's first cover what happens by default. Say I have a resource group with a way cool name of "SecurityTestRG" and I've granted "Wayne Writer" ownership permissions to the resource group. And within the resource group I have a storage account called "strgtest96." As we'd expect, the owner permissions are inherited by the storage account:

Security_ResourceInherited.jpg

And with that, Wayne Writer is able to view, edit, and delete any of the data that has been uploaded to that storage account by virtue of his owner permissions at the resource group level:

Security_OwnerOfResourceGroup.jpg

Allowing owners unrestricted access to both the management plane and the data plane is by design. However, if you have a group of people who should be able to administer (own) everything contained in the resource group, but you don’t want them to automatically see all the data, what are the options?

Cannot Break the RBAC Inheritance Model in Azure

The first thing you might be inclined to try is to just remove the owner permission for the one resource within the resource group that has data, so that you can specify security for it differently. However, that doesn’t work:

Security_RoleAssignmentsError.jpg

When you try to delete a role assignment that’s been inherited, as shown above, you get an error message: “Inherited role assignments cannot be removed. Please open up the associated resource and remove the role assignments from there.

In Azure’s RBAC model, we can add additional permissions at lower levels (i.e., like for a resource itself within the resource group), but we cannot remove an assignment that’s been inherited.

An additional note about subscription level owners & administrators: Any owner you specify at the subscription level will be inherited by all resource groups and resources across the subscription—it’s a very high privilege role. The same inheritance behavior also applies to the Azure service administrator (but not the other co-administrators—they would have to be specified as subscription owners).

Options for Handling Security Differently

Option 1. You could remove the owner permission from the resource group (that is the suggestion presented within the error message shown above). However, let’s say there are 10 resources in this resource group and the other 9 of them can inherit the owner permission without issue. In that case, do we really want to have to manage security for every single resource separately? That might introduce risk and inconsistency. Though if there aren’t very many resources within a resource group, that might work ok.

Option 2. You could isolate the resource into its own ‘sensitive’ resource group so it can be managed separately. This allows the other 9 resources to stay in the main resource group with normal RBAC, and just segregates this one resource. Depending on how you handle automation and deployments, having separate resource groups could add complications (because the general rule for resource groups is to group resources together that are related & have the same lifecycle). Also, it still precludes you from being able to own the resource yet not see data—this segregation would work if contributor or a custom role meets your needs, and if you don’t have subscription-level owners specified which also inherit no matter what.

Option 3. There is new functionality in preview that looks promising which will allow us to handle security for management operations and data operations separately, as shown here:

 

The above scenario indicates that Alice, a high level owner, doesn’t actually see the data. This is still in preview and I haven’t gotten it to work correctly yet, but I’m keeping an eye on this. You can find more info here: https://docs.microsoft.com/en-us/azure/role-based-access-control/role-definitions. I’m guessing this security enhancement is inspired by GDPR and/or customer needs for more granularity of permissions, so I’m excited to see where this is headed.

Finding More Information

Azure Storage Security Guide <—This is a good read re: management plane and data plane

What is Role-Based Access Control (RBAC)?

Understand Role Definitions

Terminology Check - What are Data Flows?

It's another terminology post! Earlier this week I was having a delightful lunch with Angela Henry, Kevin Feasel, Javier Guillen, and Jason Thomas. We were chatting about various new things. Partway thru our conversation Jason stops me because he thought I was talking about Power BI Dataflows when I was really talking about Azure Data Factory Data Flows. It was kind of a funny moment actually but it did illustrate that we have some overlapping terminology coming into our world.

So, with that inspiration, let's have a chat about some of the new data flow capabilities in the Microsoft world, shall we?

Azure Data Factory Data Flow

The new Azure Data Factory (ADF) Data Flow capability is analogous to those from SSIS: a data flow allows you to build data transformation logic using a graphical interface. A really interesting aspect about ADF Data Flows is that they use Azure Databricks as the runtime engine underneath -- however, you don't actually have to know Spark or Databricks in order to be able to use ADF Data Flows. The goal is for it to be a low code/no code way to transform data at scale.

Follow Mark Kromer and the ADF team on Twitter to stay up to date on the rollout of the preview.

 

More info on ADF Data Flow can be found here: https://aka.ms/adfdataflowdocs

Power BI Dataflows

Power BI Dataflows (yes, this one is branded as one word) are a new type of object in a Power BI Workspace which will allow you to load data into a Common Data Model. Data is loaded via a web-based version of Power Query, which is why this capability is referred to as self-service data prep. The resulting data is stored in Azure Data Lake Storage Gen 2. Once in the Common Data Model in the data lake, it can be reused among various Power BI datasets -- allowing the data load, transformations, and cleansing to be done once rather than by numerous PBIX files. This capability was known for a little while during the private preview as Power BI Datapools or as 'Common Data Service for Analytics' (CDS-A) -- but the final name looks like it's going to be Power BI Dataflows.

It's still early so there's not a lot of info available online yet. James Serra wrote up a nice summary and has a few links on his blog. Also, here's a diagram that Chris and I included in the recently updated whitepaper Planning a Power BI Enterprise Deployment which shows our initial understanding of the Power BI Dataflows capability:

Note that Pro users can use Power BI Dataflows without requiring Premium. However, my hunch is that this capabililty will be most appealing for data at scale - i.e., the features that Premium offers with respect to Power BI Dataflows will be pretty compelling, which is why Premium is depicted in the diagram above.

SSIS Data Flow

Data flows have long been a key part of SQL Server Integration Services (SSIS) for data transformations, just like the new capability being added to ADF discussed above. As of Azure Data Factory V2, we can also host and execute SSIS packages in Azure from ADF V2. 

Microsoft Flow

Just for completeness I'll cover one more product which is similarly named. Flow is an Office 365 service for workflow automation between services. It can be used in conjunction with PowerApps and Power BI for different types of workflow automation. Flow lets you do things like approval requests, sending an e-mail alert, or creating a task in a project management system.

Now you know there are multiple types data flows being launched into the world of Microsoft BI (in addition to the good old SSIS data flows we've had forever). Now you can cleverly watch out for which one is being bantered about in your techie conversations. 

You Might Also Like...

Three Ways to Use Power BI Dataflows

Terminology Check - What is a Power BI App?

 

Lesson Learned - Keep PowerShell Modules Consistent and Up To Date

PowerShellLogo.jpg

This is a quick post to share something that happened on a project recently. We began to experience some intermittent issues with Azure Data Factory (V1) and it was ultimately related to an out-of-date AzureRM PowerShell module. What does ADF have to do with the AzureRM PowerShell module you ask? For this project, the nightly loads are triggered by a signal file which indicates the source data is ready. Therefore, we have a PowerShell script that controls the whole end-to-end process (which overrides the ADF V1 built-in scheduling). The PowerShell script looks for the signal file and then proceeds to resume the Azure Data Factory pipelines (via the Resume-AzureRmDataFactoryPipeline cmdlet). The resume causes each pipeline to execute immediately. When all pipelines are finished, they all get suspended again (via the Suspend-AzureRmDataFactoryPipeline cmdlet) until the next execution of the data load process. This PowerShell process runs on a virtual machine.

In production recently, we started seeing cases of ADF pipelines that wouldn't resume properly which resulted in some data not getting loaded properly. It was inconsistent and intermittent. With the exact same PowerShell script, we couldn't reproduce the issue in UAT. The issues were only occurring in production, but not with precise regularity. 

My colleague on the project, Terry Crist, did some investigation and found that the AzureRM module installed on UAT was much newer than what was installed on production. Sure enough, once the AzureRM module was updated in production everything began to run reliably again.

So, this served as a good reminder to ensure that (a) UAT and production environments should be running on the same PowerShell module version, and (b) preferably the latest version should be installed when possible. In environments which don't have a full-time DBA looking after this sort of thing it's good for developers to know to watch out for these types of issues too. 

You Might Also Like...

Find Pipelines Currently Running in Azure Data Factory with PowerShell

PowerShell for Assigning and Querying Tags in Azure