Three Ways to Use Power BI Dataflows

Dataflows in Power BI are a really interesting capability for centralizing and reusing Power Query logic among many different PBIX files. If you’ve found yourself copying M scripts (Power Query formula language) between files, or creating a library of M scripts then you should look into dataflows.

I won’t rehash the basic overview info here because it has been well covered by other bloggers. In particular, Matthew Roche has been writing a lot of good stuff:

This post is to focus on what I’m referring to as ‘the three ways to use Power BI dataflows.’ I talked about this on the March BlueGranite Office Hours session on Power BI Dataflows (video recording | slides).

Dataflows1.jpg

Dataflows Option 1: Fully Managed by Power BI

In this first option, Power BI handles everything. We use the web-based Power Query Online tool for structuring the data. Power BI handles scheduling the data refresh.

The underlying data behind the dataflow is stored in a data lake. However, since it’s fully managed, this data lake is not directly accessible or visible to the customer. As with most cloud-based implementations, the infrastructure is hidden under the covers. This is what is happening if your users are utilizing dataflows currently but you haven’t specified a data lake account in the Power BI admin center.

Dataflows_Option1.jpg

Option 1 above is useful when all of your data preparation is done by analysts and Power BI is the only tool which accesses that data.

Dataflows Option 2: Managed by Power BI with an Explicit Data Lake

In this second option, the diagram looks exactly the same as option 1. The key difference is that we have associated our data lake account to Power BI. This is the “bring your own data lake” scenario which means we can actually view the data behind the scenes with a tool such as Azure Storage Explorer (dataflows data is comprised of csv files and the model.json file which contains the schema and other metadata).

With this option we are still using Power Query Online and Power BI for the data refresh, so it’s still very business user-friendly in terms of maintaining the data.

Dataflows_Option2.jpg

Option 2 above is useful when your data analysts like using Power Query Online for data prep, but you still want the ability to access the data independently of Power BI.

Dataflows Option 3: External Dataflows (Managed Outside of Power BI with an Explicit Data Lake)

With the third option, we are maintaining the data with another tool such as Azure Databricks or Azure Data Factory. You could even use something as simple as a text editor if you’d like - Chris Webb talks about that in this blog post. Power BI’s role here is reduced to being a consumer of the data.

Although Power BI doesn’t take responsibility for updating external dataflows, the dataflow can be consumed by PBIX files like any other dataflow. This is great because the user experience for using the dataflow output doesn’t change.

Dataflows_Option3.jpg

Option 3 above is useful when your data lake is part of your strategic data architecture and is integrated with other Azure technologies**. It is more complicated though because you’re also responsible for managing the metadata in the model.json file which accompanies the actual data files. When you have a balance of corporate BI and managed self-service BI needs, a combination of options 2 and 3 are useful.

**Use of dataflows does currently require a filesystem in ADLS Gen2 called “powerbi” so at this point in time (March 2019) we cannot refer the dataflow to just any folder in the data lake. I hope this changes because segregating this data doesn’t lend itself well to a corporate data lake strategy. Also, naming convention junkie that I am…using the name of “powerbi” for the filesystem also doesn’t make 100% sense when there’s other tools involved. I’d rather rely on my normal process of organizing the data lake by subject area and security boundaries.

Summary of the 3 Options

It’s important to be aware that all 3 options can be used simultaneously in the same Power BI tenant.

This table summarizes the above options and (very) loosely correlates them to the types of BI implementations they support:

Dataflows_3Options_Table.jpg

Consider this expanded scenario:

DataflowsInPowerBI.jpg

Workspace A is managed by Power BI.

  • If Workspace A is *not* associated to the dataflows storage, the dataflow functionality works normally but data in the lake is not accessible other than through Power Query Online. This is Option 1.

  • If Workspace A *is* associated to the dataflows storage, it’ll be visible in ADLS Gen2.  This is Option 2.

Workspace B is associated to dataflows storage, and is managed outside of Power BI. This is option 3.

Important Settings in the Power BI Admin Center

To allow dataflows to be utilized by your users (i.e., to use any of the 3 options above), they need to be enabled in the tenant settings:

Dataflows_TenantSettings.jpg
 

To be able to “bring your own data lake” (i.e., to use options 2 or 3), you need to do two things. The first task is to associate your Azure Data Lake Storage Gen2 account to the Power BI tenant:

DataflowsSettings.jpg
 

Note that there are currently (as of March 2019) some pretty big limitations with the above setting:

  1. You can only associate on ADLS Gen2 account for your entire Power BI tenant.

  2. It requires an ADLS Gen2 filesystem called “powerbi” which means we cannot point to just any folder in the data lake.

  3. Once the ADLS Gen2 account is associated to Power BI, it cannot be changed. So be very careful with POCs and such until this becomes more flexible.

The second required task to make options 2 or 3 work is a workspace-specific storage setting for every workspace where dataflows might exist:

Dataflows_WorkspaceStorageSetting.jpg

If the workspace storage setting for dataflows is set to No, then the dataflows in that workspace are using option 1 (which means they still work just fine, you just won’t see the data in your data lake even if you’ve set the data lake account in your tenant settings). If the workspace storage setting is set to Yes, then you are free to use options 2 or 3.

To facilitate the above workspace storage setting, there is one more property to be aware of. In the same pane as where we specify the ADLS Gen2 account, there is a Yes/No toggle to allow workspace admins to assign workspaces to the storage account:

DataflowSettings_WorkspaceOwners.jpg
 

At this time the workspace default for dataflows storage is set to No, so it’s easy to forget — I’m hoping that default changes to Yes if there’s a data lake specified in the tenant settings.

Requirements for ‘Bring Your Own Data Lake’

Here’s a final summary of the requirements to be able to implement option 2 or 3:

Dataflows_RequirementsForBringYourOwnDataLake.jpg

Planning for Accounts, Containers, and File Systems for Your Data Lake in Azure Storage

Now that Azure Data Lake Storage Gen2 is now based on Azure Storage as its foundation, we have a new level to incorporate into our planning process the file system itself. The file system contains the files and folders, and is equivalent to a container in Azure Blob Storage which contains blobs. In ADLS Gen1, we didn't have that intermediary level. I talked about this just a bit in #7 of my recent blog entry called 10 Things to Know About Azure Data Lake Storage Gen2 but I'd like to elaborate in this post a bit more about when you might need multiple storage accounts, multiple containers, or multiple file systems to support your data lake.

One caveat: As I’m writing this (March 2019), ADLS Gen2 is young and still evolving in its feature support. This means that some of the blob storage properties mentioned below don’t apply to ADLS Gen2 — yet. According to what we've heard from the ADLS Gen2 team, we can expect that all Azure Storage features will be supported on ADLS Gen2 as it evolves. So, here's the perspective I'm taking in this post:

  • From the Azure Blob Storage perspective (so that it's less confusing during this transition period of ADLS Gen2)

  • All properties for all 3 levels are included (even if not yet supported by ADLS Gen2)

  • Files, Tables, and Queues are disregarded for this discussion (though many of properties we discuss in this post, like the account-level properties, would apply)

The 3 Levels to Plan for in Azure Storage

The 3 levels within Azure Storage that we’re talking about in this post are (1) the account level, (2) the container or file system level, and (3) the blob or file level:

3LevelsOfAzureBlobStorage.jpg
 

Azure Storage Account Properties

The storage account has quite a few properties and settings associated with it. Here are the main ones:

AzureStorageAccountProperties.jpg

A few thoughts regarding the account-level properties:

  • You may need to consider separate storage accounts if you need to segregate access control (RBAC), virtual networks, access keys, and the like. (Note that RBAC can also be set at the container level too, but ACL type permissions only apply to ADLS Gen2 and not to blob storage.)

  • If you don’t need the hierarchical namespace whatsoever (for non-analytical use cases), this could mean a separate storage account. The storage cost is the same but transaction costs are higher when the HNS is enabled (discussed in item #8 of this post).

  • If your data residency requirements differ for certain types of data (ex: one type of data that must reside within Canada, while another must remain in Europe), that will definitely require separate storage accounts.

  • Settings such as replication (whether it’s locally redundant or geo-redundant) are specified at the storage account level. This impacts not only your disaster recovery planning, but it also impacts cost for the entire storage account.

  • Your two account keys are at the account level, so be ultra cautious in sharing those out.

Azure Blob Storage Container Properties

The container within the storage account has properties associated with it as well:

AzureBlobStorageContainerProperties.jpg
 

A few considerations related to the container-level properties:

  • Role-based access control (RBAC/IAM) can be set at the account level or the container level. The container level is the narrowest RBAC scope that can be specified. And don’t forget that RBAC always inherits and can’t be broken: a container inherits from the account, which inherits from the resource group, which inherits from the subscription.

  • You can set up stored access policies which will make your SAS tokens at the blob/file level utilize the policy (such as an expiration date for access).

  • My favorite container-level property is the immutable policy. An immutable policy can prevent data being edited or deleted (i.e., it allows appends only once the policy is enabled). If you have very firm requirements for data protection, this might justify separate containers which have different policies in place.

  • If you have some publicly available data, that access is specified at the container level.

  • If this is an ADLS Gen2 file system (rather than blob container): Power BI Dataflows will reside in one or more file systems.

Azure Storage Blob (File) Properties

And finally, the files within the container have properties associated with them as well:

AzureStorageFileProperties.jpg
 

A few things to be aware of regarding the file-level properties:

  • You can set up a SAS (shared access signature) token if you need to make just one specific file available for access.

  • If we’re talking about directories and files within ADLS Gen2 instead of blob within a container, then you would also specify data-level security ACLs (access control lists) at this level as well. ACLs apply to directories and files. From a security planning perspective, it’s really important to plan both RBAC and ACLs.

Final Thoughts

My rule of thumb is to start with a consolidated data lake. Separate out into separate storage accounts or containers/file systems only when it’s justified to do so based on your requirements. The more separation that exists, the harder it is for users to find data — so take that into careful consideration. However…it’s ok to be liberal with the separation of your directory structure within the file system itself.

Also, keep in mind that a lot of the RBAC roles are evolving right now with regard to flexibility & granularity of managing the control plane vs. the data plane. Make sure to look into the preview capabilities so you make the best long-term decision.

I hope this is helpful for planning out your data lake / data storage needs.

You Might Also Like…

Granting Permissions in Azure Data Lake Storage

Resources for Learning About Azure Data Lake Storage Gen2

FAQs About Organizing a Data Lake

Zones In A Data Lake

Data Lake Use Cases and Planning Considerations

Resources for Learning About Azure Data Lake Storage Gen2

A couple of people have asked me recently about how to 'bone up' on the new data lake service in Azure. The way I see it, there are two aspects: A, the technology itself and B, data lake principles and architectural best practices. Below are some links to resources that you should find helpful.

Learning about ADLS Gen2 Technology

Azure Data Lake Storage Gen2 is new so there is limited info available. However, since it's built upon the foundation of Azure Storage there is quite a lot of information available at the same time (though in all fairness ADLS Gen2 hasn't reached feature parity yet with blob storage). Here are some resources about the technology:

10 Things to Know about Azure Data Lake Storage Gen2

Planning for Accounts, Containers, and file Systems for your Data Lake in Azure Storage

Best Practices for Using Azure Data Lake Storage Gen2

The Azure Blob Filesystem Driver (ABFS): A Dedicated Storage Driver for Hadoop

Azure Data Lake Storage Gen2 Hierarchical Namespace

Use the Azure Data Lake Storage Gen2 URI

Overview of Azure Data Lake Storage Gen2 [video]

Pluralsight Course: Implementing Azure Data Lake Storage Gen2 by Xavier Morera [video—requires subscription]

Learning about Data Lake Principles and Architectural Best Practices

Just like when designing a database, there are some important aspects to designing a data lake that improve usability, security, performance, and governance. This is your enterprise data we're talking about, right? Some up-front planning for how this data is structured is warranted (because yes, a data lake is more agile...but not so agile that it becomes the dreaded swamp). Here are a few resources for learning about principles and best practices:

Data Lake Use Cases and Planning Considerations

Zones in a Data Lake

FAQs About Organizing a Data Lake

When Should We Load Relational Data to a Data Lake?

Data Lakes in a Modern Data Architecture [ebook]

The Data Lake Manifesto: 10 Best Practices

A Smarter Way to Jump Into Data Lakes

Big Data by Nathan Marz [book] <— This book is getting older now, but the conceptual chapters are excellent. Skip the technology chapters & focus on the concepts & it's a worthwhile read.

There are also several books on data lakes. I don’t have a favorite. Just keep in mind that some things are opinions and personal preferences. Though data lakes are maturing, best practices are still emerging. Many articles and book intros overstate the benefits and under-emphasize the challenges, so watch out for that.

Following the Maturity of ADLS Gen2

These are two important URLs for tracking what is and isn't supported in ADLS Gen2:

Known Issues 

Upgrade Your Big Data Analytics Solutions from ADLS Gen1 to ADLS Gen2