Why You Should Create Reports in Power BI Desktop Instead of the Power BI Service

This post is to make you aware that there are some serious downsides when you choose to create reports in the Power BI Service rather than Power BI Desktop. If you agree with me on this one, please vote for this idea here on the Power BI Ideas site.

The Short Version of the Story

I always recommend to Power BI authors that report creation & editing should happen in Power BI Desktop and to just ignore the edit capability in the Power BI Service. Usually my reasons are concerned with:

  1. Which version is the latest version (if both Desktop and the web are being used by one or more people)?

  2. We’d like to have version history available (not available with the web but can be if you store the PBIX file on OneDrive for Business, SharePoint, or a source control repository).

  3. Reducing the risk of someone overwriting someone else’s work because two different report editing/publishing options are in use.

These are very valid concerns, but I’ve discovered one additional problem that could easily confuse some people: the side effects experienced if you download then re-upload a web-created report. Let me explain…

If you create a report directly in the Power BI Service *theoretically* this is like using the Power BI Dataset Live Connection, right? Because we are prompted to choose a dataset, this means we are reusing an existing dataset for our new report. Conceptually this is a great thing, but in reality the hub-and-spoke model (with the dataset being the hub and the various reports being the spokes) only works well if you generate all reports from Power BI Desktop.

Here’s the issue: when you download a web-created report (i.e., created in the Power BI Service), the dataset actually comes with it in the downloaded PBIX file. Unfortunately it does not download as the equivalent of a Power BI Dataset Live Connection. Then, if you turn around and re-upload the PBIX to the Power BI Service --- you end up with a duplicate report and (even worse) a duplicate dataset. I understand the product team wants everything to work seamlessly for busy users — however, I'd much prefer if it behaved seamlessly with a Power BI Dataset Live Connection for the downloaded file.

The moral of this story: Always create all content in Power BI Desktop if you possibly can. Don’t rely on the ability to download from the Power BI Service. I elaborate on this at the bottom of this post. But first, let me walk you through the process.

The Long Version of the Story

I have a dataset published to the Power BI Service which is called Sales Data Model:

PBI_SalesDataModel.jpg

I’m going to create a report directly in the web so that we can illustrate the issue I’m concerned about:

PBI_CreateReport.jpg

I’ll select the existing dataset for my report being created in the web:

PBI_ChooseADataset.jpg

Next I’ll create my really fancy report and save it in the Power BI Service:

PBI_SaveReport.jpg

Other than the name I chose to use, of course we cannot tell from the report list if it was created in the web or from Power BI Desktop. (And to be fair, if we stopped here we wouldn’t actually have any issues with it being used in the Power BI Service.) At this point, I see my report displayed in the report list:

PBI_Report.jpg

Now let’s download the report we just generated:

PBI_DownloadReport.jpg

After choosing a location to save the download locally, we get our first clue that something is wrong:

PBI_DownloadedReport.jpg

In the above image, see how it’s 1.2 MB? That’s pretty big for my fancy report that has one whole chart. This is our first hint that the dataset has just come along during the download process.

Let’s pretend that we made some edits in Power BI Desktop and that we want to publish them back to the Power BI Service. We can publish the pbix from Desktop, or upload from the Service:

PBI_UploadReport.jpg

Now that we’ve re-uploaded the edited pbix file, we see the issues. The first issue is we now have two reports named the same thing:

PBI_DuplicateReports.jpg

And the second issue, even worse, is we have duplicate datasets (this occurred since the dataset came along during the download):

PBI_DuplicateDatasets.jpg

Summary of Recommendations

Make it a standard practice to always create reports in Power BI Desktop rather than the Power BI Service. This has the following advantages:

  • You always know that the Power BI Desktop file saved to your standard location is the latest version because you trust your process.

  • You always have version history available (assuming that you’re using a standard location such as OneDrive for Business, SharePoint Online, or similar).

  • You have explicit control over what the dataset is doing: import, live connection, or DirectQuery. Use a Power BI Dataset Live Connection whenever you have an existing dataset already published to the Power BI Service that has the data you need (to reduce the # of times the data is stored, reduce work across different datasets, and reduce the # of refresh operations). By doing this from Power BI Desktop, we accomplish exactly what we want in that regard.

If you do want to create reports in the web (or someone with workspace permissions edited a report in the web):

  • Don’t plan to re-upload the downloaded PBIX file. I heard someone presenting a webinar recently recommended using the web for creating reports & that you always have the downloaded PBIX to fall back on. That makes me nervous because busy business users won’t all understand what is happening and could easily end up with duplicated reports & datasets if they go back and forth as described above.

  • If you do need to sync up a report in Power BI Desktop, copy and paste the edited visuals from the downloaded PBIX to the actual PBIX. Be sure to discard the downloaded pbix file since it contains the duplicated dataset. This will help you recoup the work that someone did in the web, but not run the risk of duplicating reports and datasets as described above.

Please Vote to Change the Functionality

If you agree with me on this one, please vote for this idea here on the Power BI Ideas site.

You Might Also Like…

Three Ways to Use Power BI Dataflows

Terminology Check - What are Data Flows?

Terminology Check - What is a Power BI App?

Three Ways to Use Power BI Dataflows

Dataflows in Power BI are a really interesting capability for centralizing and reusing Power Query logic among many different PBIX files. If you’ve found yourself copying M scripts (Power Query formula language) between files, or creating a library of M scripts then you should look into dataflows.

I won’t rehash the basic overview info here because it has been well covered by other bloggers. In particular, Matthew Roche has been writing a lot of good stuff:

This post is to focus on what I’m referring to as ‘the three ways to use Power BI dataflows.’ I talked about this on the March BlueGranite Office Hours session on Power BI Dataflows (video recording | slides).

Dataflows1.jpg

Dataflows Option 1: Fully Managed by Power BI

In this first option, Power BI handles everything. We use the web-based Power Query Online tool for structuring the data. Power BI handles scheduling the data refresh.

The underlying data behind the dataflow is stored in a data lake. However, since it’s fully managed, this data lake is not directly accessible or visible to the customer. As with most cloud-based implementations, the infrastructure is hidden under the covers. This is what is happening if your users are utilizing dataflows currently but you haven’t specified a data lake account in the Power BI admin center.

Dataflows_Option1.jpg

Option 1 above is useful when all of your data preparation is done by analysts and Power BI is the only tool which accesses that data.

Dataflows Option 2: Managed by Power BI with an Explicit Data Lake

In this second option, the diagram looks exactly the same as option 1. The key difference is that we have associated our data lake account to Power BI. This is the “bring your own data lake” scenario which means we can actually view the data behind the scenes with a tool such as Azure Storage Explorer (dataflows data is comprised of csv files and the model.json file which contains the schema and other metadata).

With this option we are still using Power Query Online and Power BI for the data refresh, so it’s still very business user-friendly in terms of maintaining the data.

Dataflows_Option2.jpg

Option 2 above is useful when your data analysts like using Power Query Online for data prep, but you still want the ability to access the data independently of Power BI.

Dataflows Option 3: External Dataflows (Managed Outside of Power BI with an Explicit Data Lake)

With the third option, we are maintaining the data with another tool such as Azure Databricks or Azure Data Factory. You could even use something as simple as a text editor if you’d like - Chris Webb talks about that in this blog post. Power BI’s role here is reduced to being a consumer of the data.

Although Power BI doesn’t take responsibility for updating external dataflows, the dataflow can be consumed by PBIX files like any other dataflow. This is great because the user experience for using the dataflow output doesn’t change.

Dataflows_Option3.jpg

Option 3 above is useful when your data lake is part of your strategic data architecture and is integrated with other Azure technologies**. It is more complicated though because you’re also responsible for managing the metadata in the model.json file which accompanies the actual data files. When you have a balance of corporate BI and managed self-service BI needs, a combination of options 2 and 3 are useful.

**Use of dataflows does currently require a filesystem in ADLS Gen2 called “powerbi” so at this point in time (March 2019) we cannot refer the dataflow to just any folder in the data lake. I hope this changes because segregating this data doesn’t lend itself well to a corporate data lake strategy. Also, naming convention junkie that I am…using the name of “powerbi” for the filesystem also doesn’t make 100% sense when there’s other tools involved. I’d rather rely on my normal process of organizing the data lake by subject area and security boundaries.

Summary of the 3 Options

It’s important to be aware that all 3 options can be used simultaneously in the same Power BI tenant.

This table summarizes the above options and (very) loosely correlates them to the types of BI implementations they support:

Dataflows_3Options_Table.jpg

Consider this expanded scenario:

DataflowsInPowerBI.jpg

Workspace A is managed by Power BI.

  • If Workspace A is *not* associated to the dataflows storage, the dataflow functionality works normally but data in the lake is not accessible other than through Power Query Online. This is Option 1.

  • If Workspace A *is* associated to the dataflows storage, it’ll be visible in ADLS Gen2.  This is Option 2.

Workspace B is associated to dataflows storage, and is managed outside of Power BI. This is option 3.

Important Settings in the Power BI Admin Center

To allow dataflows to be utilized by your users (i.e., to use any of the 3 options above), they need to be enabled in the tenant settings:

Dataflows_TenantSettings.jpg
 

To be able to “bring your own data lake” (i.e., to use options 2 or 3), you need to do two things. The first task is to associate your Azure Data Lake Storage Gen2 account to the Power BI tenant:

DataflowsSettings.jpg
 

Note that there are currently (as of March 2019) some pretty big limitations with the above setting:

  1. You can only associate on ADLS Gen2 account for your entire Power BI tenant.

  2. It requires an ADLS Gen2 filesystem called “powerbi” which means we cannot point to just any folder in the data lake.

  3. Once the ADLS Gen2 account is associated to Power BI, it cannot be changed. So be very careful with POCs and such until this becomes more flexible.

The second required task to make options 2 or 3 work is a workspace-specific storage setting for every workspace where dataflows might exist:

Dataflows_WorkspaceStorageSetting.jpg

If the workspace storage setting for dataflows is set to No, then the dataflows in that workspace are using option 1 (which means they still work just fine, you just won’t see the data in your data lake even if you’ve set the data lake account in your tenant settings). If the workspace storage setting is set to Yes, then you are free to use options 2 or 3.

To facilitate the above workspace storage setting, there is one more property to be aware of. In the same pane as where we specify the ADLS Gen2 account, there is a Yes/No toggle to allow workspace admins to assign workspaces to the storage account:

DataflowSettings_WorkspaceOwners.jpg
 

At this time the workspace default for dataflows storage is set to No, so it’s easy to forget — I’m hoping that default changes to Yes if there’s a data lake specified in the tenant settings.

Requirements for ‘Bring Your Own Data Lake’

Here’s a final summary of the requirements to be able to implement option 2 or 3:

Dataflows_RequirementsForBringYourOwnDataLake.jpg

Planning for Accounts, Containers, and File Systems for Your Data Lake in Azure Storage

Now that Azure Data Lake Storage Gen2 is now based on Azure Storage as its foundation, we have a new level to incorporate into our planning process the file system itself. The file system contains the files and folders, and is equivalent to a container in Azure Blob Storage which contains blobs. In ADLS Gen1, we didn't have that intermediary level. I talked about this just a bit in #7 of my recent blog entry called 10 Things to Know About Azure Data Lake Storage Gen2 but I'd like to elaborate in this post a bit more about when you might need multiple storage accounts, multiple containers, or multiple file systems to support your data lake.

One caveat: As I’m writing this (March 2019), ADLS Gen2 is young and still evolving in its feature support. This means that some of the blob storage properties mentioned below don’t apply to ADLS Gen2 — yet. According to what we've heard from the ADLS Gen2 team, we can expect that all Azure Storage features will be supported on ADLS Gen2 as it evolves. So, here's the perspective I'm taking in this post:

  • From the Azure Blob Storage perspective (so that it's less confusing during this transition period of ADLS Gen2)

  • All properties for all 3 levels are included (even if not yet supported by ADLS Gen2)

  • Files, Tables, and Queues are disregarded for this discussion (though many of properties we discuss in this post, like the account-level properties, would apply)

The 3 Levels to Plan for in Azure Storage

The 3 levels within Azure Storage that we’re talking about in this post are (1) the account level, (2) the container or file system level, and (3) the blob or file level:

3LevelsOfAzureBlobStorage.jpg
 

Azure Storage Account Properties

The storage account has quite a few properties and settings associated with it. Here are the main ones:

AzureStorageAccountProperties.jpg

A few thoughts regarding the account-level properties:

  • You may need to consider separate storage accounts if you need to segregate access control (RBAC), virtual networks, access keys, and the like. (Note that RBAC can also be set at the container level too, but ACL type permissions only apply to ADLS Gen2 and not to blob storage.)

  • If you don’t need the hierarchical namespace whatsoever (for non-analytical use cases), this could mean a separate storage account. The storage cost is the same but transaction costs are higher when the HNS is enabled (discussed in item #8 of this post).

  • If your data residency requirements differ for certain types of data (ex: one type of data that must reside within Canada, while another must remain in Europe), that will definitely require separate storage accounts.

  • Settings such as replication (whether it’s locally redundant or geo-redundant) are specified at the storage account level. This impacts not only your disaster recovery planning, but it also impacts cost for the entire storage account.

  • Your two account keys are at the account level, so be ultra cautious in sharing those out.

Azure Blob Storage Container Properties

The container within the storage account has properties associated with it as well:

AzureBlobStorageContainerProperties.jpg
 

A few considerations related to the container-level properties:

  • Role-based access control (RBAC/IAM) can be set at the account level or the container level. The container level is the narrowest RBAC scope that can be specified. And don’t forget that RBAC always inherits and can’t be broken: a container inherits from the account, which inherits from the resource group, which inherits from the subscription.

  • You can set up stored access policies which will make your SAS tokens at the blob/file level utilize the policy (such as an expiration date for access).

  • My favorite container-level property is the immutable policy. An immutable policy can prevent data being edited or deleted (i.e., it allows appends only once the policy is enabled). If you have very firm requirements for data protection, this might justify separate containers which have different policies in place.

  • If you have some publicly available data, that access is specified at the container level.

  • If this is an ADLS Gen2 file system (rather than blob container): Power BI Dataflows will reside in one or more file systems.

Azure Storage Blob (File) Properties

And finally, the files within the container have properties associated with them as well:

AzureStorageFileProperties.jpg
 

A few things to be aware of regarding the file-level properties:

  • You can set up a SAS (shared access signature) token if you need to make just one specific file available for access.

  • If we’re talking about directories and files within ADLS Gen2 instead of blob within a container, then you would also specify data-level security ACLs (access control lists) at this level as well. ACLs apply to directories and files. From a security planning perspective, it’s really important to plan both RBAC and ACLs.

Final Thoughts

My rule of thumb is to start with a consolidated data lake. Separate out into separate storage accounts or containers/file systems only when it’s justified to do so based on your requirements. The more separation that exists, the harder it is for users to find data — so take that into careful consideration. However…it’s ok to be liberal with the separation of your directory structure within the file system itself.

Also, keep in mind that a lot of the RBAC roles are evolving right now with regard to flexibility & granularity of managing the control plane vs. the data plane. Make sure to look into the preview capabilities so you make the best long-term decision.

I hope this is helpful for planning out your data lake / data storage needs.

You Might Also Like…

Granting Permissions in Azure Data Lake Storage

Resources for Learning About Azure Data Lake Storage Gen2

FAQs About Organizing a Data Lake

Zones In A Data Lake

Data Lake Use Cases and Planning Considerations

Resources for Learning About Azure Data Lake Storage Gen2

A couple of people have asked me recently about how to 'bone up' on the new data lake service in Azure. The way I see it, there are two aspects: A, the technology itself and B, data lake principles and architectural best practices. Below are some links to resources that you should find helpful.

Learning about ADLS Gen2 Technology

Azure Data Lake Storage Gen2 is new so there is limited info available. However, since it's built upon the foundation of Azure Storage there is quite a lot of information available at the same time (though in all fairness ADLS Gen2 hasn't reached feature parity yet with blob storage). Here are some resources about the technology:

10 Things to Know about Azure Data Lake Storage Gen2

Planning for Accounts, Containers, and file Systems for your Data Lake in Azure Storage

Best Practices for Using Azure Data Lake Storage Gen2

The Azure Blob Filesystem Driver (ABFS): A Dedicated Storage Driver for Hadoop

Azure Data Lake Storage Gen2 Hierarchical Namespace

Use the Azure Data Lake Storage Gen2 URI

Overview of Azure Data Lake Storage Gen2 [video]

Pluralsight Course: Implementing Azure Data Lake Storage Gen2 by Xavier Morera [video—requires subscription]

Learning about Data Lake Principles and Architectural Best Practices

Just like when designing a database, there are some important aspects to designing a data lake that improve usability, security, performance, and governance. This is your enterprise data we're talking about, right? Some up-front planning for how this data is structured is warranted (because yes, a data lake is more agile...but not so agile that it becomes the dreaded swamp). Here are a few resources for learning about principles and best practices:

Data Lake Use Cases and Planning Considerations

Zones in a Data Lake

FAQs About Organizing a Data Lake

When Should We Load Relational Data to a Data Lake?

Data Lakes in a Modern Data Architecture [ebook]

The Data Lake Manifesto: 10 Best Practices

A Smarter Way to Jump Into Data Lakes

Big Data by Nathan Marz [book] <— This book is getting older now, but the conceptual chapters are excellent. Skip the technology chapters & focus on the concepts & it's a worthwhile read.

There are also several books on data lakes. I don’t have a favorite. Just keep in mind that some things are opinions and personal preferences. Though data lakes are maturing, best practices are still emerging. Many articles and book intros overstate the benefits and under-emphasize the challenges, so watch out for that.

Following the Maturity of ADLS Gen2

These are two important URLs for tracking what is and isn't supported in ADLS Gen2:

Known Issues 

Upgrade Your Big Data Analytics Solutions from ADLS Gen1 to ADLS Gen2