Querying Data in Azure Data Lake Store with Power BI

UPDATE March 10, 2019: This post currently only applies to Azure Data Lake Storage Gen1. Direct support from Power BI (or Azure Analysis Services) is not yet supported for Azure Data Lake Storage Gen2.

Let's say you have data in Azure Data Lake Store (ADLS) that you want to report directly from in Power BI. You might be doing this for early exploratory data efforts, or you might have some curated data which has been prepared in the data lake for analysis tools such as Power BI. 

The screen shots & functionality discussed for ADLS and Power BI are from early May 2018.

In the example I'm using, my Data Lake Store service is called "bankingadls." The file structure I'm working with looks like this:

ADLS_FileStructure.jpg

We're going to be pulling the data files from the RawData subfolders into Power BI.

What do you need to be able to retrieve ADLS data into Power BI?

  • Read and execute permissions to the data stored in ADLS. I talked about permissions in this blog post.

  • The source file(s) need a format which has reliable structure that Power BI can understand.

  • If you're combining multiple files (such as all 3 shown in the screenshot above), they need to have the same structure/file layout for each file. Power BI is pretty smart about this, so if you're missing a column in one file that's present in the other files, it can cope with something simple like that from what I've seen. The rule of thumb when designing a data lake is to use the same data structure within each folder, so hopefully varying data formats aren't a problem--otherwise you'll have to cleanse & standardize the data before it can be analyzed.

What can you query from ADLS?

You can connect to the data stored in Azure Data Lake Store. What you *cannot* connect to currently is the data stored in the Catalog tables/views/stored procedures within Azure Data Lake Analytics (hopefully connectivity to the ADLA Catalog objects from tools other than U-SQL is available soon--you can vote for Power BI connectivity to the Catalog on this UserVoice suggestion).

You're not sending a U-SQL query here. Rather, we're sending a web API request to an endpoint.

With an ADLS data source, you have to import the data into Power BI Desktop. There is no option for DirectQuery.

Should you use https:// or adl:// to connect to ADLS

In the portal you may notice that you have two ways to connect:

ADLS_URLandURI.jpg

https:// is a WebHDFS-compatible endpoint which allows integration of the ADL Store with open source projects and other languages.

adl:// is the “AzureDataLakeFilesystem” which is also secure, and provides additional performance enhancements on top of WebHDFS. 

I typically use the adl:// endpoint in case it does give me a performance boost. From what I can tell in Fiddler, it looks like both methods send the same GET request from Power BI which looks like this:

ADLS_fromPowerBI_GetRequest.jpg

Connecting to one file in ADLS from Power BI

Connecting to one individual file in ADLS is pretty straightforward. In the Get Data area within Power BI Desktop, we'll choose the Azure Data Lake Store connector:

PowerBI_ADLS.jpg
 

When prompted for the URL, you can put in either the URI or the URL - either will work. For one file, you'll use the full path to that file:

URL_ADLS_OneFile.jpg

Go ahead and respond to the security dialog box when prompted (if you're accessing this ADL Store for the first time). Next let's choose Edit to go into the Query Editor:

PowerBI_ADLS2.jpg

Here's where it might look funny to you at first. What it's actually showing you here is the metadata. To get to the data, click on the hyperlink called Binary:

PowerBI_ADLSData.jpg

And now we've got a preview of the actual file data. Power BI auto-created a few steps to tidy up the data:

PowerBI_ADLSData2.jpg

At this point, you can Close & Apply. Carry on with fine-tuning the data model properties and creating visuals. 

Connecting to a folder containing multiple files from Power BI

Connecting to a folder is helpful when you have numerous files you want to consolidate. In a data lake, it's common to have data partitioned (into subfolders) by date or other logical ways to store the data incrementally. 

There are a couple of differences when retrieving data from the folder level instead. The first difference is we specify the folder path in the URL:

URL_ADLS_Folder.jpg

The next difference is how the metadata is initially presented. The metadata for this "table" is actually the subfolder - for us, that's the 2017 subfolder under RawData. If we click on the Table hyperlink, we are taken to the metadata for the monthly folders:

PowerBI_ADLS3.jpg

At this point, we don't want to navigate down any more because we want files that are present in each of those monthly folders. What we want to do is click on the "Expand" button:

PowerBI_ADLS_CombineFiles.jpg
 

Your steps above might differ a bit depending on your folder structure. 

Now we see that it has detected the presence of the individual files across the multiple folders. Now it's time to click on the "Combine Files" button:

PowerBI_ADLS_CombineFiles4.jpg

The Query Editor uses the first file to determine what the column structure of the data will be:

PowerBI_ADLS_CombineFiles3.jpg

When combining files, be sure to choose "Skip files with errors." I had some null rows along with the binary file list which Power BI saw as file errors so it failed when trying to complete a full data refresh. The error was "Failed to save modifications to the server. The parameter is expected to be of type Text.Type or Binary.Type." I believe what this meant was the sample query & parameter it used in the Query Editor to figure out the metadata was choking on those null values. I'm not certain where the null rows came from, but choosing "Skip files with errors" solved the issue (or you could filter out null values in the Content.Content column shown above).

At this point, you can Close & Apply. Carry on with fine-tuning the data model properties and creating visuals. 

Refreshing the data in the Power BI Service

In order to schedule a data refresh, we'll want to publish the PBIX to the Power BI Service. (Reminder: if you plan to share this with others, lean towards using an app workspace. Sharing out of your personal workspace should be used only in a limited way.)

A gateway is not needed in the Power BI Service since it's pulling from an Azure service.

The data source credentials for the refresh use the OAuth2 protocol which is common for web requests:

PowerBI_ADLS_DataRefresh.jpg
 

Now for the bad news. The OAuth2 protocol relies on access tokens, and those tokens expire at regular intervals. (I'm currently testing this with a new refresh schedule created yesterday; I'll update this post as soon as I have verified what the current expiration interval is--thankfully it is longer than it used to be.) Once the refresh fails due to token expiration, you'll have to edit credentials in the dataset properties. Usually our way around this issue, like when Azure Data Factory needs to access ADLS, is to use an Azure application (service principal) for authentication, but that's not currently supported either. 

Depending on your quantity of files and/or size of files in the data lake, the data refresh may take a bit of time. One https GET request is sent by Power BI per file to retrieve data. 

Like This Content?

If you are integrating data between Azure services, you might be interested in an all-day session Meagan Longoria and I are presenting at PASS Summit in November. It's called "Designing Modern Data and Analytics Solutions in Azure." Check out info here: http://www.pass.org/summit/2018/Sessions/Details.aspx?sid=78885 

You Might Also Like...

Why the Default Summarization Property in Power BI is So Important

Checklist for Finalizing a Data Model in Power BI Desktop

Data Refresh Issues in the Power BI Service Due to Invalid Dates