Querying Data in Azure Data Lake Store with Power BI

Let's say you have data in Azure Data Lake Store (ADLS) that you want to report directly from in Power BI. You might be doing this for early exploratory data efforts, or you might have some curated data which has been prepared in the data lake for analysis tools such as Power BI. 

The screen shots & functionality discussed for ADLS and Power BI are from early May 2018.

In the example I'm using, my Data Lake Store service is called "bankingadls." The file structure I'm working with looks like this:

ADLS_FileStructure.jpg

We're going to be pulling the data files from the RawData subfolders into Power BI.

What do you need to be able to retrieve ADLS data into Power BI?

  • Read and execute permissions to the data stored in ADLS. I talked about permissions in this blog post
  • The source file(s) need a format which has reliable structure that Power BI can understand.
  • If you're combining multiple files (such as all 3 shown in the screenshot above), they need to have the same structure/file layout for each file. Power BI is pretty smart about this, so if you're missing a column in one file that's present in the other files, it can cope with something simple like that from what I've seen. The rule of thumb when designing a data lake is to use the same data structure within each folder, so hopefully varying data formats aren't a problem--otherwise you'll have to cleanse & standardize the data before it can be analyzed. 

What can you query from ADLS?

You can connect to the data stored in Azure Data Lake Store. What you *cannot* connect to currently is the data stored in the Catalog tables/views/stored procedures within Azure Data Lake Analytics (hopefully connectivity to the ADLA Catalog objects from tools other than U-SQL is available soon--you can vote for Power BI connectivity to the Catalog on this UserVoice suggestion).

You're not sending a U-SQL query here. Rather, we're sending a web API request to an endpoint.

With an ADLS data source, you have to import the data into Power BI Desktop. There is no option for DirectQuery.

Should you use https:// or adl:// to connect to ADLS

In the portal you may notice that you have two ways to connect:

ADLS_URLandURI.jpg

https:// is a WebHDFS-compatible endpoint which allows integration of the ADL Store with open source projects and other languages.

adl:// is the “AzureDataLakeFilesystem” which is also secure, and provides additional performance enhancements on top of WebHDFS. 

I typically use the adl:// endpoint in case it does give me a performance boost. From what I can tell in Fiddler, it looks like both methods send the same GET request from Power BI which looks like this:

ADLS_fromPowerBI_GetRequest.jpg

Connecting to one file in ADLS from Power BI

Connecting to one individual file in ADLS is pretty straightforward. In the Get Data area within Power BI Desktop, we'll choose the Azure Data Lake Store connector:

PowerBI_ADLS.jpg
 

When prompted for the URL, you can put in either the URI or the URL - either will work. For one file, you'll use the full path to that file:

URL_ADLS_OneFile.jpg

Go ahead and respond to the security dialog box when prompted (if you're accessing this ADL Store for the first time). Next let's choose Edit to go into the Query Editor:

PowerBI_ADLS2.jpg

Here's where it might look funny to you at first. What it's actually showing you here is the metadata. To get to the data, click on the hyperlink called Binary:

PowerBI_ADLSData.jpg

And now we've got a preview of the actual file data. Power BI auto-created a few steps to tidy up the data:

PowerBI_ADLSData2.jpg

At this point, you can Close & Apply. Carry on with fine-tuning the data model properties and creating visuals. 

Connecting to a folder containing multiple files from Power BI

Connecting to a folder is helpful when you have numerous files you want to consolidate. In a data lake, it's common to have data partitioned (into subfolders) by date or other logical ways to store the data incrementally. 

There are a couple of differences when retrieving data from the folder level instead. The first difference is we specify the folder path in the URL:

URL_ADLS_Folder.jpg

The next difference is how the metadata is initially presented. The metadata for this "table" is actually the subfolder - for us, that's the 2017 subfolder under RawData. If we click on the Table hyperlink, we are taken to the metadata for the monthly folders:

PowerBI_ADLS3.jpg

At this point, we don't want to navigate down any more because we want files that are present in each of those monthly folders. What we want to do is click on the "Expand" button:

PowerBI_ADLS_CombineFiles.jpg
 

Your steps above might differ a bit depending on your folder structure. 

Now we see that it has detected the presence of the individual files across the multiple folders. Now it's time to click on the "Combine Files" button:

PowerBI_ADLS_CombineFiles4.jpg

The Query Editor uses the first file to determine what the column structure of the data will be:

PowerBI_ADLS_CombineFiles3.jpg

When combining files, be sure to choose "Skip files with errors." I had some null rows along with the binary file list which Power BI saw as file errors so it failed when trying to complete a full data refresh. The error was "Failed to save modifications to the server. The parameter is expected to be of type Text.Type or Binary.Type." I believe what this meant was the sample query & parameter it used in the Query Editor to figure out the metadata was choking on those null values. I'm not certain where the null rows came from, but choosing "Skip files with errors" solved the issue (or you could filter out null values in the Content.Content column shown above).

At this point, you can Close & Apply. Carry on with fine-tuning the data model properties and creating visuals. 

Refreshing the data in the Power BI Service

In order to schedule a data refresh, we'll want to publish the PBIX to the Power BI Service. (Reminder: if you plan to share this with others, lean towards using an app workspace. Sharing out of your personal workspace should be used only in a limited way.)

A gateway is not needed in the Power BI Service since it's pulling from an Azure service.

The data source credentials for the refresh use the OAuth2 protocol which is common for web requests:

PowerBI_ADLS_DataRefresh.jpg
 

Now for the bad news. The OAuth2 protocol relies on access tokens, and those tokens expire at regular intervals. (I'm currently testing this with a new refresh schedule created yesterday; I'll update this post as soon as I have verified what the current expiration interval is--thankfully it is longer than it used to be.) Once the refresh fails due to token expiration, you'll have to edit credentials in the dataset properties. Usually our way around this issue, like when Azure Data Factory needs to access ADLS, is to use an Azure application (service principal) for authentication, but that's not currently supported either. 

Depending on your quantity of files and/or size of files in the data lake, the data refresh may take a bit of time. One https GET request is sent by Power BI per file to retrieve data. 

Like This Content?

If you are integrating data between Azure services, you might be interested in an all-day session Meagan Longoria and I are presenting at PASS Summit in November. It's called "Designing Modern Data and Analytics Solutions in Azure." Check out info here: http://www.pass.org/summit/2018/Sessions/Details.aspx?sid=78885 

You Might Also Like...

Why the Default Summarization Property in Power BI is So Important

Checklist for Finalizing a Data Model in Power BI Desktop

Data Refresh Issues in the Power BI Service Due to Invalid Dates

New All-Day Session: Designing Modern Data and Analytics Solutions in Azure

At the fall 2018 PASS Summit in Seattle, I'm excited to be co-presenting a full day pre-conference session with my good friend & fellow BlueGranite colleague Meagan Longoria

Why Do a Pre-Con?

MelissaCoates.JPG

I have a natural inclination to share information that I have learned. Being a hands-on techie is something I absolutely love, but I have a bit of educator in my blood as well. And, continually learning new skills is at the core of what makes me happy. All of which means that I aim to teach others in a way that I would want to learn. 

What Will You Learn? 

This session will very much be about planning the architecture and factors around decision-making, presented in a very practical and realistic way (full abstract can be found here). We will build the components for one reference architecture, using scripts that we will provide you. 

PreconTime.jpg
 

The full abstract can be found on the PASS Summit site. To highlight just a few of the topics that you'll hear about:

  • Going to the cloud - What's easier? What's harder? What trade-offs can you expect to make with respect to cost, control, complexity, performance, and security?
  • Cloud design patterns - In what ways are cloud design patterns different from traditional on-premises solutions? How does that change the typical roles for developers and administrators?
  • Schema-on-read - In what scenarios does schema-on-read work extremely well? In which situations is it not ideal?
  • Patterns-based development - What automation techniques can save you time, improve efficiency, and reduce the chance for error? 
  • Architecture - What does a BI/analytics/DW architecture look like if we value the concept of polyglot persistence vs. architectural simplicity? What kind of differences should we be aware of if we are using a distributed architecture? What are the Azure options for supporting data science and self-service BI?
  • Data storage - When do we want to analyze data in place vs. move it to another data store? What technology options do we have in Azure, and what factors do we want to consider for deciding between data virtualization and data integration? In what cases can you take advantage of a data lake in your architecture? 

Who is the Target Audience?

The ideal audience member has some experience as a data engineer, BI professional, or database developer, and is in the early stages of migrating or building solutions in Azure.

This session is broad because the data platform offerings in Azure are broad with many choices and considerations. Our day job *is* planning and building data solutions in Azure. Meagan and I are very excited to help you get started with building a solid data architecture in Azure.

More details and to register: Designing Modern Data and Analytics Solutions in Azure

Assigning Data Permissions for Azure Data Lake Store (Part 3)

This is part 3 in a short series on Azure Data Lake permissions. 

Part 1 - Granting Permissions in Azure Data Lake
Part 2 - Assigning Resource Management Permissions for Azure Data Lake Store
Part 3 - Assigning Data Permissions for Azure Data Lake Store {you are here}
Part 4 - Using a Service Principal for Azure Data Lake Store <Coming Soon>
Part 5 - Assigning Permissions for Azure Data Lake Analytics <Coming Soon>

In this section, we're covering the "data permissions" for Azure Data Lake Store (ADLS). The ACL (access control list) grants permissions to to create, read, and/or modify files and folders stored in the ADLS service. Uploading and downloading data falls in this category of ACLs. If you come from the Unix or Linux world, the POSIX-style ACLs will be a familiar concept. 

There are two types of ACLs: Access ACLs and Default ACLs.

An Access ACL is the read/write/execute permissions specified for a folder or file. Every single folder or file has its security explicitly defined -- so that means the ADLS security model is not an 'inheritance' model. That is an important concept to remember.

A Default ACL is like a 'template' setting at a folder level (the concept of a default doesn't apply at the file level). Any new child item placed in that folder will automatically obtain that default security setting. The default ACLs are absolutely critical, given that data permissions aren't an inheritance model. You want to avoid a situation where a user has permission to read a folder, but is unable to see any of the files within the folder -- that situation will happen if a new file gets added to a folder which has an access ACL set at the folder level, but not a default ACL to apply to new child objects.

Tips for Assigning Data Permissions for ADLS

Organize your data lake folders and files so that folder-level security is one of the main considerations, and so that security is easier to manage.

Access to raw data is typically highly limited. This is partially due to lack of usability, and partially to ingest data as quickly as possible. Because every single file in ADLS has security properties specified, that is one of several reasons why a large number of very tiny files in ADLS is generally discouraged.

Typically in a data lake, the majority of users need only read+execute rights to consume the data. However, you may also have an area like a 'manual drop zone' or an 'analytics sandbox' where select users have write permissions to create, modify & delete folders and files. Generally speaking, write permissions in a data lake are minimal.

Be selective with granting permissions at the root level ("/"). It does minimize maintenance if you define an access ACL + default ACL at the root level, but only if you feel confident that is adequate security.

Try to use Azure Active Directory (AAD) groups whenever you can to grant access, rather than individual user accounts. This is a consistent best practice for managing security across many types of systems. This will reduce maintenance, and reduce the risk of inaccurate or out of date user permissions.

Currently the maximum # of ACLs that can be assigned to a file or folder is 32. This is another big reason to use AAD groups for managing access, rather than individual users.

Try to assign access at the folder level whenever you can. Although ADLS doesn't have a true inheritance model, you can set a 'default' entry which will handle new child items. 

Be aware that changing a default ACL on a folder doesn't automatically propagate to change the default ACL on any existing child folders. So, managing changes to existing data needs to be done carefully. Although it can seem like the default ACLs act like an inheritance model in some respects, it definitely is not.

Grant access to an AAD application (aka service principal identity or SPI) for automated operations, such as data loads. For service principals, you often can assign just the data permissions (the ACL) and not any permissions to the ADLS service (the RBAC). Check Part 4 for more discussion about service principals.

You almost always want to assign read + execute permissions together. The execute permissions allow a user to traverse the folder structure to where a file resides, which is needed in conjunction with the read (or write) permissions for the actual file.

The portal interface makes it easy to apply permissions to existing child-level folders and files. It's is a little harder to specify via scripting methods since your script will need to explicitly be set up to handle recursive operations. Therefore, try to assign relevant permissions as early as possible in your design/development/operationalization phase. 

When you set permissions on existing data, it can take a little while if you are asking it to recursively traverse the folders and files to set permissions for every object. This is another reason to try to set permissions at the AAD group level, rather than via individual users. 

The PowerShell cmdlets to manage ADLS changed in January 2018. See this post: Breaking changes to Azure Data Lake Store cmdlets

Defining ADLS Data Permissions in the Azure Portal

In my example, I want to assign read + execute permissions for the StandardizedData folder, but not for the RawData folder. In the portal, I open Data Explorer, navigate to the applicable folder which sets the "scope" for the permissions, then the Access button:

ADLS_ACL_Portal_1.jpg

Click the Add button to select a user or a group. Notice the permissions are read/write/execute. You can have the new permission entry add to all existing child folders & files (which you typically want to do). The last radio button is really important - this lets you set it as both an access entry *and* the default entry.

ADLS_ACL_Portal_2.jpg

Important! When using the web interface as shown above, you need to leave the blade open while it assigns permissions. If you navigate away and close it before it completes, the process will get interrupted.

Defining ADLS Data Permissions via PowerShell Script

The technique shown above in the portal is convenient for quick changes, for learning, or for "one-off" scenarios. However, in an enterprise solution, or a production environment, it's a better practice to handle permissions via a script so you can do things such as:

  • Promote changes through different environments
  • Pass off scripts to an administrator to run in production
  • Include permission settings in source control

Group Permissions

In the following script, we are assigning read+execute permissions to a group:

  • Step 1 defines the access ACL.
  • Step 2 defines the default ACL. (Thanks to Saveen Reddy from the ADL team who very kindly clued me into needing to set the default in a separate step. I was a stumped on that one for a bit.)
  • Step 3 applies the folder-level access to the child objects. Note that it is only going one level deep in the folder structure. Therefore, you'll want to construct the script to be 'smarter' about recursion if you have a large number of folders which already exist, or just pass in an array list of the folders it should apply to - see Shannon Lowder's blog for an example of this technique. If you have any files, the default ACL in the foreach loop will fail (but it works fine for folders).
ADLS_ACL_PowerShell.jpg
 

Here's the copy/paste friendly script for the above screenshot - for a group:

#-----------------------------------------

#Input Area
$subscriptionName = 'YourSubscriptionName'
$resourceGroupName = 'YourResourceGroupName'
$resourceName = 'YourResourceName'
$adlsPath = '/Folder/Subfolder'
$groupName = 'YourAADGroupName'
$permissionType = 'ReadExecute'

#-----------------------------------------

#Manual login into Azure
#Login-AzureRmAccount -SubscriptionName $subscriptionName

#-----------------------------------------

#Step 1: Set the access permissions at the folder level
$groupId = Get-AzureRmADGroup -SearchString $groupName  
Set-AdlStoreItemAclEntry  `
    -AccountName $resourceName `
    -Path $adlsPath `
    -AceType Group `
    -Permissions $permissionType `
    -Id $groupId.Id 

#Step 2: Set the default at the folder level
Set-AdlStoreItemAclEntry  `
    -AccountName $resourceName `
    -Path $adlsPath `
    -AceType Group `
    -Permissions $permissionType `
    -Id $groupId.Id `
    -Default 

#Step 3: Set existing child objects to be the same as the folder level
$childObjects = Get-AzureRmDataLakeStoreChildItem `
    -AccountName $resourceName `
    -Path $adlsPath
$arrayOfObjectNames = @($childObjects.Name)
foreach ($objectName in $arrayOfObjectNames) 
    {
     Write-Host "Setting ACL for $adlsPath/$objectName"
     #Set the access
     Set-AdlStoreItemAclEntry  `
        -AccountName $resourceName `
        -Path "$adlsPath/$objectName" `
        -AceType Group `
        -Permissions $permissionType `
        -Id $groupId.Id 
     #Set the default
     Set-AdlStoreItemAclEntry `
        -AccountName $resourceName `
        -Path "$adlsPath/$objectName" `
        -AceType Group `
        -Permissions $permissionType `
        -Id $groupId.Id `
        -Default 
    }

User Permissions

This next script is nearly the same, but this time we are assigning read+execute permissions to a user instead of a group (which should be the exception not the rule):

ADLS_ACL_PowerShell_User.jpg
 

And, the copy/paste friendly script that goes with the above screenshot - for a user:

#-----------------------------------------

#Input Area
$subscriptionName = 'YourSubscriptionName'
$resourceGroupName = 'YourResourceGroupName'
$resourceName = 'YourResourceName'
$adlsPath = '/Folder/Subfolder'
$userName = 'UserNameInEmailFormat'
$userPermission = 'ReadExecute'

#-----------------------------------------

#Manual login into Azure
#Login-AzureRmAccount -SubscriptionName $subscriptionName

#-----------------------------------------

#Step 1: Set the access permissions at the folder level
$userId = Get-AzureRmADUser -UPN $userName 
Set-AdlStoreItemAclEntry  `
     -AccountName $resourceName `
     -Path $adlsPath `
     -AceType User `
     -Permissions $userPermission `
     -Id $userId.Id 

#Step 2: Set the default at the folder level
Set-AdlStoreItemAclEntry  `
     -AccountName $resourceName `
     -Path $adlsPath `
     -AceType User `
     -Permissions $userPermission `
     -Id $userId.Id -Default 

#Step 3: Set existing child objects to be the same as the folder level
$childObjects = Get-AzureRmDataLakeStoreChildItem `
     -AccountName $resourceName `
     -Path $adlsPath
$arrayOfObjectNames = @($childObjects.Name)
foreach ($objectName in $arrayOfObjectNames) 
    {
     Write-Host "Setting ACL for $adlsPath/$objectName"
     #Set the access
     Set-AdlStoreItemAclEntry  `
          -AccountName $resourceName `
          -Path "$adlsPath/$objectName" `
          -AceType User `
          -Permissions $userPermission `
          -Id $userId.Id 
     #Set the default
     Set-AdlStoreItemAclEntry  `
          -AccountName $resourceName `
          -Path "$adlsPath/$objectName" `
          -AceType User `
          -Permissions $userPermission `
          -Id $userId.Id `
          -Default 
    }

In both of the above scripts, if you are setting permissions for only child files (rather than folders), you can remove the second Set-AdlStoreItemAclEntry line which sets the default. There's no concept of a default at the file level.

Finding More Information

PowerShell Cmdlets for Azure Data Lake Store

Breaking Changes to Data Lake Store Cmdlets

Access Control in Azure Data Lake Store <--Definitely take time to read this

Secure Data in Azure Data Lake Store

Want to Know More?

My next all-day workshop on Architecting a Data Lake is in Raleigh, NC on April 13, 2018

Assigning Resource Management Permissions for Azure Data Lake Store (Part 2)

This is part 2 in a short series on Azure Data Lake permissions. 

Part 1 - Granting Permissions in Azure Data Lake
Part 2 - Assigning Resource Management Permissions for Azure Data Lake Store {you are here}
Part 3 - Assigning Data Permissions for Azure Data Lake Store
Part 4 - Using a Service Principal for Azure Data Lake Store <Coming Soon>
Part 5 - Assigning Permissions for Azure Data Lake Analytics <Coming Soon>

In this section, we're covering the "service permissions" for the purpose of managing Azure Data Lake Store (ADLS). Granting a role on the resource allows someone to view or manage the configuration and settings for that particular Azure service (i.e., although we're talking about ADLS, this post is applicable to Azure services in general). RBAC, or role-based access control, includes the familiar built-in Azure roles such as reader, contributor, or owner (you can create custom roles as well).

Tips for Assigning Roles for the ADLS Service

Setting permissions for the service + the data stored in ADLS is always two separate processes, with one exception: when you define an owner for the ADLS service in Azure, that owner is automatically granted 'superuser' (full) access to manage the ADLS resource in Azure *AND* full access to the data. Any other RBAC role other than owner needs the data access specifically assigned via ACLs. This is a good thing because not all system administrators need to see the data, and not all data access users/groups/service principals need access to the service itself. This type of separation is true for certain other services too, such as Azure SQL Database.

Try to use groups whenever you can to grant access, rather than individual accounts. This is a consistent best practice for managing security across many types of systems.

If you are using resource groups in Azure the way they are intended to be used, you may be able to define service permissions at the resource group level rather than at the individual resource level (although the example shown is here is setting RBAC for ADLS specifically). Managing permissions at the resource group level reduces maintenance, assuming your resource group isn't too broadly defined.

If you need to allow a user or group to be able to peruse data using the Data Explorer, then they'll need reader permissions to the ADLS Azure service. Basically, if any access to ADLS in the Azure portal is needed, or if managing the ADLS service (such as through ARM or PowerShell or the portal) is needed, then the appropriate RBAC entry is necessary. 

Typically, automated processes which do need access to the data (discussed in Part 3), don't need any access to the ADLS resource itself. In Part 4 we'll talk a bit about using service principals. I've found that frequently a service principal needs data access (ACLs), but not any RBAC access to the service.

The RBAC functionality is consistent across Azure services. When roles are updated for an Azure resource, it is recorded in the Activity Log:

ADLS_RBAC_ActivityLog.jpg

Defining RBAC Permissions in the Azure Portal

Setting up permissions can be done in the portal in the Access control (IAM) pane. (By the way, the IAM acronym stands for Identity and Access Management.)

ADLS_RBAC_Portal.jpg

Defining RBAC Permissions via PowerShell Script

The technique shown above in the portal is convenient for quick changes, for learning, or for "one-off" scenarios. However, in an enterprise solution, and for production environments, it's a better practice to handle permissions via a script so you can do things such as:

  • Promote changes through different environments
  • Pass off scripts to an administrator to run in production
  • Include permission settings in source control

Setting Permissions for a Group

In the following PowerShell script, we are assigning contributor permissions to an AAD group:

ADLS_RBAC_PowerShell.jpg

Here's a copy/paste friendly script from the above screenshot - for a group:

#-----------------------------------------

#Input Area
$subscriptionName = 'YourSubscriptionName'
$resourceGroupName = 'YourResourceGroupName'
$resourceName = 'YourResourceName'
$groupName = 'YourAADGroupName'
$userRole = 'Contributor'

#-----------------------------------------

#Manual login into Azure
Login-AzureRmAccount -SubscriptionName $subscriptionName

#-----------------------------------------

$resourceId = Get-AzureRmResource `
    -ResourceGroupName $resourceGroupName `
    -ResourceName $resourceName 
$groupId = Get-AzureRMADGroup `
    -SearchString $groupName

New-AzureRmRoleAssignment `
    -ObjectId $groupId.Id `
    -RoleDefinitionName $userRole `
    -Scope $resourceId.ResourceId 

Setting Permissions for a User

This next script is nearly the same, but this time we are assigning read+execute permissions to a user instead of a group (which should be the exception not the rule):

ADLS_RBAC_PowerShell_User.jpg
 

And, here's the copy/paste friendly version of the above screenshot - for a user:

#-----------------------------------------

#Input Area
$subscriptionName = 'YourSubscriptionName'
$resourceGroupName = 'YourResourceGroupName'
$resourceName = 'YourResourceName'
$userName = 'YourUserNameInEmailFormat'
$userRole = 'Contributor'

#-----------------------------------------

#Manual login into Azure
#Login-AzureRmAccount -SubscriptionName $subscriptionName

#-----------------------------------------

$resourceId = Get-AzureRmResource `
    -ResourceGroupName $resourceGroupName `
    -ResourceName $resourceName 
$userId = Get-AzureRmADUser `
    -UPN $userName 

New-AzureRmRoleAssignment `
    -ObjectId $userId.Id `
    -RoleDefinitionName $userRole `
    -Scope $resourceId.ResourceId 

The online examples for the New-AzureRmRoleAssignment cmdlet enumerates the IDs or GUIDs, which makes things clear for learning but isn't ideal for operationalized scripts. Therefore, the purpose for $resourceId and $groupId above is to do the work of looking up the GUIDs so you don't have to do that manually.

Personally, I like using PowerShell instead of ARM (Azure Resource Manager) templates for certain things, such as permissions, but you do have additional options beyond what I've discussed here based on what you're most comfortable with.

Finding More Information

Get Started with Role-Based Access Control in Azure

Want to Know More?

My next all-day workshop on Architecting a Data Lake is in Raleigh, NC on April 13, 2018