Planning for Accounts, Containers, and File Systems for Your Data Lake in Azure Storage

Now that Azure Data Lake Storage Gen2 is now based on Azure Storage as its foundation, we have a new level to incorporate into our planning process the file system itself. The file system contains the files and folders, and is equivalent to a container in Azure Blob Storage which contains blobs. In ADLS Gen1, we didn't have that intermediary level. I talked about this just a bit in #7 of my recent blog entry called 10 Things to Know About Azure Data Lake Storage Gen2 but I'd like to elaborate in this post a bit more about when you might need multiple storage accounts, multiple containers, or multiple file systems to support your data lake.

One caveat: As I’m writing this (March 2019), ADLS Gen2 is young and still evolving in its feature support. This means that some of the blob storage properties mentioned below don’t apply to ADLS Gen2 — yet. According to what we've heard from the ADLS Gen2 team, we can expect that all Azure Storage features will be supported on ADLS Gen2 as it evolves. So, here's the perspective I'm taking in this post:

  • From the Azure Blob Storage perspective (so that it's less confusing during this transition period of ADLS Gen2)

  • All properties for all 3 levels are included (even if not yet supported by ADLS Gen2)

  • Files, Tables, and Queues are disregarded for this discussion (though many of properties we discuss in this post, like the account-level properties, would apply)

The 3 Levels to Plan for in Azure Storage

The 3 levels within Azure Storage that we’re talking about in this post are (1) the account level, (2) the container or file system level, and (3) the blob or file level:

3LevelsOfAzureBlobStorage.jpg
 

Azure Storage Account Properties

The storage account has quite a few properties and settings associated with it. Here are the main ones:

AzureStorageAccountProperties.jpg

A few thoughts regarding the account-level properties:

  • You may need to consider separate storage accounts if you need to segregate access control (RBAC), virtual networks, access keys, and the like. (Note that RBAC can also be set at the container level too, but ACL type permissions only apply to ADLS Gen2 and not to blob storage.)

  • If you don’t need the hierarchical namespace whatsoever (for non-analytical use cases), this could mean a separate storage account. The storage cost is the same but transaction costs are higher when the HNS is enabled (discussed in item #8 of this post).

  • If your data residency requirements differ for certain types of data (ex: one type of data that must reside within Canada, while another must remain in Europe), that will definitely require separate storage accounts.

  • Settings such as replication (whether it’s locally redundant or geo-redundant) are specified at the storage account level. This impacts not only your disaster recovery planning, but it also impacts cost for the entire storage account.

  • Your two account keys are at the account level, so be ultra cautious in sharing those out.

Azure Blob Storage Container Properties

The container within the storage account has properties associated with it as well:

AzureBlobStorageContainerProperties.jpg
 

A few considerations related to the container-level properties:

  • Role-based access control (RBAC/IAM) can be set at the account level or the container level. The container level is the narrowest RBAC scope that can be specified. And don’t forget that RBAC always inherits and can’t be broken: a container inherits from the account, which inherits from the resource group, which inherits from the subscription.

  • You can set up stored access policies which will make your SAS tokens at the blob/file level utilize the policy (such as an expiration date for access).

  • My favorite container-level property is the immutable policy. An immutable policy can prevent data being edited or deleted (i.e., it allows appends only once the policy is enabled). If you have very firm requirements for data protection, this might justify separate containers which have different policies in place.

  • If you have some publicly available data, that access is specified at the container level.

  • If this is an ADLS Gen2 file system (rather than blob container): Power BI Dataflows will reside in one or more file systems.

Azure Storage Blob (File) Properties

And finally, the files within the container have properties associated with them as well:

AzureStorageFileProperties.jpg
 

A few things to be aware of regarding the file-level properties:

  • You can set up a SAS (shared access signature) token if you need to make just one specific file available for access.

  • If we’re talking about directories and files within ADLS Gen2 instead of blob within a container, then you would also specify data-level security ACLs (access control lists) at this level as well. ACLs apply to directories and files. From a security planning perspective, it’s really important to plan both RBAC and ACLs.

Final Thoughts

My rule of thumb is to start with a consolidated data lake. Separate out into separate storage accounts or containers/file systems only when it’s justified to do so based on your requirements. The more separation that exists, the harder it is for users to find data — so take that into careful consideration. However…it’s ok to be liberal with the separation of your directory structure within the file system itself.

Also, keep in mind that a lot of the RBAC roles are evolving right now with regard to flexibility & granularity of managing the control plane vs. the data plane. Make sure to look into the preview capabilities so you make the best long-term decision.

I hope this is helpful for planning out your data lake / data storage needs.

You Might Also Like…

Granting Permissions in Azure Data Lake Storage

Resources for Learning About Azure Data Lake Storage Gen2

FAQs About Organizing a Data Lake

Zones In A Data Lake

Data Lake Use Cases and Planning Considerations

Keeping Up with Azure Changes

Since I started focusing primarily on Microsoft BI/DW/analytics around 2005, I've always been happy being a generalist within that space. Fast forward to around 2017 when I started focusing predominantly on Azure. Being a generalist (still within BI/DW/analytics) in Azure is really tough — as I'm sure you well know. Not only is the variety of technologies very wide, the pace of change is dizzying. Handling that pace of change is the subject of this post.

My two main ways of keeping up with Azure changes focus on:

(1) Being aware of new announcements, new features, general evolution and maturity of services.

(2) Getting hands-on with services and features to build out skills.

Related to (1) above, I've been trying to do a better job about keeping up in a timely way. Since January 1, I've been putting out a tweet with the #AzureDidYouKnow hashtag each day (well…most days anyway). It's a way to keep me motivated to continually keep up with announcements...in which case I may as well share them with you as well, right? I do love learning all the time, so this extra bit of motivation helps me keep at it. These daily updates usually relate to something new; sometimes it might be something I just learned, or maybe something I find interesting.

I also want to share the resources I use most often to keep up with changes in Azure:

Websites with Update Roundups

Azure Updates** - https://azure.microsoft.com/en-us/updates/

Azure Blog Updates - https://azure.microsoft.com/en-us/blog/topics/updates/

Last Week in Azure - https://azure.microsoft.com/en-us/blog/topics/last-week-in-azure/

Build Azure - https://buildazure.com/category/azure-weekly/

**This site is great, but it does not contain every update from every Azure service. If there’s a service you are particularly vested in, be sure to also follow the main product team blog as well.

Podcasts & Videos

The Azure Podcast - http://azpodcast.azurewebsites.net/

Azure Friday - https://azure.microsoft.com/en-us/resources/videos/azure-friday/

Azure Flash Friday - http://www.azureflashfriday.com/

Microsoft Cloud Show - http://www.microsoftcloudshow.com/#

Azure This Week - A Cloud Guru - https://www.youtube.com/playlist?list=PLI1_CQcV71RmnrRBgJNlI1yY_WiOWIXov

Did I miss anything important in this list related to keeping up with Azure changes? (I did skip general Azure training videos purposely.) If I missed a good one, please shoot me a tweet and let me know.

You Might Also Like…

Getting Started with Azure

How to Reference Azure Storage Files from Cloud Shell

Recently I used Azure Cloud Shell for the first time. This is a quick post to show how I referenced the file share in Azure Storage to communicate with Cloud Shell.

What is Cloud Shell?

Cloud Shell is a lightweight way to run scripts using either Bash or PowerShell. You can run scripts in a browser using the Azure portal or shell.azure.com, with the Azure mobile app, or using the VS Code Azure Account extension. If you have seen the "Try it now" links in Azure documentation pages, that will direct you to use Cloud Shell.

CloudShell_AzurePortal.jpg

The rest of this post focuses on using PowerShell with Cloud Shell.

Finding the File Service Info to Use with Cloud Shell

When you create a Cloud Shell account, you are prompted to also create an Azure Storage account. This gives you a mounted file share in the Azure Files service which is available for all of your Cloud Shell sessions.

You can use Get-CloudDrive to find the information related to the drive for your Cloud Shell account:

Get-CloudDrive.jpg
 

Example of Exporting a File to Azure Files Using Cloud Shell

Using the mount point information we got from Get-CloudDrive shown above, we now have what we need for how to reference the Azure file share that is associated to my Cloud Shell account. For an example, I’m going to export a PBIX file from the Power BI Service (discussed in my previous blog post).

First we need to log in to the Power BI Service. In Cloud Shell, authentication to another service like Power BI is a little different than how we see it within a local client tool - it utilizes a login page with a code provided by Cloud Shell:

Connect-PowerBIServiceAccount.jpg

Now that we are authenticated, we can execute our Export-PowerBIReport cmdlet:

Export-PowerBIReport_CloudShell.jpg

Note in the above PowerShell cmdlet, I referenced the mount point for the Azure file share associated with my Cloud Shell account as:

'/home/melissa/clouddrive/PowerBIExportFiles/ExportTest_V2workspace.pbix'
AzureFiles.jpg

Notice in the above example I used a folder called ‘PowerBIExportFiles’ — note that any folder(s) need to already exist before you can export a file to it. It won’t auto-create for you; if the folder doesn’t exist you’ll get an error that says “Could not find a part of the path.” Using a folder is optional though.

You Might Also Like…

Controlling Data Access in Azure for Administrators and Owners

Getting Started with Azure

PowerShell for Assigning and Querying Tags in Azure

Controlling Data Access in Azure for Administrators and Owners

ResourceGroup.jpg

Recently a customer expressed concern that an owner of an Azure resource group automatically gains access to the data within the services contained in the resource group. In this case, the customer was specifically referring to data in Azure Data Lake Storage Gen 1 but this concept applies to Azure Storage and some of the other data-oriented services in Azure as well. The customer’s comment prompted me to look into available alternatives. This is by no means a detailed security post…rather, I’m trying to share a few nuggets of what I learned.

Key points to take away from this post:

  • The built-in Owner role is, by design, unrestricted for both management operations as well as data operations. *See option 3 below where it looks like this behavior is changing.*

  • You cannot 'break' the inheritance model of resources in Azure.

  • If resource group owners are not allowed to see data, these are the following choices I’m aware of currently:

  1. Remove owner permissions from the resource group level, or

  2. Isolate the resource into its own resource group so it can be managed separately, or

  3. Investigate using the new "DataActions" and "NotDataActions" properties within RBAC role structure to separate management operations from data operations (currently in preview at the time of this writing)

Default Behavior of an Owner in a Resource Group Allows Access to Data

Let's first cover what happens by default. Say I have a resource group with a way cool name of "SecurityTestRG" and I've granted "Wayne Writer" ownership permissions to the resource group. And within the resource group I have a storage account called "strgtest96." As we'd expect, the owner permissions are inherited by the storage account:

Security_ResourceInherited.jpg

And with that, Wayne Writer is able to view, edit, and delete any of the data that has been uploaded to that storage account by virtue of his owner permissions at the resource group level:

Security_OwnerOfResourceGroup.jpg

Allowing owners unrestricted access to both the management plane and the data plane is by design. However, if you have a group of people who should be able to administer (own) everything contained in the resource group, but you don’t want them to automatically see all the data, what are the options?

Cannot Break the RBAC Inheritance Model in Azure

The first thing you might be inclined to try is to just remove the owner permission for the one resource within the resource group that has data, so that you can specify security for it differently. However, that doesn’t work:

Security_RoleAssignmentsError.jpg

When you try to delete a role assignment that’s been inherited, as shown above, you get an error message: “Inherited role assignments cannot be removed. Please open up the associated resource and remove the role assignments from there.

In Azure’s RBAC model, we can add additional permissions at lower levels (i.e., like for a resource itself within the resource group), but we cannot remove an assignment that’s been inherited.

An additional note about subscription level owners & administrators: Any owner you specify at the subscription level will be inherited by all resource groups and resources across the subscription—it’s a very high privilege role. The same inheritance behavior also applies to the Azure service administrator (but not the other co-administrators—they would have to be specified as subscription owners).

Options for Handling Security Differently

Option 1. You could remove the owner permission from the resource group (that is the suggestion presented within the error message shown above). However, let’s say there are 10 resources in this resource group and the other 9 of them can inherit the owner permission without issue. In that case, do we really want to have to manage security for every single resource separately? That might introduce risk and inconsistency. Though if there aren’t very many resources within a resource group, that might work ok.

Option 2. You could isolate the resource into its own ‘sensitive’ resource group so it can be managed separately. This allows the other 9 resources to stay in the main resource group with normal RBAC, and just segregates this one resource. Depending on how you handle automation and deployments, having separate resource groups could add complications (because the general rule for resource groups is to group resources together that are related & have the same lifecycle). Also, it still precludes you from being able to own the resource yet not see data—this segregation would work if contributor or a custom role meets your needs, and if you don’t have subscription-level owners specified which also inherit no matter what.

Option 3. There is new functionality in preview that looks promising which will allow us to handle security for management operations and data operations separately, as shown here:

 

The above scenario indicates that Alice, a high level owner, doesn’t actually see the data. This is still in preview and I haven’t gotten it to work correctly yet, but I’m keeping an eye on this. You can find more info here: https://docs.microsoft.com/en-us/azure/role-based-access-control/role-definitions. I’m guessing this security enhancement is inspired by GDPR and/or customer needs for more granularity of permissions, so I’m excited to see where this is headed.

Finding More Information

Azure Storage Security Guide <—This is a good read re: management plane and data plane

What is Role-Based Access Control (RBAC)?

Understand Role Definitions