Planning for Accounts, Containers, and File Systems for Your Data Lake in Azure Storage

Now that Azure Data Lake Storage Gen2 is now based on Azure Storage as its foundation, we have a new level to incorporate into our planning process the file system itself. The file system contains the files and folders, and is equivalent to a container in Azure Blob Storage which contains blobs. In ADLS Gen1, we didn't have that intermediary level. I talked about this just a bit in #7 of my recent blog entry called 10 Things to Know About Azure Data Lake Storage Gen2 but I'd like to elaborate in this post a bit more about when you might need multiple storage accounts, multiple containers, or multiple file systems to support your data lake.

One caveat: As I’m writing this (March 2019), ADLS Gen2 is young and still evolving in its feature support. This means that some of the blob storage properties mentioned below don’t apply to ADLS Gen2 — yet. According to what we've heard from the ADLS Gen2 team, we can expect that all Azure Storage features will be supported on ADLS Gen2 as it evolves. So, here's the perspective I'm taking in this post:

  • From the Azure Blob Storage perspective (so that it's less confusing during this transition period of ADLS Gen2)

  • All properties for all 3 levels are included (even if not yet supported by ADLS Gen2)

  • Files, Tables, and Queues are disregarded for this discussion (though many of properties we discuss in this post, like the account-level properties, would apply)

The 3 Levels to Plan for in Azure Storage

The 3 levels within Azure Storage that we’re talking about in this post are (1) the account level, (2) the container or file system level, and (3) the blob or file level:

3LevelsOfAzureBlobStorage.jpg
 

Azure Storage Account Properties

The storage account has quite a few properties and settings associated with it. Here are the main ones:

AzureStorageAccountProperties.jpg

A few thoughts regarding the account-level properties:

  • You may need to consider separate storage accounts if you need to segregate access control (RBAC), virtual networks, access keys, and the like. (Note that RBAC can also be set at the container level too, but ACL type permissions only apply to ADLS Gen2 and not to blob storage.)

  • If you don’t need the hierarchical namespace whatsoever (for non-analytical use cases), this could mean a separate storage account. The storage cost is the same but transaction costs are higher when the HNS is enabled (discussed in item #8 of this post).

  • If your data residency requirements differ for certain types of data (ex: one type of data that must reside within Canada, while another must remain in Europe), that will definitely require separate storage accounts.

  • Settings such as replication (whether it’s locally redundant or geo-redundant) are specified at the storage account level. This impacts not only your disaster recovery planning, but it also impacts cost for the entire storage account.

  • Your two account keys are at the account level, so be ultra cautious in sharing those out.

Azure Blob Storage Container Properties

The container within the storage account has properties associated with it as well:

AzureBlobStorageContainerProperties.jpg
 

A few considerations related to the container-level properties:

  • Role-based access control (RBAC/IAM) can be set at the account level or the container level. The container level is the narrowest RBAC scope that can be specified. And don’t forget that RBAC always inherits and can’t be broken: a container inherits from the account, which inherits from the resource group, which inherits from the subscription.

  • You can set up stored access policies which will make your SAS tokens at the blob/file level utilize the policy (such as an expiration date for access).

  • My favorite container-level property is the immutable policy. An immutable policy can prevent data being edited or deleted (i.e., it allows appends only once the policy is enabled). If you have very firm requirements for data protection, this might justify separate containers which have different policies in place.

  • If you have some publicly available data, that access is specified at the container level.

  • If this is an ADLS Gen2 file system (rather than blob container): Power BI Dataflows will reside in one or more file systems.

Azure Storage Blob (File) Properties

And finally, the files within the container have properties associated with them as well:

AzureStorageFileProperties.jpg
 

A few things to be aware of regarding the file-level properties:

  • You can set up a SAS (shared access signature) token if you need to make just one specific file available for access.

  • If we’re talking about directories and files within ADLS Gen2 instead of blob within a container, then you would also specify data-level security ACLs (access control lists) at this level as well. ACLs apply to directories and files. From a security planning perspective, it’s really important to plan both RBAC and ACLs.

Final Thoughts

My rule of thumb is to start with a consolidated data lake. Separate out into separate storage accounts or containers/file systems only when it’s justified to do so based on your requirements. The more separation that exists, the harder it is for users to find data — so take that into careful consideration. However…it’s ok to be liberal with the separation of your directory structure within the file system itself.

Also, keep in mind that a lot of the RBAC roles are evolving right now with regard to flexibility & granularity of managing the control plane vs. the data plane. Make sure to look into the preview capabilities so you make the best long-term decision.

I hope this is helpful for planning out your data lake / data storage needs.

You Might Also Like…

Granting Permissions in Azure Data Lake Storage

Resources for Learning About Azure Data Lake Storage Gen2

FAQs About Organizing a Data Lake

Zones In A Data Lake

Data Lake Use Cases and Planning Considerations

Resources for Learning About Azure Data Lake Storage Gen2

A couple of people have asked me recently about how to 'bone up' on the new data lake service in Azure. The way I see it, there are two aspects: A, the technology itself and B, data lake principles and architectural best practices. Below are some links to resources that you should find helpful.

Learning about ADLS Gen2 Technology

Azure Data Lake Storage Gen2 is new so there is limited info available. However, since it's built upon the foundation of Azure Storage there is quite a lot of information available at the same time (though in all fairness ADLS Gen2 hasn't reached feature parity yet with blob storage). Here are some resources about the technology:

10 Things to Know about Azure Data Lake Storage Gen2

Best Practices for Using Azure Data Lake Storage Gen2

The Azure Blob Filesystem Driver (ABFS): A Dedicated Storage Driver for Hadoop

Azure Data Lake Storage Gen2 Hierarchical Namespace

Use the Azure Data Lake Storage Gen2 URI

Overview of Azure Data Lake Storage Gen2 [video]

Pluralsight Course: Implementing Azure Data Lake Storage Gen2 by Xavier Morera [video—requires subscription]

Learning about Data Lake Principles and Architectural Best Practices

Just like when designing a database, there are some important aspects to designing a data lake that improve usability, security, performance, and governance. This is your enterprise data we're talking about, right? Some up-front planning for how this data is structured is warranted (because yes, a data lake is more agile...but not so agile that it becomes the dreaded swamp). Here are a few resources for learning about principles and best practices:

Data Lake Use Cases and Planning Considerations

Zones in a Data Lake

FAQs About Organizing a Data Lake

When Should We Load Relational Data to a Data Lake?

Data Lakes in a Modern Data Architecture [ebook]

The Data Lake Manifesto: 10 Best Practices

A Smarter Way to Jump Into Data Lakes

Big Data by Nathan Marz [book] <— This book is getting older now, but the conceptual chapters are excellent. Skip the technology chapters & focus on the concepts & it's a worthwhile read.

There are also several books on data lakes. I don’t have a favorite. Just keep in mind that some things are opinions and personal preferences. Though data lakes are maturing, best practices are still emerging. Many articles and book intros overstate the benefits and under-emphasize the challenges, so watch out for that.

Following the Maturity of ADLS Gen2

These are two important URLs for tracking what is and isn't supported in ADLS Gen2:

Known Issues 

Upgrade Your Big Data Analytics Solutions from ADLS Gen1 to ADLS Gen2

FAQs About Organizing a Data Lake

This post covers several things I've heard or been asked recently about organizing data in a data lake.

Q: Partitioning by date is common. Where should the dates go in the folder hierarchy?

Almost always, you will want the dates to be at the end of the folder path. This is because we often need to set security at specific folder levels (such as by subject area), but we rarely set up security based on time elements.

Optimal for folder security: \SubjectArea\DataSource\YYYY\MM\DD\FileData_YYYY_MM_DD.csv

Tedious for folder security: \YYYY\MM\DD\SubjectArea\DataSource\FileData_YYYY_MM_DD.csv

Here’s an example of what the raw data zone might look like with the date partitioning at the end:

DataLakeOrganization1.jpg
 

Notice in the above example how the date element is repeated in *both* the folder structure and the file name. Being very clear in the naming of folders and files helps a lot with usability.

Also, keep in mind that related compute engines or data processing tools might have a firm expectation as to what the folder structure is (or potentially is contained in the file name). For instance, the year and month folders might translate directly to a column within the file. Or, if you’re using a tool like Azure Stream Analytics to push data to the lake, you’ll be defining in ASA what the date partitioning schema looks like in the data lake (because ASA takes care of creating the folders as data arrives).

Q: How do data lake zones translate to a folder structure?

The zones that I talked about previously are a conceptual idea. Most commonly I’ve seen zones translate to a top level folder (like shown in the image above). However, it's also possible that the zones would reside within, say, a subject area as shown in this next image:

DataLakeOrganization2.jpg
 

Generally speaking, business users only get access to the prepared data in the curated data zone (with some exceptions of course). Zones like Raw Data and Staged Data are frequently ‘kitchen areas’ that have little to no user access. That’s why putting the zones at the top-most level is very common. However, if your objective is to make all of the data available in an easier way, then putting zones underneath a subject area might make sense—this is less common from what I’ve seen though because exposing too much data to business users can be confusing.

On the flip side, another less common option would be to further separate zones beyond just top-level folders. For instance, in Azure Data Lake Storage Gen 2, we have the structure of Account > File System > Folders > Files to work with (terminology-wise, a File System in ADLS Gen 2 is equivalent to a Container in Azure Blob Storage). Depending on what you are trying to accomplish, you might decide that separate file systems are appropriate for areas of your data lake:

ADLSGen2.jpg

If your objective is to have an enterprise-wise data lake, then more separation is less appealing.

Q: Should the date reflected in the folder structure be the ingestion date or the date associated with the source data?

It could be either one. I tend to think this is dependent on whether you're dealing with data that's being pushed or pulled into the data lake, and if it’s transactional or snapshot data.

Push system: Let’s say you have machine telemetry or IoT data that is being loaded to the data lake. In this case, the dates in the folder structure would typically be based on ingestion date.

Pull system: If you have a scheduled process that loads data into the lake, then it's up to the architect of the process to determine what the date means:

  • Transactional data: If sales data is being loaded, the dates could easily relate to the sale transaction date (even if we pulled the data out three days later). Typically transactional data is append-only.

  • Snapshot data: Let’s say we want to organize the data by its "as of" date. If you look back at the very first image shown above, the CustomerContacts folder is intended to show a snapshot of what that data looked like as of a point in time. Typically this would be for reference data, and is stored in full every time it’s extracted into the data lake.

Q: Is it a good idea to created folders which nest multiple data elements?

This type of structure which nests 3 data elements into 1 folder is typically *not* recommended:

DataLakeOrganization3.jpg
 

There are two potential issues with ‘nesting’ elements like Company-Division-Project as shown above:

  1. Security: Setting up security is probably harder. For instance, if Mary should see everything in Division 2, that just got harder because now Division is associated with the granularity of Projects. Not impossible to manage, just likely to be more work.

  2. Performance: The query performance can suffer. Some compute engines & query tools can understand the structure of the data lake and do ‘data pruning’ (like predicate pushdown). Let’s say you send a query asking for all data for Project A240. If Project were its own folder, the likelihood of the compute engine needing to scan only that one folder is much higher and would improve performance considerably. (This performance optimization is applicable to a hierarchical file system like Azure Data Lake Storage (Gen1 or Gen2), but not applicable to an object store like Azure Blob Storage.)

Q: If I need a separate dev, test, prod environment, how would this usually be handled?

Usually separate environments are handled with separate services. For instance, in Azure, that would be 3 separate Azure Data Lake Storage resources (which might be in the same subscription or different subscriptions).

We wouldn’t usually separate out dev/test/prod with a folder structure in the same data lake. It can be done (just like you could use the same database with a different schema for dev/test/prod) but it’s not the typical recommended way of handling the separation. We prefer having the exact same folder structure across all 3 environments. If you must get by with it being within one data lake (one service), then the environment should be the top level node.

Q: How much do I need to be concerned with the similarity of file contents within a folder?

The general rule is for all files to have the same format underneath a folder node. This is because scripts often traverse all files in a folder. If you have something like JSON it's fine if the schema differs per file, but for fixed data structures such as CSV, a script will error out if some of the files have a different format. (This does mean sometimes you need to refresh historical files to align with a format change that occurred along the way.)

Q: When should we load data from a relational data source into a data lake?

I devoted a blog post to this because it comes up a lot—check here.

Q: Data lakes are supposed to be agile. So I don’t need to worry about about naming conventions, right?

Try your best to not neglect naming conventions. You might use camel case, or you might just go with all lower case – either is ok, as long as you’re consistent. There are two big reasons for this: First, some languages are case-sensitive so consistent naming structures end up being less frustrating. Second, the bigger your data lake gets the more likely you are to have scripts that manage the data and/or the metadata, and they are more easily maintained and parameterized if consistent.

You Might Also Like…

Zones in a Data Lake

When Should We Load Relational Data to a Data Lake?

Data Lake Use Cases and Planning Considerations

BlueGranite eBook - Data Lakes in a Modern Data Architecture

When Should We Load Relational Data to a Data Lake?

This is a question I get fairly regularly these days: Should we extract relational data and load it to a data lake?

Architecture diagrams, such as the one displayed here from Microsoft, frequently depict all types of data sources going thru the lake:

For certain types of data, writing it to the data lake really is frequently the best choice. This is often true for low latency IoT data, semi-structured data like logs, and varying structures such as social media data. However, the handling of structured data which originates from a relational database is much less clear.

Most data lake technologies store data as files (like csv, json, or parquet). This means that when we extract relational data into a file stored in a data lake, we lose valuable metadata from the database such as data types, constraints, foreign keys, etc. I tend to say that we "de-relationalize" data when we write it to a file in the data lake. If we're going to turn right around and load that data to a relational database destination, is it the right call to write it out to a file in the data lake as an intermediary step?

My current thinking is that this is justified primarily by these situations:

(1) Do you need to keep the history? If you are retaining snapshots as of periodic points in time, having those snapshots accessible in a data lake can be useful.

(2) Will multiple downstream teams, systems, or processes access the data? If there are multiple consumers of the data (and presumably we don't want them accessing the original source directly), then it's possible that providing data access from a data lake might be beneficial.

(3) Do you have a strategy of storing *all* of the organizational data in your data lake? If you've standardized on this type of approach, and (1) or (2) above applies, then it makes sense.

(4) Do you need to provide a subset of data for a specialized use case? If you can't provide access to the source database, but a full replica or another relational DB is overkill, then delivery of data via the data lake can sometimes make sense.

DataLakeQuote.jpg
 

The above list isn’t exhaustive - it always depends. My rule of thumb: Avoid writing relational data to a data lake when it doesn't have value being there. This is an opinion and not everyone agrees with this. Even if we are using our data lake as a staging area for a data warehouse, my opinion is that all relational data doesn't necessarily have to make a pit stop in the data lake except when it's justified to do so.

You Might Also Like…

Zones in a Data Lake

Data Lake Use Cases and Planning Considerations

James Serra’s blog - Should I load Structured Data Into My Data Lake?