Resources for Learning About Azure Data Lake Storage Gen2

A couple of people have asked me recently about how to 'bone up' on the new data lake service in Azure. The way I see it, there are two aspects: A, the technology itself and B, data lake principles and architectural best practices. Below are some links to resources that you should find helpful.

Learning about ADLS Gen2 Technology

Azure Data Lake Storage Gen2 is new so there is limited info available. However, since it's built upon the foundation of Azure Storage there is quite a lot of information available at the same time (though in all fairness ADLS Gen2 hasn't reached feature parity yet with blob storage). Here are some resources about the technology:

10 Things to Know about Azure Data Lake Storage Gen2

Planning for Accounts, Containers, and file Systems for your Data Lake in Azure Storage

Best Practices for Using Azure Data Lake Storage Gen2

The Azure Blob Filesystem Driver (ABFS): A Dedicated Storage Driver for Hadoop

Azure Data Lake Storage Gen2 Hierarchical Namespace

Use the Azure Data Lake Storage Gen2 URI

Overview of Azure Data Lake Storage Gen2 [video]

Pluralsight Course: Implementing Azure Data Lake Storage Gen2 by Xavier Morera [video—requires subscription]

Learning about Data Lake Principles and Architectural Best Practices

Just like when designing a database, there are some important aspects to designing a data lake that improve usability, security, performance, and governance. This is your enterprise data we're talking about, right? Some up-front planning for how this data is structured is warranted (because yes, a data lake is more agile...but not so agile that it becomes the dreaded swamp). Here are a few resources for learning about principles and best practices:

Data Lake Use Cases and Planning Considerations

Zones in a Data Lake

FAQs About Organizing a Data Lake

When Should We Load Relational Data to a Data Lake?

Data Lakes in a Modern Data Architecture [ebook]

The Data Lake Manifesto: 10 Best Practices

A Smarter Way to Jump Into Data Lakes

Big Data by Nathan Marz [book] <— This book is getting older now, but the conceptual chapters are excellent. Skip the technology chapters & focus on the concepts & it's a worthwhile read.

There are also several books on data lakes. I don’t have a favorite. Just keep in mind that some things are opinions and personal preferences. Though data lakes are maturing, best practices are still emerging. Many articles and book intros overstate the benefits and under-emphasize the challenges, so watch out for that.

Following the Maturity of ADLS Gen2

These are two important URLs for tracking what is and isn't supported in ADLS Gen2:

Known Issues 

Upgrade Your Big Data Analytics Solutions from ADLS Gen1 to ADLS Gen2

FAQs About Organizing a Data Lake

This post covers several things I've heard or been asked recently about organizing data in a data lake.

Q: Partitioning by date is common. Where should the dates go in the folder hierarchy?

Almost always, you will want the dates to be at the end of the folder path. This is because we often need to set security at specific folder levels (such as by subject area), but we rarely set up security based on time elements.

Optimal for folder security: \SubjectArea\DataSource\YYYY\MM\DD\FileData_YYYY_MM_DD.csv

Tedious for folder security: \YYYY\MM\DD\SubjectArea\DataSource\FileData_YYYY_MM_DD.csv

Here’s an example of what the raw data zone might look like with the date partitioning at the end:

DataLakeOrganization1.jpg
 

Notice in the above example how the date element is repeated in *both* the folder structure and the file name. Being very clear in the naming of folders and files helps a lot with usability.

Also, keep in mind that related compute engines or data processing tools might have a firm expectation as to what the folder structure is (or potentially is contained in the file name). For instance, the year and month folders might translate directly to a column within the file. Or, if you’re using a tool like Azure Stream Analytics to push data to the lake, you’ll be defining in ASA what the date partitioning schema looks like in the data lake (because ASA takes care of creating the folders as data arrives).

Q: How do data lake zones translate to a folder structure?

The zones that I talked about previously are a conceptual idea. Most commonly I’ve seen zones translate to a top level folder (like shown in the image above). However, it's also possible that the zones would reside within, say, a subject area as shown in this next image:

DataLakeOrganization2.jpg
 

Generally speaking, business users only get access to the prepared data in the curated data zone (with some exceptions of course). Zones like Raw Data and Staged Data are frequently ‘kitchen areas’ that have little to no user access. That’s why putting the zones at the top-most level is very common. However, if your objective is to make all of the data available in an easier way, then putting zones underneath a subject area might make sense—this is less common from what I’ve seen though because exposing too much data to business users can be confusing.

On the flip side, another less common option would be to further separate zones beyond just top-level folders. For instance, in Azure Data Lake Storage Gen 2, we have the structure of Account > File System > Folders > Files to work with (terminology-wise, a File System in ADLS Gen 2 is equivalent to a Container in Azure Blob Storage). Depending on what you are trying to accomplish, you might decide that separate file systems are appropriate for areas of your data lake:

ADLSGen2.jpg

If your objective is to have an enterprise-wise data lake, then more separation is less appealing.

Q: Should the date reflected in the folder structure be the ingestion date or the date associated with the source data?

It could be either one. I tend to think this is dependent on whether you're dealing with data that's being pushed or pulled into the data lake, and if it’s transactional or snapshot data.

Push system: Let’s say you have machine telemetry or IoT data that is being loaded to the data lake. In this case, the dates in the folder structure would typically be based on ingestion date.

Pull system: If you have a scheduled process that loads data into the lake, then it's up to the architect of the process to determine what the date means:

  • Transactional data: If sales data is being loaded, the dates could easily relate to the sale transaction date (even if we pulled the data out three days later). Typically transactional data is append-only.

  • Snapshot data: Let’s say we want to organize the data by its "as of" date. If you look back at the very first image shown above, the CustomerContacts folder is intended to show a snapshot of what that data looked like as of a point in time. Typically this would be for reference data, and is stored in full every time it’s extracted into the data lake.

Q: Is it a good idea to created folders which nest multiple data elements?

This type of structure which nests 3 data elements into 1 folder is typically *not* recommended:

DataLakeOrganization3.jpg
 

There are two potential issues with ‘nesting’ elements like Company-Division-Project as shown above:

  1. Security: Setting up security is probably harder. For instance, if Mary should see everything in Division 2, that just got harder because now Division is associated with the granularity of Projects. Not impossible to manage, just likely to be more work.

  2. Performance: The query performance can suffer. Some compute engines & query tools can understand the structure of the data lake and do ‘data pruning’ (like predicate pushdown). Let’s say you send a query asking for all data for Project A240. If Project were its own folder, the likelihood of the compute engine needing to scan only that one folder is much higher and would improve performance considerably. (This performance optimization is applicable to a hierarchical file system like Azure Data Lake Storage (Gen1 or Gen2), but not applicable to an object store like Azure Blob Storage.)

Q: If I need a separate dev, test, prod environment, how would this usually be handled?

Usually separate environments are handled with separate services. For instance, in Azure, that would be 3 separate Azure Data Lake Storage resources (which might be in the same subscription or different subscriptions).

We wouldn’t usually separate out dev/test/prod with a folder structure in the same data lake. It can be done (just like you could use the same database with a different schema for dev/test/prod) but it’s not the typical recommended way of handling the separation. We prefer having the exact same folder structure across all 3 environments. If you must get by with it being within one data lake (one service), then the environment should be the top level node.

Q: How much do I need to be concerned with the similarity of file contents within a folder?

The general rule is for all files to have the same format underneath a folder node. This is because scripts often traverse all files in a folder. If you have something like JSON it's fine if the schema differs per file, but for fixed data structures such as CSV, a script will error out if some of the files have a different format. (This does mean sometimes you need to refresh historical files to align with a format change that occurred along the way.)

Q: When should we load data from a relational data source into a data lake?

I devoted a blog post to this because it comes up a lot—check here.

Q: Data lakes are supposed to be agile. So I don’t need to worry about about naming conventions, right?

Try your best to not neglect naming conventions. You might use camel case, or you might just go with all lower case – either is ok, as long as you’re consistent. There are two big reasons for this: First, some languages are case-sensitive so consistent naming structures end up being less frustrating. Second, the bigger your data lake gets the more likely you are to have scripts that manage the data and/or the metadata, and they are more easily maintained and parameterized if consistent.

You Might Also Like…

Zones in a Data Lake

When Should We Load Relational Data to a Data Lake?

Data Lake Use Cases and Planning Considerations

BlueGranite eBook - Data Lakes in a Modern Data Architecture

When Should We Load Relational Data to a Data Lake?

This is a question I get fairly regularly these days: Should we extract relational data and load it to a data lake?

Architecture diagrams, such as the one displayed here from Microsoft, frequently depict all types of data sources going thru the lake:

For certain types of data, writing it to the data lake really is frequently the best choice. This is often true for low latency IoT data, semi-structured data like logs, and varying structures such as social media data. However, the handling of structured data which originates from a relational database is much less clear.

Most data lake technologies store data as files (like csv, json, or parquet). This means that when we extract relational data into a file stored in a data lake, we lose valuable metadata from the database such as data types, constraints, foreign keys, etc. I tend to say that we "de-relationalize" data when we write it to a file in the data lake. If we're going to turn right around and load that data to a relational database destination, is it the right call to write it out to a file in the data lake as an intermediary step?

My current thinking is that this is justified primarily by these situations:

(1) Do you need to keep the history? If you are retaining snapshots as of periodic points in time, having those snapshots accessible in a data lake can be useful.

(2) Will multiple downstream teams, systems, or processes access the data? If there are multiple consumers of the data (and presumably we don't want them accessing the original source directly), then it's possible that providing data access from a data lake might be beneficial.

(3) Do you have a strategy of storing *all* of the organizational data in your data lake? If you've standardized on this type of approach, and (1) or (2) above applies, then it makes sense.

(4) Do you need to provide a subset of data for a specialized use case? If you can't provide access to the source database, but a full replica or another relational DB is overkill, then delivery of data via the data lake can sometimes make sense.

DataLakeQuote.jpg
 

The above list isn’t exhaustive - it always depends. My rule of thumb: Avoid writing relational data to a data lake when it doesn't have value being there. This is an opinion and not everyone agrees with this. Even if we are using our data lake as a staging area for a data warehouse, my opinion is that all relational data doesn't necessarily have to make a pit stop in the data lake except when it's justified to do so.

You Might Also Like…

Zones in a Data Lake

Data Lake Use Cases and Planning Considerations

James Serra’s blog - Should I load Structured Data Into My Data Lake?

New eBook - Data Lakes in a Modern Data Architecture

This is a quick announcement of a new resource available published to the BlueGranite site.

If you're interested in data lakes, you might want to check out an updated ebook just published to the BlueGranite site. It's called "Data Lakes in a Modern Data Architecture." This ebook was originally published about 3 years ago by Chris Campbell. If you saw the original, you'll note we retained the same 'look and feel' but about 90% of the content has been updated.

DataLakeVsDataWarehouse.jpg
 

I wrote the updated content from a practical point of view, totally hype-free. The table of contents:

  • Modern Data Architecture
  • Business Needs Driving Data Architectures to Evolve and Adapt
  • Principles of a Modern Data Architecture
  • Data Lake + Data Warehouse: Complementary Solutions
  • Tips for Designing a Data Lake
  • Azure Technologies for Implementing a Data Lake
  • Considerations for a Successful Data Lake in the Cloud
  • Getting Started with a Data Lake

To download the ebook, BlueGranite will ask for you to register your information. That's common for premium content like this. We take a low-key approach to sales, so I can assure you that registration only means you'll receive notifications of new content that you may find interesting.

I'll be updating the ebook from time to time. For example, it already needs updating to reflect the new changes in Azure Data Lake Store Gen 2. 

At 23 pages, the ebook just begins to explore a lot of considerations. To see if we can help you dive in deeper (pun intended), please contact us. I hope you enjoy reading the ebook as much as I enjoyed writing it.