FAQs About Organizing a Data Lake

This post covers several things I've heard or been asked recently about organizing data in a data lake.

Q: Partitioning by date is common. Where should the dates go in the folder hierarchy?

Almost always, you will want the dates to be at the end of the folder path. This is because we often need to set security at specific folder levels (such as by subject area), but we rarely set up security based on time elements.

Optimal for folder security: \SubjectArea\DataSource\YYYY\MM\DD\FileData_YYYY_MM_DD.csv

Tedious for folder security: \YYYY\MM\DD\SubjectArea\DataSource\FileData_YYYY_MM_DD.csv

Here’s an example of what the raw data zone might look like with the date partitioning at the end:

DataLakeOrganization1.jpg
 

Notice in the above example how the date element is repeated in *both* the folder structure and the file name. Being very clear in the naming of folders and files helps a lot with usability.

Also, keep in mind that related compute engines or data processing tools might have a firm expectation as to what the folder structure is (or potentially is contained in the file name). For instance, the year and month folders might translate directly to a column within the file. Or, if you’re using a tool like Azure Stream Analytics to push data to the lake, you’ll be defining in ASA what the date partitioning schema looks like in the data lake (because ASA takes care of creating the folders as data arrives).

Q: How do data lake zones translate to a folder structure?

The zones that I talked about previously are a conceptual idea. Most commonly I’ve seen zones translate to a top level folder (like shown in the image above). However, it's also possible that the zones would reside within, say, a subject area as shown in this next image:

DataLakeOrganization2.jpg
 

Generally speaking, business users only get access to the prepared data in the curated data zone (with some exceptions of course). Zones like Raw Data and Staged Data are frequently ‘kitchen areas’ that have little to no user access. That’s why putting the zones at the top-most level is very common. However, if your objective is to make all of the data available in an easier way, then putting zones underneath a subject area might make sense—this is less common from what I’ve seen though because exposing too much data to business users can be confusing.

On the flip side, another less common option would be to further separate zones beyond just top-level folders. For instance, in Azure Data Lake Storage Gen 2, we have the structure of Account > File System > Folders > Files to work with (terminology-wise, a File System in ADLS Gen 2 is equivalent to a Container in Azure Blob Storage). Depending on what you are trying to accomplish, you might decide that separate file systems are appropriate for areas of your data lake:

ADLSGen2.jpg

If your objective is to have an enterprise-wise data lake, then more separation is less appealing.

Q: Should the date reflected in the folder structure be the ingestion date or the date associated with the source data?

It could be either one. I tend to think this is dependent on whether you're dealing with data that's being pushed or pulled into the data lake, and if it’s transactional or snapshot data.

Push system: Let’s say you have machine telemetry or IoT data that is being loaded to the data lake. In this case, the dates in the folder structure would typically be based on ingestion date.

Pull system: If you have a scheduled process that loads data into the lake, then it's up to the architect of the process to determine what the date means:

  • Transactional data: If sales data is being loaded, the dates could easily relate to the sale transaction date (even if we pulled the data out three days later). Typically transactional data is append-only.

  • Snapshot data: Let’s say we want to organize the data by its "as of" date. If you look back at the very first image shown above, the CustomerContacts folder is intended to show a snapshot of what that data looked like as of a point in time. Typically this would be for reference data, and is stored in full every time it’s extracted into the data lake.

Q: Is it a good idea to created folders which nest multiple data elements?

This type of structure which nests 3 data elements into 1 folder is typically *not* recommended:

DataLakeOrganization3.jpg
 

There are two potential issues with ‘nesting’ elements like Company-Division-Project as shown above:

  1. Security: Setting up security is probably harder. For instance, if Mary should see everything in Division 2, that just got harder because now Division is associated with the granularity of Projects. Not impossible to manage, just likely to be more work.

  2. Performance: The query performance can suffer. Some compute engines & query tools can understand the structure of the data lake and do ‘data pruning’ (like predicate pushdown). Let’s say you send a query asking for all data for Project A240. If Project were its own folder, the likelihood of the compute engine needing to scan only that one folder is much higher and would improve performance considerably. (This performance optimization is applicable to a hierarchical file system like Azure Data Lake Storage (Gen1 or Gen2), but not applicable to an object store like Azure Blob Storage.)

Q: If I need a separate dev, test, prod environment, how would this usually be handled?

Usually separate environments are handled with separate services. For instance, in Azure, that would be 3 separate Azure Data Lake Storage resources (which might be in the same subscription or different subscriptions).

We wouldn’t usually separate out dev/test/prod with a folder structure in the same data lake. It can be done (just like you could use the same database with a different schema for dev/test/prod) but it’s not the typical recommended way of handling the separation. We prefer having the exact same folder structure across all 3 environments. If you must get by with it being within one data lake (one service), then the environment should be the top level node.

Q: How much do I need to be concerned with the similarity of file contents within a folder?

The general rule is for all files to have the same format underneath a folder node. This is because scripts often traverse all files in a folder. If you have something like JSON it's fine if the schema differs per file, but for fixed data structures such as CSV, a script will error out if some of the files have a different format. (This does mean sometimes you need to refresh historical files to align with a format change that occurred along the way.)

Q: When should we load data from a relational data source into a data lake?

I devoted a blog post to this because it comes up a lot—check here.

Q: Data lakes are supposed to be agile. So I don’t need to worry about about naming conventions, right?

Try your best to not neglect naming conventions. You might use camel case, or you might just go with all lower case – either is ok, as long as you’re consistent. There are two big reasons for this: First, some languages are case-sensitive so consistent naming structures end up being less frustrating. Second, the bigger your data lake gets the more likely you are to have scripts that manage the data and/or the metadata, and they are more easily maintained and parameterized if consistent.

You Might Also Like…

Zones in a Data Lake

When Should We Load Relational Data to a Data Lake?

Data Lake Use Cases and Planning Considerations

BlueGranite eBook - Data Lakes in a Modern Data Architecture

How to Reference Azure Storage Files from Cloud Shell

Recently I used Azure Cloud Shell for the first time. This is a quick post to show how I referenced the file share in Azure Storage to communicate with Cloud Shell.

What is Cloud Shell?

Cloud Shell is a lightweight way to run scripts using either Bash or PowerShell. You can run scripts in a browser using the Azure portal or shell.azure.com, with the Azure mobile app, or using the VS Code Azure Account extension. If you have seen the "Try it now" links in Azure documentation pages, that will direct you to use Cloud Shell.

CloudShell_AzurePortal.jpg

The rest of this post focuses on using PowerShell with Cloud Shell.

Finding the File Service Info to Use with Cloud Shell

When you create a Cloud Shell account, you are prompted to also create an Azure Storage account. This gives you a mounted file share in the Azure Files service which is available for all of your Cloud Shell sessions.

You can use Get-CloudDrive to find the information related to the drive for your Cloud Shell account:

Get-CloudDrive.jpg
 

Example of Exporting a File to Azure Files Using Cloud Shell

Using the mount point information we got from Get-CloudDrive shown above, we now have what we need for how to reference the Azure file share that is associated to my Cloud Shell account. For an example, I’m going to export a PBIX file from the Power BI Service (discussed in my previous blog post).

First we need to log in to the Power BI Service. In Cloud Shell, authentication to another service like Power BI is a little different than how we see it within a local client tool - it utilizes a login page with a code provided by Cloud Shell:

Connect-PowerBIServiceAccount.jpg

Now that we are authenticated, we can execute our Export-PowerBIReport cmdlet:

Export-PowerBIReport_CloudShell.jpg

Note in the above PowerShell cmdlet, I referenced the mount point for the Azure file share associated with my Cloud Shell account as:

'/home/melissa/clouddrive/PowerBIExportFiles/ExportTest_V2workspace.pbix'
AzureFiles.jpg

Notice in the above example I used a folder called ‘PowerBIExportFiles’ — note that any folder(s) need to already exist before you can export a file to it. It won’t auto-create for you; if the folder doesn’t exist you’ll get an error that says “Could not find a part of the path.” Using a folder is optional though.

You Might Also Like…

Controlling Data Access in Azure for Administrators and Owners

Getting Started with Azure

PowerShell for Assigning and Querying Tags in Azure

How Permissions Work for a Power BI Service Administrator

A Power BI administrator is a role for managing various aspects of the Power BI Service. This role can be assigned in Office 365. Anyone with Office 365 global admin privileges is also a Power BI administrator by default.

Based on the tests I've been doing, I've observed that users with membership to the Power BI administrator role have two sets of permissions apply:

  • Activities which are scoped across the entire organization

  • Activities for which normal user permissions apply

Within the above 2 categories, I’m thinking there are 4 main types of activities:

  1. Manage tenant settings (always scoped to the organization)

  2. Compile inventory and metadata (can be scoped to the organization)

  3. Manage workspace users (can be scoped to the organization)

  4. Export content from a workspace (relies on user permissions)

PowerBIAdministrator_CategoriesOfResponsibility.jpg

My motivation for laying this out is because initially I expected that all of the PowerShell scripts (including the Export-PowerBIReport cmdlet discussed below) would apply organization-wide. This made me think that the Power BI administrator role had become a highly elevated role which could access all content in the Power BI tenant. However, my initial expectation was wrong in that respect — it turns out that a Power BI administrator can access all metadata but not all of the actual data.

Next let’s briefly review each of the 4 types of activities.

Manage Power BI Tenant Settings

The ability to manage tenant settings in the Power BI Admin Portal, has been in place for some time now. It includes managing things such as:

PowerBIAdminPortalTenantSettings.jpg
  • Tenant settings (for the most part this includes enabling/disabling certain features to influence the user experience and/or govern the system)

  • Capacity

  • Embed codes

  • Organizational custom visuals

The Power BI administrator role cannot be delegated to individual subsets of the organization — it applies to the entire tenant.

The role also cannot be granted in a read-only way. This can make it challenging in a very large organization. For example, let's say you're a large worldwide organization with five main divisions. One of the key Power BI people from division A requests access to the Power BI Admin Portal because they want to be able to view what the settings are. An example I've seen of this is someone thinking that the 'push apps to end users' doesn't work, when really the issue is that it's disabled by default in the tenant settings.

Compile Power BI Inventory and Metadata

With the introduction of the Power BI Management Module, we can more easily run scripts to perform certain activities such as accessing metadata. There are several cmdlets available, for instance: Get-PowerBIDashboard, Get-PowerBIReport, and Get-PowerBIWorkspace.

Here’s an example of a script which looks across the entire tenant (i.e., the organization scope) to find all instances of a report named Product Sales Analysis:

Get-PowerBIReport.jpg

The key point above is that the Power BI administrator can retrieve all metadata like this, including My Workspace for other users. This is actually great because it means a Power BI administrator can put together an inventory of the content in the tenant. If you compare this to usage data from the Office 365 audit log, you could determine what isn’t being used.

Manage Workspace Users

There are PowerShell cmdlets such as Add-PowerBIWorkspaceUser and Remove-PowerBIWorkspaceUser to manage the new type of workspaces (i.e., the V2 ‘new workspace experience’ that is in preview as of this writing in late 2018).

Here is an example of my Power BI administrator account providing member permissions to a colleague:

Add-PowerBIWorkspaceUser.jpg

The interesting part of the above example is that my Power BI administrator account does *not* have any direct permissions to the workspace. However, the organization scope allows it to be done.

Export Power BI Content from Workspaces

There is a PowerShell cmdlet called Export-PowerBIReport which, as the name implies, exports a PBIX from the Power BI Service. This includes the report and the underlying data.

Initially I thought this would also be able to be done with an organization-wide scope, but that’s not true. (Though if that were true, that would open up some interesting scenarios like exporting files to make backups, and/or exporting files to minimize risk that critical data is being delivered from a user's individual workspace…but that’s not possible right now. And if it were, I would want that highly privileged role to be separate somehow from the existing Power BI administrator role.)

Here is an example of exporting one file:

Export-PowerBIReport.jpg

Unlike the previous two examples, Export-PowerBIReport does require the Power BI administrator to have rights to the workspace in order to access the content. This is what I was observing, so I was happy to have it confirmed by Chaz Beck (CodeCyclone) from the Power BI product team when he replied to this GitHub issue. This is a legitimate unauthorized (401) message when a Power BI administrator tries to export a PBIX that resides somewhere that the administrator doesn’t have access to - this includes My Workspace for all other users.

Also, this cmdlet appears to only work with the new workspace experience that is currently in public preview. I get an unauthorized (401) message, even if I’m an admin on the workspace itself, if I try this on a V1 workspace. This is confirmed as a bug on this GitHub issue.

Summary

Hopefully this post saves you some time in determining how permissions apply to the different types of activities that a Power BI administrator can do.

 
PowerBIAdministrator.jpg
 

The scripts shown above are super simplified, not really ready for actual production use, but I kept them simple since the focus of this post is on permissions.

Also, keep in mind that you do *not* have to be a designated Power BI administrator to use the Power BI PowerShell cmdlets — any user can run them related to their own content. However, you *do* need to be a Power BI administrator in order to set the scope to organization (for those cmdlets which support it).

To find additional information:

You Might Also Like

Terminology Check - What is a Power BI App?

Lesson Learned - Keep PowerShell Modules Consistent and Up To Date

Checklist for Finalizing a Model in Power BI Desktop

When Should We Load Relational Data to a Data Lake?

This is a question I get fairly regularly these days: Should we extract relational data and load it to a data lake?

Architecture diagrams, such as the one displayed here from Microsoft, frequently depict all types of data sources going thru the lake:

For certain types of data, writing it to the data lake really is frequently the best choice. This is often true for low latency IoT data, semi-structured data like logs, and varying structures such as social media data. However, the handling of structured data which originates from a relational database is much less clear.

Most data lake technologies store data as files (like csv, json, or parquet). This means that when we extract relational data into a file stored in a data lake, we lose valuable metadata from the database such as data types, constraints, foreign keys, etc. I tend to say that we "de-relationalize" data when we write it to a file in the data lake. If we're going to turn right around and load that data to a relational database destination, is it the right call to write it out to a file in the data lake as an intermediary step?

My current thinking is that this is justified primarily by these situations:

(1) Do you need to keep the history? If you are retaining snapshots as of periodic points in time, having those snapshots accessible in a data lake can be useful.

(2) Will multiple downstream teams, systems, or processes access the data? If there are multiple consumers of the data (and presumably we don't want them accessing the original source directly), then it's possible that providing data access from a data lake might be beneficial.

(3) Do you have a strategy of storing *all* of the organizational data in your data lake? If you've standardized on this type of approach, and (1) or (2) above applies, then it makes sense.

(4) Do you need to provide a subset of data for a specialized use case? If you can't provide access to the source database, but a full replica or another relational DB is overkill, then delivery of data via the data lake can sometimes make sense.

DataLakeQuote.jpg
 

The above list isn’t exhaustive - it always depends. My rule of thumb: Avoid writing relational data to a data lake when it doesn't have value being there. This is an opinion and not everyone agrees with this. Even if we are using our data lake as a staging area for a data warehouse, my opinion is that all relational data doesn't necessarily have to make a pit stop in the data lake except when it's justified to do so.

You Might Also Like…

Zones in a Data Lake

Data Lake Use Cases and Planning Considerations

James Serra’s blog - Should I load Structured Data Into My Data Lake?