Keeping Up with Azure Changes

February 17, 2019

Since I started focusing primarily on Microsoft BI/DW/analytics around 2005, I've always been happy being a generalist within that space. Fast forward to around 2017 when I started focusing predominantly on Azure. Being a generalist (still within BI/DW/analytics) in Azure is really tough — as I'm sure you well know. Not only is the variety of technologies very wide, the pace of change is dizzying. Handling that pace of change is the subject of this post.

My two main ways of keeping up with Azure changes focus on:

(1) Being aware of new announcements, new features, general evolution and maturity of services.

(2) Getting hands-on with services and features to build out skills.

Related to (1) above, I've been trying to do a better job about keeping up in a timely way. Since January 1, I've been putting out a tweet with the #AzureDidYouKnow hashtag each day (well…most days anyway). It's a way to keep me motivated to continually keep up with announcements...in which case I may as well share them with you as well, right? I do love learning all the time, so this extra bit of motivation helps me keep at it. These daily updates usually relate to something new; sometimes it might be something I just learned, or maybe something I find interesting.

I also want to share the resources I use most often to keep up with changes in Azure:

Websites with Update Roundups

Azure Updates** - https://azure.microsoft.com/en-us/updates/

Azure Blog Updates - https://azure.microsoft.com/en-us/blog/topics/updates/

Last Week in Azure - https://azure.microsoft.com/en-us/blog/topics/last-week-in-azure/

Build Azure - https://buildazure.com/category/azure-weekly/

**This site is great, but it does not contain every update from every Azure service. If there’s a service you are particularly vested in, be sure to also follow the main product team blog as well.

Podcasts & Videos

The Azure Podcast - http://azpodcast.azurewebsites.net/

Azure Friday - https://azure.microsoft.com/en-us/resources/videos/azure-friday/

Azure Flash Friday - http://www.azureflashfriday.com/

Microsoft Cloud Show - http://www.microsoftcloudshow.com/#

Azure This Week - A Cloud Guru - https://www.youtube.com/playlist?list=PLI1_CQcV71RmnrRBgJNlI1yY_WiOWIXov

Did I miss anything important in this list related to keeping up with Azure changes? (I did skip general Azure training videos purposely.) If I missed a good one, please shoot me a tweet and let me know.

You Might Also Like…

Getting Started with Azure

FAQs About Organizing a Data Lake

January 20, 2019

This post covers several things I've heard or been asked recently about organizing data in a data lake.

Q: Partitioning by date is common. Where should the dates go in the folder hierarchy?

Almost always, you will want the dates to be at the end of the folder path. This is because we often need to set security at specific folder levels (such as by subject area), but we rarely set up security based on time elements.

Optimal for folder security: \SubjectArea\DataSource\YYYY\MM\DD\FileData_YYYY_MM_DD.csv

Tedious for folder security: \YYYY\MM\DD\SubjectArea\DataSource\FileData_YYYY_MM_DD.csv

Here’s an example of what the raw data zone might look like with the date partitioning at the end:

Notice in the above example how the date element is repeated in *both* the folder structure and the file name. Being very clear in the naming of folders and files helps a lot with usability.

Also, keep in mind that related compute engines or data processing tools might have a firm expectation as to what the folder structure is (or potentially is contained in the file name). For instance, the year and month folders might translate directly to a column within the file. Or, if you’re using a tool like Azure Stream Analytics to push data to the lake, you’ll be defining in ASA what the date partitioning schema looks like in the data lake (because ASA takes care of creating the folders as data arrives).

Q: How do data lake zones translate to a folder structure?

The zones that I talked about previously are a conceptual idea. Most commonly I’ve seen zones translate to a top level folder (like shown in the image above). However, it's also possible that the zones would reside within, say, a subject area as shown in this next image:

Generally speaking, business users only get access to the prepared data in the curated data zone (with some exceptions of course). Zones like Raw Data and Staged Data are frequently ‘kitchen areas’ that have little to no user access. That’s why putting the zones at the top-most level is very common. However, if your objective is to make all of the data available in an easier way, then putting zones underneath a subject area might make sense—this is less common from what I’ve seen though because exposing too much data to business users can be confusing.

On the flip side, another less common option would be to further separate zones beyond just top-level folders. For instance, in Azure Data Lake Storage Gen 2, we have the structure of Account > File System > Folders > Files to work with (terminology-wise, a File System in ADLS Gen 2 is equivalent to a Container in Azure Blob Storage). Depending on what you are trying to accomplish, you might decide that separate file systems are appropriate for areas of your data lake:

If your objective is to have an enterprise-wise data lake, then more separation is less appealing.

Q: Should the date reflected in the folder structure be the ingestion date or the date associated with the source data?

It could be either one. I tend to think this is dependent on whether you're dealing with data that's being pushed or pulled into the data lake, and if it’s transactional or snapshot data.

Push system: Let’s say you have machine telemetry or IoT data that is being loaded to the data lake. In this case, the dates in the folder structure would typically be based on ingestion date.

Pull system: If you have a scheduled process that loads data into the lake, then it's up to the architect of the process to determine what the date means:

Transactional data: If sales data is being loaded, the dates could easily relate to the sale transaction date (even if we pulled the data out three days later). Typically transactional data is append-only.
Snapshot data: Let’s say we want to organize the data by its "as of" date. If you look back at the very first image shown above, the CustomerContacts folder is intended to show a snapshot of what that data looked like as of a point in time. Typically this would be for reference data, and is stored in full every time it’s extracted into the data lake.

Q: Is it a good idea to created folders which nest multiple data elements?

This type of structure which nests 3 data elements into 1 folder is typically *not* recommended:

There are two potential issues with ‘nesting’ elements like Company-Division-Project as shown above:

Security: Setting up security is probably harder. For instance, if Mary should see everything in Division 2, that just got harder because now Division is associated with the granularity of Projects. Not impossible to manage, just likely to be more work.
Performance: The query performance can suffer. Some compute engines & query tools can understand the structure of the data lake and do ‘data pruning’ (like predicate pushdown). Let’s say you send a query asking for all data for Project A240. If Project were its own folder, the likelihood of the compute engine needing to scan only that one folder is much higher and would improve performance considerably. (This performance optimization is applicable to a hierarchical file system like Azure Data Lake Storage (Gen1 or Gen2), but not applicable to an object store like Azure Blob Storage.)

Q: If I need a separate dev, test, prod environment, how would this usually be handled?

Usually separate environments are handled with separate services. For instance, in Azure, that would be 3 separate Azure Data Lake Storage resources (which might be in the same subscription or different subscriptions).

We wouldn’t usually separate out dev/test/prod with a folder structure in the same data lake. It can be done (just like you could use the same database with a different schema for dev/test/prod) but it’s not the typical recommended way of handling the separation. We prefer having the exact same folder structure across all 3 environments. If you must get by with it being within one data lake (one service), then the environment should be the top level node.

Q: How much do I need to be concerned with the similarity of file contents within a folder?

The general rule is for all files to have the same format underneath a folder node. This is because scripts often traverse all files in a folder. If you have something like JSON it's fine if the schema differs per file, but for fixed data structures such as CSV, a script will error out if some of the files have a different format. (This does mean sometimes you need to refresh historical files to align with a format change that occurred along the way.)

Q: When should we load data from a relational data source into a data lake?

I devoted a blog post to this because it comes up a lot—check here.

Q: Data lakes are supposed to be agile. So I don’t need to worry about about naming conventions, right?

Try your best to not neglect naming conventions. You might use camel case, or you might just go with all lower case – either is ok, as long as you’re consistent. There are two big reasons for this: First, some languages are case-sensitive so consistent naming structures end up being less frustrating. Second, the bigger your data lake gets the more likely you are to have scripts that manage the data and/or the metadata, and they are more easily maintained and parameterized if consistent.

You Might Also Like…

Zones in a Data Lake

When Should We Load Relational Data to a Data Lake?

Data Lake Use Cases and Planning Considerations

BlueGranite eBook - Data Lakes in a Modern Data Architecture

How to Reference Azure Storage Files from Cloud Shell

December 9, 2018

Recently I used Azure Cloud Shell for the first time. This is a quick post to show how I referenced the file share in Azure Storage to communicate with Cloud Shell.

What is Cloud Shell?

Cloud Shell is a lightweight way to run scripts using either Bash or PowerShell. You can run scripts in a browser using the Azure portal or shell.azure.com, with the Azure mobile app, or using the VS Code Azure Account extension. If you have seen the "Try it now" links in Azure documentation pages, that will direct you to use Cloud Shell.

The rest of this post focuses on using PowerShell with Cloud Shell.

Finding the File Service Info to Use with Cloud Shell

When you create a Cloud Shell account, you are prompted to also create an Azure Storage account. This gives you a mounted file share in the Azure Files service which is available for all of your Cloud Shell sessions.

You can use Get-CloudDrive to find the information related to the drive for your Cloud Shell account:

Example of Exporting a File to Azure Files Using Cloud Shell

Using the mount point information we got from Get-CloudDrive shown above, we now have what we need for how to reference the Azure file share that is associated to my Cloud Shell account. For an example, I’m going to export a PBIX file from the Power BI Service (discussed in my previous blog post).

First we need to log in to the Power BI Service. In Cloud Shell, authentication to another service like Power BI is a little different than how we see it within a local client tool - it utilizes a login page with a code provided by Cloud Shell:

Now that we are authenticated, we can execute our Export-PowerBIReport cmdlet:

Note in the above PowerShell cmdlet, I referenced the mount point for the Azure file share associated with my Cloud Shell account as:

'/home/melissa/clouddrive/PowerBIExportFiles/ExportTest_V2workspace.pbix'

Notice in the above example I used a folder called ‘PowerBIExportFiles’ — note that any folder(s) need to already exist before you can export a file to it. It won’t auto-create for you; if the folder doesn’t exist you’ll get an error that says “Could not find a part of the path.” Using a folder is optional though.

You Might Also Like…

Controlling Data Access in Azure for Administrators and Owners

Getting Started with Azure

PowerShell for Assigning and Querying Tags in Azure

How Permissions Work for a Power BI Service Administrator

December 2, 2018

This post was updated on Nov 14, 2019.

A Power BI administrator is a role for managing various aspects of the Power BI Service. The Power BI administrator is a specific role which can be assigned in Office 365. Anyone with Office 365 or Azure global admin privileges is also a Power BI administrator by default.

The Power BI administrator role is a very high privilege role, as discussed below.

Based on the tests I've been doing, I've observed that users with membership to the Power BI administrator role have two sets of permissions apply:

Activities which are scoped across the entire organization
Activities for which normal user permissions apply (with a loophole - see below)

Within the above 2 categories, there are the following types of activities:

Manage tenant settings (always scoped to the organization)
Compile inventory and metadata (can be scoped to the organization)
Manage workspace users (can be scoped to the organization)
Export content from a workspace (relies on user permissions)

Next let’s briefly review each of the above types of activities.

Manage Power BI Tenant Settings

The ability to manage tenant settings in the Power BI Admin Portal has been in place for some time now. It includes managing settings such as:

Tenant settings (for the most part this includes enabling/disabling certain features to influence the user experience and/or govern the system)
Capacity
Embed codes
Organizational custom visuals
…plus lots more

The Power BI administrator role cannot be delegated to individual subsets of the organization — it applies to the entire tenant.

The role also cannot be granted in a read-only way. This can make it challenging in a very large organization. For example, let's say you're a large worldwide organization with five main divisions. One of the key Power BI people from division A requests access to the Power BI Admin Portal because they want to be able to view what the settings are. An example I've seen of this is someone thinking that the 'push apps to end users' doesn't work, when really the issue is that it's disabled by default in the tenant settings.

Compile Power BI Inventory and Metadata

With the introduction of the Power BI Management Module, we can more easily run scripts to perform certain activities such as accessing metadata. There are several cmdlets available, for instance: Get-PowerBIDashboard, Get-PowerBIReport, and Get-PowerBIWorkspace.

Here’s an example of a script which looks across the entire tenant (i.e., the organization scope) to find all instances of a report named Product Sales Analysis:

The key point above is that the Power BI administrator can retrieve all metadata like this, including My Workspace for other users.

This is actually great because it means a Power BI administrator can put together an inventory of the content in the tenant. If you compare this to usage data from the Office 365 unified audit log, you can do some interesting analysis and oversight of the Power BI tenant.

Manage Workspace Users

There are PowerShell cmdlets such as Add-PowerBIWorkspaceUser and Remove-PowerBIWorkspaceUser to manage the new type of workspaces (i.e., the V2 ‘new workspace experience’).

Here is an example of my Power BI administrator account providing member permissions to a colleague:

The interesting part of the above example is that my Power BI administrator account does *not* have any direct permissions to the workspace. However, the organization scope allows it to be done.

You know what else is interesting? That same Power BI administrator could assign permission to themselves in order to access the app workspace content. This is very important to realize because it essentially makes all data throughout the organization available to the administrator should they deem it necessary (or if they wish to do something nefarious).

Quote of the previous paragraph text for emphasis

Export Power BI Content from Workspaces

There is a PowerShell cmdlet called Export-PowerBIReport which, as the name implies, exports a PBIX from the Power BI Service. This includes the report and the underlying data. Here is an example of exporting one file:

Unlike the previous two examples, Export-PowerBIReport is an activity which requires the Power BI administrator to have rights to the app workspace in order to access the content. An unauthorized (401) message is returned when a Power BI administrator tries to export a PBIX that resides within an app workspace which the administrator doesn’t have access to - this includes My Workspace for all other users.

Reminder: a Power BI administrator can grant themselves permission to this content though. That’s the loophole.

Summary

Hopefully this post saves you some time in determining how permissions apply to the different types of activities that a Power BI administrator can do. In summary:

A Power BI administrator can see all metadata throughout the tenant, regardless of their workspace permissions. This includes My Workspace metadata. Some activities do require direct app workspace permissions. However, a Power BI administrator can give themselves permission to any app workspace -- so that should be tracked with logging.

Quote of the previous paragraph for emphasis

The scripts shown above are super simplified, not ready for actual production use, but I kept them simple since the focus of this post is on permissions.

Also, keep in mind that you do *not* have to be a designated Power BI administrator to use the Power BI PowerShell cmdlets — any user can run them related to their own content. However, you *do* need to be a Power BI administrator in order to set the scope to organization (for those cmdlets which support it).

To find additional information: