Tips for Using Azure Data Catalog

It seems there's a lot of interest in Azure Data Catalog from the customers that I work with. Since I've been discussing it a lot recently during projects, I thought I'd share a few thoughts and suggestions.

Register only Production data sources. Typically you won't want to register Development or UAT data sources. That could lead to users seeing what appears to be duplicates, and it also could lead to inconsistency in metadata between sources if someone adds an annotation to, say, a table in UAT but not in Production. 

 
 

Register only data sources that users interact with. Usually the first priority is to register data sources that the users see-for instance, the reporting database or DW that you want users to go to rather than the original source data. Depending on how you want to use the data catalog, you might also want to register the original source. In that case you probably want to hide it from business users so it's not confusing. Which leads me to the next tip...

Use security capabilities to hide unnecessary sources. The Standard (paid) version will allow you to have some sources registered but only discoverable by certain users & hidden from other users (i.e., asset level authorization). This is great for sensitive data like HR. It's also useful for situations when, say, IT wants to document certain data sources that business users don't access directly.

Use the business glossary to ensure consistency of tags. The business glossary is a capability of the Standard (paid) version and is absolutely worth looking into. By creating standardized tags in the business glossary, you'll minimize issues with tag inconsistency that would be annoying. For example, the business glossary would contain just one of these variations: "Sales & Marketing", "Sales and Marketing", or "Sales + Marketing".

Check the sources in the "Create Manual Entry" area of Publish if you're not seeing what you're looking for. There's a few more options available in Manual Entry than the click-once application.

Use the pinning & save search functionality to save time. For data sources you refer to often, you can pin the asset or save search criteria. This will display them on the home page at AzureDataCatalog.com so they're quicker to access the next time.

Use the Preview & Profile when registering data when possible. The preview of data (i.e., first X rows) and profile of data (ex: a particular column has 720 unique values that range from A110 to M270) are both extremely useful when evaluating if a data source contains what the user really wants. So, unless the data is highly sensitive, go ahead and use this functionality whenever you can.

 
 

Be a little careful with friendly names for tables. When you set a friendly name, that becomes the primary thing a user sees. If it's very different from the original name, it could be more confusing than helpful because users will be working with the primary name over in tools such as Power BI.

 
 

Define use of "expert" so expectations are clear. A subject matter expert can be assigned to data sources and/or individual objects. In some organizations it might imply owner of the data; however, in the Standard (paid) version there is a separate option to take over ownership. Usually the expert(s) assigned indicates who knows the most about the data & who should be contacted with questions. 

Be prepared for this to potentially be a culture change. First, it's a culture change for DBAs who are responsible for securing data. The data catalog may absolutely expose the existence of a data source that a user didn't know about--however, remember that it only exposes metadata and the original data security is still in force. The other culture change affects the subject matter experts who know the data inside and out. These folks may not be used to documenting and sharing what they know about the data. 

You Might Also Like...

Overview of Azure Data Catalog in the Cortana Intelligence Suite <--Check the "Things to Know about Azure Data Catalog" towards the bottom this post for more tips

How to Create a Demo/Test Environment for Azure Data Catalog 

How to Create a Demo/Test Environment for Azure Data Catalog

Azure Data Catalog is a Software as a Service (SaaS) offering in Azure, part of the Cortana Intelligence Suite, for registering metadata about data sources. Check this post for an overview of Azure Data Catalog key features. (I'm a big fan of what Azure Data Catalog is trying to accomplish.)

There are a couple of particulars about Azure Data Catalog which make it a bit more difficult to set up a Demo/Test/Learning type of environment, including:

  • You are required to sign into Azure Data Catalog with an organizational account. Signing in with a Microsoft account (formerly known as a Live account) won't work for Azure Data Catalog authentication, even if that's what you normally use for Azure.
  • One Azure Data Catalog may be created per organization. Note this is *not* per Azure subscription - if your account has access to multiple subscriptions, it's still one catalog per organization.

These restrictions are because the intention is for Azure Data Catalog to be an enterprise-wide sole system of registry for enterprise data sources.

Summary: Creating a Demo/Test Environment for Azure Data Catalog

Because of the above two restrictions, we need to create a Demo/Test/Learning sort of environment in a particular way. For the remainder of this post, the objective is to create a Data Catalog outside of your normal organizational Azure environment - i.e., associated to an MSDN account for instance. 

With some very helpful advice from Matthew Roche (from the Azure Data Catalog product team at Microsoft), the best method currently to create a Data Catalog test environment is as follows:

  1. Sign into the Azure portal with a Microsoft account (not with your organizational account). You should be the administrator of this subscription. For instance, my subscription is associated with my MSDN.
  2. In your Azure Active Directory (AAD), create a new AAD account. This cannot be associated to a Microsoft account; it needs to be a native AAD account. A native AAD account is recognized by the Data Catalog service as an organizational account. 
  3. Allow this new AAD account to be co-administrator of the subscription. This will permit the AAD account to provision the new Azure Data Catalog service.
  4. Go to the Azure Data Catalog portal at www.azuredatacatalog.com and sign in with the new AAD account. Provision a new Azure Data Catalog from here. You will continue to do all of the work in Azure Data Catalog with this separate AAD ID (and additional AAD IDs if desired).

The objective of this is to leave the Azure Data Catalog in your 'real' organizational Azure subscription free of test or temporary use data sources - i.e., you wouldn't want users in your environment to discover something like an AdventureWorks sample database in the catalog (loophole: if you are paying for the standard version of Azure Data Catalog, rather than the free version, you do get security capabilities and could restrict a data source to just yourself so others can't find it).

Sidenote: One additional important thing to be aware of with Azure Data Catalog is that a data source may be registered in the catalog only once. This is to prevent duplicates which could be really confusing to users of the system.

Below are further details about how to make this approach work.

Details: Creating a Demo/Test Environment for Azure Data Catalog

Step 1: Sign into Azure portal for which you are an administrator.

First, sign into the Azure portal with your Microsoft account (such as user@outlook.com). As of the time of this writing (April 2016), Azure Active Directory is still managed in the old portal not in the new portal yet. The old portal can be found at https://manage.windowsazure.com/

For our demo/test purposes, this should not be your organizational account (such as user@companyname.com). And of course, you need administrative privileges for the Azure subscription.

Step 2: Create a Native Azure Active Directory Account.

Go to the Active Directory menu, then select your default directory:

On the Users page, select Add User at the bottom:

Create a new user with the name you prefer:

 
 
 

Be sure to jot down the temporary password assigned by Azure.

At this point, you should see your new user on the AAD Users page. The key to making this all work is the account is sourced from 'Microsoft Azure Active Directory' and is *not* a Microsoft Account. (An account sourced from your organization's Active Directory works too...but we're trying to create a demo outside of the organizational Azure tenant.)

Next let's reset that temporary password now. 

Open up an InPrivate or Incognito browser window and go to https://login.microsoftonline.com/. By using InPrivate or Incognito, the login screen will reliably accept any type of account (otherwise it makes assumptions based on the type of account you're logged onto your machine with currently). You'll want to either use a different browser, or close the Azure portal, before this step so that it doesn't sign you in with the account you're logged into Azure with.

Sign in with the new AAD account we just created. When prompted, put in the current temporary password and reset to a new password. Close this browser window when finished resetting the password.

Login.jpg
 

Step 3: Provide Co-Administrator Permissions to the New AAD Account.

Back in the Azure portal (we're still using the old portal at https://manage.windowsazure.com/ since this functionality isn't yet exposed in the new portal). Here you sign in with your Microsoft account again (just like step 1). Go to the Settings menu, then the Administrators page, then click Add:

AzureAdministrator.jpg

Input the e-mail address of your new AAD user. You'll see a green check when it's validated.

AzureAdministrator2.jpg
 

At this point you should see your native AAD account listed on the Administrators page. Now we know that account has permission to create the Azure Data Catalog service. (I can be more liberal with this sort of setting because the Azure tenant I'm working in only contains demo data, not any real data.)

AzureAdministrator3.jpg

Go ahead and close the Azure browser window as we are finished with the Azure portal.

Step 4: Provision a New Azure Data Catalog from the Data Catalog Portal.

Now it's time to provision the Azure Data Catalog using our AAD account.

Launch a new browser window using InPrivate or Incognito (this will ensure you'll be able to seamlessly sign in with your AAD account) and go to the Azure Data Catalog portal at https://azuredatacatalog.com. When prompted, sign in with your AAD account. 

If everything with the AAD account is set up correctly, you should see a page which prompts you to create a new Azure Data Catalog:

CreateCatalog.jpg

Tip: Remember you can only have one catalog per organization, so be sure to give it a pretty broad name.

You can go ahead and add any other users and administrators for the Catalog as appropriate, provided they are not personal Microsoft accounts. 

Troubleshooting Access to Azure Data Catalog

This account does not have permission to access Azure Data Catalog

When signing into the Azure Data Catalog portal, the message "This account does not have permission to access Azure Data Catalog" is generated when you have signed in with an appropriate kind of account, and a catalog already exists somewhere, but your account doesn't have permission to access it. 

ADCPortal_RequestAccess.jpg

To figure out more info about the existing data catalog, first check if you see any catalogs in the new Azure portal at https://portal.azure.com/. If no catalogs are listed, this means the catalog resides in a subscription which you don't have permission to see in the Azure portal. Since there's one catalog per organization - if there's a subscription you cannot see, it's possible that's where the catalog is at. 

ADCPage.jpg

To get more information try to create a new catalog. You'll see a message "Only one Azure Data Catalog is supported per organization. A Catalog has already been created for your organization. You cannot add additional catalogs." Click the link under that message to "Access existing Azure Data Catalog." 

ADCPortal_RequestAccess2.jpg
 

Under User, you should be able to see the name of the catalog which may give you a hint as to which subscription it resides in. In any case, you need to talk to your Azure service administrator if this happens to determine if the catalog that is set up is really what/where you want it to be.

ADCPage2.jpg
 

You've logged in using a personal Microsoft account

When signing into the Azure Data Catalog portal, the message "You've logged in using a personal Microsoft account" is generated when you're not using an organizational account recognized by the Data Catalog service. Here's where you want to refer to the instructions above in this post to create a native Azure Active Directory (AAD) account to use for logging into Azure Data Catalog.

ADCPortal_PersonalAccount.jpg

Overview of Azure Data Catalog in the Cortana Analytics Suite

Azure Data Catalog is one of the components of the Cortana Analytics Suite (now renamed to Cortana Intelligence Suite).  This post is as of September 2015; at this time the Azure Data Catalog is still in public preview so we can expect many changes coming soon.

Check here for a brief video tour: Tour of Azure Data Catalog.

If you saw the data catalog that was part of V1 Power BI (for Office 365), then you are familiar with the first iteration of this tool. Customer feedback was good, but that they didn't want to go through the trouble of registering data sources for use with just one application. So that's the motivation for pulling it out of being a Power BI feature and into being a full-fledged element of the Cortana Analytics Suite.

The Azure Data Catalog is two things:

  • Enterprise-wide catalog in Azure that enables self-service discovery of data from any source (on prem or cloud, Microsoft or non-Microsoft, structured or non-structured)
  • A metadata repository that allows users to register, annotate, discover, understand, and consume data sources

I'm very excited to have a metadata repository like this which can save people time, help find the info they need, share what the data means as well as issues and advice, and potentially decrease duplication of effort for things which already exist. Check out this Azure Documentation page for some very useful scenarios and use cases for Azure Data Catalog:  https://azure.microsoft.com/en-us/documentation/articles/data-catalog-common-scenarios/.
    
The primary activities in the Azure Data Catalog: Publish (aka Register), Discover, and Annotate. The publishing process currently uses a click-once app in a web browser, and the discovery and annotation process is done via a web page (unless you prefer to use the open APIs).

Publishing / Registering Data Sources in Azure Data Catalog

When a user registers a data source, the catalog extracts out the connection string and metadata for column names and data types.  It also will extract descriptions / extended properties if present in the source. Optionally, the person handling the data source registration can choose to show a preview of the data (up to 20 records), and/or a profile of the data. Other than the optional 20-record preview, none of the actual data contents are moved to Azure - it's metadata only.

The above screen shot shows the data sources supported currently in the public preview. Due to customer feedback, the development team started with on-premises SQL Server (relational) and Analysis Services (both multidimensional and tabular). It's also very interesting that Reporting Services reports can be cataloged here as well.

Lots more data sources will be coming soon - their aim is to be able to register all enterprise data sources after all. The list of supported sources can be found here: https://azure.microsoft.com/en-us/documentation/articles/data-catalog-frequently-asked-questions/.

Discovering Data Sources in Azure Data Catalog

When looking for a data source that has been registered, users can search by term, tag, object type, source type, and/or an expert assigned as having knowledge of the source. (This expert can be a person or perhaps a support group.)

The web interface includes nice functionality to select multiple items on a page and assign tags, for instance, to them all at once.

If the "Include Preview" checkbox was selected when the data source was initially registered, this is what the Preview pane looks like:

Note that individual columns can possess their own tags and descriptions for search ability (in addition to the tags and descriptions at the database & table levels). This is what the Columns pane looks like:

If the "Include Data Profile" checkbox was selected when the data source was initially registered, then table and column profiling is done with respect to number of rows, number of distinct values, min & max values, number of nulls, etc. Following is what the Data Profile pane looks like:

Annotating Data Sources in Azure Data Catalog

Users are encouraged to make annotations about usefulness, column meanings, friendly names, etc. The development team refers to this as a crowdsourcing approach because anyone can contribute useful information that may be of great assistance & time savings to colleagues.

Tags can also be used very effectively. For example, I saw a demo recently where an e-mail address column was annotated with a PII tag to alert users to use caution when distributing personally identifiable information.

If users of the Azure Data Catalog make the time investment to add rich information related to data sources, then this type of metadata tool can be extremely helpful to self-service users who are searching for the correct data to use. 


Things to Know about Azure Data Catalog

There is a web portal interface to Azure Data Catalog is located at http://azuredatacatalog.com. However, there are also open APIs as well if you would rather integrate the publishing, discovering, and annotation activities with a custom application. 

The Azure Data Catalog permits a data source to be registered only once. This was a purposeful design decision to avoid duplicates. Visibility to a select number of objects (ex: views for particular sets of users) can be set with security (in the Standard version only, not the free version).

Currently the system allows only a single Data Catalog per Azure subscription. The design team has envisioned the Azure Data Catalog as being enterprise-level, so permitting departmental use would diminish the value. It'll be interesting to see over time how the subscription model tends to align within decentralized customer organizations.

Azure Active Directory integration is required. You *cannot* use a Microsoft account (ex: user@outlook.com) with Azure Data Catalog. 

The default for a new data source is for its metadata (and data preview, if selected) to be available to everyone. Visibility can be set to specific users and groups (Standard version only, not the free version).

There is a free version, and a paid version that is referred to as the Standard version. The free version shows all registered data sources to all users - if you need to restrict visibility by users & groups, that requires the Standard version. The free version allows up to a max of 50 users, whereas the Standard version is unlimited and is priced (as of Sept 2015) at $50/per month/per 100 users. Pricing details are here:  https://azure.microsoft.com/en-us/pricing/details/data-catalog/.

If a user doesn't have permission to access a data source, the Standard version (not free version) allows you to submit a request to gain access to that particular data.

Anyone can try to register a data source. However, for it to be successful, the person registering needs to be able to read the schema for the underlying data source (i.e., read definition permission). If the checkbox to show a preview is selected, the person registering also needs select permissions on the underlying data source.

You MIght Also Like...

What is the Cortana Analytics Suite?