Overview of Azure Data Catalog in the Cortana Analytics Suite

Azure Data Catalog is one of the components of the Cortana Analytics Suite (now renamed to Cortana Intelligence Suite).  This post is as of September 2015; at this time the Azure Data Catalog is still in public preview so we can expect many changes coming soon.

Check here for a brief video tour: Tour of Azure Data Catalog.

If you saw the data catalog that was part of V1 Power BI (for Office 365), then you are familiar with the first iteration of this tool. Customer feedback was good, but that they didn't want to go through the trouble of registering data sources for use with just one application. So that's the motivation for pulling it out of being a Power BI feature and into being a full-fledged element of the Cortana Analytics Suite.

The Azure Data Catalog is two things:

  • Enterprise-wide catalog in Azure that enables self-service discovery of data from any source (on prem or cloud, Microsoft or non-Microsoft, structured or non-structured)
  • A metadata repository that allows users to register, annotate, discover, understand, and consume data sources

I'm very excited to have a metadata repository like this which can save people time, help find the info they need, share what the data means as well as issues and advice, and potentially decrease duplication of effort for things which already exist. Check out this Azure Documentation page for some very useful scenarios and use cases for Azure Data Catalog:  https://azure.microsoft.com/en-us/documentation/articles/data-catalog-common-scenarios/.
    
The primary activities in the Azure Data Catalog: Publish (aka Register), Discover, and Annotate. The publishing process currently uses a click-once app in a web browser, and the discovery and annotation process is done via a web page (unless you prefer to use the open APIs).

Publishing / Registering Data Sources in Azure Data Catalog

When a user registers a data source, the catalog extracts out the connection string and metadata for column names and data types.  It also will extract descriptions / extended properties if present in the source. Optionally, the person handling the data source registration can choose to show a preview of the data (up to 20 records), and/or a profile of the data. Other than the optional 20-record preview, none of the actual data contents are moved to Azure - it's metadata only.

The above screen shot shows the data sources supported currently in the public preview. Due to customer feedback, the development team started with on-premises SQL Server (relational) and Analysis Services (both multidimensional and tabular). It's also very interesting that Reporting Services reports can be cataloged here as well.

Lots more data sources will be coming soon - their aim is to be able to register all enterprise data sources after all. The list of supported sources can be found here: https://azure.microsoft.com/en-us/documentation/articles/data-catalog-frequently-asked-questions/.

Discovering Data Sources in Azure Data Catalog

When looking for a data source that has been registered, users can search by term, tag, object type, source type, and/or an expert assigned as having knowledge of the source. (This expert can be a person or perhaps a support group.)

The web interface includes nice functionality to select multiple items on a page and assign tags, for instance, to them all at once.

If the "Include Preview" checkbox was selected when the data source was initially registered, this is what the Preview pane looks like:

Note that individual columns can possess their own tags and descriptions for search ability (in addition to the tags and descriptions at the database & table levels). This is what the Columns pane looks like:

If the "Include Data Profile" checkbox was selected when the data source was initially registered, then table and column profiling is done with respect to number of rows, number of distinct values, min & max values, number of nulls, etc. Following is what the Data Profile pane looks like:

Annotating Data Sources in Azure Data Catalog

Users are encouraged to make annotations about usefulness, column meanings, friendly names, etc. The development team refers to this as a crowdsourcing approach because anyone can contribute useful information that may be of great assistance & time savings to colleagues.

Tags can also be used very effectively. For example, I saw a demo recently where an e-mail address column was annotated with a PII tag to alert users to use caution when distributing personally identifiable information.

If users of the Azure Data Catalog make the time investment to add rich information related to data sources, then this type of metadata tool can be extremely helpful to self-service users who are searching for the correct data to use. 


Things to Know about Azure Data Catalog

There is a web portal interface to Azure Data Catalog is located at http://azuredatacatalog.com. However, there are also open APIs as well if you would rather integrate the publishing, discovering, and annotation activities with a custom application. 

The Azure Data Catalog permits a data source to be registered only once. This was a purposeful design decision to avoid duplicates. Visibility to a select number of objects (ex: views for particular sets of users) can be set with security (in the Standard version only, not the free version).

Currently the system allows only a single Data Catalog per Azure subscription. The design team has envisioned the Azure Data Catalog as being enterprise-level, so permitting departmental use would diminish the value. It'll be interesting to see over time how the subscription model tends to align within decentralized customer organizations.

Azure Active Directory integration is required. You *cannot* use a Microsoft account (ex: user@outlook.com) with Azure Data Catalog. 

The default for a new data source is for its metadata (and data preview, if selected) to be available to everyone. Visibility can be set to specific users and groups (Standard version only, not the free version).

There is a free version, and a paid version that is referred to as the Standard version. The free version shows all registered data sources to all users - if you need to restrict visibility by users & groups, that requires the Standard version. The free version allows up to a max of 50 users, whereas the Standard version is unlimited and is priced (as of Sept 2015) at $50/per month/per 100 users. Pricing details are here:  https://azure.microsoft.com/en-us/pricing/details/data-catalog/.

If a user doesn't have permission to access a data source, the Standard version (not free version) allows you to submit a request to gain access to that particular data.

Anyone can try to register a data source. However, for it to be successful, the person registering needs to be able to read the schema for the underlying data source (i.e., read definition permission). If the checkbox to show a preview is selected, the person registering also needs select permissions on the underlying data source.

You MIght Also Like...

What is the Cortana Analytics Suite?