Overview of Azure Data Catalog in the Cortana Analytics Suite

Azure Data Catalog is one of the components of the Cortana Analytics Suite (now renamed to Cortana Intelligence Suite).  This post is as of September 2015; at this time the Azure Data Catalog is still in public preview so we can expect many changes coming soon.

Check here for a brief video tour: Tour of Azure Data Catalog.

If you saw the data catalog that was part of V1 Power BI (for Office 365), then you are familiar with the first iteration of this tool. Customer feedback was good, but that they didn't want to go through the trouble of registering data sources for use with just one application. So that's the motivation for pulling it out of being a Power BI feature and into being a full-fledged element of the Cortana Analytics Suite.

The Azure Data Catalog is two things:

  • Enterprise-wide catalog in Azure that enables self-service discovery of data from any source (on prem or cloud, Microsoft or non-Microsoft, structured or non-structured)
  • A metadata repository that allows users to register, annotate, discover, understand, and consume data sources

I'm very excited to have a metadata repository like this which can save people time, help find the info they need, share what the data means as well as issues and advice, and potentially decrease duplication of effort for things which already exist. Check out this Azure Documentation page for some very useful scenarios and use cases for Azure Data Catalog:  https://azure.microsoft.com/en-us/documentation/articles/data-catalog-common-scenarios/.
    
The primary activities in the Azure Data Catalog: Publish (aka Register), Discover, and Annotate. The publishing process currently uses a click-once app in a web browser, and the discovery and annotation process is done via a web page (unless you prefer to use the open APIs).

Publishing / Registering Data Sources in Azure Data Catalog

When a user registers a data source, the catalog extracts out the connection string and metadata for column names and data types.  It also will extract descriptions / extended properties if present in the source. Optionally, the person handling the data source registration can choose to show a preview of the data (up to 20 records), and/or a profile of the data. Other than the optional 20-record preview, none of the actual data contents are moved to Azure - it's metadata only.

The above screen shot shows the data sources supported currently in the public preview. Due to customer feedback, the development team started with on-premises SQL Server (relational) and Analysis Services (both multidimensional and tabular). It's also very interesting that Reporting Services reports can be cataloged here as well.

Lots more data sources will be coming soon - their aim is to be able to register all enterprise data sources after all. The list of supported sources can be found here: https://azure.microsoft.com/en-us/documentation/articles/data-catalog-frequently-asked-questions/.

Discovering Data Sources in Azure Data Catalog

When looking for a data source that has been registered, users can search by term, tag, object type, source type, and/or an expert assigned as having knowledge of the source. (This expert can be a person or perhaps a support group.)

The web interface includes nice functionality to select multiple items on a page and assign tags, for instance, to them all at once.

If the "Include Preview" checkbox was selected when the data source was initially registered, this is what the Preview pane looks like:

Note that individual columns can possess their own tags and descriptions for search ability (in addition to the tags and descriptions at the database & table levels). This is what the Columns pane looks like:

If the "Include Data Profile" checkbox was selected when the data source was initially registered, then table and column profiling is done with respect to number of rows, number of distinct values, min & max values, number of nulls, etc. Following is what the Data Profile pane looks like:

Annotating Data Sources in Azure Data Catalog

Users are encouraged to make annotations about usefulness, column meanings, friendly names, etc. The development team refers to this as a crowdsourcing approach because anyone can contribute useful information that may be of great assistance & time savings to colleagues.

Tags can also be used very effectively. For example, I saw a demo recently where an e-mail address column was annotated with a PII tag to alert users to use caution when distributing personally identifiable information.

If users of the Azure Data Catalog make the time investment to add rich information related to data sources, then this type of metadata tool can be extremely helpful to self-service users who are searching for the correct data to use. 


Things to Know about Azure Data Catalog

There is a web portal interface to Azure Data Catalog is located at http://azuredatacatalog.com. However, there are also open APIs as well if you would rather integrate the publishing, discovering, and annotation activities with a custom application. 

The Azure Data Catalog permits a data source to be registered only once. This was a purposeful design decision to avoid duplicates. Visibility to a select number of objects (ex: views for particular sets of users) can be set with security (in the Standard version only, not the free version).

Currently the system allows only a single Data Catalog per Azure subscription. The design team has envisioned the Azure Data Catalog as being enterprise-level, so permitting departmental use would diminish the value. It'll be interesting to see over time how the subscription model tends to align within decentralized customer organizations.

Azure Active Directory integration is required. You *cannot* use a Microsoft account (ex: user@outlook.com) with Azure Data Catalog. 

The default for a new data source is for its metadata (and data preview, if selected) to be available to everyone. Visibility can be set to specific users and groups (Standard version only, not the free version).

There is a free version, and a paid version that is referred to as the Standard version. The free version shows all registered data sources to all users - if you need to restrict visibility by users & groups, that requires the Standard version. The free version allows up to a max of 50 users, whereas the Standard version is unlimited and is priced (as of Sept 2015) at $50/per month/per 100 users. Pricing details are here:  https://azure.microsoft.com/en-us/pricing/details/data-catalog/.

If a user doesn't have permission to access a data source, the Standard version (not free version) allows you to submit a request to gain access to that particular data.

Anyone can try to register a data source. However, for it to be successful, the person registering needs to be able to read the schema for the underlying data source (i.e., read definition permission). If the checkbox to show a preview is selected, the person registering also needs select permissions on the underlying data source.

You MIght Also Like...

What is the Cortana Analytics Suite?

What is the Cortana Intelligence Suite?

Since I’m a data nut, I’m intrigued with Microsoft’s new offering referred to as the Cortana Analytics Suite.  

Update April 2016: The suite has been renamed to be Cortana Intelligence Suite.

Update Dec. 2017: The gallery of pre-configured solutions has been renamed to the Azure AI Gallery.

First things first, the suite is not a product in and of itself, though it will have its own pricing. The suite can be thought of as a bundle of integrated products and services. It’s somewhat similar to the idea of the Office suite or the SQL Server suite, both of which contain various components that are interoperable (at least to a certain extent). I get the feeling with the Cortana Intelligence Suite that interoperability/integration will be a huge emphasis. Another big emphasis will be on the availability of templates and preconfigured solutions which should accelerate and simplify development for particular scenarios.

Since the suite isn’t officially available yet, most of what can be found right now are marketing materials – though most of the components are available individually now and have varying levels of technical documentation available. I’m excited to be attending the CAS Workshop in September in Seattle, where I’m hoping to learn a lot more about the integration points, interoperability, accelerators, and overall capabilities.

What are the Components of Cortana Intelligence Suite?

Knowing this is a bundle of tools with an emphasis on integration and automation, for the purpose of advanced analytics, what are the components of the suite? 

The documentation lists the following as elements of Cortana Intelligence Suite:

  • Azure Machine Learning
  • Azure HDInsight
  • Azure Stream Analytics
  • Azure Data Lake
  • Azure SQL Data Warehouse
  • Azure Data Catalog
  • Azure Data Factory
  • Azure Event Hub
  • Power BI
  • Cortana
  • Cognitive Services
  • Bot Framework

There are numerous other Azure components that will play an important part in data-oriented solutions as well; I’m showing some of these key components at the bottom of the image above even though they aren’t “officially” part of Cortana Intelligence Suite.

Why is Cortana in the Name of Cortana Intelligence Suite?

One of my first questions when this was announced:  Why is Cortana in the name? Because really, Cortana is one small piece of a much bigger platform. It's because Cortana (originally a "smart" AI character in the Halo novel and video games) represents a high level of intelligence and the ability to learn and adapt.

Initially, the idea here is that the personal assistant, Cortana, will be able to provide information upon request or proactively. Something such as:  “Hey Cortana, what is the total of yesterday’s sales?” appears to be the next evolutionary step of the Q&A natural language capabilities first seen in Power BI. A public demo indicated that Power BI will be just one way to expose data to Cortana.

Source for image: July 2015 Webinar by Joseph Sirosh

Here’s a very interesting quote from a TechCrunch article:

“As for Cortana, which is the Microsoft voice-driven personal assistant tool in Windows 10, it’s a small part of the solution, but Sirosh says Microsoft named the suite after it because it symbolizes the contextualized intelligence that the company hopes to deliver across the entire suite.”

So, we have an extremely broad platform with Cortana Intelligence Suite. Stay tuned for my follow-up posts where we start looking at the individual components.

You Might Also Like...

Building Blocks of Cortana Intelligence Suite in Azure

Overview of Azure Data Catalog in the Cortana Analytics Suite

Deploying Solutions from the Azure AI Gallery