Deploying Solutions from the Azure AI Gallery

The Azure AI Gallery (formerly known as the Cortana Intelligence Gallery) is a site where you can search for solutions, tutorials, experiments, and training for learning the data & analytics tools within Azure. The Gallery can be located at:


The Azure AI Gallery is a community site. Many of the contributions are from Microsoft directly. Individual community members can make contributions to the Gallery as well.

The "Solutions" are particularly interesting. Let's say you've searched and found Data Warehousing and Modern BI on Azure:


Deploying a Solution from the Azure AI Gallery

What makes these solutions pretty appealing is the "Deploy" button. They're packaged up to deploy all (or most) of the components into your Azure environment. I admit I'd like to see some fine-tuning of this deployment process as it progresses through the public preview. Here's a quick rundown what to expect.

1|Create new deployment:


The most important thing in step 1 above is that your deployment name ends up being your resource group. The resource group is created as soon as you click the Create button (so if you change your mind on naming, you'll have to go manually delete the RG). Also note that you're only allowed 9 characters, which makes it hard to implement a good naming convention. (Have I ever mentioned how fond I am of naming conventions?!?)

Resource groups are an incredibly important concept in Azure. They are a way to logically organize related resources which (usually) have the same lifecycle and are managed together. All items within a single resource group are included in an ARM template. Resource groups can serve as a boundary for security/permissions at the RG level, and can be used to track the cost of a solution. So, it's extremely important to plan out resource group structure in your real environment. In our situation here, having all of these related resources for testing/learning purposes is perfect.

2|Provide configuration parameters:


In step 2 above, the only thing we need to specify is a user and password. This will be the server admin for both Azure SQL Database and Azure SQL Data Warehouse which are provisioned. It will use SQL authentication.

As soon as you hit the Next button, the solution is provisioning.

3|Resource provisioning (automated):

In step 3 above we see the progress. Depending on the type of resource, it may take a little while.



When provisioning is complete, as shown in step 4 above (partial screenshot), you get a list of what was created and instructions for follow-up steps. For instance, in this solution our next steps are to go and create an Azure Service Principal and then create the Azure Analysis Services model (via PowerShell script saved in an Azure runbook provided by the solution).

They also send an e-mail to confirm the deployment:


If we pop over to the Azure portal and review what was provisioned so far, we see the following:

We had no options along the way for selecting names for resources, so we have a lot of auto-generated suffixes for our resource names. This is ok for purely learning scenarios, but not my preference if we're starting a true project with a pre-configured solution. Following an existing naming convention is impossible with solutions (at this point anyway). A wish list item I have is for the solution deployment UI to display the proposed names for each resource and let us alter if desired before the provisioning begins.

The deployment also doesn't prompt for which subscription to deploy to (if you have multiple subscriptions like I do). The deployment did go to the subscription I wanted, however, it would be really nice to have that as a selection to make sure it's not just luck.

We aren't prompted to select scale levels during deployment. From what I can tell, it chooses the lowest possible scale (I noted that the SQL Data Warehouse was provisioned with 100 DWUs, and the SQLDB had 5 DTUs).

To minimize cost, don't forget to pause what you can (such as the SQL Data Warehouse) when you're not using it. The HDInsight piece of this will be the most expensive, and it cannot be paused, so you might want to learn & experiment with that first then de-provision HDInsight in order to save on cost. If you're done with the whole solution, you can just delete the resource group (in which case all resources within it will be deleted permanently).

Referring to Documentation for Deployed Solutions

You can find each of your deployed solutions here:

From this view, you can refer back to the documentation for a solution deployment (which is the same info presented in Step 4 when it was finished provisioning).

You can also 'Clean up Deployments' which is a nice feature. The clean up operation first deletes each individual resource, then it deletes the resource group:

Data Lake Use Cases and Planning Considerations

Updated as of 6/10/2018

I've spent a lot of time this past year learning about and thinking about data lakes. Mostly I've been interested in how to integrate a data lake alongside existing investments without making a mess of things.

A data lake is one piece of an overall data management strategy. Conceptually, a data lake is nothing more than a data repository. The data lake can store any type of data. Cost and effort are reduced because the data is stored in its original native format with no structure (schema) required of it initially. 

Data lakes usually align with an "ELT" strategy which means we can Extract and Load into the data lake in its original format, then Transform later *if* a need presents itself. An "ELT" strategy goes along with the "Schema on Read" technique that is prevalent in the big data world. It's become prevalent because you're able to store the data relatively easily with less up-front time investment, then query the data "where it lives" without being required to relocate the data first (though you may want to later after a solution is productionized).

Conversely, traditional data warehousing typically follows at "ETL" strategy in which the Transformation occurs before the Load to the data warehouse - this is referred to a "Schema on Write" because the schema, i.e., structure of the data, must be defined before the data is loaded to the data warehouse. This takes time and effort to do correctly. It also means we typically need to define use cases ahead of time in order to validate the schema is correct. We can also expect the use cases and needs to evolve over time - this is a good thing, but can be challenging to implement sometimes.

The way I see it, the term data lake is mostly conceptual since a data lake can be comprised of HDFS (Hadoop Distributed File System), other various Hadoop projects, NoSQL DBs, in-memory DBs, which can coexist with relational DBs.

Use Cases For a Data Lake

There are a variety of ways you can use a data lake:

  • Ingestion of semi-structured and unstructured data sources (aka big data) such as equipment readings, telemetry data, logs, streaming data, and so forth. A data lake is a great solution for storing IoT (Internet of Things) type of data which has traditionally been more difficult to store, and can support near real-time analysis. Optionally, you can also add structured data (i.e., extracted from a relational data source) to a data lake if your objective is a single repository of all data to be available via the lake.
  • Experimental analysis of data before its value or purpose has been fully defined. Agility is important for every business these days, so a data lake can play an important role in "proof of value" type of situations because of the "ELT" approach discussed above.
  • Advanced analytics support. A data lake is useful for data scientists and analysts to provision and experiment with data.
  • Archival and historical data storage. Sometimes data is used infrequently, but does need to be available for analysis. A data lake strategy can be very valuable to support an active archive strategy.
  • Support for Lambda architecture which includes a speed layer, batch layer, and serving layer.
  • Preparation for data warehousing. Using a data lake as a staging area of a data warehouse is one way to utilize the lake, particularly if you are getting started.
  • Augment a data warehouse. A data lake may contain data that isn't easily stored in a data warehouse, or isn't queried frequently. The data lake might be accessed via federated queries which make its separation from the DW transparent to end users via a data virtualization layer.
  • Distributed processing capabilities associated with a logical data warehouse.
  • Storage of all organizational data to support downstream reporting & analysis activities. Some organizations wish to achieve a single storage repository for all types of data. Frequently, the goal is to store as much data as possible to support any type of analysis that might yield valuable findings.
  • Application support. In addition to analysis by people, a data lake can be a data source for a front-end application. The data lake might also act as a publisher for a downstream application (though ingestion of data into the data lake for purposes of analytics remains the most frequently cited use).

Planning a Data Lake

As we discussed above, a data lake reduces the up-front effort of getting data stored because we aren't absolutely required to structure it first. However, that doesn't mean there's no planning at all. There's various things you want to consider as you're planning a data lake -- to prevent it from becoming the dreaded data swamp -- including things such as:

  • Ingestion needs (push / pull via streaming or batch)
  • Security around data access
  • Data retention and archival policies
  • Encryption requirements
  • Governance
  • Data quality
  • Master data management
  • Validity checks necessary
  • Metadata management
  • Organization of data for optimal data retrieval
  • Scheduling and job management
  • Logging and auditing
  • To what extent data federation will be utilized
  • Enrichment, standardization, cleansing, and curation needs
  • Technology choices comprising the overall data lake architecture (HDFS, Hadoop components, NoSQL DBs, relational DBs, etc.)
  • Modular approach to the overall design

You definitely want to spend some time planning out how the data will be organized so that finding the data is as straightforward as possible. Just like with planning anything where data is stored (for example, a regular file share, or a SharePoint document library, etc.), you usually want to consider subject areas along with user groups and security boundaries. 


There's many different ways to organize a data lake. For batch data loads, here's one example:

Raw Data

    Organizational Unit

          Subject Area

               Original Data Source


                         Date Loaded


This structure could translate into something like:

    East Division








The nice thing about the organizational unit being at the top is there is a clear security boundary. However, that might lead to siloed data and/or duplicated data. 

Key points about the "Raw Data" layer:

  • The raw data is usually considered immutable data (i.e., unchangeable). This allows you to go back to a point in time if necessary. It should be the lowest granularity, most atomic, data you are able to obtain.
  • It's possible that bad data could be deleted from the raw data layer, but generally deletes rare.
  • The speed at which data arrives should be a non-issue for the data lake. Therefore, you want your data lake to be able to accommodate a batch layer (data loaded from batch processes) and a speed layer (data loaded from streaming data or near real-time applications).
  • The choice of data format is particularly crucial in the raw data layer since the performance of writes for low-latency data is critical. Although something like CSV is easy to use and human-readable, it doesn't perform well at larger scale. Other formats like Parquet or ORC perform better. To decide on format, think about file sizes, if a self-describing schema is desired (like JSON or XML), data type support, if schema changes over time are expected, performance you need for write and read, and the integration with other systems. 
  • The raw data layer is often partitioned by date. If there's an insert/update date in the source that can be relied upon, you can organize incremental data loads by that date. If there's not a reliable date to detect changes, then you need to decide if the source is copied over in its entirety every time the data load executes. This correlates to what extent you need to detect history of changes over time--you probably want to be liberal here since all requirements aren't identified upfront.
  • When partitioning by date, generally the dates should go at the end of the path rather than the beginning of the path. This makes it much easier to organize and secure content by department, subject area, etc. 
  • The rule of thumb is to not mix data schemas in a single folder so that all files in a folder can be traversed with the same script. If a script is looping through all the files in a folder--depending on the technique you are using, a script might fail if it finds a different format. 
  • The raw data should be stored the same as what's contained in the source. This is much like how we would stage data for a data warehouse, without any transformations, data cleansing, or standardization as of yet. The only exception to this would be to add some metadata such as a timestamp, or a source system reference, directly within the data itself (having the timestamp can be very helpful if your system needs to tolerate duplicates, for instance from retry logic).
  • Very, very few people have access to the raw data layer. Just like staging for a data warehouse is considered the back-end "kitchen" area, the raw data layer is usually hands-off for most users except for highly skilled analysts or data scientists.
  • The above sample raw data structure emphasizes security by organizational unit. The tradeoff here is a data source table may be duplicated if used by different organizational units. Therefore, a different type of structure might make more sense depending on your needs. 
  • Some architectural designs call for a transient/temporary landing zone prior to the raw data zone. An intermediate zone prior to the raw data zone is appropriate if validity checks need to be performed before the data is allowed to hit the raw data zone. It's also helpful if you need to separate "new data" for a period of time (for instance, to ensure that jobs pulling data from the raw data zone always pull consistent data). Alternatively, some folks refer to a temp/transient zone as the 'speed' layer in a Lambda architecture

For purposes of permitting data retrieval directly from the data lake, usually a different layer is recommended for the curated data, or specialized data. It could be organized more simply such as:

Curated Data



                Snapshot Date (if applicable)


This structure could translate into something like:

    Sales Trending Analysis




Key points about the "Curated Data" layer:

  • User queries are from the curated data layer (not usually the raw data layer). This area where consumption is allowed is also referred to as a data hub sometimes.
  • The curated data layer contains data for specific, known, purposes. This means that the curated data layer is considered "Schema on Write" because its structure is predefined.
  • Some data integration and cleansing can, and should occur, in the curated data layer.
  • It's not uncommon to restructure data to simplify reporting (ex: standardize, denormalize, or consolidate several related data elements).
  • You want to make sure you can always regenerate the curated data on-demand. Regenerating history could be needed for a number of reasons. It helps to recover from an error, should a problem occur. Depending on how schema changes over time are handled, regenerating curated data can help. Perhaps your machine learning algorithms are better now. Or perhaps you've had to delete some customer data from a GDPR request, all the way back to the original raw data.
  • To the extent that friendly names can be assigned, rather than cryptic names from a source system, that should be done to improve usability. Anything you can do to make it simpler to use is usually a good time investment in my mind.
  • Depending on how far you take it, the curated data layer could be set up similarly to a data warehouse, particularly if you take advantage of some of the Hadoop technologies available to augment the data lake infrastructure.
  • You might find that multiple logical data layers to deliver curated data makes sense for your purposes (similar to the concept of different data marts).

It's also very common to include a "sandbox" area in your data lake. This provides your highly trained analysts and data scientists an area where they can run processes and write data output.

Organizing the Data Lake

In the above section are a couple of organization examples. Above all else, the data lake should be organized for optimal data retrieval. Metadata capabilities of your data lake will greatly influence how you handle organization.


Organization of the data lake can be influenced by:

  • Subject matter
  • Source
  • Security boundaries
  • Downstream applications & uses
  • Data load pattern
    • Real-time, streaming
    • Incremental
    • Full load
    • One time
  • Time partitioning & probability of data access
    • Recent/current data
    • Historical data
  • Data classification, trust value
    • Public information
    • Internal use only
    • Supplier/partner confidential
    • Personally identifiable information (PII)
    • Sensitive - financial
    • Sensitive - intellectual property
  • Business impact
    • High (HBI)
    • Medium (MBI)
    • Low (LBI)
  • Data retention policy
    • Temporary data
    • Permanent data
    • Applicable period (ex: project lifetime)

Does a Data Lake Replace a Data Warehouse?

I'm biased here, and a firm believer that modern data warehousing is still very important. Therefore, I believe that a data lake, in an of itself, doesn't entirely replace the need for a data warehouse (or data marts) which contain cleansed data in a user-friendly format. The data warehouse doesn't absolutely have to be in a relational database anymore, but it does need a semantic layer which is easy to work with that most business users can access for the most common reporting needs.

Lessons learned via analysis done from the data lake can often be taken advantage of for the data warehouse (the exact same thing we often say with regard to self-service BI efforts influencing and improving corporate BI initiatives). It is true that agility can be faster when conducting analysis directly from the data lake because you're not constrained to a predefined schema -- anyone who has experienced shortcomings in the data warehouse will appreciate this. For instance, let's say the data warehouse relates the customer and region dimensions to a fact table such as sales. Then someone wants to count customers by region, even if the customer doesn't have any sales. In the DW there's at least a couple of different ways to handle it, but we need to build on the initial schema to handle it (because dimensions usually only relate through facts, not directly to each other). Whereas when you start in the data lake with the question, such as customers by region, you have more freedom and flexibility. However, that freedom and flexibility come at a cost -- data quality, standardization, and so forth are usually not at the same maturity level as the data warehouse. In other words, "schema on read" systems put more burden on the analyst.

There's always tradeoffs between performing analytics on a data lake versus from a cleansed data warehouse: Query performance, data load performance, scalability, reusability, data quality, and so forth. Therefore, I believe that a data lake and a data warehouse are complementary technologies that can offer balance. For a fast analysis by a highly qualified analyst or data scientist, especially exploratory analysis, the data lake is appealing. For delivering cleansed, user-friendly data to a large number of users across the organization, the data warehouse still rules.

Want to Know More?

My next all-day workshop on Architecting a Data Lake is in Raleigh, NC on April 13, 2018

You Might Also Like...

Zones in a Data Lake

Querying Multi-Structured JSON Files with U-SQL in Azure Data Lake

Handling Row Headers in U-SQL

Why You Should Use an SSDT Database Project For Your Data Warehouse

Tips for Using Azure Data Catalog

It seems there's a lot of interest in Azure Data Catalog from the customers that I work with. Since I've been discussing it a lot recently during projects, I thought I'd share a few thoughts and suggestions.

Register only Production data sources. Typically you won't want to register Development or UAT data sources. That could lead to users seeing what appears to be duplicates, and it also could lead to inconsistency in metadata between sources if someone adds an annotation to, say, a table in UAT but not in Production. 


Register only data sources that users interact with. Usually the first priority is to register data sources that the users see-for instance, the reporting database or DW that you want users to go to rather than the original source data. Depending on how you want to use the data catalog, you might also want to register the original source. In that case you probably want to hide it from business users so it's not confusing. Which leads me to the next tip...

Use security capabilities to hide unnecessary sources. The Standard (paid) version will allow you to have some sources registered but only discoverable by certain users & hidden from other users (i.e., asset level authorization). This is great for sensitive data like HR. It's also useful for situations when, say, IT wants to document certain data sources that business users don't access directly.

Use the business glossary to ensure consistency of tags. The business glossary is a capability of the Standard (paid) version and is absolutely worth looking into. By creating standardized tags in the business glossary, you'll minimize issues with tag inconsistency that would be annoying. For example, the business glossary would contain just one of these variations: "Sales & Marketing", "Sales and Marketing", or "Sales + Marketing".

Check the sources in the "Create Manual Entry" area of Publish if you're not seeing what you're looking for. There's a few more options available in Manual Entry than the click-once application.

Use the pinning & save search functionality to save time. For data sources you refer to often, you can pin the asset or save search criteria. This will display them on the home page at so they're quicker to access the next time.

Use the Preview & Profile when registering data when possible. The preview of data (i.e., first X rows) and profile of data (ex: a particular column has 720 unique values that range from A110 to M270) are both extremely useful when evaluating if a data source contains what the user really wants. So, unless the data is highly sensitive, go ahead and use this functionality whenever you can.


Be a little careful with friendly names for tables. When you set a friendly name, that becomes the primary thing a user sees. If it's very different from the original name, it could be more confusing than helpful because users will be working with the primary name over in tools such as Power BI.


Define use of "expert" so expectations are clear. A subject matter expert can be assigned to data sources and/or individual objects. In some organizations it might imply owner of the data; however, in the Standard (paid) version there is a separate option to take over ownership. Usually the expert(s) assigned indicates who knows the most about the data & who should be contacted with questions. 

Be prepared for this to potentially be a culture change. First, it's a culture change for DBAs who are responsible for securing data. The data catalog may absolutely expose the existence of a data source that a user didn't know about--however, remember that it only exposes metadata and the original data security is still in force. The other culture change affects the subject matter experts who know the data inside and out. These folks may not be used to documenting and sharing what they know about the data. 

You Might Also Like...

Overview of Azure Data Catalog in the Cortana Intelligence Suite <--Check the "Things to Know about Azure Data Catalog" towards the bottom this post for more tips

How to Create a Demo/Test Environment for Azure Data Catalog 

Setting Up a PC for Azure Cortana Intelligence Suite Development

Some development aspects of the Cortana Intelligence Suite can occur in the Azure Portal. There are also some additional client tools which are helpful, or potentially required, to fully create solutions. This is a quick checklist of tools you probably want to install on a development machine for purposes of working on the analytics, BI, and/or data warehousing elements of Azure.

1. SQL Server Management Studio (SSMS)

The latest version of SSMS is recommended for compatibility with Azure services, as well as backwards compatible with all SQL Server versions back to 2008.

Download SSMS:

2. Visual Studio 2015

The latest 2015 version of Visual Studio is recommended for full functionality for the newest components. If you choose to do a customized VS installation, be sure to select the option to install Microsoft Web Developer Tools. (If you don't, when you try to install the Azure SDK later it won't install properly because prerequisites are missing. Yeah, yeah, I've been there.)

If you don't have a license available, look into using the VS Community edition.

Download VS 2015: 

Note: in addition to "Visual Studio 2015," there's also a "Visual Studio 15 Preview." The 15 Preview is *not* the same thing as Visual Studio 2015, even though 15 is in its name. So, just watch out for that in terms of naming.

3. SQL Server Data Tools (SSDT) for Visual Studio 2015

Here's is where you gain the ability to create BI projects: SQL Server Integration Services (SSIS), SQL Server Analysis Services (SSAS), and SQL Server Reporting Services (SSRS). These SQL Server BI projects aren't considered part of Cortana Intelligence Suite, but if you're creating an analytics, BI, and/or data warehousing solution you may need at least one of types of BI projects as part of the overall solution.

With the latest version of SSDT for VS 2015, you'll also be able to interact with all versions of SQL Server, as well as Azure SQL Database and Azure SQL Data Warehouse.

Example of what an Azure SQL Data Warehouse cloud resource looks like from within Visual Studio (SSDT):

4. Azure SDK

The Azure SDK sets up lots of libraries; the main features we are looking for from the Azure SDK right away are (a) the ability to use the Cloud Explorer within Visual Studio, and (b) the ability to create ARM template projects for automated deployment purposes. In addition to the Server Explorer we get from Visual Studio, the Cloud Explorer from the SDK gives us another way to interact with our resources in Azure. 

Note that Visual Studio 2017 will have the latest Azure SDK built in, so this separate installation step won't be necessary. At that time of this writing, Visual Studio 2017 is not fully released yet.

Example of what the Cloud Explorer pane looks like in Visual Studio (by Resource Group, and by Resource Type):

Example of what you'll see related to ARM template projects after the Azure SDK is installed:

The above Azure Resource Group project is useful for source-controlling your Azure infrastructure. This can hold your ARM templates (JSON files), as well as PowerShell scripts (see #8 below).

Download the Azure SDK for Visual Studio 2015: 

5.  Relevant Visual Studio Extensions

These are important extensions for working with Cortana Intelligence Suite at this time:
-Microsoft Azure Data Lake Tools for Visual Studio 2015 <--Automatically installed as part of the Azure SDK
-Microsoft Azure HDInsight Tools for Visual Studio 2015 <--Automatically installed as part of the Azure SDK
-Microsoft Azure Data Factory Tools for Visual Studio 2015

At the time of this writing (June 2016), Azure Data Factory Tools are not automatically installed with the Azure SDK. That will probably change at some point I would guess.

Example of the Cortana Intelligence Suite projects you'll see after the Azure extensions are installed:

Depending on the particular service, there may be links within the Azure portal as well:

Download: <--The file called "VS 2015" is the Azure Pack

Ongoing updates for Azure Data Lake and Stream Analytics Tools for Visual Studio:

The latest extension for Azure Data Factory:

6.  Microsoft Azure Storage Explorer and/or AzCopy

I really like the new standalone Azure Storage Explorer for uploading and downloading files to Azure Storage. AzCopy is another alternative - AzCopy is a command line utility instead of a graphical UI, so it's better suited for automation and scripting purposes. 

Example of what Azure Storage Explorer looks like:

Download Microsoft Azure Storage Explorer: 
Download AzCopy:

Note: There is a different Azure Storage Explorer on CodePlex with an almost identical name to the one I'm referring to.

7. Azure PowerShell

Even if you don't tend to utilize scripting and automation tremendously, chances are you'll need PowerShell for something sooner rather than later. Installing Azure PowerShell adds the Azure management cmdlets to Windows PowerShell.


There's also an Integrated Scripting Environment (ISE) you may want to look into using. The ISE also has an Azure Automation Toolkit which makes local authoring of Azure Automation runbooks easier.


8. PowerShell Tools for Visual Studio

You may want to store and source control various PowerShell scripts:

9. R Tools and/or Python Tools for Visual Studio

R Tools for Visual Studio (RTVS) and Python Tools for Visual Studio (PTVS) are optional alternatives to other design environments (such as R Studio).

Download RTVS:

Download PTVS:

10. Azure Feature Pack for SQL Server Integration Services

In step 3 above we installed the BI projects including SSIS. There's also a feature pack for SSIS which adds connection managers, tasks, and components for Azure services such as Blob Storage, Azure HDInsight, Azure Data Lake Store and Azure SQL Data Warehouse.

Download for SSIS 2016:

Download for SSIS 2014:

Depending on what you need to do, there might be additional items to install (such as the Azure CLI), but the above list should be a good start for tools related to development of analytics, DW, and/or BI solutions. 

11. Team Foundation Version Control or GitHub

Usage of Team Foundation Version Control (TFVC), Git, or another form of source control is not just a good practice, it'll save your bacon one of these days.