Getting Started with Azure

I've been asked numerous times where to start when it comes to Azure.  Getting a handle on Azure concepts, terminology, and features can be more than a little intimidating. Not only is Azure an extremely large platform, the pace of change is *very* fast. This post has taken me quite a while to figure out how to write because where to start depends so much on which services you're responsible for and what role you are. Each individual service in Azure has its own set of gotchas and lessons learned. So, this post will stay at the 1,000 foot level.

Before we dive in further, I do want to offer a few thoughts suggestions:

  • Even though you can create a resource quickly and easily, that doesn't mean that it's enterprise-ready immediately. By all means, save time getting started. Just don't forget that getting everything working securely with sound and repeatable practices is often counter to the "deploy in 10 minutes" goal for the cloud that we hear a lot. So, plan to spend time learning how things work more deeply if you're building anything besides a test/sandbox environment.
  • Always build a proof of concept if you're developing on an unfamiliar resource. There will be something you expect it to do that it won't do. New features and functionality are always coming out, but everything these days begins as an MVP: minimally viable product. It can take months for a new service to mature. So, do yourself a favor and POC pretty much everything as normal practice.
  • Try to keep up with Azure announcements for the resources you work with. The pace of change is crazy fast and it's really difficult to keep things straight. However, you need to try because, in addition to new functionality, behavior changes occur and features get deprecated as resources evolve and mature.
  • Although it's tempting to start with a virtual machine solution (IaaS) because it's the most familiar, give PaaS (platform as a service) a fair shot because it's really the future and where significant investments are being made by Microsoft.
  • Objects in an Azure subscription can turn into a mess really fast if there are no standards. Planning how dev/test/prod will be separated, as well as how resources will be arranged by resource group, are two of the most critical decisions you will make. Naming conventions are also very important (and yes, I know about the pets and the cattle -- I still like my naming conventions).
  • Choosing a location for services is an extremely important decision. You want to select geographic location based on data storage, proximity to users, co-locating related resources, minimizing latency, backup and recovery needs, service availability in a region, and minimizing data egress charges (data coming into a region is free whereas data leaving a region is not--data egress charges are small but can add up to a surprise bill if a decent volume of data is being transmitted).
  • The cloud is not *always* cheaper, and the cloud is not *always* easier. Often both are true (especially for sandbox and POCs), but don't take it for granted that both are always true for an enterprise solution. It really depends. 
  • Developers and administrators have an immense impact on cost in a cloud computing environment. Time spent learning cost savings techniques is well worthwhile.
  • ARM (Azure Resource Manager) templates and PowerShell are both your friend when it comes to Azure productivity. However, I wouldn't get hung up there if you are brand new to Azure. It's ok to use the portal to create and modify items in Dev and/or while you are learning. You can evolve to use automation techniques as soon as you're ready and/or when things need to be repeatable.

A number of the above thoughts are part of a presentation I've delivered recently. You can find those slides here: Tips for Getting Started with the Azure Data Platform

What is Azure?

Azure offers infrastructure and services for customers (us) to build and deploy software and services through a global network of Microsoft-managed data centers. Although many individual aspects will feel familiar, there can be different design patterns and best practices in cloud and hybrid environments. Here is an overview of the types of resources we can provision in Azure:

The Azure Interactives site (shown above) gives a short snippet as to what each service is, and links to where to find more information. It's a good way to get a quick sense of what is there.

Azure Terms & Definitions

Here are a few of the most important basic building blocks to know related to Azure:

AzureTerminology.png

Subscription: A billing relationship. Each subscription has a directory; a directory can trust multiple subscriptions. Subscriptions have documented limits and quotas. Subscriptions also represent a security boundary.

Tenant (aka Directory): The AAD (Azure Active Directory) service which handles identity management. A global administrator in Azure is also is a global administrator in Office 365.

ARM: The Azure Resource Manager (ARM) framework which allows us to deploy and manage resources in Azure. The various service management APIs behind Visual Studio, command line, and the Azure portal all share the same underlying ARM layer. Checking ARM scripts into source control is referred to as 'Infrastructure as Code.' By default, ARM scripts generated by the Azure portal encompass an entire resource group (the exception being Resource Explorer).

Resource Group: A logical container for resources in Azure. Typically, one resource group contains resources with the same lifecycle which should be managed, secured, and deployed together as a unit (for instance, an Azure Virtual Machine and its related Azure Storage Accounts). Resource groups are important for organizing resources, and can also serve as a security boundary (particularly for decentralized administration purposes).

Resource: Resources are the individual services in Azure, such as Azure SQL Database, Azure Virtual Machine, or an Azure Storage account. There is a wide variety of resources available in the Azure platform (as shown in the first screenshot near the top of this post). A resource must be assigned to a resource group.

Resource Provider: Registering a provider allows your subscription to work with a specific type of resource. Many commonly used resources are registered by default, but not all.

And, here are a few more common cloud computing terms:

CloudIaaSPaaSSaaS.png

IaaS: Infrastructure as a Service is a close equivalent to virtual server environments that we've been using for years--just that the servers are in Microsoft's data center instead. Virtual machines are IaaS. With IaaS, the customer is responsible for managing nearly all of the environment (such as the operating system -- though there are a few features that blur the lines between IaaS and PaaS a bit because you can allow Azure to install O/S and database patches for certain VM images).

PaaS: With Platform as a Service, the customer manages much less: usually just the data and the custom application. The cloud provider is responsible for everything else and in fact, the customer cannot (typically) even access the VM directly that the PaaS service is really running on. With PaaS, the customer focuses on building a solution and not on managing the environment. PaaS examples include Azure SQL Database, Azure SQL Data Warehouse, and CosmosDB.

SaaS: With Software as a Service, the focus is much more on consumption of a pre-existing application. You can think of it as "renting an application." SaaS examples include Power BI or Office 365.

Serverless: The term serverless feels like a misnomer -- there are many thousands of servers running behind everything in Azure. From the customer's perspective though, services are "serverless" because the infrastructure is invisible. Azure Functions are considered a serverless technology.

Azure Stack: Although sometimes you hear someone referring to 'the Azure stack' as a general term for all of Azure, what Azure Stack actually refers to is a way to deploy Azure resources in a private data center (as opposed to the public cloud) via an appliance from a hardware vendor. Only certain types of resources are available via Azure Stack.

Your Purpose For Going to the Cloud

Following are some of the most common reasons why cloud computing is attractive, and also some common concerns. I know you're likely most interested in the technical stuff -- the reason why I bring this up is because it's very important for you to really know these objectives. You will continually make tradeoffs during the design and development process, and the tradeoffs you make will depend completely on your goals. By tradeoffs, I mean decisions around:

  • Simplicity vs. control   (Do you want the 'easy' option, or do you prefer customization & control?)
  • Cost vs. features available  (Do you really need feature xyz even if it costs a lot more?)
  • Cost vs. complexity  (Are you willing to manage something on your own if it's a lot cheaper?)
  • etc...
AzureCommonProsAndConcerns.png

The remainder of this post are URLs which may be helpful to you, depending on your starting point and what you're trying to achieve.

Important Azure URLs

Introductory Azure Information

Azure Reference Materials

Pre-Built Azure Templates and QuickStart Solutions

Azure Security and Compliance

How to Get an Azure Subscription

You Might Also Like...

Presentation materials: Tips for Getting Started with the Azure Data Platform

Naming Conventions in Azure

Overview of Azure Automation

Setting Up a PC for Cortana Intelligence Suite Development

Why Some Azure VM Sizes are Unavailable when Resizing in the Portal

Deciding on Encryption at Rest for an Azure Virtual Machine

Why the Default Summarization Property in Power BI is So Important

This post describes an issue I ran into this week on a project. If the Default Summarization settings in your data model are not correct, it results in wrong data being displayed, or even missing data, on your report.

This post is based on Power BI Desktop as of mid November 2017. The concept also is applicable to Tabular models in Azure Analysis Services and SQL Server Analysis Services. 

Let's say I have a small table of data:

PBISum_rawdata.png
 

By default, Power BI detects numeric columns and sets the summarization property to sum (or count). We can tell this by the sigma symbol next to YearNbr and SalespersonDept, and the Default Summarization property for the column:

PBISum_fieldlist_wrong.png

What you really want to do is set both SalespersonDept and YearNbr to "Don't summarize" as shown here:

PBISum_fieldlist_correct.png

Let me explain why setting this summarization property is so very important.  

Summarization Behavior on the Report Canvas

Since Bob has 4 rows of data, when we put this data on the canvas, the SalespersonDept and YearNbr sum themselves up. If you don't know the data well enough you might not even notice these are bogus number (especially SalespersonDept in our example). So, not only is the data being displayed incorrectly, you can also waste some time trying to test or cross-reference SalespersonDept 4004 because it doesn't exist.

PBISum_canvas.png
 

Now, you might say to me that it's an annoyance and you can fix it in the visual. Of course you can, but do you want to have to fix it on *every* visual you create? What if you don't catch the fact that SalespersonDept is a bogus number? So if that doesn't convince you to set the summarization property on the dataset, I'll bet this next reason will...

Summarization Behavior in Visual Filters

Let's say I add a visual filter on my table because I only want to show Bob's rows of data. Bob is SalespersonNbr 1001. However, the visual filter does not return any rows of data for 1001 -- the only way it finds data is if we search on the bogus *aggregated* value for SalespersonNbr 4004 (which is 1001 * 4 rows of data):

PBISum_visual filter.png

Not a good user experience, right? So the message here is that the visual level filters are applied *after* the data aggregation occurs within the individual table or chart. Note that slicers, page filters, and report filters don't behave this same way--they each return the correct data based on SalespersonDept 1001 because they filter before summarization. It's just visual level filters that are applied after the data aggregation occurs in the visual (thanks to Javier Guillen for confirming I wasn't crazy!).

Options for Resolving Aggregation Issues

First choice: Solve this in the underlying data storage. For instance, if you have a data warehouse SalespersonDept should be a varchar(4) instead of an integer. With this choice, no downstream reporting tool whatsoever will try to aggregate columns like SalespersonDept or YearNbr because the column will be detected as text by Power BI. Of course, we live in the real world and this can't be done unless we're creating a brand new source database. Which brings us to...

Second choice (shown in this post): Set the summarization setting in the data model correctly for every column in every table. Another way is to set the data type to text. The dataset serves as our semantic layer, so by setting this in the data model then each report visual can take advantage of this setting. This is the most common way of handling the issue.

Third choice (worst choice): Your last choice is to change the aggregation in every individual visual. This is risky because what if you forget? Or what if someone new creates a new report and doesn't know? So, don't do this please. The previous choice of handling in the data model is a much more sound practice.

You Might Also Like...

Reusing Datasets Imported to the Power BI Service
 

Deciding Whether to Use Azure SQL Data Warehouse

From time to time I publish on the BlueGranite team blog. My newest post is a decision tree about whether or not Azure SQL Data Warehouse is a good fit.

In Azure you have several technology choices for where to implement a data warehouse. Since Azure SQL DW is an MPP (massively parallel processing) platform, it's only appropriate in certain circumstances. Hopefully the decision tree can help educate people on the best use cases and situations for Azure SQL DW, and prevent making the wrong technology choice which leads to performance issues down the road.

Please head on over to the BlueGranite site to check out the post

Two Ways to Approach Federated Queries with U-SQL and Azure Data Lake Analytics

Did you know there are two ways to do federated queries with Azure Data Lake Analytics (ADLA)? By federated queries, I mean a query that combines (federates) data from multiple sources -- in this case, from within Azure Data Lake and another data store. Federated queries are one aspect of data virtualization which helps us to access data without requiring the physical movement of data or data integration:

FederatedQueries.jpg

The two methods for federated queries with U-SQL and ADLA are:

  1. Schema-less (aka "lazy metadata")
  2. Via a pre-defined schema via an external table

You might be familiar with external tables in SQL Server, Azure SQL Data Warehouse, or APS. In those platforms, external tables work with PolyBase for purposes of querying data where it lives elsewhere, often for the purpose of loading it into the relational database. That same premise exists in Azure Data Lake Analytics as well. However, in the data lake there's two approaches - an external table is still a good idea most of the time but it isn't absolutely required.

Option 1: Schema-Less

Following are the components of making schema-less federated queries work in ADLA:

ADLA_Schemaless.jpg
 

Pros of the schema-less option:

  • Access the data quickly for exploration without requiring an external table to be defined in the ADLA Catalog
  • More closely aligned to a schema-on-read paradigm because of its flexibility 
  • Query flexibility: can retrieve a subset of columns without having to define all the columns

Cons of the schema-less option:

  • Additional "burden" on the data analyst doing the ad hoc querying with U-SQL to always perform the schema-on-read within the query
  • Repeating the same schema-on-read syntax in numerous U-SQL queries, rather than reusing the definition via an external table -- so if the source system table or view changes, it could involve altering numerous U-SQL scripts.
  • Requires a rowset in the U-SQL schema-on-read queries - i.e., cannot do a direct join so this approach involves slightly longer, more complex syntax

Option 2: With a Pre-Defined Schema in an External Table

The following introduces an external table to the picture in order to enforce a schema:

ADLA_SchemaExternalTable.jpg
 

Pros of using an external table:

  • Most efficient on the data analyst doing the ad hoc querying with U-SQL
  • Easier, shorter syntax on the query side because columns and data types have already been predefined in the ADLA Catalog, so a direct join to an external table can be used in the query without having to define a rowset
  • Only one external table to change if a modification does occur to the underlying SQL table

Cons of using an external table:

  • Schema must remain consistent - a downstream U-SQL query will error if a new column is added to the remote source and the external table has not been kept in sync
  • All remote columns must be defined in the external table (not necessarily a big con - but definitely important to know)

In summary, the schema-less approach is most appropriate for initial data exploration because of the freedom and flexibility. An external table is better suited for ongoing, routine queries in which the SQL side is stable and unchanging. Solutions which have been operationalized and promoted to production will typically warrant an external table. 

Want to Know More?

During my all-day workshop, we set up each piece step by step including the the service principal, credential, data source, external table, and so forth so you can see the whole thing in action. The next workshop is in Washington DC on December 8th. For more details and how to register check here: Presenting a New Training Class on Architecting a Data Lake  

You Might Also Like...

Querying Multi-Structured JSON Files with U-SQL

Running U-SQL on a Schedule with Azure Data Factory to Populate Azure Data Lake

Handling Row Headers in U-SQL

Data Lake Use Cases and Planning Considerations