Find Pipelines Currently Running in Azure Data Factory with PowerShell

This is a quick post to share a few scripts to find what is currently executing in Azure Data Factory. These PowerShell scripts are applicable to ADF version 1 (not version 2 which uses different cmdlets).

Prerequisite: In addition to having installed the Azure Resource Manager modules, you'll have to register the provider for  Azure Data Factory:

#One-time registration of the ADF provider
#Register-AzureRmResourceProvider -ProviderNamespace Microsoft.DataFactory

Check for Unfinished ADF Jobs 

This script finds unfinished jobs (i.e., not in a ready state), after the start date, sorted by status (i.e., window state):

#-----------------------------------------
#Input Area
$subscriptionName = 'YourSubscriptionName'
$resourceGroupName = 'YourResourceGroupName'
$dataFactoryName = 'YourDataFactoryName'
$runStartDateTimeUTC = '2018-01-20T14:00:00Z'
#-----------------------------------------
#Manual login into Azure
Login-AzureRmAccount -SubscriptionName $subscriptionName
#-----------------------------------------
Get-AzureRmDataFactoryActivityWindow `
     -DataFactoryName $dataFactoryName `
     -ResourceGroupName $resourceGroupName `
     -RunStart $runStartDateTimeUTC `
     | ? {$PSItem.WindowState -ne 'Ready'} `
     | SELECT PipelineName, ActivityName, ActivityType, WindowState, `
              PercentComplete, RunStart, RunEnd, Duration `
     | Sort-Object WindowState | Format-Table 
ADFJobs_Unfinished.jpg

For brevity, the rest of this post skips repeating the input area and login as shown in the first script above.

Check Status of All ADF Jobs

This script shows status of all jobs after the start date, sorted by pipeline name:

Get-AzureRmDataFactoryActivityWindow `
     -DataFactoryName $dataFactoryName `
     -ResourceGroupName $resourceGroupName `
     -RunStart $runStartDateTimeUTC `
     | SELECT PipelineName, ActivityName, ActivityType, `
              WindowState, PercentComplete `
     | Sort-Object PipelineName | Format-Table 
ADFJobs_All.jpg

Find Status of Certain ADF Jobs Based on a Naming Convention

If you have implemented a nice consistent naming convention (yes, I'm totally a naming convention junkie), you can search based on the name of an activity or pipeline. This example is looking for the status of the jobs which copy data from Oracle into Azure Data Lake Store:

Get-AzureRmDataFactoryActivityWindow `
     -DataFactoryName $dataFactoryName `
     -ResourceGroupName $resourceGroupName `
     -RunStart $runStartDateTimeUTC `
     | ? {$PSItem.ActivityName -like 'ORA*' `
          -Or $PSItem.ActivityName -like 'DW*'} `
     | SELECT PipelineName, ActivityName, ActivityType, WindowState, `
              PercentComplete, RunStart, RunEnd, Duration `
     | Sort-Object PipelineName | Format-Table 
ADFJobs_All_OracleActivities.jpg

Zones in a Data Lake

As we are approaching the end of 2017, many people have resolutions or goals for the new year. How about a goal to get organized...in your data lake?

The most important aspect of organizing a data lake is optimal data retrieval.

It all starts with the zones of your data lake, as shown in the following diagram:

Hopefully the above diagram is a helpful starting place when planning a data lake structure. I have used all of the above zones in projects (with the exception of a transient zone which I haven't had a requirement for).

Want to Know More?

During my all-day workshop, we discuss zones and organizing the data lake in detail. The next workshop is in Raleigh, NC on April 13, 2018

You Might Also Like...

Data Lake Use Cases and Planning Considerations  <--More tips on organizing the data lake in this post

Checklist for Finalizing a Data Model in Power BI Desktop

My colleague and friend Meagan Longoria has been giving me a few “gentle nudges” to update an old post about this topic: 

 
Meagan.jpg
 

This post is a checklist of suggested items to be completed in your Power BI data model, within Power BI Desktop, before you consider it “done done.” This is not a how-to post; it's meant to truly be a checklist with just a bit of context in each 'why' column. (Though I concede I got more than a little wordy at 14 pages.)

A downloadable version:  Checklist

Checklist last updated: 12/27/2017 (V2)

The checklist focuses on the data model itself, rather than report design. For sound practices and suggestions about report design, check out Meagan's series:  Design Concepts for Better Power BI Reports

This checklist does **NOT** include:

  • Visuals or reports in Power BI Desktop
  • Performance-related specifics (because it always depends)
  • Testing specifics or approaches
  • After publishing to either Power BI Service or Power BI Report Server
  • Ongoing administration and management such as data refresh

Terminology:

  • Data model = the data imported to Power BI; also known as the dataset (formerly known as Power Pivot)
  • PBIX = the Power BI Desktop file
  • Field = columns and derived measures which are displayed in the field list
  • Query = source query in the Query Editor (formerly known as Power Query)

Why Is the Power BI Data Model So Important?

There are many things you can do with report visuals – change field names, adjust formats, and lots more. However, adjustments to things like names and formats within the report should be the exception rather than the norm. If all fine-tuning is done within the data model, it’s done *once* and every downstream report takes advantage of it. You will save yourself much time and effort, and improve consistency from report to report tremendously. Other users who consume your data model through reports and/or Q&A will appreciate the extra effort you’ve made to improve the user experience.

PBIX File:

To Do

Why

Use a Consistent, Unchanging PBIX File Name with Business Relevance

The PBIX name you assign is what the user will see after it’s been published. The name for the PBIX file should describe the set of reports, and it should be a consistent name which does not change over time. While a version number (like “SalesAnalysis-V1-December) is ok to keep straight interim copies of the PBIX, the file name needs to remain consistent once it is published. This is to ensure favorites, links, and apps all continue to work whenever changes are released.

Consider Reusing an Existing Dataset

Before starting a new PBIX, give some consideration to whether the data already exists in another Power BI data model. If new reports are the goal, yet the data already exists, a “hub and spoke” approach may work in which an existing published data model is reused. This can be done by creating new reports in the Power BI Service that refer to an existing dataset, by using a Power BI Service Live Connection, or via Analyze in Excel functionality. A hub and spoke approach (a) decreases the amount of redundant data stored in various PBIX files, (b) reduces the number of data refresh operations, (c) reduces future maintenance needs, (d) improves consistency because there are fewer files with data models, and (e) facilitates the separation of dataset development from report development. More details: Reusing Datasets Imported to the Power BI Service.

Use the Correct Release of Power BI Desktop Based on Your Target Deployment Destination

There are now two ‘flavors’ of Power BI Desktop. The standard one is updated monthly, and it correlates with functionality releases in the Power BI Service. The newer one is Power BI Desktop Optimized for Power BI Report Server – that one is updated every 3-4 months. If you develop Power BI solutions which are deployed to both the Power BI Service and to Power BI Report Server, you will need to carefully manage which release you are using for your data model development (they can run side by side on the same machine). Keep in mind that if collaboration will occur on a Power BI Desktop file, all users need to have the same version of Power BI Desktop installed on their machine (this is easier if you install the standard version of Power BI Desktop via the Windows Store so that it will update itself in the background).

Manage the Dataset Size

File size limits impact the amount of data which can be imported into the data model. At the time of this writing, the file size limits (after compression is applied) are as follows:

 

Power BI Desktop – based on individual PC system capability

Power BI Service – 1GB, or 10GB if PBIX is deployed to a Premium workspace

Power BI Report Server – 2GB if PBIX is deployed to PBIRS

Utilize Version History for PBIX File

Where to store the original PBIX file is an important decision. One common location is OneDrive for Business, which has built-in versioning available, so you can go back to a prior version if something goes wrong. A source control repository is also ideal. The original file should reside in a team site, not an individual user’s laptop. If report edits are made directly in the Power BI Service, you may also need to download from the Service and synchronize with the version in your version history.

Consider Using a Power BI Template

A Power BI template (PBIT) can contain anything from a PBIX, except the data itself. (Note this is different from solution templates available on AppSource.) Templates can contain common queries and data tables (such as a Date table in every model). The original template file should reside in the same original location with version history.

Consider Using a Power BI Theme

A Power BI report theme allows for a consistent use of a color palette in a report. Though theming affects reports and not the data model (the data model is the focus of this checklist), it is possible a data modeler will handle this task since custom report themes are based on JSON. The original JSON file should reside in the same original location with version history.

 

PBIX File Properties:

To Do

Where

Why

Decide if Using Auto Time Intelligence

File > Options and Settings > Options > Current File: Auto Date Time

Auto time intelligence is enabled by default, and it applies to each individual PBIX file (there’s not a global option). For most datetime columns that exist in the dataset, a hidden date table is created in the model to support time-oriented DAX calculations. This is great functionality for newer users, or if you have a very simple data model. However, if you typically utilize a standard Date table for the analytical dates, then you may want to disable the hidden date tables to reduce the file size – particularly if you have some dates which are just attributes (most dates can be analytical wherein you do analysis like YoY, though some dates can just be simple attributes which don't need to be connected to a Date dimension). Using the built-in functionality along with a user-generated Date table can work as well. (Tip: You can view the hidden date tables if you connect to the PBIX via DAX Studio.)

Control Options for Relationships

File > Options and Settings > Options > Current File: Relationships

Once the data model and relationships are complete, you may wish to ensure the relationships do not change without your knowledge. To do this, disable two items in each PBIX file (there’s not a global option): “Update relationships when refreshing queries” and “Autodetect new relationships after data is loaded.”

Use Preview Features Carefully

File > Options and Settings > Options > Global: Preview Features

Enabling public preview features should be used cautiously since preview features are not typically supported in the Power BI Service nor Power BI Report Server. It’s easy to forget you used a preview feature which won’t work properly after the report has been published.

 

Query Editor:

To Do

Where

Why

Decide the Data Connectivity Mode

Home > Edit Queries

A very important decision is whether to utilize import mode or DirectQuery mode. Guidance about this decision can be found in the Planning a Power BI Enterprise Deployment whitepaper.

Import the Minimum # of Columns

Home > Edit Queries

Importing only the columns that are needed simplifies the model and reduces the size. Model size is very important since the Power BI model resides in-memory (when in import mode). Since hidden columns still consume size and memory, the rule of thumb is: if a column doesn’t contribute to a report, calculation, or a relationship, then it shouldn’t be imported to the model.

Disable Load for Intermediary Queries

Home > Edit Queries > All Properties > Enable Load

Intermediary queries in the Query Editor are very useful in certain circumstances, such as when two queries are then merged together into one. Any intermediary queries should be disabled from being loaded to the data model.

Parameterize Data Source Connections or Other Changing Data

Home > Edit Queries > Manage Parameters

Parameters can be useful to minimize hard-coding and reduce maintenance when changes need to occur. For data source connections, by default the connection is stored in the first step of every query. For dev>test>prod scenarios, changing the connection only once in a parameter is far less tedious than for every query. It also reduces the chance of error or omission.

Set Data Source Privacy Level

File > Options and Settings > Data Source Settings > Edit Permissions > Privacy Level

When importing data from various types of sources, the types of optimizations available vary per data source (such as query folding which passes filters down to an underlying data source). If data is sensitive and you do not want it to be passed in-memory to another server for consolidation or aggregations, and/or you want to ensure a DBA monitoring the server cannot see values, you can control isolation by setting the privacy level. This may result in a slower consolidation or comparison of data, but is more secure for sensitive data.

Set Encryption for Data Source Connections

File > Options and Settings > Data Source Settings > Edit Permissions > Encryption

Connections should be encrypted whenever possible as a standard security measure. This does need to be supported by the data source which is being queried.

Use Consistent Source File Locations

Home > Edit Queries > Data Source Settings

When accessing sources such as flat files (ex: CSV or Excel), make sure the files are accessible from a centralized location which is not based on an individual user path. This will ensure a data refresh operation won’t fail if another person executes it.

Retrieve Source Data from a Database View

Home > Edit Queries

When possible, when retrieving data from a relational database, select data from a database view rather than a table. Different sets of views allow for different calculations and customization for different purposes: DW views can feed a semantic layer for corporate BI delivery (ex: Analysis Services). Alternatively, USR or RPT (user / report) views can be customized for power users who are allowed to access relational data (ex: from Power BI Desktop).

 

Tables and Relationships:

To Do

Where

Why

Use Friendly Business Name for Tables

Data pane > Fields

A table name like “Students” is nicer to look at within a field list than something like “TblStdt.” The name should represent the data contained within the table. Other contextual information about the type of table is very helpful to include in the name, such as: Financial Monthly Snapshot, Financial Transactions, or Financial Current Balances. Though often debated, either singular or plural table names are fine - just be consistent.

Validate Relationships are Accurate

Relationships pane

Depending on the source of data, relationships may or may not be auto-generated. One of the most critical tasks is to verify all required relationships are in place and accurate with respect to the 1-to-many direction of each.

Verify Relationships Do Not Change Direction

Relationships pane

The direction of a relationship affects how filters propagate through a model. If the direction of relationships change, a ‘dead-end’ is created which means reports do not display correct data. This is more likely to occur for more normalized data models. Verifying this is a critical piece of model validation.

Validate and Use Bi-Directional Relationships Purposefully

Relationships pane

Bidirectional relationships should be used minimally, and with a specific purpose. They should be used as the exception – to accomplish something specific – rather than the default, particularly if you have any complexity to your data model. Sometimes CrossFilter() can be a substitute for a bidirectional relationship. Verifying that any bidirectional relationships are set up correctly is critical to model validation. (Tip: According to the BiDi whitepaper, a relationship with a Date table should always be one-to-many, never bidirectional.)

Validate and Use Inactive Relationships Purposefully

Relationships pane

Inactive relationships in the data model should exist only when they have purposefully been set up that way, in which case the relationship is invoked via a DAX calculation. Verifying that any inactive relationships are set up correctly is critical to model validation.

Create Synonyms to Improve Q&A Experience

Relationships pane > Modeling > Synonyms

If report consumers also use Q&A functionality for natural language querying (ex: Sales by Year by Division), the thoughtful creation of synonyms improves the user experience significantly so that users do not always have to input field names exactly. For example, perhaps the terms Division and Business Unit are interchangeable at your organization. Or, perhaps there are common acronyms users might input into Q&A that need to be translated to full field names.

Assume Referential Integrity When Possible

Home > Manage Relationships

If you are utilizing data in DirectQuery mode from a reliable and clean data source (such as a data warehouse), you can improve performance by assuming referential integrity.

 

Fields, Columns, and Measures:

To Do

Where

Why

Create Unique Field Names Across the Entire Dataset

Data pane

Although the Power BI software permits columns to exist which are named the same across tables, that is a poor practice to allow in a data model. Let’s say that there is a Sales Rep table and a Sales Manager table, and both have a Sales Region column. Both Sales Regions should be renamed to be Sales Rep Region, and Sales Manager Region for clarity (or, even better, renamed in an underlying database view where Power BI retrieves the data). The general rule to ensure report labels are easy to understand: the name assigned to each column needs to be self-explanatory on its own rather than relying on the context of its table.

Expose a Field Only Once in the Entire Dataset

Data pane

Sometimes a column exists in the data model more than once. The ideal choice is for the redundant column to not be imported at all to the dataset. If that isn’t a viable choice for whatever reason, then make sure the “extra” copy of the column is hidden from report view. For instance, if a Sales Header ID exists in both the Header and the Lines tables – make sure it’s hidden in the Lines table and only shown once in the data model. This way, users won’t accidentally use both columns. Using both could result in possibly needing to use two separate filters which would provide a very poor user experience.

Hide Fields Not Utilized for Reporting

Data pane

IDs and surrogate keys are needed for relationships, but are not useful for reporting. Hiding them simplifies the data model because there’s less fields shown in the field list. The consumers of your data model will appreciate the lack of what they’d view as clutter. (Reminder: if a column is not needed for something – a relationship, basis for a calculation, sorting, something – then don’t import it at all.)

Use Friendly Names for Fields

Data pane

A field such as “Student Name” is nicer to look at for reporting than something like “Stdt_nm” which may be how it’s stored in the source. Since field names impact the naming of column titles, filters, and slicers, assigning friendly names with business-relevance are well worth a bit of time investment. Just be consistent with respect to spaces, casing, and abbreviations (such as Desc or Nbr). Try to only use acronyms if you are certain all users know the meaning.

Set Formatting for All Numeric and Date Fields

Modeling > Formatting

It’s no fun to add a field onto a report that needs to be reformatted every single time. Defining units as whole numbers, or amounts as currency, is a helpful timesaver on the reporting end. Don’t forget to also set the comma to assist with readability.

Specify Sorting for Columns

Modeling > Sort

Creating a default sort order is a common need for certain types of columns. For example, you may need to sort the Month Name column by the Month Number column. Oftentimes the column serving as the sort value can be hidden from report view.

Set Default Summarization for All Numeric Columns

Modeling > Properties > Default Summarization

The summarization default is sum, but some numeric columns are more suitable as count, min, or max. For example, high temperature per day would never be summed for a meaningful value; average is likely a better choice. Setting this properly allows subtotals and totals to be presented properly. Summarization for column that aren’t really numeric, such as Customer Number or Year Number, should be set to “don’t summarize.” More details: Why the Default Summarization Property in Power BI is So Important.

Create Useful Calculated Columns

Modeling > Calculations > New Column

Creation of calculated columns (aka derived fields) is vital for enrichment of the data model. A very simple example of this is names – perhaps the underlying data source keeps First Name and Last Name in separate columns; you may wish to derive a Full Name field for reporting purposes which concatenates them together. Note: there are times when it's preferable to derive columns in the Query Editor instead of DAX: (a) to take advantage of query folding, (b) to remove intermediary columns from being loaded to the model, (c) to achieve faster processing time since DAX columns compute sequentially, or (d) to potentially achieve a better compression rate. Marco Russo discussed the last two items in this post. For complex calculations in a larger model, test the difference in performance.

Create Useful Calculated Measures

Modeling > Calculations > New Measure

Calculated measures (aka explicit measures) are extremely useful to augment reporting and analysis. Year-to-date and year-over-year are common examples of calculated measures which rely upon current context for calculation at runtime. Typically a model starts with common calculations, and continues to evolve over time as business users identify additional relevant calculations.

Decide on Location for Calculated Measures

Modeling > Properties > Home Table

Some Power BI practitioners prefer to locate all calculated measures in a specific table just for measures. However, this results in calculated columns and calculated measures residing in different tables which might seem confusing to users. If you do choose to use a “home table” for all measures, or certain critical measures, focus on organization to help users find them easily.

Consolidate Intermediary Calculated Columns

N/A

Frequently it is easier to understand and test a complicated calculated column in a series of steps that build upon each other. The tradeoff for this understandability is additional size and memory usage. Therefore, you may consider consolidating calculated columns once testing is complete rather than leaving the separate steps permanently in the model. If they remain, make certain that any intermediary calculated columns are hidden from report view. Alternatively, you could handle adding new columns in the Query Editor (instead of DAX), and delete the intermediary columns before the last step in the query.

Decide on Use of Implicit Measures

N/A

An implicit measure utilizes the default summarization setting such as “Sum of Sales Amount.” Some Power BI practitioners prefer to use explicit calculated measures only, in which case base numeric columns are hidden. Like many other things, consistency here is what matters.

Set the Data Category

Modeling > Properties > Data Category

Address-related columns, such as city/state/zip and so forth, need to be specified in the data model to ensure geocoding can occur properly. Typically, Power BI is very good about detecting data categories from the data, but the data category for addresses, URLs and barcodes needs to be verified.

Create Columns to Support Query String Parameters

Modeling > Calculations > New Column

There are some rules for using URL query string parameters, such as no spaces allowed in the name. Therefore, there may be circumstances when you need to create some specific columns for use with query string parameters. In this case, you'll likely want to name them something similar to 'pStateName' and ensure they are hidden from report view.

 

Other Dataset Items:

To Do

Where

Why

Ensure a ‘Data Of Date’ Exists in the Dataset

Modeling > Calculations > New Column

It is common to display a ‘Data As Of’ date on reports. This should exist in the data model, based on the data itself and not reliant on the current datetime. The objective is for report consumers to understand the effective date of the data being displayed. As of dates tend to be most significant for certain types of data such as sales quotas, or during key periods such as month-end.

Create Hierarchies for Usability

Field List > New Hierarchy

Date columns (such as Year>Quarter>Month) are a common scenario for use of a hierarchy and should be present in most models. Geography columns (such as Country>State>City) are great candidates as well. After a column has been added to a hierarchy, then it’s up to you whether the individual columns should still be visible and available for reporting (sometimes users find that seeing both the individual columns & the hierarchy columns to be confusing).

Set Up Row-Level Security

Modeling > Security > Manage Roles

Row-level security is useful to specify which rows of data certain users are permitted to view. There are several techniques for how to accomplish RLS. (Tip: if you find you are repeating work in numerous PBIX files to set RLS, that means it may be time to look at using Analysis Services instead.)

 

A Few Common DAX Best Practices:

This is a very short list of good DAX habits – certainly not intended to be a complete reference.

To Do

Where

Why

Use a Naming Convention When Referencing Measures and Columns

Modeling > Calculations

When referencing a measure in a DAX calculation, do not include the table name. Conversely, when referencing a column in a DAX calculation, always include the table name as a prefix in the format of 'table'[Column]. This commonly used best practice in the Power BI community helps you identify what type of column is input to the calculation, especially if you are troubleshooting context and calculation results. If you need convincing, watch this 2017 PASS Summit video from Marco Russo.

Apply the Most Selective Condition on Inner Functions

Modeling > Calculations

If a DAX calculation is nested, the innermost function is evaluated first. For performance optimization purposes, the most selective condition should typically be nested to limit data as early as possible.

Use Error Messaging in DAX

Modeling > Calculations

The Error() function provides a description if a calculation result cannot be rendered, which significantly can prevent users being frustrated should something go wrong with the report display. Error messaging is particularly important when displaying data which is manually maintained.

Consider Using the Divide() Function

Modeling > Calculations

If you utilize the Divide() function, instead of Column A / Column B, you don’t have to worry about nulls or coding around divide-by-zero errors. This standard practice prevents invalid numbers being displayed on reports (which in turn minimizes users being tempted to download to Excel just to ‘clean up’ the report output). In a typical data model, use of Divide() will be fine. However, if you have complex calculations and/or a complex data model, check out Alberto Ferrari's post about Divide() performance.

Use Variables as a Common Practice

Modeling > Calculations

Use of variables in DAX is becoming a best practice, as it can make the syntax shorter and easier to read. Variables can improve performance because they are “lazy” meaning they evaluate only once and won’t be evaluated if not used.

 

Documentation Within a PBIX:

Last, but certainly not least!

To Do

Where

Why

Solution Documentation

Report > New Page

You may consider using the first or last report page as a place to include helpful tips, definitions, or an introduction to the solution. It can also link to additional information, identify who to contact with questions, and/or who the owner of the solution is.

Descriptions for Fields

Field List > Properties > Description

A description can be immensely helpful for users to understand the contents of a column or measure, when and how to use it. When a description has been defined, it is shown in the field list as a tooltip when the mouse hovers on the field name. (Tip: if you find you are repeating work in lots of different PBIX files to input descriptions, that means it may be time to look at using Analysis Services instead.)

Comments in DAX Calculations

Modeling > Calculations

Comments in DAX (using // or / /) can be very helpful to describe what’s happening in a calculation. This can be to help others understand the calculation, or to leave yourself a reminder.

Comments Within M Scripts

Home > Edit Queries > View > Advanced Editor --or-- Home > Edit Queries > Applied Steps > Properties > Description

Comments in the M script (using // or / /) can be very helpful to describe the data transformations which are occurring. Comments can be created one of two ways, as Chris Webb describes in his blog post.

Description for Query

Home > Edit Queries > All Properties > Description

If a query is doing something unusual, or is serving as an intermediary query, it is very helpful to include a description of the purpose for the query.

Naming of Query Steps

Home > Edit Queries > Query Settings > Applied Steps

Steps in the query editor get auto-named, which is sufficient for most needs. If you are doing a lot of data transformations, or something complex, it can be helpful to rename the query steps to do what’s happening at a particular stage.

Getting Started with Azure

I've been asked numerous times where to start when it comes to Azure.  Getting a handle on Azure concepts, terminology, and features can be more than a little intimidating. Not only is Azure an extremely large platform, the pace of change is *very* fast. This post has taken me quite a while to figure out how to write because where to start depends so much on which services you're responsible for and what role you are. Each individual service in Azure has its own set of gotchas and lessons learned. So, this post will stay at the 1,000 foot level.

Before we dive in further, I do want to offer a few thoughts suggestions:

  • Even though you can create a resource quickly and easily, that doesn't mean that it's enterprise-ready immediately. By all means, save time getting started. Just don't forget that getting everything working securely with sound and repeatable practices is often counter to the "deploy in 10 minutes" goal for the cloud that we hear a lot. So, plan to spend time learning how things work more deeply if you're building anything besides a test/sandbox environment.
  • Always build a proof of concept if you're developing on an unfamiliar resource. There will be something you expect it to do that it won't do. New features and functionality are always coming out, but everything these days begins as an MVP: minimally viable product. It can take months for a new service to mature. So, do yourself a favor and POC pretty much everything as normal practice.
  • Try to keep up with Azure announcements for the resources you work with. The pace of change is crazy fast and it's really difficult to keep things straight. However, you need to try because, in addition to new functionality, behavior changes occur and features get deprecated as resources evolve and mature.
  • Although it's tempting to start with a virtual machine solution (IaaS) because it's the most familiar, give PaaS (platform as a service) a fair shot because it's really the future and where significant investments are being made by Microsoft.
  • Objects in an Azure subscription can turn into a mess really fast if there are no standards. Planning how dev/test/prod will be separated, as well as how resources will be arranged by resource group, are two of the most critical decisions you will make. Naming conventions are also very important (and yes, I know about the pets and the cattle -- I still like my naming conventions).
  • Choosing a location for services is an extremely important decision. You want to select geographic location based on data storage, proximity to users, co-locating related resources, minimizing latency, backup and recovery needs, service availability in a region, and minimizing data egress charges (data coming into a region is free whereas data leaving a region is not--data egress charges are small but can add up to a surprise bill if a decent volume of data is being transmitted).
  • The cloud is not *always* cheaper, and the cloud is not *always* easier. Often both are true (especially for sandbox and POCs), but don't take it for granted that both are always true for an enterprise solution. It really depends. 
  • Developers and administrators have an immense impact on cost in a cloud computing environment. Time spent learning cost savings techniques is well worthwhile.
  • ARM (Azure Resource Manager) templates and PowerShell are both your friend when it comes to Azure productivity. However, I wouldn't get hung up there if you are brand new to Azure. It's ok to use the portal to create and modify items in Dev and/or while you are learning. You can evolve to use automation techniques as soon as you're ready and/or when things need to be repeatable.

A number of the above thoughts are part of a presentation I've delivered recently. You can find those slides here: Tips for Getting Started with the Azure Data Platform

What is Azure?

Azure offers infrastructure and services for customers (us) to build and deploy software and services through a global network of Microsoft-managed data centers. Although many individual aspects will feel familiar, there can be different design patterns and best practices in cloud and hybrid environments. Here is an overview of the types of resources we can provision in Azure:

The Azure Interactives site (shown above) gives a short snippet as to what each service is, and links to where to find more information. It's a good way to get a quick sense of what is there.

Azure Terms & Definitions

Here are a few of the most important basic building blocks to know related to Azure:

AzureTerminology.png

Subscription: A billing relationship. Each subscription has a directory; a directory can trust multiple subscriptions. Subscriptions have documented limits and quotas. Subscriptions also represent a security boundary.

Tenant (aka Directory): The AAD (Azure Active Directory) service which handles identity management. A global administrator in Azure is also is a global administrator in Office 365.

ARM: The Azure Resource Manager (ARM) framework which allows us to deploy and manage resources in Azure. The various service management APIs behind Visual Studio, command line, and the Azure portal all share the same underlying ARM layer. Checking ARM scripts into source control is referred to as 'Infrastructure as Code.' By default, ARM scripts generated by the Azure portal encompass an entire resource group (the exception being Resource Explorer).

Resource Group: A logical container for resources in Azure. Typically, one resource group contains resources with the same lifecycle which should be managed, secured, and deployed together as a unit (for instance, an Azure Virtual Machine and its related Azure Storage Accounts). Resource groups are important for organizing resources, and can also serve as a security boundary (particularly for decentralized administration purposes).

Resource: Resources are the individual services in Azure, such as Azure SQL Database, Azure Virtual Machine, or an Azure Storage account. There is a wide variety of resources available in the Azure platform (as shown in the first screenshot near the top of this post). A resource must be assigned to a resource group.

Resource Provider: Registering a provider allows your subscription to work with a specific type of resource. Many commonly used resources are registered by default, but not all.

And, here are a few more common cloud computing terms:

CloudIaaSPaaSSaaS.png

IaaS: Infrastructure as a Service is a close equivalent to virtual server environments that we've been using for years--just that the servers are in Microsoft's data center instead. Virtual machines are IaaS. With IaaS, the customer is responsible for managing nearly all of the environment (such as the operating system -- though there are a few features that blur the lines between IaaS and PaaS a bit because you can allow Azure to install O/S and database patches for certain VM images).

PaaS: With Platform as a Service, the customer manages much less: usually just the data and the custom application. The cloud provider is responsible for everything else and in fact, the customer cannot (typically) even access the VM directly that the PaaS service is really running on. With PaaS, the customer focuses on building a solution and not on managing the environment. PaaS examples include Azure SQL Database, Azure SQL Data Warehouse, and CosmosDB.

SaaS: With Software as a Service, the focus is much more on consumption of a pre-existing application. You can think of it as "renting an application." SaaS examples include Power BI or Office 365.

Serverless: The term serverless feels like a misnomer -- there are many thousands of servers running behind everything in Azure. From the customer's perspective though, services are "serverless" because the infrastructure is invisible. Azure Functions are considered a serverless technology.

Azure Stack: Although sometimes you hear someone referring to 'the Azure stack' as a general term for all of Azure, what Azure Stack actually refers to is a way to deploy Azure resources in a private data center (as opposed to the public cloud) via an appliance from a hardware vendor. Only certain types of resources are available via Azure Stack.

Your Purpose For Going to the Cloud

Following are some of the most common reasons why cloud computing is attractive, and also some common concerns. I know you're likely most interested in the technical stuff -- the reason why I bring this up is because it's very important for you to really know these objectives. You will continually make tradeoffs during the design and development process, and the tradeoffs you make will depend completely on your goals. By tradeoffs, I mean decisions around:

  • Simplicity vs. control   (Do you want the 'easy' option, or do you prefer customization & control?)
  • Cost vs. features available  (Do you really need feature xyz even if it costs a lot more?)
  • Cost vs. complexity  (Are you willing to manage something on your own if it's a lot cheaper?)
  • etc...
AzureCommonProsAndConcerns.png

The remainder of this post are URLs which may be helpful to you, depending on your starting point and what you're trying to achieve.

Important Azure URLs

Introductory Azure Information

Azure Reference Materials

Pre-Built Azure Templates and QuickStart Solutions

Azure Security and Compliance

How to Get an Azure Subscription

You Might Also Like...

Presentation materials: Tips for Getting Started with the Azure Data Platform

Naming Conventions in Azure

Overview of Azure Automation

Setting Up a PC for Cortana Intelligence Suite Development

Why Some Azure VM Sizes are Unavailable when Resizing in the Portal

Deciding on Encryption at Rest for an Azure Virtual Machine