Querying Multi-Structured JSON Files with U-SQL in Azure Data Lake

A while back I posted about this same topic using CosmosDB, for handling situations when the data structure varies from file to file. This new post uses the same example data file, but this time we're using U-SQL in Azure Data Lake instead. This technique is important because reporting tools frequently need a standard, predictable structure. In a project I'm currently doing at work, conceptually what is happening is this:

USQLStandardizeFromJSONtoCSV.png

In this scenario, the raw JSON files can differ in format per file, but the output is consistent per file so analytics can be performed on the data much, much easier.

Here is a (highly simplified!) example which shows how I have a PrevVal in one JSON document, but not the other:

DocumentDBSimpleExampleOfVaryingFormat.png
 

Handling JSON Format in Azure Data Lake

Handling the varying formats in U-SQL involves a few steps if it's the first time you've done this:

  1. Upload custom JSON assemblies  [one time setup]
  2. Create a database   [one time setup]
  3. Register custom JSON assemblies   [one time setup]
  4. Upload JSON file to Azure Data Lake Store [manual step as an example--usually automated]
  5. Run U-SQL script to "standardize" the JSON file(s) into a consistent CSV column/row format

Step 1: Upload Custom JSON Assemblies to Azure Data Lake Store

Currently the JSON extractor isn't built-in to Azure Data Lake Analytics, but it is available on GitHub which we need to register ourselves in order to use. There are two assemblies we are concerned with, and they are available from: https://github.com/Azure/usql/tree/master/Examples

To obtain the custom JSON assemblies:

  • Fork the USQL repository from GitHub.
  • In Visual Studio, open the Microsoft.Analytics.Samples.Formats project (under the usql-master\Examples\DataFormats folder).
  • Build the project in Visual Studio.
  • Locate these two DLLs in the bin folder:  Microsoft.Analytics.Samples.Formats.dll -and- Newtonsoft.Json.dll which should be located in <PathYouSelected>\usql-master\Examples\DataFormats\Microsoft.Analytics.Samples.Formats\bin\Debug.
  • Hang onto where these two DLLs are - we'll need them in step 3.
  • Make sure these DLLs are added to your source control project, along with the rest of the U-SQL scripts mentioned in the remainder of this post.

Step 2: Create a Database in Azure Data Lake

We want a new database because we need somewhere to register the assemblies, and using Master for this purpose isn't ideal (though it will work for these two assemblies, I've found it doesn't work for another custom assembly I tried -- so I make a habit of not using Master for any user-defined objects). Also, if you end up with other related objects (like stored procedures or tables), they can also go in this database.

CREATE DATABASE IF NOT EXISTS BankingADLDB;

Mine is called 'BankingADLDB' for two reasons: My demo database is about Banking. And, being the naming convention queen that I am, I prefer having 'ADLDB' as part of the name so it's very clear what type of database this is. 

You'll want to run this as a job in Azure Data Lake Analytics (or from within a Visual Studio U-SQL project if you prefer):

USQLCreateDatabase.png
 

Note I gave the script a relevant job name so if I'm looking in the job history later, the scripts are easy to identify.

To view the database, use the Data Explorer from Azure Data Lake Analytics (rather than the Store):

ADLACatalog.png
 

Step 3: Register Custom JSON Assemblies in Azure Data Lake

Upload your two DLLs from the bin folder to your desired location in Azure Data Lake Store. You'll need to create a folder for them first. I like to use this path: \Assemblies\JSON in ADLS:

JSONCustomAssembliesInADLS.png

Now that the files are somewhere Azure Data Lake Analytics (ADLA) can find, we want to register the assemblies:

USE DATABASE [BankingADLDB];
CREATE ASSEMBLY [Newtonsoft.Json] FROM @"Assemblies/JSON/Newtonsoft.Json.dll";
CREATE ASSEMBLY [Microsoft.Analytics.Samples.Formats] FROM @"Assemblies/JSON/Microsoft.Analytics.Samples.Formats.dll";
USQLRegisterCustomJSONAssemblies.png

Sidenote: Don't forget to specify relevant security on the new Assemblies folder so your users are able to reference the assemblies when needed.

Step 4: Upload JSON File to Azure Data Lake Store

Normally this step would be done in an automated fashion. However, for this post we need to manually get the file into the Data Lake Store. To simulate a realistic scenario, I have shown partitioning of the raw data down to the month level:

ADLSRawData.png

Here is the text for the JSON file contents:

[
 {
 "AID":"Document1",
 "Timestamp":"2017-03-14T20:58:13.3896042Z",
 "Data": {
"Val": 67,
"PrevVal": 50,
"Descr": "ValueA"
 } 
 },
 {
 "AID":"Document2",
 "Timestamp":"2017-03-14T20:04:12.9693345Z",
 "Data": {
"Val": 92,
"Descr": "ValueB"
 }
 }
]

Step 5: Run U-SQL Script to Standardize JSON Data to a Consistent CSV Format

Finally! We're past the setup and arrived at the good part.

I call this "StandardizedData" in my implementation, because I'm really just taking the RawData and changing its format into standardized, consistent, columns and rows. (In my real project, I do have another folder structure for CuratedData where the data is truly enhanced in some way.)

USQLStandardizeJSONtoCSV.png

Here's what the CSV output file looks like in the web preview - each row has a consistent number of columns which was the objective:

USQLOutputFile.png

And, that's it. Very easy to do on an ongoing basis once the initial setup is complete. There's also a multi-level JSON extractor posted to GitHub which I haven't needed to use as of yet. If you have numerous levels of nesting, you will want to look into that extractor. 

One reason this approach works well for me: if a new property should slip into the raw JSON and you don't know about it, as long as you're keeping the raw data indefinitely you can always re-generate the standardized data. In the meantime until the new property is discovered, it won't error out (this isn't always the behavior for every U-SQL extractor but it does work for this JSON extractor).

Here is the full text of the U-SQL script which is copy/paste friendly:

REFERENCE ASSEMBLY BankingADLDB.[Newtonsoft.Json];
REFERENCE ASSEMBLY BankingADLDB.[Microsoft.Analytics.Samples.Formats]; 

USING Microsoft.Analytics.Samples.Formats.Json;

DECLARE @InputPath string = "/BankingMachineData/RawData/{date:yyyy}/{date:MM}/{filename}.json";

DECLARE @OutputFile string = "/BankingMachineData/StandardizedData/LogCapture.csv";

@RawData = 
EXTRACT 
 [AID] string
,[Timestamp] DateTime
,[Data] string
,date DateTime//virtual column
,filename string//virtual column 
FROM @InputPath
USING new JsonExtractor();

@CreateJSONTuple = 
SELECT 
 [AID] AS AssignedID
,[Timestamp] AS TimestampUtc
,JsonFunctions.JsonTuple([Data]) AS EventData 
FROM @RawData;

@Dataset =
SELECT
AssignedID
,TimestampUtc
,EventData["Val"] ?? "0" AS DataValue
,EventData["PrevVal"] ?? "0" AS PreviousDataValue
,EventData["Descr"] ?? "N/A" AS Description
FROM @CreateJSONTuple;

OUTPUT @Dataset
TO @OutputFile
USING Outputters.Csv(outputHeader:true,quoting:true);

 

If You Like This Content...

Consider joining me for an all-day session I'm presenting on October 13th, 2017. More info here: Presenting a New Training Class on Architecting a Data Lake  

You Might Also Like...

Handling Row Headers in U-SQL

Querying Documents with Different Structures in Azure CosmosDB

Data Lake Use Cases and Planning Considerations

 

Presenting a New Training Class on Architecting a Data Lake

I'm very excited to be presenting an all-day session on Architecting a Data Lake on Friday, October 13, 2017. It is a pre-conference session as part of the annual SQLSaturday event in Charlotte, NC -- which means the price is a bargain. The early bird price is $125 up to Sept 20th; after that the regular price is $165. You can register here: http://bit.ly/ArchitectingADataLake.

The full title of the session is "Architecting a Data Lake to Modernize Your Data Warehouse." The reason the title has a 'qualifier' about the DW is because that's the viewpoint we'll be taking -- i.e., the data lake is augmenting an existing DW/BI/Analytics type of system. Having said that, the vast majority of the information I'll be sharing will apply to any type of data lake implementation (supporting Data Science, IoT, Big Data, and so forth).

DataLakeZones.png

This full-day session will focus on principles for designing and implementing a data lake. There will be a mix of concepts, lessons learned, and technical implementation details. During the session we will build a data lake from the ground up, populate it, organize it, secure it, integrate it with a data warehouse via logical constructs, and query the data. You will leave this session with an understanding of the benefits and challenges of a multi-platform analytics/DW/BI environment, as well as recommendations for how to get started. You will learn:

  • Scenarios and use cases for expanding an analytics/DW/BI environment into a multi-platform environment which includes a data lake
  • Strengths and limitations of a logical data architecture which follows a polyglot persistence strategy
  • Planning considerations for a data lake which supports streaming data as well as batch data processing
  • Methods for organizing a data lake which focuses on optimal data retrieval and data security
  • Techniques for speeding up development and refining user requirements via data virtualization and federated query approaches
  • Benefits and challenges of schema-on-read vs. schema-on-write approaches for data integration and on-demand querying needs
  • Deciding between Azure Blob Storage vs. Azure Data Lake Store vs. a relational platform

Specific technologies discussed and/or demonstrated in this session include Azure Data Lake Store, Azure Data Lake Analytics, Azure SQL Data Warehouse, Azure Blob Storage, SQL Server, PolyBase, and U-SQL:

AzureDataLakeIntegration.png

If you have an Azure account, you will be able to follow along during the demonstrations if you'd like. Demo scripts will be provided with the course materials.

SQLSaturday is a free (with optional $10 lunch) training day run by members of the local community. There's usually 250-350 people at the annual Charlotte SQLSaturday, so it's a lot of fun as well as an excellent chance to get your learn on. The full schedule for Saturday can be found here: http://www.sqlsaturday.com/683/Sessions/Schedule.aspx

I hope you can join me for the Friday pre-conference session on Architecting a Data Lake. That same day, Leila Etaati is also giving a pre-conference session on Advanced Analytics with R, Microsoft SQL Server, Power BI, and Azure ML - it will also be a great session if that suits your needs a little better.

If you have any questions, you can contact me via the form on my About page (scroll down to find the form).

You Might Also Like...

Data Lake Use Cases and Planning Considerations

Querying Multi-Structured JSON Files with U-SQL in Azure Data Lake

Handling Row Headers in U-SQL

This is a quick tip about syntax for handling row headers in U-SQL, the data processing language of Azure Data Lake Analytics. There are two components: handling row headers on the source data which is being queried, and row headers on the dataset being generated by ADLA.

Detecting that row headers are present on the first row of the source data:

USING Extractors.Csv(skipFirstNRows:1)
 

Outputting row headers on row 1 of the dataset being generated:

USING Outputters.Csv(outputHeader:true);

   
Here is a full U-SQL example which includes both:

DECLARE @inputPath string = "/RawData/{date:yyyy}/{date:MM}/{filename:*}.csv";
DECLARE @outputPath string = "/CuratedData/POC.csv";

@data = 
EXTRACT
InstanceID string
,TransactionID string
,TimestampUtc string
,ClassID string
,ClassName string
,CurrentValue string
,date DateTime //virtual column
,filename string //virtual column 
FROM @inputPath 
USING Extractors.Csv(skipFirstNRows:1);

@result =
SELECT 
 InstanceID AS InstanceID
,TransactionID AS TransactionID
,TimestampUtc AS TimestampUTC
,ClassID + "-" + ClassName AS Class
,CurrentValue AS CurrentValue
,date AS TransactionDate
,filename AS SourceFileName
,1 AS NbrOfTransactions
FROM @data
WHERE ClassName == "EX1";

OUTPUT @result 
TO @outputPath 
USING Outputters.Csv(outputHeader:true,quoting:true);


As a reminder: U-SQL is a batch-oriented language which requires its output to be written to a destination file. It's not intended to be an ad hoc query language at the time of this writing.

If You Like This Content...

Consider joining me for an all-day session I'm presenting on October 13th, 2017. More info here: Presenting a New Training Class on Architecting a Data Lake  

You Might Also Like...

Data Lake Use Cases and Planning Considerations

Querying Multi-Structured JSON Files with U-SQL in Azure Data Lake

PowerShell for Assigning and Querying Tags in Azure

Tags in Azure are useful pieces of metadata for documenting (annotating) things such as:

  • Billing or cost center categories (ex: general ledger code)
  • Environment names (ex: Dev, Test, Prod, Sandbox)
  • Project or system
  • Purpose or application
  • Team, group, department, or business unit
  • Who owns or supports the resource
  • Release or version numbers (ex: for testing infrastructure)
  • Archival date (ex: if infrastructure is only needed temporarily)
  • Who initially created the resource
  • Which customer a resource applies to (ex: for an ISV)
  • Service level agreement
  • Patching or maintenance window
  • etc…

Tags are free-form key/value pairs. So, they can be used for tracking anything you find to be helpful. Tags are particularly helpful for breaking down invoicing costs. For instance, rather than seeing the entire cost for certain resources (like VMs or storage) in a resource group, tags allow you to subdivide the resource costs further, or to group costs in another way across resource groups. Here's what the tags look like when you download usage (the new V2 format) for your subscription:

You can assign tags for resource groups, as well as individual resources which support Azure Resource Manager. The individual resources do not automatically inherit tags from the resource group parent. A maximum of 15 key/value pairs can be assigned (though you could store concatenated values or embedded JSON in a single tag value as a workaround). You may want to just assign tags at just the resource group level, and use custom queries to "inherit" at the resource level. Alternatively, you may want to assign tags to the individual resources directly particularly if you want to see them clearly on the standard "download usage" report of billing.

Since the key/value pairs are just free-form text, watch out for uniformity issues. To improve consistency, you can utilize policies to require tags and/or apply defaults if you'd like (for example, you might want to enforce a "Created By" tag). Tags can be set in the ARM template when you initially deploy a resource (which is best so that no billing occurs without proper tagging), or afterwards to existing resources via the portal, PowerShell, or CLI.

The three tags I'm currently using in an implementation are Billing Category, Environment Type, and Support Contact:

 

The above screen shot shows setting tags within the portal. If you have more than a handful of resources, that won't be efficient at all. Following are a few PowerShell scripts to help with that setting tags.

Assign Tags to a Resource

This script will *overwrite* any and all tags previously assigned to one resource.

$resourceGroupName = 'InternalReportingRGDev'
$resourceName = 'bisqlvm1datastdstrgdev'

$azureResourceInfo = Find-AzureRmResource -ResourceGroupNameEquals $resourceGroupName -ResourceNameEquals $resourceName 

Set-AzureRmResource -Tag @{ billingCategory="Internal Analytics"; supportContact="Analytics Team"; environmentType="Dev" } -ResourceName $resourceName -ResourceType $azureResourceInfo.ResourceType -ResourceGroupName $resourceGroupName -Force 

Assign Tags to a Resource Group

This script will *overwrite* any and all tags previously assigned to one resource group.

$resourceGroupName = 'InternalReportingRGDev'

$azureRGInfo = Get-AzureRmResourceGroup -Name $resourceGroupName

Set-AzureRmResourceGroup -Id $azureRGInfo.ResourceId -Tag @{ billingCategory="Internal Analytics"; supportContact="Analytics Team"; environmentType="Dev" } 

Assign Tags to All Resources Within a Resource Group (Inherited from the RG)

This script will *overwrite* any and all tags previously assigned to one resource.

$resourceGroupName = 'InternalReportingRGDev'

$azureRGInfo = Get-AzureRmResourceGroup -Name $resourceGroupName
foreach ($item in $azureRGInfo) 
{
Find-AzureRmResource -ResourceGroupNameEquals $item.ResourceGroupName | ForEach-Object {Set-AzureRmResource -ResourceId $PSItem.ResourceId -Tag $item.Tags -Force } 
}

Add an Additional Tag to a Resource Group

This script adds a new tag and preserves existing tags for one resource group. It only accepts new tags (i.e., it will error out if you repeat existing tags).

$resourceGroupName = 'InternalReportingRGDev'

$azureRGTags = (Get-AzureRmResourceGroup -Name $resourceGroupName).Tags

$azureRGTags+= @{ billingCategory345="Internal Analytics" }

Set-AzureRmResourceGroup -Tag $azureRGTags-Name $resourceGroupName 

 

Query to Get List of Tag Names & Values for a Specific Resource

This checks for the tags assigned to one resource.

$resourceGroupName = 'InternalReportingRGDev'
$resourceName = 'bisqlvm1datastdstrgdev'

(Find-AzureRmResource -ResourceGroupNameEquals $resourceGroupName -ResourceNameEquals $resourceName).Tags

Output is a list of each Name/Value pair which has been assigned:

 

Query to Get List of Resource Groups With a Specific Tag Value Assigned

(Find-AzureRmResourceGroup -Tag @{ billingCategory="Internal Analytics" }).Name 

Output is a list of resource groups which have been assigned that tag name and value.

 

Query to Get List of Resources With a Specific Tag Value Assigned

(Find-AzureRmResource -Tag @{ environmentType="Dev" }).Name 

Output is a list of resources which have been assigned that tag name and value.

 

Query to Get List of Resources With a Tag Set Based on Tag Name

(Find-AzureRmResource -TagName 'billingCategory').Name 

Output is a list of which resources have a specific tag assigned (regardless of the tag's value).

 

You Might Also Like...

Naming Conventions in Azure

Setting Up Disk Encryption for a Virtual Machine with PowerShell