When working on big data systems, it can be very helpful to include file properties and other metadata directly within the data results. Capturing data lineage can come in very handy, especially if reconciling or troubleshooting issues (for instance, if retry logic occurred in the data stream and now you have duplicate rows to be handled).
I just learned we have some new U-SQL syntax which supports the following file properties:
- URI (uniform resource identifier)
- Modified date
- Created date
- Length (file size in bytes)
In the following example, I'm using U-SQL (Azure Data Lake Analytics) to iterate over files which are in date-partitioned subfolders under Raw Data within Azure Data Lake Store. As part of the schema-on-read definition of the source files (aka the extract statement), the new file properties are shown in yellow:
The output for the virtual columns looks like this:
You can find more info about this in the release notes on GitHub.
Like This Content?
If you are integrating data between Azure services, you might be interested in an all-day session Meagan Longoria and I are presenting at PASS Summit in November. It's called "Designing Modern Data and Analytics Solutions in Azure." Check out info here: http://www.pass.org/summit/2018/Sessions/Details.aspx?sid=78885