Defining the Components of a Modern Data Warehouse

As I put together a new presentation on my current favorite topic (modern data warehousing), it occurred to me that others might feel like there's some confusion and/or overlap with terminology. Some terms are somewhat fuzzy and mean different things within different organizations, so here's my best effort at a glossary of the components within a Modern Data Warehouse.

In alphabetical order:

Advanced Analytics

The term advanced analytics is a broad phrase for sophisticated statistical techniques to find patterns in the data for the purpose of predictions, recommendations, optimizations, and descriptions of information. It can include subcategories such as predictive analytics, prescriptive analytics, operational analytics, descriptive analytics, and so forth. Advanced analytics can be employed for use cases such as fraud detection, customer segmentation, assessing credit risk, or predictions such as student dropouts, customer churn, or hospital readmissions.


In general, use of the term analytics has evolved over time. It used to imply the usage of statistical methods, such as machine learning and data mining. More recently the term analytics has evolved to be commonly used to describe finding meaningful patterns in data.

Analytics Sandbox

An analytics sandbox is an exploratory environment which a knowledgeable analyst or data scientist controls. In this ungoverned (or less governed) personal environment, an analyst can move very quickly with usage of preferred tools and techniques. The advantage of an analytics sandbox is agility for purposes of prototyping and/or data exploration. Sandbox solutions may be productionized and moved into the governed DW/BI/analytics environment.

Big Data

The term 'big data' is terribly overused, and used in different ways. One meaning refers to sheer size of data volumes. Another meaning is associated with multi-structured data (the combination of unstructured, semi-structured, and structured) of varying data types. Further, others mean it to imply analysis of data in new and interesting ways. Low latency data (the velocity portion of the 3 V's of volume, variety, and velocity) is also attributed to big data.

Bimodal BI

Originally coined by Gartner, Bimodal BI refers to two modes for development and delivery of information. One mode is focused on agility: business-driven, rapid delivery which values the freedom of exploration and speed. The second mode focuses on traditional IT-driven processes which value reliability, governance, standardization, and security. Having groups of knowledgeable staff each devoted to their own mode, while working collaboratively with each other, is optimal for achieving a good balance around delivery of BI. 

Data Catalog

An enterprise data catalog is a repository which describes corporate data assets. One aspect of the data catalog is documentation which is often referred to as a data dictionary, but can be much more than that depending on the software vendor's implementation. The ability to search for data based on keywords is another major benefit. In a search scenario, the underlying data security is still present; what is exposed is metadata such as columns, owner, and meaning of the data. A data catalog should also assist with data provisioning (i.e., where to request access to a particular data source if a user does not currently have access).

Data Governance

Data governance processes oversee corporate data assets so they are properly utilized with integrity and the data is understood as intended. Implementation of data governance varies wildly from organization to organization.

Data Federation

Frequently data federation is used synonymously with data virtualization, and the concepts are certainly related. Technically speaking, however, a federated query returns data from multiple data stores - federating, or combining, the query results. A federated query works using data virtualization techniques.

Data Integration

The term data integration can be used in a couple of ways. Conceptually, it refers to the integration, or consolidation, of data from multiple sources together. However, usually when we refer to data integration for a data warehouse, we're referring to data integration being performed physically, via an ETL (Extract>Transform>Load) process. Oftentimes data virtualization (in which the consolidated data is not materialized) is contrasted with data integration (in which the consolidated data is materialized). Data integration processes typically involve significant development effort, which is one reason why data virtualization has become more prevalent.

Data Lake

A data lake is one piece of an overall data management strategy. Conceptually, a data lake is nothing more than a data repository. The data lake can store any type of data, so it's well-suited to ingestion of multi-structured data such as logs and machinery output. Cost and effort associated with data ingestion are reduced because the data is stored in its original native format with no structure (schema) required of it initially. Data lakes usually align with an "ELT" strategy which means we can Extract and Load into the data lake in its original format, then Transform later *if* a need presents itself. A data lake is commonly implemented with HDFS (Hadoop Distributed File System), but could also easy involve other technologies such as NoSQL.

Data Mart

A data mart is usually focused on reporting and analysis for a particular department or a particular subject area. Data marts typically are considered to be a subset of what is contained in an enterprise data warehouse, though that is not always consistent for all implementations. Like a data warehouse, data marts are usually populated from source systems with an ETL process (data integration), so some latency is expected for the data to become available for reporting and analysis in the marts. It's also possible for data marts to employ some level of data virtualization.

Data Virtualization

A data virtualization layer is one aspect of a logical data warehouse implementation. It can be thought of as a method to access one or more underlying data sources, regardless of where the data resides, without requiring the data to be physically materialized in another data structure. Depending on the data virtualization platform, data cleansing and transformations can be performed. On a small scale, data virtualization can be used very successfully during pilot/exploratory/prototyping phases to improve BI agility, shorten the learning curve, and influence what should be built in the data warehouse. It can also be used successfully to access data in small data volumes, and/or for infrequently needed data. Using a data virtualization approach has numerous challenges, most notably user query performance and reporting load on the source systems -- therefore, a full-fledged logical data warehouse is likely necessary if data virtualization is utilized significantly across an enterprise.

Data Warehouse

Traditionally a data warehouse is a repository of enterprise-wide data which has been consolidated from multiple source systems, thus increasing the value of the data after it's been correlated. The source data is cleansed, transformed, standardized, enriched with calculations, and stored historically to facilitate time-oriented analysis. The traditional data warehouse is a centralized database, separate and distinct from the source systems, which usually translates to some level of delay in the data being available for reporting and analysis. The level of effort in developing an end-to-end data warehouse can involve long development cycles, which has opened up opportunities for alternative methods for handling data integration and data access.

Data Warehouse Appliance

A DW appliance is a combination of both hardware and software which is optimized specifically for data warehousing and analytics at high scale. Most DW appliances are based on MPP (massively parallel processing) architecture for high performance operations.

Distributed Query

A query which returns results from multiple, disparate data sources. A federated query and a distributed query are synonymous.

Distributed Processing

Distributed processing is one aspect of a logical data warehouse implementation. It involves pushing down the processing effort to each distributed source system whenever possible in order to maximize performance via parallelism. There are numerous techniques utilized by vendors to improve the performance of distributed processing.


Hadoop is a broad-ranging set of open source Apache projects which form an ecosystem for distributed storage as well as distributed processing/computations for large amounts of data (i.e., Big Data). Hadoop has continued to expand its presence due to its ability to scale both storage and processing at lower cost. A Hadoop environment complements a data warehouse by offering an environment which can handle data exploration, processing for a variety of data types, advanced analytic computations, as well as a capable ETL alternative. Because the Hadoop ecosystem is extremely broad, there's a myriad of options available for using Hadoop.

Lambda Architecture

A Lambda architecture is more about data processing than data storage. It's an architectural pattern designed to process large data volumes using both batch and streaming methods. Batch processing is typically pull-oriented, whereas streaming data is push-oriented. A Lambda architecture implementation involves a batch layer, a speed layer, and a serving layer. The serving layer, which handles data access/consumption, may serve consolidated data which was ingested through the different batch and speed layers.

Logical Data Warehouse

A logical data warehouse (LDW) builds upon the traditional DW by providing unified data access to multiple platforms. Conceptually, the logical data warehouse is a view layer that abstractly accesses distributed systems such as relational DBs, NoSQL DBs, data lakes, in-memory data structures, and so forth, consolidating and relating the data in a virtual layer. This availability of data on various platforms adds flexibility to a traditional DW, and speeds up data availability. The tradeoff for this flexibility can be slower performance for user queries, though the full-fledged LDW vendors employ an array of optimization techniques to mitigate performance issues. A logical data warehouse is broader than just data virtualization and distributed processing which can be thought of as enabling technologies. According to Gartner a full-fledged LDW system also involves metadata management, repository management, taxonomy/ontology resolution, auditing & performance services, as well as service level agreement management.

Massively Parallel Processing (MPP)

An MPP system operates on high volumes of data in a parallel manner across distributed nodes. This "shared-nothing architecture" differs from an SMP system because each separate physical node has its own disk storage, memory, and CPU. Though an MPP platform has a lot of functionality in common with SMP, there are also many differences with respect to table design, handling distributed data, and alternative data loading patterns in order to take full advantage of parallelism. Most MPP systems are oriented towards DW/analytics/big data workloads, as opposed to transactional system workloads.

Master Data Management (MDM)

A master data management system is a set of tools and process for the purpose of managing, standardizing, and augmenting an organization’s key reference and descriptive information for subject areas such as customers, products, accounts, or stores.  Master data management is often thought of in two forms: analytical MDM, and operational MDM. The value of a data warehouse can be exponentially increased with skillful master data management.

Modern Data Warehouse

Characteristics of a modern data warehouse frequently include (in no particular order):

  • Capability of handling a variety of subject areas and diverse data sources
  • Ability to handle large volumes of data
  • Expansion beyond a single DW/relational data mart structure (to include structures such as Hadoop, data lake, and/or NoSQL databases)
  • Multi-platform architecture which balances scalibility and performance
  • Data virtualization in addition to data integration
  • Ability to facilitate near real-time analysis on high velocity data (potentially via Lambda architecture)
  • A flexible deployment model which is decoupled from the tool used for development
  • Built with agile, modular approach with fast delivery cycles
  • Hybrid integration with cloud services
  • Some DW automation to improve speed, consistency, and flexibly adapt to change
  • Data cataloging to facilitate data search and to document business terminology
  • Governance model to support trust and security
  • Master data management for curation of reference data
  • Support for all types of users, and all levels of users
  • Availability of an analytics sandbox or workbench area to facilitate agility within a bimodal BI environment
  • Support for self-service BI to augment corporate BI
  • Delivery of data discovery and data exploration, in addition to reports and dashboards
  • Ability to certify and promote self-service solutions to the corporate BI/analytics environment

NoSQL Database

Like a data lake, a NoSQL database is a schema-agnostic data storage option. NoSQL databases are well-suited to hierarchical/nested data with simple data structures (ex: property and its value). There are various types of NoSQL databases which suit different use cases, including: key-value DBs, document DBs, column family stores, and graph DBs.

Polygot Persistence

These days, we anticipate our BI and analytics environment to be comprised of multiple platforms (which is ideally transparent to users). Polygot persistence (aka "best fit engineering") refers to the approach of using the most suitable data storage technology based on the data itself. For instance, multi-structured data is usually well-suited to a data lake or HDFS storage solution; log data in XML or JSON format may be suitable for a NoSQL solution; structured data aligns with a relational database. Using multiple technologies in an information management platform is a form of scaling out, and although it simplifies needs related to certain data types, it does add complexity to the overall data management solution. The more that polygot persistence methodology is followed, the additional likelihood data virtualization is of benefit due to varied infrastructure.

Schema on Read

Querying techniques such as "Schema on Read" have become prevalent because it applies structure at the time the data is queried, rather than at the time the data is written or initially stored (i.e., "Schema on Write"). The value of the Schema on Read approach is to be able to store the data relatively easily with less up-front time investment, then query the data "where it lives" later once a use case is defined. Schema on Read does offer significant flexibility which can offer agility in a modern data warehousing environment and allows us to learn and deliver value more quickly. Schema on Read is often associated with an ELT (Extract>Load>Transform) approach, which is common for big data projects. The implementation of Schema on Read is usually only a portion of the overall information management platform.

Schema on Write

Schema on Write refers to how a traditional data warehouse is designed. The dimensional data model is defined up front, its schema is created, then the dimension and fact tables are loaded with data. Once the data has been loaded, users can begin using it for reporting. This approach requires up-front data analysis, data modeling, and creating of data load processes, all of which can involve long development cycles which challenge business agility. Schema on Write is associated with the traditional ETL approach: Extract>Transform>Load.

Semantic Model

A semantic model extends a data warehouse with calculations, relationships, formatting, and friendly names which make sense to functional users. The semantic model is the primary interface for users, and can significantly improve the user experience for reporting, particularly for less technical users who are responsible for some level of self-service reporting and analysis. A semantic model can be a pass-through only (with some similar characteristics to data virtualization), or could be combined with cached data storage mechanisms (such as OLAP or in-memory analytical models).

Operational Data Store (ODS)

An ODS is a data storage repository intended for near real-time operational reporting and analysis. Whereas a data warehouse and data marts are usually considered nonvolatile, an ODS is considered volatile because the data changes so frequently. To reduce data latency, data transformations and cleansing are usually minimized during the ETL process to populate the ODS. An ODS typically contains minimal history.

Symmetrical Multiprocessing (SMP)

An SMP system is a traditional relational database platform in which resources such as disk storage and memory are shared. For very large-scale databases, when an SMP database begins to experience limits related to performance and scalability, an MPP (massively parallel processing) database may be an alternative.

Where Azure Analysis Services Fits Into BI & Analytics Architecture (Part 3)

This is part 3 of a discussion about a new service in Microsoft Azure: Azure Analysis Services. Azure AS is a Platform-as-a-Service (PaaS) offering which is in public preview mode (as of December 2016).

Part 1: Why a Semantic Layer Like Azure Analysis Services is Relevant

Part 2: Use Cases for Azure Analysis Services

Part 3: Where Azure Analysis Services Fits Into BI & Analytics Architecture {you are here}

From an infrastructure perspective, there are a variety of ways you can use Azure Analysis Services. Basically, you can think of it as a "mix & match" each of the 3 following options:

(1) Location of Assets

  • Hybrid. This is most common for companies which are extending their existing infrastructure.
  • All cloud-based. Utilization of cloud infrastructure, such as Azure services, is most common for brand new companies.
  • All on-premises. N/A for Azure AS - you'll want to use SQL Server Analysis Services instead of Azure Analysis Services for a fully on-premises implementation.

(2) Data Sources

  • From a single source such as a data warehouse. This is the most traditional path for BI development, and still has a very valid place in many BI/analytics deployments. This scenario puts the work of data integration on the ETL process into the data warehouse, which is the most appropriate place.
  • Directly from various systems.  This can be done, but works well only in specific cases - it definitely won't work well if there are a lot of highly normalized tables, or if there's not a straightforward way to relate the disparate data together. Trying to go directly to the source systems & skip an intermediary data warehouse puts the "integration" burden on the data source view in Analysis Services, so plan for plenty of time testing if you're going to try this route (i.e., it can be much harder, not easier). Note that this option only makes sense if the data is stored in Analysis Services because it needs to be related together somehow (i.e., DirectQuery mode, discussed next in #3, with > 1 data source won't work if a user tries to combine data sources because the data is not inherently related).

There is one twist to the data source options, and that is the use of federated queries. See the last example at the end of this post for more on that.

(3) Data Storage

  • Stored (cached) in the AS data model. Data is stored in the in-memory model. In this case, we're using Analysis Services for its in-memory database as well as a semantic layer. This requires a scheduled refresh and provides the best performance.
  • DirectQuery. Data is *not* stored in Analysis Services; rather, AS is basically a semantic layer only wherein all queries are actually sent to underlying data source(s). This is useful when near real-time data is desired. However, unless your source system is tuned for it, performance may not be acceptable.

Note that the property to specify if the AS model is DirectQuery or not is associated with the .bim file. This means that the entire model is either DirectQuery or it's not (can't choose on a data source-by-data source basis or a table-by-table basis). 


Below are some common scenarios (note not all possible combinations are actually depicted below, but enough to give you ideas).

In this first scenario, we have a traditional data warehouse which has integrated data from four different source systems. The data warehouse resides in an on-premises server, and the Analysis Services semantic layer resides in Azure. Data in Analysis Services is refreshed on a schedule. Reporting is primarily handled through the semantic layer to improve the user experience.

The data sources & Visual Studio pieces are removed from the rest of the examples to simplify (though they certainly still pertain).

The following depicts using Azure AS in DirectQuery mode back to the data warehouse. In this case, the DAX or MDX (whichever is passed from the client tool) is converted to SQL, sent to the data warehouse through the gateway. Data is then retrieved and sent back securely to the client tool.

Alternatively, the DirectQuery mode could be directed at a cloud-based data warehouse. Note in this scenario that a gateway is not necessary (because AS and the data warehouse both reside in Azure).

The next scenario varies only in that the data is stored in Analysis Services, and refreshed on a schedule.

Lastly, we have the concept of federated queries. PolyBase allows us to define an "external table" in SQL Server or Azure SQL Data Warehouse and reach into data stored in Azure Blob Storage (Azure Data Lake Store support is coming soon). These external tables are known as "schema on read" because the data isn't physically stored in the data warehouse.

In the following example, we are using Analysis Services in DirectQuery mode directed to the data warehouse. Under the covers, since an external table is involved, the user queries will actually reach back farther to get the Equipment Health Data as well, though the users don't have to know that's actually happening - though they do need to be willing to accept slower performance. Federated queries like this offer great flexibility to avoid or delay data integration for data analysis which is infrequent (if the data is frequently accessed, or has progressed beyond proof of concept, you likely want to implement full data integration).

Though I didn't depict every possible combination, hopefully this gives you a good idea of ways to use Analysis Services, and where it can fit into your existing architecture. As you can tell from Part 1 of this series, I'm a big fan of using a semantic layer for a consistent and friendly end-user experience.

You Might Also Like...

Overview of SQL Server Analysis Services Tabular in DirectQuery Mode for SQL Server 2016

Building Blocks of Cortana Intelligence Suite in Azure

Power BI Features End-to-End

Resolving 'Identity Not Found' Error When Provisioning Azure Analysis Services

This week I ran into an issue when trying to provision Azure Analysis Services (AAS) in Azure. The AAS team was fantastically helpful with helping me understand what was going on. 

When provisioning Azure Analysis Services in the portal, the final selection is to specify who the administrator is. AAS requires the administrator to be an account (user or group) which exists in Azure Active Directory (AAD). Because this was in a sandbox area for testing purposes, I went ahead and used my personal account for the AAS administrator selection:


After making the selections in the portal, the provisioning failed. In looking at the details, the error was: The identity was not found in Azure Active Directory:

This error was puzzling because of course I'm in Azure Active Directory. Turns out the tenant I was using to provision the service (the Development tenant in the diagram below) really only contains guest accounts:


The Analysis Services product team explained to me that a a user from a tenant which has never provisioned Azure Analysis Services cannot be added to another tenant's provisioned server. Put another way, our Corporate tenant had never provisioned AAS so the Development tenant could not do so via cross-tenant guest security.

One resolution for this is to provision an AAS server in a subscription associated with the Corporate tenant, and then immediately delete the service from the Corporate tenant. Doing that initial provisioning will do the magic behind the scenes and allow the tenant to be known to Azure Analysis Services. Then we can proceed to provision it for real in the Development tenant.

Another resolution is to utilize an AAD account which is directly from the Development tenant as the AAS administrator. 

You Might Also Like...

Why a Semantic Layer Like Azure Analysis Services is Relevant

Use Cases for Azure Analysis Services (Part 2)

This is part 2 of a discussion about a new service in Microsoft Azure: Azure Analysis Services. Azure AS is a Platform-as-a-Service (PaaS) offering which is in public preview mode (as of November 2016).

Part 1: Why a Semantic Layer Like Azure Analysis Services is Relevant

Part 2: Use Cases for Azure Analysis Services {you are here}

Part 3: Where Azure Analysis Services Fits Into BI & Analytics Architecture

Because Azure AS is a cloud service, it has several meaningful benefits we've come to associated with this type of PaaS service:

  • Ability to scale up or down to increase performance when the demand is there (yet not pay for that level of performance when it's not required)
  • Ability to pause to save money if no queries are issued during a particular period of time (such as overnight on Dev/Test databases)
  • No need to provision hardware, handle upgrades nor patching
  • Inherent redundancy of the data

Note: at this early time of the public preview, not every feature mentioned in this post is available just yet, but they are coming.

Azure SSAS May Be Beneficial For...

Upsizing Data Size from Power BI

If you import the data into a Power BI data model (vs. issuing queries via DirectQuery mode), the data imported is currently limited to 1GB of data if you want to upload the file to the Power BI Service. This 1GB is after it's been compressed into its in-memory columnar format. The columnar compression is very efficient, so it can contain quite a bit of data, but 1GB certainly doesn't represent enterprise level data models (again, I'm referring to imported data not DirectQuery scenarios). Upsizing to Analysis Services changes all that since it can handle larger data volumes imported to the in-memory data model.

Faster Data Refresh Schedules than Power BI

Currently you can set a Power BI dataset (which has been imported) to refresh up to 4 times per day. If your data latency requirements dictate fresher data than that, then Analysis Services can be scheduled more frequently. (Just like the previous item, this refers to imported data and not DirectQuery models.)

Varying Levels of Peak Workloads

Let's say during month-end close the reporting activity spikes much higher than the rest of a typical month. In this situation, it's a shame to provision hardware that is underutilized a large percentage of the rest of the month. This type of scenario makes a scalable PaaS service more attractive than dedicated hardware. Do note that currently Azure SSAS scales compute, known as the QPU or Query Processing Unit level, along with max data size (which is different than some other Azure services which decouple those two).

User Activity Occurs During Fixed Hours Only

We will be able to pause the Azure AS service in order to save charges. If you're a regional company with users who don't need to query the system from, say 10pm to 7am, you'll be able to pause the service programmatically if you choose.

Cloud or Hybrid Fits Your Strategic IT Direction

If you're purposely attempting to reduce the overhead of running a data center, then more and more service offerings like this one may be a fit. To the extent you're using multiple services such as Azure SQL Data Warehouse and/or Azure SQL Database, I'm hoping we're going to see some performance benefits (assuming you've selected Azure locations in close proximity to each other for the related services that pass data around).

You Are Short on Performance Tuning Expertise

I tend to get a little preachy that running workloads in the cloud does *not* mean your DBA can retire. However, it is certainly the case that there's less administrative oversight with cloud services. With this Azure AS PaaS service, rather than tuning your server specs, instead you would change the scale level for your instance in Azure - certainly a lot easier because there's a way fewer "knobs" to adjust. Having said that, there's still *lots* of things to pay attention to: performance of queries from the original source (if your data refresh window is small, or if you are using DirectQuery mode), and also good design patterns are always *incredibly* important in order to achieve optimal performance from an Analysis Services model.

Getting New Features Fastest

We're going to be seeing new features hit Azure Analysis Services faster than SQL Server Analysis Services. Using the AAS cloud service, versus the SSAS service which is part of the SQL Server box product, offers that if getting the latest and greatest quickly appeals to you.

Next up is Part 3: Where Azure Analysis Services Fits Into BI & Analytics Architecture

Finding More Information

Kasper de Jonge's blog: Analysis Services in Azure, When and Why (I had 95% of this post written before Kasper published his, so I decided to publish mine anyway even though they're fairly similar.)

Azure Documentation - What is Azure Analysis Services?

Feedback to the Product Team - Feedback on Azure AS

Channel 9 Videos: Azure AS Videos

You Might Also Like...

Building Blocks of Cortana Intelligence Suite in Azure

Power BI Features End-to-End