As I put together a new presentation on my current favorite topic (modern data warehousing), it occurred to me that others might feel like there's some confusion and/or overlap with terminology. Some terms are somewhat fuzzy and mean different things within different organizations, so here's my best effort at a glossary of the components within a Modern Data Warehouse.
In alphabetical order:
The term advanced analytics is a broad phrase for sophisticated statistical techniques to find patterns in the data for the purpose of predictions, recommendations, optimizations, and descriptions of information. It can include subcategories such as predictive analytics, prescriptive analytics, operational analytics, descriptive analytics, and so forth. Advanced analytics can be employed for use cases such as fraud detection, customer segmentation, assessing credit risk, or predictions such as student dropouts, customer churn, or hospital readmissions.
In general, use of the term analytics has evolved over time. It used to imply the usage of statistical methods, such as machine learning and data mining. More recently the term analytics has evolved to be commonly used to describe finding meaningful patterns in data.
An analytics sandbox is an exploratory environment which a knowledgeable analyst or data scientist controls. In this ungoverned (or less governed) personal environment, an analyst can move very quickly with usage of preferred tools and techniques. The advantage of an analytics sandbox is agility for purposes of prototyping and/or data exploration. Sandbox solutions may be productionized and moved into the governed DW/BI/analytics environment.
The term 'big data' is terribly overused, and used in different ways. One meaning refers to sheer size of data volumes. Another meaning is associated with multi-structured data (the combination of unstructured, semi-structured, and structured) of varying data types. Further, others mean it to imply analysis of data in new and interesting ways. Low latency data (the velocity portion of the 3 V's of volume, variety, and velocity) is also attributed to big data.
Originally coined by Gartner, Bimodal BI refers to two modes for development and delivery of information. One mode is focused on agility: business-driven, rapid delivery which values the freedom of exploration and speed. The second mode focuses on traditional IT-driven processes which value reliability, governance, standardization, and security. Having groups of knowledgeable staff each devoted to their own mode, while working collaboratively with each other, is optimal for achieving a good balance around delivery of BI.
An enterprise data catalog is a repository which describes corporate data assets. One aspect of the data catalog is documentation which is often referred to as a data dictionary, but can be much more than that depending on the software vendor's implementation. The ability to search for data based on keywords is another major benefit. In a search scenario, the underlying data security is still present; what is exposed is metadata such as columns, owner, and meaning of the data. A data catalog should also assist with data provisioning (i.e., where to request access to a particular data source if a user does not currently have access).
Data governance processes oversee corporate data assets so they are properly utilized with integrity and the data is understood as intended. Implementation of data governance varies wildly from organization to organization.
Frequently data federation is used synonymously with data virtualization, and the concepts are certainly related. Technically speaking, however, a federated query returns data from multiple data stores - federating, or combining, the query results. A federated query works using data virtualization techniques.
The term data integration can be used in a couple of ways. Conceptually, it refers to the integration, or consolidation, of data from multiple sources together. However, usually when we refer to data integration for a data warehouse, we're referring to data integration being performed physically, via an ETL (Extract>Transform>Load) process. Oftentimes data virtualization (in which the consolidated data is not materialized) is contrasted with data integration (in which the consolidated data is materialized). Data integration processes typically involve significant development effort, which is one reason why data virtualization has become more prevalent.
A data lake is one piece of an overall data management strategy. Conceptually, a data lake is nothing more than a data repository. The data lake can store any type of data, so it's well-suited to ingestion of multi-structured data such as logs and machinery output. Cost and effort associated with data ingestion are reduced because the data is stored in its original native format with no structure (schema) required of it initially. Data lakes usually align with an "ELT" strategy which means we can Extract and Load into the data lake in its original format, then Transform later *if* a need presents itself. A data lake is commonly implemented with HDFS (Hadoop Distributed File System), but could also easy involve other technologies such as NoSQL.
A data mart is usually focused on reporting and analysis for a particular department or a particular subject area. Data marts typically are considered to be a subset of what is contained in an enterprise data warehouse, though that is not always consistent for all implementations. Like a data warehouse, data marts are usually populated from source systems with an ETL process (data integration), so some latency is expected for the data to become available for reporting and analysis in the marts. It's also possible for data marts to employ some level of data virtualization.
A data virtualization layer is one aspect of a logical data warehouse implementation. It can be thought of as a method to access one or more underlying data sources, regardless of where the data resides, without requiring the data to be physically materialized in another data structure. Depending on the data virtualization platform, data cleansing and transformations can be performed. On a small scale, data virtualization can be used very successfully during pilot/exploratory/prototyping phases to improve BI agility, shorten the learning curve, and influence what should be built in the data warehouse. It can also be used successfully to access data in small data volumes, and/or for infrequently needed data. Using a data virtualization approach has numerous challenges, most notably user query performance and reporting load on the source systems -- therefore, a full-fledged logical data warehouse is likely necessary if data virtualization is utilized significantly across an enterprise.
Traditionally a data warehouse is a repository of enterprise-wide data which has been consolidated from multiple source systems, thus increasing the value of the data after it's been correlated. The source data is cleansed, transformed, standardized, enriched with calculations, and stored historically to facilitate time-oriented analysis. The traditional data warehouse is a centralized database, separate and distinct from the source systems, which usually translates to some level of delay in the data being available for reporting and analysis. The level of effort in developing an end-to-end data warehouse can involve long development cycles, which has opened up opportunities for alternative methods for handling data integration and data access.
Data Warehouse Appliance
A DW appliance is a combination of both hardware and software which is optimized specifically for data warehousing and analytics at high scale. Most DW appliances are based on MPP (massively parallel processing) architecture for high performance operations.
A query which returns results from multiple, disparate data sources. A federated query and a distributed query are synonymous.
Distributed processing is one aspect of a logical data warehouse implementation. It involves pushing down the processing effort to each distributed source system whenever possible in order to maximize performance via parallelism. There are numerous techniques utilized by vendors to improve the performance of distributed processing.
Hadoop is a broad-ranging set of open source Apache projects which form an ecosystem for distributed storage as well as distributed processing/computations for large amounts of data (i.e., Big Data). Hadoop has continued to expand its presence due to its ability to scale both storage and processing at lower cost. A Hadoop environment complements a data warehouse by offering an environment which can handle data exploration, processing for a variety of data types, advanced analytic computations, as well as a capable ETL alternative. Because the Hadoop ecosystem is extremely broad, there's a myriad of options available for using Hadoop.
A Lambda architecture is more about data processing than data storage. It's an architectural pattern designed to process large data volumes using both batch and streaming methods. Batch processing is typically pull-oriented, whereas streaming data is push-oriented. A Lambda architecture implementation involves a batch layer, a speed layer, and a serving layer. The serving layer, which handles data access/consumption, may serve consolidated data which was ingested through the different batch and speed layers.
Logical Data Warehouse
A logical data warehouse (LDW) builds upon the traditional DW by providing unified data access to multiple platforms. Conceptually, the logical data warehouse is a view layer that abstractly accesses distributed systems such as relational DBs, NoSQL DBs, data lakes, in-memory data structures, and so forth, consolidating and relating the data in a virtual layer. This availability of data on various platforms adds flexibility to a traditional DW, and speeds up data availability. The tradeoff for this flexibility can be slower performance for user queries, though the full-fledged LDW vendors employ an array of optimization techniques to mitigate performance issues. A logical data warehouse is broader than just data virtualization and distributed processing which can be thought of as enabling technologies. According to Gartner a full-fledged LDW system also involves metadata management, repository management, taxonomy/ontology resolution, auditing & performance services, as well as service level agreement management.
Massively Parallel Processing (MPP)
An MPP system operates on high volumes of data in a parallel manner across distributed nodes. This "shared-nothing architecture" differs from an SMP system because each separate physical node has its own disk storage, memory, and CPU. Though an MPP platform has a lot of functionality in common with SMP, there are also many differences with respect to table design, handling distributed data, and alternative data loading patterns in order to take full advantage of parallelism. Most MPP systems are oriented towards DW/analytics/big data workloads, as opposed to transactional system workloads.
Master Data Management (MDM)
A master data management system is a set of tools and process for the purpose of managing, standardizing, and augmenting an organization’s key reference and descriptive information for subject areas such as customers, products, accounts, or stores. Master data management is often thought of in two forms: analytical MDM, and operational MDM. The value of a data warehouse can be exponentially increased with skillful master data management.
Modern Data Warehouse
Characteristics of a modern data warehouse frequently include (in no particular order):
- Capability of handling a variety of subject areas and diverse data sources
- Ability to handle large volumes of data
- Expansion beyond a single DW/relational data mart structure (to include structures such as Hadoop, data lake, and/or NoSQL databases)
- Multi-platform architecture which balances scalibility and performance
- Data virtualization in addition to data integration
- Ability to facilitate near real-time analysis on high velocity data (potentially via Lambda architecture)
- A flexible deployment model which is decoupled from the tool used for development
- Built with agile, modular approach with fast delivery cycles
- Hybrid integration with cloud services
- Some DW automation to improve speed, consistency, and flexibly adapt to change
- Data cataloging to facilitate data search and to document business terminology
- Governance model to support trust and security
- Master data management for curation of reference data
- Support for all types of users, and all levels of users
- Availability of an analytics sandbox or workbench area to facilitate agility within a bimodal BI environment
- Support for self-service BI to augment corporate BI
- Delivery of data discovery and data exploration, in addition to reports and dashboards
- Ability to certify and promote self-service solutions to the corporate BI/analytics environment
Like a data lake, a NoSQL database is a schema-agnostic data storage option. NoSQL databases are well-suited to hierarchical/nested data with simple data structures (ex: property and its value). There are various types of NoSQL databases which suit different use cases, including: key-value DBs, document DBs, column family stores, and graph DBs.
These days, we anticipate our BI and analytics environment to be comprised of multiple platforms (which is ideally transparent to users). Polygot persistence (aka "best fit engineering") refers to the approach of using the most suitable data storage technology based on the data itself. For instance, multi-structured data is usually well-suited to a data lake or HDFS storage solution; log data in XML or JSON format may be suitable for a NoSQL solution; structured data aligns with a relational database. Using multiple technologies in an information management platform is a form of scaling out, and although it simplifies needs related to certain data types, it does add complexity to the overall data management solution. The more that polygot persistence methodology is followed, the additional likelihood data virtualization is of benefit due to varied infrastructure.
Schema on Read
Querying techniques such as "Schema on Read" have become prevalent because it applies structure at the time the data is queried, rather than at the time the data is written or initially stored (i.e., "Schema on Write"). The value of the Schema on Read approach is to be able to store the data relatively easily with less up-front time investment, then query the data "where it lives" later once a use case is defined. Schema on Read does offer significant flexibility which can offer agility in a modern data warehousing environment and allows us to learn and deliver value more quickly. Schema on Read is often associated with an ELT (Extract>Load>Transform) approach, which is common for big data projects. The implementation of Schema on Read is usually only a portion of the overall information management platform.
Schema on Write
Schema on Write refers to how a traditional data warehouse is designed. The dimensional data model is defined up front, its schema is created, then the dimension and fact tables are loaded with data. Once the data has been loaded, users can begin using it for reporting. This approach requires up-front data analysis, data modeling, and creating of data load processes, all of which can involve long development cycles which challenge business agility. Schema on Write is associated with the traditional ETL approach: Extract>Transform>Load.
A semantic model extends a data warehouse with calculations, relationships, formatting, and friendly names which make sense to functional users. The semantic model is the primary interface for users, and can significantly improve the user experience for reporting, particularly for less technical users who are responsible for some level of self-service reporting and analysis. A semantic model can be a pass-through only (with some similar characteristics to data virtualization), or could be combined with cached data storage mechanisms (such as OLAP or in-memory analytical models).
Operational Data Store (ODS)
An ODS is a data storage repository intended for near real-time operational reporting and analysis. Whereas a data warehouse and data marts are usually considered nonvolatile, an ODS is considered volatile because the data changes so frequently. To reduce data latency, data transformations and cleansing are usually minimized during the ETL process to populate the ODS. An ODS typically contains minimal history.
Symmetrical Multiprocessing (SMP)
An SMP system is a traditional relational database platform in which resources such as disk storage and memory are shared. For very large-scale databases, when an SMP database begins to experience limits related to performance and scalability, an MPP (massively parallel processing) database may be an alternative.