The data warehouse enables enterprise-level analysis and reporting commonly used for decision support. The data warehouse system includes data from different data sources. Data are typically extracted from different sources and then transformed and integrated several times before being stored in a central repository.
The extraction and transformation processes are different both in theory and across different solution providers. Some are common, and others are hand-built solutions tailored to the user’s transformation and reporting requirements solutions.
Most research on data integration focuses on this area, i.e., data transformation. Since data in the data warehouse undergo complex transformation processes, often at several different levels and stages, it is very important to ensure the quality of the data in the data warehouse. This blog aims to examine and compare existing approaches for tracking data provenance, a solution specially tailored to data warehouse architecture.
Data Warehouse Architecture
Data storage is the process of extracting, transforming and loading data into a data warehouse for reporting and analysis. A Data warehouse consists of two main components, namely data collection, integration and storage, and the second is maintenance, reporting and analysis.
The maintenance component is complex and involves several processes, which mainly include data extraction, transformation and loading.
Data in the data warehouse act as a focal point for data integration and as a distribution point.
Data Collection
Data collection and integration is an important process in data warehouses. Data in the data warehouse undergo a cleansing and integration process before they enter the data warehouse.
Linking data from heterogeneous and disparate sources is essential linking sources is a major challenge due to differences in nomenclature, domain definitions, identification number, etc.
Since data in the warehouse are integrated into different sources and transformed through complex processes, the original source is often obscured.
Data Transformation
Data acquisition and integration are usually part of the extraction and loading. ETL is commonly called data transformation and is a well-known process cycle inherent in the storage environment.
A typical end result of the ETL process in a storage environment is data storage in a multidimensional schema. ETL typically contains conversion programs that transform data tasks to clean, combine and summaries source data before loading it into the data warehouse.
Support
Supporting objects, processes, and data is an important administrative task in the data area. Data provenance reports play a very important role in the maintenance task in data warehouses. The overview of data provenance is very important from design to development.
A common data warehouse administration task is modifying, updating, or analyzing a particular transformation. Another common example is an administrative task to analyze the data quality loaded into the data warehouse.
This dissertation presents different approaches to maintaining data sets, which have been implemented in existing solutions.