data warehouse system processes
We have a fixed number of operations to be applied on the operational databases and we have well-defined techniques such as use normalized data, keep table small, etc. These techniques are suitable for delivering a solution. But in case of decision-support systems, we do not know what query and operation needs to be executed in future. Therefore techniques applied on operational databases are not suitable for data warehouses.
In this chapter, we will discuss how to build data warehousing solutions on top open-system technologies like Unix and relational databases.
Process Flow in Data Warehouse
There are four major processes that contribute to a data warehouse:
- Extract and load the data.
- Cleaning and transforming the data.
- Backup and archive the data.
- Managing queries and directing them to the appropriate data sources.
Extract and Load Process
Data extraction takes data from the source systems. Data load takes the extracted data and loads it into the data warehouse.
Note: Before loading the data into the data warehouse, the information extracted from the external sources must be reconstructed.
Controlling the Process
Controlling the process involves determining when to start data extraction and the consistency check on data. Controlling process ensures that the tools, the logic modules, and the programs are executed in correct sequence and at correct time.
When to Initiate Extract
Data needs to be in a consistent state when it is extracted, i.e., the data warehouse should represent a single, consistent version of the information to the user.
For example, in a customer profiling data warehouse in telecommunication sector, it is illogical to merge the list of customers at 8 pm on Wednesday from a customer database with the customer subscription events up to 8 pm on Tuesday. This would mean that we are finding the customers for whom there are no associated subscriptions.
Loading the Data
After extracting the data, it is loaded into a temporary data store where it is cleaned up and made consistent.
Note: Consistency checks are executed only when all the data sources have been loaded into the temporary data store.
Clean and Transform Process
Once the data is extracted and loaded into the temporary data store, it is time to perform Cleaning and Transforming. Here is the list of steps involved in Cleaning and Transforming:
- Clean and transform the loaded data into a structure
- Partition the data
- Aggregation
Clean and Transform the Loaded Data into a Structure
Cleaning and transforming the loaded data helps speed up the queries. It can be done by making the data consistent:
- within itself.
- with other data within the same data source.
- with the data in other source systems.
- with the existing data present in the warehouse.
Transforming involves converting the source data into a structure. Structuring the data increases the query performance and decreases the operational cost. The data contained in a data warehouse must be transformed to support performance requirements and control the ongoing operational costs.
Partition the Data
It will optimize the hardware performance and simplify the management of data warehouse. Here we partition each fact table into multiple separate partitions.
Aggregation
Aggregation is required to speed up common queries. Aggregation relies on the fact that most common queries will analyze a subset or an aggregation of the detailed data.
Backup and Archive the Data
In order to recover the data in the event of data loss, software failure, or hardware failure, it is necessary to keep regular back ups. Archiving involves removing the old data from the system in a format that allow it to be quickly restored whenever required.
For example, in a retail sales analysis data warehouse, it may be required to keep data for 3 years with the latest 6 months data being kept online. In such as scenario, there is often a requirement to be able to do month-on-month comparisons for this year and last year. In this case, we require some data to be restored from the archive.
Query Management Process
This process performs the following functions:
- manages the queries.
- helps speed up the execution time of queris.
- directs the queries to their most effective data sources.
- ensures that all the system sources are used in the most effective way.
- monitors actual query profiles.
The information generated in this process is used by the warehouse management process to determine which aggregations to generate. This process does not generally operate during the regular load of information into data warehouse.