This post will be the first in a short series of blog posts showing an overview of Data Warehousing with a slant on how we as ColdFusion developers may encounter it. In these articles we will use a simple database example.
The first major stepping stone in understanding Data Warehousing is to grasp the concepts and differences between the two overall database categories. The type most of us are used to dealing with is the On Line Transactional Processing (OLTP) category. I go into greater detail below. The other major category is On Line Analytical Processing (OLAP), in this design we are “flattening” out the data and the examples below should help to clarify this.
Accessing data to produce required reports is a crucial need in most organizations; in this example we will take data contained in an OLTP format and create two notional OLAP modeled tables from the OLTP format of the data. The OLAP format is how we might hold that same data if our goal was to produce an efficient version of the same data for reporting on large numbers of records. This need for large reports with lots of data can sometimes conflict with the needs of storing data in the first place as the database structures needed to accept, store and easily manipulate data efficiently can be markedly different from the structures needed to produce effective reports from that same data. This is clearly illustrated if we look at the differences between two basic kinds of database categories:
ON-LINE TRANSACTIONAL PROCESSING (OLTP)
This is what we would characterize as the ongoing day-to-day functional copy of the database. It is where data is added and updated but never overwritten or deleted. The main needs of the OLTP operational database being easily controlled insertion and updating of data with efficient access to data manipulation and viewing mechanisms. Typically only single record or small record-sets should be manipulated in a single operation in an OLTP designed database. The main thrust here is to avoid having the same data in different tables. This basic tenet of Relational Database modeling is known as “normalizing” data.
Here is an example of a simple OLTP normalized database. (By the way I use Visio to create this and it is a bit “quirky” in accurately assigning Foreign Keys (FK)). The main thing to note is the number of tables and the relationships. This is really a simple database with in essence, people, companies and related items. In order to pull meaningful reports we have to join tables at the query level. In larger databases that are not well designed this can be a very challenging and inefficient process.
ON-LINE ANALYTICAL PROCESSING (OLAP)
OLAP is a broad term that also encompasses data warehousing. In this model data is stored in a format which enables the efficient creation of data mining/reports. OLAP design should accommodate reporting on very large record sets with little degradation in operational efficiency. The overall term used to describe taking data structures in an OLTP format and holding the same data in an OLAP format is “Dimensional Modeling” It is the primary building block of Data Warehousing and we will go into greater detail on what that is and how it is used in a later blog post. Data in an OLAP database should always be read-only.
Here is an example of how some of the data shown above might be stored in an OLAP format for the purpose of producing large reports of a particular extract of the OLTP data format above. Here we created a dimensional model of two types of user reports. One for details of where users live the other of where they work, note that the same data is now in a de-normalized state. The OLTP and OLAP data could be kept synchronized by using Transactional Replication. It is important to note that all day-to-day manipulation of the data is carried out in the OLTP format database; the OLAP version being a replicated, read-only copy.
We have two tables here that contain data from the OLTP database but in this case it is in a much more “flattened” state. In this case we could and probably would have lots of repeated data particularly in the “ReportPeopleWork” table. In an OLAP model repetition is not necessarily a bad thing.
In this first article we have demonstrated some differences between OLTP and OLAP database types.
Firstly all transactions on the data are carried out in the OLTP, transactional, copy of the database. The normalized structure is best for ensuring that little or no data is repeated. Basic user information is contained in the “People” table, the home and work address for the user is maintained in the “Location” table and the home address work address for each person is determined by the “Locationtype”. These different tables are related via Link tables. All adding and updating of data is done in the OLTP database.
Secondly data which needs to be used in large recordsets in a read-only format, such as for reporting, is best held in an OLAP format. Here each table can be thought of in the same way as we might think of an object in OO programming paradigms, with each row in the table being a single complete object. There still may be reasons for linking tables but in the OLAP model this should be kept to a bare minimum. An excessive need for linking tables in the OLAP database probably denotes a structural weakness in how the OLAP database was modeled.