Fundamentals of A Data Warehouse


A database specifically for analytics

Data warehouses are special databases that use the OLAP model designed for analtyics and business intelligence. Let's get started.


Why You Want a Data Warehouse

Why would you want a data warehouse? A variety of reasons.

A data warehouse helps you to stop guessing and hoping for the best.

Helps You Perform Better

Organizations that use analytics perform better. That's why all the big names do it. It helps you to understand the customer, the market, create better products, etc. You end up making better business decisions over the long run because it is based on data. You aren't guessing and hoping for the best.

Designed for Analytics

It has been designed for analytics, not order fulfillment. Thus, you can easily run reports and get the data you need regarding business performance.

Won't Mess With Production Systems

Production applications are changing data frequently. If you run a large analytics query out of these systems it can slow it down and you don't want that. It is not good to take down your E-Commerce system because one of your analysts is running a large query for a report.

Database Models

There are two database models. They are OLAP, which is short for Online Analytical Processing and OLTP which is short for Online Transactional Processing.

The OLTP model powers most applications where data is added and updated frequently. This would be the case for things such as websites, E-Commerce, CRMs, etc.

Conversely, the OLAP model powers data warehouses, which are specially designed for analytics and business intelligence. Data isn't added directly by users but extracted from other systems such as POS or CRM and delivered through a data pipeline. More on that in a moment. Data also isn't updated. Once it is there it stays there.

A data warehouse has few related terms that are similar which you may or may not have heard. There are defined below.

  • Data Warehouse: Main database designed for analytics. Data has been processed and structured.
  • Data Mart: Sub-database of the data warehouse that is specific to a certain area of the business.
  • Data Lake: Stores unstructured and un-processed raw data.

Design

Data warehouses are designed differently than apps. Now they both have tables, rows and columns. However, the data warehouse uses types of tables: facts and dimensions.

  • Facts Tables store a metric such as sales.
  • Dimensions Tables give context to the facts. Such as sales dates.

Data warehouses also are more generalized or aggregated. They aren't concerned with a single transaction but the group as a whole. Since they store archival data, they can get big.

A schema is how the data is inter-related. It is a design for the data.

Two types of schemas are used in data warehouses, Star and Snowflake.

Star Schema

Star schema gets its name because it typically has a facts table surrounded by multiple dimensions tables. Thus, the schema appearance looks like a star. It has fewer joins, which is how relationships between queries are made.

Snowflake Schema

The snowflake schema has more relationships than the star schema. It is more efficient in terms of data storage (less redundant data) but takes longer to query due to many joins.

Related Technology

A couple of technologies are used in conjunction with a data warehouse.

  • Data Pipelines
  • Visualization & Reporting Tools
  • Machine Learning & AI

Data Pipeline

Data Pipelines are how data ends up in a data warehouse. Data pipelines are scripts written in Python or other languages that extract data from a myriad of sources, perform a process called ETL and deliver it to the destination database, which is the data warehouse.

ETL means Extract Transform Load. So it is Extracts data from systems such as POS, E-Commerce, CRM, Social media APIs, etc. It then Transforms to data while in transit so it is property formatted. It then Loads it into the destination database. Once data is added to a data warehouse it rarely changes. That's one of its benefits, it doesn't change.

More sophisticated data warehouses use what is called staging area. It moves raw (unchanged) data to a staging table temporarily, then do transformations and deliver it to the final destination at a later time.

In terms of processing, Data Pipelines are automated on a schedule or when a specific event occurs using two methods:

  • Stream: Processes lots of data all at once, typically overnight. For example, you extract all of your POS data from today at 3:00am when you are closed.
  • Batch: Processes data in small batches on a regular schedule or when an event occurs. For example, it runs every 15 minutes all day every day to get the latest data.

Visualization & Reporting Tools

Data warehouses are the source of the data but they are connected to visualization and reporting tools. These tools enable non-technical users to easily create reports and visualizations for business intelligence purposes.

They offer a variety of methods for visualization and data aggregation. They may be installed on a computer, run in the cloud as a SaaS, be automated or an interactive dashboard.

Machine Learning & AI

Machine Learning and AI require lots of data. The more data you have, the better your models perform. The data warehouse is quite often the source. Remember its purpose is analytics and you don't want to train models on production data, slowing those systems down.

Machine Learning use cases I am familiar with is more on the machine learning side. This would be predictive analytics, classification, clustering and similarity detection.


Conclusions

In summary, we discussed why you want a data warehouse, some design concepts and related technologies that you can use with a data warehouse.