Yup, workloads drive modeling. Not the industry, not your platform, not your beliefs nor religions.

There continues to be a lot of chatter about data design (schema modeling) lately as it’s a process we’re re-discovering after picking up a lot of bad habbits from the big data experience.

There are all sorts of competing ideas in the industry today. Some relational purists (I am one sometimes), some pushing “one big table” techniques including activity schema, then there are those that are absolute adherents to dimensional modeling, both star and snowflake schema.

These discussions, to me, aren’t very productive, as they almost never account for workloads. So, lets take a look at some of the workloads that will drive these decisions:

Operational systems/transactional systems:

These are highly normalized systems as our goals are to manage high concurrency updates, deletes and inserts while mantaining constraint. It’s in our best interests to maintain high relational fidelity here.

Operational Data Stores:

It’s an old term, but ODS workloads are everywhere though we may no longer call them that. The goal of the ODS is to collocate source datasets into a single structure for operational reporting. Rapid ingestion is an absolute requirement, as operational decisions depend on speed. Because of this, we avoid transformation and integration in these environments. Our goal is just to collocate the data so we can query across source systems in a hurry.

Schemas here will likely be absolute mirrors of the source systems with lightly added metadata. Since transactional systems are highly normalized, they will be in the ODS as well. This makes for extremely simple CDC and ETL/ELT processes. Wedging something like OBT into this structure would be nearly physically impossible while meeting ingestion SLAs. In some advanced implementations, the relations will be temporalized using history logging, or temporal primary key surrogation.

Data Warehouses:

This is a very distinct structure in the data world. Not all data workloads hosted on a data warehouse platform are data warehouses. From my work in the field, most workloads on data warehouse platforms are ODSs. The primary goal of the data warehouse is to support strategic decision making in the organization. In order to accomplish this, it must be time variant, non volatile, subject oriented, and integrated. These are big asks, and will affect schema design. The biggest influencers are subject orientation and integration. We normalize to the subject level, no further. An employee record from salesforce and an employee record from the HR system will be in two distinct relations in an ODS, but in the data warehouse they become one entity. Integration and subject orientation drive the data model. It will likely be less normalized than the source systems but not much. This level of normalization gives us amazing advantages in defining what correct is, and helps us avoid update anomalies. If something goes wrong in the DW, we’ve got strategic problems. Moderate normalization helps us avoid these things. Of course, with integration comes more complex transformation, so ingestion times will be slower than an ODS. For that matter, streaming ingestion may not be possible.

DataMarts and OperMarts:

Marts are logically just views of their source systems. So long as a data mart accurately reflects the current state of the data warehouse, normalization levels are meaningless. Same is true of opermarts reflecting the current state of the ODS. This is where ideas like dimensional modeling, OBT, etc are very effective as they simplify the subject matter for easy consumption of almost all staff in the organization. They can be physical tables materialized from the source system, materialized views within the same database as the warehouse or ODS, or depending on the performance of your platform they can be just direct views within these same structures.

So what modeling paradigm is best? It’s kinda like asking what type of nail gun is best. Don’t use a framing nailer to build drawers for cabinetry. Don’t use a finish nailer to fasten structural framing, and whatever you do, don’t use a brad gun for hanging shingles. Use the right tool for the right job and your team can see amazing performance.

image credit: Bill Inmon’s Corporate Information Factory

Source: https://www.linkedin.com/in/robert-harmon-a910b74/

Leave a Reply

Your email address will not be published. Required fields are marked *