One of the most important choices in building your data warehouse is schema design. This choice sets the table for everything that follows. How much time is your staff going to spend on building difficult queries. Which data platform is best built to support your schematic decisions? How much time are we going to burn chasing anomalous data as each design pattern brings its own risks?

There are myriad options. And from an implementation perspective, virtually any choice is valid. But only if you know the unintended consequences.

There’s three goals we’re trying to balance here: Logical accuracy, performance (speed and concurrency), and maintenance/development time.

Let’s look around at what’s possible given platforms that exist today and common design patterns:

Dimensional design:

  • Options for relations and domain constraints are limited   
  • Constraint isn’t addressed
  • Development rotations are generally long, as transforming source data into dimensional stars or snowflake schemas is problematic
  • Query response time may suffer due to complex joins on SCD2 relations and in some cases, the confusion that dimensions/facts present
  • Query design is VERY straight forward

Activity Schema design (EAV design):

  • Completely lacks any constraint
  • Ignores the temporality of entities which generates update anomalies
  • Ingest development is VERY fast as time variance is ignored
  • For queris understood before schema design, it’s very fast
  • Requires data duplication, which requires more development time, and LOTS of energy committed to data governance
  • Queries not anticipated will run extremely long

Anchor design:

  • Amazing opportunities for constraint available
  • High relational fidelity
  • Low manpower required for schema change
  • Ingest development can be automated
  • Bonus: Bitemporality is a walk in the park compared to the others
  • Assuming the correct physical platform, really low latency and high concurrency
  • Relatively easy to build out ingestion

Multi Temporal subject oriented design:

  • Easily constrained
  • High relational fidelity
  • Ingest development can be automated
  • Bitemporality is easy to implement, but hard to understand
  • Assuming the correct physical platform, really low latency and high concurrency
  • A bit hard on storage volume

For all of the above, there are limitations. It’s to you, the implimentor, to fill in those limitations.

There are more proposed out there, but few that hit all the high spots. Choose your design carefully, if you’re incorrect you’ll likely spend an order of magnitude more than necessary on both manpower and hardware/hosting.

And last, a warning: New design methods are being publicized it seems weekly lately. Few improve on previous design methods. Learn as much about all of them so you can compare, contrast, then do the hard work of filling in their limits.

#dataarchitecture #datawarehousing #highperformance

Source: https://www.linkedin.com/in/robert-harmon-a910b74/

Leave a Reply

Your email address will not be published. Required fields are marked *