High performance Data Warehouse rule #7

One of the most important choices in building your data warehouse is schema design. This choice sets the table for everything that follows. How much time is your staff going to spend on building difficult queries. Which data platform is best built to support your schematic decisions? How much time are we going to burn chasing anomalous data as each design pattern brings its own risks?

There are myriad options. And from an implementation perspective, virtually any choice is valid. But only if you know the unintended consequences.

There’s three goals we’re trying to balance here: Logical accuracy, performance (speed and concurrency), and maintenance/development time.

Let’s look around at what’s possible given platforms that exist today and common design patterns:

Dimensional design:

Options for relations and domain constraints are limited
Constraint isn’t addressed
Development rotations are generally long, as transforming source data into dimensional stars or snowflake schemas is problematic
Query response time may suffer due to complex joins on SCD2 relations and in some cases, the confusion that dimensions/facts present
Query design is VERY straight forward

Activity Schema design (EAV design):

Completely lacks any constraint
Ignores the temporality of entities which generates update anomalies
Ingest development is VERY fast as time variance is ignored
For queris understood before schema design, it’s very fast
Requires data duplication, which requires more development time, and LOTS of energy committed to data governance
Queries not anticipated will run extremely long

Anchor design:

Amazing opportunities for constraint available
High relational fidelity
Low manpower required for schema change
Ingest development can be automated
Bonus: Bitemporality is a walk in the park compared to the others
Assuming the correct physical platform, really low latency and high concurrency
Relatively easy to build out ingestion

Multi Temporal subject oriented design:

Easily constrained
High relational fidelity
Ingest development can be automated
Bitemporality is easy to implement, but hard to understand
Assuming the correct physical platform, really low latency and high concurrency
A bit hard on storage volume

For all of the above, there are limitations. It’s to you, the implimentor, to fill in those limitations.

There are more proposed out there, but few that hit all the high spots. Choose your design carefully, if you’re incorrect you’ll likely spend an order of magnitude more than necessary on both manpower and hardware/hosting.

And last, a warning: New design methods are being publicized it seems weekly lately. Few improve on previous design methods. Learn as much about all of them so you can compare, contrast, then do the hard work of filling in their limits.

#dataarchitecture #datawarehousing #highperformance

Source: https://www.linkedin.com/in/robert-harmon-a910b74/

High performance Data Warehouse rule #7

By tronghuy2807

Related Post

Leave a Reply Cancel reply

Bạn có bỏ lỡ

Các ngôn ngữ lập trình đáng để học vào 2024

Vector Database là gì? Ứng dụng như thế nào?

Upsert trong Database là gì? Có gì hay ho?

Làm Thế Nào Để Lưu Password An Toàn Trong Database?