Schema
designing for the warehouse is a very crucial task for an architect and data
modeler. There are two well established design patterns for warehouse schema.
First one is the Star Schema
and other one is the Snowflake
Schema. Both have their own advantages, disadvantages and design life cycle.
This paper highlights top 5 considerations while designing a snow flake dimension.
This paper highlights top 5 considerations while designing a snow flake dimension.
1.
Different
Granularity of Facts is one of the primary reasons for snow
flaking. When a dimension cuts two different fact tables in different level of
hierarchies or two different grains that can be considered as must snow flake
scenario. For instance let us take a dimension DimCustomer containing personal
details and the address which is geography information like city, state country
etc. and there is another dimension DimStore containing basic vendor
information along with geographical details. Let us assume one transactional
fact FactSales which can be analyzed at Store, Geography and Customer level. So
here the fact table contains 2 different grains of customer i.e. Customer and
Geography and two different grains from Store i.e. Store and Geography.
Geography being common should be snow flaked from both the dimensions.
2.
Sparsity in
the Dimension mostly is the data modelers view point. Many times
we have seen that the developer merge these type of snow flaked dimensions into
primary dimensions due to the data availability or code management at later
point in project lifecycle. Let us consider a dimension DimProduct which
contains large number of fields (columns) and most of the values are optional
hence they are either populated to Unknown values or kept null. So these type
of sparse fields are collected together to create another snow flaked
dimensions. The result is simple that the primary dimension contains soiled
values and very less number of unknowns.
3.
Monster Dimensions
are an interest point for architects/developers struggling to get good
performance of the queries. Here we consider the number of records for the snow
flaking. If the number of records are enormous than we divide them into snow
flake. Another reason for monster dimensions can be fast changing dimensions
and can be solved by mini dimensions but this is not the scope of current
document. Let us take example of a date dimension i.e. DimDate containing
information till seconds. Here if we store data for 20 years the number of
records in the date dimension will be around 630,720,000(20*365*24*60*60) which
is large and difficult to manage. So it can be snow flake as DimDate containing
data till date i.e. 7300(365*20) and DimTime with 86400 (24*60*60) records.
Most of the times data modelers do not prefer to have snow flaking due to
monster dimensions but practically it is very useful and manageable.
4.
Reporting
Needs considerations are to ease the complexity of queries
attracting interest of architects/developers. Let us understand this with an example.
The requirement is to have a report which will have Product Name and Product
Category as the user selection list and list has to show all the categories.
Here if we will de-normalize the structure to have one DimProduct than it will
only have the Categories for which we have the products. This leads to the
requirement inconsistency and again snow flake design serves as rescue boat.
This point is mostly requirement driven and to ease some of the queries.
Although from the data modeler and the higher level architect viewpoints do not
support this design many times but it really is very handy.
5.
Technical
Considerations/Constraints is part of the data modeler and the
architect viewpoint but ironically it is most unseen part of it. Every database
has some limits for storing data e.g. SQL Server have a limit of 8060
bytes per row or 1024 columns per non wide table. If we have a dimension
violating this kind of limits then we have to do the vertical partitioning and snowflake
can be one of the solution. We can achieve the same by dividing into two
primary dimensions but the better approach will be to snow flake. This snow
flaking need is bit different because both primary and secondary dimension will
be at same level and will be serving the same grains.
Most of the
times we have seen hybrid approach of the schema where some of the dimensions
will be snow flaked. Choosing schema and driving solution architecture contains
many other considerations but the target should be towards star for two primary
reasons one - fast query results and two – better usability for the business
users.