Tuesday, 14 August 2012

Top 5 Considerations to Snowflake a Dimension

Schema designing for the warehouse is a very crucial task for an architect and data modeler. There are two well established design patterns for warehouse schema. First one is the Star Schema and other one is the Snowflake Schema. Both have their own advantages, disadvantages and design life cycle.

This paper highlights top 5 considerations while designing a snow flake dimension.

1.       Different Granularity of Facts is one of the primary reasons for snow flaking. When a dimension cuts two different fact tables in different level of hierarchies or two different grains that can be considered as must snow flake scenario. For instance let us take a dimension DimCustomer containing personal details and the address which is geography information like city, state country etc. and there is another dimension DimStore containing basic vendor information along with geographical details. Let us assume one transactional fact FactSales which can be analyzed at Store, Geography and Customer level. So here the fact table contains 2 different grains of customer i.e. Customer and Geography and two different grains from Store i.e. Store and Geography. Geography being common should be snow flaked from both the dimensions.

2.       Sparsity in the Dimension mostly is the data modelers view point. Many times we have seen that the developer merge these type of snow flaked dimensions into primary dimensions due to the data availability or code management at later point in project lifecycle. Let us consider a dimension DimProduct which contains large number of fields (columns) and most of the values are optional hence they are either populated to Unknown values or kept null. So these type of sparse fields are collected together to create another snow flaked dimensions. The result is simple that the primary dimension contains soiled values and very less number of unknowns.

3.       Monster Dimensions are an interest point for architects/developers struggling to get good performance of the queries. Here we consider the number of records for the snow flaking. If the number of records are enormous than we divide them into snow flake. Another reason for monster dimensions can be fast changing dimensions and can be solved by mini dimensions but this is not the scope of current document. Let us take example of a date dimension i.e. DimDate containing information till seconds. Here if we store data for 20 years the number of records in the date dimension will be around 630,720,000(20*365*24*60*60) which is large and difficult to manage. So it can be snow flake as DimDate containing data till date i.e. 7300(365*20) and DimTime with 86400 (24*60*60) records. Most of the times data modelers do not prefer to have snow flaking due to monster dimensions but practically it is very useful and manageable.

4.       Reporting Needs considerations are to ease the complexity of queries attracting interest of architects/developers. Let us understand this with an example. The requirement is to have a report which will have Product Name and Product Category as the user selection list and list has to show all the categories. Here if we will de-normalize the structure to have one DimProduct than it will only have the Categories for which we have the products. This leads to the requirement inconsistency and again snow flake design serves as rescue boat. This point is mostly requirement driven and to ease some of the queries. Although from the data modeler and the higher level architect viewpoints do not support this design many times but it really is very handy.

5.       Technical Considerations/Constraints is part of the data modeler and the architect viewpoint but ironically it is most unseen part of it. Every database has some limits for storing data e.g. SQL Server have a limit of 8060 bytes per row or 1024 columns per non wide table. If we have a dimension violating this kind of limits then we have to do the vertical partitioning and snowflake can be one of the solution. We can achieve the same by dividing into two primary dimensions but the better approach will be to snow flake. This snow flaking need is bit different because both primary and secondary dimension will be at same level and will be serving the same grains.

Most of the times we have seen hybrid approach of the schema where some of the dimensions will be snow flaked. Choosing schema and driving solution architecture contains many other considerations but the target should be towards star for two primary reasons one - fast query results and two – better usability for the business users.

1 comment: