Tuesday, 14 August 2012

Top 5 Considerations to Snowflake a Dimension

Schema designing for the warehouse is a very crucial task for an architect and data modeler. There are two well established design patterns for warehouse schema. First one is the Star Schema and other one is the Snowflake Schema. Both have their own advantages, disadvantages and design life cycle.

This paper highlights top 5 considerations while designing a snow flake dimension.

1.       Different Granularity of Facts is one of the primary reasons for snow flaking. When a dimension cuts two different fact tables in different level of hierarchies or two different grains that can be considered as must snow flake scenario. For instance let us take a dimension DimCustomer containing personal details and the address which is geography information like city, state country etc. and there is another dimension DimStore containing basic vendor information along with geographical details. Let us assume one transactional fact FactSales which can be analyzed at Store, Geography and Customer level. So here the fact table contains 2 different grains of customer i.e. Customer and Geography and two different grains from Store i.e. Store and Geography. Geography being common should be snow flaked from both the dimensions.

2.       Sparsity in the Dimension mostly is the data modelers view point. Many times we have seen that the developer merge these type of snow flaked dimensions into primary dimensions due to the data availability or code management at later point in project lifecycle. Let us consider a dimension DimProduct which contains large number of fields (columns) and most of the values are optional hence they are either populated to Unknown values or kept null. So these type of sparse fields are collected together to create another snow flaked dimensions. The result is simple that the primary dimension contains soiled values and very less number of unknowns.

3.       Monster Dimensions are an interest point for architects/developers struggling to get good performance of the queries. Here we consider the number of records for the snow flaking. If the number of records are enormous than we divide them into snow flake. Another reason for monster dimensions can be fast changing dimensions and can be solved by mini dimensions but this is not the scope of current document. Let us take example of a date dimension i.e. DimDate containing information till seconds. Here if we store data for 20 years the number of records in the date dimension will be around 630,720,000(20*365*24*60*60) which is large and difficult to manage. So it can be snow flake as DimDate containing data till date i.e. 7300(365*20) and DimTime with 86400 (24*60*60) records. Most of the times data modelers do not prefer to have snow flaking due to monster dimensions but practically it is very useful and manageable.

4.       Reporting Needs considerations are to ease the complexity of queries attracting interest of architects/developers. Let us understand this with an example. The requirement is to have a report which will have Product Name and Product Category as the user selection list and list has to show all the categories. Here if we will de-normalize the structure to have one DimProduct than it will only have the Categories for which we have the products. This leads to the requirement inconsistency and again snow flake design serves as rescue boat. This point is mostly requirement driven and to ease some of the queries. Although from the data modeler and the higher level architect viewpoints do not support this design many times but it really is very handy.

5.       Technical Considerations/Constraints is part of the data modeler and the architect viewpoint but ironically it is most unseen part of it. Every database has some limits for storing data e.g. SQL Server have a limit of 8060 bytes per row or 1024 columns per non wide table. If we have a dimension violating this kind of limits then we have to do the vertical partitioning and snowflake can be one of the solution. We can achieve the same by dividing into two primary dimensions but the better approach will be to snow flake. This snow flaking need is bit different because both primary and secondary dimension will be at same level and will be serving the same grains.

Most of the times we have seen hybrid approach of the schema where some of the dimensions will be snow flaked. Choosing schema and driving solution architecture contains many other considerations but the target should be towards star for two primary reasons one - fast query results and two – better usability for the business users.

Monday, 13 August 2012

SSAS Role Members Limit


Working with SMO sometimes lefts you amazed and adding role members is one of them. SSAS having a limit of adding members to a role and that is 1820. On adding next member throws much wearied error than expected as below.

Microsoft.AnalysisServices.OperationException: The following system error occurred:  The parameter is incorrect.

   at Microsoft.AnalysisServices.AnalysisServicesClient.SendExecuteAndReadResponse(ImpactDetailCollection impacts, Boolean expectEmptyResults, Boolean throwIfError)
   at Microsoft.AnalysisServices.AnalysisServicesClient.Alter(IMajorObject obj, ObjectExpansion expansion, ImpactDetailCollection impact, Boolean allowCreate, XmlaWarningCollection warnings)
   at Microsoft.AnalysisServices.Server.Update(IMajorObject obj, UpdateOptions options, UpdateMode mode, XmlaWarningCollection warnings, ImpactDetailCollection impactResult)
   at Microsoft.AnalysisServices.Server.SendUpdate(IMajorObject obj, UpdateOptions options, UpdateMode mode, XmlaWarningCollection warnings, ImpactDetailCollection impactResult)
   at Microsoft.AnalysisServices.MajorObject.Update(UpdateOptions options, UpdateMode mode, XmlaWarningCollection warnings)
   at Microsoft.AnalysisServices.MajorObject.Update()
   at CheckRoleCreation.Program.AddMembersToRole(Database db, Int32 num, Role role)



On investigating more we found that –

1.       There is limit to the members to the role and it is 1820. We can repro the same with the below code segment.

void AddMembers(A.Database db)
{
    foreach (A.Role r in db.Roles)
        AddMembersToRole(db, 1821, r);
}
void AddMembersToRole(A.Database db, int num, A.Role role)
{
    if(null != role)
    {
        //remove all members from role
        for (int count = (role.Members.Count-1); count >= 0; count--)
            role.Members.RemoveAt(count);
        role.Update();

        //add members to the role
        int i = MemberCounter;
        for (int count = MemberCounter; count < (num + i); count++, MemberCounter++)
        {
            AddMembersToLocalComp(String.Format(RoleMemberName, String.Empty, MemberCounter), false);
            role.Members.Add(new A.RoleMember(String.Format(RoleMemberName, Environment.MachineName + @"\", MemberCounter)));
        }
        role.Update();
    }
    Console.WriteLine("Total Role Members in role {0} is: {1}", role.Name, role.Members.Count);
}
void AddMembersToLocalComp(string memberName, bool printMemberInfo)
{
    string path = "WinNT://" + Environment.MachineName;
    string schema = "User";
    DS.DirectoryEntry localMc = new DS.DirectoryEntry(path);
    try
    {
        DS.DirectoryEntry newUser = localMc.Children.Add(memberName, schema);
        newUser.CommitChanges();
    }
    catch (System.Runtime.InteropServices.COMException e)
    {
        if(e.ErrorCode != -2147022672) //mask new user already exists error
            throw e;
    }
    if (printMemberInfo)
        foreach (DS.DirectoryEntry e in localMc.Children)
            if(e.SchemaClassName.Equals(schema)) Console.WriteLine("MemberName: {0}", e.Name);
}



2.       We can have more than 1820 roles. The limit is only on the role members (individual users or groups).

3.       We can have more than 1820 members but scattered in various roles.

4.       The workaround for the same are many but one of them can be adding one more role with the same type of permissions and have a count of users before adding more than 1820 members. These situations come only when we work more with the automated scripts.

Monday, 23 July 2012

Analysis of Business Workflows & Pipelines

Systematic expansion of the businesses across geographies brings in the simple and complex business processes. These processes combine numerous workflows & pipelines representing life cycle of the business events. Many of the data warehouse (DW) sponsors deprioritize the analytical needs of workflows & pipelines which ironically are very rich source of numerous analytical reports which can act as game changer for that particular business.


Let us start with an example of business workflow for online retail store. The company named“eBayz” sells commodities online. Customers can log into the website and place orders. For out of stock items the purchase order will be raised to the respective supplier and the item will be shipped to the warehouse and then will follow the in-stock workflow. In-stock products will be dispatched from warehouse to local area store and from there it will be delivered to the customers. Payment can be done while placing order or at thetime of delivery.


With the existing analytics business users found that the customer complaints have increased related to delivery issues. The “Customer Complaints” analytics shows that the trend graph is touching all time high peaks for last three months and the reaction captured were...


 Alas! This f**k is going to suck our business. What to do now?
What the problem is?
Where the problem is?
How do we know about the problem?

These are the business questions which can’t be answered from the regular analytics. Here the accumulated snapshot facts serve as rescue boat. The whole lifecycle/pipeline/workflow of the order is captured in accumulated snapshot fact table. All the events have the date and each event date is captured as a column. For our example the fact table design may look like below to capture complete pipeline.




What derived the cost saving for shipping? Is it the low in-stock commodities?
How many new orders and how many undelivered orders do we have on daily basis?


Interestingly the time lags are also the part of the fact table. With the design we can find the answers to the above business questions. Very easily we can see where the lag is more and that can be the problem area. To answer above questions it was found that for the same time period the trend chart for the “ArrivedToStoreLag” is also touching peaks implying that the problem area was late product shipping to local store. On investigation it was found that to save the shipping cost the product was kept on hold for more number of days which resulted peaks in “Customer Complaints” analytics. This answers all the related questions and raising some new question like…


These new business questions are also supported by the current design. The new orders can be directly queried joining DimDate and FactOnlineOrders entities whereas undelivered orders are really a daily snapshot calculated based on more than one role of the date dimension. Here undelivered orders for 4-June-2012 are the records having “DeliveryDate” greater than and “OrderPlacedDateId” less than 4-June-2012 including 2 different roles of date.

Another question from the DW designer prospect is like “Is this the only design supporting the above features and answering all the business questions in most performing manner?” The answer lies on individual’s assessment of the problem statements being targeted from the analytics. Here is one more design approach to capture pipeline in transactional manner i.e. store lifecycle in multiple rows rather than putting them together as a single row. Obviously it has its own pros and cons and they are discussed below with the changed fact and introducing new dimension design.




The latter design moves the multiple columns into the rows i.e. fact table is transformed from accumulated snapshot grain to transactional grain. Event grouping helps us flexibility to have more than one lifecycle in same dimension without making it junk.The new design answers the questions asked in the initial state but restrictively. In the former design we can keep multiple time lags easily like we have individual lags and overall lag in the process which can’t be done with the latter design approach. Moreover it is also difficult to get other advanced answers like count of undelivered orders, orders waiting at warehouse, orders waiting at store on daily basis etc. which we were getting easily from the former approach. One of the designs to answer these questions is creating one table for each question on the daily basis but it need very strong archive and purge policies or the data will grow exponentially.


With the above discussion the accumulated snapshot provides us the flexible and preferred design. Kimball group advocates the analytical need of workflows& pipelines using accumulated snapshot. Detailed behavior of the accumulated snapshot facts are detailed in “Design Tip #37, #130 & #145” at Kimball group. Microsoft’s SSAS very nicely supports the workflows & pipelines scenarios with accumulated snapshot facts putting some tricks in place particularly with role playing dimensions. The blog “Counting 'in-flight' events in MDX” from Mosha talks in detail to handle these scenarios in cube and MDX.

Friday, 20 July 2012

Handling Multiple Fiscal Calendars

As the enterprises expending their horizons the enterprise data warehouse is facing multiple challenges. Date and time is one of them. It is amazingly interesting and challenging to work with date dimension in enterprise scenario. The same obvious challenge came to us while developing data warehouse changes to one of our customer.

The blog discusses the solution approaches of respective problem. Putting multiple hierarchies is just like giving different name to the date and can be done easily as below. The below solution didn't deals with the scenario having fiscal year starting from middle of the month e.g. fiscal year 2012 starting from 15-April-2011. We assume that the fiscal year starts from 1st of any of the month e.g. fiscal year 2012 starts from 01-April-2011.

Approach 1: There are several solutions to the problem and one of the most common solutions is to add multiple date hierarchies to the same date dimension one for each consumer's fiscal year. The diagram 1 visualizes the data model for the same.


Approach 2: The approach works nice if we have less number of consumers to the data warehouse having different fiscal year. Adding hundreds of consumers with hundreds of fiscal year will ruin the above approach. Small modifications to the above approach helped to maintain all the fiscal years for any number of consumers. Understanding that all of the hundreds of fiscal year will always converse to twelve months the date dimension is modified to include all of the twelve hierarchies. The diagram 2 visualizes the data model for the same.

The consumer can consume the respective hierarchy based upon his fiscal year. E.g. if the fiscal year for the 'Consumer-1' starts from 01-July then they will be using the JulFiscalYear, JulFiscalQuarter, JulFiscalMonth etc. It can be extended for week and other fields but we just have to make sure that the size of the row doesn't extend the maximum size of the row.