Monday, 16 February 2015

NoSQL DBs - Feature Comparison

NoSQL DBs - Feature Comparison

"Hi there, come on in and welcome again to the ocean of data. Which DB solution are you planning to ride the waves of data?" The answer can be easy if only RDBMS (ACID) based solutions are point of consideration. This becomes little difficult to choose among BASE (No-SQL) providers.
Here is a small matrix of feature comparison on BASE(No-SQL) databases and probably it can help choosing one. Feel free to comment back if you guys think something has changed since I last updated it.


Thursday, 29 January 2015

Agile Delivery - Awards & Rewards

Agile Delivery - Awards & Rewards



"We believe in agile fashion of delivery to add more value to the delivered/working product."
This sounds like having great understanding the scrum triangles like "Stakeholders => Scrum Team => Working product" OR "Product Owner => Scrum Master => Team" and execution excellence by eliminating the waste.
But is this all? May be not? What else is left out?
Come-on guys it is "Awards". If the company is really agile than all the awards should be team oriented. Which means the "The STAR", "Mr. DEPENDABLE" or any other individual awards should be stopped. If it is a failure it is for the team and if it is a success it is also for team. If any individual is performing good or bad then it is team's behaviour.
Now the question is how to recognize the good work of individual and the simple answer is "Rewards". Good work to be reflected in promotions and money.

Friday, 16 January 2015

Moving Trends - ACID to BASE (NoSql)

Sales guy - "Hey! Are you guys just stuck with SQL Databases?" and the reply "Yes ..." leaves the sales guys with a thought of plenteous opportunities and weirdy expressions as if he is talking to someone from stone age. Surely the trends are moving from ACID to BASE i.e. SQL to NoSQL (Not only SQL) but the question is should the business adopt? Which solution should they adopt.
There can be multiple answers but my simple take is "Yes, but only if the business needs it and it is making the transformation to the existing business.".
Most of us know but trying to make the point again that No-SQL have nothing to do with the SQL or query language but it is just related to the transaction processing side i.e. reads and writes of the system. Therefore I am considering ACID (Atomicity, Consistency, Isolation, Durability) for the SQL databases and BASE (Basically Available, Soft state, Eventual consistency) for the No-SQL database.
Another big debate can start on "Is the whole planet moving from ACID to BASE?" And the answer is "Noooo !". We are not throwing away whatever we learnt from ACID theories and an excellent example can be Spanner. A planet scale database with ACID properties. Yes I know the CAP (Consistency, Availability, Partition tolerance) theorem and having my fingers crossed for Spanner.
Best wishes for a very-2 happy new year 2015.

Tuesday, 30 December 2014

SQL Server - Current Execution Statistics

Happy holidays! Most of the times we have heard that the queries are not working as expected. Such simple statements just puts the investigation back to the DBAs or the SQL developers. The most common problem is what exactly is happening in SQL Server. Thanks to DMVs that this question can easily be answered without having admin permissions. What's needed is - VIEW SERVER STATE permission.
Query in below link file shows the snapshot of SQL Server at that particular time including CPU Time, Memory blocks, I/O issues, exact query being executed, query plan being used and much more. This have helped me a lot in the past and hope it helps you guys as well. Shoot me the questions which you think are not answered using this query and I'll try to answer them with best efforts.

Sunday, 21 December 2014

SSAS Tabular Model - Drillthrough & Other Actions

BI dashboards, reports etc. throw some random numbers and graphs on the screen but the data analysts always wants to know that how these numbers are being cooked up. SSAS provides actions as one of the method to show that. Although actions are far more powerful than just showing drillthrough. They can add lots and lots of context to the data being shown.
With tabular model actions were sort of difficult but with the help of BIDS Helper it can be done with lots of ease. Here is the reference link showing the use.
http://bidshelper.codeplex.com/wikipage?title=Tabular%20Actions%20Editor

Finding Outliers - Simple Statistical Way

Outliers? Why do we need them?
Too big or too small value is outlier in statistics. We need to consider outliers separately for good business decisions and beautiful graphs. Business decisions are mostly taken around (a) most frequent case OR around center of the population and (b) exceptions i.e. outliers and they are far away points from the center. My take is outliers should never be rejected from data but be treated as special cases for further analysis. Outliers may open new doors of possibilities.
The most common way to find an outlier is by using quartiles or percentiles. Quartile 1(Q1) == 25th Percentile,Quartile 2(Q2) == Median == 50th Percentile & Quartile 3(Q3) == 75th Percentile. To find out the accepted data range we need to find the spread of the data and most of the times it is done using either by Standard Variation or by IQR(Interquartile Range) i.e. (Q3-Q1) in simple way. Below figure gives the simple formula to find the outlier.
In simplest way any data point which is out of the (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) they can be treated for outlier analysis. There are many other methods but this one makes most sense based on common business analysis. This 1.5 is statistically negotiated value i.e. constant.
Here are a couple of examples from web search :)

Sunday, 14 December 2014

Don't disable SQL Jobs like ...

How to disable a SQL Server job?
This can't be a good enough question to post a solution. But sometimes it is. In the recent past we figured out that the SQL Server jobs are disabled but they were starting at their scheduled time. Amazing! What a f**k is happening? Why is this happening? Is SQL Server is not reliable enough?
One of the curious operations support guy figured out that the jobs were disabled by updating the metadata using the query "UPDATE [dbo].[sysjobs] SET [enabled] = 0 WHERE name IN (...)". The "Job Activity Monitor" showed that all the jobs are disabled. What's wrong then?
SQL Agent uses caching and caches the job status. Simple UPDATE statement will simply change the metadata and not change the cache. So it is recommended to use the the UI from SSMS or call sp_update_job. This method updates the cache as well.

Monday, 28 July 2014

Verifying Credit Card Numbers

Recently we came across an interesting problem of credit card theft and as part of the investigation we found tons of digits which matches the pattern of the credit card. After finding that bunch of data it was difficult to report the correct numbers. Any random number of 16 digits or 15 digits and so on can be either a credit card or some identifier or some dummy number into the system. The problem was really interesting.

After some research on web we found that most of the credit cards companies rather than generating any random number uses some algorithms to generate the number. Luhn test is one of the widely accepted algorithm to distinguish the credit card number. The below mentioned steps and SQL Code can be used to validate a credit card number. We are taking 49927398716 number as an example for the below steps.

Luhn’s Test:
  1. Get the credit card number. For our example we are assuming 49927398716 as the given number.
  2. Reverse the digits of the given number. For our example it will become like 61789372994.
  3. Take the sum of all the odd digits in the reversed number. Which is as 6 + 7 + 9 + 7 + 9 + 4 = 42.
  4. Take all the even digits. E.g. 1,  8,  3,  2,  9 and multiply each one of them with two. For our example the digits will become like 2, 16, 6, 4, 18.
  5. Wherever the number is greater than nine sum the digits of the number to make it a single digit number. The number s in step 4 will become like 2, 7, 6, 4, 9.
  6. Sum the numbers in the step 5. E.g. 2 + 7 + 6 + 4 + 9 = 28.
  7. Now sum the number in the step 3 and step 6. E.g. 28 + 42 = 70
  8. If the sum ends in zero then it can be a valid credit card number. In other words if the remainder of the Step 7 divided by 10 is zero than it can be a valid credit card number. So in our case 70%10 = 0 and hence it can be a valid credit card number.


Sample SQL Code for Luhn’s Test:

Here we have created a T-SQL function to verify the above test but it can be implemented in any other manner as well.


DECLARE
    @cardNo NVARCHAR(20) = '49927398716'
    ,@reverseCardNo NVARCHAR(20)
    ,@sum SMALLINT
    ,@number SMALLINT
    ,@counter SMALLINT

SELECT
    @cardNo = LTRIM(RTRIM(@cardNo))
    ,@reverseCardNo = REVERSE(@cardNo)
    ,@sum = 0
    ,@counter = 1

WHILE(@counter <= LEN(@cardNo))
BEGIN
    SELECT
        @number = TRY_CONVERT(SMALLINT, LEFT(@reverseCardNo, 1))

    IF(@counter%2 = 0) --EVEN NUMBER
    BEGIN
        SELECT
            @number *= 2
        IF (LEN(@number) > 1)
        BEGIN
            SELECT
                @number = TRY_CONVERT(SMALLINT, LEFT(@number, 1)) + TRY_CONVERT(SMALLINT, RIGHT(@number, 1))
        END
    END
   
    SELECT
        @sum += @number
        ,@counter += 1
        ,@reverseCardNo = RIGHT(@reverseCardNo, LEN(@reverseCardNo)-1)
END

IF(@sum > 0 AND @sum%10 = 0)
    PRINT CONCAT('Test passed. ', @cardNo, ' can be a potential card number.')
ELSE
    PRINT CONCAT('Test failed. ', @cardNo, ' invalid card number.')

Wednesday, 29 May 2013

Considerations – Loading Oracle Data Using SSIS

With the changing dynamics of technologies extraction data from the heterogeneous databases is enterprise need. Microsoft SSIS serves as one of the good ETL tools to extract data from multiple and heterogeneous sources. Below are some of the considerations and best practices to be taken care during the pull.


Pulling Data from Oracle

There are several methods to pull data from Oracle. This link provides the detailed documentation for creating ETL and planning SSIS solution using Oracle OLE DB Provider. Below are some more points of consideration while pulling data from Oracle.


Choosing Provider

Choosing a provider can drastically change the speed of the data pull by multiple folds hence it becomes one of the most critical and the important part while planning ETL. Currently below are the options which can be used.

  1. The Microsoft OLE DB provider for Oracle. It is available for 32 bit version and is depreciated.
  2. The Oracle OLE DB provider and can be downloaded from here. This is widely used and is very stable in terms of performance and maturity and provides least amount of issues.
  3. The fastest providers are shipped as Microsoft connectors by Attunity for Oracle and it supports both 32 and 64 bit versions. SQL 2012 can be downloaded from here. These providers tend to be very fast and are almost 20 percent faster than Oracle Providers.

In-Memory Transformations & Minimized/Required Column Approach

SSIS is in-memory pipeline. It works very fast for in-memory operations and if the memory is full then the transformations will spill into the disk and performance will start degrading. Below are the methods which can be considered for the same.

  1. Ensure that only required and minimized columns are being pulled from the oracle source as views or use the sql command to get only required columns. Minimizing the columns will ensure the minimum data size per row.
  2. Prefer using the sql queries than table dropdown options. Using dropdown option uses different set of commands to prepare sql and then execute the command. Due to this sometimes we may have the SSIS hanged up whereas in management studio it is showing results in very quick time. Same behavior is very visible when we select the table from the drop down and it takes long time to show up tables.
  3. The in-memory transformation can be checked by checking the “Buffers spooled” perfmon counter for SSIS. Here are the details about the performance counters for SSIS for more details. The initial value is of this counter is zero and more than zero indicates that swapping is being carried out by the SSIS pipeline.

Data Type Transformations

Using the efficient data type always ensures the best performance and the minimum resources. Considering an example of a big integer value like ‘19876234987’ can be stored as integer as well as characters. Numbers uses different algorithms than the characters to index, compare etc. So in this case characters will give slow performance as well as more space to store data, more memory space while doing ETL (RAM). Best practice is to use the optimized length for each data type and make it as narrow as possible. For the money part Money data type is best and it is fast. Numeric and float have the larger precession which leads to the performance issues.
Below is the default mapping of the data types from Oracle to SSIS.


Oracle Data Type
Integration Services Data Type
VARCHAR2
DT_STR
NVARCHAR2
DT_WSTR
NUMBER
DT_R81
NUMBER(P, S)
When the scale is 0, according to the precision (P) DT_I1, DT_I2, DT_I4, DT_NUMERIC(P, 0)

DT_NUMERIC(P, S)
LONG
Not supported
DATE
DT_DBTIMESTAMP
RAW
DT_BYTES
LONG RAW
Not supported
ROWID
Not supported
CHAR
DT_STR
User-defined type (object type, VARRAY, Nested Table)
Not supported
REF
Not supported
CLOB, NCLOB, BLOB, BFILE
Not supported
TIMESTAMP TIMESTAMP WITH TIME ZONE INTERVAL YEAR TO MONTH INTERVAL DAY TO SECOND TIMESTAMP WITH LOCAL TIME ZONE
DT_STR
UROWID
Not supported

We can use the data conversion task to convert the data types of different types and the package might look like below image. Data type cast is important but too much of casting degrades the performance of the pull as well. Staging tables here plays vital role for these type of issues.

Using Bulk Operations and Use Minimize Logging

While inserting data in the SQL Server, minimized logged operations provide the maximum performance. Below are some of the options which can be used.
  1. In the data flow task use the fast load options to put data in the target tables.
  2. For fast ETL the foreign keys can be removed to perform insert/update in parallel. If it is needed badly best practice is to drop all the foreign keys and then create it again once data pull is complete. Even if large data is getting changed in the target table dropping primary keys and other indexes is also a part of best practices.
  3. Ensure using the TABLOCK if target table is not in frequent use and disabling check constraint.
  4. Ensure describing values for “Maximum insert commit size” to proper to ensure batch wise inserts.
  5. Prefer using NOLOCK for lookup and other tasks to minimize locking impacts.
  6. Uses of SQL Server partitioning and the SWITCH statement increase the performance by several folds while working with large data.
  7. Minimize the SSIS blocking and slow performing operations e.g. sort, aggregations like SUM, MIN, MAX etc.

Revisit ETL Design

Every ETL tool has its own pros and cons. To get the optimized performance out of SSIS below are some of the design considerations which can be taken into account. Most of them are for large set of data.
  1. Redesign the organization of data using staging tables in such a way that TRUNCATE is being used than DELETE.
  2. To UPDATE large data from data flow tasks prefer using staging tables and run the UPDATE statement from SQL Server than using row by row OLEDB calls but for small dimensions it is fine using row by row calls.
  3. Set based operations and transformations provide best performance in the SQL Server as it is designed for the set based operations. All row by row transformations are best for the SSIS.
  4. Date and time conversions tend to be fast in SQL Server than SSIS so it is better to change date and time in the SQL Server.
  5. Delta data load should be evaluated well before implementation. If a large amount of data is getting changed then full load gives better performance than the delta detection logic.
  6. The ideal pull from the heterogeneous pull looks like below. Different parts of the ETL can be removed and added as per the need.

Thursday, 7 February 2013

(5 + 1) Steps to Resolve Network Related Instance Error


A thought to blog inspired from spending ample of time resolving a very common error in SQL Server while connecting to a instance i.e. "A network-related or instance-specific error occurred while establishing connection to SQL Server. The server was not found or was not accessible …". Below are the steps to follow to resolve the error.

1.    Check Services: "SQL Server" and "SQL Server Browser" is up and running. We can check this by going to "All Programs à SQL Server à Configuration Tools à SQL Server Configuration Manager" OR search "SQL Server Con" in windows search.

2.    Check Client Protocols: Now check for the native client protocols e.g. TCP, Named Pipe etc. is enabled. Most of the times we'll get this information in error message towards end of it. Enable the provider if it is not done. Check image below.

3.    Check Server Protocols: Check the "SQL Server Network Configuration" and make sure that respective client protocol is enabled here as well. Check the image below.

4.    Know ports and open them: Now check the properties of the protocol by right clicking and selecting properties or double clicking on it. Select "IP Address” tab and go till end of the window till "IP All" comes as in image below. Check for the port where SQL server is listening in this case it is 3680 (not 1433 which is default) and it is dynamic. We can specify static as well in "TCP Port" section. Now we need to open the same port in this case it is 3680 in the local machine by creating inbound rule (or exception) in "Windows Firewall". We may also need to open the port for "SQL Server Browser Service" as well.

5.    Check Remote Connection: Make sure that SQL Server is enabled for remote connection by running below T-SQL command or from the below screen by “Right Click on the Server from SSMS à Select Connections”.
SP_CONFIGURE 'remote access', 1
RECONFIGURE WITH OVERRIDE
GO


6.    Force Protocol and Port: While connecting from SSMS we can also provide port and protocol to connect.
Protocol can also be forced by prefixing protocol name e.g. "tcp:ServerName, PortNumber" TCP/IP or "ip:ServerName, PortNumber" for named pipe etc. Same can be achieved from "Options >>" option as well.