Profiling Data Using T-SQL

Joe Hunter
2 min readJun 9, 2022

--

When working with a new dataset, one of the most important initial steps is to get a grasp of the spread of the data. This can easily be done with a simple “group by” statement for each pertinent column. But there is a more comprehensive means of doing this by leveraging some of the more recent analytical extensions to the T-SQL language. Please note this code requires SQL Server version 2012 or higher.

The first of these functions is PERCENT_RANK(). As best described by the Microsoft documentation it “…calculates the relative rank of a row within a group of rows in SQL Server. Use PERCENT_RANK to evaluate the relative standing of a value within a query result set or partition.

The second function used is CUME_DIST() and it is very similar to PERCENT_RANK(). Referring to the Microsoft documentation again: “…this function calculates the cumulative distribution of a value within a group of values. In other words, CUME_DIST calculates the relative position of a specified value in a group of values. Assuming ascending ordering, the CUME_DIST of a value in row r is defined as the number of rows with values less than or equal to that value in row r, divided by the number of rows evaluated in the partition or query result set

So armed with both these functions we can analyse one or more columns in a SQL Server database table and determine how the data values are distributed. I often have to do this to understand how a new variable introduced into a dataset is being used by end users. It can help to learn if the variable is being used, is being understood and properly represents what it is supposed to be recording.

First things first we need some data, so run script below to create a table in tempDB with the data used in this example:

Now that the dummy data is available we can run the analytical code against it and get an idea of the data distribution:

Et voila! This produces the following output. If there is a large number of columns and values I copy the output to Excel and dice and slice it there.

Output generated

--

--

Joe Hunter
Joe Hunter

Written by Joe Hunter

Digital Health Software Engineer

No responses yet