[philiptellis] /bb|[^b]{2}/
Never stop Grokking


Monday, October 07, 2019

Implementing Spearman's Rank Correlation in SQL

In my last post, I showed how to implement Pearson's Correlation as an SQL Window function with window frame support. In this post, I'll follow up with implementing Spearman's Rank correlation co-efficient in SQL.

While Pearson's correlation looks for linear relationships between two vectors (ie, you wouldn't use it for exponential relationships), Spearman's rank correlation looks for monotonicity, or in plain english, do the two values go up & down together?

So here's the really cool part. Spearman's Rank correlation co-efficient is the Pearson's correlation co-efficient of the ranks of the two vectors. We already know how to calculate Pearson's correlation co-efficient, so what we need to do here is first calculate ranks of our vectors.

We can do this using the SQL RANK function, which also works as a window function with window frame support:

RANK() OVER (PARTITION BY <partition cols> ORDER BY x ASC) as R_X,

RANK() OVER (PARTITION BY <partition cols> ORDER BY y ASC) as R_Y,

The two important things to note here are that RANK() does not take a parameter, instead you specify what you want to rank on in the ORDER BY clause, and secondly, make sure both parameters are ordered in the same direction, ASC or DESC.

Now even though the RANK() function supports window frames, you don't want to use them here. This is so because if you're using sliding windows, each row will have a different rank depending on the window, and we won't be able to correlate an outer window.

Once we have the ranks in an inner query, we can run either the standard CORR function, or the windowed CORR that we developed in the previous post on these derived columns instead:

SELECT CORR(R_X, R_Y) FROM (
    SELECT
        RANK() OVER (PARTITION BY <partition cols> ORDER BY x ASC) as R_X,

        RANK() OVER (PARTITION BY <partition cols> ORDER BY y ASC) as R_Y
      FROM ...
)

If implementing this as a window function, then use R_X and R_Y as the inputs to the SUM() functions with an additional nested query.

I hope this was helpful, leave a comment or tweet @bluesmoon if you'd like to chat.

0 comments :

Post a Comment

...===...