[philiptellis] /bb|[^b]{2}/
Never stop Grokking


Monday, October 07, 2019

Implementing Spearman's Rank Correlation in SQL

In my last post, I showed how to implement Pearson's Correlation as an SQL Window function with window frame support. In this post, I'll follow up with implementing Spearman's Rank correlation co-efficient in SQL.

While Pearson's correlation looks for linear relationships between two vectors (ie, you wouldn't use it for exponential relationships), Spearman's rank correlation looks for monotonicity, or in plain english, do the two values go up & down together?

So here's the really cool part. Spearman's Rank correlation co-efficient is the Pearson's correlation co-efficient of the ranks of the two vectors. We already know how to calculate Pearson's correlation co-efficient, so what we need to do here is first calculate ranks of our vectors.

We can do this using the SQL RANK function, which also works as a window function with window frame support:

RANK() OVER (PARTITION BY <partition cols> ORDER BY x ASC) as R_X,

RANK() OVER (PARTITION BY <partition cols> ORDER BY y ASC) as R_Y,

The two important things to note here are that RANK() does not take a parameter, instead you specify what you want to rank on in the ORDER BY clause, and secondly, make sure both parameters are ordered in the same direction, ASC or DESC.

Now even though the RANK() function supports window frames, you don't want to use them here. This is so because if you're using sliding windows, each row will have a different rank depending on the window, and we won't be able to correlate an outer window.

Once we have the ranks in an inner query, we can run either the standard CORR function, or the windowed CORR that we developed in the previous post on these derived columns instead:

SELECT CORR(R_X, R_Y) FROM (
    SELECT
        RANK() OVER (PARTITION BY <partition cols> ORDER BY x ASC) as R_X,

        RANK() OVER (PARTITION BY <partition cols> ORDER BY y ASC) as R_Y
      FROM ...
)

If implementing this as a window function, then use R_X and R_Y as the inputs to the SUM() functions with an additional nested query.

I hope this was helpful, leave a comment or tweet @bluesmoon if you'd like to chat.

Wednesday, October 02, 2019

Implementing Pearson's CORR as a database window function

I recently needed to find the Pearson's Correlation Coefficient between two columns in a Snowflake database. Now Snowflake and PostgreSQL both support a CORR aggregate function that correlates along a GROUP BY. Snowflake additionally supports CORR as a window function, but only for use with PARTITION BY. It does not support window frames. Other databases like MySQL do not have a CORR function at all. See SQL in a nutshell for more information on database support.

If you want to know about window functions, Julia Evans (@b0rk) has a great SQL tip on them.

In my case, I needed to find the Pearson's coefficient along a sliding window of rows. ie, for a list of N rows, I needed N-k coefficients, one for each sliding window of size k. As far as I could tell, only Oracle supports this functionality, and I wasn't that desperate, so I set about figuring out how to implement it myself.

Fortunately, Pearson's correlation coefficient is calculated using a very simple algebraic function. The full details are on the Wikipedia page linked above, but it's helpful to break it down into manageable pieces.

At a high level, the coefficient ρ (the greek letter rho) is defined as the covariance of the vectors divided by the product of standard deviation of the two vectors, or mathematically:

ρ(x,y) = cov(x, y) / (σ(x) * σ(y))

In SQL, this would be

COVAR_POP(y, x) / (STDDEV_POP(x) * STDDEV_POP(y))

This simplifies things a bit since STDDEV_POP does support window frames, but COVAR_POP does not. That reduces our problem to implementing COVAR_POP as a window function.

This is much simpler, because the covariance uses sum and count, both of which are implemented as window functions with window frame support:

COVAR_POP = (SUM(x * y) - SUM(x) * SUM(y) / COUNT(*)) / COUNT(*), or mathematically:

cov(x, y) = (Σ(x * y) - Σx * Σy / N) / N

But it gets even better. Since we're calculating these SUMs and COUNTs anyway, why not use them to implement STDDEV as well? A simplified formula for STDDEV uses the the sum of squares and the square of the sum as follows:

σ = SQRT(N * Σx^2 - (Σx)^2) / N.

Combining the formulae above, we get:

ρ(x,y) = cov(x, y) / (σ(x) * σ(y))

       = ( (Σ(x * y) - Σx * Σy / N) / N ) / (     (SQRT(N * Σx^2 - (Σx)^2) / N) * (SQRT(N * Σy^2 - (Σy)^2) / N) )

       =     (Σ(x * y) - Σx * Σy / N)      / ( N * (SQRT(N * Σx^2 - (Σx)^2) / N) * (SQRT(N * Σy^2 - (Σy)^2) / N) )

       =     (Σ(x * y) - Σx * Σy / N)      / (     SQRT(N * Σx^2 - (Σx)^2)       * SQRT(N * Σy^2 - (Σy)^2) / N )

       = N * (Σ(x * y) - Σx * Σy / N)      / (     SQRT(N * Σx^2 - (Σx)^2)       * SQRT(N * Σy^2 - (Σy)^2) )

       =     (N * Σ(x * y) - Σx * Σy)      / (     SQRT(N * Σx^2 - (Σx)^2)       * SQRT(N * Σy^2 - (Σy)^2) )

I've left the product of the two squareroots in the denominator as-is rather than simplifying it further because the simplifaction could result in numeric overflow.

So, we now have a function for Pearson's correlation coefficient using only SUM & COUNT, both of which support window functions and window frames.

For each of these, we can now SELECT something like this in an inner query:

COUNT( * )   OVER (PARTITION BY <partition cols> ORDER BY <minute> ROWS BETWEEN $k2 PRECEDING AND $k2 FOLLOWING) AS N,

SUM(x)       OVER (PARTITION BY <partition cols> ORDER BY <minute> ROWS BETWEEN $k2 PRECEDING AND $k2 FOLLOWING) AS SUM_X,

SUM(x * x)   OVER (PARTITION BY <partition cols> ORDER BY <minute> ROWS BETWEEN $k2 PRECEDING AND $k2 FOLLOWING) AS SUM2_X,

SUM(y)       OVER (PARTITION BY <partition cols> ORDER BY <minute> ROWS BETWEEN $k2 PRECEDING AND $k2 FOLLOWING) AS SUM_Y,

SUM(y * y)   OVER (PARTITION BY <partition cols> ORDER BY <minute> ROWS BETWEEN $k2 PRECEDING AND $k2 FOLLOWING) AS SUM2_Y,

SUM(x * y)   OVER (PARTITION BY <partition cols> ORDER BY <minute> ROWS BETWEEN $k2 PRECEDING AND $k2 FOLLOWING) AS SUM_XY,

$k2 is half the sliding window size, so if you wanted a window of 60 elements, k2 would be 30. The ROWS BETWEEN r1 PRECEDING AND r2 FOLLOWING syntax specifies a window of rows extending at most r1 rows before the current row, and at most r2 rows beyond the current row. We follow that with an outer query that SELECTs this:

x,
y,
CASE
    WHEN N * SUM2_X > SUM_X * SUM_X AND N * SUM2_Y > SUM_Y * SUM_Y
    THEN (N * SUM_XY - SUM_X * SUM_Y) / (SQRT(N * SUM2_X - SUM_X * SUM_X) * SQRT(N * SUM2_Y - SUM_Y * SUM_Y))
    ELSE 0.0
END AS corr_yx

And there we have it... Pearson's correlation coefficient implemented as a window function with window frame support.

With the above combination, we can get a Pearson's correlation for each (x, y) tuple in the table that correlates a sliding window of data. In my case this was a timeseries database, so for each minute of data, I get a correlation co-efficient of (at most) 61 minutes around that minute (at most 30 before and at most 30 after).

We can still use the CORR aggregate function on the entire list of (x, y) tuples, or post calculate that in a language like Julia.

For the next installment, maybe I'll write up how to do Spearman's Rank Correlation Coefficient.

...===...