[philiptellis] /bb|[^b]{2}/
Never stop Grokking


Wednesday, October 02, 2019

Implementing Pearson's CORR as a database window function

I recently needed to find the Pearson's Correlation Coefficient between two columns in a Snowflake database. Now Snowflake and PostgreSQL both support a CORR aggregate function that correlates along a GROUP BY. Snowflake additionally supports CORR as a window function, but only for use with PARTITION BY. It does not support window frames. Other databases like MySQL do not have a CORR function at all. See SQL in a nutshell for more information on database support.

If you want to know about window functions, Julia Evans (@b0rk) has a great SQL tip on them.

In my case, I needed to find the Pearson's coefficient along a sliding window of rows. ie, for a list of N rows, I needed N-k coefficients, one for each sliding window of size k. As far as I could tell, only Oracle supports this functionality, and I wasn't that desperate, so I set about figuring out how to implement it myself.

Fortunately, Pearson's correlation coefficient is calculated using a very simple algebraic function. The full details are on the Wikipedia page linked above, but it's helpful to break it down into manageable pieces.

At a high level, the coefficient ρ (the greek letter rho) is defined as the covariance of the vectors divided by the product of standard deviation of the two vectors, or mathematically:

ρ(x,y) = cov(x, y) / (σ(x) * σ(y))

In SQL, this would be

COVAR_POP(y, x) / (STDDEV_POP(x) * STDDEV_POP(y))

This simplifies things a bit since STDDEV_POP does support window frames, but COVAR_POP does not. That reduces our problem to implementing COVAR_POP as a window function.

This is much simpler, because the covariance uses sum and count, both of which are implemented as window functions with window frame support:

COVAR_POP = (SUM(x * y) - SUM(x) * SUM(y) / COUNT(*)) / COUNT(*), or mathematically:

cov(x, y) = (Σ(x * y) - Σx * Σy / N) / N

But it gets even better. Since we're calculating these SUMs and COUNTs anyway, why not use them to implement STDDEV as well? A simplified formula for STDDEV uses the the sum of squares and the square of the sum as follows:

σ = SQRT(N * Σx^2 - (Σx)^2) / N.

Combining the formulae above, we get:

ρ(x,y) = cov(x, y) / (σ(x) * σ(y))

       = ( (Σ(x * y) - Σx * Σy / N) / N ) / (     (SQRT(N * Σx^2 - (Σx)^2) / N) * (SQRT(N * Σy^2 - (Σy)^2) / N) )

       =     (Σ(x * y) - Σx * Σy / N)      / ( N * (SQRT(N * Σx^2 - (Σx)^2) / N) * (SQRT(N * Σy^2 - (Σy)^2) / N) )

       =     (Σ(x * y) - Σx * Σy / N)      / (     SQRT(N * Σx^2 - (Σx)^2)       * SQRT(N * Σy^2 - (Σy)^2) / N )

       = N * (Σ(x * y) - Σx * Σy / N)      / (     SQRT(N * Σx^2 - (Σx)^2)       * SQRT(N * Σy^2 - (Σy)^2) )

       =     (N * Σ(x * y) - Σx * Σy)      / (     SQRT(N * Σx^2 - (Σx)^2)       * SQRT(N * Σy^2 - (Σy)^2) )

I've left the product of the two squareroots in the denominator as-is rather than simplifying it further because the simplifaction could result in numeric overflow.

So, we now have a function for Pearson's correlation coefficient using only SUM & COUNT, both of which support window functions and window frames.

For each of these, we can now SELECT something like this in an inner query:

COUNT( * )   OVER (PARTITION BY <partition cols> ORDER BY <minute> ROWS BETWEEN $k2 PRECEDING AND $k2 FOLLOWING) AS N,

SUM(x)       OVER (PARTITION BY <partition cols> ORDER BY <minute> ROWS BETWEEN $k2 PRECEDING AND $k2 FOLLOWING) AS SUM_X,

SUM(x * x)   OVER (PARTITION BY <partition cols> ORDER BY <minute> ROWS BETWEEN $k2 PRECEDING AND $k2 FOLLOWING) AS SUM2_X,

SUM(y)       OVER (PARTITION BY <partition cols> ORDER BY <minute> ROWS BETWEEN $k2 PRECEDING AND $k2 FOLLOWING) AS SUM_Y,

SUM(y * y)   OVER (PARTITION BY <partition cols> ORDER BY <minute> ROWS BETWEEN $k2 PRECEDING AND $k2 FOLLOWING) AS SUM2_Y,

SUM(x * y)   OVER (PARTITION BY <partition cols> ORDER BY <minute> ROWS BETWEEN $k2 PRECEDING AND $k2 FOLLOWING) AS SUM_XY,

$k2 is half the sliding window size, so if you wanted a window of 60 elements, k2 would be 30. The ROWS BETWEEN r1 PRECEDING AND r2 FOLLOWING syntax specifies a window of rows extending at most r1 rows before the current row, and at most r2 rows beyond the current row. We follow that with an outer query that SELECTs this:

x,
y,
CASE
    WHEN N * SUM2_X > SUM_X * SUM_X AND N * SUM2_Y > SUM_Y * SUM_Y
    THEN (N * SUM_XY - SUM_X * SUM_Y) / (SQRT(N * SUM2_X - SUM_X * SUM_X) * SQRT(N * SUM2_Y - SUM_Y * SUM_Y))
    ELSE 0.0
END AS corr_yx

And there we have it... Pearson's correlation coefficient implemented as a window function with window frame support.

With the above combination, we can get a Pearson's correlation for each (x, y) tuple in the table that correlates a sliding window of data. In my case this was a timeseries database, so for each minute of data, I get a correlation co-efficient of (at most) 61 minutes around that minute (at most 30 before and at most 30 after).

We can still use the CORR aggregate function on the entire list of (x, y) tuples, or post calculate that in a language like Julia.

For the next installment, maybe I'll write up how to do Spearman's Rank Correlation Coefficient.

0 comments :

Post a Comment

...===...