[philiptellis] /bb|[^b]{2}/
Never stop Grokking


Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Saturday, February 01, 2025

When users interact

When looking at the Core Web Vitals, we often try optimizing each independently of the others, but that's not how users experience the web. A user's web experience is made up of many metrics, and it's important to look at these metrics together for each experience. Real User Measurement (RUM) allows us to do that by collecting operational metrics in conjunction with user actions, the combination of which can tell us whether our pages actually meet the user's expectations.

In this experiment, I decided to look at each of the events in a page's loading cycle, and break that down by when the user tried interacting with the page. For those interactions, I looked at the Interaction to Next Paint, and the rate of rage clicking to get an idea of user experience and whether that experience may have been frustrating or not.

Before I jump into the charts, I should note an important caveat about the data. This analysis was done using RUM data collected by Akamai's mPulse product which collects data at or soon after page load. Not all page views resulted in an interaction before data was collected. Most of the analysis was restricted to page views where we had at least one interaction prior to data collection. We see on average, between 2-25% of beacons collected (across sites) had an interaction. Most sites had a recorded interaction on about 10% of beacons. I also separately looked at data collected during page unload/pagehide and while it captured more interactions, it did not have a noticeable effect on the results.

Each of the following charts is from a different website in mPulse's dataset.

Exploring the chart

Interaction Analysis - Virtual Globe Trotting
Interaction analysis chart for Virtual Globe Trotting

Let's now look at the various features of this chart.

The chart shows multiple dimensions of data projected onto a 2D surface, so some parts of it will appear wonky. We'll walk through that in this section.

Event labels

The first thing we'll describe are the events. These are the vertical colored lines with labels to their right. These represent transition events in the page load cycle. The events we include are:

You may have already noticed that in this particular chart, First Paint is _after_ First Contentful Paint, which is counter-intuitive. The reason we see this is that the number of data points with First Paint on them is different from those with First Contentful Paint. Safari and Firefox, for example, support FCP but not FP. When aggregating these points, the same percentile value when applied to two data sets will likely get you values from two different experiences. This effect is more prominent when the sample sizes are different. In general we would not expect the delta to be too far off, and in the data I've looked at, it hasn't been more than 50ms off.

The events to keep an eye on are the Largest Contentful Paint or Time to Visually Ready, the Time to Interactive, and the delta between them. LCP is not currently supported on Safari, so we use boomerang's cross-browser calculation of TTVR in those cases.

Time to Interactive is considered a lab measurement, but `boomerang` measures it in a cross-browser manner during RUM sessions, and passes that data back to mPulse. It is approximately the time when interactions with the page are expected to be smooth due to no more long animation frames and blocking time.

The next thing to note are that these events are positioned on this projection based on when they occurred relative to interactions _as well as_ when they occurred relative to page load time. By definition this means that all interactions should show up after LCP but it may show up differently on the chart due to the projection from multiple dimensions down to two. There's also the fact that TTVR calculations do not stop at first interaction, so on browsers that do not support LCP, we may see interactions before the proxy for that event.

The absolute value of each event is calculated across the entire dataset, even on pages without intereactions, so it might look like events aren't placed where their values dictate they should be, however the percentage of users interacting before & after an event is always correct.

The last label to take note of is the fraction of users that interacted before `boomerang` considered the page to be interactive. In this case, it's 12% of users.

Data distributions
Interaction Analysis Site. 2
Interaction analysis chart showing mouseover details.

There are a few different distributions shown on this chart, (and even more when we look at the mouseover in the chart above).

The blue area chart is the _population density_. It shows, for every 5% interval of the page load time, how many users first interacted with the page at that point in the page's loading cycle.

The blue dots that trace the population density chart show the median _Interaction to Next Paint_ value for all of those interactions. Keep in mind that INP is not supported on Safari, whereas `boomerang`'s own measurements for TTI do work across browsers.

The vertical position of the red dots shows the _probability_ that interactions at that time resulted in _rage clicks_ while the size of the red dots shows the _intensity_ of these rage clicks. Rage clicks are collected across browsers.

The thin orange line shows Frustration Index for users that interacted within that window.

We also have the median Total Blocking Time for each of these interactions, though that's only visible in the live versions of these charts and not in most of the screenshots posted here.

In this second chart, we see that 59% of users interacted with the site before it became interactive. Its TTI is further from the LCP time compared to the first site.

Insights from the data

Interaction Analysis Fig. 3
Interaction analysis chart showing INP increasing around TTI.

When we look at this data across websites, we see the same patterns. Users expect to be able to interact with the site once the page is largely visible, however, the user experience for interactions is sub-optimal until the time to interactive which can be much later in the page's loading cycle.

In most cases we see a high Total Blocking Time in the period between LCP and TTI, resulting in a slow INP, and higher probability of rage clicking.

When looking to optimize a site for user experience, we shouldn't look at each metric in isolation. A really fast LCP is a great first user experience, but it's also a signal to the user that they can proceed with interacting to complete their task. It's important that the rest of the page be ready for those interactions and keep up the good experience.

The elephant in the room

Interaction Analysis Fig. 4
Interaction analysis chart for Akamai.com focussing on the population series.

As an aside, has anyone else noticed that these charts almost always look like a sleeping elephant (or maybe a hat)? I've seen very few sites where this isn't the case, so I looked into that pattern.

The population distribution pattern we see is a gradual curve increasing, then a dip that looks like the elephant's neck, then a bump that could be its ears, a sharp dip and long flat region that could be its trunk.

It could well be a Normal distribution if it weren't for the dip and spike right around PLT.

Normal Distribution
A basic Normal Distribution curve with a mean of 75 and standard deviation of 30.

The drop-off after OnLoad is expected. `boomerang.js` sends a beacon on or soon after page load (sites can configure a beacon delay of a few seconds to capture post-onload events). This results in a drop-off in data with interactions after onload. The post onload interactions are on pages that are faster than the average.

The strange pattern is the spike in interactions just at or after onload (it's sometimes at 100% and sometimes at 105%). The dip at 95% & 100% shows up on most, but not all sites, but the spike shows up everywhere.

I looked closer at the data around those buckets and there is very little difference in terms of experience. The page load time, LCP time, TTI time, etc. are all very similar at the 25th and 75th percentile (in other words, the experiences are comparable). The only difference is that more users prefer to interact with the site just after the onload event has fired than just before it. It's not a big delay - about 200-400ms on average across sites, but it does look like some portion of users still wait for the loading indicator to complete before they interact.

Conclusions

In conclusion, I think there's a lot to be learned from looking at when your users interact with your site. Which parts of the page have finished loading when that interaction happens? What's still in flight? What do they experience? Is there too much of a delay between your LCP and the site becoming usable?

A good loading experience needs your page to transition from state to state smoothly without too much delay between states. Looking at the loading Frustration Index can identify pages where this isn't the case.

When comparing different events on the page, look at the aggregate of deltas rather than the delta of aggregates.

And lastly, keep an eye out for that elephant.

References

Glossary on Mozilla Developer Network
Web Vitals on Google's Web.Dev
Implementations in mPulse
Other useful links

Monday, October 07, 2019

Implementing Spearman's Rank Correlation in SQL

In my last post, I showed how to implement Pearson's Correlation as an SQL Window function with window frame support. In this post, I'll follow up with implementing Spearman's Rank correlation co-efficient in SQL.

While Pearson's correlation looks for linear relationships between two vectors (ie, you wouldn't use it for exponential relationships), Spearman's rank correlation looks for monotonicity, or in plain english, do the two values go up & down together?

So here's the really cool part. Spearman's Rank correlation co-efficient is the Pearson's correlation co-efficient of the ranks of the two vectors. We already know how to calculate Pearson's correlation co-efficient, so what we need to do here is first calculate ranks of our vectors.

We can do this using the SQL RANK function, which also works as a window function with window frame support:

RANK() OVER (PARTITION BY <partition cols> ORDER BY x ASC) as R_X,

RANK() OVER (PARTITION BY <partition cols> ORDER BY y ASC) as R_Y,

The two important things to note here are that RANK() does not take a parameter, instead you specify what you want to rank on in the ORDER BY clause, and secondly, make sure both parameters are ordered in the same direction, ASC or DESC.

Now even though the RANK() function supports window frames, you don't want to use them here. This is so because if you're using sliding windows, each row will have a different rank depending on the window, and we won't be able to correlate an outer window.

Once we have the ranks in an inner query, we can run either the standard CORR function, or the windowed CORR that we developed in the previous post on these derived columns instead:

SELECT CORR(R_X, R_Y) FROM (
    SELECT
        RANK() OVER (PARTITION BY <partition cols> ORDER BY x ASC) as R_X,

        RANK() OVER (PARTITION BY <partition cols> ORDER BY y ASC) as R_Y
      FROM ...
)

If implementing this as a window function, then use R_X and R_Y as the inputs to the SUM() functions with an additional nested query.

I hope this was helpful, leave a comment or tweet @bluesmoon if you'd like to chat.

Wednesday, October 02, 2019

Implementing Pearson's CORR as a database window function

I recently needed to find the Pearson's Correlation Coefficient between two columns in a Snowflake database. Now Snowflake and PostgreSQL both support a CORR aggregate function that correlates along a GROUP BY. Snowflake additionally supports CORR as a window function, but only for use with PARTITION BY. It does not support window frames. Other databases like MySQL do not have a CORR function at all. See SQL in a nutshell for more information on database support.

If you want to know about window functions, Julia Evans (@b0rk) has a great SQL tip on them.

In my case, I needed to find the Pearson's coefficient along a sliding window of rows. ie, for a list of N rows, I needed N-k coefficients, one for each sliding window of size k. As far as I could tell, only Oracle supports this functionality, and I wasn't that desperate, so I set about figuring out how to implement it myself.

Fortunately, Pearson's correlation coefficient is calculated using a very simple algebraic function. The full details are on the Wikipedia page linked above, but it's helpful to break it down into manageable pieces.

At a high level, the coefficient ρ (the greek letter rho) is defined as the covariance of the vectors divided by the product of standard deviation of the two vectors, or mathematically:

ρ(x,y) = cov(x, y) / (σ(x) * σ(y))

In SQL, this would be

COVAR_POP(y, x) / (STDDEV_POP(x) * STDDEV_POP(y))

This simplifies things a bit since STDDEV_POP does support window frames, but COVAR_POP does not. That reduces our problem to implementing COVAR_POP as a window function.

This is much simpler, because the covariance uses sum and count, both of which are implemented as window functions with window frame support:

COVAR_POP = (SUM(x * y) - SUM(x) * SUM(y) / COUNT(*)) / COUNT(*), or mathematically:

cov(x, y) = (Σ(x * y) - Σx * Σy / N) / N

But it gets even better. Since we're calculating these SUMs and COUNTs anyway, why not use them to implement STDDEV as well? A simplified formula for STDDEV uses the the sum of squares and the square of the sum as follows:

σ = SQRT(N * Σx^2 - (Σx)^2) / N.

Combining the formulae above, we get:

ρ(x,y) = cov(x, y) / (σ(x) * σ(y))

       = ( (Σ(x * y) - Σx * Σy / N) / N ) / (     (SQRT(N * Σx^2 - (Σx)^2) / N) * (SQRT(N * Σy^2 - (Σy)^2) / N) )

       =     (Σ(x * y) - Σx * Σy / N)      / ( N * (SQRT(N * Σx^2 - (Σx)^2) / N) * (SQRT(N * Σy^2 - (Σy)^2) / N) )

       =     (Σ(x * y) - Σx * Σy / N)      / (     SQRT(N * Σx^2 - (Σx)^2)       * SQRT(N * Σy^2 - (Σy)^2) / N )

       = N * (Σ(x * y) - Σx * Σy / N)      / (     SQRT(N * Σx^2 - (Σx)^2)       * SQRT(N * Σy^2 - (Σy)^2) )

       =     (N * Σ(x * y) - Σx * Σy)      / (     SQRT(N * Σx^2 - (Σx)^2)       * SQRT(N * Σy^2 - (Σy)^2) )

I've left the product of the two squareroots in the denominator as-is rather than simplifying it further because the simplifaction could result in numeric overflow.

So, we now have a function for Pearson's correlation coefficient using only SUM & COUNT, both of which support window functions and window frames.

For each of these, we can now SELECT something like this in an inner query:

COUNT( * )   OVER (PARTITION BY <partition cols> ORDER BY <minute> ROWS BETWEEN $k2 PRECEDING AND $k2 FOLLOWING) AS N,

SUM(x)       OVER (PARTITION BY <partition cols> ORDER BY <minute> ROWS BETWEEN $k2 PRECEDING AND $k2 FOLLOWING) AS SUM_X,

SUM(x * x)   OVER (PARTITION BY <partition cols> ORDER BY <minute> ROWS BETWEEN $k2 PRECEDING AND $k2 FOLLOWING) AS SUM2_X,

SUM(y)       OVER (PARTITION BY <partition cols> ORDER BY <minute> ROWS BETWEEN $k2 PRECEDING AND $k2 FOLLOWING) AS SUM_Y,

SUM(y * y)   OVER (PARTITION BY <partition cols> ORDER BY <minute> ROWS BETWEEN $k2 PRECEDING AND $k2 FOLLOWING) AS SUM2_Y,

SUM(x * y)   OVER (PARTITION BY <partition cols> ORDER BY <minute> ROWS BETWEEN $k2 PRECEDING AND $k2 FOLLOWING) AS SUM_XY,

$k2 is half the sliding window size, so if you wanted a window of 60 elements, k2 would be 30. The ROWS BETWEEN r1 PRECEDING AND r2 FOLLOWING syntax specifies a window of rows extending at most r1 rows before the current row, and at most r2 rows beyond the current row. We follow that with an outer query that SELECTs this:

x,
y,
CASE
    WHEN N * SUM2_X > SUM_X * SUM_X AND N * SUM2_Y > SUM_Y * SUM_Y
    THEN (N * SUM_XY - SUM_X * SUM_Y) / (SQRT(N * SUM2_X - SUM_X * SUM_X) * SQRT(N * SUM2_Y - SUM_Y * SUM_Y))
    ELSE 0.0
END AS corr_yx

And there we have it... Pearson's correlation coefficient implemented as a window function with window frame support.

With the above combination, we can get a Pearson's correlation for each (x, y) tuple in the table that correlates a sliding window of data. In my case this was a timeseries database, so for each minute of data, I get a correlation co-efficient of (at most) 61 minutes around that minute (at most 30 before and at most 30 after).

We can still use the CORR aggregate function on the entire list of (x, y) tuples, or post calculate that in a language like Julia.

For the next installment, maybe I'll write up how to do Spearman's Rank Correlation Coefficient.

Monday, August 13, 2012

Analyzing Performance Data with Statistics

At LogNormal, we’re all about collecting and making sense of real user performance data. We collect over a billion data points a month, and there’s a lot you can tell about the web and your users if you know what to look at in your data. In this post, I’d like to go over some of the statistical methods we use to make sense of this data.

You’d use these methods if you wanted to build your own Real User Measurement (RUM) tool.

The entire topic is much larger than I can cover in a single post, so go through the references if you’re interested in more information.

Tuesday, January 19, 2010

Bandwidth test v1.2

After testing this out for about a week, I'm ready to release v1.2 of the bandwidth testing code.

Get the ZIP file or Try it online

Changes

The changes in this release are all statistical, and related to the data I've collected while running the test.
  1. Switch from the geometric mean to the arithmetic mean, because data for a single user is more or less centrally clustered.

    This is not true across users, but for readings for a single user (single connection actually), I've found that the data does not really have outliers.
  2. Instead of downloading all images on every run, use the first run as a pilot test, and based on the results from that run, only download the 3 largest images that can be downloaded for this user.

    This allows us to download fewer bytes, and get more consistent results across runs.
  3. Add random sampling. In previous versions, the test would be run for every page view. By adding random sampling, we now only run it on a percentage of page views.

    You can now set PERFORMANCE.BWTest.sample to a number between 0 and 100, and that's the percentage of page views that will be tested for bandwidth. Note that this is not an integer, so you can set it to 0.1 or 10.25, or anything as long as it's between 0 and 100 (both inclusive). Setting it to 0 means that the test isn't run for anyone, so you probably don't want that, and 100 means that the test is run for everybody. This is also the default.
  4. Fire an onload event when the script completes loading. This is mainly so that you can asynchronously load the script so that it does not affect your page load time. Instead of checking for script.onload or onreadystatechange, you can just implement the PERFORMANCE.BWTest.onload method.
You can see the entire history on github.

Sunday, January 17, 2010

The statistics of performance measurement - Random Sampling

I've been working with performance data for over two years now, and over that time I've picked up a little bit of statistics. I am by no means an expert in the subject and I'd probably have flunked statistics in college if it wasn't for a classmate ramming the entire 3Kg statistics book into my head the week before our exams. One thing I understand now though, is that statistics is an extremely important skill for any professional in any field.

Anyway, with a few refreshers from wikipedia and many other texts and papers and a bunch of experts at Yahoo!, I think I've learnt a bit. I'll put it down here mainly so I don't forget.

Why statistics?

So we're often asked by management for a number that represents the performance of a web page. The big question is, what should that number be? Is it the YSlow score, the network roundtrip time, page load time, a combination of these, or some other measure that shows how much the user likes or dislikes the page? Once we know which number we want to measure, how do we determine how this affects all our users? The easiest way is to actually measure what every user sees, but then we end up with thousands, possibly even millions of measurements for different users, and not all of them are the same.

This is where statistics comes in, and where I realised how little I knew.

There are a whole bunch of terms that come up. The few that came up are: Random Sample, Central Tendency, Mean, Median, Geometric Mean, Normal distribution, Log-normal distribution, Outliers, IQR filtering, Standard Error, Confidence Interval. There are probably more that will come up in the course of the next few posts, but this what I can remember for now.

Random Samples

Now say the app whose performance we want to measure typically has about 1 million users a day. If we want to measure our users' perceived performance, we could measure the performance for all interactions of all 1 million users, but this is overkill. In reality, you only need to know the performance for maybe 10% of your users, and just project from there.

Now how we select these 10% is important. If we select the fastest 10%, then our numbers will appear better than they actually are, and if we select the slowest 10%, the opposite happens. There are, in fact, many variables on the users end that affect performance, and are beyond our control. These are called confounding factors and are covered in this article by Zed Shaw under the section on Confounding. Since we can't control these variables, the best thing we can do is completely randomize our sample, which reduces the effect that these variables have over a large sample.

Some important things about a random sample are:
  • It should not be deterministic, i.e., I shouldn't be able to predict who will be chosen using a formula
  • It shouldn't be repetitive, i.e., if I run my randomiser multiple times, I should not end up with identical sets
  • All members of the population should have an equal probability of ending up in the selected sample
The wikipedia articles on random sampling may not be up to the mark, but this article in the social research methods knowledge base is pretty good. There's also an excellent video on TED by Oxford Mathematician Paul Donnelly on how stats fool juries. The real stats is at 3:30 and random sampling comes in at 11:00.

So, how do we go about getting a random sample of all our users or page views? On the face of it, it's really quite easy. I'll only consider page views here because it means I don't have to bother with maintaining sessions. Here's the simple algorithm:
  1. N is your population size — that's the total number of page views that you get in, say, one day
  2. f is the fraction of your population that you want in your sample.
  3. Use something like simple random sampling to pull out N*f items from your population — this is n
  4. Measure the performance of these n items
In PHP, you'd do something like this:
$N_a = array(...);                           // array of entire population
$f = 0.10;                                   // 10%
$n_a = array_rand($N, intval(count($N)*$f)); // $n now contains your sample
Other languages are similar. This is easy to do if you already have access to your entire population. For example, at the end of a day, you can pull out requests from your HTTP access logs. The problem, however, is that you'd need to first measure the performance of all page views, and then randomly pick 10% of them. You'll be throwing away a lot of data, and all your users have to pay the price of the performance test which could be non-negligible. (Heisenberg's uncertainty principle sort of plays a part here — you can't measure something without changing what you're measuring.)

A better solution would be to only instrument your code with performance measurement hooks if the particular request falls into your sample. This makes the whole thing a little more complicated because we do not have access to the entire population when we need to make this decision. What we do have though, is an estimate (based on past experience) of the total number of page views (N) and the sampling fraction (f) based on how big a sample we want. Can we use these two to figure out if a particular request should fall into the sample or not? A simple approach is something like this (PHP):
// Using systematic random sampling
// $N == estimated population size, $n == desired sample size
// $k == floor($N/$n)

$counter = cache_fetch('counter');
if(is_null($counter)) {     // first hit
    $seed = mt_rand(1, $k);
    $counter = $k-$seed;
}

$counter++;

if($counter === $k) {
    measure_performance(); // instrument with performance code
    $counter = 1;
}

cache_store('counter', $counter);
The $counter variable needs to persist across requests, which is why we need to store it in a cross-request cache. Unfortunately, the obvious problem here is that if we have concurrent connections (which any web server has), then we end up with a race condition where multiple processes all read the same value from cache, increment it independently, and then write the new value back, ie, multiple requests get counted as a single request. To fix this, we'd need to use a mutex to serialize that section of code, but that will throw performance and scalability out of the window. A better solution is required.

A more naïve approach might work:
// $N == estimated population size, $n == desired sample size
// $f == $n/$N;

if(mt_rand() < $f * mt_getrandmax()) {
    measure_performance(); // instrument with performance code
}
This approach does not require global variables shared across processes, and is actually really fast, the question is, is it correct. Let's look at an example that's easier to imagine... Let's say we have a box with 20 marbles in it — identical in all respects except that they are numbered from 1 to 20. We need to pick 5 marbles out of this box at random. The probability of a particular marble being picked depends on whether we pick all five at the same time or one at a time. If we pick them all together, the probability of a particular marble being selected is 5/20 == 1/4 == 25%.

Instead, if we pick them one at a time, then the probability depends on when a marble is picked. Each marble has a 1/20 chance of being the first, a 1/19 chance of being the second, 1/18 chance of being the third and so on. Therefore, the probability of a particular marble being selected is now (1/20 + 1/19 + 1/18 + 1/17 + 1/16) == 27.95%

A third possibility is that we look at each marble in sequence, and then decide whether to select it or throw it away. With this approach, the probability of the first marble being selected is 5/20 == 25%. The probability of the second marble being selected though, depends on whether the first was selected. If the first marble was selected (25%), then we only need 4 more marbles (4/19), but if the first marble was thrown away (75%), we still need 5 marbles (5/19). Since at this point we already know whether the first marble was selected or not, that event is no longer uncertain, and we need to pick one of the two probabilities for the second marble.

What we're doing in the code above, however, is that for each marble that we see, there's a 25% chance that we'll pick it and a 75% chance that we'll discard it, however, this has no effect on the probability of selection of the next marble. There is still a 25% chance that it will be selected regardless of whether the first marble was selected or not. In order for the selection of previous marbles to be considered, we'd need to maintain a count of all selected marbles — which is the problem we're trying to avoid here.

The outcome here is that we may not always end up with exactly $n elements. In fact, it's possible that we'll end up with no elements or all elements, though on average, we should end up with close to $n elements.

All things considered, I think it's a close enough approximation without degrading performance.

What about users?

So what about measuring users instead of page views? All of the above looked only at picking a random sample of page views. In order to pick a random sample of users instead, you'll need a unique identifier for each user. This is typically a cookie. You'll want the cookie to last for a reasonably long time. Now, since you know how many cookies you've issued over a given period of time, and since you know the validity of a cookie, you can determine on a given date how many cookies are currently valid, and what they are — this is your population (N). You can then use the simple random sampling method (array_rand()) to pull out a random sample (n) of cookies that you want to measure. Once you have this, it's a simple case of matching the user's cookie to your sample list, and if it matches, send across the measurement code. This is much easier and more correct than what we've seen above, but it has a few caveats:
  • It won't work with users who have cookies disabled
  • There's no guarantee that a user who has fallen into your bucket will ever return to your site
  • New users, who don't already have a cookie won't get tested
  • If you have multiple web servers in a farm, you need to make sure all of them know the sample list
You'll also need to tweak your cookie lifetime based on how often you get new users and what percentage of your users return to the site. If your site requires users to log in, this might be much easier. A combination of the two approaches may also work well.

Closing thoughts

I don't know everything, and wrt statistics, I know very little. There's a good chance that I've gotten things wrong here, so feel free to leave a comment correcting me and chiding me for spreading false information. It will help me learn, and I can update this post accordingly.

Random sampling is pretty important when it comes to getting a good estimate of a measure for a large population. It's applicable in many fields, though I currently only care about its application to performance analysis. I'll post again about the other things I've learnt.

...===...