The other side of the moon: lcp

Showing posts with label lcp. Show all posts

Saturday, February 01, 2025

When users interact

When looking at the Core Web Vitals, we often try optimizing each independently of the others, but that's not how users experience the web. A user's web experience is made up of many metrics, and it's important to look at these metrics together for each experience. Real User Measurement (RUM) allows us to do that by collecting operational metrics in conjunction with user actions, the combination of which can tell us whether our pages actually meet the user's expectations.

In this experiment, I decided to look at each of the events in a page's loading cycle, and break that down by when the user tried interacting with the page. For those interactions, I looked at the Interaction to Next Paint, and the rate of rage clicking to get an idea of user experience and whether that experience may have been frustrating or not.

Before I jump into the charts, I should note an important caveat about the data. This analysis was done using RUM data collected by Akamai's mPulse product which collects data at or soon after page load. Not all page views resulted in an interaction before data was collected. Most of the analysis was restricted to page views where we had at least one interaction prior to data collection. We see on average, between 2-25% of beacons collected (across sites) had an interaction. Most sites had a recorded interaction on about 10% of beacons. I also separately looked at data collected during page unload/pagehide and while it captured more interactions, it did not have a noticeable effect on the results.

Each of the following charts is from a different website in mPulse's dataset.

Exploring the chart

Interaction Analysis - Virtual Globe Trotting

Interaction analysis chart for Virtual Globe Trotting

Let's now look at the various features of this chart.

The chart shows multiple dimensions of data projected onto a 2D surface, so some parts of it will appear wonky. We'll walk through that in this section.

Event labels

The first thing we'll describe are the events. These are the vertical colored lines with labels to their right. These represent transition events in the page load cycle. The events we include are:

You may have already noticed that in this particular chart, First Paint is _after_ First Contentful Paint, which is counter-intuitive. The reason we see this is that the number of data points with First Paint on them is different from those with First Contentful Paint. Safari and Firefox, for example, support FCP but not FP. When aggregating these points, the same percentile value when applied to two data sets will likely get you values from two different experiences. This effect is more prominent when the sample sizes are different. In general we would not expect the delta to be too far off, and in the data I've looked at, it hasn't been more than 50ms off.

The events to keep an eye on are the Largest Contentful Paint or Time to Visually Ready, the Time to Interactive, and the delta between them. LCP is not currently supported on Safari, so we use boomerang's cross-browser calculation of TTVR in those cases.

Time to Interactive is considered a lab measurement, but `boomerang` measures it in a cross-browser manner during RUM sessions, and passes that data back to mPulse. It is approximately the time when interactions with the page are expected to be smooth due to no more long animation frames and blocking time.

The next thing to note are that these events are positioned on this projection based on when they occurred relative to interactions _as well as_ when they occurred relative to page load time. By definition this means that all interactions should show up after LCP but it may show up differently on the chart due to the projection from multiple dimensions down to two. There's also the fact that TTVR calculations do not stop at first interaction, so on browsers that do not support LCP, we may see interactions before the proxy for that event.

The absolute value of each event is calculated across the entire dataset, even on pages without intereactions, so it might look like events aren't placed where their values dictate they should be, however the percentage of users interacting before & after an event is always correct.

The last label to take note of is the fraction of users that interacted before `boomerang` considered the page to be interactive. In this case, it's 12% of users.

Data distributions

Interaction analysis chart showing mouseover details.

There are a few different distributions shown on this chart, (and even more when we look at the mouseover in the chart above).

The blue area chart is the _population density_. It shows, for every 5% interval of the page load time, how many users first interacted with the page at that point in the page's loading cycle.

The blue dots that trace the population density chart show the median _Interaction to Next Paint_ value for all of those interactions. Keep in mind that INP is not supported on Safari, whereas `boomerang`'s own measurements for TTI do work across browsers.

The vertical position of the red dots shows the _probability_ that interactions at that time resulted in _rage clicks_ while the size of the red dots shows the _intensity_ of these rage clicks. Rage clicks are collected across browsers.

The thin orange line shows Frustration Index for users that interacted within that window.

We also have the median Total Blocking Time for each of these interactions, though that's only visible in the live versions of these charts and not in most of the screenshots posted here.

In this second chart, we see that 59% of users interacted with the site before it became interactive. Its TTI is further from the LCP time compared to the first site.

Insights from the data

Interaction analysis chart showing INP increasing around TTI.

When we look at this data across websites, we see the same patterns. Users expect to be able to interact with the site once the page is largely visible, however, the user experience for interactions is sub-optimal until the time to interactive which can be much later in the page's loading cycle.

In most cases we see a high Total Blocking Time in the period between LCP and TTI, resulting in a slow INP, and higher probability of rage clicking.

When looking to optimize a site for user experience, we shouldn't look at each metric in isolation. A really fast LCP is a great first user experience, but it's also a signal to the user that they can proceed with interacting to complete their task. It's important that the rest of the page be ready for those interactions and keep up the good experience.

The elephant in the room

Interaction analysis chart for Akamai.com focussing on the population series.

As an aside, has anyone else noticed that these charts almost always look like a sleeping elephant (or maybe a hat)? I've seen very few sites where this isn't the case, so I looked into that pattern.

The population distribution pattern we see is a gradual curve increasing, then a dip that looks like the elephant's neck, then a bump that could be its ears, a sharp dip and long flat region that could be its trunk.

It could well be a Normal distribution if it weren't for the dip and spike right around PLT.

A basic Normal Distribution curve with a mean of 75 and standard deviation of 30.

The drop-off after OnLoad is expected. `boomerang.js` sends a beacon on or soon after page load (sites can configure a beacon delay of a few seconds to capture post-onload events). This results in a drop-off in data with interactions after onload. The post onload interactions are on pages that are faster than the average.

The strange pattern is the spike in interactions just at or after onload (it's sometimes at 100% and sometimes at 105%). The dip at 95% & 100% shows up on most, but not all sites, but the spike shows up everywhere.

I looked closer at the data around those buckets and there is very little difference in terms of experience. The page load time, LCP time, TTI time, etc. are all very similar at the 25th and 75th percentile (in other words, the experiences are comparable). The only difference is that more users prefer to interact with the site just after the onload event has fired than just before it. It's not a big delay - about 200-400ms on average across sites, but it does look like some portion of users still wait for the loading indicator to complete before they interact.

Conclusions

In conclusion, I think there's a lot to be learned from looking at when your users interact with your site. Which parts of the page have finished loading when that interaction happens? What's still in flight? What do they experience? Is there too much of a delay between your LCP and the site becoming usable?

A good loading experience needs your page to transition from state to state smoothly without too much delay between states. Looking at the loading Frustration Index can identify pages where this isn't the case.

When comparing different events on the page, look at the aggregate of deltas rather than the delta of aggregates.

And lastly, keep an eye out for that elephant.

References

Glossary on Mozilla Developer Network

Web Vitals on Google's Web.Dev

Implementations in mPulse

Monday, August 30, 2021

The metrics game

A recent tweet by Punit Sethi about a Wordpress plugin that reduces Largest Contentful Paint (LCP) without actually improving user experience led to a discussion about faking/gaming metrics.

Core Web Vitals

Google recently started using the LCP and other Core Web Vitals (aka CWV) as a signal for ranking search results. Google's goal in using CWV as a ranking signal is to make the web better for end users. The understanding is that these metrics (Input delays, Layout shift, and Contentful paints) reflect the end user experience, so sites with good CWV scores should (in theory) be better for users... reducing wait time, frustration, and annoyance with the web.

If I've learnt anything over the last 20 years of working with the web, it's that getting to the top of a Google search result page (SRP) is a major goal for most site owners, so metrics that affect that ranking tend to be researched a lot. The LCP is no different, and the result often shows up in such "quick fix" plugins that Punit discusses above. Web performance (Page Load Time) was only ever spoken about as a sub-topic in highly technical spaces until Google decided to start using it as a signal for page ranking, and then suddenly everyone wanted to make their sites faster.

My background in performance

I started working with web performance in the mid 2000s at Yahoo!. We had amazing Frontend Engineering experts at Yahoo!, and for the first time, engineering processes on the front-end were as strong as the back-end. In many cases we had to be far more disciplined, because Frontend Engineers do not have the luxury of their code being private and running on pre-selected hardware and software specs.

At the time, Yahoo! had a performance team of one person — Steve "Chief Performance Yahoo" Souders. He'd gotten a small piece of JavaScript to measure front-end performance onto the header of all pages by pretending it was an "Ad", and Ash Patel, who may have been an SVP at the time, started holding teams accountable for their performance.

Denial

Most sites' first reaction was to deny the results, showing scans from Keynote and Gomez, which at the time only synthetically measured load times from the perspective of well connected backbone agents, and were very far off from the numbers that roundtrip was showing.

The Wall of Shame

I wasn't working on any public facing properties, but became interested in Steve's work when he introduced the Wall of Fame/Shame (depending on which way you sorted it). It would periodically show up on the big screen at URLs (the Yahoo! cafeteria). Steve now had a team of 3 or 4, and somehow in late 2007 I managed to get myself transferred into this team.

The Wall of Shame showed a kind of stock-ticker like view where a site's current performance was compared against its performance from a week ago, and one day we saw a couple of sites (I won't mention them) jump from the worst position to the best! We quickly visited the sites and timed things with a stop-watch, but they didn't actually appear much faster. In many instances they might have even been slower. We started looking through the source and saw what was happening.

The sites had discovered AJAX!

Faking it

There was almost nothing loaded on the page before the onload event. The only content was some JavaScript that ran on onload and downloaded the framework and data for the rest of the site. Once loaded, it was a long-lived single page application with far fewer traditional page views.

Site owners argued that it would make the overall experience better, and they weren't intentionally trying to fake things. Unfortunately we had no way to actually measure this, so we added a way for them to call an API when their initial framework had completed loading. That way we'd get some data to trend over time.

At Yahoo! we had the option of speaking to every site builder and to work with them to make things better. Outside though, is a different matter.

Measuring Business Impact

Once we'd started LogNormal (and continuing with mPulse), and were serving multiple customers, it soon became clear that we'd need both business and engineering champions at each customer site. We needed to sell the business case for performance, but also make sure engineering used it for their benefit rather than gaming the metrics. We started correlating business metrics like revenue, conversions, and activity with performance. There is no cheap way to game these metrics because they depend on the behaviour of real users.

Sites that truly care about performance and the business impact of that performance, worked hard to make their sites faster.

This changed when Google started using speed as a ranking signal.

With this change, sites now had to serve two users, and when in conflict, Real Users lost out to Googlebot. After all, you can't serve real users if they can't see your site. Switching to CWV does not change the situation because things like Page Load Time, Largest Contentful Paint, and Layout Shift can all be faked or gamed by clever developers.

Ungameable Metrics

This brings us back to the metrics that we've seen couldn't be gamed. Things like time spent on a site, bounce rate, conversions, and revenue, are an indication of actual user behaviour. Users are only motivated by their ability to complete the task they set out to do, and using this as a ranking signal is probably a better idea.

Unfortunately, activity, conversions, and revenue are also fairly private corporate data. Leaking this data can affect stock prices and clue competitors in to how you're doing.

User frustration & CrUX

Now the goal of using these signals is to measure user frustration. Google Chrome periodically sends user interaction measurements back to their servers, collected as part of the Chrome User Experience report (CrUX). This includes things like the actual user experienced LCP, FID, and CLS In my opinion, it should also include measures like rage clicks, missed, and dead clicks, jank while scrolling, CPU busy-ness, battery drain, etc. Metrics that only come into play while a user is interacting with the site, and that affect or reflect how frustrating the experience may be.

It would also need to have buy-in from a few more browsers. Chrome has huge market share, but doesn't reflect the experience of all users. Data from mPulse shows that across websites, Chrome only makes up, on average, 44% of page loads. Edge and Safari (including mobile) also have a sizeable share. Heck, even IE has a 3% share on sites where it's still supported.

In the chart below, each box shows the distribution of a browser's traffic share across sites. The plot includes (in descending order of number of websites with sizeable traffic for that browser) Chrome, Edge, Mobile Safari, Chrome Mobile, Firefox, Safari, Samsung Internet, Chrome Mobile iOS, Google, IE, and Chrome Mobile WebView.

It's unlikely that other browsers would trust Google with this raw information, so there probably needs to be an independent consortium that collects, anonymizes, and summarizes the data, and makes it available to any search provider.

Using something like the Frustration Index is another way to make it hard to fake ranking metrics without also accidentally making the user experience better.

Comparing these metrics with Googlebot's measures could hint at whether the metrics are being gamed or not, or perhaps it even lowers the weight of Googlebot's measures, restricting it only to pages that haven't received a critical mass of users.

We need to move the balance of ranking power back to the users whose experience matters!

The other side of the moon