[philiptellis] /bb|[^b]{2}/
Never stop Grokking


Showing posts with label mathematics. Show all posts
Showing posts with label mathematics. Show all posts

Wednesday, August 25, 2010

An equation to predict a page's roundtrip time


Note that I'm using MathJax to render the equations on this post. It can take a while to render everything, so you may need to wait a bit before everything shows up. If you're reading this in a feed reader, then it will all look like gibberish (or LaTeX if you can tell the difference).

So I've been playing with this idea for a while now, and bounced it off YSlow dev Antonia Kwok a few times, and we came up with something that might work. The question was whether we could estimate a page's expected roundtrip time for a particular user just by looking at the page structure. The answer, is much longer than that, but tends towards possibly.

Let's break the problem down first. There are two large unknowns in there:
  • a particular user
  • the page's structure
These can be broken down into their atomic components, each of which is either known or measurable:
  • Network characteristics:
    • Bandwidth to the origin server ( \(B_O\) )
    • Bandwidth to your CDN ( \(B_C\) )
    • Latency to the origin server ( \(L_O\) )
    • Latency to your CDN ( \(L_C\) )
    • DNS latency to their local DNS server ( \(L_D\) )
  • Browser characteristics:
    • Number of parallel connections to a host ( \(N_{Hmax}\) )
    • Number of parallel connections overall ( \(N_{max}\) )
    • Number of DNS lookups it can do in parallel ( \(N_{Dmax}\) )
    • Ability to download scripts in parallel
    • Ability to download css in parallel (with each other and with scripts)
    • Ability to download images in parallel with scripts
  • Page characteristics:
    • Document size (\(S_O\) )
    • Size of each script (\(S_{S_i}\))
    • Size of each non-script resource (images, css, etc.) (\(S_{R_i}\))
    • Number of scripts ( \(N_S\))
    • Number of non-script resources (\(N_R\))
    • Number of hostnames (\(N_H\)), further broken down into:
      • Number of script hostnames (\(N_{SH}\))
      • Number of non-script hostnames (\(N_{RH}\))
All sizes are on the wire, so if a resource is sent across compressed, we consider the compressed size and not the uncompressed size. Additionally, scripts and resources within the page can be combined into groups based on the parallelisation factor of the browser in question. We use the terms \(SG_i\) and \(RG_i\) to identiffy these groups. We treat scripts and non-script resources differently because browsers treat them differently, ie, some browsers will not download scripts in parallel even if they download other resources in parallel.

To simplify the equation a bit, we assume that bandwidth and network latency from the user to the CDN and the origin are the same. Additionally, the latency for the main page includes both network latency and the time it takes the server to generate the page (\(L_S\)). Often this time can be significant, so we redefine the terms slightly:
\begin{align}
B_O & = B_C \\
L_O & = L_S + L_C
\end{align}

Browser characteristics are easy enough to obtain. Simply pull the data from BrowserScope's Network tab. It contains almost all the information we need. The only parameter not listed is the number of parallel DNS lookups that a browser can make. Since it's better to err on the side of caution, we assume that this number is 1, so for all further equations, assume \(N_{Dmax} = 1\).

Before I get to the equation, I should mention a few caveats. It's fairly naïve, assuming that all resources that can be downloaded in parallel will be downloaded in parallel, that there's no blank time between downloads, and that the measured bandwidth \(B_C\) is less than the actual channel capacity, therefore multiple parallel TCP connections will all have access to the full bandwidth. This is not entirely untrue for high bandwidth users, but it does breakdown when we get down to dial-up speeds. Here's the equation:
\[
T_{RT} = T_P + T_D + T_S + T_R
\]
Where:
\begin{align}
T_P \quad & = \quad L_O + \frac{S_O}{B_C}\\
\\
\\
\\
\\
\\
T_D \quad & = \quad \frac{N_H}{N_{Dmax}} \times L_D\\
\\
\\
\\
\\
\\
T_S \quad & = \quad \sum_{i=1}^{N_{SG}} \left( \frac{S_{SG_imax}}{B_C} + L_C \right) \\
\\
\\
\\
\\
\\
N_{SG} \quad & = \quad \left\{
\begin{array}{1 1}
\frac{N_S}{min \left( N_{Hmax} \times N_{SH}, N_{max} \right) } & \quad \text{if browser supports parallel scripts}\\
\\
\\
N_S & \quad \text{if browser does not support parallel scripts}\\
\end{array} \right. \\
\\
\\
\\
\\
\\
S_{SG_imax} \quad & = \quad \text{Size of the largest script in script group } SG_i\\
\\
\\
\\
\\
\\
T_R \quad & = \quad \sum_{i=1}^{N_{RG}} \left( \frac{S_{RG_imax}}{B_C} + L_C \right) \\
\\
\\
\\
\\
\\
N_{RG} \quad & = \quad \frac{N_R}{min \left( N_{Hmax} \times N_{RH}, N_{max} \right) }\\
\\
\\
\\
\\
\\
S_{RG_imax} \quad & = \quad \text{Size of the largest resource in resource group } RG_i
\end{align}

So this is how it works...

We assume that the main page's download time is a linear function of its size, bandwidth, the time it takes for the server to build the page and the network latency between the user and the server. While this is not correct (consider multiple flushes, bursty networks, and other factors), it is close.

We then consider all scripts in groups based on whether the browser can handle parallel script downloads or not. Script groups are populated based on the following algorithm:
for each script:
   if size of group > Nmax:
      process and empty group
   else if number of scripts in group for a given host > NHmax:
      ignore script for the current group, reconsider for next group
   else
      add script to group

process and empty group
If a browser cannot handle parallel scripts, then we just temporarily set \(N_{max}\) to 1.

Similarly, we consider the case for all non-script resources:
for each resource:
   if size of group > Nmax:
      process and empty group
   else if number of resources in group for a given host > NHmax:
      ignore resource for the current group, reconsider for next group
   else
      add resource to group

process and empty group

For DNS, we assume that all DNS lookups are done sequentially. This makes our equation fairly simple, but turns our result into an overestimate.

Overall, this gives us a fairly good guess at what the roundtrip time for the page would be, but it only works well for high bandwidth values.

We go wrong with our assumptions at a few places. For example, we don't consider the fact that resources may download in parallel with the page itself, or that when the smallest script/resource in a group has been downloaded, the browser can start downloading the next script/resource. We ignore the fact that some browsers can download scripts and resources in parallel, and we assume that the browser takes no time to actually execute scripts and render the page. These assumptions introduce an error into our calculations, however, we can overcome them in the lab. Since the primary purpose of this experiment is to determine the roundtrip time of a page without actually pushing it out to users, this isn't a bad thing.

So, where do we get our numbers from?

All browser characteristics come from BrowserScope.

The user's bandwidth is variable, so we leave that as a variable to be filled in by the developer running the test. We could simply select 5 or 6 bandwidth values that best represent our users based on the numbers we get from boomerang. Again, since this equation breaks down at low bandwidth values, we could simply ignore those.

The latency to our CDN is something we can either pull out of data that we've already gathered from boomerang, or something we can calculate with a simple and not terribly incorrect formula:
\[
L_C = 4 \times \frac{distance\left(U \leftrightarrow C\right)}{c_{fiber}}
\]
Where \(c_{fiber}\) is the speed of light in fiber, which is approximately \(2 \times 10^8 m/s\).

DNS latency is a tough number, but since most people are fairly close to their ISPs, we can assume that this number is between 40-80ms. The worst case is much higher than that, but on average, this should be correct.

The last number we need is \(L_S\), the time it takes for the server to generate the page. This is something that we can determine just by hitting our server from a nearby box, which is pretty much what we do during development. This brings us to the tool we use to do all the calculations.

YSlow already analyses a page's structure and looks at the time it takes to download each resource. We just pull the time out from what YSlow already has. YSlow also knows the size of all resources (both compressed and uncompressed), how many domains are in use and more. By sticking these calculations into YSlow, we could get a number that a developer can use during page development.

The number may not be spot on with what real users experience, but a developer should be able to compare two page designs and determine which of these will perform better even if they get the same YSlow score.

Naturally this isn't the end of the story. We've been going back and forth on this some more, and are tending towards more of a CPM approach to the problem. I'll write more about that when we've sorted it out.

For now, leave a comment letting me know what you think. Am I way off? Do I have the right idea? Can this be improved upon? Is this something you'd like to see in YSlow?

Sunday, January 17, 2010

The statistics of performance measurement - Random Sampling

I've been working with performance data for over two years now, and over that time I've picked up a little bit of statistics. I am by no means an expert in the subject and I'd probably have flunked statistics in college if it wasn't for a classmate ramming the entire 3Kg statistics book into my head the week before our exams. One thing I understand now though, is that statistics is an extremely important skill for any professional in any field.

Anyway, with a few refreshers from wikipedia and many other texts and papers and a bunch of experts at Yahoo!, I think I've learnt a bit. I'll put it down here mainly so I don't forget.

Why statistics?

So we're often asked by management for a number that represents the performance of a web page. The big question is, what should that number be? Is it the YSlow score, the network roundtrip time, page load time, a combination of these, or some other measure that shows how much the user likes or dislikes the page? Once we know which number we want to measure, how do we determine how this affects all our users? The easiest way is to actually measure what every user sees, but then we end up with thousands, possibly even millions of measurements for different users, and not all of them are the same.

This is where statistics comes in, and where I realised how little I knew.

There are a whole bunch of terms that come up. The few that came up are: Random Sample, Central Tendency, Mean, Median, Geometric Mean, Normal distribution, Log-normal distribution, Outliers, IQR filtering, Standard Error, Confidence Interval. There are probably more that will come up in the course of the next few posts, but this what I can remember for now.

Random Samples

Now say the app whose performance we want to measure typically has about 1 million users a day. If we want to measure our users' perceived performance, we could measure the performance for all interactions of all 1 million users, but this is overkill. In reality, you only need to know the performance for maybe 10% of your users, and just project from there.

Now how we select these 10% is important. If we select the fastest 10%, then our numbers will appear better than they actually are, and if we select the slowest 10%, the opposite happens. There are, in fact, many variables on the users end that affect performance, and are beyond our control. These are called confounding factors and are covered in this article by Zed Shaw under the section on Confounding. Since we can't control these variables, the best thing we can do is completely randomize our sample, which reduces the effect that these variables have over a large sample.

Some important things about a random sample are:
  • It should not be deterministic, i.e., I shouldn't be able to predict who will be chosen using a formula
  • It shouldn't be repetitive, i.e., if I run my randomiser multiple times, I should not end up with identical sets
  • All members of the population should have an equal probability of ending up in the selected sample
The wikipedia articles on random sampling may not be up to the mark, but this article in the social research methods knowledge base is pretty good. There's also an excellent video on TED by Oxford Mathematician Paul Donnelly on how stats fool juries. The real stats is at 3:30 and random sampling comes in at 11:00.

So, how do we go about getting a random sample of all our users or page views? On the face of it, it's really quite easy. I'll only consider page views here because it means I don't have to bother with maintaining sessions. Here's the simple algorithm:
  1. N is your population size — that's the total number of page views that you get in, say, one day
  2. f is the fraction of your population that you want in your sample.
  3. Use something like simple random sampling to pull out N*f items from your population — this is n
  4. Measure the performance of these n items
In PHP, you'd do something like this:
$N_a = array(...);                           // array of entire population
$f = 0.10;                                   // 10%
$n_a = array_rand($N, intval(count($N)*$f)); // $n now contains your sample
Other languages are similar. This is easy to do if you already have access to your entire population. For example, at the end of a day, you can pull out requests from your HTTP access logs. The problem, however, is that you'd need to first measure the performance of all page views, and then randomly pick 10% of them. You'll be throwing away a lot of data, and all your users have to pay the price of the performance test which could be non-negligible. (Heisenberg's uncertainty principle sort of plays a part here — you can't measure something without changing what you're measuring.)

A better solution would be to only instrument your code with performance measurement hooks if the particular request falls into your sample. This makes the whole thing a little more complicated because we do not have access to the entire population when we need to make this decision. What we do have though, is an estimate (based on past experience) of the total number of page views (N) and the sampling fraction (f) based on how big a sample we want. Can we use these two to figure out if a particular request should fall into the sample or not? A simple approach is something like this (PHP):
// Using systematic random sampling
// $N == estimated population size, $n == desired sample size
// $k == floor($N/$n)

$counter = cache_fetch('counter');
if(is_null($counter)) {     // first hit
    $seed = mt_rand(1, $k);
    $counter = $k-$seed;
}

$counter++;

if($counter === $k) {
    measure_performance(); // instrument with performance code
    $counter = 1;
}

cache_store('counter', $counter);
The $counter variable needs to persist across requests, which is why we need to store it in a cross-request cache. Unfortunately, the obvious problem here is that if we have concurrent connections (which any web server has), then we end up with a race condition where multiple processes all read the same value from cache, increment it independently, and then write the new value back, ie, multiple requests get counted as a single request. To fix this, we'd need to use a mutex to serialize that section of code, but that will throw performance and scalability out of the window. A better solution is required.

A more naïve approach might work:
// $N == estimated population size, $n == desired sample size
// $f == $n/$N;

if(mt_rand() < $f * mt_getrandmax()) {
    measure_performance(); // instrument with performance code
}
This approach does not require global variables shared across processes, and is actually really fast, the question is, is it correct. Let's look at an example that's easier to imagine... Let's say we have a box with 20 marbles in it — identical in all respects except that they are numbered from 1 to 20. We need to pick 5 marbles out of this box at random. The probability of a particular marble being picked depends on whether we pick all five at the same time or one at a time. If we pick them all together, the probability of a particular marble being selected is 5/20 == 1/4 == 25%.

Instead, if we pick them one at a time, then the probability depends on when a marble is picked. Each marble has a 1/20 chance of being the first, a 1/19 chance of being the second, 1/18 chance of being the third and so on. Therefore, the probability of a particular marble being selected is now (1/20 + 1/19 + 1/18 + 1/17 + 1/16) == 27.95%

A third possibility is that we look at each marble in sequence, and then decide whether to select it or throw it away. With this approach, the probability of the first marble being selected is 5/20 == 25%. The probability of the second marble being selected though, depends on whether the first was selected. If the first marble was selected (25%), then we only need 4 more marbles (4/19), but if the first marble was thrown away (75%), we still need 5 marbles (5/19). Since at this point we already know whether the first marble was selected or not, that event is no longer uncertain, and we need to pick one of the two probabilities for the second marble.

What we're doing in the code above, however, is that for each marble that we see, there's a 25% chance that we'll pick it and a 75% chance that we'll discard it, however, this has no effect on the probability of selection of the next marble. There is still a 25% chance that it will be selected regardless of whether the first marble was selected or not. In order for the selection of previous marbles to be considered, we'd need to maintain a count of all selected marbles — which is the problem we're trying to avoid here.

The outcome here is that we may not always end up with exactly $n elements. In fact, it's possible that we'll end up with no elements or all elements, though on average, we should end up with close to $n elements.

All things considered, I think it's a close enough approximation without degrading performance.

What about users?

So what about measuring users instead of page views? All of the above looked only at picking a random sample of page views. In order to pick a random sample of users instead, you'll need a unique identifier for each user. This is typically a cookie. You'll want the cookie to last for a reasonably long time. Now, since you know how many cookies you've issued over a given period of time, and since you know the validity of a cookie, you can determine on a given date how many cookies are currently valid, and what they are — this is your population (N). You can then use the simple random sampling method (array_rand()) to pull out a random sample (n) of cookies that you want to measure. Once you have this, it's a simple case of matching the user's cookie to your sample list, and if it matches, send across the measurement code. This is much easier and more correct than what we've seen above, but it has a few caveats:
  • It won't work with users who have cookies disabled
  • There's no guarantee that a user who has fallen into your bucket will ever return to your site
  • New users, who don't already have a cookie won't get tested
  • If you have multiple web servers in a farm, you need to make sure all of them know the sample list
You'll also need to tweak your cookie lifetime based on how often you get new users and what percentage of your users return to the site. If your site requires users to log in, this might be much easier. A combination of the two approaches may also work well.

Closing thoughts

I don't know everything, and wrt statistics, I know very little. There's a good chance that I've gotten things wrong here, so feel free to leave a comment correcting me and chiding me for spreading false information. It will help me learn, and I can update this post accordingly.

Random sampling is pretty important when it comes to getting a good estimate of a measure for a large population. It's applicable in many fields, though I currently only care about its application to performance analysis. I'll post again about the other things I've learnt.

...===...