[philiptellis] /bb|[^b]{2}/
Never stop Grokking


Sunday, January 17, 2010

The statistics of performance measurement - Random Sampling

I've been working with performance data for over two years now, and over that time I've picked up a little bit of statistics. I am by no means an expert in the subject and I'd probably have flunked statistics in college if it wasn't for a classmate ramming the entire 3Kg statistics book into my head the week before our exams. One thing I understand now though, is that statistics is an extremely important skill for any professional in any field.

Anyway, with a few refreshers from wikipedia and many other texts and papers and a bunch of experts at Yahoo!, I think I've learnt a bit. I'll put it down here mainly so I don't forget.

Why statistics?

So we're often asked by management for a number that represents the performance of a web page. The big question is, what should that number be? Is it the YSlow score, the network roundtrip time, page load time, a combination of these, or some other measure that shows how much the user likes or dislikes the page? Once we know which number we want to measure, how do we determine how this affects all our users? The easiest way is to actually measure what every user sees, but then we end up with thousands, possibly even millions of measurements for different users, and not all of them are the same.

This is where statistics comes in, and where I realised how little I knew.

There are a whole bunch of terms that come up. The few that came up are: Random Sample, Central Tendency, Mean, Median, Geometric Mean, Normal distribution, Log-normal distribution, Outliers, IQR filtering, Standard Error, Confidence Interval. There are probably more that will come up in the course of the next few posts, but this what I can remember for now.

Random Samples

Now say the app whose performance we want to measure typically has about 1 million users a day. If we want to measure our users' perceived performance, we could measure the performance for all interactions of all 1 million users, but this is overkill. In reality, you only need to know the performance for maybe 10% of your users, and just project from there.

Now how we select these 10% is important. If we select the fastest 10%, then our numbers will appear better than they actually are, and if we select the slowest 10%, the opposite happens. There are, in fact, many variables on the users end that affect performance, and are beyond our control. These are called confounding factors and are covered in this article by Zed Shaw under the section on Confounding. Since we can't control these variables, the best thing we can do is completely randomize our sample, which reduces the effect that these variables have over a large sample.

Some important things about a random sample are:
  • It should not be deterministic, i.e., I shouldn't be able to predict who will be chosen using a formula
  • It shouldn't be repetitive, i.e., if I run my randomiser multiple times, I should not end up with identical sets
  • All members of the population should have an equal probability of ending up in the selected sample
The wikipedia articles on random sampling may not be up to the mark, but this article in the social research methods knowledge base is pretty good. There's also an excellent video on TED by Oxford Mathematician Paul Donnelly on how stats fool juries. The real stats is at 3:30 and random sampling comes in at 11:00.

So, how do we go about getting a random sample of all our users or page views? On the face of it, it's really quite easy. I'll only consider page views here because it means I don't have to bother with maintaining sessions. Here's the simple algorithm:
  1. N is your population size — that's the total number of page views that you get in, say, one day
  2. f is the fraction of your population that you want in your sample.
  3. Use something like simple random sampling to pull out N*f items from your population — this is n
  4. Measure the performance of these n items
In PHP, you'd do something like this:
$N_a = array(...);                           // array of entire population
$f = 0.10;                                   // 10%
$n_a = array_rand($N, intval(count($N)*$f)); // $n now contains your sample
Other languages are similar. This is easy to do if you already have access to your entire population. For example, at the end of a day, you can pull out requests from your HTTP access logs. The problem, however, is that you'd need to first measure the performance of all page views, and then randomly pick 10% of them. You'll be throwing away a lot of data, and all your users have to pay the price of the performance test which could be non-negligible. (Heisenberg's uncertainty principle sort of plays a part here — you can't measure something without changing what you're measuring.)

A better solution would be to only instrument your code with performance measurement hooks if the particular request falls into your sample. This makes the whole thing a little more complicated because we do not have access to the entire population when we need to make this decision. What we do have though, is an estimate (based on past experience) of the total number of page views (N) and the sampling fraction (f) based on how big a sample we want. Can we use these two to figure out if a particular request should fall into the sample or not? A simple approach is something like this (PHP):
// Using systematic random sampling
// $N == estimated population size, $n == desired sample size
// $k == floor($N/$n)

$counter = cache_fetch('counter');
if(is_null($counter)) {     // first hit
    $seed = mt_rand(1, $k);
    $counter = $k-$seed;
}

$counter++;

if($counter === $k) {
    measure_performance(); // instrument with performance code
    $counter = 1;
}

cache_store('counter', $counter);
The $counter variable needs to persist across requests, which is why we need to store it in a cross-request cache. Unfortunately, the obvious problem here is that if we have concurrent connections (which any web server has), then we end up with a race condition where multiple processes all read the same value from cache, increment it independently, and then write the new value back, ie, multiple requests get counted as a single request. To fix this, we'd need to use a mutex to serialize that section of code, but that will throw performance and scalability out of the window. A better solution is required.

A more naïve approach might work:
// $N == estimated population size, $n == desired sample size
// $f == $n/$N;

if(mt_rand() < $f * mt_getrandmax()) {
    measure_performance(); // instrument with performance code
}
This approach does not require global variables shared across processes, and is actually really fast, the question is, is it correct. Let's look at an example that's easier to imagine... Let's say we have a box with 20 marbles in it — identical in all respects except that they are numbered from 1 to 20. We need to pick 5 marbles out of this box at random. The probability of a particular marble being picked depends on whether we pick all five at the same time or one at a time. If we pick them all together, the probability of a particular marble being selected is 5/20 == 1/4 == 25%.

Instead, if we pick them one at a time, then the probability depends on when a marble is picked. Each marble has a 1/20 chance of being the first, a 1/19 chance of being the second, 1/18 chance of being the third and so on. Therefore, the probability of a particular marble being selected is now (1/20 + 1/19 + 1/18 + 1/17 + 1/16) == 27.95%

A third possibility is that we look at each marble in sequence, and then decide whether to select it or throw it away. With this approach, the probability of the first marble being selected is 5/20 == 25%. The probability of the second marble being selected though, depends on whether the first was selected. If the first marble was selected (25%), then we only need 4 more marbles (4/19), but if the first marble was thrown away (75%), we still need 5 marbles (5/19). Since at this point we already know whether the first marble was selected or not, that event is no longer uncertain, and we need to pick one of the two probabilities for the second marble.

What we're doing in the code above, however, is that for each marble that we see, there's a 25% chance that we'll pick it and a 75% chance that we'll discard it, however, this has no effect on the probability of selection of the next marble. There is still a 25% chance that it will be selected regardless of whether the first marble was selected or not. In order for the selection of previous marbles to be considered, we'd need to maintain a count of all selected marbles — which is the problem we're trying to avoid here.

The outcome here is that we may not always end up with exactly $n elements. In fact, it's possible that we'll end up with no elements or all elements, though on average, we should end up with close to $n elements.

All things considered, I think it's a close enough approximation without degrading performance.

What about users?

So what about measuring users instead of page views? All of the above looked only at picking a random sample of page views. In order to pick a random sample of users instead, you'll need a unique identifier for each user. This is typically a cookie. You'll want the cookie to last for a reasonably long time. Now, since you know how many cookies you've issued over a given period of time, and since you know the validity of a cookie, you can determine on a given date how many cookies are currently valid, and what they are — this is your population (N). You can then use the simple random sampling method (array_rand()) to pull out a random sample (n) of cookies that you want to measure. Once you have this, it's a simple case of matching the user's cookie to your sample list, and if it matches, send across the measurement code. This is much easier and more correct than what we've seen above, but it has a few caveats:
  • It won't work with users who have cookies disabled
  • There's no guarantee that a user who has fallen into your bucket will ever return to your site
  • New users, who don't already have a cookie won't get tested
  • If you have multiple web servers in a farm, you need to make sure all of them know the sample list
You'll also need to tweak your cookie lifetime based on how often you get new users and what percentage of your users return to the site. If your site requires users to log in, this might be much easier. A combination of the two approaches may also work well.

Closing thoughts

I don't know everything, and wrt statistics, I know very little. There's a good chance that I've gotten things wrong here, so feel free to leave a comment correcting me and chiding me for spreading false information. It will help me learn, and I can update this post accordingly.

Random sampling is pretty important when it comes to getting a good estimate of a measure for a large population. It's applicable in many fields, though I currently only care about its application to performance analysis. I'll post again about the other things I've learnt.

2 comments :

semanticvoid
January 19, 2010 1:25 PM

a book I'm reading on statistics (may be helpful for others) - Principles of Statistics MG Bulmer - http://tinyurl.com/yc7f3tr

Stoyan
January 19, 2010 8:24 PM

I think it makes sense.

Actually I've seen something like this approach as "poor man's cron job"
http://www.phpied.com/phpbb-old-topics-locker/

Post a Comment

...===...