(This is the 7th post in a series that started here)
We’ve been using linear regression to analyze and compare races, partly because it can be a useful summary of the entire field, and partly because Excel makes it easy to do.
What happens if we try to use linear regression to look at smaller groups of runners within a race?
Here’s what happens if we split up the Boston 2014 field by finish time:
The main thing we can see is that R2 plummets. It’s almost .8 for the entire data set, but .33 or less for the smaller chunks, getting very near zero for one group. In other words, when we look at linear regression for an hour’s worth of finishers at a time, an individual runner’s first half split is almost useless when trying to predict their second half split.
That’s consistent with what we learned from the 2013 results – linear regression isn’t very useful when applied to partial race fields.
Instead, let’s take a look at the “moving average” for our data set. A moving average is a set of numbers, each of which is the average of the corresponding subset of a larger data set. It’s a simple way of smoothing out the data without forcing it into a straight line.
To get the moving average, I divided the field up by finish time into half-hour groups, then plotted the average first half split versus the average second half splits for each smaller group of runners. Unfortunately, that’s more difficult to tease out of Excel than a linear regression, but after fiddling around for a while I ended up with this:
The overall average for the entire data set is shown by the red square. Note in passing how it falls right on the linear regression trendline.
The average for each half-hour chunk is represented by a blue diamond. Connect them up, and the resulting moving average line still follows the linear regression, but instead of forming a completely straight line, it makes a curve with a ‘S’ shape that swerves back and forth slightly relative to the linear regression.
What this shows us is that while the overall trend for all pairs of splits is positive, the trend for the runners at either end of the data set is less positive (closer to even splits) than it is for the runners in the middle.
This makes sense if you think about it.
Even (or negative) splits are the goal of most runners. We expect the runners on the fast end of the distribution to be able to do a better job of holding their pace and achieving that goal.
On the other hand, the slowest runners’ splits curve back toward the even split line because they have fewer illusions. From the start, their goals are less about speed and more about surviving the 26.2 mile distance. So, as the saying goes, they “start out slow, then back off”, which makes them less likely to crash-and-burn.
Meanwhile, the middle of the pack (my people!) has more runners failing to hold on (on average) because they push the pace more than the slower group but lack the ability (on that day, at least) of the faster group.
You can see results consistent with that theory in the moving average chart for Chicago 2013:
And for our set of five smaller races:
Is the moving average a better representation of the data? Depends on what you need to do.
If you’re looking at runners within a single race, the moving average can tell you more than a straight line about how those runners are distributed without overwhelming you with data points, like the scattergram can.
But it ignores the “weight” of each group. As this histogram shows:
there are many more runners in the mid-pack groups than in either the fastest or slowest groups.
The linear regression takes the relative size of each group into consideration. And the resulting straight line is handier for comparing split trends between races.
I’ll take a closer look at one of the groups of runners in my next post.
(Note: I’m figuring this all out as I go along. Comments or questions, especially from people who’ve taken more stats classes, are always welcome!)