(This is the 12th post in a series that started here)
When I looked at the 5K splits from the 2014 Boston Marathon, I mostly saw what I expected to see.
I calculated the average 5K splits for the 2014 race both as a whole and broken down into specific groups of runners. This chart shows those 5K splits converted to a minute-per-mile pace:
Here’s the elevation map for the Boston course:
The Newton hills start at about 28K and end near the 33K mark. They don’t line up perfectly with the 5K splits so their effect is muted some, but you can still see from the 30K and 35K splits how the hills slow everyone down. Splitting up the data by finish time shows that the hills hit the slower runners harder.
Then comes something I found interesting. Once you get through the hills, the next 7K through Brookline is pretty easy, trending mostly flat or downhill. Overall, the average runner was able to take advantage of the terrain and pick up the pace. Perhaps unsurprisingly, the “Even splits” group did a particularly good job of making up time lost in the hills.
A closer look shows that the overall effect was created by the extreme groups – the very fastest runners in the race and the slowest. The middle of the pack averaged even slower after the hills than they did through Newton, though the course is much easier. Even the runners finishing between 2:30 and 3:00, a pretty fast group, lost ground.
The last 2K was even more surprising to me.
That part of the course has its challenges (the bump over the Mass Pike, the Mass Ave. underpass, and the climb up Hereford St.) but they’re nothing compared to what came earlier in the race. Countering that, almost everyone who runs a marathon gets a burst of energy when their goal finally comes within reach. And at Boston, the thickest packs of spectators start just past the 40K mark, in Kenmore Square. The roar of the crowd is relentless from Kenmore on in.
I expected to see splits dropping across the board. But the trend was clear. As runners’ finish times got faster, their times over the last 2K got slower (not in an absolute sense, but relative to their previous splits). 2:30 runners still finished faster than 5:30 runners, but the 2:30 runners slowed down while the 5:30 runners sped up. The fastest runners were dramatically (for them) slower over the last 2K.
Even the “Even splits” group lost ground after 40K, while the “Most positive” group made up more than anyone.
Maybe there’s an element of “oh well, I’m not winning” in the sub-2:20 group, but that doesn’t apply to the 3:30 marathoners, and they still slowed down.
It’s not just Boston. The 5K splits for New York show the same pattern:
though the elevation profile is different:
To run your fastest race, you want to “leave it all out on the course”. If you have the energy to pick up the pace as you approach the finish, you probably could have run faster earlier on and ended up with a better finish time.
But there’s a fine line between running your best race and going out too fast. As we know, most runners run positive splits. Marathoning is hard.
The 5K charts show that in general, slower runners drop off from their initial pace sooner than faster runners.
For the slowest runners, part of the drop may be because, on average, they aren’t pushing themselves quite as close to their potential as the faster runners. That does leave the slower runners with more energy left after the Newton hills to speed up and look good for their finish line photos.
Mid-packers don’t give in as easily, holding on to their initial pace farther into the race. Actually, mid-packers might be better served by taking it a little easier from the start; on average, they’re not able to take advantage of the easier terrain after the hills.
Even the fastest runners have difficulty getting the first 40K just right, so they’re tired, but not too tired, for the last 2K. Apparently the best runners believe that if you’re going to make a mistake in pacing, it’s best to err slightly on the optimistic side and risk running out of gas a little early, just as long as you don’t get too sanguine in the early miles.
It sounds funny to say that the fastest runners “went out too fast” over the first 40K, but I think that’s what’s happening.
- There is a wide variance within any of the groups of runners. While the average runner in a group may have a certain split pattern, many individuals within the group will have entirely different splits.
- I used split scores to determine who belonged in the “Even splits” and “Most Positive” groups. The first and second half splits for the 310 runners in the Even splits group were within ±.2% of each other. The Most Positive group was made up of about 300 runners whose second half was at least 20% slower than their first.
- I looked into creating a “Most Negative” group. There weren’t any runners with 20% negative splits, and only about 100 with splits that were more than 2% negative. The average 5K splits for those runners were badly skewed by a few runners with individual splits way out of line with the rest of their race, perhaps because of portapotty breaks or other temporary issues, so I decided not to chart them.
More on this topic in my next post.
(This is the 11th post in a series that started here)
Last time, we defined a “split score” as a runner’s raw split divided by their finish time.
Split scores work great for individual runners, easily showing us that a 5-hour marathoner’s 10 minute split is smaller, relatively, than the same raw split would be for a 3-hour runner.
That’s all well and good, but split scores will only matter to anyone if we can use them to gain new insights about runners – in general, or when we group them by time, age, gender, or in other ways.
The core of any potential usefulness for split scores is their ability to make comparisons between runners of different abilities more meaningful – to say, “All else being equal, here’s how the 5 hour marathoners compare to the 3 hour marathoners.”
The math is easy, and the logic behind it makes superficial sense. What I’m not sure of is what, if anything we gain by doing this. Because the main difference between 5 hour and 3 hour marathoners IS their finish time.
Whatever. Get to the point, Charbonneau. Show us some split scores. Maybe we’ll see something.
Here’s finish time vs. split score for Boston 2014:
And to refresh your memory, here’s finish time vs. raw splits:
You can’t compare the two directly, but you can see how the general shape of the split score scattergram hews closer to the horizontal axis.
If you plot raw splits vs. split score, you can see that the relationship is non-linear:
As it should be, since another way of looking at split scores is that if you make a right triangle by plotting splits versus finish times, the split score is the tangent of the angle between the hypotenuse of that triangle and the side represented by the finish time.
In spite of that, on the next chart, I plotted the moving averages and 4th order polynomials for the two data sets, scaling the raw split data so the linear regression for the two sets of data overlaid each other as much as possible. That choice is entirely arbitrary, and is probably more misleading than it is revealing since the relationship between the two data sets isn’t linear, but it does show something about how the two sets of curves change relative to one another:
Let’s try comparing men vs. women. Here’s raw splits vs finish times, divided by gender:
And here’s the same chart, with split scores replacing raw splits on the Y axis:
The use of split scores seems to accentuate the spread of values for any specific finishing time, which also increases the difference between men’s splits and women’s splits. Perhaps because split scores remove another source of autocorrelation error by factoring out finish time?
This chart tries to summarize the differences:
Men have faster average finish times, but larger average splits, whether you measure raw splits or split scores. The difference is slightly greater for split scores. Note that the average split divided by the average finish time is not the same as the average split score.
There’s certainly a difference between raw splits and split scores. Does that difference help us find any better answers for our questions? What do you think? I still don’t know, but I did find one use for split scores in my next post.
(This is the 10th post in a series that started here)
Up until now, I’ve been talking a lot about the rate of change for marathon splits and not as much about the raw splits themselves.
That’s because raw split numbers aren’t always a very good way to compare splits, especially between individual runners.
Think about it. If a runner runs a 1:30 first half of a marathon and a 1:40 second half, she’s run a 3:10 marathon with a 10-minute positive split. Simple, right?
Now, suppose another runner runs a 2:30 first half and a 2:40 second half. That’s a 5:10 marathon, also with a 10-minute positive split.
Ten minutes is ten minutes. So when we compare the two races, both runners ran the same positive split, right?
Not really. 10 minutes is proportionally larger when compared to a 3:10 marathon than it is when compared to a 5:10. Just because both of them had raw splits of 10 minutes, saying that the splits are the same isn’t right. The 5:10 runner clearly ran closer to even splits than the 3:10 runner.
So if we’re going to look at runners and compare their splits or add splits together to analyze them in groups, we need to come up with a better way of assigning a value to each set of splits, a “spilt score” if you will.
It doesn’t have to be complicated. Let’s define a “split score” as “how far a runner is from even splits, relative to their finish time”.
Here are some sample splits:
I’ve plotted them on this chart:
Runner A ran a 2:30 by running a 1:00 first half (wow!) and a 1:30 second half, giving him a raw split of 30 minutes (represented by the red line).
Our initial “split score” for A is 30 minutes divided by 2 hours and 30 minutes, or .2.
Note that the result is just “.2”, not “.2 minutes”. Since we’re dividing a time by a time, the “time” part drops out, leaving us with what they call a “dimensionless number”.
Most of us don’t have an intuitive feel for what a split of “point 2” means. To convert the score to something we do understand, we can turn it into a percentage by multiplying by 100. So A ends up with a split score of 20%.
Both runner C and runner D took 5 hours to finish the race, twice as long as runner A. Runner C’s raw split is 1 hour, also twice runner A’s, so C’s split score is also 20%.
On the other hand, while runner D’s raw split of 30 minutes is equal to runner A’s, since D’s finish time is twice A’s, D’s split score is half of A’s, or 10%.
Of course, as runner B shows, split scores can be negative, too (a 1:00 second half!?!).
We’ll start to find out what, if anything, split scores are good for in my next post.
(This is the 9th post in a series that started here)
In this series of posts, I’m trying out different things to see what they reveal. I didn’t expect that I’d do everything the best way the first time. Besides, viewing the same data sets in multiple ways can help create new insights and verify (or correct) old ones. I expect to learn things as I go along.
A wiser person might do their research behind closed doors, only revealing their conclusions once they’ve thought everything through. I’m exposing the process (and my lack of advanced statistical skilz) in the hope that I might improve my results because of helpful and interesting comments from y’all, and on the odd chance that other people find might find my learning process somewhat entertaining or enlightening.
Today, I’m presenting another view of the Boston 2014 data, inspired in part by one of your comments. The chart below plots splits vs. finish time:
It’s essentially the same as my original chart of first half vs. second half splits:
…if you grab the latter chart by the even split line and rotate it 45° clockwise around the (0,0) axis so the even split line ends up on the X (horizontal) axis.
The new chart is better for many, perhaps most, uses.
You can see that Excel says R2 for the linear regression is lower, even though the scattergram is essentially the same. That’s because even though we’re looking at the same thing, by plotting finish time against splits we eliminate (I think) a source of autocorrelation error that was making R2 artificially high (something the reader “Stats” called out in a comment).
Polynomial regressions now reveal more about the actual distribution of the data. The 4th order poly on the chart below does a good job of confirming trends within the data set that we discovered earlier. And you can now use Excel to automatically generate moving averages, which makes life much easier.
You could also say this: Racing is about getting to the finish as fast as possible, so we’re usually interested in how things relate to finish time. Nevertheless, focusing the relationship between the first and second half splits can sometimes leads to fresh enlightenment.
An aside for the few who might care: Excel was originally designed to be compatible with Lotus 1-2-3, bugs and all, so the default time system used for calculations does not understand negative times. So to calculate splits with Excel for Windows, you have to go into the options and tell Excel to “Use 1904 Date System”. Excel for the Mac uses the 1904 system by default.
Anyhow, we’ll see more charts like these as we address new questions (and perhaps revisit a few old ones) in my next few posts.
(This is the 8th post in a series that started here)
I thought I’d take closer look at the split data for the sub-2:20 runners, since I never get to see them while we’re all on the course.
In my last post, I pointed out that on average, the faster runners tend toward negative splits more than the bulk of the field. But a close look at the fastest group reveals a lot of positive splits:
There’s a simple reason for this: the runners up front aren’t just competing against the clock for a fast time. They’re also competing against each other to win the race.
A relatively large number of runners are capable of sticking with the lead pack through the first half of a marathon. The very fastest runners hold the pace (or go even faster) all the way to the finish line. The remaining runners slow down and get dropped along the way. Those runners are represented by the vertical column of data points lined up above the winners.
In the case of Boston 2014, the lead pack went through the first half in about 1:05. In the second half, as most of the pack dropped off one by one, three runners were able to hold on through the Newton Hills and make it to the finish at a pace as fast or faster than their first half pace.
You can see the same effect in among the women’s leaders:
at the 2013 New York Marathon:
and even at smaller races with less competitive fields, like MDI 2013:
Now go back and take another look at the men’s chart from Boston 2014. What’s unique about it?
Only at Boston did one runner (Meb!) dare to go out ahead of the lead pack. Though he didn’t run the fastest second half split, he did run a negative split, and more importantly, he held on to win.
More to come in my next post…