What Are Split Scores?

June 8, 2014

(This is the 10th post in a series that started here)

Up until now, I’ve been talking a lot about the rate of change for marathon splits and not as much about the raw splits themselves.

That’s because raw split numbers aren’t always a very good way to compare splits, especially between individual runners.

Think about it. If a runner runs a 1:30 first half of a marathon and a 1:40 second half, she’s run a 3:10 marathon with a 10-minute positive split. Simple, right?

Now, suppose another runner runs a 2:30 first half and a 2:40 second half. That’s a 5:10 marathon, also with a 10-minute positive split.

Ten minutes is ten minutes. So when we compare the two races, both runners ran the same positive split, right?

Not really. 10 minutes is proportionally larger when compared to a 3:10 marathon than it is when compared to a 5:10. Just because both of them had raw splits of 10 minutes, saying that the splits are the same isn’t right. The 5:10 runner clearly ran closer to even splits than the 3:10 runner.

So if we’re going to look at runners and compare their splits or add splits together to analyze them in groups, we need to come up with a better way of assigning a value to each set of splits, a “spilt score” if you will.

It doesn’t have to be complicated. Let’s define a “split score” as “how far a runner is from even splits, relative to their finish time”.

Here are some sample splits:

Click any image to enlarge

Click any image to enlarge

I’ve plotted them on this chart:

sample split score

Runner A ran a 2:30 by running a 1:00 first half (wow!) and a 1:30 second half, giving him a raw split of 30 minutes (represented by the red line).

Our initial “split score” for A is 30 minutes divided by 2 hours and 30 minutes, or .2.

Note that the result is just “.2”, not “.2 minutes”. Since we’re dividing a time by a time, the “time” part drops out, leaving us with what they call a “dimensionless number”.

Most of us don’t have an intuitive feel for what a split of “point 2” means. To convert the score to something we do understand, we can turn it into a percentage by multiplying by 100. So A ends up with a split score of 20%.

Both runner C and runner D took 5 hours to finish the race, twice as long as runner A. Runner C’s raw split is 1 hour, also twice runner A’s, so C’s split score is also 20%.

On the other hand, while runner D’s raw split of 30 minutes is equal to runner A’s, since D’s finish time is twice A’s, D’s split score is half of A’s, or 10%.

Of course, as runner B shows, split scores can be negative, too (a 1:00 second half!?!).

We’ll start to find out what, if anything, split scores are good for in my next post.


A Different Look at the Same Ol’ Data

June 6, 2014

(This is the 9th post in a series that started here)

In this series of posts, I’m trying out different things to see what they reveal. I didn’t expect that I’d do everything the best way the first time. Besides, viewing the same data sets in multiple ways can help create new insights and verify (or correct) old ones. I expect to learn things as I go along.

A wiser person might do their research behind closed doors, only revealing their conclusions once they’ve thought everything through. I’m exposing the process (and my lack of advanced statistical skilz) in the hope that I might improve my results because of helpful and interesting comments from y’all, and on the odd chance that other people find might find my learning process somewhat entertaining or enlightening.

Today, I’m presenting another view of the Boston 2014 data, inspired in part by one of your comments. The chart below plots splits vs. finish time:


Click any image to enlarge

Click any image to enlarge

It’s essentially the same as my original chart of first half vs. second half splits:

Click to enlarge

…if you grab the latter chart by the even split line and rotate it 45° clockwise around the (0,0) axis so the even split line ends up on the X (horizontal) axis.

The new chart is better for many, perhaps most, uses.

You can see that Excel says R2 for the linear regression is lower, even though the scattergram is essentially the same. That’s because even though we’re looking at the same thing, by plotting finish time against splits we eliminate (I think) a source of autocorrelation error that was making R2 artificially high (something the reader “Stats” called out in a comment).

Polynomial regressions now reveal more about the actual distribution of the data. The 4th order poly on the chart below does a good job of confirming trends within the data set that we discovered earlier. And you can now use Excel to automatically generate moving averages, which makes life much easier.

Boston 2014 Fin vs split w poly movave

You could also say this: Racing is about getting to the finish as fast as possible, so we’re usually interested in how things relate to finish time. Nevertheless, focusing the relationship between the first and second half splits can sometimes leads to fresh enlightenment.

An aside for the few who might care: Excel was originally designed to be compatible with Lotus 1-2-3, bugs and all, so the default time system used for calculations does not understand negative times. So to calculate splits with Excel for Windows, you have to go into the options and tell Excel to “Use 1904 Date System”. Excel for the Mac uses the 1904 system by default.

Anyhow, we’ll see more charts like these as we address new questions (and perhaps revisit a few old ones) in my next few posts.


Marathon Splits in the Lead Pack

June 5, 2014

(This is the 8th post in a series that started here)

I thought I’d take closer look at the split data for the sub-2:20 runners, since I never get to see them while we’re all on the course.

In my last post, I pointed out that on average, the faster runners tend toward negative splits more than the bulk of the field. But a close look at the fastest group reveals a lot of positive splits:

Click on any image to enlarge

Click on any image to enlarge

There’s a simple reason for this: the runners up front aren’t just competing against the clock for a fast time. They’re also competing against each other to win the race.

A relatively large number of runners are capable of sticking with the lead pack through the first half of a marathon. The very fastest runners hold the pace (or go even faster) all the way to the finish line. The remaining runners slow down and get dropped along the way. Those runners are represented by the vertical column of data points lined up above the winners.

In the case of Boston 2014, the lead pack went through the first half in about 1:05. In the second half, as most of the pack dropped off one by one, three runners were able to hold on through the Newton Hills and make it to the finish at a pace as fast or faster than their first half pace.

You can see the same effect in among the women’s leaders:



at the 2013 New York Marathon:


and even at smaller races with less competitive fields, like MDI 2013:


Now go back and take another look at the men’s chart from Boston 2014. What’s unique about it?

Only at Boston did one runner (Meb!) dare to go out ahead of the lead pack. Though he didn’t run the fastest second half split, he did run a negative split, and more importantly, he held on to win.


More to come in my next post

Comparing Runners Within a Race with Moving Average

June 4, 2014

(This is the 7th post in a series that started here)

We’ve been using linear regression to analyze and compare races, partly because it can be a useful summary of the entire field, and partly because Excel makes it easy to do.

What happens if we try to use linear regression to look at smaller groups of runners within a race?

Here’s what happens if we split up the Boston 2014 field by finish time:

Click any image to enlarge

Click on any image to enlarge

The main thing we can see is that R2 plummets. It’s almost .8 for the entire data set, but .33 or less for the smaller chunks, getting very near zero for one group. In other words, when we look at linear regression for an hour’s worth of finishers at a time, an individual runner’s first half split is almost useless when trying to predict their second half split.

That’s consistent with what we learned from the 2013 results – linear regression isn’t very useful when applied to partial race fields.

Instead, let’s take a look at the “moving average” for our data set. A moving average is a set of numbers, each of which is the average of the corresponding subset of a larger data set. It’s a simple way of smoothing out the data without forcing it into a straight line.

To get the moving average, I divided the field up by finish time into half-hour groups, then plotted the average first half split versus the average second half splits for each smaller group of runners. Unfortunately, that’s more difficult to tease out of Excel than a linear regression, but after fiddling around for a while I ended up with this:


The overall average for the entire data set is shown by the red square. Note in passing how it falls right on the linear regression trendline.

The average for each half-hour chunk is represented by a blue diamond. Connect them up, and the resulting moving average line still follows the linear regression, but instead of forming a completely straight line, it makes a curve with a ‘S’ shape that swerves back and forth slightly relative to the linear regression.

What this shows us is that while the overall trend for all pairs of splits is positive, the trend for the runners at either end of the data set is less positive (closer to even splits) than it is for the runners in the middle.

This makes sense if you think about it.

Even (or negative) splits are the goal of most runners. We expect the runners on the fast end of the distribution to be able to do a better job of holding their pace and achieving that goal.

ease offOn the other hand, the slowest runners’ splits curve back toward the even split line because they have fewer illusions. From the start, their goals are less about speed and more about surviving the 26.2 mile distance. So, as the saying goes, they “start out slow, then back off”, which makes them less likely to crash-and-burn.

Meanwhile, the middle of the pack (my people!) has more runners failing to hold on (on average) because they push the pace more than the slower group but lack the ability (on that day, at least) of the faster group.

You can see results consistent with that theory in the moving average chart for Chicago 2013:


And for our set of five smaller races:

others - moving ave

Is the moving average a better representation of the data? Depends on what you need to do.

If you’re looking at runners within a single race, the moving average can tell you more than a straight line about how those runners are distributed without overwhelming you with data points, like the scattergram can.

But it ignores the “weight” of each group. As this histogram shows:


there are many more runners in the mid-pack groups than in either the fastest or slowest groups.

The linear regression takes the relative size of each group into consideration. And the resulting straight line is handier for comparing split trends between races.

I’ll take a closer look at one of the groups of runners in my next post.

(Note: I’m figuring this all out as I go along. Comments or questions, especially from people who’ve taken more stats classes, are always welcome!)

Marathon Splits Sorted by Gender

June 3, 2014

(This is the 6th post in a series that started here)

When we sort the Boston 2014 split data by gender, this is the result:

Boston 2014 gender

Click on any image to enlarge

Yes, I colored the boys’ data blue. Sue me.

The graph shows that, as finish times go up, (as runners get slower), second half split times for men increase faster relative to first half splits than they do for women.

As we know, the average man runs faster than the average woman (Sorry, ladies!). For example, at Boston 2014, the average man ran about 3:52, while the average women ran a 4:13. So our data illustrates that while slower runners do tend to have more positive splits, the fact that splits for one group of runners trend towards the positive more than another group’s splits does not necessarily mean that that the first group is slower.

Assuming that most runners are trying for even splits, what this probably does show is that men are more likely to go out too fast and fade in the second half (Congratulations, ladies!).

Let’s look at some other races to see if the trend is consistent.

To compare Boston data with another large race, last time we used Chicago 2013. Unfortunately, I don’t have gender data for that race. Luckily, I just got the data from the 2013 New York Marathon (thank you, NYRR!). They included information about each runner’s gender:

2013 NY - gender

And here’s the data from the five smaller races that we looked at earlier:

others - gender

Congratulations again, ladies.

Coming up next, we’ll start to look at race fields sliced up by time.


Marathon Splits for a Single Runner

June 3, 2014

(This is the 5th post in a series that started here)

Over the years, I’ve collected split data from 12 of the 25 marathons that I’ve run:



Here’s the graph, which uses the custom data marker set from my version of Excel:

Click to embiggen

Click to embiggen

I ran negative splits in my three best marathons, which were all run more than 10 years ago, including my PR at Cape Cod 2002. But my best age-graded marathon (Cape Cod 2012) was a 5 minute positive split.

How much does even pacing matter? Something to think about.

Meanwhile, in my next post, we’ll go back to Boston 2014 and look at split data sorted by gender.

Marathon Split Data Sorted by Age

June 2, 2014

(This is the 4th post in a series that started here)

In this post, we’ll see how marathon split data is affected by age. But first, what might seem to be a digression:

Accuracy is sometimes overrated.

No one can keep 30,000 sets of marathon splits in their head, so we use charts and formulas to help simplify and summarize big sets of data.

The scattergram plots we’ve been making help us visualize how our data is distributed, but a graph with 30,000 points is still too complicated. So we reduce the complexity even more, by asking Excel to create a formula to describe the data.

As an example, let’s look at the trendlines for our Boston 2014 runners, sorted by age:

Click on any image to display a larger version

Click on any image to display a larger version

As we’ve seen, linear regression creates a trendline that’s a pretty good fit for our split data. And a straight line is the easiest kind of formula to understand.

However, Excel has other options available. Some of them fit our split data even better than linear regression.

The best fit from the options provided by Excel is the “power” trendline. Instead of fitting the data to a straight line, Excel fits the data to a curve with a formula of:


The R2 values for each line show that the power trendlines do a better job of representing the data:


But the formula isn’t quite as easy to visualize as a straight line. Comparing two different power trendlines is much harder. I know that for values of b close to 1, the line is pretty close to a straight line, but I can’t guess how y = 1.8454x1.18  compares to y = 1.8674x1.186 without drawing the two graphs.

Here are the age trendlines again, this time generated with the power option:


Since the data for an entire race is fairly linear, the differences aren’t dramatic. Unless you blow the chart up to an enormous size (click on the image to see), it’s hard to tell the difference between the power curves and the straight lines.

Since the straight line is so easy to understand, it’s often not worth the extra effort to calculate a nominally more accurate type of regression.

Anyhow, going back to the straight line chart, when we sort runners into 10-year groups by age, we can see that there isn’t a whole lot of difference between the groups. To the extent that age matters at all, it appears that the young folks’ splits trend less positive than the rest of the field.

Next time, we’ll take a quick look at another data set.

Marathon splits at Boston 2014

May 29, 2014

Hold on to your slide rules, folks. My next few blog posts are going to get nerdy.

Here’s a picture:

Click to enlarge

Click to enlarge

This chart graphs the first half splits (on the horizontal, or X, axis) vs. the second half splits (on the vertical, or Y, axis) for runners in the 2014 Boston Marathon.

The data covers about 94% of the finishers. The missing results are scattered randomly, so for all practical purposes this chart can be used to represent the entire field.

Any analysis of this data, to be more than just playing with numbers, should be designed to reveal information that might help you become a better runner. But I do like playing with numbers, so let’s start with the basics and go from there, shall we?

The dashed line on the chart represents even splits. Any runner below the line ran Boston with negative splits, faster over the second half than the first. Anyone above the line ran positive splits – slower in the second half of the race.

The first, and most obvious, conclusion we can draw from the chart is that many, many people, most of the field, ran positive splits at Boston this year. This is in spite of the commonly held opinion that to run your best race, you should run even splits, with your first half as close to the second as you can manage.

The solid line is the result of running a linear regression (using Microsoft Excel) on the data to find the straight line that is the best fit. As you can see from the formula, that line has a slope of about 1.38. In other words, for a “typical” runner, every extra minute in the first half means an extra 1:23 in the second half.

R2 is a measurement of how closely the data matches the regression line. In this case, it measures how accurately you can predict a runner’s second half split by knowing their first half split. R2 can range from 0 (not at all) to 1 (perfectly). The value of .79 for this particular regression means it’s a pretty good fit.

How much is of the difference between the first and second halves was due to the course?

Boston’s notorious hills come in the second half, but the last 10K is a fast, mostly flat or slightly downhill roll to the finish.

The splits from an “even effort” pace calculator that allows for changes in the terrain show the second half as only slightly slower. According to the calculator, the  contours of the course will create a positive split of about 2 minutes and 30 seconds for a 3:10 marathoner. Extend that over the complete range of finish times, and the calculator indicates that for our “typical” runner, every extra minute in the first half means an extra 1:01 in the second half. That leaves 22 seconds unaccounted for.

Heat slows a runner down. How much of the difference was due to temperature?

Boston 2014 started out cool. It was near 50 degrees in Hopkinton when the gun went off for the first runners. But it was a bright, sunny day, and temperatures warmed up into the 60’s by the time the last runner crossed the finish.

One study states that the ideal temperature for the average runner is in the low 40’s. At 50, that average runner loses about 1% of their speed, while at 60, it’s about 4%. We have to make a few assumptions here, but it’s not unreasonable to assume that the rise in temperature might turn a 4 hour marathoner into a 4:07 marathoner, or a 2 hour first half into a 2:03:30 second half.

In big, round numbers (we ignored a lot of variables), our “typical” runner lost about two more seconds for every additional minute compared to their first half time due to the temperature.

That still leaves us with 20 seconds unaccounted for.

How much of the difference is because running a marathon is hard? Because normal humans start to slow down after 18 to 20 miles in almost every case? Maybe that 20 seconds is what the typical runner loses in every marathon?

We’ll look into this more… next time!

(Note: I’m figuring this all out as I go along. Comments or questions, especially from people who’ve taken more stats classes, are always welcome!)

“If You Map It” in Level Renner

May 5, 2014

May/June Level RennerDo you ever dream about a run through open countryside, under an infinite sky washed clean by wind and rain, with runners who take responsibility for themselves and each other, keeping expenses low and camaraderie high? If so, you should read my article, “If You Map It”, in the May/June Level Renner.

Level Renner is still free, if you were wondering.

It’s All He Can Do – Blind Grandfather David Kuhn’s Run Around the U.S.

May 1, 2014

I first met David Kuhn at the 2013 Boston Marathon, where we were both running for the Mass. Association for the Blind’s Team With A Vision. David, a 61-year old blind grandfather of four, returned for the 2014 race, finishing the marathon in 6:15.

David and Joslynn, one of his guides, at Boston 2014

David and Joslynn, one of his guides, at Boston 2014

David is apparently a modest guy. When I met David again at this year’s race, one of the things we talked about was how winter weather and injuries had limited his training. He was wondering how he’d do given that his longest run coming in was only 8 miles.

One of the things we didn’t talk about was David’s upcoming run around the country to benefit cystic fibrosis. 11,000 miles over 14 months? Not worth mentioning.

I found out about David’s project when we connected via Facebook after the marathon. His motivation comes from his desire to do something for his 11 year old granddaughter, Kylie, who has cystic fibrosis.

“This run is all about emotions, a powerful driving force,” says David. “It is as though I had all the experiences I had so that together, me and my granddaughter, we will make a difference. Without her, none of my experiences have any real value. Without the pain I feel for her, I don’t put all my experiences toward something that may drive one more nail in the coffin of cystic fibrosis. She and I have come together at this time in human history to create a unique team.”

David, a lifetime Illinois resident, plans to begin his trip from the Seattle area in the next few weeks. His granddaughter has not been doing well, so he wants to get going as soon as he can.

He has a very rough idea of the course: Seattle to Bangor, Maine, south to Jacksonville, Florida, west to San Diego, and back to the start in Seattle.

David finds solving problems as they pop up energizing in itself, and planning every detail makes a project this big overwhelming, so he’s keeping things loose. His goal is knock the miles down one at a time, taking little bites out of the big picture as he goes.

As with any effort of this type, David needs attention and support of all kinds. But uniquely to him, he will need sighted guides every step of the way. As David notes, “I am blind. As with the many marathons I run, I cannot do this alone. The one thing that will keep me going are all the sighted guides along the way – their energy, their well wishes, their desire to help out on a daily basis.” Of course, there are other needs, for transportation, lodging, food, and more.

He has the right outlook. David says, “For me, life has always been fun. That I continue to find ways of getting things done, even without sight, is exhilarating to say the least. I love re-tooling myself. I thoroughly enjoy taking on tasks that I may not have done when I had sight. If you are wondering what a deck that a blind guy built, and a two car garage and front porch that he rebuilt looks like – so am I.”

“I have learned to fall in love with running after losing my sight, and I feel confident that my legs can carry me over the distance of 11,000 miles. To trust in so many strangers to get through it all feels like the most amazing adventure. “

Cystic Fibrosis Foundation logoJoin in the adventure!:


Follow David’s progress:

David’s web site
“It’s All I Can Do” on Facebook
@allicandoisrun on Twitter


Cystic Fibrosis Foundation
Help finance David’s run


Get every new post delivered to your Inbox.

Join 118 other followers