Ljubljana Maraton 2014

Author: A. Blejec

Introduction

At Ljubljana Marathon more than 15.000 runners compete on October 26, 2014. Results are available at http://vw-ljubljanskimaraton.si/en/result/16lm# and some preliminary checks showed nice statistical properties. So I decided to see how to get the data from the web page and get some insight into the results. Just to let you know - I am not a runner :)

Getting the results

First, we have to get the results form the web page. There are several categories and for now I will not go into the selection but rather use one of them. The names are nicely composed: M42 is 42km marathon for all men (Moski in Slovenian), Z42 is 42 marathon for all women (Zenske in Slovenian), 42MA is Marathon - Men A group.

The URL for M42 is: http://www.pohod.si/lm/M42.asp. Let us read the web page.

lfn <- "http://www.pohod.si/lm/M42.asp"
page <- readLines(lfn,enc="UTF-8")
length(page)
## [1] 1545

Total html page is now stored in page. First few lines

head(page,10)
##  [1] "<html><head><META HTTP-EQUIV=\"Content-Type\" CONTENT=\"text/html; charset=windows-1250\">"                                                                                                                                                                                                                                                                                                                                                                                  
##  [2] "<title>Ljubljanski Maraton</title><style type=\"text/css\">"                                                                                                                                                                                                                                                                                                                                                                                                                 
##  [3] "body      {font-family:\"Tahoma\"; font-face:\"Tahoma\"; font-size:9pt}"                                                                                                                                                                                                                                                                                                                                                                                                     
##  [4] "table     {font-family:\"Tahoma\"; font-face:\"Tahoma\"; font-size:9pt}"                                                                                                                                                                                                                                                                                                                                                                                                     
##  [5] "</style></head><body oncontextmenu=\"return false;\">"                                                                                                                                                                                                                                                                                                                                                                                                                       
##  [6] "<p align=center style=\"font-size:18px; font-weight:bold; color:#73b508\">Volkswagen 19. Ljubljanski maraton<br>"                                                                                                                                                                                                                                                                                                                                                            
##  [7] "<font style=\"font-size:12px; font-weight:bold; color:#73b508\"> Ljubljana, 26.10.2014 <br>"                                                                                                                                                                                                                                                                                                                                                                                 
##  [8] "<p align=center><B>M42 - Maraton - Mo<U+009A>ki<br>Marathon - Men</b><font><br>"                                                                                                                                                                                                                                                                                                                                                                                             
##  [9] "<table cellpadding=1 align=center><tr><td style='background-color:white'><table align=center border=0 cellpadding=2 cellspacing=2><TR bgcolor=#AAAAAA><td>#</td><td>St.<br>Bib</td><td>Ime in priimek<br>Name and Surname</td><td>LR<br>YoB</td><td>Dr<U+009E>ava<br>Country</td><td>Klub<br>Club</td><td>5km</td><td>10km</td><td>15km</td><td>20km</td><td>25km</td><td>30km</td><td>35km</td><td>40km</td><td>Rezultat<br>Result</td><td>Kat.</td><td>#</td><td> </td></tr>"
## [10] "<TR class=r0><td><b>1</b></td><td>6</td><td>Ishimael Bushendich</td><td>1991</td><td>KEN</td><td></td><td>0:15:15</td><td>0:30:23</td><td>0:45:30</td><td>1:00:39</td><td>1:15:52</td><td>1:31:07</td><td>1:46:12</td><td>2:01:24</td><td><b>2:08:25</b></td><td>42MA</td><td>1</td><td> </td></tr>"

and the last few lines

tail(page)
## [1] "<TR class=r0><td><b></b></td><td>2143</td><td>Benjamin Dobnikar</td><td>1989</td><td>SLO</td><td>Zavod Mitikas</td><td>0:25:16</td><td>0:50:05</td><td>1:16:00</td><td>1:41:39</td><td>2:07:28</td><td>2:34:22</td><td></td><td></td><td><b>DNF</b></td><td>42MA</td><td></td><td> </td></tr>"       
## [2] "<TR class=r0><td><b></b></td><td>1434</td><td>Teo Mikl</td><td>1963</td><td>SLO</td><td>Adria Airways Team</td><td>0:25:37</td><td>0:51:13</td><td>1:16:54</td><td>1:42:54</td><td>2:09:24</td><td>2:36:26</td><td></td><td></td><td><b>DNF</b></td><td>42MF</td><td></td><td> </td></tr>"           
## [3] "<TR class=r0><td><b></b></td><td>1438</td><td>Mi<U+009A>o Brkan</td><td>1977</td><td>CRO</td><td>AK FORCA</td><td>0:23:29</td><td>0:46:33</td><td>1:09:42</td><td>1:32:40</td><td>1:56:19</td><td>2:20:03</td><td></td><td></td><td><b>DNF</b></td><td>42MC</td><td></td><td> </td></tr>"            
## [4] "<TR class=r0><td><b></b></td><td>2118</td><td><U+008E>iga Rosensteinn</td><td>1989</td><td>SLO</td><td>Urbani teka<e8>i</td><td>0:26:37</td><td>0:52:57</td><td>1:19:45</td><td>1:47:42</td><td>2:18:15</td><td>2:55:41</td><td></td><td></td><td><b>DNF</b></td><td>42MA</td><td></td><td> </td></tr>"
## [5] "<TR class=r0><td><b></b></td><td>2006</td><td>Rok Kolari<e8></td><td>1976</td><td>SLO</td><td>Teka<U+009A>ki forum</td><td>0:19:36</td><td>0:39:30</td><td>0:59:34</td><td>1:19:38</td><td></td><td></td><td></td><td></td><td><b>DNF</b></td><td>42MC</td><td></td><td> </td></tr>"                 
## [6] "</table></td></tr></table><br><br></body></html>"

Using package XML

A better option is to use package XML.

library(XML)
doc <- xmlRoot(htmlTreeParse(lfn,enc="UTF-8"))
#doc

Node names

table(names(doc))
## 
## body head 
##    1    1

All nodes have identical fields

 fields = xmlApply(doc, names)
 table(fields$body)
## 
## p 
## 1

Extract table - this is a bit clumsy way, but after inspection of the HTML page you can get the idea how to proceed.

tbl <- doc[["body"]]
names(xmlChildren(tbl))
## [1] "p"
tbl <- doc[["body"]][["p"]][["font"]][["p"]][["font"]][["table"]][["tr"]][["td"]][["table"]]
xmlName(tbl)
## [1] "table"
xmlSize(tbl)
## [1] 1536

Get values and attributes

tmp = xmlSApply(tbl, function(x) xmlSApply(x, xmlValue))
tmp <- t(tmp)
atr <- t(xmlSApply(tbl, function(x) xmlSApply(x, xmlAttrs)))
head(atr)
##    td   td   td   td   td   td   td   td   td   td   td   td   td   td  
## tr NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL
## tr NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL
## tr NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL
## tr NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL
## tr NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL
## tr NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL
##    td   td   td   td  
## tr NULL NULL NULL NULL
## tr NULL NULL NULL NULL
## tr NULL NULL NULL NULL
## tr NULL NULL NULL NULL
## tr NULL NULL NULL NULL
## tr NULL NULL NULL NULL
head(tmp)
##    td  td       td                               td      td              
## tr "#" "St.Bib" "Ime in priimekName and Surname" "LRYoB" "DržavaCountry"
## tr "1" "6"      "Ishimael Bushendich"            "1991"  "KEN"           
## tr "2" "2"      "Elijah Keitany"                 "1983"  "KEN"           
## tr "3" "19"     "Edwin Kibet Koech"              "1987"  "KEN"           
## tr "4" "8"      "Julius Chepkwony"               "1986"  "KEN"           
## tr "5" "3"      "Augustine Ronoh"                "1982"  "KEN"           
##    td          td        td        td        td        td        td       
## tr "KlubClub"  "5km"     "10km"    "15km"    "20km"    "25km"    "30km"   
## tr Character,0 "0:15:15" "0:30:23" "0:45:30" "1:00:39" "1:15:52" "1:31:07"
## tr Character,0 "0:15:16" "0:30:23" "0:45:31" "1:00:40" "1:15:53" "1:31:07"
## tr Character,0 "0:15:15" "0:30:23" "0:45:31" "1:00:40" "1:15:53" "1:31:07"
## tr Character,0 "0:15:15" "0:30:23" "0:45:30" "1:00:40" "1:15:53" "1:31:07"
## tr Character,0 "0:15:15" "0:30:23" "0:45:30" "1:00:40" "1:15:52" "1:31:07"
##    td        td        td               td     td  td         
## tr "35km"    "40km"    "RezultatResult" "Kat." "#" Character,0
## tr "1:46:12" "2:01:24" "2:08:25"        "42MA" "1" Character,0
## tr "1:46:13" "2:01:33" "2:08:37"        "42MB" "1" Character,0
## tr "1:46:13" "2:01:31" "2:09:04"        "42MA" "2" Character,0
## tr "1:46:13" "2:02:26" "2:10:16"        "42MA" "3" Character,0
## tr "1:46:50" "2:02:52" "2:10:17"        "42MB" "2" Character,0

Clean the table

header <- tmp[1,]
data <- tmp[-1,]
dimnames(data) <- list(data[,1],header)
data <- data[dimnames(data)[[1]]!="",-1]
data <- data.frame(data)
## Warning: some row.names duplicated:
## 59,65,96,129,132,135,142,150,183,191,195,199,204,221,229,252,272,280,287,302,306,315,322,324,340,356,358,359,363,378,384,399,402,404,408,424,426,428,430,432,436,448,450,459,478,480,481,483,485,486,492,494,503,505,512,515,521,530,533,535,536,539,542,558,566,568,575,581,585,591,599,601,613,629,648,660,666,672,679,683,686,690,692,695,714,719,722,728,729,733,750,752,757,758,763,765,768,772,776,791,797,802,804,810,813,816,819,838,844,846,850,852,862,870,873,888,914,915,917,918,921,923,924,937,940,942,953,963,964,968,972,983,1006,1015,1040,1051,1061,1070,1078,1083,1089,1097,1107,1115,1123,1125,1131,1140,1143,1145,1149,1172,1185,1188,1201,1224,1225,1230,1235,1240,1242,1250,1262,1264,1280,1293,1305,1317,1348,1355,1378,1389,1390,1393,1404,1412,1419,1430,1458
## --> row.names NOT used
head(data)
##   St.Bib Ime.in.priimekName.and.Surname LRYoB DržavaCountry KlubClub
## 1      6            Ishimael Bushendich  1991            KEN         
## 2      2                 Elijah Keitany  1983            KEN         
## 3     19              Edwin Kibet Koech  1987            KEN         
## 4      8               Julius Chepkwony  1986            KEN         
## 5      3                Augustine Ronoh  1982            KEN         
## 6     12                  Wilfred Kirwa  1986            KEN         
##      X5km   X10km   X15km   X20km   X25km   X30km   X35km   X40km
## 1 0:15:15 0:30:23 0:45:30 1:00:39 1:15:52 1:31:07 1:46:12 2:01:24
## 2 0:15:16 0:30:23 0:45:31 1:00:40 1:15:53 1:31:07 1:46:13 2:01:33
## 3 0:15:15 0:30:23 0:45:31 1:00:40 1:15:53 1:31:07 1:46:13 2:01:31
## 4 0:15:15 0:30:23 0:45:30 1:00:40 1:15:53 1:31:07 1:46:13 2:02:26
## 5 0:15:15 0:30:23 0:45:30 1:00:40 1:15:52 1:31:07 1:46:50 2:02:52
## 6 0:15:15 0:30:23 0:45:30 1:00:40 1:15:52 1:31:06 1:46:12 2:02:10
##   RezultatResult Kat. X. character.0.
## 1        2:08:25 42MA  1             
## 2        2:08:37 42MB  1             
## 3        2:09:04 42MA  2             
## 4        2:10:16 42MA  3             
## 5        2:10:17 42MB  2             
## 6        2:10:26 42MA  4
data <- data[,-grep("character",names(data))]
dim(data)
## [1] 1501   16

Extract results and category data

result <- unlist(data[,grep("Result",names(data))])
names(result) <- NULL
head(result)
## [1] "2:08:25" "2:08:37" "2:09:04" "2:10:16" "2:10:17" "2:10:26"

Number of runners in different categories

category <- as.factor(unlist(data[,grep("Kat",names(data))]))
table(category)
## category
## 42MA 42MB 42MC 42MD 42ME 42MF 42MG 42MH 42MI 42MJ 
##  180  187  294  306  213  187   77   34   17    6

Time conversion function

Function to convert strings to minutes

as.mins <- function(x) {
pat <-  "^([0-9]+):([0-9]+):([0-9]+)"
h <- as.numeric( gsub(pat,"\\1",x))
m <- as.numeric( gsub(pat,"\\2",x))
s <- as.numeric( gsub(pat,"\\3",x))
secs <- (h*60+m)*60+s
mins <- secs/60
return(mins)
}

Convert results to minutes and calculate the time lag from the winner.

result <- sapply(result,as.mins)
lag <- result-result[1]

Distribution of results

par(mfrow=c(2,2))
hist(lag)
plot(lag,rank(lag),type="S",lwd=4)
filter <- category==levels(category)[1]
points(lag[filter],rank(lag)[filter],pch=16,col=2)
boxplot(lag,horizontal=TRUE)
qqnorm(lag,col=filter+1)

plot of chunk unnamed-chunk-14

SessionInfo

{ Windows 7 x64 (build 7601) Service Pack 1 R version 3.0.2 (2013-09-25) Platform: x86_64-w64-mingw32/x64 (64-bit)

locale: [1] LC_COLLATE=Slovenian_Slovenia.1250 LC_CTYPE=Slovenian_Slovenia.1250
[3] LC_MONETARY=Slovenian_Slovenia.1250 LC_NUMERIC=C
[5] LC_TIME=Slovenian_Slovenia.1250

attached base packages: [1] stats graphics datasets utils grDevices methods base

other attached packages: [1] XML_3.98-1.1

loaded via a namespace (and not attached): [1] digest_0.6.4 evaluate_0.5.5 formatR_0.10 htmltools_0.2.4 [5] knitr_1.6 rmarkdown_0.2.50 stringr_0.6.2 tools_3.0.2
Project path:D:/_Y/R/CompStatistics\ Main file :../doc/LjMarathon.Rnw