Methods

The Algorithm

Several years' worth of race results from various professional level competitions was gathered and organized in a database. Using the Python programming language, a network analysis is performed on the compiled results using the NetworkX library. The program works by creating a directed network graph with the nodes representing unique athletes, and the edges (the links between them) representing race outcomes (within a specified time frame) between a pair of athletes. Edges are created for each race corresponding to "n choose 2" pairwise comparisons (every possible unique combination of 2 athletes out of n total finishers).

Edges between nodes are given a weight based on four factors:

  • Age: the date of the race with respect to the "as of" date of the ranking.

  • Depreciation: the rate at which the weight of a result decreases over time

  • Competition: The event the race took place at (results from the Olympic Games and LEN Cup are weighted differently)

  • Distance: This allows the ranking to be optimized for a specific distance

As the program loops through race results within the specified time frame, nodes are added to the graph if they do not already exist, and edges are either added or modified by calculating the new result's weight and adding that to the edge's existing weight. All race weight contributions to each edge are stored in a dictionary as an attribute of the edge for future troubleshooting and analysis.

Once the graph is created, a "random walk" PageRank algorithm is implemented which, through thousands of iterative calculations, converges on a value for each node that represents the likelihood at any given time that the "random walker" would exist on that node. Therefore, the sum of PageRank values of all nodes is equal to 1, and athletes (nodes) are ranked based on that value. The beauty of this methodology is that in addition to taking into account the four factors mentioned above, it also accounts for the most important factor which is who an athlete beats, and is beaten by, in a given race.

Thoughts and constructive feedback are welcome! If you have any, get in touch.

Network visualizations for top 8, 24, and 100 athletes:

Top 8

Top 24

Top 100

Accuracy / Predictability

The program has a feature that calculates the accuracy, or predictability, of a generated ranking. As the algorithm loops through the races within the specified time frame that will contribute to the ranking as of a specified date, as each race is added, a new ranking is produced. Accuracy is calculated by looking at every race's results and comparing it to the ranking as of the day before the race. The denominator is equal to every possible combination of one winner and one loser at every race in the time frame. The numerator is equal to the number of instances where the higher-ranked athlete beats the lower-ranked athlete. This value is usually between 0.78 and 0.82, and the accuracy is optimized by iteratively testing the multiple inputs that affect the depreciation, competition weight, and distance weight of edges in the graph. Every ranking produced may require slightly different inputs in order to maximize the predictability. For example, before the covid-19 pandemic when race results were plenty, limiting the races used in a ranking to only ones that occurred in the past 16-19 months produced the ranking with the highest predictability. However, since there were not many competitions in 2020 and 2021, in order to create the most accurate ranking for a date in early 2022, races from the past 3.6 years are used.

Data

For a list of all races included in this ranking, click here. Results are compiled from fina.org, len.eu, longswimsdb, and other organizations' websites. Rankings only take race finishers into account - DNFs do not affect the ranking.

Ties, though rare, do happen occasionally and are taken into account.

Retired athletes will continue to appear in the rankings until their last race falls out of the relevant time window. Efforts are made to identify ranked but retired swimmers with an asterisk (*).

Known Challenges

Names: Occasionally results are only available in pdf format and an Adobe product is used to convert the data to excel format. Sometimes, especially when the pdf is poor quality or a scanned piece of paper, names are misspelled once converted. An athlete's name also may be presented differently in different race results (ie: Alex Meyer, Alexander Meyer, Alex M Meyer, or maiden / married names). A strong effort is made to standardize the names to one spelling but there may occasionally be ones that are missed, resulting in an athlete's name appearing more than once in the rankings and with different spellings. In addition, it's possible that at some point in the future there could two athletes with the same name from the same country, in which case a middle name or initial would need to be used.

Complexity: Most ranking systems that people are familiar with are based on a "points accumulation rating", where an athlete's performances are assigned an arbitrary pre-determined value based on certain factors such as how they placed, what round of a tournament they reached, or their time behind the winner. In some cases, point values start at zero for all athletes at the beginning of a season and are tallied at the end to declare a winner. These are usually repeated annually and are referred to as a "series" or "tour", such as the FINA/CNSG Marathon Swim World Series. In other sports such as golf and tennis, a points accumulation system is used but points expire after a certain period of time, so athletes are continuously maintaining a balance of points which forms the basis of a ranking. The advantage of such systems lies in its simplicity and objectivity, making it easy for someone to understand exactly why they are ranked above or below another athlete. However, the number of points ascribed to certain outcomes is arbitrary, and rankings generated by such systems are less accurate predictors of future outcomes than other more advanced systems. The down side of those, however, is that they are difficult to explain to the average person, and inevitably there will be athletes who disagree with the ranking and can't understand why they are not ranked higher. The purpose of this Methods section is to provide transparency into how the ranking is created in order to avoid these frustrations as much as possible.

Conditions: In order to perform detailed analyses, or even generate rankings based on a particular race environment, qualitative data needs to be collected for every race. The current categories of interest are rough vs. normal surface conditions and wetsuit vs. non-wetsuit. Read more about this on the Roadmap page.