CLOP Pools Package 1.4

CLOP Pools Package

This manual is for the CLOP Pools Package (version 1.4, 29 September 2016), which calculates optimal picks for sports betting pools.

Copying and distribution of this file, with or without modification, are permitted in any medium without royalty provided the copyright notice and this notice are preserved.

1 Introduction

The CLOP Pools Package is a suite of command line utilities for working with sports betting pools. The package generally implements algorithms from the article Optimal Strategies for Sports Betting Pools (Clair,Letscher 2005), but also includes the main algorithm from March Madness and the Office Pool (Kaplan, Garstka 2001). The package and supporting information are maintained at http://math.slu.edu/~clair/pools.

The model used for a betting pool requires three inputs:

Size of the pool (the number of participants).
Actual probabilites, used to model the actual outcomes of the games.
Perceived probabilities, used to model the behavior of the other participants. These are known as “pool probabilities” in the Optimal Strategies paper. Pool is a better name than perceived, but the code predates the name change.

The size of the pool is given (when needed) via the command line argument -n. The actual and perceived probabilities are stored in ASCII text files and are passed as command line arguments.

Pick sets are coded as single lines of ASCII text, and are passed to the pools utilites on standard input, and produced on standard output.

Generally, the package is intended to play well with Unix utilities such as cut, paste, and sort, and the programs are designed to fit nicely into pipelines. The programs fall, fseek, fsmooth, fcanon, tgreedy, and tcanon all generate pick sets on stdout. The programs fstats and tstats calculate interesting statistics for pick sets provided on stdin. For example, the command:

fcanon -qd nfl_data | fstats -n50 nfl_data

calculates the expected return for a bet on all the underdogs in a 50 player football pool described in file nfl_data.

This package uses the GNU Scientific Library (http://www.gnu.org/software/gsl/) for numeric computations. The programs fall, fseek, and fsmooth are all threaded for speed on multiprocessor machines.

2 Football pools

Football pools consist of g games which are assumed to be independent. The number of games is limited by the number of bits in an unsigned int. Running a 16 game pool on a machine with 16 bit ints has not been tested, and could potentially cause problems.

2.1 Football pool data file format

A football pool data file contains all actual and perceived probablities needed to model the pool. A line beginning with ’#’ is a comment and is ignored, as are blank lines. The file begins with a header record on a single line which is followed by one game record line for each game in the pool.

The header record gives text names for each columnn of fields.

Each game record has from 3 to 16 whitespace separated fields:

Field 1: Team 1 Name
Field 2: Team 2 Name
Remaining fields: Probability for team 1 winning or being picked.

Here is a sample data file:

# Data from week 3 of the 2005 NFL season

HOME    AWAY    SAGARIN ESPN    YAHOO
BUF     ATL     0.60    .459    .448
CHI     CIN     0.48    .214    .228
DEN     KC.     0.42    .278    .274
GB.     TB.     0.41    .341    .334
IND     CLE     0.87    .975    .970
MIA     CAR     0.63    .206    .152
MIN     NO.     0.51    .509    .533
NYJ     JAX     0.64    .527    .480
PHI     OAK     0.82    .940    .935
PIT     NE.     0.61    .709    .619
SD.     NYG     0.47    .635    .720
SEA     ARZ     0.80    .877    .860
SF.     DAL     0.67    .140    .125
STL     TEN     0.54    .756    .763

In this file, the columns are the probability of the Home team winning as predicted by Sagarin ratings, the probability that a given participant in ESPN’s football pool chose the Home team, and the probability that a given participant in Yahoo’s football pool chose the Home team.

By default, the 3rd and 4th column are used as the actual and perceived probability of Team 1 winning. To choose different columns, use the -A and -P options to the football programs. For example,

fseek -n150 nfl05_week3_data -PYAHOO

would search for the best picks in a 150 participant pool, using SAGARIN data as actual probabilities and YAHOO data as perceived probabilities.

2.2 Football picks format

Picks are read and written as a whitespace separated list of winners. Names must match exactly and be in the same order as the team names in the associated pool data file. For example:

BUF CHI DEN TB. IND MIA MIN NYJ PHI NE. NYG SEA SF. TEN

2.3 Football programs

Programs to generate picks:
• fall:		Calculate expected return for all possible picks.
• fsmooth:		Calculate an approximation to expected return for all possible picks.
• fseek:		Hill-clibming search for optimal picks.
• fcanon:		Calculate canonical picks.
Programs for calculating statistics:
• fstats:		Compute statistics for picks.

2.3.1 fall

Calculate expected return for all possible picks. Writes 2^g lines of output. Each line shows the expected return for a set of picks, a tab character, and then the picks in the above format.

It is useful to pipe the output of fall to sort -nr to get a list sorted in descending order of quality.

Usage: fall [-tqnAP] datafile

-q: Quiet. Suppress display of header.
-t<threads>: Threads. Specify number of computation threads (default 2).
-n<competitors>: Number of competitors.
-A<actuals>
-P<perceiveds>: Actual or perceived probabilities. Specify column header from datafile.

2.3.2 fsmooth

Calculate expected return for all possible picks. Writes 2^g lines of output. Each line shows the expected return for a set of picks, a tab character, and then the picks in the above format.

fsmooth operates exactly the same as fall, except that the normal approximation is used to calculate the expected return for each set of picks. fsmooth is considerably faster than fall.

It is useful to pipe the output of fsmooth to sort -nr to get a list sorted in descending order of quality.

Usage: fsmooth [-tqnAP] datafile

-q: Quiet. Suppress display of header.
-t<threads>: Threads. Specify number of computation threads (default 2).
-n<competitors>: Number of competitors.
-A<actuals>
-P<perceiveds>: Actual or perceived probabilities. Specify column header from datafile.

2.3.3 fseek

Hill climbing search for a pickset which is a local maximum for expected return. The search begins with a good guess based off of typical results. The search ends at a pick set which has larger expected return than any other set which differs by at most two games.

fseek is very effective at finding the best pickset quickly. However, it might in theory become stuck at a local maximum which is not the global maximum.

Usage: fseek [-tqnAP] datafile

-q: Quiet. Suppress display of header.
-t<threads>: Threads. Specify number of computation threads (default 2).
-n<competitors>: Number of competitors.
-A<actuals>
-P<perceiveds>: Actual or perceived probabilities. Specify column header from datafile.

2.3.4 fcanon

Calculate canonical picks for a football pool. Canonical picks include the favories, the underdogs, and the edge picks (and will display in that order if more than one are requested). Favorites and underdogs use the actual values, so if you want to see perceived favorites/underdogs, use the -A option. The edge picks are optimal for a sufficiently large number of competitors, and maximize the ratio A/P.

Usage: fcanon [-qfdeAP] datafile

-q: Quiet. Display only the picks.
-A<actuals>
-P<perceiveds>: Actual or perceived probabilities. Specify column header from datafile. Note that -P is only useful in conjunction with -e.
-f: Calculate actual favorites.
-d: Calculate actual underdogs.
-e: Calculate edge picks.

2.3.5 fstats

Calculate statistics for picksets read on standard in. The default behavior is to print the expected return followed by the picks.

Usage: fstats [-qnAPsdgvw] datafile

-q: Quiet. Suppress display of header.
-n<competitors>: Number of competitors.
-A<actuals>
-P<perceiveds>: Actual or perceived probabilities. Specify column header from datafile.
-s: Smooth. Use the normal approximation to calculate expected return.
-d: Detailed. Show detailed statistics. Shows the expected return (exp), and expected return with smooth approximation (sexp). For both the actual and perceived data, it shows the probability that these picks occur exactly (prob), the mean and variance of the number of games these picks will agree with (mean, var), and the number of underdogs picked (upsets).
-g: Game-by-game. Displays five columns of data for each game. The first (Pick) is 1 or 0 depending on whether the actual favorite or actual underdog was chosen by the pickset. The next columns give the actual and perceived probabilites for the chosen team to win. The final two (which take some time to compute) give numeric calculations of the partial derivative of expected return with respect to a change in the input variables a_i or p_i for that game. These can be though of as a measure of the sensitivity of expected return to the data for that particular game. Keep in mind that probabilites range from 0 to 1, and that a change of .01 makes a much bigger difference to a probability of .98 than it does to a probability of .5.
-v<actual spread>: Vary actuals. This option is intended to test robustness of the expected return value. This option calculates the expected return for the given set of picks 200 times while varying the actual probabilities used to model the pool. For each calculation, each value a_i is chosen uniformly randomly from an interval centered at the original a_i with width 2*<actual spread>. Statistics are calculated for the 200 values of expected return, and displayed.
-w<perceived spread>: Vary perceiveds. Same as -v, but varies p_i. Using both -v and -w will vary both at the same time.

3 Tournament pools

A tournament pool involves picking all games of an R round single elimination tournament with 2^R teams. Currently, the maximum number of allowable rounds is 14 (which is ridiculously large).

With tournament pools, the scoring method is variable. In this version of CLOP, only two scoring methods are implemented: power-of-two scoring and ESPN scoring. In power-of-two scoring, correct picks are worth 1,2,4,8,… in increasing rounds. In ESPN scoring, correct picks are worth 10,20,40,80,120, and 160 points in increasing rounds. Any tournament program that uses scoring will use power-of-two scoring by default and accept the -E option to switch to ESPN scoring.

3.1 Tournament pool data file formats

A tournament pool is described by three collections of data: team names, actual probabilties, and perceived probabilities. A collection of probabilities can be given in one of two ways, as head-to-head data or as winround data.

Within data files, team order is important, because it determines which teams play in which round (using the usual single elimination bracket) and it must remain consistent for all files used in a given pool.

In the sections below, T is the number of teams in the tournament and R is the number of rounds.

3.1.1 Names file format

A team names file begins with a header line containing the keyword names followed by the number of teams (T) in the tournament, followed by an optional comment to the end of the line. Each subsequent line contains a team name, which may optionally be in double quotes. Quotes are useful to include whitespace in the team name, which makes ASCII picks output much nicer.

Here is an example tournament with four teams. The first round matchups are Aardvarks-Bison and Chihuahuas-Ducks.

names 4 Bryan's Imaginary Playoffs
"Aardvarks "
"Bison     "
"Chihuahuas"
"Ducks     "

3.1.2 Head-to-head probability file format

A head-to-head data file begins with a header line containing the keyword h2h followed by the number of teams T in the tournament, followed by and optional comment to the end of the line.

Data follows as T*T floating point numbers, in order:

        P(0 beats 0) P(0 beats 1) .. P(0 beats T-1)
              ...
        P(T-1 beats 0) ..            P(T-1 beats T-1)

The data is redundant since P(i beats j) = 1 - P(j beats i). Values for P(x beats x) are required but ignored.

Here is an example that goes with Bryan’s Imaginary Playoffs:

h2h 4   Close. Team 2 (Bison) have an edge.
.5 .4 .4 .7
.6 .5 .7 .6
.6 .3 .5 .6
.3 .4 .4 .5

3.1.3 Winround probability file format

A winround data file begins with a header line containing the keyword winround followed by the number of teams T in the tournament, followed by and optional comment to the end of the line.

Data in the file comes in two series, the solo and pair series. The solo series begins with the keyword solo followed by the probabilities of team i winning round r for all i,r. The pair series begins with the keword pair followed by the probabilities of team i winning round r and team j winning round s for all i,j,r,s.

The pair series is optional. If it is omitted, the data no longer contains enough information for the theoretical model of the pool. In that case, CLOP will estimate the pair data and print a warning message to stderr. See the Optimal Strategies paper for details.

The solo series is size (T * (R+1)), in order:

   P(0->0) P(0->1) ... P(0->R) P(1->0) ... P((T-1)->R)

The pair array is size (T * T * (R+1) * (R+1)), in order:

   P(0->0 & 0->0) P(0->0 & 0->1) .. P(0->0 & 0->R)
   P(0->1 & 0->0) ..                P(0->1 & 0->R)
    ...
   P(0->R & 0->0) ..                P(0->R & 0->R)
   P(0->0 & 1->0) P(0->0 & 1->1) .. P(0->0 & 1->R)
    ...
   P(0->R & 1->0) ..                P(0->R & 1->R)
    ...
   P(0->R & (T-1)->0) ..        P(0->R & (T-1)->R)
   P(1->0 & 0->0) ...
                            P((T-1)->R & (T-1)->R)

Here is an example that goes with Bryan’s Imaginary Playoffs:

winround 4   Team 2 (Bison) very strong.
solo
  1.000 0.300 0.180 
  1.000 0.700 0.525 
  1.000 0.500 0.125 
  1.000 0.500 0.170 
pair
  1.000 0.300 0.180 
  0.300 0.300 0.180 
  0.180 0.180 0.180 

  1.000 0.700 0.525 
  0.300 0.000 0.000 
  0.180 0.000 0.000 

  (14x9 more floats)...

3.2 Tournament picks format

A set of picks for a tournament is stored in “depth format” as a list of integers in the range [1…R+1], one for each team. The number for each team indicates which round that team will reach.

In Bryan’s Imaginary Playoffs, here is a bracket in which the Bison beat the Chihuahuas in the finals:

1 3 2 1

The tstats program can display brackets in a human readable ASCII format. The pix2tex utility can create a TeX file that displays the bracket graphically.

3.3 Tournament programs

Programs to generate picks:
• tseek:		Hill-clibming search for optimal picks.
• tcanon:		Calculate canonical picks.
• trandom:		Generate random picks.
Programs for calculating statistics:
• tstats:		Compute statistics for picks.
• tscore:		Calculate score of picks given an outcome.
Utility programs:
• tsim:		Simulate tournaments.
• dumph2h:		Dump probability data in h2h format.
• dumpwinround:		Dump probability data in winround format.
• pcalc:		Calculate probability data from a collection of opponent picks.
• pix2tex:		Generate a LaTeX picture of a filled in bracket.

3.3.1 tseek

Performs a hill-climbing search for picks that maximize expected return. Each trial chooses a random starting pick (uniformly distributed over the set of all possible brackets) and hill climbs to a local maximum. The process is repeated for the specified number of trials. Picks that improve on previous results are displayed when found.

Usage: tseek [-nEqvts] namesfile actualfile perceivedfile

-n<competitors>: Number of competitors.
-E: Use ESPN scoring.
-q: Quiet. Display only one set of picks (the best found) when all trials are finished.
-v: Verbose. Display all intermediate picks for each trial. Using tseek -v -t1 … is a good way to get a feel for the hill climbing process.
-t<trials>: Trials. Specify number of trials (default is to run trials forever).
-s<seed>: Seed. Specify (long integer) seed for random number generator (default seeds with the current time).

3.3.2 tcanon

Display canonical statistics and picks for a tournament pool. The statistics (shown unless -q is used) describe opponent scoring. The six sets of picks are:

Picks giving the maximum expected score.
The actual favorites.
The perceived favorites.
The result most likely to actually occur.
The picks most likely to be made by an opponent.
The picks that optimize expected return in the limit as the number of competitors approaches infinity.

Usage: tcanon [-Eq] namesfile actualfile perceivedfile

-E: Use ESPN scoring.
-q: Quiet. Suppress headers.

3.3.3 trandom

Generate random picks. Each game is 50-50 unless the optional datafile is given to specify the probabilities.

Usage: trandom [-nR] [datafile]

-n<count>: Generate <count> set of picks. Default is 1.
-R<rounds>: Specify number of rounds in the tournament. If datafile is given, uses the rounds for that datafile. If unspecified, defaults to 6.

3.3.4 tstats

Calculate statistics for picksets read on standard in. After reading input, tstats produces a header with the comments from the input files and statistics describing opponent scores. Then, for each set of picks on stdin, tstats displays the picks in a human readable ASCII form and displays statistics for the picks. The statistics are:

expected return: The expected return on a bet of 1 on these picks.
actual probability: The probability these picks actually occur.
actual mean score, actual score standard deviation: The mean score and SD for these picks.
perceived probability: The probability one opponent will make these picks exactly.
perceived mean score, perceived score standard deviation: The mean score and SD for these picks if the tournament games were played using the perceived probabilities.
correlation with opponents: The correlation (\in [-1,1]) between the score of these picks and the score of one opponent.

Usage: tstats [-nEqsetP] namesfile actualfile perceivedfile

-n<competitors>: Number of competitors.
-E: Use ESPN scoring.
-q: Quiet. Suppress headers.
-s: Don’t show stats.
-e: Don’t show expected return.
-t: Don’t show teams.
-P: Show perceived probability only. (This was useful, once.)

3.3.5 tscore

Quick and dirty program to calculate the scores of picks on stdin, given a set of picks as the input file outcome.

Usage: tscore [-E] [-r rounds] outcome

-E: Use ESPN scoring.
-r rounds: Specify number of rounds. Default is 6.

3.3.6 tsim

Simulate tournaments. Computes results for each set of picks Y read on standard in. Each trial chooses n competitor picks, either randomly using perceived probablities or by selecting from the opponentpicks file if provided. Each trial chooses winners using actual probabilities and calculates the score and winnings for picks Y. After all trials are finished, the summary results for picks Y are displayed.

Usage: tsim [-nEqst] namesfile actualfile perceivedfile [opponentpicks]

-n<competitors>: Number of competitors.
-E: Use ESPN scoring.
-q: Quiet. Suppress headers.
-s<seed>: Seed. Seed random number generator with seed. Default is to use current time.
-t<trials>: Number of tournaments to simulate. Default is 10000.

3.3.7 dumph2h

Utility program to read in a probability file and dump a correctly formatted probability file in h2h format. Use for converting winround to h2h.

Usage: dumph2h probfile

3.3.8 dumpwinround

Utility program to read in a probability file and dump a correctly formatted probability file in winround format. Useful for converting h2h to winround (because the solo information is interesting for computer ranking generated h2h files).

Usage: dumpwinround [-p] probfile

-p: Only dump solo data.

3.3.9 pcalc

Calcualate a table of winround data from a list of picks. Given a series of picks on either stdin or in picksfile, computes solo and pair data by counting occurences of teams reaching rounds. Dumps results to stdout as a winround format file. This is how you get perceived probabilities if you have a large collection of opponent picksets.

Usage: pcalc [-r<rounds>] [picksfile]

-r<rounds>: Specify number of rounds in tournament. Default is 6.

3.3.10 pix2tex

From a set of picks and a tournament names file, pix2tex generates LaTeX output to draw a filled in bracket.

Width and height are specified as floating point numbers and are used to position the elements of the bracket. LaTeX will interpret these as points, by default, although you could change \unitlength in your document to adjust this.

Usage: pix2tex [-h<height>] [-w<width>] namesfile

• Football pool data file format:
• Football picks format:
• Football programs:

• Tournament pool data file formats:
• Tournament picks format:
• Tournament programs:

• Names file format:
• Head-to-head probability file format:
• Winround probability file format: