tl;dr — This is almost certainly interesting only to baseball stat geeks. It’s clear from Jeff Sackmann’s documentation that his Minor League Splits play-by-play files have errors. The purpose of this document is to begin to get a sense of their scope.
This is a preliminary report. I’ll likely do more of these.
In May of 2006 Jeff Sackmann announced that he’d begun publishing splits (left/right, day/night, home/away, etc) information for minor league players. His Minor League Splits (MLS) website became one of the most valuable minor league stats sources. He’s no longer providing the information in an easily accessible form, but he’s made the underlying data available. This is potentially an immensely valuable resource, as it should enable folks to study the disparities in performance between players taking the field at different levels of professional competition.
Jeff initially accumulated the data by screen-scraping the Minor League Baseball website’s daily game accounts, then reducing the published play-by-play pages to Retrosheet (RS) format. (He may have changed methods later; I’ve a comment on that below.) Here’s an example MiLB.com play-by-play from September 15, 2009, and here’s the same game in Jeff’s RS-like format.
A Retrosheet-Like Format
Although useful information can be gleaned directly from game data in the Retrosheet format, it can be made more generally accessible if you run the file through a parser and reformat it for database use. Two such parsers exist: BEVENT was one of several tools written by Tom Tippett of Diamond Mind (and now the Red Sox), and CWEVENT is part of Ted Turocy’s Chadwick project. While the two parsers work similarly and usually produce the same output, Chadwick is more flexible and addresses some shortcomings of the older program.
I have mostly examined the 2009 Midwest League file, though I’ve also played with some other files. My experiments demonstrated that neither parser could read the Sackmannn RS files as they come off the MLS website. The first thing learned was that both parsers need a file named TEAM2009 (or whatever) in order to function, which was fairly easy to mock up. That mock up got the programs past the initial hurdle, but both failed to completely process the file. So I posted an inquiry on a couple of SABR mailing lists.
Retrosheet VP Clem Comly responded by noting some issues with Sackmann’s “RS” data–including team ID formatting, some invalid codes, and some impossible plays. Ted Turocy responded similarly. This set me to comparing the data in the Midwest League 2009 file with Retrosheet’s defined format. Meantime, Clem was doing some testing and passing along the results. Here’s a summary of the differences and issues identified by these efforts:
Missing Record Types
The following record types are defined in the Retrosheet format document but are not present in the 2009 Midwest League event file.
. version, # info,site, . info,number, * info,starttime, # info,daynight, . info,usedh, ! info,umphome, ! info,ump1b, ! info,ump2b, ! info,ump3b, . info,howscored, * info,pitches, ! info,temp, ! info,winddir, ! info,windspeed, * info,fieldcond, ? info,precip, ? info,sky, ! info,timeofgame, ! info,attendance, ! info,wp, ! info,lp, ! info,save, ! data,er, # padj, # badj, . ladj, # com, Key (these are comments from Joel): # unavailable in source play by play * could/should have been defaulted to unknown (or equivalent) . could be inferred or fairly easily derived (well, ladj might not be *easy*) ! available in associated box score (but not in play by play) ? may have been available in associated box score
Refer to the Retrosheet event file spec for explanations of those record types. I presume, but haven’t verified, that these are missing from all the MLS event files. While much of this missing information would be valuable to have, it’s not obvious that any are essential to parsing the event file. On the other hand, CWEVENT’s behavior–it appears to recognize only the first game in whatever MLS event file I try to process–suggests that this parser is failing because it expects to find one or more of these record types. I’m guessing those include info,wp, info,lp, info,save, and (perhaps) data,er.
There’s little here that couldn’t be fairly easily fixed by someone with the appropriate skills. At the very least, the derivable records could be filled in and the unknown data could be indicated as such. Much of the missing data is readily available in the game box scores, which remain accessible on the web. We might want to discuss ways to acquire that information.
Besides the format deficiencies, there are numerous problems, of varying impact, with the MLS data:
- MLS lacks quotes around player names in start & sub records, which is a variance from the spec.
- Clem found play codes /FF & /PF which don’t have meaning in Retrosheet files. It’s likely there are others.
- MLS has hit location data which isn’t coded for RS; Sackmann added extra fields to the play record to accomodate these.
- The presence of this data suggests that Jeff stopped screen-scraping and began receiving data direct from Major League Baseball Advanced Media (MLBAM) at some point, as no explicit hit location data exists on the MiLB.com play by play pages.
- This data is, itself, pretty opaque, as Jeff’s documentation notes. I expect it could be converted into the Project Scoresheet zones familiar to baseball defense analysts.
- This may be moot, since both parsers drop this information.
- Team IDs in the MLS files are 6 characters long, half of which are a league identifier. The RS standard defines Team IDs as having 3 characters.
- Clem, by testing, discovered that BEVENT doesn’t handle these IDs well; removing the league identifier from the Team IDs results in cleaner output data. Ted says CWEVENT does not have this issue.
- Question: Does this mess up Game ID decoding in any Diamond Mind or Chadwick tool? The spec expects 12 characters with a set format; if anyone is parsing that format this will mess them up.
- Retrosheet IDs are very different from the MLBAM IDs used in the MLS files. It’s unlikely that this is an issue; I’m just documenting a difference.
- Every MLS “RS” play record has empty field for [pitch] count. The RS spec defines ?? as the unknown-count code.
- No plays are marked #, which would indicate a possible scoring problem. This is not a surprise, and is likely not important to this effort, as we’ve no reason to believe Jeff’s screen-scraper could identify those; I’m just reporting it as a known anomaly.
- Similarly, there are no comment (com) records. Again, this is expected, given the data source.
- The MLS files contain a master roster, while the RS parsers expect (but do not require) annual team roster files (with names which look like PEO2009.ROS).
- The parsers use these files to determine pitcher and hitter handedness. That could easily be handled at the database level, though doing so might cause a (probably light) performance hit.
- Clem points out that separating the team rosters facilitates data validation.
- Although Jeff’s provided event data for the 2005 season, his MLS team file (x_minorLeagueSplitsTeams.csv) does not list teams or affiliates for that season.
- Moreover, neither the format nor the file name match the files expected by the parsers.
- These are small issues, but they will trip up the user.
- Some play records have “placeholder” for the Player ID.
- Clem reports finding errors in player substitution (sub) records.
- Gosh it’s nice to have someone who easily reads these files looking over my shoulder. Thanks.
- Relatedly, I’m guessing–haven’t looked–that Jeff’s scraper treats out-of-sequence batting as a series of substitutions, not a lineup adjustment. I’m quite certain there are no ladj records in any of these files.
Sackmann’s brief documentation for the files promises errors, and Clem has verified that there are “impossible” plays recorded in the data. This is a bigger issue, and has some potential to become the project’s dominant activity. We will want to do some validation of the data, and may well find it necessary to implement a proofreading regime. Some of those errors will be failures of Sackmann’s webpage scraper; those can, in theory, be fixed by checking the MiLB play-by-play page for the affected game. Others are scoring errors preserved on those play-by-play pages, and won’t be fixable by that method; they may be otherwise fixable. This merits discussion.
Finally: Clem determined that the 2009 Midwest League file could be successfully processed through BEVENT by deleting a single incompletely-coded game. I presume that problems in other files are similar–though I’m using that “similar” word quite loosely. That now-deleted game was played, and a complete play-by-play is available, so a practiced Retrosheet coder could quickly reconstruct the game and replace it in the file. That may ultimately be the direction we wish to go. This, too, merits discussion.