Another tl;dr essay discussing Jeff Sackmann’s minor league play-by-play data; the first was here. This will be far more understandable if you have worked with Retrosheet event files than if you’ve not, though anyone who habitually scores ballgames can likely follow the discussion if they’re really determined. Retrosheet file documentation begins here, and BEVENT’s default output is described near the end of this file.
Out of the Box
It may be helpful to start with a box score. This was generated from the Sackmann event file by Retrosheet’s program BOX for the September 3, 2009, game I mentioned in the title.
Game of 9/3/2009 -- Beloit at Quad Cities (N) Beloit AB R H RBI Quad Cities AB R H RBI Beresford J, ss 1 1 0 0 Ingram D, cf 3 1 1 1 De La Osa D, ss 4 0 1 1 Stidham J, 2B 4 0 0 0 Thompson D, 2b 4 0 0 0 Curtis J, 3b 3 1 1 1 Hicks A, cf 3 1 1 0 Scruggs X, 1b 5 1 1 0 Waltenbury J, 1b 4 0 0 1 Racobaldo R, dh 5 1 1 2 Rams D, c 4 0 1 1 Parejo F, lf 3 2 3 1 Harrington M, lf 3 0 0 0 Rodriguez R, rf 3 1 1 0 Hanson N, 3b 4 0 1 0 Cawley J, c 4 1 2 3 Severino A, dh 3 0 1 0 Bolivar D, ss 4 0 0 0 Morales A, rf 4 0 1 0 -- -- -- -- -- -- -- -- 34 2 6 3 34 8 10 8 Beloit 111 000 000 -- 3 Quad Cities 123 020 00x -- 8 1 out when game ended. Beloit IP H R ER BB SO Hendriks L 4.0 8 6 0 1 3 Marquez W 2.1 2 2 0 4 2 Stillings B 2.0 0 0 0 1 2 Quad Cities IP H R ER BB SO Miller S 1.0 1 1 0 1 1 McGregor S 6.0 4 1 0 1 11 Delgado R 1.1 1 0 0 0 2 E -- Bolivar D, Thompson D, Hicks A 2, Scruggs X DP -- Beloit 1 LOB -- Beloit 10, Quad Cities 7 2B -- Curtis J, Scruggs X 3B -- Hicks A, Morales A SB -- Ingram D, Severino A, Hanson N, Harrington M CS -- Ingram D HBP -- by Marquez W (Curtis J), by Delgado R (Harrington M) WP -- Hendriks L, Marquez W 3 PB -- Rams D, Cawley J T -- 0:00 A -- 0
You may wish to compare this box to MiLB’s box for the same game. Even without comparing, though, two issues are readily apparent. First off, it’s difficult to imagine why an 8-3 game would end with one out in the ninth. Baseball just doesn’t work like that. Similarly puzzling are the innings totals for both pitching staffs: It seems that this was indeed an 8 1/3 inning game.
Comparisons with the MiLB box raise some more flags: Ten of the hitters have different counts in AB, R, H, and/or RBI. Four of the pitchers differ in IP, R, H, BB, and/or SO. (ER is a separate issue, not under discussion today.) I see other differences elsewhere, but see no need to go into detail. I think I’ve demonstrated that there are problems here, folks. Let’s see if we can figure them out.
Some Useful Background
Jeff Sackmann collected several years’ minor league play-by-play data to use for a specific project, his Minor League Splits website. He’s discontinued that project, but has voluntarily shared the underlying data with other researchers. There are problems, which he recognizes, with the data store, and I’m exploring the scope of those. I have some questions which can only be examined with minor league play-by-play data, so it’s necessary that I understand this data and its shortcomings.
Sackmann built his data store by collecting the game accounts on the Minor League Baseball (MiLB) website with a bot, then running them through a program which I usually call the Sackmann parser. Since the Sackmann files are nominally in Retrosheet (RS) format, my immediate project is to run those files through what you might call a translator, called BEVENT, which converts RS files to a standard database format and is available from the Retrosheet website. This is a progress report on that conversion project. I gave a preliminary report about the effort a couple weeks ago in a prior essay.
I’ve been using Jeff’s 2009 Midwest League event file, which contains game accounts for the entire 2009 season, for a testbed. Retrosheet VP Clem Comly has experimented some with the 2009 MWL file and reports that it averages two or three erroneous records per game. Erroneous, in this case, means records which won’t be interpreted correctly by the BEVENT parser. Since the league played about 1,000 games in 2009, including the championship playoff, that works out to 2,500 or so bad records in that play-by-play file, which contains 115,278 records overall (I’ve deleted one badly-damaged game account, For200908170, from the file, as has Clem). Averages can mislead, though, as the errors are clustered. Some of the clustering results from transcription errors which make subsequent, correct, records appear to be erroneous, thus creating an error cascade. The common case is a data record transcription which loses a putout, thus apparently extending the inning. This confuses BEVENT, which blindly assumes innings have three outs. So there are some game accounts with many errors, and many evidently-flawless game accounts.
That’s my paraphrase of Clem’s analysis, by the way. I believe this document summarizes his main points adequately, but it’s fair to say I’ve twisted his commentary around a bit.
An Example Game
Perhaps we’ll profit if we examine the Beloit/Quad Cities game which is incorrectly summarized above. Let’s compare three versions of the play-by-play:
- The game as reported on the Minor League Baseball (MiLB) website.
- Sackmann’s version, which reformats the MiLB report into Retrosheet format. (I’ve shortened Jeff’s team designators, but it’s otherwise an exact copy of his data. Within this essay I’ve also removed hit location data to reduce the clutter.)
BEVENT’s version, which reformats Sackmann’s into a database-friendly format. This essay shows only the first few fields of the standard BEVENT output, though the linked file has the complete standard output.
If you compare the files, you should be able to convince yourself that they’re the same game. For instance, all show the game’s first play as an error by the shortstop, and the last as fly to right. It shouldn’t take long to verify that all three show that the first inning ends with a shortstop-to-first groundout. Besides, they all claim to be the same game, which is presumably significant.
The data errors in the play-by-play are less obvious, and I’m pleased that Clem helped me identify those. Let’s take a little tour:
Second Inning
In the top half of the second, Angel Morales struck out, with some subsequent action on the basepaths. Here’s how the various versions record this:
- MiLB: Angel Morales strikes out swinging. Adan Severino steals (3) 2nd base. Adan Severino advances to 3rd, on throwing error by catcher Jack Cawley.
- Sackmann: play,2,0,519044,,,K+SB2;1-3(E2)(E2/TH)
- BEVENT:
Qua200909030,Bel,2,0,1,0,0,1,1,519044,?,543520,?,458733,,,K+SB2;1-3(E2)(E2/TH)- Key to the partial BEVENT output format I’m using here:
- “Qua200909030”: Game ID, with home team embedded
- “Bel”: Visiting Team
- “2”: Inning
- “0”: Team at Bat (0 = visitor, 1 = home)
- “1”: Outs
- “0”: Balls (never known in this file)
- “0”: Strikes (likewise)
- “1”: Visiting Team Score
- “1”: Home Team Score
- “519044”: Responsible Batter’s ID
- “?”: Batter’s Handedness (missing in this specific file)
- “543520”: Responsible Pitcher’s ID
- “?”: Pitcher’s Handedness (missing in this specific file)
- “458733”: ID of Runner on First
- “”: ID of Runner on Second
- “”: ID of Runner on Third
- “K+SB2;1-3(E2)(E2/TH)”: Sackmann parser’s representation of the play, in Retrosheet notation
- The player ID numbers are those assigned by the Minor League Baseball website (by Major League Baseball Advanced Media [MLBAM], actually); every professional player has one.
Sackmann’s parser made a mistake here; K+SB2;1-3(E2)(E2/TH) should have a period (dot) where the semicolon is, and this play would better have been scored K+SB2.1-3(E2/TH). (Note that the Sackmann parser double-reported the error.) All this matters because it confused BEVENT, which couldn’t interpret the code and left baserunner 458733 (that would be Severino) on first base, rather than third. Which caused problems for BEVENT on the next play:
- MiLB: Dominic De La Osa singles on a line drive to right fielder Ryde Rodriguez. Adan Severino scores.
- Sackmann: play,2,0,448279,,,S9/L.3-H
- BEVENT:
Qua200909030,Bel,2,0,2,0,0,1,1,448279,?,543520,?,,458733,,S9/L.3-H
BEVENT’s parser panics. “Hey, who’s this guy on third you’ve got scoring? And what am I supposed to do with the guy on first base? He’s in the batter’s way.” So the BEVENT-generated file has misplaced a run and lost track of the batter-runner. Not good.
The next batter grounded out to end both the inning and this short error cascade.
How often does this SB with subsequent play pattern/error occur? I estimate there are two or three hundred instances in the 2009 MWL event file. It looks to me like these could be fixed by running search-and-replace on the file a couple times.
Fourth Inning
The Quad Cities half of the fourth ended with a double play:
- MiLB: Jermaine Curtis pops into double play in foul territory, first baseman Jon Waltenbury to pitcher Liam Hendriks. D’Marcus Ingram doubled off 1st.
- Sackmann: play,4,1,543079,,,3/PF.?X?(31)
- BEVENT:
Qua200909030,Bel,4,1,1,0,0,3,6,543079,?,521230,?,502080,,,3/PF.?X?(31)
Oops. What’s that? ?X?(31) doesn’t mean anything to the BEVENT parser, which ignores it (notice 502080/Ingram still standing on first). Better if the Sackmann parser had coded this play as 3/FL/DP.1X1(31). (Clem counts this as two coding errors, by the way; one is purely technical and could be classified as a parser quirk.)
This data error is pretty serious. Instead of an inning-ending DP, BEVENT believes there are two out and a baserunner on first. This seems likely to have consequences. They’ll begin to show up on the next play:
- MiLB: Drew Thompson grounds out to first baseman Xavier Scruggs.
- Sackmann: play,5,0,458711,,,3/G
- BEVENT:
Qua200909030,Bel,4,1,2,0,0,3,6,489305,?,521230,?,502080,,,3/G
The most important thing to notice is that 4 following Bel in the BEVENT line: While the MiLB and Sackmann accounts of the game have moved on to the fifth inning, BEVENT thinks we’re still in the fourth. As far as this account is concerned, 489305/Drew Thompson has jumped teams and is now batting for Beloit; similarly, pitcher 521230/Liam Hendricks has been swapped to the River Bandits. And 502080/D’Marcus Ingram remains on first base. YIKES!
The next play looks like this:
- MiLB: Aaron Hicks flies out to left fielder Frederick Parejo.
- Sackmann: play,5,0,543305,,,7/F
- BEVENT:
Qua200909030,Bel,5,0,0,0,0,3,6,458711,?,543520,?,,,,7/F
We’ve straightened out the pitching situation–Scott McGregor’s magically appeared on the mound. And we’ve released Ingram from his baserunning duties so he can return to QC’s CF. But: We’re still lost track of one out. That will haunt us.
Fifth Inning
This sort of thing’s going to go on for the rest of the game. The bottom of the fifth starts with a pitching change–
- MiLB: Pitcher Change: Winston Marquez replaces Liam Hendriks.
- Sackmann: play,5,1,489305,,,NP
sub,470504,Winston Marquez,0,0,1 - BEVENT:
Qua200909030,Bel,5,0,2,0,0,3,6,501858,?,543520,?,,,,K
–except BEVENT believes Beloit’s still at bat and recognizes 470504/Marquez as a Beloit pitcher. The program doesn’t make the substitution because Marquez shouldn’t be pitching for the opposition. Parsers can be quirky, folks. That BEVENT recognizes this is an error after missing a similar data conflict a few lines ago can likely be explained, but it’s still odd. And you could certainly make a case that it should stop processing the game and report an error when it finds this sort of contradiction.
Seventh Inning
The top of the seventh begins with a Drew Thompson single, then Aaron Hicks hits into a DP–
- MiLB: Drew Thompson singles on a line drive to center fielder D’Marcus Ingram.
Aaron Hicks grounds into double play, second baseman Jason Stidham to shortstop Domnit Bolivar to first baseman Xavier Scruggs. Drew Thompson out at 2nd. - Sackmann: play,7,0,458711,,,S8/L
play,7,0,543305,,,46(1)3/GDP/G4 - BEVENT:
Qua200909030,Bel,6,1,2,0,0,3,8,521088,?,470504,?,543079,
502781,502080,S8/L
Qua200909030,Bel,6,1,2,0,0,3,8,527050,?,470504,?,521088,br/502781,502080,46(1)3/GDP/G4
–except BEVENT’s still in the sixth with the bases loaded. So it throws away the current runner on first (543079/Jermaine Curtis), replaces him with 521088/Thompson, and wipes out whichever of them is actually there on the subsequent DP. (No, I don’t know why it thought replacing the baserunner made sense. A while back it threw away the batter.) But it can’t be a DP; there are already two out. So we’ve misplaced another out, and will be off by two for the rest of the contest. This is getting pretty ugly, friends.
You should be getting the picture. Before the game ends we’ll see two more pitching changes that the BEVENT parser will mishandle, and there are certainly some impacts on nearly everything from having the first two batters’ results for each inning awarded to the opponent’s team. All because we missed the second out on a fourth inning double play.
So how often does the missed-double-play event error occur? Looks like there are about 50 in the 2009 Midwest League event file. These could reasonably, albeit inconveniently, be recovered by eyeballing the MiLB game accounts and manually fixing the data.
Back in the Box
So how did it do? If I apply those two fixes, does the BOX program generate the correct information? Let’s check:
Game of 9/3/2009 -- Beloit at Quad Cities (N) Beloit AB R H RBI Quad Cities AB R H RBI Beresford J, ss 1 1 0 0 Ingram D, cf 4 1 1 1 De La Osa D, ss 4 0 1 1 Stidham J, 2b 4 0 0 0 Thompson D, 2b 5 0 2 0 Curtis J, 3b 3 1 1 1 Hicks A, cf 4 1 1 0 Scruggs X, 1b 4 1 1 0 Waltenbury J, 1b 4 0 0 1 Racobaldo R, dh 4 1 1 2 Rams D, c 4 0 1 1 Parejo F, lf 3 2 2 1 Harrington M, lf 3 0 0 0 Rodriguez R, rf 3 1 1 0 Hanson N, 3b 4 0 0 0 Cawley J, c 4 1 2 3 Severino A, dh 4 1 2 0 Bolivar D, ss 3 0 0 0 Morales A, rf 3 0 0 0 -- -- -- -- -- -- -- -- 36 3 7 3 32 8 9 8 Beloit 111 000 000 -- 3 Quad Cities 123 020 00x -- 8 Beloit IP H R ER BB SO Hendriks L 4.0 8 6 0 1 3 Marquez W 2.0 1 2 0 4 3 Stillings B 2.0 0 0 0 0 2 Quad Cities IP H R ER BB SO Miller S 1.0 1 1 0 1 1 McGregor S 6.0 5 2 0 1 9 Delgado R 2.0 1 0 0 1 3 E -- Bolivar D, Thompson D, Cawley J, Hicks A 2, Scruggs X DP -- Beloit 1, Quad Cities 1 LOB -- Beloit 9, Quad Cities 7 2B -- Curtis J, Scruggs X 3B -- Hicks A, Thompson D SB -- Ingram D, Severino A 2, Hanson N CS -- Ingram D HBP -- by Marquez W (Curtis J), by Stillings B (Bolivar D) WP -- Hendriks L, Marquez W 3 PB -- Rams D, Cawley J T -- 0:00 A -- 0
Yes! That’s much better.
Where Does This Leave Us?
Sackmann’s parser made two significant errors in this game account, each of which generated problems on subsequent plays. These problems appear as five more data errors, because BEVENT mishandles them even though the plays (events) were correctly coded. That’s a common pattern in this data, and something we’ll need to give some thought. But I’m not ready to go there yet.
The next essay will address some technical points; then I’ll raise some questions for discussion.