[GH-ISSUE #304] [FBref] Handle canceled / forfeited games #58

Closed
opened 2026-03-02 15:55:24 +03:00 by kerem · 6 comments
Owner

Originally created by @probberechts on GitHub (Jul 23, 2023).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/304

As pointed out in #286 by @lorenzodb1, running the fbref.read_player_match_stats function fails when the list of seasons to be scraped contains canceled or forfeited games. Examples of such games are Lyon vs. Reims Match on Friday March 13, 2020 and Hellas Verona vs. Roma Match on Saturday September 19, 2020. The main issue is that the summary player stats table for these games contains different columns than the corresponding table for completed games (e.g., it adds a "PkWon" column and misses all non-performance stats).

I see two options, currently preferring the first one:

  • Skip these games (i.e., return an empty dataframe)
  • Return a dataframe with all columns that are present in completed games, setting the data columns that are not available for the forfeited game to None.
Originally created by @probberechts on GitHub (Jul 23, 2023). Original GitHub issue: https://github.com/probberechts/soccerdata/issues/304 As pointed out in #286 by @lorenzodb1, running the `fbref.read_player_match_stats` function fails when the list of seasons to be scraped contains canceled or forfeited games. Examples of such games are [Lyon vs. Reims Match on Friday March 13, 2020](https://fbref.com/en/matches/1d845950/Lyon-Reims-March-13-2020-Ligue-1) and [Hellas Verona vs. Roma Match on Saturday September 19, 2020](https://fbref.com/en/matches/e0a20cfe/Hellas-Verona-Roma-September-19-2020-Serie-A). The main issue is that the summary player stats table for these games contains different columns than the corresponding table for completed games (e.g., it adds a "PkWon" column and misses all non-performance stats). I see two options, currently preferring the first one: - Skip these games (i.e., return an empty dataframe) - Return a dataframe with all columns that are present in completed games, setting the data columns that are not available for the forfeited game to None.
kerem 2026-03-02 15:55:24 +03:00
  • closed this issue
  • added the
    bug
    FBref
    labels
Author
Owner

@lorenzodb1 commented on GitHub (Jul 24, 2023):

Thank you for creating this issue. I prefer the second option, as it maintains the integrity of the data scraped (i.e., no data will be missing).

<!-- gh-comment-id:1647121788 --> @lorenzodb1 commented on GitHub (Jul 24, 2023): Thank you for creating this issue. I prefer the second option, as it maintains the integrity of the data scraped (i.e., no data will be missing).
Author
Owner

@probberechts commented on GitHub (Jul 24, 2023):

It will require a much more complicated implementation for only a limited number of games. But fine with me if you can implement it.

Also, I am wondering whether stats collected in forfeited games count toward a team/player's season totals. Maybe that should decide how we address this?

<!-- gh-comment-id:1647512699 --> @probberechts commented on GitHub (Jul 24, 2023): It will require a much more complicated implementation for only a limited number of games. But fine with me if you can implement it. Also, I am wondering whether stats collected in forfeited games count toward a team/player's season totals. Maybe that should decide how we address this?
Author
Owner

@lorenzodb1 commented on GitHub (Jul 24, 2023):

It will require a much more complicated implementation for only a limited number of games. But fine with me if you can implement it.

That's what https://github.com/probberechts/soccerdata/pull/286 did. Maybe we can pick up that PR again and improve it?

Also, I am wondering whether stats collected in forfeited games count toward a team/player's season totals. Maybe that should decide how we address this?

iirc they do, but I'm not 100% sure.

<!-- gh-comment-id:1648288325 --> @lorenzodb1 commented on GitHub (Jul 24, 2023): >It will require a much more complicated implementation for only a limited number of games. But fine with me if you can implement it. That's what https://github.com/probberechts/soccerdata/pull/286 did. Maybe we can pick up that PR again and improve it? > Also, I am wondering whether stats collected in forfeited games count toward a team/player's season totals. Maybe that should decide how we address this? iirc they do, but I'm not 100% sure.
Author
Owner

@probberechts commented on GitHub (Jul 25, 2023):

That's what https://github.com/probberechts/soccerdata/pull/286 did. Maybe we can pick up that PR again and improve it?

I've added a unit test case in 5a4c724 to illustrate the intended behavior of the _concat function. The PR that you previously created broke this.

But you'll have to modify the _concat function indeed. What you'll have to do is:

  1. Figure out which level 1 columns should be in the output. I suppose a valid assumption you could make is that the column names that are used by the majority of the inputs are the most "valid" ones.
  2. If an input data frame does not have all of these columns and/or has more columns, select only the "valid" columns and discard the others. Note that the same level 1 column can appear multiple times. In that case, you should look at the level 0 column name too.
<!-- gh-comment-id:1649697552 --> @probberechts commented on GitHub (Jul 25, 2023): > That's what https://github.com/probberechts/soccerdata/pull/286 did. Maybe we can pick up that PR again and improve it? I've added a unit test case in 5a4c724 to illustrate the intended behavior of the `_concat` function. The PR that you previously created broke this. But you'll have to modify the `_concat` function indeed. What you'll have to do is: 1. Figure out which level 1 columns should be in the output. I suppose a valid assumption you could make is that the column names that are used by the majority of the inputs are the most "valid" ones. 2. If an input data frame does not have all of these columns and/or has more columns, select only the "valid" columns and discard the others. Note that the same level 1 column can appear multiple times. In that case, you should look at the level 0 column name too.
Author
Owner

@lorenzodb1 commented on GitHub (Jul 25, 2023):

So is _concat intended to map the column ("", "90s") to ("Performance", "90s")?

<!-- gh-comment-id:1650100095 --> @lorenzodb1 commented on GitHub (Jul 25, 2023): So is `_concat` intended to map the column `("", "90s")` to `("Performance", "90s")`?
Author
Owner

@probberechts commented on GitHub (Jul 25, 2023):

Yes, that's indeed one of the most common inconsistencies that it fixes.

<!-- gh-comment-id:1650471012 --> @probberechts commented on GitHub (Jul 25, 2023): Yes, that's indeed one of the most common inconsistencies that it fixes.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/soccerdata#58
No description provided.