[GH-ISSUE #576] [FBRef] read_player_season_stats includes Women's World Cup by default (season 2023) #105

Closed
opened 2026-03-02 15:55:49 +03:00 by kerem · 2 comments
Owner

Originally created by @mvantschip on GitHub (May 12, 2024).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/576

I am fetching player data for the 2023 season, which by default, according to the docs, should only return data from the top 5 leagues. However, I noticed that stats from the Women's World Cup are included as well.
I can reproduce this issue with the following code:

import soccerdata as sd
import pandas as pd

fbref = sd.FBref(seasons=2023)
stats = fbref.read_player_season_stats(stat_type='standard')
print(stats.index.unique(level='league'))

Output:

Index(['ENG-Premier League', 'ESP-La Liga', 'FRA-Ligue 1', 'GER-Bundesliga',
       'INT-Women's World Cup', 'ITA-Serie A'],
      dtype='object', name='league')`

In addition, I get a dataframe where each row occurs twice, but I am not sure if that problem is related.
See, from the same code, the output of stats.head():

import soccerdata as sd
import pandas as pd

fbref = sd.FBref(seasons=2023)
stats = fbref.read_player_season_stats(stat_type='standard')
print(stats.head())

Output:

                                                 nation pos     age  born Playing Time                    Performance                                 Expected                      Progression           Per 90 Minutes
                                                                                    MP Starts   Min   90s         Gls Ast G+A G-PK PK PKatt CrdY CrdR       xG  npxG   xAG npxG+xAG        PrgC PrgP PrgR            Gls   Ast   G+A  G-PK G+A-PK    xG   xAG xG+xAG  npxG npxG+xAG
league             season team    player
ENG-Premier League 2324   Arsenal Aaron Ramsdale    ENG  GK  25-364  1998            6      6   540   6.0           0   0   0    0  0     0    0    0      0.0   0.0   0.0      0.0           0    2    0            0.0   0.0   0.0   0.0    0.0   0.0   0.0    0.0   0.0      0.0
                                  Aaron Ramsdale    ENG  GK  25-364  1998            6      6   540   6.0           0   0   0    0  0     0    0    0      0.0   0.0   0.0      0.0           0    2    0            0.0   0.0   0.0   0.0    0.0   0.0   0.0    0.0   0.0      0.0
                                  Ben White         ENG  DF  26-217  1997           35     33  2830  31.4           4   4   8    4  0     0    8    0      1.1   1.1   3.5      4.6          41  175  153           0.13  0.13  0.25  0.13   0.25  0.04  0.11   0.15  0.04     0.15
                                  Ben White         ENG  DF  26-217  1997           35     33  2830  31.4           4   4   8    4  0     0    8    0      1.1   1.1   3.5      4.6          41  175  153           0.13  0.13  0.25  0.13   0.25  0.04  0.11   0.15  0.04     0.15
                                  Bukayo Saka       ENG  FW  22-250  2001           34     34  2838  31.5          16   9  25   10  6     6    3    0     15.1  10.4  10.2     20.6         153  122  502           0.51  0.29  0.79  0.32    0.6  0.48  0.32    0.8  0.33     0.65

Thanks for the wonderful work!

Originally created by @mvantschip on GitHub (May 12, 2024). Original GitHub issue: https://github.com/probberechts/soccerdata/issues/576 I am fetching player data for the 2023 season, which by default, according to the docs, should only return data from the top 5 leagues. However, I noticed that stats from the Women's World Cup are included as well. I can reproduce this issue with the following code: ```py import soccerdata as sd import pandas as pd fbref = sd.FBref(seasons=2023) stats = fbref.read_player_season_stats(stat_type='standard') print(stats.index.unique(level='league')) ``` Output: ```py Index(['ENG-Premier League', 'ESP-La Liga', 'FRA-Ligue 1', 'GER-Bundesliga', 'INT-Women's World Cup', 'ITA-Serie A'], dtype='object', name='league')` ``` In addition, I get a dataframe where each row occurs twice, but I am not sure if that problem is related. See, from the same code, the output of `stats.head()`: ```py import soccerdata as sd import pandas as pd fbref = sd.FBref(seasons=2023) stats = fbref.read_player_season_stats(stat_type='standard') print(stats.head()) ``` Output: ```py nation pos age born Playing Time Performance Expected Progression Per 90 Minutes MP Starts Min 90s Gls Ast G+A G-PK PK PKatt CrdY CrdR xG npxG xAG npxG+xAG PrgC PrgP PrgR Gls Ast G+A G-PK G+A-PK xG xAG xG+xAG npxG npxG+xAG league season team player ENG-Premier League 2324 Arsenal Aaron Ramsdale ENG GK 25-364 1998 6 6 540 6.0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0 2 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Aaron Ramsdale ENG GK 25-364 1998 6 6 540 6.0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0 2 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Ben White ENG DF 26-217 1997 35 33 2830 31.4 4 4 8 4 0 0 8 0 1.1 1.1 3.5 4.6 41 175 153 0.13 0.13 0.25 0.13 0.25 0.04 0.11 0.15 0.04 0.15 Ben White ENG DF 26-217 1997 35 33 2830 31.4 4 4 8 4 0 0 8 0 1.1 1.1 3.5 4.6 41 175 153 0.13 0.13 0.25 0.13 0.25 0.04 0.11 0.15 0.04 0.15 Bukayo Saka ENG FW 22-250 2001 34 34 2838 31.5 16 9 25 10 6 6 3 0 15.1 10.4 10.2 20.6 153 122 502 0.51 0.29 0.79 0.32 0.6 0.48 0.32 0.8 0.33 0.65 ``` Thanks for the wonderful work!
kerem 2026-03-02 15:55:49 +03:00
Author
Owner

@probberechts commented on GitHub (May 13, 2024):

The docs are outdated. When no leagues are given, it returns the data for all the supported leagues. Previously, only the Big 5 leagues were supported but I've added support for the World Cups and Euros since.

<!-- gh-comment-id:2108821410 --> @probberechts commented on GitHub (May 13, 2024): The docs are outdated. When no leagues are given, it returns the data for all the supported leagues. Previously, only the Big 5 leagues were supported but I've added support for the World Cups and Euros since.
Author
Owner

@mvantschip commented on GitHub (May 14, 2024):

I see! Thanks. Any idea about the duplicate rows? Or should I make a separate issue for that?

<!-- gh-comment-id:2109467573 --> @mvantschip commented on GitHub (May 14, 2024): I see! Thanks. Any idea about the duplicate rows? Or should I make a separate issue for that?
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/soccerdata#105
No description provided.