[GH-ISSUE #99] [FBref] Unable to scrape Men's World Cup stats #22

Closed
opened 2026-03-02 15:55:06 +03:00 by kerem · 6 comments
Owner

Originally created by @philbywalsh on GitHub (Nov 14, 2022).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/99

Hi @probberechts - this looks like a wonderful set of tools. Can't wait to get stuck deeper into it. Thank you!

Objective: To be able to scrape FBRef stats for historic World Cups (and upcoming 2022 World Cup) from this page

World Cup stats landing page -> https://fbref.com/en/comps/1/World-Cup-Stats
Stats page for 2018 World Cup -> https://fbref.com/en/comps/1/2018/2018-FIFA-World-Cup-Stats

1. Adding a new league - Working as expected

In the "Adding additional leagues" (here: https://soccerdata.readthedocs.io/en/latest/usage.html) I successfully added a new league called "INTL-WorldCup"

Content of league_dict.json

{
  "INTL-WorldCup": {
    "FBref": "World-Cup-Stats",
    "season_start": "Aug",
    "season_end": "May"
  }
}

Note: I had to remove a comma from just after the 2nd last curly bracket.

Result: When I sd.FBref.available_leagues() it returns the expected result below

[
  'Big 5 European Leagues Combined',
  'ENG-Premier League',
  'ESP-La Liga',
  'FRA-Ligue 1',
  'GER-Bundesliga',
  'INTL-WorldCup',
  'ITA-Serie A'
]

2. Can I pull back scraped data?

This line ran without error: fbref = sd.FBref(leagues="INTL-WorldCup", seasons=2018)

However, when I ran the 2 lines below

team_season_stats = fbref.read_team_season_stats(stat_type="standard")
team_season_stats.head()

...I got this error below. What am I doing wrong?


ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_11984/128004415.py in <module>
----> 1 team_season_stats = fbref.read_team_season_stats(stat_type="standard")
      2 team_season_stats.head()

soccerdata\fbref.py in read_team_season_stats(self, stat_type, opponent_stats)
    252 
    253         # get league IDs
--> 254         seasons = self.read_seasons()
    255 
    256         # collect teams

soccerdata\fbref.py in read_seasons(self)
    169             seasons.append(df_table)
    170 
--> 171         df = pd.concat(seasons).pipe(standardize_colnames)
    172         # A competition name field is not inlcuded in the Big 5 European Leagues Combined
    173         if "competition_name" in df.columns:

~\Miniconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

~\Miniconda3\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    302         verify_integrity=verify_integrity,
    303         copy=copy,
--> 304         sort=sort,
    305     )
    306 

~\Miniconda3\lib\site-packages\pandas\core\reshape\concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    349 
    350         if len(objs) == 0:
--> 351             raise ValueError("No objects to concatenate")
    352 
    353         if keys is None:

ValueError: No objects to concatenate
Originally created by @philbywalsh on GitHub (Nov 14, 2022). Original GitHub issue: https://github.com/probberechts/soccerdata/issues/99 Hi @probberechts - this looks like a wonderful set of tools. Can't wait to get stuck deeper into it. **Thank you!** **Objective:** To be able to scrape FBRef stats for historic World Cups (and upcoming 2022 World Cup) from this page World Cup stats landing page -> https://fbref.com/en/comps/1/World-Cup-Stats Stats page for 2018 World Cup -> https://fbref.com/en/comps/1/2018/2018-FIFA-World-Cup-Stats **1. Adding a new league** - Working as expected In the "Adding additional leagues" (here: https://soccerdata.readthedocs.io/en/latest/usage.html) I successfully added a new league called "INTL-WorldCup" _Content of league_dict.json_ ```json { "INTL-WorldCup": { "FBref": "World-Cup-Stats", "season_start": "Aug", "season_end": "May" } } ``` Note: I had to remove a comma from just after the 2nd last curly bracket. Result: When I `sd.FBref.available_leagues()` it returns the expected result below ```python [ 'Big 5 European Leagues Combined', 'ENG-Premier League', 'ESP-La Liga', 'FRA-Ligue 1', 'GER-Bundesliga', 'INTL-WorldCup', 'ITA-Serie A' ] ``` **2. Can I pull back scraped data?** This line ran without error: `fbref = sd.FBref(leagues="INTL-WorldCup", seasons=2018)` However, when I ran the 2 lines below ```python team_season_stats = fbref.read_team_season_stats(stat_type="standard") team_season_stats.head() ``` ...I got this error below. **What am I doing wrong?** ***************************************************************** ``` ValueError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_11984/128004415.py in <module> ----> 1 team_season_stats = fbref.read_team_season_stats(stat_type="standard") 2 team_season_stats.head() soccerdata\fbref.py in read_team_season_stats(self, stat_type, opponent_stats) 252 253 # get league IDs --> 254 seasons = self.read_seasons() 255 256 # collect teams soccerdata\fbref.py in read_seasons(self) 169 seasons.append(df_table) 170 --> 171 df = pd.concat(seasons).pipe(standardize_colnames) 172 # A competition name field is not inlcuded in the Big 5 European Leagues Combined 173 if "competition_name" in df.columns: ~\Miniconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs) 309 stacklevel=stacklevel, 310 ) --> 311 return func(*args, **kwargs) 312 313 return wrapper ~\Miniconda3\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy) 302 verify_integrity=verify_integrity, 303 copy=copy, --> 304 sort=sort, 305 ) 306 ~\Miniconda3\lib\site-packages\pandas\core\reshape\concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort) 349 350 if len(objs) == 0: --> 351 raise ValueError("No objects to concatenate") 352 353 if keys is None: ValueError: No objects to concatenate ```
kerem 2026-03-02 15:55:06 +03:00
Author
Owner

@probberechts commented on GitHub (Nov 14, 2022):

First, the competition ID for the World Cup used by FBref is "FIFA World Cup". Hence, your league_dict.json file should contain the following:

{
  "INTL-WorldCup": {
    "FBref": "FIFA World Cup"
  }
}

Next, you'll have to make a few more changes in the code to make it work, since the HTML structure is slightly different for international tournaments compared to the domestic leagues which I support out of the box. For example, in the season's list, the first column is identified by a "year" data attribute instead of "year_id" (which you can change here). Probably there are a few more differences.

<!-- gh-comment-id:1314298348 --> @probberechts commented on GitHub (Nov 14, 2022): First, the competition ID for the World Cup used by FBref is "FIFA World Cup". Hence, your `league_dict.json` file should contain the following: ```json { "INTL-WorldCup": { "FBref": "FIFA World Cup" } } ``` Next, you'll have to make a few more changes in the code to make it work, since the HTML structure is slightly different for international tournaments compared to the domestic leagues which I support out of the box. For example, in the [season's list](https://fbref.com/en/comps/1/history/World-Cup-Seasons), the first column is identified by a "year" data attribute instead of "year_id" (which you can change [here](https://github.com/probberechts/soccerdata/blob/5fa087f4b3443f771f8c052c03a14fac5e06a7e7/soccerdata/fbref.py#L167)). Probably there are a few more differences.
Author
Owner

@philbywalsh commented on GitHub (Nov 15, 2022):

Thank @probberechts - I've applied those adjustments and it gets me past the "no objects to concatenate" error.

The next barrier is the error below. Which seems also to perhaps be linked to the season vs year difference in how FBRef structures the "history" page.

--> 177 df["season"] = df["season"].apply(lambda x: season_code(x))

https://fbref.com/en/comps/Big5/history/Big-5-European-Leagues-Seasons -> uses "Season"
https://fbref.com/en/comps/1/history/World-Cup-Seasons -> uses "Year"

Is there are quick-ish fix here? Or is it more complicated?


KeyError Traceback (most recent call last)
~\Miniconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:

~\Miniconda3\lib\site-packages\pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

~\Miniconda3\lib\site-packages\pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'season'

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_28692/128004415.py in
----> 1 team_season_stats = fbref.read_team_season_stats(stat_type="standard")
2 team_season_stats.head()

~\Documents\Python\Projects\3. Scraping\SoccerData (scrape multiple sites)\soccerdata\fbref.py in read_team_season_stats(self, stat_type, opponent_stats)
252
253 # get league IDs
--> 254 seasons = self.read_seasons()
255
256 # collect teams

~\Documents\Python\Projects\3. Scraping\SoccerData (scrape multiple sites)\soccerdata\fbref.py in read_seasons(self)
175 else:
176 df["league"] = "Big 5 European Leagues Combined"
--> 177 df["season"] = df["season"].apply(lambda x: season_code(x))
178 df = df.set_index(["league", "season"]).sort_index()
179 return df.loc[df.index.isin(itertools.product(self.leagues, self.seasons))]

~\Miniconda3\lib\site-packages\pandas\core\frame.py in getitem(self, key)
3456 if self.columns.nlevels > 1:
3457 return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):
3460 indexer = [indexer]

~\Miniconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'season'

<!-- gh-comment-id:1315060669 --> @philbywalsh commented on GitHub (Nov 15, 2022): Thank @probberechts - I've applied those adjustments and it gets me past the "no objects to concatenate" error. The next barrier is the error below. Which seems also to perhaps be linked to the season vs year difference in how FBRef structures the "history" page. --> 177 df["season"] = df["season"].apply(lambda x: season_code(x)) https://fbref.com/en/comps/Big5/history/Big-5-European-Leagues-Seasons -> uses "Season" https://fbref.com/en/comps/1/history/World-Cup-Seasons -> uses "Year" _Is there are quick-ish fix here? Or is it more complicated?_ --------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~\Miniconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 3360 try: -> 3361 return self._engine.get_loc(casted_key) 3362 except KeyError as err: ~\Miniconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() ~\Miniconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'season' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_28692/128004415.py in <module> ----> 1 team_season_stats = fbref.read_team_season_stats(stat_type="standard") 2 team_season_stats.head() ~\Documents\Python\Projects\3. Scraping\SoccerData (scrape multiple sites)\soccerdata\fbref.py in read_team_season_stats(self, stat_type, opponent_stats) 252 253 # get league IDs --> 254 seasons = self.read_seasons() 255 256 # collect teams ~\Documents\Python\Projects\3. Scraping\SoccerData (scrape multiple sites)\soccerdata\fbref.py in read_seasons(self) 175 else: 176 df["league"] = "Big 5 European Leagues Combined" --> 177 df["season"] = df["season"].apply(lambda x: season_code(x)) 178 df = df.set_index(["league", "season"]).sort_index() 179 return df.loc[df.index.isin(itertools.product(self.leagues, self.seasons))] ~\Miniconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key) 3456 if self.columns.nlevels > 1: 3457 return self._getitem_multilevel(key) -> 3458 indexer = self.columns.get_loc(key) 3459 if is_integer(indexer): 3460 indexer = [indexer] ~\Miniconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 3361 return self._engine.get_loc(casted_key) 3362 except KeyError as err: -> 3363 raise KeyError(key) from err 3364 3365 if is_scalar(key) and isna(key) and not self.hasnans: KeyError: 'season'
Author
Owner

@probberechts commented on GitHub (Nov 15, 2022):

These are all easy fixes. You just have to adapt some class names / ids / data attributes in the scraper's code and rename some column names.

<!-- gh-comment-id:1315067985 --> @probberechts commented on GitHub (Nov 15, 2022): These are all easy fixes. You just have to adapt some class names / ids / data attributes in the scraper's code and rename some column names.
Author
Owner

@coreyrastello commented on GitHub (Nov 23, 2022):

@philbywalsh Curious if you resolved this and are willing to share steps before I dig in myself.

<!-- gh-comment-id:1325083988 --> @coreyrastello commented on GitHub (Nov 23, 2022): @philbywalsh Curious if you resolved this and are willing to share steps before I dig in myself.
Author
Owner

@philbywalsh commented on GitHub (Nov 23, 2022):

Hi @coreyrastello - I've not yet had an opportunity I'm afraid. Wishing you good luck if you get there first :-)

<!-- gh-comment-id:1325146315 --> @philbywalsh commented on GitHub (Nov 23, 2022): Hi @coreyrastello - I've not yet had an opportunity I'm afraid. Wishing you good luck if you get there first :-)
Author
Owner

@probberechts commented on GitHub (Nov 26, 2022):

I'm working on it.

<!-- gh-comment-id:1328013017 --> @probberechts commented on GitHub (Nov 26, 2022): I'm working on it.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/soccerdata#22
No description provided.