mirror of
https://github.com/probberechts/soccerdata.git
synced 2026-04-25 18:15:58 +03:00
[GH-ISSUE #99] [FBref] Unable to scrape Men's World Cup stats #22
Labels
No labels
ESPN
FBref
FotMob
MatchHistory
SoFIFA
Sofascore
WhoScored
WhoScored
bug
build
common
dependencies
discussion
documentation
duplicate
enhancement
good first issue
invalid
performance
pull-request
question
question
removal
understat
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/soccerdata#22
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @philbywalsh on GitHub (Nov 14, 2022).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/99
Hi @probberechts - this looks like a wonderful set of tools. Can't wait to get stuck deeper into it. Thank you!
Objective: To be able to scrape FBRef stats for historic World Cups (and upcoming 2022 World Cup) from this page
World Cup stats landing page -> https://fbref.com/en/comps/1/World-Cup-Stats
Stats page for 2018 World Cup -> https://fbref.com/en/comps/1/2018/2018-FIFA-World-Cup-Stats
1. Adding a new league - Working as expected
In the "Adding additional leagues" (here: https://soccerdata.readthedocs.io/en/latest/usage.html) I successfully added a new league called "INTL-WorldCup"
Content of league_dict.json
Note: I had to remove a comma from just after the 2nd last curly bracket.
Result: When I
sd.FBref.available_leagues()it returns the expected result below2. Can I pull back scraped data?
This line ran without error:
fbref = sd.FBref(leagues="INTL-WorldCup", seasons=2018)However, when I ran the 2 lines below
...I got this error below. What am I doing wrong?
@probberechts commented on GitHub (Nov 14, 2022):
First, the competition ID for the World Cup used by FBref is "FIFA World Cup". Hence, your
league_dict.jsonfile should contain the following:Next, you'll have to make a few more changes in the code to make it work, since the HTML structure is slightly different for international tournaments compared to the domestic leagues which I support out of the box. For example, in the season's list, the first column is identified by a "year" data attribute instead of "year_id" (which you can change here). Probably there are a few more differences.
@philbywalsh commented on GitHub (Nov 15, 2022):
Thank @probberechts - I've applied those adjustments and it gets me past the "no objects to concatenate" error.
The next barrier is the error below. Which seems also to perhaps be linked to the season vs year difference in how FBRef structures the "history" page.
--> 177 df["season"] = df["season"].apply(lambda x: season_code(x))
https://fbref.com/en/comps/Big5/history/Big-5-European-Leagues-Seasons -> uses "Season"
https://fbref.com/en/comps/1/history/World-Cup-Seasons -> uses "Year"
Is there are quick-ish fix here? Or is it more complicated?
KeyError Traceback (most recent call last)
~\Miniconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
~\Miniconda3\lib\site-packages\pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
~\Miniconda3\lib\site-packages\pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'season'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_28692/128004415.py in
----> 1 team_season_stats = fbref.read_team_season_stats(stat_type="standard")
2 team_season_stats.head()
~\Documents\Python\Projects\3. Scraping\SoccerData (scrape multiple sites)\soccerdata\fbref.py in read_team_season_stats(self, stat_type, opponent_stats)
252
253 # get league IDs
--> 254 seasons = self.read_seasons()
255
256 # collect teams
~\Documents\Python\Projects\3. Scraping\SoccerData (scrape multiple sites)\soccerdata\fbref.py in read_seasons(self)
175 else:
176 df["league"] = "Big 5 European Leagues Combined"
--> 177 df["season"] = df["season"].apply(lambda x: season_code(x))
178 df = df.set_index(["league", "season"]).sort_index()
179 return df.loc[df.index.isin(itertools.product(self.leagues, self.seasons))]
~\Miniconda3\lib\site-packages\pandas\core\frame.py in getitem(self, key)
3456 if self.columns.nlevels > 1:
3457 return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):
3460 indexer = [indexer]
~\Miniconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 'season'
@probberechts commented on GitHub (Nov 15, 2022):
These are all easy fixes. You just have to adapt some class names / ids / data attributes in the scraper's code and rename some column names.
@coreyrastello commented on GitHub (Nov 23, 2022):
@philbywalsh Curious if you resolved this and are willing to share steps before I dig in myself.
@philbywalsh commented on GitHub (Nov 23, 2022):
Hi @coreyrastello - I've not yet had an opportunity I'm afraid. Wishing you good luck if you get there first :-)
@probberechts commented on GitHub (Nov 26, 2022):
I'm working on it.