mirror of
https://github.com/probberechts/soccerdata.git
synced 2026-04-26 18:46:00 +03:00
[GH-ISSUE #787] [WhoScored] Cannot scrape WhoScored data - probably not able to retrieve the stage_id #166
Labels
No labels
ESPN
FBref
FotMob
MatchHistory
SoFIFA
Sofascore
WhoScored
WhoScored
bug
build
common
dependencies
discussion
documentation
duplicate
enhancement
good first issue
invalid
performance
pull-request
question
question
removal
understat
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/soccerdata#166
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @meetdesai25 on GitHub (Jan 15, 2025).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/787
Describe the bug
I was scraping data for the 24-25 EPL season when I encountered this error. It's not retrieving any data.
Affected scrapers
This affects the following scrapers:
Code example
Error message
[01/15/25 14:03:50] INFO Retrieving calendar for ENG-Premier League 2324 whoscored.py:371
[01/15/25 14:04:02] INFO [1/10] Retrieving fixtures for ENG-Premier League 2324 whoscored.py:400
[01/15/25 14:04:08] ERROR Error while scraping _common.py:658
https://www.whoscored.com/tournaments/None/data/?d=202311. Retrying in
0 seconds... (attempt 1 of 5).
Traceback (most recent call last):
File
"/Users/meetdesai/Library/Python/3.9/lib/python/site-packages/soccerdat
a/_common.py", line 642, in _download_and_save
raise Exception("Empty response.")
Exception: Empty response.
Contributor Action Plan
Potential Root Cause
The root cause that I believe is that in the read_season_stages() method, when it is retrieving the stage_id, some error is occuring in the following line:
-stage_id = _parse_url(fixtures_url)["stage_id"]
And hence from here on there is no data retrieved in the subsequent method calls because the stage_id is not available.
After using a default value for the 2425 EPL season stage, it is atleast reading the schedule.
@earlk1 commented on GitHub (Jan 15, 2025):
I'm also getting the same error
@Messe57 commented on GitHub (Jan 15, 2025):
I am facing the same problem but for Bundesliga and La Liga.
I use a VPN because with my local language the library wasn't working. I don't know if it might help.
@probberechts commented on GitHub (Jan 15, 2025):
I think the
fixtures_urlchanged. Does it work if you update the regexp in the_parse_urlmethod like below:@yureed commented on GitHub (Jan 15, 2025):
I had a similar issue. I tried your code and i think its missing the flag to make the pattern case insensitive I suppose. I think if it was like the following it would work. I tested it and it has started retrieving fixtures and data for games.
@meetdesai25 commented on GitHub (Jan 16, 2025):
I tried your changes in the package file locally but it's giving the following error when reading the schedule
@chrisdebo commented on GitHub (Jan 16, 2025):
The problem is that _parse_url has to parse different URLs. Once the URL from the fixtures link and once the URL from the dropdown menu.
Therefore the parser must deliver a correct result for both URLs.
Here is my fix:
def _parse_url(url: str) -> dict:
"""Parse a URL from WhoScored.
@probberechts commented on GitHub (Jan 16, 2025):
This was actually a very easy fix. The URLs changed from
https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/10316/Stages/23400/tohttps://www.whoscored.com/regions/252/tournaments/2/seasons/10316/stages/23400/. They were simply capitalized before. Thanks for the hint @yureed