[GH-ISSUE #787] [WhoScored] Cannot scrape WhoScored data - probably not able to retrieve the stage_id #166

Closed
opened 2026-03-02 15:56:20 +03:00 by kerem · 7 comments
Owner

Originally created by @meetdesai25 on GitHub (Jan 15, 2025).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/787

Describe the bug
I was scraping data for the 24-25 EPL season when I encountered this error. It's not retrieving any data.

Affected scrapers
This affects the following scrapers:

  • WhoScored

Code example

import soccerdata as sd
ws = WhoScored(leagues='ENG-Premier League', seasons='2024')
ws.read_schedule()

Error message
[01/15/25 14:03:50] INFO Retrieving calendar for ENG-Premier League 2324 whoscored.py:371
[01/15/25 14:04:02] INFO [1/10] Retrieving fixtures for ENG-Premier League 2324 whoscored.py:400
[01/15/25 14:04:08] ERROR Error while scraping _common.py:658
https://www.whoscored.com/tournaments/None/data/?d=202311. Retrying in
0 seconds... (attempt 1 of 5).
Traceback (most recent call last):
File
"/Users/meetdesai/Library/Python/3.9/lib/python/site-packages/soccerdat
a/_common.py", line 642, in _download_and_save
raise Exception("Empty response.")
Exception: Empty response.

Contributor Action Plan

  • I’m not able to fix this issue but might have the root cause .

Potential Root Cause

The root cause that I believe is that in the read_season_stages() method, when it is retrieving the stage_id, some error is occuring in the following line:
-stage_id = _parse_url(fixtures_url)["stage_id"]
And hence from here on there is no data retrieved in the subsequent method calls because the stage_id is not available.

After using a default value for the 2425 EPL season stage, it is atleast reading the schedule.

Originally created by @meetdesai25 on GitHub (Jan 15, 2025). Original GitHub issue: https://github.com/probberechts/soccerdata/issues/787 **Describe the bug** I was scraping data for the 24-25 EPL season when I encountered this error. It's not retrieving any data. **Affected scrapers** This affects the following scrapers: - [ ] WhoScored **Code example** ```python import soccerdata as sd ws = WhoScored(leagues='ENG-Premier League', seasons='2024') ws.read_schedule() ``` **Error message** [01/15/25 14:03:50] INFO Retrieving calendar for ENG-Premier League 2324 [whoscored.py](file:///Users/meetdesai/Library/Python/3.9/lib/python/site-packages/soccerdata/whoscored.py):[371](file:///Users/meetdesai/Library/Python/3.9/lib/python/site-packages/soccerdata/whoscored.py#371) [01/15/25 14:04:02] INFO [1/10] Retrieving fixtures for ENG-Premier League 2324 [whoscored.py](file:///Users/meetdesai/Library/Python/3.9/lib/python/site-packages/soccerdata/whoscored.py):[400](file:///Users/meetdesai/Library/Python/3.9/lib/python/site-packages/soccerdata/whoscored.py#400) [01/15/25 14:04:08] ERROR Error while scraping [_common.py](file:///Users/meetdesai/Library/Python/3.9/lib/python/site-packages/soccerdata/_common.py):[658](file:///Users/meetdesai/Library/Python/3.9/lib/python/site-packages/soccerdata/_common.py#658) https://www.whoscored.com/tournaments/None/data/?d=202311. Retrying in 0 seconds... (attempt 1 of 5). Traceback (most recent call last): File "/Users/meetdesai/Library/Python/3.9/lib/python/site-packages/soccerdat a/_common.py", line 642, in _download_and_save raise Exception("Empty response.") Exception: Empty response. **Contributor Action Plan** - [ ] I’m not able to fix this issue but might have the root cause . **Potential Root Cause** The root cause that I believe is that in the read_season_stages() method, when it is retrieving the stage_id, some error is occuring in the following line: -stage_id = _parse_url(fixtures_url)["stage_id"] And hence from here on there is no data retrieved in the subsequent method calls because the stage_id is not available. After using a default value for the 2425 EPL season stage, it is atleast reading the schedule.
kerem 2026-03-02 15:56:20 +03:00
Author
Owner

@earlk1 commented on GitHub (Jan 15, 2025):

I'm also getting the same error

<!-- gh-comment-id:2592352525 --> @earlk1 commented on GitHub (Jan 15, 2025): I'm also getting the same error
Author
Owner

@Messe57 commented on GitHub (Jan 15, 2025):

I am facing the same problem but for Bundesliga and La Liga.
I use a VPN because with my local language the library wasn't working. I don't know if it might help.

<!-- gh-comment-id:2593068481 --> @Messe57 commented on GitHub (Jan 15, 2025): I am facing the same problem but for Bundesliga and La Liga. I use a VPN because with my local language the library wasn't working. I don't know if it might help.
Author
Owner

@probberechts commented on GitHub (Jan 15, 2025):

I think the fixtures_url changed. Does it work if you update the regexp in the _parse_url method like below:

def _parse_url(url: str) -> dict:
    """Parse a URL from WhoScored.

    Parameters
    ----------
    url : str
        URL to parse.

    Raises
    ------
    ValueError
        If the URL could not be parsed.

    Returns
    -------
    dict
    """
    patt = (
        r"^(?:https://www\.whoscored\.com)?/"
        r"(?:regions/(?P<region_id>\d+)/)?"
        r"(?:tournaments/(?P<league_id>\d+)/)?"
        r"(?:seasons/(?P<season_id>\d+)/)?"
        r"(?:stages/(?P<stage_id>\d+)|"
        r"matches/(?P<match_id>\d+))"
    )

    matches = re.search(patt, url)
    if matches:
        return {
            "region_id": matches.group("region_id"),
            "league_id": matches.group("league_id"),
            "season_id": matches.group("season_id"),
            "stage_id": matches.group("stage_id"),
            "match_id": matches.group("match_id"),
        }

    raise ValueError(f"Could not parse URL: {url}")
<!-- gh-comment-id:2593905057 --> @probberechts commented on GitHub (Jan 15, 2025): I think the `fixtures_url` changed. Does it work if you update the regexp in the `_parse_url` method like below: ```py def _parse_url(url: str) -> dict: """Parse a URL from WhoScored. Parameters ---------- url : str URL to parse. Raises ------ ValueError If the URL could not be parsed. Returns ------- dict """ patt = ( r"^(?:https://www\.whoscored\.com)?/" r"(?:regions/(?P<region_id>\d+)/)?" r"(?:tournaments/(?P<league_id>\d+)/)?" r"(?:seasons/(?P<season_id>\d+)/)?" r"(?:stages/(?P<stage_id>\d+)|" r"matches/(?P<match_id>\d+))" ) matches = re.search(patt, url) if matches: return { "region_id": matches.group("region_id"), "league_id": matches.group("league_id"), "season_id": matches.group("season_id"), "stage_id": matches.group("stage_id"), "match_id": matches.group("match_id"), } raise ValueError(f"Could not parse URL: {url}") ```
Author
Owner

@yureed commented on GitHub (Jan 15, 2025):

I think the fixtures_url changed. Does it work if you update the regexp in the _parse_url method like below:

def _parse_url(url: str) -> dict:
    """Parse a URL from WhoScored.

    Parameters
    ----------
    url : str
        URL to parse.

    Raises
    ------
    ValueError
        If the URL could not be parsed.

    Returns
    -------
    dict
    """
    patt = (
        r"^(?:https://www\.whoscored\.com)?/"
        r"(?:regions/(?P<region_id>\d+)/)?"
        r"(?:tournaments/(?P<league_id>\d+)/)?"
        r"(?:seasons/(?P<season_id>\d+)/)?"
        r"(?:stages/(?P<stage_id>\d+)|"
        r"matches/(?P<match_id>\d+))"
    )

    matches = re.search(patt, url)
    if matches:
        return {
            "region_id": matches.group("region_id"),
            "league_id": matches.group("league_id"),
            "season_id": matches.group("season_id"),
            "stage_id": matches.group("stage_id"),
            "match_id": matches.group("match_id"),
        }

    raise ValueError(f"Could not parse URL: {url}")

I had a similar issue. I tried your code and i think its missing the flag to make the pattern case insensitive I suppose. I think if it was like the following it would work. I tested it and it has started retrieving fixtures and data for games.

patt = (
    r"(?i)^(?:https://www\.whoscored\.com)?/" 
    r"(?:regions/(?P<region_id>\d+)/)?"
    r"(?:tournaments/(?P<league_id>\d+)/)?"
    r"(?:seasons/(?P<season_id>\d+)/)?"
    r"(?:stages/(?P<stage_id>\d+)|"
    r"matches/(?P<match_id>\d+))"
)

<!-- gh-comment-id:2594066911 --> @yureed commented on GitHub (Jan 15, 2025): > I think the `fixtures_url` changed. Does it work if you update the regexp in the `_parse_url` method like below: > > ```python > def _parse_url(url: str) -> dict: > """Parse a URL from WhoScored. > > Parameters > ---------- > url : str > URL to parse. > > Raises > ------ > ValueError > If the URL could not be parsed. > > Returns > ------- > dict > """ > patt = ( > r"^(?:https://www\.whoscored\.com)?/" > r"(?:regions/(?P<region_id>\d+)/)?" > r"(?:tournaments/(?P<league_id>\d+)/)?" > r"(?:seasons/(?P<season_id>\d+)/)?" > r"(?:stages/(?P<stage_id>\d+)|" > r"matches/(?P<match_id>\d+))" > ) > > matches = re.search(patt, url) > if matches: > return { > "region_id": matches.group("region_id"), > "league_id": matches.group("league_id"), > "season_id": matches.group("season_id"), > "stage_id": matches.group("stage_id"), > "match_id": matches.group("match_id"), > } > > raise ValueError(f"Could not parse URL: {url}") > ``` I had a similar issue. I tried your code and i think its missing the flag to make the pattern case insensitive I suppose. I think if it was like the following it would work. I tested it and it has started retrieving fixtures and data for games. ``` patt = ( r"(?i)^(?:https://www\.whoscored\.com)?/" r"(?:regions/(?P<region_id>\d+)/)?" r"(?:tournaments/(?P<league_id>\d+)/)?" r"(?:seasons/(?P<season_id>\d+)/)?" r"(?:stages/(?P<stage_id>\d+)|" r"matches/(?P<match_id>\d+))" ) ```
Author
Owner

@meetdesai25 commented on GitHub (Jan 16, 2025):

I think the fixtures_url changed. Does it work if you update the regexp in the _parse_url method like below:

def _parse_url(url: str) -> dict:
    """Parse a URL from WhoScored.

    Parameters
    ----------
    url : str
        URL to parse.

    Raises
    ------
    ValueError
        If the URL could not be parsed.

    Returns
    -------
    dict
    """
    patt = (
        r"^(?:https://www\.whoscored\.com)?/"
        r"(?:regions/(?P<region_id>\d+)/)?"
        r"(?:tournaments/(?P<league_id>\d+)/)?"
        r"(?:seasons/(?P<season_id>\d+)/)?"
        r"(?:stages/(?P<stage_id>\d+)|"
        r"matches/(?P<match_id>\d+))"
    )

    matches = re.search(patt, url)
    if matches:
        return {
            "region_id": matches.group("region_id"),
            "league_id": matches.group("league_id"),
            "season_id": matches.group("season_id"),
            "stage_id": matches.group("stage_id"),
            "match_id": matches.group("match_id"),
        }

    raise ValueError(f"Could not parse URL: {url}")

I had a similar issue. I tried your code and i think its missing the flag to make the pattern case insensitive I suppose. I think if it was like the following it would work. I tested it and it has started retrieving fixtures and data for games.

patt = (
    r"(?i)^(?:https://www\.whoscored\.com)?/" 
    r"(?:regions/(?P<region_id>\d+)/)?"
    r"(?:tournaments/(?P<league_id>\d+)/)?"
    r"(?:seasons/(?P<season_id>\d+)/)?"
    r"(?:stages/(?P<stage_id>\d+)|"
    r"matches/(?P<match_id>\d+))"
)

I tried your changes in the package file locally but it's giving the following error when reading the schedule

ValueError                                Traceback (most recent call last)
Cell In[4], line 1
----> 1 schedule = ws.read_schedule()
      2 schedule.head()

File ~/Library/Python/3.9/lib/python/site-packages/soccerdata/whoscored.py:348, in WhoScored.read_schedule(self, force_cache)
    335 def read_schedule(self, force_cache: bool = False) -> pd.DataFrame:
    336     """Retrieve the game schedule for the selected leagues and seasons.
    337 
    338     Parameters
   (...)
    346     pd.DataFrame
    347     """
--> 348     df_season_stages = self.read_season_stages(force_cache=force_cache)
    349     filemask_schedule = "matches/{}_{}_{}_{}.json"
    351     all_schedules = []

File ~/Library/Python/3.9/lib/python/site-packages/soccerdata/whoscored.py:278, in WhoScored.read_season_stages(self, force_cache)
    265 def read_season_stages(self, force_cache: bool = False) -> pd.DataFrame:
    266     """Retrieve the season stages for the selected leagues.
    267 
    268     Parameters
   (...)
    276     pd.DataFrame
    277     """
--> 278     df_seasons = self.read_seasons()
    279     filemask = "seasons/{}_{}.html"
    281     season_stages = []

File ~/Library/Python/3.9/lib/python/site-packages/soccerdata/whoscored.py:247, in WhoScored.read_seasons(self)
    244     for node in tree.xpath("//select[contains(@id,'seasons')]/option"):
    245         # extract team IDs from links
    246         season_url = node.get("value")
--> 247         season_id = _parse_url(season_url)["season_id"]
    248         seasons.append(
    249             {
    250                 "league": lkey,
   (...)
    255             }
    256         )
    258 return (
    259     pd.DataFrame(seasons)
    260     .set_index(["league", "season"])
    261     .sort_index()
    262     .loc[itertools.product(self.leagues, self.seasons)]
    263 )

File ~/Library/Python/3.9/lib/python/site-packages/soccerdata/whoscored.py:110, in _parse_url(url)
    101     print('yes')
    102     return {
    103         "region_id": matches.group("region_id"),
    104         "league_id": matches.group("league_id"),
   (...)
    107         "match_id": matches.group("match_id"),
    108     }
--> 110 raise ValueError(f"Could not parse URL: {url}")

ValueError: Could not parse URL: /Regions/252/Tournaments/2/Seasons/10316/England-Premier-League
<!-- gh-comment-id:2594816900 --> @meetdesai25 commented on GitHub (Jan 16, 2025): > > I think the `fixtures_url` changed. Does it work if you update the regexp in the `_parse_url` method like below: > > ```python > > def _parse_url(url: str) -> dict: > > """Parse a URL from WhoScored. > > > > Parameters > > ---------- > > url : str > > URL to parse. > > > > Raises > > ------ > > ValueError > > If the URL could not be parsed. > > > > Returns > > ------- > > dict > > """ > > patt = ( > > r"^(?:https://www\.whoscored\.com)?/" > > r"(?:regions/(?P<region_id>\d+)/)?" > > r"(?:tournaments/(?P<league_id>\d+)/)?" > > r"(?:seasons/(?P<season_id>\d+)/)?" > > r"(?:stages/(?P<stage_id>\d+)|" > > r"matches/(?P<match_id>\d+))" > > ) > > > > matches = re.search(patt, url) > > if matches: > > return { > > "region_id": matches.group("region_id"), > > "league_id": matches.group("league_id"), > > "season_id": matches.group("season_id"), > > "stage_id": matches.group("stage_id"), > > "match_id": matches.group("match_id"), > > } > > > > raise ValueError(f"Could not parse URL: {url}") > > ``` > > I had a similar issue. I tried your code and i think its missing the flag to make the pattern case insensitive I suppose. I think if it was like the following it would work. I tested it and it has started retrieving fixtures and data for games. > > ``` > patt = ( > r"(?i)^(?:https://www\.whoscored\.com)?/" > r"(?:regions/(?P<region_id>\d+)/)?" > r"(?:tournaments/(?P<league_id>\d+)/)?" > r"(?:seasons/(?P<season_id>\d+)/)?" > r"(?:stages/(?P<stage_id>\d+)|" > r"matches/(?P<match_id>\d+))" > ) > ``` I tried your changes in the package file locally but it's giving the following error when reading the schedule ``` ValueError Traceback (most recent call last) Cell In[4], line 1 ----> 1 schedule = ws.read_schedule() 2 schedule.head() File ~/Library/Python/3.9/lib/python/site-packages/soccerdata/whoscored.py:348, in WhoScored.read_schedule(self, force_cache) 335 def read_schedule(self, force_cache: bool = False) -> pd.DataFrame: 336 """Retrieve the game schedule for the selected leagues and seasons. 337 338 Parameters (...) 346 pd.DataFrame 347 """ --> 348 df_season_stages = self.read_season_stages(force_cache=force_cache) 349 filemask_schedule = "matches/{}_{}_{}_{}.json" 351 all_schedules = [] File ~/Library/Python/3.9/lib/python/site-packages/soccerdata/whoscored.py:278, in WhoScored.read_season_stages(self, force_cache) 265 def read_season_stages(self, force_cache: bool = False) -> pd.DataFrame: 266 """Retrieve the season stages for the selected leagues. 267 268 Parameters (...) 276 pd.DataFrame 277 """ --> 278 df_seasons = self.read_seasons() 279 filemask = "seasons/{}_{}.html" 281 season_stages = [] File ~/Library/Python/3.9/lib/python/site-packages/soccerdata/whoscored.py:247, in WhoScored.read_seasons(self) 244 for node in tree.xpath("//select[contains(@id,'seasons')]/option"): 245 # extract team IDs from links 246 season_url = node.get("value") --> 247 season_id = _parse_url(season_url)["season_id"] 248 seasons.append( 249 { 250 "league": lkey, (...) 255 } 256 ) 258 return ( 259 pd.DataFrame(seasons) 260 .set_index(["league", "season"]) 261 .sort_index() 262 .loc[itertools.product(self.leagues, self.seasons)] 263 ) File ~/Library/Python/3.9/lib/python/site-packages/soccerdata/whoscored.py:110, in _parse_url(url) 101 print('yes') 102 return { 103 "region_id": matches.group("region_id"), 104 "league_id": matches.group("league_id"), (...) 107 "match_id": matches.group("match_id"), 108 } --> 110 raise ValueError(f"Could not parse URL: {url}") ValueError: Could not parse URL: /Regions/252/Tournaments/2/Seasons/10316/England-Premier-League ```
Author
Owner

@chrisdebo commented on GitHub (Jan 16, 2025):

The problem is that _parse_url has to parse different URLs. Once the URL from the fixtures link and once the URL from the dropdown menu.

Therefore the parser must deliver a correct result for both URLs.

Here is my fix:

def _parse_url(url: str) -> dict:
"""Parse a URL from WhoScored.

Parameters
----------
url : str
    URL to parse.

Raises
------
ValueError
    If the URL could not be parsed.

Returns
-------
dict
"""
patt = re.compile(
    r"^(?:https?:\/\/(?:www\.)?whoscored\.com)?\/"        
    r"(?:regions\/(?P<region_id>\d+)\/)?"                 
    r"(?:tournaments\/(?P<league_id>\d+)\/)?"             
    r"(?:seasons\/(?P<season_id>\d+)\/?)?"                
    r"(?:(?:stages\/(?P<stage_id>\d+))|"                  
    r"(?:matches\/(?P<match_id>\d+)))?"                   
    r"(?:\/[^\?]*)?$",
    re.IGNORECASE
)

matches = patt.match(url)

if not matches:
    raise ValueError(f"Could not parse URL: {url}")

return {
    "region_id": matches.group(1),
    "league_id": matches.group(2),
    "season_id": matches.group(3),
    "stage_id": matches.group(4),
    "match_id": matches.group(5),
}
<!-- gh-comment-id:2595164820 --> @chrisdebo commented on GitHub (Jan 16, 2025): The problem is that _parse_url has to parse different URLs. Once the URL from the fixtures link and once the URL from the dropdown menu. Therefore the parser must deliver a correct result for both URLs. Here is my fix: def _parse_url(url: str) -> dict: """Parse a URL from WhoScored. Parameters ---------- url : str URL to parse. Raises ------ ValueError If the URL could not be parsed. Returns ------- dict """ patt = re.compile( r"^(?:https?:\/\/(?:www\.)?whoscored\.com)?\/" r"(?:regions\/(?P<region_id>\d+)\/)?" r"(?:tournaments\/(?P<league_id>\d+)\/)?" r"(?:seasons\/(?P<season_id>\d+)\/?)?" r"(?:(?:stages\/(?P<stage_id>\d+))|" r"(?:matches\/(?P<match_id>\d+)))?" r"(?:\/[^\?]*)?$", re.IGNORECASE ) matches = patt.match(url) if not matches: raise ValueError(f"Could not parse URL: {url}") return { "region_id": matches.group(1), "league_id": matches.group(2), "season_id": matches.group(3), "stage_id": matches.group(4), "match_id": matches.group(5), }
Author
Owner

@probberechts commented on GitHub (Jan 16, 2025):

This was actually a very easy fix. The URLs changed from https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/10316/Stages/23400/ to https://www.whoscored.com/regions/252/tournaments/2/seasons/10316/stages/23400/. They were simply capitalized before. Thanks for the hint @yureed

<!-- gh-comment-id:2595454271 --> @probberechts commented on GitHub (Jan 16, 2025): This was actually a very easy fix. The URLs changed from `https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/10316/Stages/23400/` to `https://www.whoscored.com/regions/252/tournaments/2/seasons/10316/stages/23400/`. They were simply capitalized before. Thanks for the hint @yureed
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/soccerdata#166
No description provided.