[GH-ISSUE #93] [WhoScored] Schedule keeps scraping the same dates #20

Open
opened 2026-03-02 15:55:04 +03:00 by kerem · 2 comments
Owner

Originally created by @aegonwolf on GitHub (Oct 16, 2022).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/93

Hello,
Thank you for this wonderful package!
I am using python 3.10.6 and the latest version of soccerdata.

I am not sure if there is an error or a bug or if this is expected behaviour, I apologize in advance if it is.
The following code:

ws = soccerdata.WhoScored(leagues=['ENG-Premier League'], proxy={
    "http": "socks5://127.0.0.1:9150",
    "https": "socks5://127.0.0.1:9150",
}, seasons='22-23', headless = False)
}, seasons='22-23', headless = False)
#%%
schedule = ws.read_schedule()

keeps posting Scraping game schedule for date.... for the same dates over and over, it iterates at random, i.e. the dates are not ordered Sept 30, then Oct 5, then Oct 1, then Oct 30 (which is in the future), etc.
Is this expected?

There is no error.

Edit:
Might be related.
When trying to scrape a particular game, the following error occurs after also giving the same info messages (dates scraping schedules):

whoscored.py:568, in WhoScored.read_events(self, match_id, force_cache, live, output_fmt)
    565 urlmask = WHOSCORED_URL + "/Matches/{}/Live"
    566 filemask = "events/{}_{}/{}.json"
--> 568 df_schedule = self.read_schedule(force_cache).reset_index()
    569 if match_id is not None:
    570     iterator = df_schedule[
    571         df_schedule.game_id.isin([match_id] if isinstance(match_id, int) else match_id)
    572     ]

whoscored.py:326, in WhoScored.read_schedule(self, force_cache)
    324         self._driver.get(summary_nav.get_attribute("href"))
    325     logger.info("Scraping game schedule from %s", url)
--> 326     schedule.extend(self._parse_schedule())
    328 # Cache the data
    329 df_schedule = pd.DataFrame(schedule).assign(league=lkey, season=skey)

whoscored.py:241, in WhoScored._parse_schedule(self, stage)
    239 schedule = []
    240 # Parse first page
--> 241 page_schedule, next_page = self._parse_schedule_page()
    242 schedule.extend(page_schedule)
    243 # Go to next page

whoscored.py:219, in WhoScored._parse_schedule_page(self)
    213     time_str = node.find_element(By.XPATH, time_selector).get_attribute("textContent")
    214     match_url = node.find_element(By.XPATH, result_selector).get_attribute("href")
    215     schedule_page.append(
    216         {
    217             "date": datetime.strptime(f"{date_str} {time_str}", "%A, %b %d %Y %H:%M"),
    218             "home_team": node.find_element(By.XPATH, home_team_selector).text,
--> 219             "away_team": node.find_element(By.XPATH, away_team_selector).text,
    220             "game_id": match_id,
    221             "url": match_url,
    222         }
    223     )
    224 else:
    225     date_str = node.find_element(By.XPATH, date_selector).text

AttributeError: 'NoneType' object has no attribute 'text'
Originally created by @aegonwolf on GitHub (Oct 16, 2022). Original GitHub issue: https://github.com/probberechts/soccerdata/issues/93 Hello, Thank you for this wonderful package! I am using python 3.10.6 and the latest version of soccerdata. I am not sure if there is an error or a bug or if this is expected behaviour, I apologize in advance if it is. The following code: ``` ws = soccerdata.WhoScored(leagues=['ENG-Premier League'], proxy={ "http": "socks5://127.0.0.1:9150", "https": "socks5://127.0.0.1:9150", }, seasons='22-23', headless = False) }, seasons='22-23', headless = False) #%% schedule = ws.read_schedule() ``` keeps posting `Scraping game schedule for date....` for the same dates over and over, it iterates at random, i.e. the dates are not ordered Sept 30, then Oct 5, then Oct 1, then Oct 30 (which is in the future), etc. Is this expected? There is no error. Edit: Might be related. When trying to scrape a particular game, the following error occurs after also giving the same info messages (dates scraping schedules): ``` whoscored.py:568, in WhoScored.read_events(self, match_id, force_cache, live, output_fmt) 565 urlmask = WHOSCORED_URL + "/Matches/{}/Live" 566 filemask = "events/{}_{}/{}.json" --> 568 df_schedule = self.read_schedule(force_cache).reset_index() 569 if match_id is not None: 570 iterator = df_schedule[ 571 df_schedule.game_id.isin([match_id] if isinstance(match_id, int) else match_id) 572 ] whoscored.py:326, in WhoScored.read_schedule(self, force_cache) 324 self._driver.get(summary_nav.get_attribute("href")) 325 logger.info("Scraping game schedule from %s", url) --> 326 schedule.extend(self._parse_schedule()) 328 # Cache the data 329 df_schedule = pd.DataFrame(schedule).assign(league=lkey, season=skey) whoscored.py:241, in WhoScored._parse_schedule(self, stage) 239 schedule = [] 240 # Parse first page --> 241 page_schedule, next_page = self._parse_schedule_page() 242 schedule.extend(page_schedule) 243 # Go to next page whoscored.py:219, in WhoScored._parse_schedule_page(self) 213 time_str = node.find_element(By.XPATH, time_selector).get_attribute("textContent") 214 match_url = node.find_element(By.XPATH, result_selector).get_attribute("href") 215 schedule_page.append( 216 { 217 "date": datetime.strptime(f"{date_str} {time_str}", "%A, %b %d %Y %H:%M"), 218 "home_team": node.find_element(By.XPATH, home_team_selector).text, --> 219 "away_team": node.find_element(By.XPATH, away_team_selector).text, 220 "game_id": match_id, 221 "url": match_url, 222 } 223 ) 224 else: 225 date_str = node.find_element(By.XPATH, date_selector).text AttributeError: 'NoneType' object has no attribute 'text' ```
Author
Owner

@probberechts commented on GitHub (Oct 17, 2022):

I've just tested it and everything seems to work fine. I have absolutely no clue why it would iterate at random over the schedule. In the expected flow, the scraper should go to the league page, click on fixtures in the menu and then cycle back from the current month to the first month of the season.

The "AttributeError" seems to suggest that there was a game without an away team in the schedule. Maybe it was a bug on the WhoScored website that was resolved by now? Could you try to run the code again?

<!-- gh-comment-id:1280445828 --> @probberechts commented on GitHub (Oct 17, 2022): I've just tested it and everything seems to work fine. I have absolutely no clue why it would iterate at random over the schedule. In the expected flow, the scraper should go to the [league page](https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/9075/England-Premier-League), click on [fixtures](https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/9075/Stages/20934/Fixtures/England-Premier-League-2022-2023) in the menu and then cycle back from the current month to the first month of the season. The "AttributeError" seems to suggest that there was a game without an away team in the schedule. Maybe it was a bug on the WhoScored website that was resolved by now? Could you try to run the code again?
Author
Owner

@aegonwolf commented on GitHub (Oct 17, 2022):

Hmm, I still get the same error for the 22-23 season. It works fine for the earlier season until I get blocked (no worries I found a workaround). I was thinking it might be retrying several processes in parallel and its just the case that the IP addresses get blocked.

<!-- gh-comment-id:1281011648 --> @aegonwolf commented on GitHub (Oct 17, 2022): Hmm, I still get the same error for the 22-23 season. It works fine for the earlier season until I get blocked (no worries I found a workaround). I was thinking it might be retrying several processes in parallel and its just the case that the IP addresses get blocked.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/soccerdata#20
No description provided.