mirror of
https://github.com/probberechts/soccerdata.git
synced 2026-04-25 18:15:58 +03:00
[GH-ISSUE #366] [WhoScored] ConnectionError: Could not download https://www.whoscored.com. #69
Labels
No labels
ESPN
FBref
FotMob
MatchHistory
SoFIFA
Sofascore
WhoScored
WhoScored
bug
build
common
dependencies
discussion
documentation
duplicate
enhancement
good first issue
invalid
performance
pull-request
question
question
removal
understat
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/soccerdata#69
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @ds-oliver on GitHub (Sep 13, 2023).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/366
I think the logs should have all necessary info to cite this issue.
Imports:
import tqdm from pathlib import Path import soccerdata as sd from socceraction.data.opta import OptaLoader import socceraction.spadl as spadl import pandas as pd import datetime import os import warnings import pickle import socceraction.atomic.spadl as atomicspadl import zipfile from io import BytesIO from urllib.request import urlretrieveCode:
`# Initialize the WhoScored object
ws = sd.WhoScored(
leagues=["ENG-Premier League"],
seasons=2223,
headless=True
)
api = ws.read_events(output_fmt='loader')`
Traceback:
ConnectionError Traceback (most recent call last)
/Users/hogan/soccerdata/scrape.ipynb Cell 2 line 8
1 # Initialize the WhoScored object
2 ws = sd.WhoScored(
3 leagues=["ENG-Premier League"],
4 seasons=2223,
5 headless=True
6 )
----> 8 api = ws.read_events(output_fmt='loader')
File ~/soccerdata/scrape_env/lib/python3.9/site-packages/soccerdata/whoscored.py:667, in WhoScored.read_events(self, match_id, force_cache, live, output_fmt)
664 urlmask = WHOSCORED_URL + "/Matches/{}/Live"
665 filemask = "events/{}_{}/{}.json"
--> 667 df_schedule = self.read_schedule(force_cache).reset_index()
668 if match_id is not None:
669 iterator = df_schedule[
670 df_schedule.game_id.isin([match_id] if isinstance(match_id, int) else match_id)
671 ]
File ~/soccerdata/scrape_env/lib/python3.9/site-packages/soccerdata/whoscored.py:370, in WhoScored.read_schedule(self, force_cache)
357 def read_schedule(self, force_cache: bool = False) -> pd.DataFrame:
358 """Retrieve the game schedule for the selected leagues and seasons.
359
360 Parameters
(...)
368 pd.DataFrame
369 """
--> 370 df_seasons = self.read_seasons()
371 filemask = "matches/{}_{}.csv"
373 all_schedules = []
File ~/soccerdata/scrape_env/lib/python3.9/site-packages/soccerdata/whoscored.py:246, in WhoScored.read_seasons(self)
239 def read_seasons(self) -> pd.DataFrame:
240 """Retrieve the selected seasons for the selected leagues.
241
242 Returns
243 -------
244 pd.DataFrame
245 """
--> 246 df_leagues = self.read_leagues()
248 seasons = []
249 for lkey, league in df_leagues.iterrows():
File ~/soccerdata/scrape_env/lib/python3.9/site-packages/soccerdata/whoscored.py:212, in WhoScored.read_leagues(self)
210 url = WHOSCORED_URL
211 filepath = self.data_dir / "tiers.json"
--> 212 reader = self.get(url, filepath, var="allRegions")
214 data = json.load(reader)
216 leagues = []
File ~/soccerdata/scrape_env/lib/python3.9/site-packages/soccerdata/_common.py:132, in BaseReader.get(self, url, filepath, max_age, no_cache, var)
130 if no_cache or self.no_cache or not is_cached:
131 logger.debug("Scraping %s", url)
--> 132 return self._download_and_save(url, filepath, var)
133 logger.debug("Retrieving %s from cache", url)
134 assert filepath is not None
File ~/soccerdata/scrape_env/lib/python3.9/site-packages/soccerdata/_common.py:452, in BaseSeleniumReader._download_and_save(self, url, filepath, var)
449 self._driver = self._init_webdriver()
450 continue
--> 452 raise ConnectionError("Could not download %s." % url)
ConnectionError: Could not download https://www.whoscored.com/.
Edit to add context/files:
Have since tried running scraper on top of Tor using ='Tor' and by defining proxies as dict.
https://github.com/probberechts/soccerdata/assets/77216918/13aafeb1-2e64-4dac-b115-0799c93e1afb
error.log
@OnlineAnalytics commented on GitHub (Sep 20, 2023):
Unfortunately it doesn’t look like they’ll do anything to try and fix it. Will need to find another means of scrapping
@aegonwolf commented on GitHub (Sep 27, 2023):
Hmm, I do get this now too.
@aegonwolf commented on GitHub (Sep 27, 2023):
I think "they" is a single person and this is not necessarily a helpful comment, people have work, life and we enjoy an awesome free package that the author has spent a lot of time and effort building.
@OnlineAnalytics commented on GitHub (Sep 27, 2023):
I know it's a single person. Hence me using the singular pronoun. You don't really need to try and start drama where there isn't any.
@probberechts commented on GitHub (Sep 27, 2023):
I do not have this issue, so I am unable to fix it as I would have no way to verify it.
It looks like WhoScored does a security check. I do not know why it does it, but here are two options:
What does happen after the "verifying..."? Does it show a captcha? Or does it simply directly redirect to the WhoScored webpage? In that case, a straightforward solution could be to check whether the current page contains the text "checking if the site connection is secure" and wait until it redirects before progressing. You can add that after this line .
@ds-oliver commented on GitHub (Sep 27, 2023):
Funny you should mention adding a wait period. I actually had already done so...

Any other suggestions?
@TimelessUsername commented on GitHub (Sep 29, 2023):
Running headless false (while on selenium 4.12 or under) does the trick
@ds-oliver commented on GitHub (Oct 1, 2023):
@TimelessUsername @probberechts
This has solved the issue. You have been a huge help @TimelessUsername.
@OnlineAnalytics I'm tagging you so that you can see the resolution, and hoping that you can witness how this is the way that most issues are resolved when it comes to open-source projects as this one. The project relies on the collective community to resolve complicated issues, not just the author, this is how the technology improves and now that we have found a workaround @probberechts can spend his valuable time patching instead of testing.
Thanks all. Closing this now. :)