mirror of
https://github.com/probberechts/soccerdata.git
synced 2026-04-25 10:05:53 +03:00
[GH-ISSUE #396] Sporadic re-attempts that initiate new webdriver instances eventually cause performance issues #78
Labels
No labels
ESPN
FBref
FotMob
MatchHistory
SoFIFA
Sofascore
WhoScored
WhoScored
bug
build
common
dependencies
discussion
documentation
duplicate
enhancement
good first issue
invalid
performance
pull-request
question
question
removal
understat
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/soccerdata#78
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @TimelessUsername on GitHub (Oct 5, 2023).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/396
Hi,
I'm not 100% on all of this yet, and further investigation is required, but at least when running multiple concurrent scraping instances, the re-attempts that cause new webdriver instances to be initiated will eventually (most probably) cause some "ghost" instances of google chrome to run on the background and borderline freeze the pc with 100% cpu usage. It might be related to running the code on headless=False, as WhoScore currently requires, or something else entirely. It is probably not related to multiple concurrent processes, but that is where it is very apparent, eventually anyway. Further information coming later when I investigate more, but if anyone else has had these issues, at least there is a note of them now. This might not be a soccerdata issue at all (or only), but I will also try to work out a fix. Fixing the issue manually every so many hours is easy if you just kill the processes, but long scraping sessions end with a near frozen pc for me currently.
@probberechts commented on GitHub (Oct 14, 2023):
I just would like to point out a few things here.
Running multiple concurrent scraping instances is not supported. Moreover, I am a strong advocate of scraping responsibly. Therefore, I do my best to respect the website's scraping policies. For example, FBref only allows up to 20 requests in a minute. SocerData respects this by implementing a delay between requests. If you run multiple concurrent instances, you are no longer respecting this. Overloading a site also makes the user experience worse for anyone using that site (think: slow response times). It gives all forms of web scraping a bad reputation.
European law allows scraping web data as long as (a) you don’t scrape a ‘substantial part, evaluated qualitatively and/or quantitatively, of the contents of that database’ and you don’t re-use it (meaning basically selling or publishing it); or (b) scraping falls under TDM exception; or (c) you’ve received an appropriate license. If you scrape data for multiple hours, you are probably violating the first clause.
A webdriver instance is always properly closed before a new one is initialized, so I do not really see where these ghost instances would originate from.
github.com/probberechts/soccerdata@8303840b0f/soccerdata/_common.py (L403-L404)@TimelessUsername commented on GitHub (Oct 14, 2023):
I'm quite confident I have not done anything illegal, so that aside, I have not been able to reliably reproduce the issue but can confirm that it is very much present with a single instance. More info will come when I'm able to pinpoint the issue better.
Edit: It is real confusing as to how it happens precisely because of the above, but it might be related to the code erroring out, or while the program is going trough a list of years for example. It does seem these ghost instances often pop up while loading things purely from memory.