[GH-ISSUE #396] Sporadic re-attempts that initiate new webdriver instances eventually cause performance issues #78

New issue

Open

opened 2026-03-02 15:55:34 +03:00 by kerem · 2 comments

kerem commented

2026-03-02 15:55:34 +03:00

Owner

Originally created by @TimelessUsername on GitHub (Oct 5, 2023).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/396

Hi,

I'm not 100% on all of this yet, and further investigation is required, but at least when running multiple concurrent scraping instances, the re-attempts that cause new webdriver instances to be initiated will eventually (most probably) cause some "ghost" instances of google chrome to run on the background and borderline freeze the pc with 100% cpu usage. It might be related to running the code on headless=False, as WhoScore currently requires, or something else entirely. It is probably not related to multiple concurrent processes, but that is where it is very apparent, eventually anyway. Further information coming later when I investigate more, but if anyone else has had these issues, at least there is a note of them now. This might not be a soccerdata issue at all (or only), but I will also try to work out a fix. Fixing the issue manually every so many hours is easy if you just kill the processes, but long scraping sessions end with a near frozen pc for me currently.

Originally created by @TimelessUsername on GitHub (Oct 5, 2023). Original GitHub issue: https://github.com/probberechts/soccerdata/issues/396 Hi, I'm not 100% on all of this yet, and further investigation is required, but at least when running multiple concurrent scraping instances, the re-attempts that cause new webdriver instances to be initiated will eventually (most probably) cause some "ghost" instances of google chrome to run on the background and borderline freeze the pc with 100% cpu usage. It might be related to running the code on headless=False, as WhoScore currently requires, or something else entirely. It is probably not related to multiple concurrent processes, but that is where it is very apparent, eventually anyway. Further information coming later when I investigate more, but if anyone else has had these issues, at least there is a note of them now. This might not be a soccerdata issue at all (or only), but I will also try to work out a fix. Fixing the issue manually every so many hours is easy if you just kill the processes, but long scraping sessions end with a near frozen pc for me currently.

kerem added the

WhoScored

question

performance

labels

2026-03-02 15:55:34 +03:00

kerem commented

2026-03-02 15:55:35 +03:00

Author

Owner

@probberechts commented on GitHub (Oct 14, 2023):

I just would like to point out a few things here.

when running multiple concurrent scraping instances ...

Running multiple concurrent scraping instances is not supported. Moreover, I am a strong advocate of scraping responsibly. Therefore, I do my best to respect the website's scraping policies. For example, FBref only allows up to 20 requests in a minute. SocerData respects this by implementing a delay between requests. If you run multiple concurrent instances, you are no longer respecting this. Overloading a site also makes the user experience worse for anyone using that site (think: slow response times). It gives all forms of web scraping a bad reputation.

... every so many hours ...

European law allows scraping web data as long as (a) you don’t scrape a ‘substantial part, evaluated qualitatively and/or quantitatively, of the contents of that database’ and you don’t re-use it (meaning basically selling or publishing it); or (b) scraping falls under TDM exception; or (c) you’ve received an appropriate license. If you scrape data for multiple hours, you are probably violating the first clause.

, the re-attempts that cause new webdriver instances to be initiated will eventually (most probably) cause some "ghost" instances of google chrome to run on the background

A webdriver instance is always properly closed before a new one is initialized, so I do not really see where these ghost instances would originate from.

github.com/probberechts/soccerdata@8303840b0f/soccerdata/_common.py (L403-L404)

@probberechts commented on GitHub (Oct 14, 2023): I just would like to point out a few things here. > when running multiple concurrent scraping instances ... Running multiple concurrent scraping instances is not supported. Moreover, I am a strong advocate of scraping responsibly. Therefore, I do my best to respect the website's scraping policies. For example, [FBref only allows up to 20 requests in a minute](https://www.sports-reference.com/bot-traffic.html). SocerData respects this by implementing a delay between requests. If you run multiple concurrent instances, you are no longer respecting this. Overloading a site also makes the user experience worse for anyone using that site (think: slow response times). It gives all forms of web scraping a bad reputation. > ... every so many hours ... European law allows scraping web data as long as (a) you don’t scrape a ‘substantial part, evaluated qualitatively and/or quantitatively, of the contents of that database’ and you don’t re-use it (meaning basically selling or publishing it); or (b) scraping falls under TDM exception; or (c) you’ve received an appropriate license. If you scrape data for multiple hours, you are probably violating the first clause. > , the re-attempts that cause new webdriver instances to be initiated will eventually (most probably) cause some "ghost" instances of google chrome to run on the background A webdriver instance is always properly closed before a new one is initialized, so I do not really see where these ghost instances would originate from. https://github.com/probberechts/soccerdata/blob/8303840b0f8789c9501565fba754e1c981d8b3b3/soccerdata/_common.py#L403-L404

kerem commented

2026-03-02 15:55:35 +03:00

Author

Owner

@TimelessUsername commented on GitHub (Oct 14, 2023):

I'm quite confident I have not done anything illegal, so that aside, I have not been able to reliably reproduce the issue but can confirm that it is very much present with a single instance. More info will come when I'm able to pinpoint the issue better.

Edit: It is real confusing as to how it happens precisely because of the above, but it might be related to the code erroring out, or while the program is going trough a list of years for example. It does seem these ghost instances often pop up while loading things purely from memory.

@TimelessUsername commented on GitHub (Oct 14, 2023): I'm quite confident I have not done anything illegal, so that aside, I have not been able to reliably reproduce the issue but can confirm that it is very much present with a single instance. More info will come when I'm able to pinpoint the issue better. Edit: It is real confusing as to how it happens precisely because of the above, but it might be related to the code erroring out, or while the program is going trough a list of years for example. It does seem these ghost instances often pop up while loading things purely from memory.