[GH-ISSUE #310] [WhoScored] Error when running scraper with Tor #59

Closed
opened 2026-03-02 15:55:24 +03:00 by kerem · 7 comments
Owner

Originally created by @petmo on GitHub (Jul 27, 2023).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/310

Which Python version are you using: Python 3.11.4
Which version of soccerdata are you using? 1.4.0
Note: Actually copied and installed the full soccerdata poetry.lock file into my env so versions should be identical

Running on OSX Ventura

The issue:

import soccerdata as sd
if __name__ == '__main__':
    ws = sd.WhoScored(proxy='tor')
    epl_schedule = ws.read_schedule()
    print(epl_schedule.head())

Gives

[07/27/23 17:56:25] ERROR    Error while scraping https://www.whoscored.com. Retrying... (attempt 3 of 5).                                                                                     _common.py:446
                             Traceback (most recent call last):                                                                                                                                              
                               File "/.../ph-env/lib/python3.11/site-packages/soccerdata/_common.py", line 437, in _download_and_save                       
                                 response = json.dumps(self._driver.execute_script("return " + var)).encode(                                                                                                 
                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                          
                               File "/.../ph-env/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 406, in                         
                             execute_script                                                                                                                                                                  
                                 return self.execute(command, {"script": script, "args": converted_args})["value"]                                                                                           
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                    
                               File "/.../ph-env/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 346, in execute                 
                                 self.error_handler.check_response(response)                                                                                                                                 
                               File "/.../ph-env/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py", line 245, in                      
                             check_response                                                                                                                                                                  
                                 raise exception_class(message, screen, stacktrace)                                                                                                                          
                             selenium.common.exceptions.JavascriptException: Message: javascript error: allRegions is not defined                                                                            
                               (Session info: chrome=115.0.5790.114)                                                                                                                                         
                             Stacktrace:                                                                                                                                                                     
                             0   undetected_chromedriver             0x00000001054ad6b8 undetected_chromedriver + 4937400                                                                                    
                             1   undetected_chromedriver             0x00000001054a4b73 undetected_chromedriver + 4901747                                                                                    
                             2   undetected_chromedriver             0x0000000105062616 undetected_chromedriver + 435734                                                                                     
                             3   undetected_chromedriver             0x0000000105067adf undetected_chromedriver + 457439                                                                                     
                             4   undetected_chromedriver             0x000000010506a9bf undetected_chromedriver + 469439                                                                                     
                             5   undetected_chromedriver             0x00000001050e5bce undetected_chromedriver + 973774                                                                                     
                             6   undetected_chromedriver             0x00000001050ca012 undetected_chromedriver + 860178                                                                                     
                             7   undetected_chromedriver             0x00000001050e4e76 undetected_chromedriver + 970358                                                                                     
                             8   undetected_chromedriver             0x00000001050c9de3 undetected_chromedriver + 859619                                                                                     
                             9   undetected_chromedriver             0x0000000105097d7f undetected_chromedriver + 654719                                                                                     
                             10  undetected_chromedriver             0x00000001050990de undetected_chromedriver + 659678                                                                                     
                             11  undetected_chromedriver             0x00000001054692ad undetected_chromedriver + 4657837                                                                                    
                             12  undetected_chromedriver             0x000000010546e130 undetected_chromedriver + 4677936                                                                                    
                             13  undetected_chromedriver             0x0000000105474def undetected_chromedriver + 4705775                                                                                    
                             14  undetected_chromedriver             0x000000010546f05a undetected_chromedriver + 4681818                                                                                    
                             15  undetected_chromedriver             0x000000010544192c undetected_chromedriver + 4495660                                                                                    
                             16  undetected_chromedriver             0x000000010548c838 undetected_chromedriver + 4802616                                                                                    
                             17  undetected_chromedriver             0x000000010548c9b7 undetected_chromedriver + 4802999                                                                                    
                             18  undetected_chromedriver             0x000000010549d99f undetected_chromedriver + 4872607                                                                                    
                             19  libsystem_pthread.dylib             0x00007ff81c1031d3 _pthread_start + 125                                                                                                 
                             20  libsystem_pthread.dylib             0x00007ff81c0febd3 thread_start + 15      

Tor is installed and running according to documentation in a separate window:

Jul 27 17:38:02.345 [notice] Tor can't help you if you use it wrong! Learn how to be safe at https://support.torproject.org/faq/staying-anonymous/
Jul 27 17:38:02.345 [notice] Configuration file "/opt/homebrew/etc/tor/torrc" not present, using reasonable defaults.
Jul 27 17:38:02.347 [notice] Opening Socks listener on 127.0.0.1:9050
Jul 27 17:38:02.347 [notice] Opened Socks listener connection (ready) on 127.0.0.1:9050
Jul 27 17:38:02.000 [notice] Parsing GEOIP IPv4 file /opt/homebrew/Cellar/tor/0.4.7.13_1/share/tor/geoip.
Jul 27 17:38:02.000 [notice] Parsing GEOIP IPv6 file /opt/homebrew/Cellar/tor/0.4.7.13_1/share/tor/geoip6.
Jul 27 17:38:02.000 [notice] Bootstrapped 0% (starting): Starting
Jul 27 17:38:02.000 [notice] Starting with guard context "default"
Jul 27 17:38:03.000 [notice] Bootstrapped 5% (conn): Connecting to a relay
Jul 27 17:38:03.000 [notice] Bootstrapped 10% (conn_done): Connected to a relay
Jul 27 17:38:03.000 [notice] Bootstrapped 14% (handshake): Handshaking with a relay
Jul 27 17:38:05.000 [notice] Bootstrapped 15% (handshake_done): Handshake with a relay done
Jul 27 17:38:05.000 [notice] Bootstrapped 75% (enough_dirinfo): Loaded enough directory info to build circuits
Jul 27 17:38:05.000 [notice] Bootstrapped 90% (ap_handshake_done): Handshake finished with a relay to build circuits
Jul 27 17:38:05.000 [notice] Bootstrapped 95% (circuit_create): Establishing a Tor circuit
Jul 27 17:38:05.000 [notice] Bootstrapped 100% (done): Done
Jul 27 17:58:37.000 [notice] Our IP address has changed.  Rotating keys...

Note that it works fine without the Tor proxy.

Originally created by @petmo on GitHub (Jul 27, 2023). Original GitHub issue: https://github.com/probberechts/soccerdata/issues/310 Which Python version are you using: Python 3.11.4 Which version of soccerdata are you using? 1.4.0 Note: Actually copied and installed the full soccerdata poetry.lock file into my env so versions should be identical Running on OSX Ventura The issue: ``` import soccerdata as sd if __name__ == '__main__': ws = sd.WhoScored(proxy='tor') epl_schedule = ws.read_schedule() print(epl_schedule.head()) ``` Gives ``` [07/27/23 17:56:25] ERROR Error while scraping https://www.whoscored.com. Retrying... (attempt 3 of 5). _common.py:446 Traceback (most recent call last): File "/.../ph-env/lib/python3.11/site-packages/soccerdata/_common.py", line 437, in _download_and_save response = json.dumps(self._driver.execute_script("return " + var)).encode( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/.../ph-env/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 406, in execute_script return self.execute(command, {"script": script, "args": converted_args})["value"] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/.../ph-env/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 346, in execute self.error_handler.check_response(response) File "/.../ph-env/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py", line 245, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.JavascriptException: Message: javascript error: allRegions is not defined (Session info: chrome=115.0.5790.114) Stacktrace: 0 undetected_chromedriver 0x00000001054ad6b8 undetected_chromedriver + 4937400 1 undetected_chromedriver 0x00000001054a4b73 undetected_chromedriver + 4901747 2 undetected_chromedriver 0x0000000105062616 undetected_chromedriver + 435734 3 undetected_chromedriver 0x0000000105067adf undetected_chromedriver + 457439 4 undetected_chromedriver 0x000000010506a9bf undetected_chromedriver + 469439 5 undetected_chromedriver 0x00000001050e5bce undetected_chromedriver + 973774 6 undetected_chromedriver 0x00000001050ca012 undetected_chromedriver + 860178 7 undetected_chromedriver 0x00000001050e4e76 undetected_chromedriver + 970358 8 undetected_chromedriver 0x00000001050c9de3 undetected_chromedriver + 859619 9 undetected_chromedriver 0x0000000105097d7f undetected_chromedriver + 654719 10 undetected_chromedriver 0x00000001050990de undetected_chromedriver + 659678 11 undetected_chromedriver 0x00000001054692ad undetected_chromedriver + 4657837 12 undetected_chromedriver 0x000000010546e130 undetected_chromedriver + 4677936 13 undetected_chromedriver 0x0000000105474def undetected_chromedriver + 4705775 14 undetected_chromedriver 0x000000010546f05a undetected_chromedriver + 4681818 15 undetected_chromedriver 0x000000010544192c undetected_chromedriver + 4495660 16 undetected_chromedriver 0x000000010548c838 undetected_chromedriver + 4802616 17 undetected_chromedriver 0x000000010548c9b7 undetected_chromedriver + 4802999 18 undetected_chromedriver 0x000000010549d99f undetected_chromedriver + 4872607 19 libsystem_pthread.dylib 0x00007ff81c1031d3 _pthread_start + 125 20 libsystem_pthread.dylib 0x00007ff81c0febd3 thread_start + 15 ``` Tor is installed and running according to documentation in a separate window: ``` Jul 27 17:38:02.345 [notice] Tor can't help you if you use it wrong! Learn how to be safe at https://support.torproject.org/faq/staying-anonymous/ Jul 27 17:38:02.345 [notice] Configuration file "/opt/homebrew/etc/tor/torrc" not present, using reasonable defaults. Jul 27 17:38:02.347 [notice] Opening Socks listener on 127.0.0.1:9050 Jul 27 17:38:02.347 [notice] Opened Socks listener connection (ready) on 127.0.0.1:9050 Jul 27 17:38:02.000 [notice] Parsing GEOIP IPv4 file /opt/homebrew/Cellar/tor/0.4.7.13_1/share/tor/geoip. Jul 27 17:38:02.000 [notice] Parsing GEOIP IPv6 file /opt/homebrew/Cellar/tor/0.4.7.13_1/share/tor/geoip6. Jul 27 17:38:02.000 [notice] Bootstrapped 0% (starting): Starting Jul 27 17:38:02.000 [notice] Starting with guard context "default" Jul 27 17:38:03.000 [notice] Bootstrapped 5% (conn): Connecting to a relay Jul 27 17:38:03.000 [notice] Bootstrapped 10% (conn_done): Connected to a relay Jul 27 17:38:03.000 [notice] Bootstrapped 14% (handshake): Handshaking with a relay Jul 27 17:38:05.000 [notice] Bootstrapped 15% (handshake_done): Handshake with a relay done Jul 27 17:38:05.000 [notice] Bootstrapped 75% (enough_dirinfo): Loaded enough directory info to build circuits Jul 27 17:38:05.000 [notice] Bootstrapped 90% (ap_handshake_done): Handshake finished with a relay to build circuits Jul 27 17:38:05.000 [notice] Bootstrapped 95% (circuit_create): Establishing a Tor circuit Jul 27 17:38:05.000 [notice] Bootstrapped 100% (done): Done Jul 27 17:58:37.000 [notice] Our IP address has changed. Rotating keys... ``` Note that it works fine without the Tor proxy.
kerem 2026-03-02 15:55:24 +03:00
  • closed this issue
  • added the
    WhoScored
    label
Author
Owner

@probberechts commented on GitHub (Jul 27, 2023):

Can you try to run the code below and check what happens in your browser window? Does it say that your IP is blocked?

import soccerdata as sd
ws = sd.WhoScored("ENG-Premier League", "2223", proxy='tor', headless=False, no_cache=True)
leagues = ws.read_leagues()
<!-- gh-comment-id:1653992525 --> @probberechts commented on GitHub (Jul 27, 2023): Can you try to run the code below and check what happens in your browser window? Does it say that your IP is blocked? ```py import soccerdata as sd ws = sd.WhoScored("ENG-Premier League", "2223", proxy='tor', headless=False, no_cache=True) leagues = ws.read_leagues() ```
Author
Owner

@petmo commented on GitHub (Jul 27, 2023):

Thanks for the quick help.

Tried that - but got the same error. Says nothing about IP being blocked, just the error msg above.

It opens a bunch of chrome windows, that seems to get a captcha prompt? Is this expected?

Screenshot 2023-07-27 at 18 57 28
<!-- gh-comment-id:1654005373 --> @petmo commented on GitHub (Jul 27, 2023): Thanks for the quick help. Tried that - but got the same error. Says nothing about IP being blocked, just the error msg above. It opens a bunch of chrome windows, that seems to get a captcha prompt? Is this expected? <img width="508" alt="Screenshot 2023-07-27 at 18 57 28" src="https://github.com/probberechts/soccerdata/assets/26343993/b1e62766-ef8a-45f0-b52d-3b69f3328e07">
Author
Owner

@probberechts commented on GitHub (Jul 27, 2023):

No, that's not expected. It looks like WhoScored has blacklisted the IP of your Tor exit node. You can try a different exit node (see https://stackoverflow.com/questions/1969958/how-to-change-the-tor-exit-node-programmatically-to-get-a-new-ip) or use a different proxy.

<!-- gh-comment-id:1654031260 --> @probberechts commented on GitHub (Jul 27, 2023): No, that's not expected. It looks like WhoScored has blacklisted the IP of your Tor exit node. You can try a different exit node (see https://stackoverflow.com/questions/1969958/how-to-change-the-tor-exit-node-programmatically-to-get-a-new-ip) or use a different proxy.
Author
Owner

@probberechts commented on GitHub (Jul 27, 2023):

You could also try to solve the captcha once. Maybe you can continue scraping afterwards?

<!-- gh-comment-id:1654036235 --> @probberechts commented on GitHub (Jul 27, 2023): You could also try to solve the captcha once. Maybe you can continue scraping afterwards?
Author
Owner

@hkzid commented on GitHub (Aug 9, 2023):

Thanks for the quick help.

Tried that - but got the same error. Says nothing about IP being blocked, just the error msg above.

It opens a bunch of chrome windows, that seems to get a captcha prompt? Is this expected?

Screenshot 2023-07-27 at 18 57 28

I've got similar problem. I try to run read_schedule(). It can get all the competition names. But then the html downloaded in \soccerdata\data\WhoScored\seasons is similar to this picture.

<!-- gh-comment-id:1671155886 --> @hkzid commented on GitHub (Aug 9, 2023): > Thanks for the quick help. > > > > Tried that - but got the same error. Says nothing about IP being blocked, just the error msg above. > > > > It opens a bunch of chrome windows, that seems to get a captcha prompt? Is this expected? > > > > <img width="508" alt="Screenshot 2023-07-27 at 18 57 28" src="https://github.com/probberechts/soccerdata/assets/26343993/b1e62766-ef8a-45f0-b52d-3b69f3328e07"> > > > > I've got similar problem. I try to run read_schedule(). It can get all the competition names. But then the html downloaded in \soccerdata\data\WhoScored\seasons is similar to this picture.
Author
Owner

@OnlineAnalytics commented on GitHub (Aug 14, 2023):

Any fix for this yet? I'm seeing the same issues.

<!-- gh-comment-id:1678172486 --> @OnlineAnalytics commented on GitHub (Aug 14, 2023): Any fix for this yet? I'm seeing the same issues.
Author
Owner

@hkzid commented on GitHub (Aug 15, 2023):

I think it's the problem of undetected-chromedriver and I don't think somebody find a way to solve it.

<!-- gh-comment-id:1678280974 --> @hkzid commented on GitHub (Aug 15, 2023): I think it's the problem of undetected-chromedriver and I don't think somebody find a way to solve it.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/soccerdata#59
No description provided.