[GH-ISSUE #619] 'read_events' Function Ignoring 'live=False' Parameter and Issues with Group Stage vs Knockout Stage HTML Structure #118

Closed
opened 2026-03-02 15:55:57 +03:00 by kerem · 3 comments
Owner

Originally created by @ds-oliver on GitHub (Jun 27, 2024).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/619

While using the soccerdata library to scrape event data from WhoScored, I've encountered an issue where the read_events function seems to ignore the live=False parameter. Despite explicitly setting live=False, the function attempts to scrape the live URL, resulting in repeated errors. (Please ignore the "priority game" aspects of the script that is carry over from another project that I did not remove from this function call.)

Here are some relevant details:

  1. Script Parameters and Logs:

    • The script sets live=False for the read_events function.
    • However, the logs indicate that the function tries to access the live URL: https://www.whoscored.com/Matches/1787316/Live.
    INFO     Setting read_events params: match_id=1729479, output_fmt=spadl, force_cache=False, live=False             scrape_euros.py:133
    INFO     Could not find priority game 1729479.                                                                      scrape_euros.py:151
    INFO     Processing home team: Scotland [424]                                                                       scrape_euros.py:160
    INFO     Processing away team: Hungary [327]                                                                        scrape_euros.py:161
    INFO     Processing game 1787316...                                                                                 scrape_euros.py:168
    ERROR    Error while scraping https://www.whoscored.com/Matches/1787316/Live. Retrying in 0 seconds... (attempt 1 of 5). _common.py:469
    
  2. HTML Structure for Group Stage vs Knockouts:

    • Another observation that might be relevant is the difference in HTML structure when accessing group stage games versus knockout stage games. This difference could potentially affect the scraping process.

Steps to Reproduce:

  1. Set up a script to scrape event data using the soccerdata library.
  2. Ensure the read_events function has live=False.
  3. Run the script and observe the logs.

Expected Behavior:
The read_events function should not attempt to access the live URL when live=False is set.

Actual Behavior:
The function tries to scrape the live URL, leading to repeated errors.

Logs:

[06/27/24 10:19:08] INFO     Custom team name replacements loaded from                                                   _config.py:85
                             /Users/hogan/soccerdata/config/teamname_replacements.json.
[06/27/24 10:19:11] INFO     Saving cached data to /Users/hogan/soccerdata/v2                                            _common.py:92
[06/27/24 10:19:18] INFO     Team ID Map: {'Germany': 336, 'Scotland': 424, 'Hungary': 327, 'Switzerland': 423,    scrape_euros.py:298
                             'Albania': 814, 'Italy': 343, 'Spain': 338, 'Croatia': 337, 'Poland': 342,
                             'Netherlands': 335, 'England': 345, 'Serbia': 771, 'Denmark': 425, 'Slovenia': 464,
                             'Austria': 324, 'France': 341, 'Belgium': 339, 'Slovakia': 484, 'Ukraine': 462,
                             'Romania': 412, 'Portugal': 340, 'Czechia': 332, 'Georgia': 413, 'Turkiye': 333}
[06/27/24 10:38:01] ERROR    Error while scraping https://www.whoscored.com/Matches/1787316/Live. Retrying in 0 seconds... (attempt 1 of 5). _common.py:469
                             Traceback (most recent call last):
                               File "/Users/hogan/.pyenv/versions/3.10.4/lib/python3.10/site-packages/soccerdata/_common.py", line 460, in
                             _download_and_save
                                 response = json.dumps(self._driver.execute_script("return " + var)).encode(
                               File
                             "/Users/hogan/.pyenv/versions/3.10.4/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line
                             408, in execute_script
                                 return self.execute(command, {"script": script, "args": converted_args})["value"]
                               File
                             "/Users/hogan/.pyenv/versions/3.10.4/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line
                             348, in execute
                                 self.error_handler.check_response(response)
                               File
                             "/Users/hogan/.pyenv/versions/3.10.4/lib/python3.10/site-packages/selenium/webdriver/remote/errorhandler.py",
                             line 229, in check_response
                                 raise exception_class(message, screen, stacktrace)
                             selenium.common.exceptions.JavascriptException: Message: javascript error: requirejs is not defined
                               (Session info: chrome=126.0.6478.127)
                             Stacktrace:
                             0   undetected_chromedriver             0x000000010d5230e8 undetected_chromedriver + 5169384
                             1   undetected_chromedriver             0x000000010d51afba undetected_chromedriver + 5136314
                             2   undetected_chromedriver             0x000000010d09736c undetected_chromedriver + 402284
                             3   undetected_chromedriver             0x000000010d09cb99 undetected_chromedriver + 424857
                             4   undetected_chromedriver             0x000000010d09ec2c undetected_chromedriver + 433196
                             5   undetected_chromedriver             0x000000010d127ee8 undetected_chromedriver + 995048
                             6   undetected_chromedriver             0x000000010d107ab2 undetected_chromedriver + 862898
                             7   undetected_chromedriver             0x000000010d126f57 undetected_chromedriver + 991063
                             8   undetected_chromedriver             0x000000010d107853 und undetected_chromedriver + 862291
                             9   undetected_chromedriver             0x000000010d0d75c6 undetected_chromedriver + 665030
                             10  undetected_chromedriver             0x000000010d0d7e4e undetected_chromedriver + 667214
                             11  undetected_chromedriver             0x000000010d4e5d00 undetected_chromedriver + 4918528
                             12  undetected_chromedriver             0x000000010d4eacfd undetected_chromedriver + 4939005
                             13  undetected_chromedriver             0x000000010d4eb3d5 undetected_chromedriver + 4940757
                             14  undetected_chromedriver             0x000000010d4c6de4 undetected_chromedriver + 4791780
                             15  undetected_chromedriver             0x000000010d4eb6c9 undetected_chromedriver + 4941513
                             16  undetected_chromedriver             0x000000010d4b85b4 undetected_chromedriver + 4732340
                             17  undetected_chromedriver             0x000000010d50b898 undetected_chromedriver + 5073048
                             18  undetected_chromedriver             0x000000010d50ba57 undetected_chromedriver + 5073495
                             19  undetected_chromedriver             0x000000010d51ab6e undetected_chromedriver + 5135214
                             20  libsystem_pthread.dylib             0x00007ff819c0418b _pthread_start + 99
                             21  libsystem_pthread.dylib             0x00007ff819bffae3 thread_start + 15

Code:

# Relevant snippet showing the function call
ws.read_events(
    match_id=game_id, 
    output_fmt=output_fmt, 
    force_cache=force_cache, 
    live=live
)

Additional Context:
The HTML structure for group stage games versus knockout stage games might be contributing to the issue. The difference in structure could potentially impact the scraping process.

Environment:

  • soccerdata version: [please specify]
  • Python version: 3.10.4
  • Operating System: macOS

Potential Fix:
Please investigate why the live=False parameter is not being respected by the read_events function. Additionally, consider any differences in HTML structure between group stage and knockout stage games that might affect scraping.

Thank you for your attention to this issue. Let me know if you need any additional information.

Originally created by @ds-oliver on GitHub (Jun 27, 2024). Original GitHub issue: https://github.com/probberechts/soccerdata/issues/619 While using the `soccerdata` library to scrape event data from WhoScored, I've encountered an issue where the `read_events` function seems to ignore the `live=False` parameter. Despite explicitly setting `live=False`, the function attempts to scrape the live URL, resulting in repeated errors. (Please ignore the "priority game" aspects of the script that is carry over from another project that I did not remove from this function call.) Here are some relevant details: 1. **Script Parameters and Logs:** - The script sets `live=False` for the `read_events` function. - However, the logs indicate that the function tries to access the live URL: `https://www.whoscored.com/Matches/1787316/Live`. ``` INFO Setting read_events params: match_id=1729479, output_fmt=spadl, force_cache=False, live=False scrape_euros.py:133 INFO Could not find priority game 1729479. scrape_euros.py:151 INFO Processing home team: Scotland [424] scrape_euros.py:160 INFO Processing away team: Hungary [327] scrape_euros.py:161 INFO Processing game 1787316... scrape_euros.py:168 ERROR Error while scraping https://www.whoscored.com/Matches/1787316/Live. Retrying in 0 seconds... (attempt 1 of 5). _common.py:469 ``` 2. **HTML Structure for Group Stage vs Knockouts:** - Another observation that might be relevant is the difference in HTML structure when accessing group stage games versus knockout stage games. This difference could potentially affect the scraping process. **Steps to Reproduce:** 1. Set up a script to scrape event data using the `soccerdata` library. 2. Ensure the `read_events` function has `live=False`. 3. Run the script and observe the logs. **Expected Behavior:** The `read_events` function should not attempt to access the live URL when `live=False` is set. **Actual Behavior:** The function tries to scrape the live URL, leading to repeated errors. **Logs:** ``` [06/27/24 10:19:08] INFO Custom team name replacements loaded from _config.py:85 /Users/hogan/soccerdata/config/teamname_replacements.json. [06/27/24 10:19:11] INFO Saving cached data to /Users/hogan/soccerdata/v2 _common.py:92 [06/27/24 10:19:18] INFO Team ID Map: {'Germany': 336, 'Scotland': 424, 'Hungary': 327, 'Switzerland': 423, scrape_euros.py:298 'Albania': 814, 'Italy': 343, 'Spain': 338, 'Croatia': 337, 'Poland': 342, 'Netherlands': 335, 'England': 345, 'Serbia': 771, 'Denmark': 425, 'Slovenia': 464, 'Austria': 324, 'France': 341, 'Belgium': 339, 'Slovakia': 484, 'Ukraine': 462, 'Romania': 412, 'Portugal': 340, 'Czechia': 332, 'Georgia': 413, 'Turkiye': 333} [06/27/24 10:38:01] ERROR Error while scraping https://www.whoscored.com/Matches/1787316/Live. Retrying in 0 seconds... (attempt 1 of 5). _common.py:469 Traceback (most recent call last): File "/Users/hogan/.pyenv/versions/3.10.4/lib/python3.10/site-packages/soccerdata/_common.py", line 460, in _download_and_save response = json.dumps(self._driver.execute_script("return " + var)).encode( File "/Users/hogan/.pyenv/versions/3.10.4/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 408, in execute_script return self.execute(command, {"script": script, "args": converted_args})["value"] File "/Users/hogan/.pyenv/versions/3.10.4/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 348, in execute self.error_handler.check_response(response) File "/Users/hogan/.pyenv/versions/3.10.4/lib/python3.10/site-packages/selenium/webdriver/remote/errorhandler.py", line 229, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.JavascriptException: Message: javascript error: requirejs is not defined (Session info: chrome=126.0.6478.127) Stacktrace: 0 undetected_chromedriver 0x000000010d5230e8 undetected_chromedriver + 5169384 1 undetected_chromedriver 0x000000010d51afba undetected_chromedriver + 5136314 2 undetected_chromedriver 0x000000010d09736c undetected_chromedriver + 402284 3 undetected_chromedriver 0x000000010d09cb99 undetected_chromedriver + 424857 4 undetected_chromedriver 0x000000010d09ec2c undetected_chromedriver + 433196 5 undetected_chromedriver 0x000000010d127ee8 undetected_chromedriver + 995048 6 undetected_chromedriver 0x000000010d107ab2 undetected_chromedriver + 862898 7 undetected_chromedriver 0x000000010d126f57 undetected_chromedriver + 991063 8 undetected_chromedriver 0x000000010d107853 und undetected_chromedriver + 862291 9 undetected_chromedriver 0x000000010d0d75c6 undetected_chromedriver + 665030 10 undetected_chromedriver 0x000000010d0d7e4e undetected_chromedriver + 667214 11 undetected_chromedriver 0x000000010d4e5d00 undetected_chromedriver + 4918528 12 undetected_chromedriver 0x000000010d4eacfd undetected_chromedriver + 4939005 13 undetected_chromedriver 0x000000010d4eb3d5 undetected_chromedriver + 4940757 14 undetected_chromedriver 0x000000010d4c6de4 undetected_chromedriver + 4791780 15 undetected_chromedriver 0x000000010d4eb6c9 undetected_chromedriver + 4941513 16 undetected_chromedriver 0x000000010d4b85b4 undetected_chromedriver + 4732340 17 undetected_chromedriver 0x000000010d50b898 undetected_chromedriver + 5073048 18 undetected_chromedriver 0x000000010d50ba57 undetected_chromedriver + 5073495 19 undetected_chromedriver 0x000000010d51ab6e undetected_chromedriver + 5135214 20 libsystem_pthread.dylib 0x00007ff819c0418b _pthread_start + 99 21 libsystem_pthread.dylib 0x00007ff819bffae3 thread_start + 15 ``` **Code:** ```python # Relevant snippet showing the function call ws.read_events( match_id=game_id, output_fmt=output_fmt, force_cache=force_cache, live=live ) ``` **Additional Context:** The HTML structure for group stage games versus knockout stage games might be contributing to the issue. The difference in structure could potentially impact the scraping process. **Environment:** - `soccerdata` version: [please specify] - Python version: 3.10.4 - Operating System: macOS **Potential Fix:** Please investigate why the `live=False` parameter is not being respected by the `read_events` function. Additionally, consider any differences in HTML structure between group stage and knockout stage games that might affect scraping. Thank you for your attention to this issue. Let me know if you need any additional information.
kerem closed this issue 2026-03-02 15:55:57 +03:00
Author
Owner

@ds-oliver commented on GitHub (Jul 3, 2024):

@probberechts any help here?

<!-- gh-comment-id:2205013699 --> @ds-oliver commented on GitHub (Jul 3, 2024): @probberechts any help here?
Author
Owner

@probberechts commented on GitHub (Jul 3, 2024):

First, you may misunderstand the purpose of the live parameter. Setting live=False doesn't really do anything. It corresponds to the default behaviour where the events will be scraped if they are not in the cache. Setting live=True allows disabling the cache which is mainly useful when retrieving event data during the game (because you want to ignore what is in the cache).

Otherwise, I do not see what the issue could be. Basically, it reduces to this block of code where no_cache takes the value of live:

github.com/probberechts/soccerdata@64a4fa0261/soccerdata/_common.py (L304-L306)

So, if you set live=False, it's

 if False or self.no_cache or not is_cached: 
     logger.debug("Scraping %s", url) 

which implies that it will scrape the data only if you've set no_cache=True in the constructor or if the game has been cached before.

<!-- gh-comment-id:2205829831 --> @probberechts commented on GitHub (Jul 3, 2024): First, you may misunderstand the purpose of the `live` parameter. Setting `live=False` doesn't really do anything. It corresponds to the default behaviour where the events will be scraped if they are not in the cache. Setting `live=True` allows disabling the cache which is mainly useful when retrieving event data during the game (because you want to ignore what is in the cache). Otherwise, I do not see what the issue could be. Basically, it reduces to this block of code where `no_cache` takes the value of `live`: https://github.com/probberechts/soccerdata/blob/64a4fa02616dfb85cd32758664b7f69265ba79b8/soccerdata/_common.py#L304-L306 So, if you set `live=False`, it's ```py if False or self.no_cache or not is_cached: logger.debug("Scraping %s", url) ``` which implies that it will scrape the data only if you've set `no_cache=True` in the constructor or if the game has been cached before.
Author
Owner

@ds-oliver commented on GitHub (Jul 4, 2024):

Thank you for looking.

Well, whatever the issue was, it appears to have been patched in the most recent version. Updated the package and my script works fine now!

<!-- gh-comment-id:2209385236 --> @ds-oliver commented on GitHub (Jul 4, 2024): Thank you for looking. Well, whatever the issue was, it appears to have been patched in the most recent version. Updated the package and my script works fine now!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/soccerdata#118
No description provided.