[GH-ISSUE #905] [Understat] KeyError 'statData' on Google Colab (IP Blocking/Cloudflare) #198

Closed
opened 2026-03-02 15:56:35 +03:00 by kerem · 0 comments
Owner

Originally created by @GiorgioMerolla on GitHub (Dec 10, 2025).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/905

Describe the bug
When attempting to fetch Understat data using Google Colab, the scraper fails with a KeyError: 'statData'.
This appears to be caused by Understat blocking Google Cloud IPs (Cloudflare 403 Forbidden or Challenge Page), resulting in the scraper failing to find the expected JSON data variable.

To Reproduce
Run the following code in a standard Google Colab environment:

import soccerdata as sd
leagues = ['ENG-Premier League']
seasons = ['24-25']
us = sd.Understat(leagues=leagues, seasons=seasons)
df = us.read_team_match_stats()

INFO     Saving cached data to /root/soccerdata/data/Understat
KeyError: 'statData'

Context

OS: Linux (Google Colab Standard Runtime)

Soccerdata Version: [Insert your version here, e.g., 1.8.7]

Observations: * The issue persists even after clearing the cache.

Using proxy='tor' or no_cache=True does not always resolve the issue, suggesting the block is aggressive against Colab IPs.

FBref scraper also returns 403 Forbidden on the same environment.

Suggested Solution / Feature Request Since running scrapers on Colab is a common use case, could we:

Improve the error handling to raise a specific AccessDeniedError instead of KeyError when the response is a Cloudflare block page?

Add documentation on using proxies (or Tor) specifically for Colab users?


### **Why this is better than a "Google Drive" request**
Asking them to "add Google Drive support" might get rejected because the library *already* supports it!
* **The Library's View:** "We already gave you the `data_dir` parameter. You can set that to your Drive folder. We don't need to add code for that."
* **The Real Problem:** The real problem is the **Blocking**. By reporting the `KeyError`, you help them fix the *real* bug (the crash).

### **If you specifically want to propose the "Tor" workaround**
If you want to be very helpful, you can comment on your own Issue saying:
*"I found a workaround for Colab users: Installing `tor` in the notebook and passing `
Originally created by @GiorgioMerolla on GitHub (Dec 10, 2025). Original GitHub issue: https://github.com/probberechts/soccerdata/issues/905 **Describe the bug** When attempting to fetch Understat data using Google Colab, the scraper fails with a `KeyError: 'statData'`. This appears to be caused by Understat blocking Google Cloud IPs (Cloudflare 403 Forbidden or Challenge Page), resulting in the scraper failing to find the expected JSON data variable. **To Reproduce** Run the following code in a standard Google Colab environment: ```python import soccerdata as sd leagues = ['ENG-Premier League'] seasons = ['24-25'] us = sd.Understat(leagues=leagues, seasons=seasons) df = us.read_team_match_stats() INFO Saving cached data to /root/soccerdata/data/Understat KeyError: 'statData' Context OS: Linux (Google Colab Standard Runtime) Soccerdata Version: [Insert your version here, e.g., 1.8.7] Observations: * The issue persists even after clearing the cache. Using proxy='tor' or no_cache=True does not always resolve the issue, suggesting the block is aggressive against Colab IPs. FBref scraper also returns 403 Forbidden on the same environment. Suggested Solution / Feature Request Since running scrapers on Colab is a common use case, could we: Improve the error handling to raise a specific AccessDeniedError instead of KeyError when the response is a Cloudflare block page? Add documentation on using proxies (or Tor) specifically for Colab users? ### **Why this is better than a "Google Drive" request** Asking them to "add Google Drive support" might get rejected because the library *already* supports it! * **The Library's View:** "We already gave you the `data_dir` parameter. You can set that to your Drive folder. We don't need to add code for that." * **The Real Problem:** The real problem is the **Blocking**. By reporting the `KeyError`, you help them fix the *real* bug (the crash). ### **If you specifically want to propose the "Tor" workaround** If you want to be very helpful, you can comment on your own Issue saying: *"I found a workaround for Colab users: Installing `tor` in the notebook and passing `
kerem 2026-03-02 15:56:35 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/soccerdata#198
No description provided.