[GH-ISSUE #440] [WhoScored] Unable to select English locale #80

Open
opened 2026-03-02 15:55:36 +03:00 by kerem · 12 comments
Owner

Originally created by @Gibranium on GitHub (Dec 14, 2023).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/440

Sorry to bring this back, but I am not able to scrape whoscored after months.
I explain it all, since April I had it all functioning properly. Then I changed my pc from an Intel MacBook Pro to a Mac mini m2, I've downloaded again Tor via Homebrew and set anaconda properly with a specific environment to use only soccerdata and dependencies. Still I've not been able to scrape a single file from Whoscored, while FBREF scraping - at least - works flawlessly. I've tried all the things that were recommended in precedently opened iterations of this problem, the only thing I've not tried till now is to use a VPN because I'd really like to not spend money right now to make it work. If anyone is able to help me in making it work feel free to contact me personally on twitter: @gualanodavide.
Thanks a lot to anyone

Screenshot 2023-12-14 alle 16 31 12
Screenshot 2023-12-14 alle 16 31 22

Originally created by @Gibranium on GitHub (Dec 14, 2023). Original GitHub issue: https://github.com/probberechts/soccerdata/issues/440 Sorry to bring this back, but I am not able to scrape whoscored after months. I explain it all, since April I had it all functioning properly. Then I changed my pc from an Intel MacBook Pro to a Mac mini m2, I've downloaded again Tor via Homebrew and set anaconda properly with a specific environment to use only soccerdata and dependencies. Still I've not been able to scrape a single file from Whoscored, while FBREF scraping - at least - works flawlessly. I've tried all the things that were recommended in precedently opened iterations of this problem, the only thing I've not tried till now is to use a VPN because I'd really like to not spend money right now to make it work. If anyone is able to help me in making it work feel free to contact me personally on twitter: @gualanodavide. Thanks a lot to anyone ![Screenshot 2023-12-14 alle 16 31 12](https://github.com/probberechts/soccerdata/assets/147209202/a038400f-bc75-497a-9b3c-aa8430dd9289) ![Screenshot 2023-12-14 alle 16 31 22](https://github.com/probberechts/soccerdata/assets/147209202/2a098fcf-fc16-463f-a85b-bdaf9c1f31de)
Author
Owner

@probberechts commented on GitHub (Dec 14, 2023):

Can you try to run the code in non-headless mode and check what happens in your browser window? Does it say that your IP is blocked or show a captcha?

import soccerdata as sd
ws = sd.WhoScored("ENG-Premier League", "2223", headless=False, no_cache=True)
leagues = ws.read_leagues()

If that's the case it's a problem with the undetected-chromedriver library, not with soccerdata. You can test with:

import undetected_chromedriver as uc
driver = uc.Chrome(headless=False, use_subprocess=False)
driver.get('https://www.whoscored.com/')

You might find a solution if the issue tracker of the undetected-chromedriver library.

<!-- gh-comment-id:1856146264 --> @probberechts commented on GitHub (Dec 14, 2023): Can you try to run the code in non-headless mode and check what happens in your browser window? Does it say that your IP is blocked or show a captcha? ```python import soccerdata as sd ws = sd.WhoScored("ENG-Premier League", "2223", headless=False, no_cache=True) leagues = ws.read_leagues() ``` If that's the case it's a problem with the undetected-chromedriver library, not with soccerdata. You can test with: ```python import undetected_chromedriver as uc driver = uc.Chrome(headless=False, use_subprocess=False) driver.get('https://www.whoscored.com/') ``` You might find a solution if the issue tracker of the [undetected-chromedriver](https://github.com/ultrafunkamsterdam/undetected-chromedriver) library.
Author
Owner

@Gibranium commented on GitHub (Dec 14, 2023):

I did not have any problem in my browser window, it opened whoscored and didn't ask for a captcha, but I think since I'm in Italy that it doesn't find the same names in the link as he request in the code, so it fails. Am I right, and how can I solve it?

Screenshot 2023-12-14 alle 20 28 21

<!-- gh-comment-id:1856456861 --> @Gibranium commented on GitHub (Dec 14, 2023): I did not have any problem in my browser window, it opened whoscored and didn't ask for a captcha, but I think since I'm in Italy that it doesn't find the same names in the link as he request in the code, so it fails. Am I right, and how can I solve it? ![Screenshot 2023-12-14 alle 20 28 21](https://github.com/probberechts/soccerdata/assets/147209202/0ae36888-f974-4a7a-b9c9-3a53896393b9)
Author
Owner

@probberechts commented on GitHub (Dec 14, 2023):

Oh, but now you have a different error. You got past the error in your first comment. Can you share the "tiers.json" file in "/Users/davidegualona/soccerdata/data/WhoScored"?

<!-- gh-comment-id:1856665285 --> @probberechts commented on GitHub (Dec 14, 2023): Oh, but now you have a different error. You got past the error in your first comment. Can you share the "tiers.json" file in "/Users/davidegualona/soccerdata/data/WhoScored"?
Author
Owner

@Gibranium commented on GitHub (Dec 14, 2023):

Yes, of course.

Here it is:

tiers.json

<!-- gh-comment-id:1856711890 --> @Gibranium commented on GitHub (Dec 14, 2023): Yes, of course. Here it is: [tiers.json](https://github.com/probberechts/soccerdata/files/13678681/tiers.json)
Author
Owner

@probberechts commented on GitHub (Dec 14, 2023):

You were right, the country names are in Italian in your "tiers.json" file. One option is to add the Italian names in the config/league_dict.json file (see https://soccerdata.readthedocs.io/en/latest/howto/custom-leagues.html). For example,

{
  "ENG-Premier League": {
    "WhoScored": "Inghilterra - Premier League"
  }
}

You might experience more problems in other parts of the code though.

Alternatively, you could try to set the default language of your browser to English or configure selenium accordingly (see https://stackoverflow.com/questions/55150118/trouble-modifying-the-language-option-in-selenium-python-bindings).

Let me know what works.

<!-- gh-comment-id:1856819478 --> @probberechts commented on GitHub (Dec 14, 2023): You were right, the country names are in Italian in your "tiers.json" file. One option is to add the Italian names in the `config/league_dict.json` file (see https://soccerdata.readthedocs.io/en/latest/howto/custom-leagues.html). For example, ```json { "ENG-Premier League": { "WhoScored": "Inghilterra - Premier League" } } ``` You might experience more problems in other parts of the code though. Alternatively, you could try to set the default language of your browser to English or configure selenium accordingly (see https://stackoverflow.com/questions/55150118/trouble-modifying-the-language-option-in-selenium-python-bindings). Let me know what works.
Author
Owner

@Gibranium commented on GitHub (Dec 15, 2023):

I've tried the first one but the code immediately presents another problem, so I think It's not viable. For the other two: I've tried to change the language of Chrome and Safari, but It doesn't resolve it because in the search page the result already is in Italian, for the adjustment via your link I don't think I have the necessary ability to pull a functioning adjustment. I've tried with some help from ChatGPT but in 1 hour we couldn't find a solution, because apparently this:

driver = webdriver.Chrome(chrome_options=options)

needs to be this:

driver = webdriver.Chrome(options=options)

in order to apply the options, but still I don't know to make the driver work into the scraping part. Nonetheless ChatGPT made me try this:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver

Set up the WebDriver with language preference

options = webdriver.ChromeOptions()
options.add_experimental_option('prefs', {'intl.accept_languages': 'en,en_US'})
driver = webdriver.Chrome(options=options)

Navigate to the WhoScored page using Selenium

driver.get("https://www.whoscored.com/") # Replace with the actual URL

Extract the HTML content after the page has loaded

html_content = driver.page_source

Continue with requests and BeautifulSoup

soup = BeautifulSoup(html_content, 'html.parser')

In order to see if things could work to later melt the soccerdata part with this adjustment, and I found that even though I can make him load in English after a second Whoscored refresh itself and load in Italian.
So, either I am not good enough to pull this or I need to go and do a NordVPN subscription, am I right?

<!-- gh-comment-id:1857044617 --> @Gibranium commented on GitHub (Dec 15, 2023): I've tried the first one but the code immediately presents another problem, so I think It's not viable. For the other two: I've tried to change the language of Chrome and Safari, but It doesn't resolve it because in the search page the result already is in Italian, for the adjustment via your link I don't think I have the necessary ability to pull a functioning adjustment. I've tried with some help from ChatGPT but in 1 hour we couldn't find a solution, because apparently this: driver = webdriver.Chrome(chrome_options=options) needs to be this: driver = webdriver.Chrome(options=options) in order to apply the options, but still I don't know to make the driver work into the scraping part. Nonetheless ChatGPT made me try this: import requests from bs4 import BeautifulSoup from selenium import webdriver # Set up the WebDriver with language preference options = webdriver.ChromeOptions() options.add_experimental_option('prefs', {'intl.accept_languages': 'en,en_US'}) driver = webdriver.Chrome(options=options) # Navigate to the WhoScored page using Selenium driver.get("https://www.whoscored.com/") # Replace with the actual URL # Extract the HTML content after the page has loaded html_content = driver.page_source # Continue with requests and BeautifulSoup soup = BeautifulSoup(html_content, 'html.parser') In order to see if things could work to later melt the soccerdata part with this adjustment, and I found that even though I can make him load in English after a second Whoscored refresh itself and load in Italian. So, either I am not good enough to pull this or I need to go and do a NordVPN subscription, am I right?
Author
Owner

@probberechts commented on GitHub (Dec 15, 2023):

You can also try to redirect to the English version by simulating a click on the language menu at the top left.

import soccerdata as sd
ws = sd.WhoScored("ENG-Premier League", "2223", headless=False, no_cache=True)
ws._driver.get("https://www.whoscored.com/")
ws._driver.execute_script("location = 'https://whoscored.com/'")
leagues = ws.read_leagues()
<!-- gh-comment-id:1857240870 --> @probberechts commented on GitHub (Dec 15, 2023): You can also try to redirect to the English version by simulating a click on the language menu at the top left. ```python import soccerdata as sd ws = sd.WhoScored("ENG-Premier League", "2223", headless=False, no_cache=True) ws._driver.get("https://www.whoscored.com/") ws._driver.execute_script("location = 'https://whoscored.com/'") leagues = ws.read_leagues() ```
Author
Owner

@Gibranium commented on GitHub (Dec 15, 2023):

It does what it is supposed to do, but nonetheless Whoscored refresh itself and load in Italian

<!-- gh-comment-id:1857793352 --> @Gibranium commented on GitHub (Dec 15, 2023): It does what it is supposed to do, but nonetheless Whoscored refresh itself and load in Italian
Author
Owner

@probberechts commented on GitHub (Dec 15, 2023):

Is there any way in which you can switch to English when browsing the website manually?

<!-- gh-comment-id:1857798885 --> @probberechts commented on GitHub (Dec 15, 2023): Is there any way in which you can switch to English when browsing the website manually?
Author
Owner

@Gibranium commented on GitHub (Dec 15, 2023):

There's a toggle in which you can choose the language, but if I set EN it switches automatically back to IT

<!-- gh-comment-id:1857800938 --> @Gibranium commented on GitHub (Dec 15, 2023): There's a toggle in which you can choose the language, but if I set EN it switches automatically back to IT
Author
Owner

@Gibranium commented on GitHub (Dec 15, 2023):

Anyway, I've resolved my subscribing to NordVPN, right now it seems worth the amount of money for the effort.
I'd ask you only another thing - then you can close the issue if you need to - for [WhoScored] Ignore cached events file if empty #420, the improvement has been already added to soccerdata or we should write the enhancement by ourselves? In that case I should do it where? Thank you very much for all the help.

<!-- gh-comment-id:1857842572 --> @Gibranium commented on GitHub (Dec 15, 2023): Anyway, I've resolved my subscribing to NordVPN, right now it seems worth the amount of money for the effort. I'd ask you only another thing - then you can close the issue if you need to - for [WhoScored] Ignore cached events file if empty #420, the improvement has been already added to soccerdata or we should write the enhancement by ourselves? In that case I should do it where? Thank you very much for all the help.
Author
Owner

@probberechts commented on GitHub (Dec 16, 2023):

Ok, great! If the locale is hard-coded based on IP location I think the only possible fixes are indeed translating some parts of the implementation or using a VPN.

#420 is not yet released. If you can't wait for the next release, you can install the latest build from test.pypi.

<!-- gh-comment-id:1858931363 --> @probberechts commented on GitHub (Dec 16, 2023): Ok, great! If the locale is hard-coded based on IP location I think the only possible fixes are indeed translating some parts of the implementation or using a VPN. #420 is not yet released. If you can't wait for the next release, you can [install the latest build from test.pypi](https://test.pypi.org/project/soccerdata/).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/soccerdata#80
No description provided.