mirror of
https://github.com/probberechts/soccerdata.git
synced 2026-04-25 10:05:53 +03:00
[GH-ISSUE #440] [WhoScored] Unable to select English locale #80
Labels
No labels
ESPN
FBref
FotMob
MatchHistory
SoFIFA
Sofascore
WhoScored
WhoScored
bug
build
common
dependencies
discussion
documentation
duplicate
enhancement
good first issue
invalid
performance
pull-request
question
question
removal
understat
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/soccerdata#80
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @Gibranium on GitHub (Dec 14, 2023).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/440
Sorry to bring this back, but I am not able to scrape whoscored after months.
I explain it all, since April I had it all functioning properly. Then I changed my pc from an Intel MacBook Pro to a Mac mini m2, I've downloaded again Tor via Homebrew and set anaconda properly with a specific environment to use only soccerdata and dependencies. Still I've not been able to scrape a single file from Whoscored, while FBREF scraping - at least - works flawlessly. I've tried all the things that were recommended in precedently opened iterations of this problem, the only thing I've not tried till now is to use a VPN because I'd really like to not spend money right now to make it work. If anyone is able to help me in making it work feel free to contact me personally on twitter: @gualanodavide.
Thanks a lot to anyone
@probberechts commented on GitHub (Dec 14, 2023):
Can you try to run the code in non-headless mode and check what happens in your browser window? Does it say that your IP is blocked or show a captcha?
If that's the case it's a problem with the undetected-chromedriver library, not with soccerdata. You can test with:
You might find a solution if the issue tracker of the undetected-chromedriver library.
@Gibranium commented on GitHub (Dec 14, 2023):
I did not have any problem in my browser window, it opened whoscored and didn't ask for a captcha, but I think since I'm in Italy that it doesn't find the same names in the link as he request in the code, so it fails. Am I right, and how can I solve it?
@probberechts commented on GitHub (Dec 14, 2023):
Oh, but now you have a different error. You got past the error in your first comment. Can you share the "tiers.json" file in "/Users/davidegualona/soccerdata/data/WhoScored"?
@Gibranium commented on GitHub (Dec 14, 2023):
Yes, of course.
Here it is:
tiers.json
@probberechts commented on GitHub (Dec 14, 2023):
You were right, the country names are in Italian in your "tiers.json" file. One option is to add the Italian names in the
config/league_dict.jsonfile (see https://soccerdata.readthedocs.io/en/latest/howto/custom-leagues.html). For example,You might experience more problems in other parts of the code though.
Alternatively, you could try to set the default language of your browser to English or configure selenium accordingly (see https://stackoverflow.com/questions/55150118/trouble-modifying-the-language-option-in-selenium-python-bindings).
Let me know what works.
@Gibranium commented on GitHub (Dec 15, 2023):
I've tried the first one but the code immediately presents another problem, so I think It's not viable. For the other two: I've tried to change the language of Chrome and Safari, but It doesn't resolve it because in the search page the result already is in Italian, for the adjustment via your link I don't think I have the necessary ability to pull a functioning adjustment. I've tried with some help from ChatGPT but in 1 hour we couldn't find a solution, because apparently this:
driver = webdriver.Chrome(chrome_options=options)
needs to be this:
driver = webdriver.Chrome(options=options)
in order to apply the options, but still I don't know to make the driver work into the scraping part. Nonetheless ChatGPT made me try this:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
Set up the WebDriver with language preference
options = webdriver.ChromeOptions()
options.add_experimental_option('prefs', {'intl.accept_languages': 'en,en_US'})
driver = webdriver.Chrome(options=options)
Navigate to the WhoScored page using Selenium
driver.get("https://www.whoscored.com/") # Replace with the actual URL
Extract the HTML content after the page has loaded
html_content = driver.page_source
Continue with requests and BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
In order to see if things could work to later melt the soccerdata part with this adjustment, and I found that even though I can make him load in English after a second Whoscored refresh itself and load in Italian.
So, either I am not good enough to pull this or I need to go and do a NordVPN subscription, am I right?
@probberechts commented on GitHub (Dec 15, 2023):
You can also try to redirect to the English version by simulating a click on the language menu at the top left.
@Gibranium commented on GitHub (Dec 15, 2023):
It does what it is supposed to do, but nonetheless Whoscored refresh itself and load in Italian
@probberechts commented on GitHub (Dec 15, 2023):
Is there any way in which you can switch to English when browsing the website manually?
@Gibranium commented on GitHub (Dec 15, 2023):
There's a toggle in which you can choose the language, but if I set EN it switches automatically back to IT
@Gibranium commented on GitHub (Dec 15, 2023):
Anyway, I've resolved my subscribing to NordVPN, right now it seems worth the amount of money for the effort.
I'd ask you only another thing - then you can close the issue if you need to - for [WhoScored] Ignore cached events file if empty #420, the improvement has been already added to soccerdata or we should write the enhancement by ourselves? In that case I should do it where? Thank you very much for all the help.
@probberechts commented on GitHub (Dec 16, 2023):
Ok, great! If the locale is hard-coded based on IP location I think the only possible fixes are indeed translating some parts of the implementation or using a VPN.
#420 is not yet released. If you can't wait for the next release, you can install the latest build from test.pypi.