mirror of
https://github.com/benbusby/whoogle-search.git
synced 2026-04-25 04:05:57 +03:00
[GH-ISSUE #211] [FEATURE] anti-captcha support. #145
Labels
No labels
Fixed (Pending PR Merge)
Stale
bug
enhancement
enhancement
good first issue
help wanted
keep-open
needs more info
pull-request
question
theme
unfortunate
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/whoogle-search#145
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @ghost on GitHub (Feb 26, 2021).
Original GitHub issue: https://github.com/benbusby/whoogle-search/issues/211
with stuff that can get blocked easily, anti-captcha support would be huge. invidious has it implemented perfectly, and it allows public instances to be used with out any major rate-limiting. just an idea. thanks!
@benbusby commented on GitHub (Mar 1, 2021):
I do agree that would be a nice feature. I haven't encountered the captcha issue too much (thankfully) but I'd still like to eliminate it altogether.
My understanding of anti-captcha services at the moment is that the worthwhile ones aren't free, which unfortunately prevents it from being something I can universally implement in the code base. I suppose I could put the responsibility on the public instance maintainer to provide an API key for activating the service -- I'm not sure how Invidious and others handle it, but I assume it's something like that.
In any case, I'll look into it. Thanks for the recommendation!
@ghost commented on GitHub (Mar 1, 2021):
Invidious implements it very nicely, I actually didn't know it was an option until talking with the developers. You'd just add a line in your config for your API key. I'm willing to pay if it means my instance can be used without worry of getting blocked, plus it's super cheap.
@unixfox commented on GitHub (Apr 25, 2021):
You can implement the anti-captcha API, it's not universal nor a standard, but it's very common and easy to clone.
A lot of projects provide an anti-captcha API clone like https://capmonster.cloud or mine which I plan to release it publicly as soon as I find it stable.
Implementing an anti captcha solution into whoogle is a great way to provide the tools for public instances maintainers to offer a reliable service that work even when Google is trying to rate limit the server.
@Albonycal commented on GitHub (Jun 22, 2021):
Yea I'm also getting blocked by google...
This would be cool..
Any updates?
Thank you :D
@maxdesalle commented on GitHub (Jul 13, 2021):
Also having this issue on a DigitalOcean droplet.
@randomwalk3141592 commented on GitHub (Jul 16, 2021):
Rather than an anti-captcha feature, can you make it so that Whoogle will actually display Google's captcha so that it can be solved? Right now, the captcha is not displayed.
@unixfox commented on GitHub (Jul 16, 2021):
That's not possible, you can't load a same key of Google Recaptcha from another domain that is not whitelisted.
@Albonycal commented on GitHub (Jul 16, 2021):
Hmm.. I found a workaround for this on Heroku instances.. restarting the heroku dyno assigns the instance a new IP address.. So I never faced the captcha issue after this.. & afaik this doesn't cause any downtime
@randomwalk3141592 commented on GitHub (Jul 16, 2021):
I'm not sure I understand. My Whoogle docker is querying Google. Google responds with a captcha. Whoogle just needs to let that captcha come through and display it in the browser so I can solve it. This shouldn't involve another domain.
@randomwalk3141592 commented on GitHub (Jul 16, 2021):
I use a VPN and Whoogle docker queries Google through this VPN. Once in a while, Whoogle will get a captcha and I would have to reconnect the VPN connection so that I get a new public IP address. I know this solves the problem.
What is interesting is that when Whoogle chokes on the Google captcha, if I go to Google directly (also thought the VPN, so my direct connection would come from the same public IP as Whoogle docker), Google does NOT show me a captcha.
It seems Google is somehow detecting that the Whoogle query is "weird" while my direct query to Google from my computer is not weird.
@Albonycal commented on GitHub (Jul 17, 2021):
hmm.. Can it be different user agents?
or fingerprint thing?
@unixfox commented on GitHub (Jul 17, 2021):
Whoogle is not a browser, it doesn't interpret JavaScript so it can't "show" you the CAPTCHA.
No Google rate limit based on the IP address and that's it.
@unixfox commented on GitHub (Aug 13, 2021):
Just wanted to say that there is a way to bypass Google reCAPTCHA entirely, I explained how here: https://github.com/searxng/searxng/issues/159
@benbusby "just" need to switch to this special endpoint, and we will have the CAPTCHA issue fixed.
@JaneJeon commented on GitHub (Apr 19, 2022):
I honestly think all of this can be solved by using a better scraping method. I worked on scrapers to get pass "gated" sites such as GSRPs (Google Search Result Pages), paywalls, etc. and the reliability of scraping (i.e. not getting cockblocked by a captcha because they detected that you were a "bot" - making a request not from their frontend) comes down to these factors:
requests.get()doesNow, 1 should be solved with proxies, 2 should be solved with careful rate limiting implementation (esp. w/ multiple proxies) within whoogle, 3 can be solved with careful HTTPS handshake implementation within whoogle, and 4 can be implemented using something like playwright plus browser stealth libraries that plug into playwright OR (given that this application is written in python and won't be able to use those stealth libraries that are typically written in js) use playwright to control a "stealthy" web browser instance, such as https://github.com/ulixee/secret-agent. Note that 4 is extreme overkill for most people's use cases (most bot solutions grade you on a sliding scale, so as long as you get 1, 2, and 3, your score will be still high enough to not require this bullshit)!!
I literally never get blocked on Google this way (not from Whoogle, my private application), no matter how many requests I send. Whoogle should adopt at the very least 2 and 3, and really direct people towards using a proxy (instead of trying to remove the captcha - that is a losing solution). Honestly, that should suffice to close this issue once and for all.
@JaneJeon commented on GitHub (Apr 19, 2022):
Actually, ignore all that bullshit I said above, @unixfox's method is 10000 times easier. We should do that.
@unixfox commented on GitHub (Aug 9, 2022):
New way of fetching the Google results (search, videos, news, images and more) with an internal API of Google and with JSON results: https://github.com/searxng/searxng/issues/1642! This doesn't have any rate limit.
@yannduran commented on GitHub (Dec 31, 2024):
Still happening at the end of 2024