[GH-ISSUE #211] [FEATURE] anti-captcha support. #145

Open
opened 2026-02-25 20:35:01 +03:00 by kerem · 17 comments
Owner

Originally created by @ghost on GitHub (Feb 26, 2021).
Original GitHub issue: https://github.com/benbusby/whoogle-search/issues/211

with stuff that can get blocked easily, anti-captcha support would be huge. invidious has it implemented perfectly, and it allows public instances to be used with out any major rate-limiting. just an idea. thanks!

Originally created by @ghost on GitHub (Feb 26, 2021). Original GitHub issue: https://github.com/benbusby/whoogle-search/issues/211 with stuff that can get blocked easily, anti-captcha support would be huge. invidious has it implemented perfectly, and it allows public instances to be used with out any major rate-limiting. just an idea. thanks!
Author
Owner

@benbusby commented on GitHub (Mar 1, 2021):

I do agree that would be a nice feature. I haven't encountered the captcha issue too much (thankfully) but I'd still like to eliminate it altogether.

My understanding of anti-captcha services at the moment is that the worthwhile ones aren't free, which unfortunately prevents it from being something I can universally implement in the code base. I suppose I could put the responsibility on the public instance maintainer to provide an API key for activating the service -- I'm not sure how Invidious and others handle it, but I assume it's something like that.

In any case, I'll look into it. Thanks for the recommendation!

<!-- gh-comment-id:788095256 --> @benbusby commented on GitHub (Mar 1, 2021): I do agree that would be a nice feature. I haven't encountered the captcha issue too much (thankfully) but I'd still like to eliminate it altogether. My understanding of anti-captcha services at the moment is that the worthwhile ones aren't free, which unfortunately prevents it from being something I can universally implement in the code base. I suppose I could put the responsibility on the public instance maintainer to provide an API key for activating the service -- I'm not sure how Invidious and others handle it, but I assume it's something like that. In any case, I'll look into it. Thanks for the recommendation!
Author
Owner

@ghost commented on GitHub (Mar 1, 2021):

Invidious implements it very nicely, I actually didn't know it was an option until talking with the developers. You'd just add a line in your config for your API key. I'm willing to pay if it means my instance can be used without worry of getting blocked, plus it's super cheap.

<!-- gh-comment-id:788190548 --> @ghost commented on GitHub (Mar 1, 2021): Invidious implements it very nicely, I actually didn't know it was an option until talking with the developers. You'd just add a line in your config for your API key. I'm willing to pay if it means my instance can be used without worry of getting blocked, plus it's super cheap.
Author
Owner

@unixfox commented on GitHub (Apr 25, 2021):

I do agree that would be a nice feature. I haven't encountered the captcha issue too much (thankfully) but I'd still like to eliminate it altogether.

My understanding of anti-captcha services at the moment is that the worthwhile ones aren't free, which unfortunately prevents it from being something I can universally implement in the code base. I suppose I could put the responsibility on the public instance maintainer to provide an API key for activating the service -- I'm not sure how Invidious and others handle it, but I assume it's something like that.

In any case, I'll look into it. Thanks for the recommendation!

You can implement the anti-captcha API, it's not universal nor a standard, but it's very common and easy to clone.

A lot of projects provide an anti-captcha API clone like https://capmonster.cloud or mine which I plan to release it publicly as soon as I find it stable.

Implementing an anti captcha solution into whoogle is a great way to provide the tools for public instances maintainers to offer a reliable service that work even when Google is trying to rate limit the server.

<!-- gh-comment-id:826339048 --> @unixfox commented on GitHub (Apr 25, 2021): > I do agree that would be a nice feature. I haven't encountered the captcha issue too much (thankfully) but I'd still like to eliminate it altogether. > > My understanding of anti-captcha services at the moment is that the worthwhile ones aren't free, which unfortunately prevents it from being something I can universally implement in the code base. I suppose I could put the responsibility on the public instance maintainer to provide an API key for activating the service -- I'm not sure how Invidious and others handle it, but I assume it's something like that. > > In any case, I'll look into it. Thanks for the recommendation! You can implement the anti-captcha API, it's not universal nor a standard, but it's very common and easy to clone. A lot of projects provide an anti-captcha API clone like https://capmonster.cloud or mine which I plan to release it publicly as soon as I find it stable. Implementing an anti captcha solution into whoogle is a great way to provide the tools for public instances maintainers to offer a reliable service that work even when Google is trying to rate limit the server.
Author
Owner

@Albonycal commented on GitHub (Jun 22, 2021):

Yea I'm also getting blocked by google...
This would be cool..
Any updates?
Thank you :D

<!-- gh-comment-id:865919838 --> @Albonycal commented on GitHub (Jun 22, 2021): Yea I'm also getting blocked by google... This would be cool.. Any updates? Thank you :D
Author
Owner

@maxdesalle commented on GitHub (Jul 13, 2021):

Also having this issue on a DigitalOcean droplet.

<!-- gh-comment-id:878990644 --> @maxdesalle commented on GitHub (Jul 13, 2021): Also having this issue on a DigitalOcean droplet.
Author
Owner

@randomwalk3141592 commented on GitHub (Jul 16, 2021):

Rather than an anti-captcha feature, can you make it so that Whoogle will actually display Google's captcha so that it can be solved? Right now, the captcha is not displayed.

<!-- gh-comment-id:881153238 --> @randomwalk3141592 commented on GitHub (Jul 16, 2021): Rather than an anti-captcha feature, can you make it so that Whoogle will actually display Google's captcha so that it can be solved? Right now, the captcha is not displayed.
Author
Owner

@unixfox commented on GitHub (Jul 16, 2021):

Rather than an anti-captcha feature, can you make it so that Whoogle will actually display Google's captcha so that it can be solved? Right now, the captcha is not displayed.

That's not possible, you can't load a same key of Google Recaptcha from another domain that is not whitelisted.

<!-- gh-comment-id:881268481 --> @unixfox commented on GitHub (Jul 16, 2021): > Rather than an anti-captcha feature, can you make it so that Whoogle will actually display Google's captcha so that it can be solved? Right now, the captcha is not displayed. That's not possible, you can't load a same key of Google Recaptcha from another domain that is not whitelisted.
Author
Owner

@Albonycal commented on GitHub (Jul 16, 2021):

Hmm.. I found a workaround for this on Heroku instances.. restarting the heroku dyno assigns the instance a new IP address.. So I never faced the captcha issue after this.. & afaik this doesn't cause any downtime

<!-- gh-comment-id:881288059 --> @Albonycal commented on GitHub (Jul 16, 2021): Hmm.. I found a workaround for this on Heroku instances.. restarting the heroku dyno assigns the instance a new IP address.. So I never faced the captcha issue after this.. & afaik this doesn't cause any downtime
Author
Owner

@randomwalk3141592 commented on GitHub (Jul 16, 2021):

Rather than an anti-captcha feature, can you make it so that Whoogle will actually display Google's captcha so that it can be solved? Right now, the captcha is not displayed.

That's not possible, you can't load a same key of Google Recaptcha from another domain that is not whitelisted.

I'm not sure I understand. My Whoogle docker is querying Google. Google responds with a captcha. Whoogle just needs to let that captcha come through and display it in the browser so I can solve it. This shouldn't involve another domain.

<!-- gh-comment-id:881749351 --> @randomwalk3141592 commented on GitHub (Jul 16, 2021): > > Rather than an anti-captcha feature, can you make it so that Whoogle will actually display Google's captcha so that it can be solved? Right now, the captcha is not displayed. > > That's not possible, you can't load a same key of Google Recaptcha from another domain that is not whitelisted. I'm not sure I understand. My Whoogle docker is querying Google. Google responds with a captcha. Whoogle just needs to let that captcha come through and display it in the browser so I can solve it. This shouldn't involve another domain.
Author
Owner

@randomwalk3141592 commented on GitHub (Jul 16, 2021):

Hmm.. I found a workaround for this on Heroku instances.. restarting the heroku dyno assigns the instance a new IP address.. So I never faced the captcha issue after this.. & afaik this doesn't cause any downtime

I use a VPN and Whoogle docker queries Google through this VPN. Once in a while, Whoogle will get a captcha and I would have to reconnect the VPN connection so that I get a new public IP address. I know this solves the problem.

What is interesting is that when Whoogle chokes on the Google captcha, if I go to Google directly (also thought the VPN, so my direct connection would come from the same public IP as Whoogle docker), Google does NOT show me a captcha.

It seems Google is somehow detecting that the Whoogle query is "weird" while my direct query to Google from my computer is not weird.

<!-- gh-comment-id:881751160 --> @randomwalk3141592 commented on GitHub (Jul 16, 2021): > Hmm.. I found a workaround for this on Heroku instances.. restarting the heroku dyno assigns the instance a new IP address.. So I never faced the captcha issue after this.. & afaik this doesn't cause any downtime I use a VPN and Whoogle docker queries Google through this VPN. Once in a while, Whoogle will get a captcha and I would have to reconnect the VPN connection so that I get a new public IP address. I know this solves the problem. What is interesting is that when Whoogle chokes on the Google captcha, if I go to Google directly (also thought the VPN, so my direct connection would come from the same public IP as Whoogle docker), Google does NOT show me a captcha. It seems Google is somehow detecting that the Whoogle query is "weird" while my direct query to Google from my computer is not weird.
Author
Owner

@Albonycal commented on GitHub (Jul 17, 2021):

hmm.. Can it be different user agents?
or fingerprint thing?

<!-- gh-comment-id:881827974 --> @Albonycal commented on GitHub (Jul 17, 2021): hmm.. Can it be different user agents? or fingerprint thing?
Author
Owner

@unixfox commented on GitHub (Jul 17, 2021):

Rather than an anti-captcha feature, can you make it so that Whoogle will actually display Google's captcha so that it can be solved? Right now, the captcha is not displayed.

That's not possible, you can't load a same key of Google Recaptcha from another domain that is not whitelisted.

I'm not sure I understand. My Whoogle docker is querying Google. Google responds with a captcha. Whoogle just needs to let that captcha come through and display it in the browser so I can solve it. This shouldn't involve another domain.

Whoogle is not a browser, it doesn't interpret JavaScript so it can't "show" you the CAPTCHA.

hmm.. Can it be different user agents?
or fingerprint thing?

No Google rate limit based on the IP address and that's it.

<!-- gh-comment-id:881828641 --> @unixfox commented on GitHub (Jul 17, 2021): > > > Rather than an anti-captcha feature, can you make it so that Whoogle will actually display Google's captcha so that it can be solved? Right now, the captcha is not displayed. > > > > That's not possible, you can't load a same key of Google Recaptcha from another domain that is not whitelisted. > > I'm not sure I understand. My Whoogle docker is querying Google. Google responds with a captcha. Whoogle just needs to let that captcha come through and display it in the browser so I can solve it. This shouldn't involve another domain. Whoogle is not a browser, it doesn't interpret JavaScript so it can't "show" you the CAPTCHA. > hmm.. Can it be different user agents? > or fingerprint thing? No Google rate limit based on the IP address and that's it.
Author
Owner

@unixfox commented on GitHub (Aug 13, 2021):

Just wanted to say that there is a way to bypass Google reCAPTCHA entirely, I explained how here: https://github.com/searxng/searxng/issues/159

@benbusby "just" need to switch to this special endpoint, and we will have the CAPTCHA issue fixed.

<!-- gh-comment-id:898753235 --> @unixfox commented on GitHub (Aug 13, 2021): Just wanted to say that there is a way to bypass Google reCAPTCHA entirely, I explained how here: https://github.com/searxng/searxng/issues/159 @benbusby "just" need to switch to this special endpoint, and we will have the CAPTCHA issue fixed.
Author
Owner

@JaneJeon commented on GitHub (Apr 19, 2022):

I honestly think all of this can be solved by using a better scraping method. I worked on scrapers to get pass "gated" sites such as GSRPs (Google Search Result Pages), paywalls, etc. and the reliability of scraping (i.e. not getting cockblocked by a captcha because they detected that you were a "bot" - making a request not from their frontend) comes down to these factors:

  1. IP (holy shit people, this is the number 1 thing that gets you blocked by Google. USE THEM PROXIES!!)
  2. Rate limiting (how many requests per second/minute/hour are you sending to Google, per IP?)
  3. SSL fingerprinting (browsers make HTTPS requests in a different manner than just calling requests.get() does
  4. Browser Fingerprinting (this is the big boy shit, and you almost never have to worry about it, except client-side rendered stuff, which is most definitely not GSRPs)

Now, 1 should be solved with proxies, 2 should be solved with careful rate limiting implementation (esp. w/ multiple proxies) within whoogle, 3 can be solved with careful HTTPS handshake implementation within whoogle, and 4 can be implemented using something like playwright plus browser stealth libraries that plug into playwright OR (given that this application is written in python and won't be able to use those stealth libraries that are typically written in js) use playwright to control a "stealthy" web browser instance, such as https://github.com/ulixee/secret-agent. Note that 4 is extreme overkill for most people's use cases (most bot solutions grade you on a sliding scale, so as long as you get 1, 2, and 3, your score will be still high enough to not require this bullshit)!!

I literally never get blocked on Google this way (not from Whoogle, my private application), no matter how many requests I send. Whoogle should adopt at the very least 2 and 3, and really direct people towards using a proxy (instead of trying to remove the captcha - that is a losing solution). Honestly, that should suffice to close this issue once and for all.

<!-- gh-comment-id:1102071006 --> @JaneJeon commented on GitHub (Apr 19, 2022): I honestly think all of this can be solved by using a better scraping method. I worked on scrapers to get pass "gated" sites such as GSRPs (Google Search Result Pages), paywalls, etc. and the reliability of scraping (i.e. not getting cockblocked by a captcha because they detected that you were a "bot" - making a request not from their frontend) comes down to these factors: 1. IP (holy shit people, this is the number 1 thing that gets you blocked by Google. USE THEM PROXIES!!) 2. Rate limiting (how many requests per second/minute/hour are you sending to Google, per IP?) 3. SSL fingerprinting (browsers make HTTPS requests in a different manner than just calling `requests.get()` does 4. Browser Fingerprinting (this is the big boy shit, and you almost never have to worry about it, except client-side rendered stuff, which is most definitely _not_ GSRPs) Now, 1 should be solved with proxies, 2 should be solved with careful rate limiting implementation (esp. w/ multiple proxies) within whoogle, 3 can be solved with careful HTTPS handshake implementation within whoogle, and 4 can be implemented using something like playwright plus browser stealth libraries that plug into playwright OR (given that this application is written in python and won't be able to use those stealth libraries that are typically written in js) use playwright to control a "stealthy" web browser instance, such as https://github.com/ulixee/secret-agent. Note that 4 is extreme overkill for most people's use cases (most bot solutions grade you on a sliding scale, so as long as you get 1, 2, and 3, your score will be still high enough to not require this bullshit)!! I literally never get blocked on Google this way (not from Whoogle, my private application), no matter how many requests I send. Whoogle should adopt at the very least 2 and 3, and _really_ direct people towards using a proxy (instead of trying to remove the captcha - that is a losing solution). Honestly, that should suffice to close this issue once and for all.
Author
Owner

@JaneJeon commented on GitHub (Apr 19, 2022):

Actually, ignore all that bullshit I said above, @unixfox's method is 10000 times easier. We should do that.

<!-- gh-comment-id:1102075347 --> @JaneJeon commented on GitHub (Apr 19, 2022): Actually, ignore all that bullshit I said above, @unixfox's method is 10000 times easier. We should do that.
Author
Owner

@unixfox commented on GitHub (Aug 9, 2022):

New way of fetching the Google results (search, videos, news, images and more) with an internal API of Google and with JSON results: https://github.com/searxng/searxng/issues/1642! This doesn't have any rate limit.

<!-- gh-comment-id:1209896467 --> @unixfox commented on GitHub (Aug 9, 2022): New way of fetching the Google results (search, videos, news, images and more) with an internal API of Google and with **JSON results**: https://github.com/searxng/searxng/issues/1642! This doesn't have any rate limit.
Author
Owner

@yannduran commented on GitHub (Dec 31, 2024):

Still happening at the end of 2024

<!-- gh-comment-id:2566533489 --> @yannduran commented on GitHub (Dec 31, 2024): Still happening at the end of 2024
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/whoogle-search#145
No description provided.