[GH-ISSUE #183] [BUG] Using Whoogle to scrape Google search results is very difficult due to strange class names #128

Closed
opened 2026-02-25 20:34:57 +03:00 by kerem · 3 comments
Owner

Originally created by @mendel5 on GitHub (Jan 24, 2021).
Original GitHub issue: https://github.com/benbusby/whoogle-search/issues/183

Describe the bug
I am using Whoogle to scrape Google search results. The reason I am using Whoogle and not Google directly is because Google regularly asks my scraper to solve a Recaptcha.

My technical setup is Python with Selenium and Firefox as webdriver. The rendered HTML gets parsed with beautiful soup.

The naming scheme of the HTML classes on Whoogle's search result page makes it very difficult to scrape and parse what I am actually looking for. I would like to scrape the website's title, the description and the URL.

For example the URLs on the search results page are contained in <div class="kCrYT">. However there are other elements of the same class that don't contain a link. It would be very helpful for the <a href="" tag to have it's own class.

The website description is contained in <div class="BNeawe s3v9rd AP7Wnd">. The links to the previous page and next page are contained in the same class="nBDE1b G5eFlf". It would be helpful for both page links to have different classes or specific ids.

To Reproduce
Steps to reproduce the behavior:

  1. Open Whoogle and search for something
  2. Open the browser's developer tools and inspect the HTML elements
  3. Be confused by strange HTML class names

Deployment Method
Download git repo and use run executable

Version of Whoogle Search
Whoogle Search v0.3.0

Originally created by @mendel5 on GitHub (Jan 24, 2021). Original GitHub issue: https://github.com/benbusby/whoogle-search/issues/183 **Describe the bug** I am using Whoogle to scrape Google search results. The reason I am using Whoogle and not Google directly is because Google regularly asks my scraper to solve a Recaptcha. My technical setup is Python with `Selenium` and `Firefox` as webdriver. The rendered HTML gets parsed with `beautiful soup`. The naming scheme of the HTML classes on Whoogle's search result page makes it very difficult to scrape and parse what I am actually looking for. I would like to scrape the website's title, the description and the URL. For example the URLs on the search results page are contained in `<div class="kCrYT">`. However there are other elements of the same class that don't contain a link. It would be very helpful for the `<a href=""` tag to have it's own class. The website description is contained in `<div class="BNeawe s3v9rd AP7Wnd">`. The links to the previous page and next page are contained in the same `class="nBDE1b G5eFlf"`. It would be helpful for both page links to have different classes or specific ids. **To Reproduce** Steps to reproduce the behavior: 1. Open Whoogle and search for something 2. Open the browser's developer tools and inspect the HTML elements 3. Be confused by strange HTML class names **Deployment Method** Download git repo and use `run` executable **Version of Whoogle Search** `Whoogle Search v0.3.0`
kerem 2026-02-25 20:34:57 +03:00
  • closed this issue
  • added the
    bug
    label
Author
Owner

@mendel5 commented on GitHub (Jan 25, 2021):

As far as I can see the strange HTML class names are originating from Google and Whoogle just inherits them?

<!-- gh-comment-id:766629177 --> @mendel5 commented on GitHub (Jan 25, 2021): As far as I can see the strange HTML class names are originating from Google and Whoogle just inherits them?
Author
Owner

@benbusby commented on GitHub (Jan 25, 2021):

Yes, all classes are inherited from Google directly. They also change fairly frequently, as I've discovered in the past. It'd probably be better to use BeautifulSoup to look for patterns in the HTML rather than classes (i.e. extract all a tags with a h3 child, since that's how the results are formatted), but off the top of my head I'm not sure of an exact solution.

<!-- gh-comment-id:766886474 --> @benbusby commented on GitHub (Jan 25, 2021): Yes, all classes are inherited from Google directly. They also change fairly frequently, as I've discovered in the past. It'd probably be better to use BeautifulSoup to look for patterns in the HTML rather than classes (i.e. extract all `a` tags with a `h3` child, since that's how the results are formatted), but off the top of my head I'm not sure of an exact solution.
Author
Owner

@mendel5 commented on GitHub (Jan 25, 2021):

Yes, all classes are inherited from Google directly. They also change fairly frequently

That seems to be quite evil by Google...

It'd probably be better to use BeautifulSoup to look for patterns in the HTML rather than classes (i.e. extract all a tags with a h3 child, since that's how the results are formatted)

Thanks for the suggestion. I'll see what I can do.

<!-- gh-comment-id:767118993 --> @mendel5 commented on GitHub (Jan 25, 2021): > Yes, all classes are inherited from Google directly. They also change fairly frequently That seems to be quite [evil](https://en.wikipedia.org/wiki/Don%27t_be_evil) by Google... > It'd probably be better to use BeautifulSoup to look for patterns in the HTML rather than classes (i.e. extract all a tags with a h3 child, since that's how the results are formatted) Thanks for the suggestion. I'll see what I can do.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/whoogle-search#128
No description provided.