[GH-ISSUE #904] [BUG] Most of the search terms are not bold in Chinese results #562

Closed
opened 2026-02-25 20:36:02 +03:00 by kerem · 6 comments
Owner

Originally created by @ghost on GitHub (Dec 11, 2022).
Original GitHub issue: https://github.com/benbusby/whoogle-search/issues/904

Describe the bug
Most of the search terms are not bold in Chinese results.

Whoogle results:
Screenshot 2022-12-11 at 6 43 14 PM

Google results for reference:
Screenshot 2022-12-11 at 6 43 21 PM

To Reproduce
Search "新聞" or any other term in Chinese.

Deployment Method

  • Docker

Version of Whoogle Search

  • Latest build from [source] (i.e. GitHub, Docker Hub, pip, etc)

Desktop (please complete the following information):

  • OS: MacOS and iOS
  • Browser: Safari
Originally created by @ghost on GitHub (Dec 11, 2022). Original GitHub issue: https://github.com/benbusby/whoogle-search/issues/904 **Describe the bug** Most of the search terms are not bold in Chinese results. Whoogle results: <img width="819" alt="Screenshot 2022-12-11 at 6 43 14 PM" src="https://user-images.githubusercontent.com/33184148/206899969-b2d72332-7eee-44ca-88ec-b126492d361e.png"> Google results for reference: <img width="936" alt="Screenshot 2022-12-11 at 6 43 21 PM" src="https://user-images.githubusercontent.com/33184148/206899990-c2edcc1f-399a-4624-8e94-a3fe589db406.png"> **To Reproduce** Search "新聞" or any other term in Chinese. **Deployment Method** - [x] Docker **Version of Whoogle Search** - [x] Latest build from [source] (i.e. GitHub, Docker Hub, pip, etc) **Desktop (please complete the following information):** - OS: MacOS and iOS - Browser: Safari
kerem 2026-02-25 20:36:02 +03:00
  • closed this issue
  • added the
    bug
    label
Author
Owner

@ahmad-alkadri commented on GitHub (Jan 6, 2023):

I managed to duplicate this issue on my local and other public instances of Whoogle.

I think the problem stems from the regex expression on this line:

github.com/benbusby/whoogle-search@ccf9f06f2f/app/utils/results.py (L74)

Basically what this regex does is it looks for words matching the query, but only if the words stand on their own or there are no other characters before or after it (other than whitespaces). Then, if it found matches, the function will modify those matches with bold text.

This is why the characters 新聞 in the string Google 新聞 will be made into bold like this: Google <b>新聞</b> while in 匯集了世界各地的新聞來源 it will not be touched at all.

I think a simple fix would be removing the \b in \b((?![{{}}<>-]){target_word}(?![{{}}<>-]))\b, so it'll become ((?![{{}}<>-]){target_word}(?![{{}}<>-])). I've tested this fix on my local instance and it seems to work for even the other Chinese characters:

afterchange_regex_bold

Comparison with the search result from a public instance with the original regex expression:

beforechange_regex_bold

The implication of this change would be, of course, modifying other words/queries to bold even if they are not standing on their own. For example: with the query murakami, the string thismurakamiisanauthor will also be modified into this<b>murakami</b>isanauthor. This could thus be a design change.

I would like your opinion, @benbusby, on this subject. if you agree with this modification, I can push it and open a PR. What do you think?

<!-- gh-comment-id:1374265823 --> @ahmad-alkadri commented on GitHub (Jan 6, 2023): I managed to duplicate this issue on my local and other public instances of Whoogle. I think the problem stems from the regex expression on this line: https://github.com/benbusby/whoogle-search/blob/ccf9f06f2f9b3064594006ac9b3dfc4762819912/app/utils/results.py#L74 Basically what this regex does is it looks for words matching the query, but only if the words stand on their own or there are no other characters before or after it (other than whitespaces). Then, if it found matches, the function will modify those matches with bold text. This is why the characters `新聞` in the string `Google 新聞` will be made into bold like this: `Google <b>新聞</b>` while in `匯集了世界各地的新聞來源` it will not be touched at all. I think a simple fix would be removing the `\b` in `\b((?![{{}}<>-]){target_word}(?![{{}}<>-]))\b`, so it'll become `((?![{{}}<>-]){target_word}(?![{{}}<>-]))`. I've tested this fix on my local instance and it seems to work for even the other Chinese characters: ![afterchange_regex_bold](https://user-images.githubusercontent.com/22837764/211117392-a7155658-c99c-4452-8613-e07c96cdd1f9.png) Comparison with the search result from a public instance with the original regex expression: ![beforechange_regex_bold](https://user-images.githubusercontent.com/22837764/211117428-99d7f6c3-6308-4688-959a-57eca112fec9.png) The implication of this change would be, of course, modifying other words/queries to bold even if they are not standing on their own. For example: with the query `murakami`, the string `thismurakamiisanauthor` will also be modified into `this<b>murakami</b>isanauthor`. This could thus be a design change. I would like your opinion, @benbusby, on this subject. if you agree with this modification, I can push it and open a PR. What do you think?
Author
Owner

@benbusby commented on GitHub (Jan 7, 2023):

@ahmad-alkadri I think the best approach would be to look up the user's configured language and use a separate regex command if that language is part of predefined list of languages (Chinese, Japanese, etc) that should ignore whitespace on either side of the search term (your proposed solution). Otherwise it should continue to use the existing regex.

I'm not a language expert by any stretch, but at least with English searches, I don't think the expected behavior would be to bold every search term on the result page regardless of where it appears (i.e. a search containing "a" or "the" would bold a lot of terms that the user probably wouldn't care about). The way around this would be to support separate, language-dependent behaviors.

<!-- gh-comment-id:1374312980 --> @benbusby commented on GitHub (Jan 7, 2023): @ahmad-alkadri I think the best approach would be to look up the user's configured language and use a separate regex command if that language is part of predefined list of languages (Chinese, Japanese, etc) that should ignore whitespace on either side of the search term (your proposed solution). Otherwise it should continue to use the existing regex. I'm not a language expert by any stretch, but at least with English searches, I don't think the expected behavior would be to bold every search term on the result page regardless of where it appears (i.e. a search containing "a" or "the" would bold a lot of terms that the user probably wouldn't care about). The way around this would be to support separate, language-dependent behaviors.
Author
Owner

@ahmad-alkadri commented on GitHub (Jan 7, 2023):

Thank you for your reply @benbusby and yes I agree with you on this part:

The way around this would be to support separate, language-dependent behaviors.

On the other hand, taking into account merely the user's configured language might not be enough I think. There could be certain scenarios where the user's using English as their configured language but the query isn't. Example: I sometimes searched for things like Kanji 友達 how to write. Some results are as follow:

screenshot-search garudalinux org-2023 01 07-08_11_22

On that page, as you see, if we want to modify the instances of the queries to bold we need to apply the normal regex and the one that ignores the whitespace.

Ideally, the app should detect whether each word on the result page contains Chinese characters or not, probably by using some kind of Unicode detection. The app should then implement the normal regex for the words that do not contain Chinese characters (thus, not ignoring whitespace) while for the words that contain Chinese characters, it would use the regex that ignores the whitespace. What do you think?

<!-- gh-comment-id:1374405077 --> @ahmad-alkadri commented on GitHub (Jan 7, 2023): Thank you for your reply @benbusby and yes I agree with you on this part: > The way around this would be to support separate, language-dependent behaviors. On the other hand, taking into account merely the user's configured language might not be enough I think. There could be certain scenarios where the user's using English as their configured language but the query isn't. Example: I sometimes searched for things like `Kanji 友達 how to write`. Some results are as follow: ![screenshot-search garudalinux org-2023 01 07-08_11_22](https://user-images.githubusercontent.com/22837764/211138778-e87385aa-fb3e-4a12-a4e6-7146e8bd5851.png) On that page, as you see, if we want to modify the instances of the queries to bold we need to apply the normal regex and the one that ignores the whitespace. Ideally, the app should detect whether each word on the result page contains Chinese characters or not, probably by using some kind of Unicode detection. The app should then implement the normal regex for the words that do not contain Chinese characters (thus, not ignoring whitespace) while for the words that contain Chinese characters, it would use the regex that ignores the whitespace. What do you think?
Author
Owner

@benbusby commented on GitHub (Jan 7, 2023):

Ideally, the app should detect whether each word on the result page contains Chinese characters or not, probably by using some kind of Unicode detection. The app should then implement the normal regex for the words that do not contain Chinese characters (thus, not ignoring whitespace) while for the words that contain Chinese characters, it would use the regex that ignores the whitespace.

That sounds great. Rather than searching the result page though, you could actually just check if target_word in bold_search_terms.replace_any_case is Chinese and apply the regex that doesn't check for whitespace. That way each search term would be bolded differently. Maybe that's what you already meant, just wanted to clarify.

<!-- gh-comment-id:1374581098 --> @benbusby commented on GitHub (Jan 7, 2023): > Ideally, the app should detect whether each word on the result page contains Chinese characters or not, probably by using some kind of Unicode detection. The app should then implement the normal regex for the words that do not contain Chinese characters (thus, not ignoring whitespace) while for the words that contain Chinese characters, it would use the regex that ignores the whitespace. That sounds great. Rather than searching the result page though, you could actually just check if `target_word` in `bold_search_terms.replace_any_case` is Chinese and apply the regex that doesn't check for whitespace. That way each search term would be bolded differently. Maybe that's what you already meant, just wanted to clarify.
Author
Owner

@ahmad-alkadri commented on GitHub (Jan 7, 2023):

That sounds great. Rather than searching the result page though, you could actually just check if target_word in bold_search_terms.replace_any_case is Chinese and apply the regex that doesn't check for whitespace. That way each search term would be bolded differently.

@benbusby completely agree with this approach of yours. I'll try to implement this and push for a PR ASAP. I'll tag you once it's done. Thanks!

<!-- gh-comment-id:1374645173 --> @ahmad-alkadri commented on GitHub (Jan 7, 2023): > That sounds great. Rather than searching the result page though, you could actually just check if `target_word` in `bold_search_terms.replace_any_case` is Chinese and apply the regex that doesn't check for whitespace. That way each search term would be bolded differently. @benbusby completely agree with this approach of yours. I'll try to implement this and push for a PR ASAP. I'll tag you once it's done. Thanks!
Author
Owner

@ahmad-alkadri commented on GitHub (Jan 9, 2023):

Hi @benbusby ; I’ve made the PR last night at #928. I’ve included finally not only Chinese characters but also Japanese and Korean. I’m available for any follow-up discussions on this. Thanks.

<!-- gh-comment-id:1375128190 --> @ahmad-alkadri commented on GitHub (Jan 9, 2023): Hi @benbusby ; I’ve made the PR last night at #928. I’ve included finally not only Chinese characters but also Japanese and Korean. I’m available for any follow-up discussions on this. Thanks.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/whoogle-search#562
No description provided.