[GH-ISSUE #235] Search function not working with national characters #196

Closed
opened 2026-02-25 21:34:24 +03:00 by kerem · 18 comments
Owner

Originally created by @szilardx on GitHub (Nov 30, 2017).
Original GitHub issue: https://github.com/cypht-org/cypht/issues/235

Originally assigned to: @jasonmunro on GitHub.

I have realized, that if my search term has a national character, like éáőúöüóíúűú search function cannot find that message.

Another question: Is search supposed to return partial matches? For example if I search for "every", should it return "everything"?

Thank for the great work, it is a delight to use Cypht everyday!

Originally created by @szilardx on GitHub (Nov 30, 2017). Original GitHub issue: https://github.com/cypht-org/cypht/issues/235 Originally assigned to: @jasonmunro on GitHub. I have realized, that if my search term has a national character, like éáőúöüóíúűú search function cannot find that message. Another question: Is search supposed to return partial matches? For example if I search for "every", should it return "everything"? Thank for the great work, it is a delight to use Cypht everyday!
kerem 2026-02-25 21:34:24 +03:00
  • closed this issue
  • added the
    imap
    label
Author
Owner

@jasonmunro commented on GitHub (Nov 30, 2017):

Just pushed a potential fix for this. IMAP SEARCH supports a character set argument, which we were not setting. I added UTF-8 as the charset in the fix, I think it will resolve your issue.

github.com/jasonmunro/cypht@9d68f31e72

AFAIK, search will return partial matches. A quick test here shows that it does for the IMAP servers I'm using. I don't see anything specific in the RFC about that however, so your IMAP implementation may behave differently than mine.

<!-- gh-comment-id:348244154 --> @jasonmunro commented on GitHub (Nov 30, 2017): Just pushed a potential fix for this. IMAP SEARCH supports a character set argument, which we were not setting. I added UTF-8 as the charset in the fix, I think it will resolve your issue. https://github.com/jasonmunro/cypht/commit/9d68f31e72b11948c6c4776d97ba14b25bf6387b AFAIK, search will return partial matches. A quick test here shows that it does for the IMAP servers I'm using. I don't see anything specific in the RFC about that however, so your IMAP implementation may behave differently than mine.
Author
Owner

@szilardx commented on GitHub (Dec 1, 2017):

I am testing the fix.
For some reason, the Combined folders (Everything, Flagged) shows no messages, and search does not work too this way.
Going back one commit seemed to fix this no messages shown issue.

<!-- gh-comment-id:348458336 --> @szilardx commented on GitHub (Dec 1, 2017): I am testing the fix. For some reason, the Combined folders (Everything, Flagged) shows no messages, and search does not work too this way. Going back one commit seemed to fix this no messages shown issue.
Author
Owner

@jasonmunro commented on GitHub (Dec 1, 2017):

Yikes, I will revert that change :) All the combined pages use IMAP SEARCH behind the scenes, so the change broke all of those for you. I misread the RFC thinking UTF-8 was a required charset, but only US-ASCII is required. So I suspect your IMAP server does not support UTF-8 which is why nothing is being returned from the search. I'm not having any issues with dovecot and a few Gmail accounts.

The search examples in the IMAP RFC suggest that we should use an "IMAP literal" to send non-ASCII search terms, but that is not clearly defined in the spec - only suggested by the examples. We could try changing that to see if it helps. I don't think it will however, because the search commands used by the combined views don't have actual search terms, only keywords to filter the results (like UNREAD or FLAGGED).

<!-- gh-comment-id:348527517 --> @jasonmunro commented on GitHub (Dec 1, 2017): Yikes, I will revert that change :) All the combined pages use IMAP SEARCH behind the scenes, so the change broke all of those for you. I misread the RFC thinking UTF-8 was a required charset, but only US-ASCII is required. So I suspect your IMAP server does not support UTF-8 which is why nothing is being returned from the search. I'm not having any issues with dovecot and a few Gmail accounts. The search examples in the IMAP RFC suggest that we should use an "IMAP literal" to send non-ASCII search terms, but that is not clearly defined in the spec - only suggested by the examples. We could try changing that to see if it helps. I don't think it will however, because the search commands used by the combined views don't have actual search terms, only keywords to filter the results (like UNREAD or FLAGGED).
Author
Owner

@jasonmunro commented on GitHub (Dec 1, 2017):

Change reverted. I left the line in the code that set the charset to UTF-8 but commented it out. You could try modifying that and setting it to a charset your imap server supports to see if searching on non-ascii characters starts working.

To fix this properly we probably need a per imap server charset setting. The problem is there is no way I know of to query the IMAP server for charsets that are supported. I will keep thinking about it!

<!-- gh-comment-id:348528309 --> @jasonmunro commented on GitHub (Dec 1, 2017): Change reverted. I left the line in the code that set the charset to UTF-8 but commented it out. You could try modifying that and setting it to a charset your imap server supports to see if searching on non-ascii characters starts working. To fix this properly we probably need a per imap server charset setting. The problem is there is no way I know of to query the IMAP server for charsets that are supported. I will keep thinking about it!
Author
Owner

@dumblob commented on GitHub (Dec 1, 2017):

@jasonmunro we might support a per imap setting auto guess encoding/charset which would take the encoding of the emails delivered by the IMAP servers (I mean "IMAP header info" and not Content-Type from the email data themself).

<!-- gh-comment-id:348543005 --> @dumblob commented on GitHub (Dec 1, 2017): @jasonmunro we might support a per imap setting `auto guess encoding/charset` which would take the encoding of the emails delivered by the IMAP servers (I mean "IMAP header info" and not `Content-Type` from the email data themself).
Author
Owner

@jasonmunro commented on GitHub (Dec 1, 2017):

I'm not sure I follow @dumblob . What IMAP header info are you referring to?

<!-- gh-comment-id:348554589 --> @jasonmunro commented on GitHub (Dec 1, 2017): I'm not sure I follow @dumblob . What IMAP header info are you referring to?
Author
Owner

@szilardx commented on GitHub (Dec 2, 2017):

@jasonmunro This is interesting because if I understand you correctly, you don't have any issues with the utf-8 setting, and gmail.

If I enable the commented line, with utf-8 I have nothing in the combined folders.
I only have gmail accounts. 3 regular, and 1 provided by my company (I guess gmail for business).
I use the accounts in Hungarian.

Only difference I have from the latest commit is an old hm3.ini.

I am googling for gmail, and national characters, utf-8 issues.
One idea I had for testing is to change my gmail to English(US), but it did not help.

<!-- gh-comment-id:348705039 --> @szilardx commented on GitHub (Dec 2, 2017): @jasonmunro This is interesting because if I understand you correctly, you don't have any issues with the utf-8 setting, and gmail. If I enable the commented line, with utf-8 I have nothing in the combined folders. I only have gmail accounts. 3 regular, and 1 provided by my company (I guess gmail for business). I use the accounts in Hungarian. Only difference I have from the latest commit is an old hm3.ini. I am googling for gmail, and national characters, utf-8 issues. One idea I had for testing is to change my gmail to English(US), but it did not help.
Author
Owner

@jasonmunro commented on GitHub (Dec 2, 2017):

@szilardx I retested and you are correct, it's totally NOT working for gmail (or any imap server I'm using). Not sure how I thought it was working! Anyway, turns out the command format for search with charset included was incorrect. I made a change here github.com/jasonmunro/cypht@ecc1555783 that seems to actually work :)

<!-- gh-comment-id:348709965 --> @jasonmunro commented on GitHub (Dec 2, 2017): @szilardx I retested and you are correct, it's totally NOT working for gmail (or any imap server I'm using). Not sure how I thought it was working! Anyway, turns out the command format for search with charset included was incorrect. I made a change here https://github.com/jasonmunro/cypht/commit/ecc15557835ff5007468daaae2f267b5942f312b that seems to actually work :)
Author
Owner

@szilardx commented on GitHub (Dec 2, 2017):

Thanks! Now everything functions normally.
But search terms with national characters does not work.
You can try it with send a mail with subject "levél" (mail :) in hungarian), and then search for it.
Maybe this have to do something with this (i am no expert here): https://stackoverflow.com/questions/11517375/search-utf-8-string-with-gmail-x-gm-raw-imap-command ?

Thank you very much!

<!-- gh-comment-id:348713201 --> @szilardx commented on GitHub (Dec 2, 2017): Thanks! Now everything functions normally. But search terms with national characters does not work. You can try it with send a mail with subject "levél" (mail :) in hungarian), and then search for it. Maybe this have to do something with this (i am no expert here): https://stackoverflow.com/questions/11517375/search-utf-8-string-with-gmail-x-gm-raw-imap-command ? Thank you very much!
Author
Owner

@dumblob commented on GitHub (Dec 3, 2017):

@jasonmunro I meant the issue described in https://stackoverflow.com/a/11590440 .

<!-- gh-comment-id:348756488 --> @dumblob commented on GitHub (Dec 3, 2017): @jasonmunro I meant the issue described in https://stackoverflow.com/a/11590440 .
Author
Owner

@jasonmunro commented on GitHub (Dec 4, 2017):

@dumblob The solution in that post is Gmail specific, I'm hoping to avoid that :). And we can't use the format of a message in an IMAP mailbox as an indicator of what character sets the IMAP server supports, because those are externally created. I could have a message in BIG 5 encoding, another one in UTF 8, and another one is US-ASCII all in my INBOX. This does not mean the IMAP server understands or knows how to deal with those character sets - that is the job of the client when rendering the message. IMAP servers MUST support US-ASCII, but any other character set is optional.

With that said, it appears most IMAP servers support UTF-8, and input from Cypht is already in UTF-8, so if we are going to attempt to set a search character set, I think UTF-8 is the best way to go.

@szilardx With regards to the current code - I created a few messages with "levél" as a subject, and the search actually works with Dovecot, but still fails with Gmail. I think the reason is Gmail expects the non-ASCII search term formatted as an "IMAP literal". IMAP clients and servers are allowed to send literals for any variable data. Typically they are used to send/fetch data that has embedded newlines that would otherwise be seen by the IMAP client or server as an "end of command" indicator.

So i made changes to send the search term as a literal, and once again it works in Dovecot, but actually hangs on a blocking socket in Gmail :(

Anyway, I'm still working on it, I will let you know when I have something new to test!

<!-- gh-comment-id:349012815 --> @jasonmunro commented on GitHub (Dec 4, 2017): @dumblob The solution in that post is Gmail specific, I'm hoping to avoid that :). And we can't use the format of a message in an IMAP mailbox as an indicator of what character sets the IMAP server supports, because those are externally created. I could have a message in BIG 5 encoding, another one in UTF 8, and another one is US-ASCII all in my INBOX. This does not mean the IMAP server understands or knows how to deal with those character sets - that is the job of the client when rendering the message. IMAP servers MUST support US-ASCII, but any other character set is optional. With that said, it appears most IMAP servers support UTF-8, and input from Cypht is already in UTF-8, so if we are going to attempt to set a search character set, I think UTF-8 is the best way to go. @szilardx With regards to the current code - I created a few messages with "levél" as a subject, and the search actually works with Dovecot, but still fails with Gmail. I think the reason is Gmail expects the non-ASCII search term formatted as an "IMAP literal". IMAP clients and servers are allowed to send literals for any variable data. Typically they are used to send/fetch data that has embedded newlines that would otherwise be seen by the IMAP client or server as an "end of command" indicator. So i made changes to send the search term as a literal, and once again it works in Dovecot, but actually hangs on a blocking socket in Gmail :( Anyway, I'm still working on it, I will let you know when I have something new to test!
Author
Owner

@szilardx commented on GitHub (Dec 4, 2017):

Thanks for the update, and the explanation!

<!-- gh-comment-id:349027803 --> @szilardx commented on GitHub (Dec 4, 2017): Thanks for the update, and the explanation!
Author
Owner

@jasonmunro commented on GitHub (Dec 4, 2017):

Ok, figured out why Gmail was hanging. When submitting the IMAP literal, we first send the number of bytes in the literal, the server returns with a "+ OK" meaning we can proceed with the literal data. Gmail however returns with "+ go ahead", and our IMAP read response routine was expecting "OK". The code has been updated to just check for the leading "+", and searching on non-ASCII terms is now working with Gmail! @szilardx if you can test the latest, that would be great!

Note I still have not solved how will determine the character set to use per IMAP server, as of now it's being set to UTF-8 for all of them.

<!-- gh-comment-id:349057509 --> @jasonmunro commented on GitHub (Dec 4, 2017): Ok, figured out why Gmail was hanging. When submitting the IMAP literal, we first send the number of bytes in the literal, the server returns with a "+ OK" meaning we can proceed with the literal data. Gmail however returns with "+ go ahead", and our IMAP read response routine was expecting "OK". The code has been updated to just check for the leading "+", and searching on non-ASCII terms is now working with Gmail! @szilardx if you can test the latest, that would be great! Note I still have not solved how will determine the character set to use per IMAP server, as of now it's being set to UTF-8 for all of them.
Author
Owner

@szilardx commented on GitHub (Dec 4, 2017):

@jasonmunro Just tested it, works perfectly with my test cases. Thank you very much! 👍

<!-- gh-comment-id:349059404 --> @szilardx commented on GitHub (Dec 4, 2017): @jasonmunro Just tested it, works perfectly with my test cases. Thank you very much! 👍
Author
Owner

@dumblob commented on GitHub (Dec 4, 2017):

@jasonmunro I'm sorry I still didn't catch up ;) I don't see anything specific for Gmail in the post I linked. What I meant is mainly the notion about modified UTF-7, which is in the IMAP standard. Does this answer the question Which encoding does a specific IMAP server use??

<!-- gh-comment-id:349132687 --> @dumblob commented on GitHub (Dec 4, 2017): @jasonmunro I'm sorry I still didn't catch up ;) I don't see anything specific for Gmail in the post I linked. What I meant is mainly the notion about **modified** UTF-7, which is in the [IMAP standard](https://tools.ietf.org/html/rfc3501#section-5.1.3 ). Does this answer the question `Which encoding does a specific IMAP server use?`?
Author
Owner

@jasonmunro commented on GitHub (Dec 4, 2017):

@dumblob I see what you are saying, thanks for the clarification. Modified UTF-7 does not apply to search terms as far as I know. It may work with X-GM-RAW as mentioned in that post, however that is a Gmail specific search extension (and honestly I would be surprised if modified UTF-7 works even then).

It's a tricky situation really. If the search term contains non-ASCII characters, and no character set is defined, the IMAP server will end up searching for a match against an inaccurate representation of the search term (it's US-ASCII version). If a character set is defined, AND the IMAP server supports it, it can run the search more effectively - but even then only if the encoded headers can be converted to a universal character set to compare against the search term. If a character set is defined in the search and the IMAP server does NOT support it, it returns a BAD/NO response to the request.

Cypht uses UTF-8 as the "universal character set". If you open a message encoded in BIG 5 that contains Chinese characters, we convert it to UTF-8 for display. In this way we can display an INBOX listing in which message subjects are encoded in different ways on the same page - just convert them all to UTF-8. I have never written an IMAP server search algo, but I suspect it works in a similar way:

  • Convert the supplied search term to a universal character set
  • fetch the header (or body or whatever) for each message and convert it to the universal character set
  • check for a string match

The issue then becomes - what character sets does the IMAP server software support for both the search terms supplied, and the MIME word encoded headers (or encoded message parts) in the mailbox.

So even if we use a well supported "super-set" character set like UTF-8, we cannot be 100% sure that the IMAP server can understand how to convert a BIG 5 encoded string to do a proper comparison. Nor can we assume the IMAP server supports UTF-8 encoded search terms at all since the only required charset is US-ASCII.

With all that said, I think what we have is an improvement, and I think we should add one more bit of logic to the equation. Currently we are setting UTF-8 on all search terms. I want to modify that to only set UTF-8 if the search term has non-ASCII characters. This way searches will generally work even for older IMAP servers that may not understand UTF-8, and will work for non-ASCII search terms for servers that support them.

<!-- gh-comment-id:349141018 --> @jasonmunro commented on GitHub (Dec 4, 2017): @dumblob I see what you are saying, thanks for the clarification. Modified UTF-7 does not apply to search terms as far as I know. It _may_ work with X-GM-RAW as mentioned in that post, however that is a Gmail specific search extension (and honestly I would be surprised if modified UTF-7 works even then). It's a tricky situation really. If the search term contains non-ASCII characters, and no character set is defined, the IMAP server will end up searching for a match against an inaccurate representation of the search term (it's US-ASCII version). If a character set is defined, AND the IMAP server supports it, it can run the search more effectively - but even then only if the encoded headers can be converted to a universal character set to compare against the search term. If a character set is defined in the search and the IMAP server does NOT support it, it returns a BAD/NO response to the request. Cypht uses UTF-8 as the "universal character set". If you open a message encoded in BIG 5 that contains Chinese characters, we convert it to UTF-8 for display. In this way we can display an INBOX listing in which message subjects are encoded in different ways on the same page - just convert them all to UTF-8. I have never written an IMAP server search algo, but I suspect it works in a similar way: - Convert the supplied search term to a universal character set - fetch the header (or body or whatever) for each message and convert it to the universal character set - check for a string match The issue then becomes - what character sets does the IMAP server software support for both the search terms supplied, and the MIME word encoded headers (or encoded message parts) in the mailbox. So even if we use a well supported "super-set" character set like UTF-8, we cannot be 100% sure that the IMAP server can understand how to convert a BIG 5 encoded string to do a proper comparison. Nor can we assume the IMAP server supports UTF-8 encoded search terms at all since the only required charset is US-ASCII. With all that said, I think what we have is an improvement, and I think we should add one more bit of logic to the equation. Currently we are setting UTF-8 on all search terms. I want to modify that to only set UTF-8 if the search term has non-ASCII characters. This way searches will generally work even for older IMAP servers that may not understand UTF-8, and will work for non-ASCII search terms for servers that support them.
Author
Owner

@dumblob commented on GitHub (Dec 5, 2017):

Modified UTF-7 does not apply to search terms as far as I know.

Ok, this was the game changer. I didn't know it does not apply to search terms in general. Then it's everything clear to me and I understand why it's so difficult.

What I would do is to read all existing open source IMAP server implementations and just take a look how they handle searching, which encoding they use, which "extensions" to the IMAP standard they support etc. Then I'd try to decide what to do next.

I have to admit I love reading code and evaluating it. I just don't have until Christmas enough time to do that. I'd propose leaving this issue open or creating a new, specific one, so that it'll stay in my TODO list.

<!-- gh-comment-id:349404014 --> @dumblob commented on GitHub (Dec 5, 2017): >Modified UTF-7 does not apply to search terms as far as I know. Ok, this was the game changer. I didn't know it does not apply to search terms in general. Then it's everything clear to me and I understand why it's so difficult. What I would do is to read **all** existing open source IMAP server implementations and just take a look how they handle searching, which encoding they use, which "extensions" to the IMAP standard they support etc. Then I'd try to decide what to do next. I have to admit I love reading code and evaluating it. I just don't have until Christmas enough time to do that. I'd propose leaving this issue open or creating a new, specific one, so that it'll stay in my TODO list.
Author
Owner

@jasonmunro commented on GitHub (Dec 21, 2017):

I added the last bit for this, only setting the search charset to UTF-8 if the search terms are non-ASCII. I'm going to close this and open a new issue for @dumblob WRT assessing character set support in major OS IMAP servers.

<!-- gh-comment-id:353468545 --> @jasonmunro commented on GitHub (Dec 21, 2017): I added the last bit for this, only setting the search charset to UTF-8 if the search terms are non-ASCII. I'm going to close this and open a new issue for @dumblob WRT assessing character set support in major OS IMAP servers.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/cypht#196
No description provided.