mirror of
https://github.com/cypht-org/cypht.git
synced 2026-04-25 13:05:53 +03:00
[GH-ISSUE #235] Search function not working with national characters #196
Labels
No labels
2fa
I18N
PGP
Security
Security
account
advanced_search
advanced_search
announcement
api_login
authentication
awaiting feedback
blocker
bug
bug
bug
calendar
config
contacts
core
core
devops
docker
docs
duplicate
dynamic_login
enhancement
epic
feature
feeds
framework
github
github
gmail_contacts
good first issue
help wanted
history
history
imap
imap_folders
inline_message
installation
keyboard_shortcuts
keyboard_shortcuts
ldap_contacts
mobile
need-ssh-access
new module set
nux
pop3
profiles
pull-request
question
refactor
release
research
saved_searches
smtp
strategic
tags
tests
themes
website
wordpress
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/cypht#196
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @szilardx on GitHub (Nov 30, 2017).
Original GitHub issue: https://github.com/cypht-org/cypht/issues/235
Originally assigned to: @jasonmunro on GitHub.
I have realized, that if my search term has a national character, like éáőúöüóíúűú search function cannot find that message.
Another question: Is search supposed to return partial matches? For example if I search for "every", should it return "everything"?
Thank for the great work, it is a delight to use Cypht everyday!
@jasonmunro commented on GitHub (Nov 30, 2017):
Just pushed a potential fix for this. IMAP SEARCH supports a character set argument, which we were not setting. I added UTF-8 as the charset in the fix, I think it will resolve your issue.
github.com/jasonmunro/cypht@9d68f31e72AFAIK, search will return partial matches. A quick test here shows that it does for the IMAP servers I'm using. I don't see anything specific in the RFC about that however, so your IMAP implementation may behave differently than mine.
@szilardx commented on GitHub (Dec 1, 2017):
I am testing the fix.
For some reason, the Combined folders (Everything, Flagged) shows no messages, and search does not work too this way.
Going back one commit seemed to fix this no messages shown issue.
@jasonmunro commented on GitHub (Dec 1, 2017):
Yikes, I will revert that change :) All the combined pages use IMAP SEARCH behind the scenes, so the change broke all of those for you. I misread the RFC thinking UTF-8 was a required charset, but only US-ASCII is required. So I suspect your IMAP server does not support UTF-8 which is why nothing is being returned from the search. I'm not having any issues with dovecot and a few Gmail accounts.
The search examples in the IMAP RFC suggest that we should use an "IMAP literal" to send non-ASCII search terms, but that is not clearly defined in the spec - only suggested by the examples. We could try changing that to see if it helps. I don't think it will however, because the search commands used by the combined views don't have actual search terms, only keywords to filter the results (like UNREAD or FLAGGED).
@jasonmunro commented on GitHub (Dec 1, 2017):
Change reverted. I left the line in the code that set the charset to UTF-8 but commented it out. You could try modifying that and setting it to a charset your imap server supports to see if searching on non-ascii characters starts working.
To fix this properly we probably need a per imap server charset setting. The problem is there is no way I know of to query the IMAP server for charsets that are supported. I will keep thinking about it!
@dumblob commented on GitHub (Dec 1, 2017):
@jasonmunro we might support a per imap setting
auto guess encoding/charsetwhich would take the encoding of the emails delivered by the IMAP servers (I mean "IMAP header info" and notContent-Typefrom the email data themself).@jasonmunro commented on GitHub (Dec 1, 2017):
I'm not sure I follow @dumblob . What IMAP header info are you referring to?
@szilardx commented on GitHub (Dec 2, 2017):
@jasonmunro This is interesting because if I understand you correctly, you don't have any issues with the utf-8 setting, and gmail.
If I enable the commented line, with utf-8 I have nothing in the combined folders.
I only have gmail accounts. 3 regular, and 1 provided by my company (I guess gmail for business).
I use the accounts in Hungarian.
Only difference I have from the latest commit is an old hm3.ini.
I am googling for gmail, and national characters, utf-8 issues.
One idea I had for testing is to change my gmail to English(US), but it did not help.
@jasonmunro commented on GitHub (Dec 2, 2017):
@szilardx I retested and you are correct, it's totally NOT working for gmail (or any imap server I'm using). Not sure how I thought it was working! Anyway, turns out the command format for search with charset included was incorrect. I made a change here
github.com/jasonmunro/cypht@ecc1555783that seems to actually work :)@szilardx commented on GitHub (Dec 2, 2017):
Thanks! Now everything functions normally.
But search terms with national characters does not work.
You can try it with send a mail with subject "levél" (mail :) in hungarian), and then search for it.
Maybe this have to do something with this (i am no expert here): https://stackoverflow.com/questions/11517375/search-utf-8-string-with-gmail-x-gm-raw-imap-command ?
Thank you very much!
@dumblob commented on GitHub (Dec 3, 2017):
@jasonmunro I meant the issue described in https://stackoverflow.com/a/11590440 .
@jasonmunro commented on GitHub (Dec 4, 2017):
@dumblob The solution in that post is Gmail specific, I'm hoping to avoid that :). And we can't use the format of a message in an IMAP mailbox as an indicator of what character sets the IMAP server supports, because those are externally created. I could have a message in BIG 5 encoding, another one in UTF 8, and another one is US-ASCII all in my INBOX. This does not mean the IMAP server understands or knows how to deal with those character sets - that is the job of the client when rendering the message. IMAP servers MUST support US-ASCII, but any other character set is optional.
With that said, it appears most IMAP servers support UTF-8, and input from Cypht is already in UTF-8, so if we are going to attempt to set a search character set, I think UTF-8 is the best way to go.
@szilardx With regards to the current code - I created a few messages with "levél" as a subject, and the search actually works with Dovecot, but still fails with Gmail. I think the reason is Gmail expects the non-ASCII search term formatted as an "IMAP literal". IMAP clients and servers are allowed to send literals for any variable data. Typically they are used to send/fetch data that has embedded newlines that would otherwise be seen by the IMAP client or server as an "end of command" indicator.
So i made changes to send the search term as a literal, and once again it works in Dovecot, but actually hangs on a blocking socket in Gmail :(
Anyway, I'm still working on it, I will let you know when I have something new to test!
@szilardx commented on GitHub (Dec 4, 2017):
Thanks for the update, and the explanation!
@jasonmunro commented on GitHub (Dec 4, 2017):
Ok, figured out why Gmail was hanging. When submitting the IMAP literal, we first send the number of bytes in the literal, the server returns with a "+ OK" meaning we can proceed with the literal data. Gmail however returns with "+ go ahead", and our IMAP read response routine was expecting "OK". The code has been updated to just check for the leading "+", and searching on non-ASCII terms is now working with Gmail! @szilardx if you can test the latest, that would be great!
Note I still have not solved how will determine the character set to use per IMAP server, as of now it's being set to UTF-8 for all of them.
@szilardx commented on GitHub (Dec 4, 2017):
@jasonmunro Just tested it, works perfectly with my test cases. Thank you very much! 👍
@dumblob commented on GitHub (Dec 4, 2017):
@jasonmunro I'm sorry I still didn't catch up ;) I don't see anything specific for Gmail in the post I linked. What I meant is mainly the notion about modified UTF-7, which is in the IMAP standard. Does this answer the question
Which encoding does a specific IMAP server use??@jasonmunro commented on GitHub (Dec 4, 2017):
@dumblob I see what you are saying, thanks for the clarification. Modified UTF-7 does not apply to search terms as far as I know. It may work with X-GM-RAW as mentioned in that post, however that is a Gmail specific search extension (and honestly I would be surprised if modified UTF-7 works even then).
It's a tricky situation really. If the search term contains non-ASCII characters, and no character set is defined, the IMAP server will end up searching for a match against an inaccurate representation of the search term (it's US-ASCII version). If a character set is defined, AND the IMAP server supports it, it can run the search more effectively - but even then only if the encoded headers can be converted to a universal character set to compare against the search term. If a character set is defined in the search and the IMAP server does NOT support it, it returns a BAD/NO response to the request.
Cypht uses UTF-8 as the "universal character set". If you open a message encoded in BIG 5 that contains Chinese characters, we convert it to UTF-8 for display. In this way we can display an INBOX listing in which message subjects are encoded in different ways on the same page - just convert them all to UTF-8. I have never written an IMAP server search algo, but I suspect it works in a similar way:
The issue then becomes - what character sets does the IMAP server software support for both the search terms supplied, and the MIME word encoded headers (or encoded message parts) in the mailbox.
So even if we use a well supported "super-set" character set like UTF-8, we cannot be 100% sure that the IMAP server can understand how to convert a BIG 5 encoded string to do a proper comparison. Nor can we assume the IMAP server supports UTF-8 encoded search terms at all since the only required charset is US-ASCII.
With all that said, I think what we have is an improvement, and I think we should add one more bit of logic to the equation. Currently we are setting UTF-8 on all search terms. I want to modify that to only set UTF-8 if the search term has non-ASCII characters. This way searches will generally work even for older IMAP servers that may not understand UTF-8, and will work for non-ASCII search terms for servers that support them.
@dumblob commented on GitHub (Dec 5, 2017):
Ok, this was the game changer. I didn't know it does not apply to search terms in general. Then it's everything clear to me and I understand why it's so difficult.
What I would do is to read all existing open source IMAP server implementations and just take a look how they handle searching, which encoding they use, which "extensions" to the IMAP standard they support etc. Then I'd try to decide what to do next.
I have to admit I love reading code and evaluating it. I just don't have until Christmas enough time to do that. I'd propose leaving this issue open or creating a new, specific one, so that it'll stay in my TODO list.
@jasonmunro commented on GitHub (Dec 21, 2017):
I added the last bit for this, only setting the search charset to UTF-8 if the search terms are non-ASCII. I'm going to close this and open a new issue for @dumblob WRT assessing character set support in major OS IMAP servers.