[GH-ISSUE #782] Problem when converting from byte to UTF8 #703

Closed
opened 2026-02-27 11:13:07 +03:00 by kerem · 11 comments
Owner

Originally created by @carragom on GitHub (Nov 13, 2015).
Original GitHub issue: https://github.com/modoboa/modoboa/issues/782

Hi there,

I have some data in my amavis database that's rendering modoboa unusable. All parts of the system throw an internal server error. I have traced the issue to the file modoboa/extensions/amavis/sql_conector.py. The problem happens with all queries that use the function convert_from(maddr.email, 'UTF8'). Turns out if I manually run the query directly in PostgreSQL it throws the error:

ERROR: invalid byte sequence for encoding "UTF8": 0xf1657340

There seems to be data in there that's not UTF8 compliant. At first I saw #677 and upgraded to 1.2.2 thinking it was the same problem. But this did not fixed the issue. In the end I replaced all convert_from functions in modoboa/extensions/amavis/sql_conector.py to use LATIN1 instead of UTF8 and the issue is now fixed.

In the end, the problematic character was a ñ. I know RFC6531 states that email addresses could be encoded in UTF8 but amavis does not seem to be playing by those rules; yet?. An immediate solution is to switch to LATIN1 and hope it's the right encoding.

As an alternate solution the data could be retrieved in binary form from the database and converted in python maybe even a django filter that would fallback to something like "invalid address" if the conversion fails. But this would force modoboa to retrive all record from the database and in my case that would be around 7K and counting.

Hope this helps.
Cheers.

Originally created by @carragom on GitHub (Nov 13, 2015). Original GitHub issue: https://github.com/modoboa/modoboa/issues/782 Hi there, I have some data in my amavis database that's rendering modoboa unusable. All parts of the system throw an internal server error. I have traced the issue to the file `modoboa/extensions/amavis/sql_conector.py`. The problem happens with all queries that use the function `convert_from(maddr.email, 'UTF8')`. Turns out if I manually run the query directly in PostgreSQL it throws the error: ``` ERROR: invalid byte sequence for encoding "UTF8": 0xf1657340 ``` There seems to be data in there that's not UTF8 compliant. At first I saw #677 and upgraded to 1.2.2 thinking it was the same problem. But this did not fixed the issue. In the end I replaced all `convert_from` functions in `modoboa/extensions/amavis/sql_conector.py` to use `LATIN1` instead of `UTF8` and the issue is now fixed. In the end, the problematic character was a `ñ`. I know RFC6531 states that email addresses could be encoded in `UTF8` but amavis does not seem to be playing by those rules; yet?. An immediate solution is to switch to `LATIN1` and hope it's the right encoding. As an alternate solution the data could be retrieved in binary form from the database and converted in python maybe even a django filter that would fallback to something like "invalid address" if the conversion fails. But this would force modoboa to retrive all record from the database and in my case that would be around 7K and counting. Hope this helps. Cheers.
kerem closed this issue 2026-02-27 11:13:07 +03:00
Author
Owner

@tonioo commented on GitHub (Nov 16, 2015):

Hi,

as for the Quarantine.mail_text field, we could try to use a BinaryField and to remove all calls to convert_from.

Do you think you could try it ?

<!-- gh-comment-id:157090560 --> @tonioo commented on GitHub (Nov 16, 2015): Hi, as for the Quarantine.mail_text field, we could try to use a BinaryField and to remove all calls to convert_from. Do you think you could try it ?
Author
Owner

@tonioo commented on GitHub (Nov 27, 2015):

@carragom ping

<!-- gh-comment-id:160172791 --> @tonioo commented on GitHub (Nov 27, 2015): @carragom ping
Author
Owner

@carragom commented on GitHub (Dec 18, 2015):

Hi @tonioo sorry for the absence. At least in version 1.2.2 Quarantine.mail_text already is a BinaryField as shown here or am I looking at the wrong place ?. Maybe I did not understand what you meant ?

<!-- gh-comment-id:165879568 --> @carragom commented on GitHub (Dec 18, 2015): Hi @tonioo sorry for the absence. At least in version `1.2.2` `Quarantine.mail_text` already is a `BinaryField` as shown [here](/tonioo/modoboa/blob/1.2.2/modoboa/extensions/amavis/models.py#L165) or am I looking at the wrong place ?. Maybe I did not understand what you meant ?
Author
Owner

@tonioo commented on GitHub (Dec 18, 2015):

Hi @carragom, your problem seems to be related to the email field:
https://github.com/modoboa/modoboa-amavis/blob/master/modoboa_amavis/models.py#L20

<!-- gh-comment-id:165889497 --> @tonioo commented on GitHub (Dec 18, 2015): Hi @carragom, your problem seems to be related to the email field: https://github.com/modoboa/modoboa-amavis/blob/master/modoboa_amavis/models.py#L20
Author
Owner

@carragom commented on GitHub (Dec 18, 2015):

@tonioo Yes that's the field causing the problems. Like I said, I see two options to fix this:

1- Keep the conversion in the database as it's now, but ask the database to convert to LATIN1 instead of UTF8 in all occurrences of the convert_from function here. I did this and it's working for me. I'm not sure if this is the right encoding but for sure it's better than using UTF8 that we already know it breaks.

2- Switch to using BinaryField for the Maddr.email and handle the conversion in python. Just keep in mind that the number of rows in the Maddr table grows fast, when I reported this a month ago the table was around 7K rows, right now it's sitting at 12K rows. So doing this conversion out of the database could be a performance issue.

I have been looking around here to see if there is any indication on what encoding it's actually used without luck. But it does mention that the option $sql_allow_8bit_address needs to be set to use this field as bytea. So LATIN1 sounds like a safe encoding to use.

If you ask me I would just go option number 1 which is simple to implement and have no performance issues.

I can provide a quick PR for option 1 if you decide to go with it.
Let me know.

<!-- gh-comment-id:165911029 --> @carragom commented on GitHub (Dec 18, 2015): @tonioo Yes that's the field causing the problems. Like I said, I see two options to fix this: 1- Keep the conversion in the database as it's now, but ask the database to convert to `LATIN1` instead of `UTF8` in all occurrences of the `convert_from` function [here](https://github.com/modoboa/modoboa-amavis/blob/master/modoboa_amavis/sql_connector.py). I did this and it's working for me. I'm not sure if this is the right encoding but for sure it's better than using `UTF8` that we already know it breaks. 2- Switch to using `BinaryField` for the `Maddr.email` and handle the conversion in python. Just keep in mind that the number of rows in the `Maddr` table grows fast, when I reported this a month ago the table was around 7K rows, right now it's sitting at 12K rows. So doing this conversion out of the database could be a performance issue. I have been looking around [here](https://www.ijs.si/software/amavisd/README.sql-pg.txt) to see if there is any indication on what encoding it's actually used without luck. But it does mention that the option `$sql_allow_8bit_address` needs to be set to use this field as `bytea`. So `LATIN1` sounds like a safe encoding to use. If you ask me I would just go option number 1 which is simple to implement and have no performance issues. I can provide a quick PR for option 1 if you decide to go with it. Let me know.
Author
Owner

@tonioo commented on GitHub (Jan 27, 2016):

@carragom Using LATIN1 as encoding won't cover all cases. I guess we will encounter the same issues with another encoding soon. I still think a BinaryField is the right answer and I do hope Django uses the appropriate field when it generates queries. The manual conversion you see in the current code would also disappear.

<!-- gh-comment-id:175688169 --> @tonioo commented on GitHub (Jan 27, 2016): @carragom Using LATIN1 as encoding won't cover all cases. I guess we will encounter the same issues with another encoding soon. I still think a BinaryField is the right answer and I do hope Django uses the appropriate field when it generates queries. The manual conversion you see in the current code would also disappear.
Author
Owner

@tonioo commented on GitHub (Jan 27, 2016):

@carragom BTW, the right place for this issue is into the https://github.com/modoboa/modoboa-amavis repository.

<!-- gh-comment-id:175692310 --> @tonioo commented on GitHub (Jan 27, 2016): @carragom BTW, the right place for this issue is into the https://github.com/modoboa/modoboa-amavis repository.
Author
Owner

@carragom commented on GitHub (Jan 27, 2016):

@tonioo I agree that LATIN1 does not cover all cases and it's far from ideal. The one thing for sure is that this problem renders Modoboa unusable, all of it, not just the amavis module. So this should be fixed in any way necessary.

It might be possible that amavis does not intent for these fields to be used as text, from the amavis README.sql-pg.txt:

Upgrade note: field quarantine.mail_text should be of data type 'bytea'
and not 'text' as suggested in earlier documentation; this is to prevent
it from being unjustifiably associated with a character set
, and to be
able to store any byte value; to convert existing field from type 'text'
to type 'bytea' the following clause may be used:
ALTER TABLE quarantine ALTER mail_text TYPE bytea
USING decode(replace(mail_text,'','\'),'escape');

Thanks a lot for your time.

<!-- gh-comment-id:175912111 --> @carragom commented on GitHub (Jan 27, 2016): @tonioo I agree that LATIN1 does not cover all cases and it's far from ideal. The one thing for sure is that this problem renders Modoboa unusable, all of it, not just the amavis module. So this should be fixed in any way necessary. It might be possible that amavis does not intent for these fields to be used as text, from the amavis [_README.sql-pg.txt_](https://www.ijs.si/software/amavisd/README.sql-pg.txt): > Upgrade note: field quarantine.mail_text should be of data type 'bytea' > and not 'text' as suggested in earlier documentation; **this is to prevent > it from being unjustifiably associated with a character set**, and to be > able to store any byte value; to convert existing field from type 'text' > to type 'bytea' the following clause may be used: > ALTER TABLE quarantine ALTER mail_text TYPE bytea > USING decode(replace(mail_text,'\','\\'),'escape'); Thanks a lot for your time.
Author
Owner

@tonioo commented on GitHub (Jan 28, 2016):

Please look at this thread (the end of the page is interesting):
https://code.djangoproject.com/ticket/2417
And this commit (django source code):
github.com/django/django@8ee1eddb7e

And tell me what do you think :)

<!-- gh-comment-id:176046392 --> @tonioo commented on GitHub (Jan 28, 2016): Please look at this thread (the end of the page is interesting): https://code.djangoproject.com/ticket/2417 And this commit (django source code): https://github.com/django/django/commit/8ee1eddb7e148de89aebde9e68da495633fc1ec9 And tell me what do you think :)
Author
Owner

@carragom commented on GitHub (Jan 29, 2016):

Yes using a BinaryField is definitively an option see here. But switching to BinaryField alone is not enough. Every custom query using convert_from needs to be replaced with something that fetches the data from the table and filter's it on the python side. This means probably rewriting this entire class.

In any case, it does not matter what type of field is used or where the conversion happens (db or app) at some point those bytes on the database will have to be converted to text in order to be useful and the conversion will require a character set. UTF8 is not the right charset for that data and currently breaks the entire application. The main objective here is to find a way where the application does not break even if the conversion fails.

Again I see two options:

1- Find a way to handle the conversion gracefully at the database level (maybe a stored procedure would help here or just use LATIN1 as charset which is working for me and seems to be what amavis is using).
2- Use a BinaryField and move the entire logic of converting/filtering the data to the web app which is inefficient and a lot of work and will still break if we keep trying to use UTF8 as charset.

Again thanks for your time, I hope I was a bit more clear this time.
Cheers.

<!-- gh-comment-id:176914271 --> @carragom commented on GitHub (Jan 29, 2016): Yes using a `BinaryField` is definitively an option see [here](https://github.com/tonioo/modoboa/issues/782#issuecomment-165911029). But switching to `BinaryField` alone is not enough. Every custom query using `convert_from` needs to be replaced with something that fetches the data from the table and filter's it on the python side. This means probably rewriting [this](https://github.com/modoboa/modoboa-amavis/blob/master/modoboa_amavis/sql_connector.py#L223) entire class. In any case, it does not matter what type of field is used or where the conversion happens (db or app) at some point those bytes on the database will have to be converted to text in order to be useful and the conversion will require a character set. `UTF8` is not the right charset for that data and currently breaks the entire application. The main objective here is to find a way where the application does not break even if the conversion fails. Again I see two options: 1- Find a way to handle the conversion gracefully at the database level (maybe a stored procedure would help here or just use `LATIN1` as charset which is working for me and seems to be what amavis is using). 2- Use a `BinaryField` and move the entire logic of converting/filtering the data to the web app which is inefficient and a lot of work and will still break if we keep trying to use `UTF8` as charset. Again thanks for your time, I hope I was a bit more clear this time. Cheers.
Author
Owner

@tonioo commented on GitHub (Jun 9, 2016):

This issue was moved to modoboa/modoboa-amavis#35

<!-- gh-comment-id:224916779 --> @tonioo commented on GitHub (Jun 9, 2016): This issue was moved to modoboa/modoboa-amavis#35
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/modoboa-modoboa#703
No description provided.