[GH-ISSUE #30] Wrong html santitization for search column #25

Closed
opened 2026-03-15 12:09:44 +03:00 by kerem · 3 comments
Owner

Originally created by @kzaitsev on GitHub (Dec 30, 2022).
Original GitHub issue: https://github.com/axllent/mailpit/issues/30

Hello, it seems something is wrong with HTML sanitization when you build the search column. it looks like some tags were ignored and not unwrapped to text. As a result, when you try to find the email by word in the body, you can't get it.

To reproduce, I'll attach a zipped eml file. In this case, the text "massmailgoodhost" will be dropped, and the "search" field will not contain it.

example.com - sales sales@example.com massmail testufof9hiyjgo8best regards! liam nison autotestmassmailwasservice@good.good https //www.example.com test. https //portal.example.com/services/my/15413 you have received this notification because you are a example.com customer.email address autotestmassmailwasservice@good.good is attached to account 11092.

It seems like a bug of https://github.com/k3a/html2text, but instead of it, why not use the Text field of the envelope structure, which returns the eml parser (github.com/jhillyerd/enmime)?

d80e6ca4-fb3c-4dcb-a6f8-030af2f8278f.eml.zip

Originally created by @kzaitsev on GitHub (Dec 30, 2022). Original GitHub issue: https://github.com/axllent/mailpit/issues/30 Hello, it seems something is wrong with HTML sanitization when you build the `search` column. it looks like some tags were ignored and not unwrapped to text. As a result, when you try to find the email by word in the body, you can't get it. To reproduce, I'll attach a zipped eml file. In this case, the text "massmailgoodhost" will be dropped, and the "search" field will not contain it. ``` example.com - sales sales@example.com massmail testufof9hiyjgo8best regards! liam nison autotestmassmailwasservice@good.good https //www.example.com test. https //portal.example.com/services/my/15413 you have received this notification because you are a example.com customer.email address autotestmassmailwasservice@good.good is attached to account 11092. ``` It seems like a bug of https://github.com/k3a/html2text, but instead of it, why not use the `Text` field of the envelope structure, which returns the eml parser (github.com/jhillyerd/enmime)? [d80e6ca4-fb3c-4dcb-a6f8-030af2f8278f.eml.zip](https://github.com/axllent/mailpit/files/10325337/d80e6ca4-fb3c-4dcb-a6f8-030af2f8278f.eml.zip)
kerem 2026-03-15 12:09:44 +03:00
  • closed this issue
  • added the
    bug
    label
Author
Owner

@axllent commented on GitHub (Dec 30, 2022):

Thanks for the information @kzaitsev. We can't rely on the envelope Text value because so many HTML emails actually have something like You require an HTML-compatible email program to read this rather than an actual text version of the HTML, or a very dumbed-down/broken version of the HTML. From memory the enmime Text value isn't a conversion of HTML but rather the Content-Type: text/plain; part of an email (if set, else blank).

The best solution is still to manually convert the HTML (if set) to text, but I'll need to dig much deeper as to exactly why it is happening, and if it is an issue with html2text then that will need to be reported there to fix. Unfortunately I'm just heading off for a short holiday today, so it will be two weeks before I can probably look into this.

I also see I am stripping out : (when I "clean the text) which results in https //www.example.com .... (just noting it here so I don't forget to remove that).

<!-- gh-comment-id:1368061871 --> @axllent commented on GitHub (Dec 30, 2022): Thanks for the information @kzaitsev. We can't rely on the envelope `Text` value because so many HTML emails actually have something like `You require an HTML-compatible email program to read this` rather than an actual text version of the HTML, or a very dumbed-down/broken version of the HTML. From memory the enmime Text value isn't a conversion of HTML but rather the `Content-Type: text/plain;` part of an email (if set, else blank). The best solution is still to manually convert the HTML (if set) to text, but I'll need to dig much deeper as to exactly why it is happening, and if it is an issue with html2text then that will need to be reported there to fix. Unfortunately I'm just heading off for a short holiday today, so it will be two weeks before I can probably look into this. I also see I am stripping out `:` (when I "clean the text) which results in ` https //www.example.com` .... (just noting it here so I don't forget to remove that).
Author
Owner

@kzaitsev commented on GitHub (Dec 30, 2022):

@axllent thank you for your quick response, I understand.

I do some investigation and it seems https://github.com/jhillyerd/enmime uses https://github.com/jaytaylor/html2text to convert HTML to text in the case of HTML-only emails.

<!-- gh-comment-id:1368067626 --> @kzaitsev commented on GitHub (Dec 30, 2022): @axllent thank you for your quick response, I understand. I do some investigation and it seems https://github.com/jhillyerd/enmime uses https://github.com/jaytaylor/html2text to convert HTML to text in the case of HTML-only emails.
Author
Owner

@axllent commented on GitHub (Jan 4, 2023):

Thanks again for reporting this. I found an option I could pass to html2text to include anchor content in the returned output, so this should now be solved in the latest v.1.3.5 release. Please feel free to re-open this if it does not solve your issue!

<!-- gh-comment-id:1371525876 --> @axllent commented on GitHub (Jan 4, 2023): Thanks again for reporting this. I found an option I could pass to html2text to include anchor content in the returned output, so this should now be solved in the latest v.1.3.5 release. Please feel free to re-open this if it does not solve your issue!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/mailpit#25
No description provided.