[GH-ISSUE #1458] MD013: Incorrect count on lines with multi-byte unicode characters #687

New issue

Open

opened 2026-03-03 01:29:05 +03:00 by kerem · 1 comment

kerem commented

2026-03-03 01:29:05 +03:00

Owner

Originally created by @Maneren on GitHub (Dec 26, 2024).
Original GitHub issue: https://github.com/DavidAnson/markdownlint/issues/1458

Hi, I copied a paragraph from a PDF and it contained hardcoded unicode italic characters which take 4 bytes in UTF-8 or 2 bytes in UTF-16. After pasting that to a markdown file and saving it in a file in UTF-8 encoding I started receiving Line length [Expected: 80, Actual: 85] warning, even though there are only 74 unicode characters displayed on the line (stored as 107 bytes).

- $\forall 𝑣_1, 𝑣_2, \ldots, 𝑣_𝑛 \in 𝑇_𝑛: 𝑣_1, 𝑣_2, \ldots, 𝑣_𝑛 \in 𝐾 \iff

(I assume the intention of the rule is to consider the "visual count of characters" as rendered in the editor - 74 in this case)

I may be missing some context or detail of the implementation but I think the issue is a combination of JS handling everything as UTF-16 rather than UTF-8 (that is the seemingly incorrect .length of the line reported) and the usage of regular "unicode-unaware" regular expressions, where . again matches on UTF-16 character.

So I think the correct way to handle these would be [...line].length to get the total length of the line and the inclusion of the u flag for the regular expressions to switch them to unicode mode.

Originally created by @Maneren on GitHub (Dec 26, 2024). Original GitHub issue: https://github.com/DavidAnson/markdownlint/issues/1458  Hi, I copied a paragraph from a PDF and it contained hardcoded unicode italic characters which take 4 bytes in UTF-8 or 2 bytes in UTF-16. After pasting that to a markdown file and saving it in a file in UTF-8 encoding I started receiving `Line length [Expected: 80, Actual: 85]` warning, even though there are only 74 unicode characters displayed on the line (stored as 107 bytes). ```md - $\forall 𝑣_1, 𝑣_2, \ldots, 𝑣_𝑛 \in 𝑇_𝑛: 𝑣_1, 𝑣_2, \ldots, 𝑣_𝑛 \in 𝐾 \iff ``` (I assume the intention of the rule is to consider the "visual count of characters" as rendered in the editor - 74 in this case) I may be missing some context or detail of the implementation but I think the issue is a combination of JS handling everything as UTF-16 rather than UTF-8 (that is the seemingly incorrect `.length` of the line reported) and the usage of regular "unicode-unaware" regular expressions, where `.` again matches on UTF-16 character. So I think the correct way to handle these would be `[...line].length` to get the total length of the line and the inclusion of the `u` flag for the regular expressions to switch them to unicode mode.