[GH-ISSUE #1458] MD013: Incorrect count on lines with multi-byte unicode characters #2534

Open
opened 2026-03-07 20:08:41 +03:00 by kerem · 1 comment
Owner

Originally created by @Maneren on GitHub (Dec 26, 2024).
Original GitHub issue: https://github.com/DavidAnson/markdownlint/issues/1458

Hi, I copied a paragraph from a PDF and it contained hardcoded unicode italic characters which take 4 bytes in UTF-8 or 2 bytes in UTF-16. After pasting that to a markdown file and saving it in a file in UTF-8 encoding I started receiving Line length [Expected: 80, Actual: 85] warning, even though there are only 74 unicode characters displayed on the line (stored as 107 bytes).

- $\forall 𝑣_1, 𝑣_2, \ldots, 𝑣_𝑛 \in 𝑇_𝑛: 𝑣_1, 𝑣_2, \ldots, 𝑣_𝑛 \in 𝐾 \iff

(I assume the intention of the rule is to consider the "visual count of characters" as rendered in the editor - 74 in this case)

I may be missing some context or detail of the implementation but I think the issue is a combination of JS handling everything as UTF-16 rather than UTF-8 (that is the seemingly incorrect .length of the line reported) and the usage of regular "unicode-unaware" regular expressions, where . again matches on UTF-16 character.

So I think the correct way to handle these would be [...line].length to get the total length of the line and the inclusion of the u flag for the regular expressions to switch them to unicode mode.

Originally created by @Maneren on GitHub (Dec 26, 2024). Original GitHub issue: https://github.com/DavidAnson/markdownlint/issues/1458 <!-- Thank you for taking the time to report an issue! When deciding where to open an issue, please note there are multiple projects under the markdownlint umbrella: - https://github.com/DavidAnson/markdownlint : This is the core JavaScript/Node.js library and is used by other tools. Most issues with implementation and rule behavior belong here. - https://github.com/igorshubovych/markdownlint-cli : This is the original CLI for markdownlint. Issues specific to CLI belong here. - https://github.com/DavidAnson/markdownlint-cli2 : This is a newer CLI for markdownlint and is used by other tools. Issues specific to CLI2 belong here. - https://github.com/DavidAnson/vscode-markdownlint : This is the Visual Studio Code extension for markdownlint. Issues specific to VS Code belong here. - https://github.com/DavidAnson/markdownlint-cli2-action : This is a GitHub Action for markdownlint. Issues specific to the Action belong here. - https://github.com/markdownlint/markdownlint : This is the original markdownlint implementation for Ruby. All Ruby-related issues belong here. Before creating an issue, it's a good practice to search existing issues for something similar. If your issue has already been reported, please update the existing one with any new information. It's also good to review the documentation for any relevant details. When describing an issue, the following information is helpful: - What did you do? - What did you expect to happen? - What actually happened? - What messages or errors were there? - How can the issue be reproduced? - What version were you using? - What operating system were you using? The simplest demonstration of a problem is the most helpful. Small examples can be pasted into the issue description. (Be sure to paste as code so GitHub doesn't render the example in Markdown.) For larger examples, linking to a repository or file is more appropriate. Before proposing a new rule, please review the existing suggestions: https://github.com/DavidAnson/markdownlint/issues?q=is%3Aissue+is%3Aopen+label%3A%22new+rule%22 Thank you! --> Hi, I copied a paragraph from a PDF and it contained hardcoded unicode italic characters which take 4 bytes in UTF-8 or 2 bytes in UTF-16. After pasting that to a markdown file and saving it in a file in UTF-8 encoding I started receiving `Line length [Expected: 80, Actual: 85]` warning, even though there are only 74 unicode characters displayed on the line (stored as 107 bytes). ```md - $\forall 𝑣_1, 𝑣_2, \ldots, 𝑣_𝑛 \in 𝑇_𝑛: 𝑣_1, 𝑣_2, \ldots, 𝑣_𝑛 \in 𝐾 \iff ``` (I assume the intention of the rule is to consider the "visual count of characters" as rendered in the editor - 74 in this case) I may be missing some context or detail of the implementation but I think the issue is a combination of JS handling everything as UTF-16 rather than UTF-8 (that is the seemingly incorrect `.length` of the line reported) and the usage of regular "unicode-unaware" regular expressions, where `.` again matches on UTF-16 character. So I think the correct way to handle these would be `[...line].length` to get the total length of the line and the inclusion of the `u` flag for the regular expressions to switch them to unicode mode.
Author
Owner

@DavidAnson commented on GitHub (Dec 26, 2024):

Related: #564

<!-- gh-comment-id:2562087457 --> @DavidAnson commented on GitHub (Dec 26, 2024): Related: #564
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/markdownlint#2534
No description provided.