starred/opentui

Fork 0

mirror of https://github.com/anomalyco/opentui.git synced 2026-04-25 13:06:00 +03:00

[GH-ISSUE #596] bug: ctrl+w (deleteWordBackward) doesn't recognize word boundaries between CJK and ASCII characters #163

New issue

Closed

opened 2026-03-02 23:45:00 +03:00 by kerem · 3 comments

kerem commented

2026-03-02 23:45:00 +03:00

Owner

Originally created by @mkusaka on GitHub (Jan 27, 2026).
Original GitHub issue: https://github.com/anomalyco/opentui/issues/596

Description

When mixing CJK (Japanese/Chinese/Korean) characters with ASCII text, ctrl+w / deleteWordBackward deletes both together instead of treating them as separate words.

Steps to Reproduce

Type mixed text like 日本語abc or テストtest
Place cursor at the end
Press ctrl+w

Expected Behavior

Only abc (or test) should be deleted, stopping at the CJK-ASCII boundary.

This is the expected behavior per Unicode UAX #\29 (Text Segmentation), where CJK characters are treated as individual word units, creating implicit boundaries between CJK and Latin scripts.

Actual Behavior

The entire string 日本語abc is deleted at once.

Root Cause

Looking at utf8.zig, the findWrapBreaks function only recognizes these as word boundaries:

ASCII: spaces, tabs, punctuation (-, /, ., etc.), brackets
Unicode: various space characters (NBSP, ideographic space, etc.)

CJK characters and script transitions are not considered word boundaries.

Suggested Fix

Consider implementing UAX #\29 compliant word boundary detection, or at minimum:

Treat each CJK character as a word boundary (per UAX #\29 default)
Treat script transitions (e.g., Han → Latin, Hiragana → Latin) as word boundaries

References

Unicode UAX #\29 (Text Segmentation): https://www.unicode.org/reports/tr29/#Word_Boundaries
Related code: https://github.com/anomalyco/opentui/blob/main/packages/core/src/zig/utf8.zig

Environment

OS: macOS
Terminal: iTerm2
opencode version: 1.1.36 (discovered while using opencode)
opentui version: 0.1.75 (@opentui/core, @opentui/solid)

Originally created by @mkusaka on GitHub (Jan 27, 2026). Original GitHub issue: https://github.com/anomalyco/opentui/issues/596 # Description When mixing CJK (Japanese/Chinese/Korean) characters with ASCII text, `ctrl+w` / `deleteWordBackward` deletes both together instead of treating them as separate words. ## Steps to Reproduce 1. Type mixed text like `日本語abc` or `テストtest` 2. Place cursor at the end 3. Press `ctrl+w` ## Expected Behavior Only `abc` (or `test`) should be deleted, stopping at the CJK-ASCII boundary. This is the expected behavior per Unicode UAX #\29 (Text Segmentation), where CJK characters are treated as individual word units, creating implicit boundaries between CJK and Latin scripts. ## Actual Behavior The entire string `日本語abc` is deleted at once. ## Root Cause Looking at `utf8.zig`, the `findWrapBreaks` function only recognizes these as word boundaries: - ASCII: spaces, tabs, punctuation (`-`, `/`, `.`, etc.), brackets - Unicode: various space characters (NBSP, ideographic space, etc.) CJK characters and script transitions are not considered word boundaries. ## Suggested Fix Consider implementing UAX #\29 compliant word boundary detection, or at minimum: 1. Treat each CJK character as a word boundary (per UAX #\29 default) 2. Treat script transitions (e.g., Han → Latin, Hiragana → Latin) as word boundaries ## References - Unicode UAX #\29 (Text Segmentation): https://www.unicode.org/reports/tr29/#Word_Boundaries - Related code: https://github.com/anomalyco/opentui/blob/main/packages/core/src/zig/utf8.zig ## Environment - OS: macOS - Terminal: iTerm2 - opencode version: 1.1.36 (discovered while using opencode) - opentui version: 0.1.75 (@opentui/core, @opentui/solid)

kerem closed this issue

2026-03-02 23:45:00 +03:00

kerem commented

2026-03-02 23:45:01 +03:00

Author

Owner

@hwisu commented on GitHub (Jan 28, 2026):

Treating each character as a separate word does not match the expectations of Korean users.

In practice, word boundaries are expected to be determined by whitespace, not by individual characters.
Changing word-navigation behavior depending on the script (especially when mixed with other languages)
would break long-established editor behavior inherited from Vim and similar tools.

This kind of language-specific branching can easily lead to inconsistent and unpredictable cursor movement.

@hwisu commented on GitHub (Jan 28, 2026): Treating each character as a separate word does not match the expectations of Korean users. In practice, word boundaries are expected to be determined by whitespace, not by individual characters. Changing word-navigation behavior depending on the script (especially when mixed with other languages) would break long-established editor behavior inherited from Vim and similar tools. This kind of language-specific branching can easily lead to inconsistent and unpredictable cursor movement.

kerem commented

2026-03-02 23:45:01 +03:00

Author

Owner

@simonklee commented on GitHub (Jan 28, 2026):

If i understand this issue correct you're both right about the core issue. There's no word boundary detection at CJK-ASCII transitions, so ctrl+w on 日本語abc deletes everything.

Could the following approach work:

Break at script transitions (CJK-ASCII). 日本語abc with ctrl+w would delete abc, then 日本語 on the next press.
Group consecutive CJK characters together rather than treating each as an individual word. A run of 日本語 would be one unit. This matches Vim's behavior and should address @hwisu's concern about Korean, where per-character deletion would break normal whitespace-based word navigation.
Break at CJK punctuation (。、！？etc.)

Am I missing something?

@simonklee commented on GitHub (Jan 28, 2026): If i understand this issue correct you're both right about the core issue. There's no word boundary detection at CJK-ASCII transitions, so ctrl+w on 日本語abc deletes everything. Could the following approach work: 1. Break at script transitions (CJK-ASCII). 日本語abc with ctrl+w would delete abc, then 日本語 on the next press. 2. Group consecutive CJK characters together rather than treating each as an individual word. A run of 日本語 would be one unit. This matches Vim's behavior and should address @hwisu's concern about Korean, where per-character deletion would break normal whitespace-based word navigation. 3. Break at CJK punctuation (。、！？etc.) Am I missing something?

kerem commented

2026-03-02 23:45:01 +03:00

Author

Owner

@mkusaka commented on GitHub (Feb 1, 2026):

@simonklee Thanks — makes sense to me.

FYI: what we’re discussing here seems close to how Vim gets its word-boundary behavior (via “character classes” in mbyte.c):
https://github.com/vim/vim/blob/master/src/mbyte.c#L2925-L2932

Not saying we need to replicate Vim — just sharing it as a reference in case we want to extend the heuristics later.

@mkusaka commented on GitHub (Feb 1, 2026): @simonklee Thanks — makes sense to me. FYI: what we’re discussing here seems close to how Vim gets its word-boundary behavior (via “character classes” in mbyte.c): https://github.com/vim/vim/blob/master/src/mbyte.c#L2925-L2932 Not saying we need to replicate Vim — just sharing it as a reference in case we want to extend the heuristics later.

kerem referenced this issue

2026-03-02 23:46:00 +03:00

[PR #163] [MERGED] Test utils + char snaphot tests #335

kerem referenced this issue

2026-03-14 09:19:09 +03:00

[PR #163] [MERGED] Test utils + char snaphot tests #1115