[GH-ISSUE #596] bug: ctrl+w (deleteWordBackward) doesn't recognize word boundaries between CJK and ASCII characters #163

Closed
opened 2026-03-02 23:45:00 +03:00 by kerem · 3 comments
Owner

Originally created by @mkusaka on GitHub (Jan 27, 2026).
Original GitHub issue: https://github.com/anomalyco/opentui/issues/596

Description

When mixing CJK (Japanese/Chinese/Korean) characters with ASCII text, ctrl+w / deleteWordBackward deletes both together instead of treating them as separate words.

Steps to Reproduce

  1. Type mixed text like 日本語abc or テストtest
  2. Place cursor at the end
  3. Press ctrl+w

Expected Behavior

Only abc (or test) should be deleted, stopping at the CJK-ASCII boundary.

This is the expected behavior per Unicode UAX #\29 (Text Segmentation), where CJK characters are treated as individual word units, creating implicit boundaries between CJK and Latin scripts.

Actual Behavior

The entire string 日本語abc is deleted at once.

Root Cause

Looking at utf8.zig, the findWrapBreaks function only recognizes these as word boundaries:

  • ASCII: spaces, tabs, punctuation (-, /, ., etc.), brackets
  • Unicode: various space characters (NBSP, ideographic space, etc.)

CJK characters and script transitions are not considered word boundaries.

Suggested Fix

Consider implementing UAX #\29 compliant word boundary detection, or at minimum:

  1. Treat each CJK character as a word boundary (per UAX #\29 default)
  2. Treat script transitions (e.g., Han → Latin, Hiragana → Latin) as word boundaries

References

Environment

  • OS: macOS
  • Terminal: iTerm2
  • opencode version: 1.1.36 (discovered while using opencode)
  • opentui version: 0.1.75 (@opentui/core, @opentui/solid)
Originally created by @mkusaka on GitHub (Jan 27, 2026). Original GitHub issue: https://github.com/anomalyco/opentui/issues/596 # Description When mixing CJK (Japanese/Chinese/Korean) characters with ASCII text, `ctrl+w` / `deleteWordBackward` deletes both together instead of treating them as separate words. ## Steps to Reproduce 1. Type mixed text like `日本語abc` or `テストtest` 2. Place cursor at the end 3. Press `ctrl+w` ## Expected Behavior Only `abc` (or `test`) should be deleted, stopping at the CJK-ASCII boundary. This is the expected behavior per Unicode UAX #\29 (Text Segmentation), where CJK characters are treated as individual word units, creating implicit boundaries between CJK and Latin scripts. ## Actual Behavior The entire string `日本語abc` is deleted at once. ## Root Cause Looking at `utf8.zig`, the `findWrapBreaks` function only recognizes these as word boundaries: - ASCII: spaces, tabs, punctuation (`-`, `/`, `.`, etc.), brackets - Unicode: various space characters (NBSP, ideographic space, etc.) CJK characters and script transitions are not considered word boundaries. ## Suggested Fix Consider implementing UAX #\29 compliant word boundary detection, or at minimum: 1. Treat each CJK character as a word boundary (per UAX #\29 default) 2. Treat script transitions (e.g., Han → Latin, Hiragana → Latin) as word boundaries ## References - Unicode UAX #\29 (Text Segmentation): https://www.unicode.org/reports/tr29/#Word_Boundaries - Related code: https://github.com/anomalyco/opentui/blob/main/packages/core/src/zig/utf8.zig ## Environment - OS: macOS - Terminal: iTerm2 - opencode version: 1.1.36 (discovered while using opencode) - opentui version: 0.1.75 (@opentui/core, @opentui/solid)
kerem closed this issue 2026-03-02 23:45:00 +03:00
Author
Owner

@hwisu commented on GitHub (Jan 28, 2026):

Treating each character as a separate word does not match the expectations of Korean users.

In practice, word boundaries are expected to be determined by whitespace, not by individual characters.
Changing word-navigation behavior depending on the script (especially when mixed with other languages)
would break long-established editor behavior inherited from Vim and similar tools.

This kind of language-specific branching can easily lead to inconsistent and unpredictable cursor movement.

<!-- gh-comment-id:3809586867 --> @hwisu commented on GitHub (Jan 28, 2026): Treating each character as a separate word does not match the expectations of Korean users. In practice, word boundaries are expected to be determined by whitespace, not by individual characters. Changing word-navigation behavior depending on the script (especially when mixed with other languages) would break long-established editor behavior inherited from Vim and similar tools. This kind of language-specific branching can easily lead to inconsistent and unpredictable cursor movement.
Author
Owner

@simonklee commented on GitHub (Jan 28, 2026):

If i understand this issue correct you're both right about the core issue. There's no word boundary detection at CJK-ASCII transitions, so ctrl+w on 日本語abc deletes everything.

Could the following approach work:

  1. Break at script transitions (CJK-ASCII). 日本語abc with ctrl+w would delete abc, then 日本語 on the next press.
  2. Group consecutive CJK characters together rather than treating each as an individual word. A run of 日本語 would be one unit. This matches Vim's behavior and should address @hwisu's concern about Korean, where per-character deletion would break normal whitespace-based word navigation.
  3. Break at CJK punctuation (。、!?etc.)

Am I missing something?

<!-- gh-comment-id:3811506390 --> @simonklee commented on GitHub (Jan 28, 2026): If i understand this issue correct you're both right about the core issue. There's no word boundary detection at CJK-ASCII transitions, so ctrl+w on 日本語abc deletes everything. Could the following approach work: 1. Break at script transitions (CJK-ASCII). 日本語abc with ctrl+w would delete abc, then 日本語 on the next press. 2. Group consecutive CJK characters together rather than treating each as an individual word. A run of 日本語 would be one unit. This matches Vim's behavior and should address @hwisu's concern about Korean, where per-character deletion would break normal whitespace-based word navigation. 3. Break at CJK punctuation (。、!?etc.) Am I missing something?
Author
Owner

@mkusaka commented on GitHub (Feb 1, 2026):

@simonklee Thanks — makes sense to me.

FYI: what we’re discussing here seems close to how Vim gets its word-boundary behavior (via “character classes” in mbyte.c):
https://github.com/vim/vim/blob/master/src/mbyte.c#L2925-L2932

Not saying we need to replicate Vim — just sharing it as a reference in case we want to extend the heuristics later.

<!-- gh-comment-id:3831730555 --> @mkusaka commented on GitHub (Feb 1, 2026): @simonklee Thanks — makes sense to me. FYI: what we’re discussing here seems close to how Vim gets its word-boundary behavior (via “character classes” in mbyte.c): https://github.com/vim/vim/blob/master/src/mbyte.c#L2925-L2932 Not saying we need to replicate Vim — just sharing it as a reference in case we want to extend the heuristics later.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/opentui#163
No description provided.