[PR #682] [MERGED] fix(input): handle surrogate pairs in stdin buffer #710

New issue

Closed

opened 2026-03-02 23:47:46 +03:00 by kerem · 0 comments

kerem commented

2026-03-02 23:47:46 +03:00

Owner

📋 Pull Request Information

Original PR: https://github.com/anomalyco/opentui/pull/682
Author: @hugojosefson
Created: 2/13/2026
Status: ✅ Merged
Merged: 2/13/2026
Merged by: @simonklee

Base: main ← Head: fix/stdin-surrogate-pairs

📝 Commits (2)

2106921 fix(input): handle surrogate pairs in stdin buffer
923638e fix(input): preserve surrogate pairs across chunk boundaries

📊 Changes

2 files changed (+48 additions, -3 deletions)

View changed files

📝 packages/core/src/lib/stdin-buffer.test.ts (+29 -0)
📝 packages/core/src/lib/stdin-buffer.ts (+19 -3)

📄 Description

Bug

extractCompleteSequences in stdin-buffer.ts iterates the buffer using remaining[0] and pos++, which operate on UTF-16 code units. But characters above U+FFFF (emoji, CJK Extension B, etc.) are stored as surrogate pairs, two code units.

Existing code splits these into two lone surrogates, which TextEncoder.encode() will then convert to U+FFFD (replacement character) downstream.

For example, typing 👍 (U+1F44D) in the input field produces �� instead.

Note that this doesn't seem to happen with pasted characters, since they probably go through a different code path. It does happen when I type an emoji using compose key in linux. I suppose it could also happen for other characters like CJK etc...

Fix

Check whether the code unit is a high surrogate (0xD800–0xDBFF) followed by a low surrogate (0xDC00–0xDFFF), and if so keep both as a single sequence entry.

Two regression tests added:

emoji as sole input
emoji mixed with ASCII

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/anomalyco/opentui/pull/682 **Author:** [@hugojosefson](https://github.com/hugojosefson) **Created:** 2/13/2026 **Status:** ✅ Merged **Merged:** 2/13/2026 **Merged by:** [@simonklee](https://github.com/simonklee) **Base:** `main` ← **Head:** `fix/stdin-surrogate-pairs` --- ### 📝 Commits (2) - [`2106921`](https://github.com/anomalyco/opentui/commit/2106921152db201877d21897d1baf255086cdb27) fix(input): handle surrogate pairs in stdin buffer - [`923638e`](https://github.com/anomalyco/opentui/commit/923638e9bf55826762624a9f3f1173ccd76bc733) fix(input): preserve surrogate pairs across chunk boundaries ### 📊 Changes **2 files changed** (+48 additions, -3 deletions) <details> <summary>View changed files</summary> 📝 `packages/core/src/lib/stdin-buffer.test.ts` (+29 -0) 📝 `packages/core/src/lib/stdin-buffer.ts` (+19 -3) </details> ### 📄 Description ### Bug `extractCompleteSequences` in `stdin-buffer.ts` iterates the buffer using `remaining[0]` and `pos++`, which operate on UTF-16 code units. But characters above U+FFFF (emoji, CJK Extension B, etc.) are stored as surrogate pairs, two code units. Existing code splits these into two lone surrogates, which `TextEncoder.encode()` will then convert to U+FFFD (replacement character) downstream. For example, typing 👍 (U+1F44D) in the input field produces `��` instead. Note that this doesn't seem to happen with pasted characters, since they probably go through a different code path. It does happen when I type an emoji using compose key in linux. I suppose it could also happen for other characters like CJK etc... ### Fix Check whether the code unit is a high surrogate (0xD800–0xDBFF) followed by a low surrogate (0xDC00–0xDFFF), and if so keep both as a single sequence entry. Two regression tests added: - emoji as sole input - emoji mixed with ASCII --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>