mirror of https://github.com/ceaksan/turkish-diacritics.git synced 2026-04-25 20:45:56 +03:00

Claude Code plugin that validates Turkish diacritics (ç, ğ, ı, ö, ş, ü) in content files using hunspell-based multi-layer detection

Find a file

ceaksan 701be15c24 docs: update README with expanded lookup stats and benchmarks - Update dictionary entry counts (lookup 7,466, whitelist ~560, skip ~240) - Add Benchmarks section comparing rule-based vs ByT5 model - Add Scripts section documenting benchmark tools - Update validation stats (134 posts, 97% FP reduction) - Add etc, often, Turkce to whitelist to prevent FP on English README - Rewrite Limitations section to note context disambiguation gap		2026-04-04 19:48:40 +03:00
.claude-plugin	feat: Initial release - Turkish diacritics validator for Claude Code	2026-02-10 02:22:48 +03:00
docs/adr	feat: add Charformer diacritics experiment notebook and ADR template	2026-04-02 12:55:17 +03:00
hooks	docs: update README with expanded lookup stats and benchmarks	2026-04-04 19:48:40 +03:00
notebooks	feat: add ByT5 experiment, benchmarks, and fix 157 lookup/whitelist conflicts	2026-04-04 15:59:52 +03:00
scripts	feat: add ByT5 experiment, benchmarks, and fix 157 lookup/whitelist conflicts	2026-04-04 15:59:52 +03:00
.gitignore	feat: add ByT5 experiment, benchmarks, and fix 157 lookup/whitelist conflicts	2026-04-04 15:59:52 +03:00
.knowledge.toml	feat: Add project configuration via `.knowledge.toml` and update `tech.dic` and `whitelist.dic` with new terms.	2026-03-19 21:44:03 +03:00
byt5_diacritics_experiment	Kaggle Notebook \| byt5_diacritics_experiment \| Version 7	2026-04-03 22:12:40 +03:00
LICENSE	feat: Initial release - Turkish diacritics validator for Claude Code	2026-02-10 02:22:48 +03:00
README.md	docs: update README with expanded lookup stats and benchmarks	2026-04-04 19:48:40 +03:00
training_data.tsv	feat: add ByT5 experiment, benchmarks, and fix 157 lookup/whitelist conflicts	2026-04-04 15:59:52 +03:00

README.md

turkish-diacritics

Claude Code plugin and global hook that validates Turkish diacritics in content files. Catches missing c, g, i, o, s, u characters using hunspell-based multi-layer detection with zero token cost.

Problem

LLMs (including Claude) drop Turkish diacritics when generating long-form content:

c becomes c, g becomes g, i becomes i, o becomes o, s becomes s, u becomes u
Prompt-level warnings ("use proper Turkish characters") are insufficient
Manual proofreading is slow, expensive, and error-prone
Errors are often caught only after publication

This plugin adds a deterministic quality gate that runs after every file edit, providing instant feedback so Claude can self-correct.

Who needs this

Turkish content creators using AI writing tools. If you generate blog posts, documentation, or marketing copy with Claude, GPT, or Gemini in Turkish, diacritics errors are inevitable. This catches them before publication.
Developers building Turkish-language applications. Any pipeline that processes LLM-generated Turkish text needs a validation layer. This is that layer.
Claude Code users writing in Turkish. Install as a plugin, and every file edit is validated automatically. Zero manual intervention.

Validated against 134 real blog posts with 97% false positive reduction after dictionary tuning.

How It Works

The plugin runs as a PostToolUse hook after every Edit or Write operation. If the edited file is a markdown file (.md, .mdx in global mode; tr.mdx, tr.md in plugin mode), it validates diacritics using a 4-layer detection system:

Layer	Method	Confidence	What It Catches
Layer 0	Word dedup	-	Reduces hunspell calls
Layer 1	hunspell suggestion matching	HIGH	Most diacritics errors
Layer 2	Brute-force variant generation	HIGH	Multi-character changes
Layer 3	Ambiguity lookup table (7,466 entries)	MEDIUM	Valid ASCII but wrong form

Performance: 2 hunspell calls per file, ~5s average, zero timeouts on 134 real posts.

When errors are found, the plugin reports them via stderr (exit code 2), and Claude automatically corrects them.

Prerequisites

hunspell + Turkish Dictionary

macOS:

brew install hunspell

Linux (Debian/Ubuntu):

sudo apt-get install hunspell hunspell-tr

Verify installation:

echo "ozellik" | hunspell -d tr_TR -a
# Should show suggestions including "ozellik"

If tr_TR dictionary is not found, download from tdd-ai/hunspell-tr and place tr_TR.dic + tr_TR.aff in your hunspell dictionary path (~/Library/Spelling/ on macOS, /usr/share/hunspell/ on Linux).

Installation

As Claude Code Plugin (project-level)

Add the marketplace and install:

/plugin marketplace add ceaksan/turkish-diacritics
/plugin install turkish-diacritics

Plugin mode checks only tr.mdx and tr.md files by default.

As Global Hook (all projects)

Copy the hook and dictionaries to your global Claude Code directory:

mkdir -p ~/.claude/hooks/dicts
cp hooks/turkish-check.py ~/.claude/hooks/turkish-check.py
cp hooks/dicts/* ~/.claude/hooks/dicts/

Add the hook to ~/.claude/settings.json under the PostToolUse matcher for Edit|Write:

{
  "type": "command",
  "command": "TURKISH_CHECK_GLOBAL=1 python3 ~/.claude/hooks/turkish-check.py",
  "statusMessage": "Checking Turkish characters..."
}

Global mode checks all .md and .mdx files across every project.

Modes

Mode	File Filter	Scope	Install Method
Plugin (default)	`tr.mdx`, `tr.md` only	Single project	`/plugin install`
Global hook	Any `.md` or `.mdx`	All projects	Manual copy to `~/.claude/hooks/`

In global mode, non-Turkish files pass through cleanly since hunspell only flags words that should have Turkish diacritics.

What Gets Checked

The plugin extracts prose text by stripping:

YAML frontmatter
Code blocks (fenced and inline)
JSX/HTML tags and expressions
Import/export statements
URLs and markdown links/images
Footnote references
HTML comments
Markdown formatting (headings, bold, italic)

What's not checked:

Code identifiers (camelCase, snake_case, UPPER_CASE, kebab-case)
Technical terms (API, React, frontend, cache, etc.)
English words used in tech context (status, editor, suite, etc.)
Known dual-use words (words valid in both ASCII and diacritics form)

Dictionaries

The plugin ships with 4 curated dictionary files:

File	Entries	Purpose
`tech.dic`	~360	hunspell personal dictionary: brands, acronyms, Turkish inflections
`whitelist.dic`	~560	Words to skip in all detection layers
`ambiguous-lookup.tsv`	7,466	ASCII-to-diacritics mapping for Layer 3
`ambiguous-skip.dic`	~240	Context-dependent words to skip in Layer 3

Adding Custom Words

To prevent a false positive on an English/tech word: Add it to hooks/dicts/whitelist.dic (one word per line, case-insensitive).

To teach hunspell a new word: Add it to hooks/dicts/tech.dic (one word per line).

To skip a valid Turkish ASCII word: Add it to hooks/dicts/ambiguous-skip.dic (one word per line, case-insensitive).

Example Output

Turkish diacritics errors in src/content/posts/2026/01/01.my-post/tr.mdx (5 found):

[HIGH confidence - 3 errors]
  'ozellik' -> 'özellik' (line 12)
  'gelistirme' -> 'geliştirme' (line 24)
  'Turkce' -> 'Türkçe' (line 8)

[MEDIUM confidence - 2 errors]
  'urun' -> 'ürün' (line 30)
  'sureci' -> 'süreci' (line 45)

Fix these Turkish character errors using Edit tool.

Benchmarks

The rule-based system was benchmarked against a fine-tuned ByT5-small (300M params) model on the same test set:

Metric	Rule-based	ByT5
Word accuracy (200 sample)	100%	96.5%
False positives (50 sample)	0	6
Speed	0ms/word	836ms/word
Context disambiguation (7 pairs)	N/A	2/7

The rule-based system outperforms ML on word-level tasks. ML models may add value for sentence-level context disambiguation with larger training datasets and better architectures (token classification vs seq2seq). See notebooks/ for experiment details.

Data Sources

The ambiguity lookup table was built from multiple sources:

134 Turkish blog posts from production content. Words with diacritics extracted, ASCII-folded, and validated with hunspell. Primary source for the 7,466-entry lookup table.
tdd-ai/hunspell-tr - Turkish hunspell dictionary (75,910 words). Used for base spell checking and suggestion generation.
CanNuhlar/Turkce-Kelime-Listesi - 76K Turkish word list. Used to extract ambiguous ASCII/diacritics word pairs.
erogluegemen/TDK-Dataset - TDK (Turkish Language Association) dataset with 92K words including madde and madde_duz columns. Used for cross-validation and ambiguous-skip curation.

Architecture

Claude Write/Edit tr.mdx
        |
        v
PostToolUse Hook (stdin JSON)
        |
        v
turkish-check.py
  1. File filter (.md/.mdx in global mode, tr.mdx/tr.md in plugin mode)
  2. Strip non-prose content
  3. Extract unique words
  4. hunspell -a: suggestions for misspelled words
  5. Layer 1: Diacritics-only suggestion matching
  6. Layer 2: Brute-force variant generation + batch validation
  7. hunspell -l: batch validate Layer 2 variants
  8. Layer 3: Ambiguity lookup (no hunspell)
  9. Structured output with confidence levels
  10. stderr + exit 2 -> Claude feedback loop

Scripts

Script	Purpose
`scripts/prepare_training_data.py`	Generate training pairs from Turkish content for ML experiments
`scripts/compare_benchmark.py`	Rule-based vs ML model comparison benchmark
`scripts/sentence_benchmark.py`	Sentence-level context disambiguation benchmark

Limitations

No context disambiguation: The rule-based system cannot distinguish between "bas" (press) and "bas" (head) based on sentence context. Context-dependent words are skipped to avoid false positives.
hunspell suggestion quality: Some corrections are partial (e.g., karsilasilan gets partially corrected). This is a hunspell limitation.
Morphological gaps: Not all inflected forms are in the lookup table. Simple suffix concatenation doesn't cover Turkish vowel harmony and stem changes.
Prefix whitelist: Short whitelist words (3-4 chars) only do exact matching to avoid false negatives on unrelated Turkish words.
Source text typos: The hook correctly flags source typos but the suggested correction may be wrong (e.g., isiyorum flagged but correction should be istiyorum, not isiyorum).

License

MIT