[GH-ISSUE #15] semantic_code_search and semantic_navigate fail with 'Unable to embed oversized input' on projects with large data files #12

Closed
opened 2026-03-15 15:59:51 +03:00 by kerem · 2 comments
Owner

Originally created by @Morghot42 on GitHub (Mar 5, 2026).
Original GitHub issue: https://github.com/ForLoopCodes/contextplus/issues/15

Bug Description

semantic_code_search and semantic_navigate fail with the error:

Unable to embed oversized input after adaptive retries

This happens on projects that contain large non-code files (JSON data files, GeoJSON, CSV, etc.) alongside source code. The tools appear to attempt embedding the entire file content including these large data files, which exceeds the embedding model's context window.

Environment

  • Context+ version: latest via bunx contextplus
  • OS: macOS (Apple Silicon, 64 GB RAM)
  • Ollama: running locally
  • Embed model: nomic-embed-text (context window: 2048 tokens)
  • Chat model: gemma2:27b

Reproduction

Any project that has both source code files and large data files (JSON > 100KB, GeoJSON, CSV, etc.) in the project tree.

Steps

  1. Configure Context+ as MCP server with Ollama (nomic-embed-text)
  2. Have a project with a few JS/TS source files and some large .json data files (100KB+)
  3. Run semantic_code_search with any query:
    semantic_code_search({ query: "authentication logic", top_k: 3 })
    
  4. Result: Unable to embed oversized input after adaptive retries

What works vs what doesn't

Tool Status Notes
get_context_tree Works AST-based, no embeddings
get_file_skeleton Works AST-based, no embeddings
semantic_identifier_search Works Embeds function signatures (small)
get_blast_radius Works Import/usage tracing
semantic_code_search Fails Tries to embed large data files
semantic_navigate Fails Same issue

Expected Behavior

Context+ should either:

  1. Skip non-code files (.json, .geojson, .csv, etc.) during embedding, or
  2. Chunk large files before embedding instead of sending the entire content, or
  3. Add a configurable max_file_size threshold (e.g., 50KB) beyond which files are skipped for embedding, or
  4. Gracefully degrade — skip files that exceed the model's context window and continue with the rest

Suggested Fix

The embeddings.ts core module could:

  • Filter out known data-only extensions (.json, .geojson, .csv, .xlsx) from the embedding pipeline
  • Add a CONTEXTPLUS_MAX_EMBED_FILE_SIZE env var (default ~50KB)
  • Or use the existing Tree-sitter parser to detect if a file has meaningful code symbols — if not, skip it

Additional Context

semantic_identifier_search works because it only embeds function/class signatures (small strings). The bug is specifically in the file-level embedding pipeline used by semantic_code_search and semantic_navigate.

The nomic-embed-text model has a 2048-token context window. Large data files far exceed this limit.

Great project — the AST tools work beautifully. Looking forward to semantic search handling mixed codebases!

Originally created by @Morghot42 on GitHub (Mar 5, 2026). Original GitHub issue: https://github.com/ForLoopCodes/contextplus/issues/15 ## Bug Description `semantic_code_search` and `semantic_navigate` fail with the error: ``` Unable to embed oversized input after adaptive retries ``` This happens on projects that contain large non-code files (JSON data files, GeoJSON, CSV, etc.) alongside source code. The tools appear to attempt embedding the entire file content including these large data files, which exceeds the embedding model's context window. ## Environment - **Context+ version**: latest via `bunx contextplus` - **OS**: macOS (Apple Silicon, 64 GB RAM) - **Ollama**: running locally - **Embed model**: `nomic-embed-text` (context window: 2048 tokens) - **Chat model**: `gemma2:27b` ## Reproduction Any project that has both source code files and large data files (JSON > 100KB, GeoJSON, CSV, etc.) in the project tree. ### Steps 1. Configure Context+ as MCP server with Ollama (`nomic-embed-text`) 2. Have a project with a few JS/TS source files and some large `.json` data files (100KB+) 3. Run `semantic_code_search` with any query: ``` semantic_code_search({ query: "authentication logic", top_k: 3 }) ``` 4. **Result**: `Unable to embed oversized input after adaptive retries` ### What works vs what doesn't | Tool | Status | Notes | |------|--------|-------| | `get_context_tree` | ✅ Works | AST-based, no embeddings | | `get_file_skeleton` | ✅ Works | AST-based, no embeddings | | `semantic_identifier_search` | ✅ Works | Embeds function signatures (small) | | `get_blast_radius` | ✅ Works | Import/usage tracing | | `semantic_code_search` | ❌ Fails | Tries to embed large data files | | `semantic_navigate` | ❌ Fails | Same issue | ## Expected Behavior Context+ should either: 1. **Skip non-code files** (`.json`, `.geojson`, `.csv`, etc.) during embedding, or 2. **Chunk large files** before embedding instead of sending the entire content, or 3. **Add a configurable `max_file_size` threshold** (e.g., 50KB) beyond which files are skipped for embedding, or 4. **Gracefully degrade** — skip files that exceed the model's context window and continue with the rest ## Suggested Fix The `embeddings.ts` core module could: - Filter out known data-only extensions (`.json`, `.geojson`, `.csv`, `.xlsx`) from the embedding pipeline - Add a `CONTEXTPLUS_MAX_EMBED_FILE_SIZE` env var (default ~50KB) - Or use the existing Tree-sitter parser to detect if a file has meaningful code symbols — if not, skip it ## Additional Context `semantic_identifier_search` works because it only embeds function/class signatures (small strings). The bug is specifically in the file-level embedding pipeline used by `semantic_code_search` and `semantic_navigate`. The `nomic-embed-text` model has a 2048-token context window. Large data files far exceed this limit. Great project — the AST tools work beautifully. Looking forward to semantic search handling mixed codebases!
kerem closed this issue 2026-03-15 15:59:56 +03:00
Author
Owner

@ForLoopCodes commented on GitHub (Mar 5, 2026):

on it

<!-- gh-comment-id:4004535894 --> @ForLoopCodes commented on GitHub (Mar 5, 2026): on it
Author
Owner

@ForLoopCodes commented on GitHub (Mar 5, 2026):

fixed

<!-- gh-comment-id:4004735005 --> @ForLoopCodes commented on GitHub (Mar 5, 2026): fixed
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/contextplus#12
No description provided.