mirror of
https://github.com/ForLoopCodes/contextplus.git
synced 2026-04-26 06:25:50 +03:00
[GH-ISSUE #15] semantic_code_search and semantic_navigate fail with 'Unable to embed oversized input' on projects with large data files #12
Labels
No labels
bug
bug
documentation
enhancement
enhancement
good first issue
good first issue
help wanted
pull-request
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/contextplus#12
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @Morghot42 on GitHub (Mar 5, 2026).
Original GitHub issue: https://github.com/ForLoopCodes/contextplus/issues/15
Bug Description
semantic_code_searchandsemantic_navigatefail with the error:This happens on projects that contain large non-code files (JSON data files, GeoJSON, CSV, etc.) alongside source code. The tools appear to attempt embedding the entire file content including these large data files, which exceeds the embedding model's context window.
Environment
bunx contextplusnomic-embed-text(context window: 2048 tokens)gemma2:27bReproduction
Any project that has both source code files and large data files (JSON > 100KB, GeoJSON, CSV, etc.) in the project tree.
Steps
nomic-embed-text).jsondata files (100KB+)semantic_code_searchwith any query:Unable to embed oversized input after adaptive retriesWhat works vs what doesn't
get_context_treeget_file_skeletonsemantic_identifier_searchget_blast_radiussemantic_code_searchsemantic_navigateExpected Behavior
Context+ should either:
.json,.geojson,.csv, etc.) during embedding, ormax_file_sizethreshold (e.g., 50KB) beyond which files are skipped for embedding, orSuggested Fix
The
embeddings.tscore module could:.json,.geojson,.csv,.xlsx) from the embedding pipelineCONTEXTPLUS_MAX_EMBED_FILE_SIZEenv var (default ~50KB)Additional Context
semantic_identifier_searchworks because it only embeds function/class signatures (small strings). The bug is specifically in the file-level embedding pipeline used bysemantic_code_searchandsemantic_navigate.The
nomic-embed-textmodel has a 2048-token context window. Large data files far exceed this limit.Great project — the AST tools work beautifully. Looking forward to semantic search handling mixed codebases!
@ForLoopCodes commented on GitHub (Mar 5, 2026):
on it
@ForLoopCodes commented on GitHub (Mar 5, 2026):
fixed