How It Works
When you open a project in Creor, the RAG pipeline indexes your source files in the background. Each file is split into semantically meaningful chunks, converted into vector embeddings, and stored in a local vector database. When the agent needs to find code relevant to your request, it queries this index instead of reading every file.
Your codebase
-> File chunking (semantic splitting + AST parsing)
-> Embedding (Voyage AI or Nomic)
-> Vector store (LanceDB)
-> Query time: hybrid search (vector + keyword)
-> Reranking (Jina or Voyage AI)
-> Top results returned to the agentThis pipeline runs entirely locally. Your code is never sent to external servers for indexing -- embeddings are generated using lightweight API calls that send only small text chunks, not entire files.
Hybrid Search
Creor does not rely on a single search strategy. It combines two complementary approaches to maximize recall and precision.
Vector Similarity (Semantic Search)
Each code chunk is converted into a high-dimensional vector that captures its semantic meaning. When the agent searches, the query is also embedded, and the most similar vectors are retrieved. This finds code that is conceptually related even if it uses different variable names or phrasing.
- Finds code by meaning, not exact wording.
- Works across languages -- a Python function and its TypeScript equivalent can match.
- Handles natural language queries like "the function that validates user email addresses".
Keyword Matching (Grep Search)
Alongside vector search, Creor runs a keyword-based grep search. This catches exact matches that semantic search might rank lower -- function names, error messages, specific strings, and import paths.
- Exact match for identifiers, class names, and error strings.
- Fast fallback when the semantic index is still building.
- Catches recently added code that may not yet be indexed.
Result Fusion
Results from both search strategies are merged and deduplicated. A reranker then scores each result against the original query, producing a single ranked list. The top results are injected into the agent's context as code snippets with file paths and line numbers.
Query Classification
Not every query benefits from the same search strategy. Creor's query classifier analyzes each search request and routes it to the optimal pipeline.
| Query Type | Strategy | Example |
|---|---|---|
| Conceptual | Vector-heavy with broad retrieval | "How does authentication work in this project?" |
| Identifier lookup | Grep-heavy with exact matching | "Find the UserService class" |
| Mixed | Balanced hybrid with reranking | "Where is the rate limiter configured and how does it work?" |
| File path | Direct file lookup, skip search | "Show me src/auth/middleware.ts" |
The classifier runs before the search and adds no perceptible latency. It examines the query structure, presence of identifiers (camelCase, PascalCase, snake_case), and natural language indicators to make its routing decision.
When Search Is Used
The agent does not search your codebase on every message. Search is triggered when the agent determines it needs additional context that is not already in the conversation.
- You ask about code the agent has not read yet in this session.
- The agent needs to find all usages of a function before refactoring it.
- You ask a question about project architecture or how a feature is implemented.
- The agent is planning a multi-file change and needs to understand dependencies.
- You reference a concept ("the auth middleware") without specifying a file path.
Tip
Search Quality
Several factors affect how well codebase search performs in your project.
| Factor | Impact | What You Can Do |
|---|---|---|
| Project size | Larger projects benefit more from semantic search | Let indexing complete before relying on search-heavy queries |
| Code documentation | Well-commented code produces better embeddings | JSDoc, docstrings, and inline comments improve search recall |
| File types | Source code is indexed; binary files and media are skipped | Check .gitignore -- files ignored by git are also ignored by the indexer |
| Embedding model | Different models have different strengths | Voyage AI code-3 is optimized for code; Nomic is a solid general-purpose alternative |
Note