Indexing

Creor's indexer processes your source files into searchable chunks stored in a local vector database. Indexing is automatic, incremental, and security-aware.

File Chunking

Raw source files are too large and noisy to embed as single vectors. The indexer splits each file into smaller, semantically meaningful chunks that each capture a coherent unit of code -- a function, a class, a configuration block, or a group of related statements.

Chunking Strategy

Creor uses a multi-pass chunking strategy that prioritizes semantic boundaries over arbitrary line counts.

  • First pass: AST parsing identifies top-level declarations (functions, classes, interfaces, types, exports).
  • Second pass: Large declarations are split further at logical boundaries (methods within a class, branches within a switch).
  • Third pass: Non-code files (Markdown, JSON, YAML) are split by headings, keys, or fixed-size windows with overlap.
  • Each chunk retains its file path, start/end line numbers, and parent context (e.g., which class a method belongs to).

Chunk Sizing

ParameterDefaultDescription
Max chunk size1500 tokensUpper limit for a single chunk. Larger declarations are split.
Min chunk size50 tokensChunks below this threshold are merged with adjacent chunks.
Overlap100 tokensOverlap between adjacent chunks to preserve context at boundaries.
Context windowParent declarationEach chunk includes a header comment identifying its parent scope.

These defaults work well for most codebases. You can adjust them in the RAG plugin configuration if your project has unusually large or small files.

AST-Aware Splitting

For supported languages, the indexer parses the Abstract Syntax Tree (AST) before chunking. This ensures that chunks never split a function in the middle of a logic block or break a class definition between its constructor and methods.

LanguageAST SupportSplitting Granularity
TypeScript / JavaScriptFullFunctions, classes, methods, interfaces, type aliases, exports
PythonFullFunctions, classes, methods, decorators, module-level assignments
GoFullFunctions, methods, structs, interfaces, package declarations
RustFullFunctions, impl blocks, structs, enums, traits, modules
Java / KotlinFullClasses, methods, interfaces, enums, annotations
C / C++PartialFunctions, classes, structs. Macros treated as text.
Other languagesFallbackLine-based splitting with overlap at logical boundaries (blank lines, comments)

Tip

Even without AST support, the fallback chunker produces good results. It uses heuristics like blank line clusters, comment blocks, and indentation changes to find natural split points.

Incremental Indexing

Creor does not re-index your entire codebase every time you save a file. A file watcher monitors your workspace and triggers re-indexing only for files that have changed.

How It Works

  • On project open: the indexer compares file modification timestamps against the stored index state. Only changed or new files are processed.
  • On file save: the watcher detects the change and queues the file for re-indexing. Old chunks from that file are removed and replaced.
  • On file delete: all chunks associated with the deleted file are removed from the vector store.
  • On branch switch: the indexer detects changed files via git and re-indexes them. This typically takes a few seconds.

Indexing Performance

Codebase SizeInitial Index TimeIncremental Update
Small (< 1K files)10-30 seconds< 1 second
Medium (1K-10K files)1-5 minutes1-3 seconds
Large (10K-50K files)5-15 minutes2-5 seconds
Very large (50K+ files)15-45 minutes3-10 seconds

Initial indexing runs in the background and does not block the editor. You can start working immediately -- the agent will fall back to grep-based search for files that are not yet indexed.

Note

Indexing performance depends on your embedding provider's API speed and rate limits. Voyage AI and Nomic both support batch embedding requests, which Creor uses to minimize round trips.

Security Filtering

The indexer includes a security filter that prevents sensitive data from being embedded and stored in the vector database. This runs before the embedding step, so secrets never leave your machine in an embedding API call.

What Gets Filtered

  • Files matching .gitignore patterns are skipped entirely.
  • Files matching common secret patterns (.env, .pem, credentials.json, *_secret*, *.key) are excluded.
  • Chunks containing high-entropy strings that look like API keys or tokens are redacted before embedding.
  • Binary files, images, videos, and other non-text files are skipped.
  • Lock files (package-lock.json, yarn.lock, bun.lock, Cargo.lock) are skipped -- they add noise without useful semantics.

Custom Exclusions

You can add custom exclusion patterns in your project's creor.json file to skip additional files or directories.

1
2
3
4
5
6
7
8
9
10
11
12
{
"plugins": {
"devflow-rag": {
"exclude": [
"vendor/**",
"generated/**",
"**/*.generated.ts",
"test/fixtures/**"
]
}
}
}

Warning

The security filter is a best-effort safeguard. Do not rely on it as your only line of defense for sensitive data. If your codebase contains secrets, use a dedicated secrets manager and keep secrets out of source files.

Index State

The vector index is stored locally in your project's .creor/ directory. It persists across editor restarts, so you only pay the initial indexing cost once.

Index Location

.creor/
  rag/
    index/          # LanceDB vector store files
    state.json      # File hashes and timestamps for incremental updates
    config.json     # Snapshot of indexing configuration

Managing the Index

  • Rebuild index: Delete the .creor/rag/ directory and reopen the project. A full re-index will start automatically.
  • Check index status: The status bar shows indexing progress. Hover over it to see the number of indexed files and chunks.
  • Pause indexing: Close the project or disable the RAG plugin in settings. Indexing resumes when re-enabled.

Tip

Add .creor/rag/ to your .gitignore. The index is machine-specific and should not be committed to version control.

Troubleshooting

Index not building

  • Verify the RAG plugin is enabled in settings.
  • Check that an embedding provider API key is configured (Voyage AI or Nomic).
  • Look at the Creor output panel (View > Output > Creor Engine) for error messages.

Search returning irrelevant results

  • The index may be stale. Delete .creor/rag/ and let it rebuild.
  • Check if the relevant files are excluded by .gitignore or custom exclusion patterns.
  • Try a more specific query -- include function names or file paths when possible.

Indexing is slow

  • Large codebases take longer on first index. Subsequent updates are incremental.
  • Check your embedding provider's rate limits. Voyage AI's free tier has lower throughput.
  • Exclude large generated or vendored directories to reduce index size.

Next Steps