botserver/docs/src/chapter-03/indexing.md

23 lines
1.1 KiB
Markdown
Raw Normal View History

# Document Indexing
2025-11-21 23:23:53 -03:00
When a document is added to a knowledgebase collection with `USE_KB` or `ADD_WEBSITE`, the system performs several steps to make it searchable:
1. **Content Extraction** Files are read and plaintext is extracted (PDF, DOCX, HTML, etc.).
2. **Chunking** The text is split into 500token chunks to keep embeddings manageable.
3. **Embedding Generation** Each chunk is sent to the configured LLM embedding model (default **BGEsmallenv1.5**) to produce a dense vector.
4. **Storage** Vectors, along with metadata (source file, chunk offset), are stored in VectorDB under the collections namespace.
5. **Indexing** VectorDB builds an IVFPQ index for fast approximate nearestneighbor search.
## Index Refresh
If a document is updated, the system reprocesses the file and replaces the old vectors. The index is automatically refreshed; no manual action is required.
## Example
```basic
2025-11-21 23:23:53 -03:00
USE_KB "company-policies"
ADD_WEBSITE "https://example.com/policies"
```
After execution, the `company-policies` collection contains indexed vectors ready for semantic search via the `FIND` keyword.