botserver/docs/src/chapter-03/indexing.md

22 lines
1.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Document Indexing
When a document is added to a knowledgebase collection with `ADD_KB` or `ADD_WEBSITE`, the system performs several steps to make it searchable:
1. **Content Extraction** Files are read and plaintext is extracted (PDF, DOCX, HTML, etc.).
2. **Chunking** The text is split into 500token chunks to keep embeddings manageable.
3. **Embedding Generation** Each chunk is sent to the configured LLM embedding model (default **BGEsmallenv1.5**) to produce a dense vector.
4. **Storage** Vectors, along with metadata (source file, chunk offset), are stored in Qdrant under the collections namespace.
5. **Indexing** Qdrant builds an IVFPQ index for fast approximate nearestneighbor search.
## Index Refresh
If a document is updated, the system reprocesses the file and replaces the old vectors. The index is automatically refreshed; no manual action is required.
## Example
```basic
ADD_KB "company-policies"
ADD_WEBSITE "https://example.com/policies"
```
After execution, the `company-policies` collection contains indexed vectors ready for semantic search via the `FIND` keyword.