botserver/docs/src/chapter-03/indexing.md

1.1 KiB
Raw Blame History

Document Indexing

When a document is added to a knowledgebase collection with USE_KB or ADD_WEBSITE, the system performs several steps to make it searchable:

  1. Content Extraction Files are read and plaintext is extracted (PDF, DOCX, HTML, etc.).
  2. Chunking The text is split into 500token chunks to keep embeddings manageable.
  3. Embedding Generation Each chunk is sent to the configured LLM embedding model (default BGEsmallenv1.5) to produce a dense vector.
  4. Storage Vectors, along with metadata (source file, chunk offset), are stored in VectorDB under the collections namespace.
  5. Indexing VectorDB builds an IVFPQ index for fast approximate nearestneighbor search.

Index Refresh

If a document is updated, the system reprocesses the file and replaces the old vectors. The index is automatically refreshed; no manual action is required.

Example

USE_KB "company-policies"
ADD_WEBSITE "https://example.com/policies"

After execution, the company-policies collection contains indexed vectors ready for semantic search via the FIND keyword.