botserver/docs/src/chapter-03/indexing.md

# Document Indexing

When a document is added to a knowledge‑base collection with `USE_KB` or `ADD_WEBSITE`, the system performs several steps to make it searchable:

1. **Content Extraction** – Files are read and plain‑text is extracted (PDF, DOCX, HTML, etc.).
2. **Chunking** – The text is split into 500‑token chunks to keep embeddings manageable.
3. **Embedding Generation** – Each chunk is sent to the configured LLM embedding model (default **BGE‑small‑en‑v1.5**) to produce a dense vector.
4. **Storage** – Vectors, along with metadata (source file, chunk offset), are stored in VectorDB under the collection’s namespace.
5. **Indexing** – VectorDB builds an IVF‑PQ index for fast approximate nearest‑neighbor search.

## Index Refresh

If a document is updated, the system re‑processes the file and replaces the old vectors. The index is automatically refreshed; no manual action is required.

## Example

```basic
USE_KB "company-policies"
ADD_WEBSITE "https://example.com/policies"
```

After execution, the `company-policies` collection contains indexed vectors ready for semantic search via the `FIND` keyword.
-												Add comprehensive documentation for GeneralBots, including keyword references, templates, and user guides

- Created detailed markdown files for keywords such as HEAR, TALK, and SET_USER.
- Added examples and usage notes for each keyword to enhance user understanding.
- Developed templates for common tasks like enrollment and authentication.
- Structured documentation into chapters covering various aspects of the GeneralBots platform, including gbapp, gbkb, and gbtheme.
- Introduced a glossary for key terms and concepts related to GeneralBots.
- Implemented a user-friendly table of contents for easy navigation.

											
										
										
											2025-10-25 14:50:14 -03:00
+								# Document Indexing
-												Revise documentation in Chapter 01 to improve clarity and structure, including updates to the installation instructions and session management overview.

											
										
										
											2025-10-25 15:59:06 -03:00
-												- New stuff, 6.1.

											
										
										
											2025-11-21 23:23:53 -03:00
+								When a document is added to a knowledge‑base collection with `USE_KB` or `ADD_WEBSITE`, the system performs several steps to make it searchable:
-												Revise documentation in Chapter 01 to improve clarity and structure, including updates to the installation instructions and session management overview.

											
										
										
											2025-10-25 15:59:06 -03:00
 . **Content Extraction** – Files are read and plain‑text is extracted (PDF, DOCX, HTML, etc.).
 . **Chunking** – The text is split into 500‑token chunks to keep embeddings manageable.
 . **Embedding Generation** – Each chunk is sent to the configured LLM embedding model (default **BGE‑small‑en‑v1.5**) to produce a dense vector.
-												Update documentation to reflect transition from Qdrant to VectorDB, including caching, indexing, and semantic search sections. Add comprehensive overview for Chapter 03.

											
										
										
											2025-10-25 20:28:40 -03:00
+. **Storage** – Vectors, along with metadata (source file, chunk offset), are stored in VectorDB under the collection’s namespace.
 . **Indexing** – VectorDB builds an IVF‑PQ index for fast approximate nearest‑neighbor search.
-												Revise documentation in Chapter 01 to improve clarity and structure, including updates to the installation instructions and session management overview.

											
										
										
											2025-10-25 15:59:06 -03:00
 								## Index Refresh
 								If a document is updated, the system re‑processes the file and replaces the old vectors. The index is automatically refreshed; no manual action is required.
 								## Example
 								```basic
-												- New stuff, 6.1.

											
										
										
											2025-11-21 23:23:53 -03:00
+								USE_KB "company-policies"
-												Revise documentation in Chapter 01 to improve clarity and structure, including updates to the installation instructions and session management overview.

											
										
										
											2025-10-25 15:59:06 -03:00
+								ADD_WEBSITE "https://example.com/policies"
 								```
 								After execution, the `company-policies` collection contains indexed vectors ready for semantic search via the `FIND` keyword.