Rodrigo Rodriguez (Pragmatismo) 7a5e369402 Add comprehensive documentation for GB templates and configuration

modules

Add detailed README documentation for 15+ bot templates including:
- Platform analytics, BI reporting, and web crawler templates
- CRM, contacts, and marketing automation templates
- Legal document processing and office productivity templates
- LLM tools, LLM server, and API client integration templates
- Reminder management and meta-template for creating new templates

Add new Rust configuration modules:
- BM25 config for Tantivy-based sparse

2025-12-03 16:05:50 -03:00

7 KiB

Raw Blame History

Web Crawler Template (crawler.gbai)

A General Bots template for automated web crawling and content extraction for knowledge base population.

Overview

The Crawler template enables your bot to automatically fetch, parse, and index web content. It's designed for building knowledge bases from websites, monitoring web pages for changes, and extracting structured data from online sources.

Features

Automated Web Scraping - Fetch and parse web pages automatically
Document Mode - Answer questions based on crawled content
Configurable Depth - Control how many pages to crawl
Content Indexing - Automatically add content to knowledge base
LLM Integration - Use AI to understand and summarize crawled content

Package Structure

crawler.gbai/
├── README.md
├── crawler.gbkb/          # Knowledge base for crawled content
│   └── docs/              # Indexed documents
└── crawler.gbot/
    └── config.csv         # Crawler configuration

Configuration

Configure the crawler in crawler.gbot/config.csv:

Parameter	Description	Example
`Website`	Target URL to crawl	`https://pragmatismo.com.br/`
`website Max Documents`	Maximum pages to crawl	`2`
`Answer Mode`	How to respond to queries	`document`
`Theme Color`	UI theme color	`purple`
`LLM Provider`	AI provider for processing	`openai`

Example config.csv

name,value
Website,https://pragmatismo.com.br/
website Max Documents,2
Answer Mode,document
Theme Color,purple
LLM Provider,openai

How It Works

Initialization - Bot reads the target website from configuration
Crawling - Fetches pages starting from the root URL
Extraction - Parses HTML and extracts meaningful text content
Indexing - Stores content in the knowledge base for RAG
Q&A - Users can ask questions about the crawled content

Usage

Basic Setup

Copy the template to your bot's packages directory:

cp -r templates/crawler.gbai /path/to/your/bot/packages/

Edit crawler.gbot/config.csv with your target website:

name,value
Website,https://your-website.com/
website Max Documents,10
Answer Mode,document

Deploy and the bot will automatically crawl the configured site.

Querying Crawled Content

Once crawled, users can ask questions naturally:

"What services does the company offer?"
"Tell me about the pricing"
"Summarize the about page"
"What are the main features?"

Answer Modes

Mode	Behavior
`document`	Answers strictly based on crawled content
`hybrid`	Combines crawled content with general knowledge
`summary`	Provides concise summaries of relevant pages

Advanced Configuration

Limiting Crawl Scope

Control which pages are crawled:

name,value
Website,https://example.com/docs/
website Max Documents,50
Website Include Pattern,/docs/*
Website Exclude Pattern,/docs/archive/*

Scheduling Recrawls

Set up periodic recrawling to keep content fresh:

name,value
Website Refresh Schedule,0 0 * * 0

This example recrawls every Sunday at midnight.

Authentication

For sites requiring authentication:

name,value
Website Auth Type,basic
Website Username,user
Website Password,secret

Customization

Creating Custom Crawl Logic

Create a BASIC dialog for custom crawling:

' custom-crawl.bas
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]

FOR EACH url IN urls
    content = GET url
    
    IF content THEN
        SAVE "crawled_pages.csv", url, content, NOW()
        SET CONTEXT content
    END IF
NEXT

TALK "Crawled " + UBOUND(urls) + " pages successfully."

Processing Crawled Content

Use LLM to process and structure crawled data:

' process-crawled.bas
pages = FIND "crawled_pages.csv"

FOR EACH page IN pages
    summary = LLM "Summarize this content in 3 bullet points: " + page.content
    
    WITH processed
        url = page.url
        summary = summary
        processed_at = NOW()
    END WITH
    
    SAVE "processed_content.csv", processed
NEXT

Extracting Structured Data

Extract specific information from pages:

' extract-products.bas
SET CONTEXT "You are a data extraction assistant. Extract product information as JSON."

page_content = GET "https://store.example.com/products"

products = LLM "Extract all products with name, price, and description as JSON array: " + page_content

SAVE "products.json", products

Integration Examples

With Knowledge Base

' Add crawled content to KB
content = GET "https://docs.example.com/api"

IF content THEN
    USE KB "api-docs.gbkb"
    ADD TO KB content, "API Documentation"
END IF

With Notifications

' Monitor for changes
previous = GET BOT MEMORY "last_content"
current = GET "https://news.example.com"

IF current <> previous THEN
    SEND EMAIL "admin@company.com", "Website Changed", "The monitored page has been updated."
    SET BOT MEMORY "last_content", current
END IF

With Data Analysis

' Analyze competitor pricing
competitor_page = GET "https://competitor.com/pricing"

analysis = LLM "Compare this pricing to our prices and identify opportunities: " + competitor_page

TALK analysis

Best Practices

Respect robots.txt - Only crawl pages allowed by the site's robots.txt
Rate limiting - Don't overwhelm target servers with requests
Set reasonable limits - Start with low Max Documents values
Monitor content quality - Review crawled content for accuracy
Keep content fresh - Schedule periodic recrawls for dynamic sites
Handle errors gracefully - Implement retry logic for failed requests

Troubleshooting

Issue	Cause	Solution
No content indexed	Invalid URL	Verify the Website URL is accessible
Partial content	Max Documents too low	Increase the limit in config
Stale answers	Content not refreshed	Set up scheduled recrawls
Authentication errors	Missing credentials	Add auth settings to config
Timeout errors	Slow target site	Increase timeout settings

Limitations

JavaScript-rendered content may not be fully captured
Some sites block automated crawlers
Large sites may take significant time to fully crawl
Dynamic content may require special handling

ai-search.gbai - AI-powered document search
talk-to-data.gbai - Natural language data queries
law.gbai - Legal document processing with similar RAG approach

Use Cases

Documentation Bots - Index product docs for support
Competitive Intelligence - Monitor competitor websites
News Aggregation - Collect news from multiple sources
Research Assistants - Build knowledge bases from academic sources
FAQ Generators - Extract FAQs from help sites

License

AGPL-3.0 - Part of General Bots Open Source Platform.

Pragmatismo - General Bots

7 KiB Raw Blame History