botserver/templates/crawler.gbai/crawler.gbkb/web-crawling-guide.md

# Web Crawling Guide

## Overview

The Web Crawler bot helps you extract and index content from websites. It automatically navigates through web pages, collects information, and makes it searchable through your knowledge base.

## Features

### Content Extraction

- **Text Content**: Extract readable text from web pages
- **Links**: Follow and index linked pages
- **Metadata**: Capture page titles, descriptions, and keywords
- **Structured Data**: Extract data from tables and lists

### Crawl Management

- **Depth Control**: Set how many levels of links to follow
- **Domain Restrictions**: Limit crawling to specific domains
- **URL Patterns**: Include or exclude URLs by pattern
- **Rate Limiting**: Control request frequency to avoid overloading servers

### Content Processing

- **Duplicate Detection**: Avoid indexing the same content twice
- **Content Filtering**: Skip irrelevant pages (login, error pages, etc.)
- **Format Conversion**: Convert HTML to clean, searchable text
- **Language Detection**: Identify content language for proper indexing

## How to Use

### Starting a Crawl

To start crawling a website:

1. Provide the starting URL (seed URL)
2. Configure crawl parameters (depth, limits)
3. Start the crawl process
4. Monitor progress and results

### Configuration Options

| Option | Description | Default |
|--------|-------------|---------|
| `max_depth` | How many link levels to follow | 3 |
| `max_pages` | Maximum pages to crawl | 100 |
| `delay` | Seconds between requests | 1 |
| `same_domain` | Stay within starting domain | true |
| `follow_external` | Follow links to other domains | false |

### URL Patterns

You can filter URLs using patterns:

**Include patterns:**
- `/blog/*` - Only crawl blog pages
- `/products/*` - Only crawl product pages

**Exclude patterns:**
- `/admin/*` - Skip admin pages
- `/login` - Skip login pages
- `*.pdf` - Skip PDF files

## Best Practices

### Respectful Crawling

1. **Respect robots.txt**: Always check and honor robots.txt rules
2. **Rate limiting**: Don't overload servers with too many requests
3. **Identify yourself**: Use a proper user agent string
4. **Off-peak hours**: Schedule large crawls during low-traffic times

### Efficient Crawling

1. **Start focused**: Begin with a specific section rather than entire site
2. **Set limits**: Use reasonable depth and page limits
3. **Filter content**: Exclude irrelevant sections early
4. **Monitor progress**: Watch for errors and adjust as needed

### Content Quality

1. **Remove navigation**: Filter out repeated headers/footers
2. **Extract main content**: Focus on the primary page content
3. **Handle dynamic content**: Some sites require JavaScript rendering
4. **Check encoding**: Ensure proper character encoding

## Common Crawl Scenarios

### Documentation Site

```
Starting URL: https://docs.example.com/
Depth: 4
Include: /docs/*, /api/*
Exclude: /changelog/*
```

### Blog Archive

```
Starting URL: https://blog.example.com/
Depth: 2
Include: /posts/*, /articles/*
Exclude: /author/*, /tag/*
```

### Product Catalog

```
Starting URL: https://shop.example.com/products/
Depth: 3
Include: /products/*, /categories/*
Exclude: /cart/*, /checkout/*
```

## Understanding Results

### Crawl Statistics

After a crawl completes, you'll see:

- **Pages Crawled**: Total pages successfully processed
- **Pages Skipped**: Pages excluded by filters
- **Errors**: Pages that failed to load
- **Time Elapsed**: Total crawl duration
- **Content Size**: Total indexed content size

### Content Index

Crawled content is indexed and available for:

- Semantic search queries
- Knowledge base answers
- Document retrieval
- AI-powered Q&A

## Troubleshooting

### Pages Not Crawling

- Check if URL is accessible (not behind login)
- Verify robots.txt allows crawling
- Ensure URL matches include patterns
- Check for JavaScript-only content

### Slow Crawling

- Increase delay between requests if seeing errors
- Reduce concurrent connections
- Check network connectivity
- Monitor server response times

### Missing Content

- Some sites require JavaScript rendering
- Content may be loaded dynamically via AJAX
- Check if content is within an iframe
- Verify content isn't blocked by login wall

### Duplicate Content

- Enable duplicate detection
- Use canonical URL handling
- Filter URL parameters that don't change content

## Scheduled Crawling

Set up recurring crawls to keep content fresh:

- **Daily**: For frequently updated news/blog sites
- **Weekly**: For documentation and knowledge bases
- **Monthly**: For stable reference content

## Legal Considerations

Always ensure you have the right to crawl and index content:

- Check website terms of service
- Respect copyright and intellectual property
- Honor robots.txt directives
- Don't crawl private or restricted content
- Consider data protection regulations (GDPR, LGPD)

## Frequently Asked Questions

**Q: How do I crawl a site that requires login?**
A: The crawler works best with public content. For authenticated content, consider using API integrations instead.

**Q: Can I crawl PDF documents?**
A: Yes, PDFs can be downloaded and processed separately for text extraction.

**Q: How often should I re-crawl?**
A: Depends on how frequently the site updates. News sites may need daily crawls; documentation might only need weekly or monthly.

**Q: What happens if a page moves or is deleted?**
A: The crawler will detect 404 errors and can remove outdated content from the index.

**Q: Can I crawl multiple sites at once?**
A: Yes, you can configure multiple seed URLs and the crawler will process them in sequence.

## Support

For crawling issues:

- Review crawl logs for error details
- Check network and firewall settings
- Verify target site is accessible
- Contact your administrator for configuration help