botserver/templates/crawler.gbai/crawler.gbkb/web-crawling-guide.md

# Web Crawling Guide

## Overview

The Web Crawler bot helps you extract and index content from websites. It automatically navigates through web pages, collects information, and makes it searchable through your knowledge base.

## Features

### Content Extraction

- **Text Content**: Extract readable text from web pages
- **Links**: Follow and index linked pages
- **Metadata**: Capture page titles, descriptions, and keywords
- **Structured Data**: Extract data from tables and lists

### Crawl Management

- **Depth Control**: Set how many levels of links to follow
- **Domain Restrictions**: Limit crawling to specific domains
- **URL Patterns**: Include or exclude URLs by pattern
- **Rate Limiting**: Control request frequency to avoid overloading servers

### Content Processing

- **Duplicate Detection**: Avoid indexing the same content twice
- **Content Filtering**: Skip irrelevant pages (login, error pages, etc.)
- **Format Conversion**: Convert HTML to clean, searchable text
- **Language Detection**: Identify content language for proper indexing

## How to Use

### Starting a Crawl

To start crawling a website:

1. Provide the starting URL (seed URL)
2. Configure crawl parameters (depth, limits)
3. Start the crawl process
4. Monitor progress and results

### Configuration Options

| Option | Description | Default |
|--------|-------------|---------|
| `max_depth` | How many link levels to follow | 3 |
| `max_pages` | Maximum pages to crawl | 100 |
| `delay` | Seconds between requests | 1 |
| `same_domain` | Stay within starting domain | true |
| `follow_external` | Follow links to other domains | false |

### URL Patterns

You can filter URLs using patterns:

**Include patterns:**
- `/blog/*` - Only crawl blog pages
- `/products/*` - Only crawl product pages

**Exclude patterns:**
- `/admin/*` - Skip admin pages
- `/login` - Skip login pages
- `*.pdf` - Skip PDF files

## Best Practices

### Respectful Crawling

1. **Respect robots.txt**: Always check and honor robots.txt rules
2. **Rate limiting**: Don't overload servers with too many requests
3. **Identify yourself**: Use a proper user agent string
4. **Off-peak hours**: Schedule large crawls during low-traffic times

### Efficient Crawling

1. **Start focused**: Begin with a specific section rather than entire site
2. **Set limits**: Use reasonable depth and page limits
3. **Filter content**: Exclude irrelevant sections early
4. **Monitor progress**: Watch for errors and adjust as needed

### Content Quality

1. **Remove navigation**: Filter out repeated headers/footers
2. **Extract main content**: Focus on the primary page content
3. **Handle dynamic content**: Some sites require JavaScript rendering
4. **Check encoding**: Ensure proper character encoding

## Common Crawl Scenarios

### Documentation Site

```
Starting URL: https://docs.example.com/
Depth: 4
Include: /docs/*, /api/*
Exclude: /changelog/*
```

### Blog Archive

```
Starting URL: https://blog.example.com/
Depth: 2
Include: /posts/*, /articles/*
Exclude: /author/*, /tag/*
```

### Product Catalog

```
Starting URL: https://shop.example.com/products/
Depth: 3
Include: /products/*, /categories/*
Exclude: /cart/*, /checkout/*
```

## Understanding Results

### Crawl Statistics

After a crawl completes, you'll see:

- **Pages Crawled**: Total pages successfully processed
- **Pages Skipped**: Pages excluded by filters
- **Errors**: Pages that failed to load
- **Time Elapsed**: Total crawl duration
- **Content Size**: Total indexed content size

### Content Index

Crawled content is indexed and available for:

- Semantic search queries
- Knowledge base answers
- Document retrieval
- AI-powered Q&A

## Troubleshooting

### Pages Not Crawling

- Check if URL is accessible (not behind login)
- Verify robots.txt allows crawling
- Ensure URL matches include patterns
- Check for JavaScript-only content

### Slow Crawling

- Increase delay between requests if seeing errors
- Reduce concurrent connections
- Check network connectivity
- Monitor server response times

### Missing Content

- Some sites require JavaScript rendering
- Content may be loaded dynamically via AJAX
- Check if content is within an iframe
- Verify content isn't blocked by login wall

### Duplicate Content

- Enable duplicate detection
- Use canonical URL handling
- Filter URL parameters that don't change content

## Scheduled Crawling

Set up recurring crawls to keep content fresh:

- **Daily**: For frequently updated news/blog sites
- **Weekly**: For documentation and knowledge bases
- **Monthly**: For stable reference content

## Legal Considerations

Always ensure you have the right to crawl and index content:

- Check website terms of service
- Respect copyright and intellectual property
- Honor robots.txt directives
- Don't crawl private or restricted content
- Consider data protection regulations (GDPR, LGPD)

## Frequently Asked Questions

**Q: How do I crawl a site that requires login?**
A: The crawler works best with public content. For authenticated content, consider using API integrations instead.

**Q: Can I crawl PDF documents?**
A: Yes, PDFs can be downloaded and processed separately for text extraction.

**Q: How often should I re-crawl?**
A: Depends on how frequently the site updates. News sites may need daily crawls; documentation might only need weekly or monthly.

**Q: What happens if a page moves or is deleted?**
A: The crawler will detect 404 errors and can remove outdated content from the index.

**Q: Can I crawl multiple sites at once?**
A: Yes, you can configure multiple seed URLs and the crawler will process them in sequence.

## Support

For crawling issues:

- Review crawl logs for error details
- Check network and firewall settings
- Verify target site is accessible
- Contact your administrator for configuration help
- New templates. 2025-12-03 07:15:54 -03:00			`# Web Crawling Guide`

			`## Overview`

			`The Web Crawler bot helps you extract and index content from websites. It automatically navigates through web pages, collects information, and makes it searchable through your knowledge base.`

			`## Features`

			`### Content Extraction`

			`- Text Content: Extract readable text from web pages`
			`- Links: Follow and index linked pages`
			`- Metadata: Capture page titles, descriptions, and keywords`
			`- Structured Data: Extract data from tables and lists`

			`### Crawl Management`

			`- Depth Control: Set how many levels of links to follow`
			`- Domain Restrictions: Limit crawling to specific domains`
			`- URL Patterns: Include or exclude URLs by pattern`
			`- Rate Limiting: Control request frequency to avoid overloading servers`

			`### Content Processing`

			`- Duplicate Detection: Avoid indexing the same content twice`
			`- Content Filtering: Skip irrelevant pages (login, error pages, etc.)`
			`- Format Conversion: Convert HTML to clean, searchable text`
			`- Language Detection: Identify content language for proper indexing`

			`## How to Use`

			`### Starting a Crawl`

			`To start crawling a website:`

			`1. Provide the starting URL (seed URL)`
			`2. Configure crawl parameters (depth, limits)`
			`3. Start the crawl process`
			`4. Monitor progress and results`

			`### Configuration Options`

			`\| Option \| Description \| Default \|`
			`\|--------\|-------------\|---------\|`
			\| `max_depth` \| How many link levels to follow \| 3 \|
			\| `max_pages` \| Maximum pages to crawl \| 100 \|
			\| `delay` \| Seconds between requests \| 1 \|
			\| `same_domain` \| Stay within starting domain \| true \|
			\| `follow_external` \| Follow links to other domains \| false \|

			`### URL Patterns`

			`You can filter URLs using patterns:`

			`Include patterns:`
			- `/blog/*` - Only crawl blog pages
			- `/products/*` - Only crawl product pages

			`Exclude patterns:`
			- `/admin/*` - Skip admin pages
			- `/login` - Skip login pages
			- `*.pdf` - Skip PDF files

			`## Best Practices`

			`### Respectful Crawling`

			`1. Respect robots.txt: Always check and honor robots.txt rules`
			`2. Rate limiting: Don't overload servers with too many requests`
			`3. Identify yourself: Use a proper user agent string`
			`4. Off-peak hours: Schedule large crawls during low-traffic times`

			`### Efficient Crawling`

			`1. Start focused: Begin with a specific section rather than entire site`
			`2. Set limits: Use reasonable depth and page limits`
			`3. Filter content: Exclude irrelevant sections early`
			`4. Monitor progress: Watch for errors and adjust as needed`

			`### Content Quality`

			`1. Remove navigation: Filter out repeated headers/footers`
			`2. Extract main content: Focus on the primary page content`
			`3. Handle dynamic content: Some sites require JavaScript rendering`
			`4. Check encoding: Ensure proper character encoding`

			`## Common Crawl Scenarios`

			`### Documentation Site`

			```
			`Starting URL: https://docs.example.com/`
			`Depth: 4`
			`Include: /docs/, /api/`
			`Exclude: /changelog/*`
			```

			`### Blog Archive`

			```
			`Starting URL: https://blog.example.com/`
			`Depth: 2`
			`Include: /posts/, /articles/`
			`Exclude: /author/, /tag/`
			```

			`### Product Catalog`

			```
			`Starting URL: https://shop.example.com/products/`
			`Depth: 3`
			`Include: /products/, /categories/`
			`Exclude: /cart/, /checkout/`
			```

			`## Understanding Results`

			`### Crawl Statistics`

			`After a crawl completes, you'll see:`

			`- Pages Crawled: Total pages successfully processed`
			`- Pages Skipped: Pages excluded by filters`
			`- Errors: Pages that failed to load`
			`- Time Elapsed: Total crawl duration`
			`- Content Size: Total indexed content size`

			`### Content Index`

			`Crawled content is indexed and available for:`

			`- Semantic search queries`
			`- Knowledge base answers`
			`- Document retrieval`
			`- AI-powered Q&A`

			`## Troubleshooting`

			`### Pages Not Crawling`

			`- Check if URL is accessible (not behind login)`
			`- Verify robots.txt allows crawling`
			`- Ensure URL matches include patterns`
			`- Check for JavaScript-only content`

			`### Slow Crawling`

			`- Increase delay between requests if seeing errors`
			`- Reduce concurrent connections`
			`- Check network connectivity`
			`- Monitor server response times`

			`### Missing Content`

			`- Some sites require JavaScript rendering`
			`- Content may be loaded dynamically via AJAX`
			`- Check if content is within an iframe`
			`- Verify content isn't blocked by login wall`

			`### Duplicate Content`

			`- Enable duplicate detection`
			`- Use canonical URL handling`
			`- Filter URL parameters that don't change content`

			`## Scheduled Crawling`

			`Set up recurring crawls to keep content fresh:`

			`- Daily: For frequently updated news/blog sites`
			`- Weekly: For documentation and knowledge bases`
			`- Monthly: For stable reference content`

			`## Legal Considerations`

			`Always ensure you have the right to crawl and index content:`

			`- Check website terms of service`
			`- Respect copyright and intellectual property`
			`- Honor robots.txt directives`
			`- Don't crawl private or restricted content`
			`- Consider data protection regulations (GDPR, LGPD)`

			`## Frequently Asked Questions`

			`Q: How do I crawl a site that requires login?`
			`A: The crawler works best with public content. For authenticated content, consider using API integrations instead.`

			`Q: Can I crawl PDF documents?`
			`A: Yes, PDFs can be downloaded and processed separately for text extraction.`

			`Q: How often should I re-crawl?`
			`A: Depends on how frequently the site updates. News sites may need daily crawls; documentation might only need weekly or monthly.`

			`Q: What happens if a page moves or is deleted?`
			`A: The crawler will detect 404 errors and can remove outdated content from the index.`

			`Q: Can I crawl multiple sites at once?`
			`A: Yes, you can configure multiple seed URLs and the crawler will process them in sequence.`

			`## Support`

			`For crawling issues:`

			`- Review crawl logs for error details`
			`- Check network and firewall settings`
			`- Verify target site is accessible`
			`- Contact your administrator for configuration help`