botserver/templates/crawler.gbai/crawler.gbkb/web-crawling-guide.md

208 lines
5.7 KiB
Markdown
Raw Normal View History

2025-12-03 07:15:54 -03:00
# Web Crawling Guide
## Overview
The Web Crawler bot helps you extract and index content from websites. It automatically navigates through web pages, collects information, and makes it searchable through your knowledge base.
## Features
### Content Extraction
- **Text Content**: Extract readable text from web pages
- **Links**: Follow and index linked pages
- **Metadata**: Capture page titles, descriptions, and keywords
- **Structured Data**: Extract data from tables and lists
### Crawl Management
- **Depth Control**: Set how many levels of links to follow
- **Domain Restrictions**: Limit crawling to specific domains
- **URL Patterns**: Include or exclude URLs by pattern
- **Rate Limiting**: Control request frequency to avoid overloading servers
### Content Processing
- **Duplicate Detection**: Avoid indexing the same content twice
- **Content Filtering**: Skip irrelevant pages (login, error pages, etc.)
- **Format Conversion**: Convert HTML to clean, searchable text
- **Language Detection**: Identify content language for proper indexing
## How to Use
### Starting a Crawl
To start crawling a website:
1. Provide the starting URL (seed URL)
2. Configure crawl parameters (depth, limits)
3. Start the crawl process
4. Monitor progress and results
### Configuration Options
| Option | Description | Default |
|--------|-------------|---------|
| `max_depth` | How many link levels to follow | 3 |
| `max_pages` | Maximum pages to crawl | 100 |
| `delay` | Seconds between requests | 1 |
| `same_domain` | Stay within starting domain | true |
| `follow_external` | Follow links to other domains | false |
### URL Patterns
You can filter URLs using patterns:
**Include patterns:**
- `/blog/*` - Only crawl blog pages
- `/products/*` - Only crawl product pages
**Exclude patterns:**
- `/admin/*` - Skip admin pages
- `/login` - Skip login pages
- `*.pdf` - Skip PDF files
## Best Practices
### Respectful Crawling
1. **Respect robots.txt**: Always check and honor robots.txt rules
2. **Rate limiting**: Don't overload servers with too many requests
3. **Identify yourself**: Use a proper user agent string
4. **Off-peak hours**: Schedule large crawls during low-traffic times
### Efficient Crawling
1. **Start focused**: Begin with a specific section rather than entire site
2. **Set limits**: Use reasonable depth and page limits
3. **Filter content**: Exclude irrelevant sections early
4. **Monitor progress**: Watch for errors and adjust as needed
### Content Quality
1. **Remove navigation**: Filter out repeated headers/footers
2. **Extract main content**: Focus on the primary page content
3. **Handle dynamic content**: Some sites require JavaScript rendering
4. **Check encoding**: Ensure proper character encoding
## Common Crawl Scenarios
### Documentation Site
```
Starting URL: https://docs.example.com/
Depth: 4
Include: /docs/*, /api/*
Exclude: /changelog/*
```
### Blog Archive
```
Starting URL: https://blog.example.com/
Depth: 2
Include: /posts/*, /articles/*
Exclude: /author/*, /tag/*
```
### Product Catalog
```
Starting URL: https://shop.example.com/products/
Depth: 3
Include: /products/*, /categories/*
Exclude: /cart/*, /checkout/*
```
## Understanding Results
### Crawl Statistics
After a crawl completes, you'll see:
- **Pages Crawled**: Total pages successfully processed
- **Pages Skipped**: Pages excluded by filters
- **Errors**: Pages that failed to load
- **Time Elapsed**: Total crawl duration
- **Content Size**: Total indexed content size
### Content Index
Crawled content is indexed and available for:
- Semantic search queries
- Knowledge base answers
- Document retrieval
- AI-powered Q&A
## Troubleshooting
### Pages Not Crawling
- Check if URL is accessible (not behind login)
- Verify robots.txt allows crawling
- Ensure URL matches include patterns
- Check for JavaScript-only content
### Slow Crawling
- Increase delay between requests if seeing errors
- Reduce concurrent connections
- Check network connectivity
- Monitor server response times
### Missing Content
- Some sites require JavaScript rendering
- Content may be loaded dynamically via AJAX
- Check if content is within an iframe
- Verify content isn't blocked by login wall
### Duplicate Content
- Enable duplicate detection
- Use canonical URL handling
- Filter URL parameters that don't change content
## Scheduled Crawling
Set up recurring crawls to keep content fresh:
- **Daily**: For frequently updated news/blog sites
- **Weekly**: For documentation and knowledge bases
- **Monthly**: For stable reference content
## Legal Considerations
Always ensure you have the right to crawl and index content:
- Check website terms of service
- Respect copyright and intellectual property
- Honor robots.txt directives
- Don't crawl private or restricted content
- Consider data protection regulations (GDPR, LGPD)
## Frequently Asked Questions
**Q: How do I crawl a site that requires login?**
A: The crawler works best with public content. For authenticated content, consider using API integrations instead.
**Q: Can I crawl PDF documents?**
A: Yes, PDFs can be downloaded and processed separately for text extraction.
**Q: How often should I re-crawl?**
A: Depends on how frequently the site updates. News sites may need daily crawls; documentation might only need weekly or monthly.
**Q: What happens if a page moves or is deleted?**
A: The crawler will detect 404 errors and can remove outdated content from the index.
**Q: Can I crawl multiple sites at once?**
A: Yes, you can configure multiple seed URLs and the crawler will process them in sequence.
## Support
For crawling issues:
- Review crawl logs for error details
- Check network and firewall settings
- Verify target site is accessible
- Contact your administrator for configuration help