208 lines
No EOL
5.7 KiB
Markdown
208 lines
No EOL
5.7 KiB
Markdown
# Web Crawling Guide
|
|
|
|
## Overview
|
|
|
|
The Web Crawler bot helps you extract and index content from websites. It automatically navigates through web pages, collects information, and makes it searchable through your knowledge base.
|
|
|
|
## Features
|
|
|
|
### Content Extraction
|
|
|
|
- **Text Content**: Extract readable text from web pages
|
|
- **Links**: Follow and index linked pages
|
|
- **Metadata**: Capture page titles, descriptions, and keywords
|
|
- **Structured Data**: Extract data from tables and lists
|
|
|
|
### Crawl Management
|
|
|
|
- **Depth Control**: Set how many levels of links to follow
|
|
- **Domain Restrictions**: Limit crawling to specific domains
|
|
- **URL Patterns**: Include or exclude URLs by pattern
|
|
- **Rate Limiting**: Control request frequency to avoid overloading servers
|
|
|
|
### Content Processing
|
|
|
|
- **Duplicate Detection**: Avoid indexing the same content twice
|
|
- **Content Filtering**: Skip irrelevant pages (login, error pages, etc.)
|
|
- **Format Conversion**: Convert HTML to clean, searchable text
|
|
- **Language Detection**: Identify content language for proper indexing
|
|
|
|
## How to Use
|
|
|
|
### Starting a Crawl
|
|
|
|
To start crawling a website:
|
|
|
|
1. Provide the starting URL (seed URL)
|
|
2. Configure crawl parameters (depth, limits)
|
|
3. Start the crawl process
|
|
4. Monitor progress and results
|
|
|
|
### Configuration Options
|
|
|
|
| Option | Description | Default |
|
|
|--------|-------------|---------|
|
|
| `max_depth` | How many link levels to follow | 3 |
|
|
| `max_pages` | Maximum pages to crawl | 100 |
|
|
| `delay` | Seconds between requests | 1 |
|
|
| `same_domain` | Stay within starting domain | true |
|
|
| `follow_external` | Follow links to other domains | false |
|
|
|
|
### URL Patterns
|
|
|
|
You can filter URLs using patterns:
|
|
|
|
**Include patterns:**
|
|
- `/blog/*` - Only crawl blog pages
|
|
- `/products/*` - Only crawl product pages
|
|
|
|
**Exclude patterns:**
|
|
- `/admin/*` - Skip admin pages
|
|
- `/login` - Skip login pages
|
|
- `*.pdf` - Skip PDF files
|
|
|
|
## Best Practices
|
|
|
|
### Respectful Crawling
|
|
|
|
1. **Respect robots.txt**: Always check and honor robots.txt rules
|
|
2. **Rate limiting**: Don't overload servers with too many requests
|
|
3. **Identify yourself**: Use a proper user agent string
|
|
4. **Off-peak hours**: Schedule large crawls during low-traffic times
|
|
|
|
### Efficient Crawling
|
|
|
|
1. **Start focused**: Begin with a specific section rather than entire site
|
|
2. **Set limits**: Use reasonable depth and page limits
|
|
3. **Filter content**: Exclude irrelevant sections early
|
|
4. **Monitor progress**: Watch for errors and adjust as needed
|
|
|
|
### Content Quality
|
|
|
|
1. **Remove navigation**: Filter out repeated headers/footers
|
|
2. **Extract main content**: Focus on the primary page content
|
|
3. **Handle dynamic content**: Some sites require JavaScript rendering
|
|
4. **Check encoding**: Ensure proper character encoding
|
|
|
|
## Common Crawl Scenarios
|
|
|
|
### Documentation Site
|
|
|
|
```
|
|
Starting URL: https://docs.example.com/
|
|
Depth: 4
|
|
Include: /docs/*, /api/*
|
|
Exclude: /changelog/*
|
|
```
|
|
|
|
### Blog Archive
|
|
|
|
```
|
|
Starting URL: https://blog.example.com/
|
|
Depth: 2
|
|
Include: /posts/*, /articles/*
|
|
Exclude: /author/*, /tag/*
|
|
```
|
|
|
|
### Product Catalog
|
|
|
|
```
|
|
Starting URL: https://shop.example.com/products/
|
|
Depth: 3
|
|
Include: /products/*, /categories/*
|
|
Exclude: /cart/*, /checkout/*
|
|
```
|
|
|
|
## Understanding Results
|
|
|
|
### Crawl Statistics
|
|
|
|
After a crawl completes, you'll see:
|
|
|
|
- **Pages Crawled**: Total pages successfully processed
|
|
- **Pages Skipped**: Pages excluded by filters
|
|
- **Errors**: Pages that failed to load
|
|
- **Time Elapsed**: Total crawl duration
|
|
- **Content Size**: Total indexed content size
|
|
|
|
### Content Index
|
|
|
|
Crawled content is indexed and available for:
|
|
|
|
- Semantic search queries
|
|
- Knowledge base answers
|
|
- Document retrieval
|
|
- AI-powered Q&A
|
|
|
|
## Troubleshooting
|
|
|
|
### Pages Not Crawling
|
|
|
|
- Check if URL is accessible (not behind login)
|
|
- Verify robots.txt allows crawling
|
|
- Ensure URL matches include patterns
|
|
- Check for JavaScript-only content
|
|
|
|
### Slow Crawling
|
|
|
|
- Increase delay between requests if seeing errors
|
|
- Reduce concurrent connections
|
|
- Check network connectivity
|
|
- Monitor server response times
|
|
|
|
### Missing Content
|
|
|
|
- Some sites require JavaScript rendering
|
|
- Content may be loaded dynamically via AJAX
|
|
- Check if content is within an iframe
|
|
- Verify content isn't blocked by login wall
|
|
|
|
### Duplicate Content
|
|
|
|
- Enable duplicate detection
|
|
- Use canonical URL handling
|
|
- Filter URL parameters that don't change content
|
|
|
|
## Scheduled Crawling
|
|
|
|
Set up recurring crawls to keep content fresh:
|
|
|
|
- **Daily**: For frequently updated news/blog sites
|
|
- **Weekly**: For documentation and knowledge bases
|
|
- **Monthly**: For stable reference content
|
|
|
|
## Legal Considerations
|
|
|
|
Always ensure you have the right to crawl and index content:
|
|
|
|
- Check website terms of service
|
|
- Respect copyright and intellectual property
|
|
- Honor robots.txt directives
|
|
- Don't crawl private or restricted content
|
|
- Consider data protection regulations (GDPR, LGPD)
|
|
|
|
## Frequently Asked Questions
|
|
|
|
**Q: How do I crawl a site that requires login?**
|
|
A: The crawler works best with public content. For authenticated content, consider using API integrations instead.
|
|
|
|
**Q: Can I crawl PDF documents?**
|
|
A: Yes, PDFs can be downloaded and processed separately for text extraction.
|
|
|
|
**Q: How often should I re-crawl?**
|
|
A: Depends on how frequently the site updates. News sites may need daily crawls; documentation might only need weekly or monthly.
|
|
|
|
**Q: What happens if a page moves or is deleted?**
|
|
A: The crawler will detect 404 errors and can remove outdated content from the index.
|
|
|
|
**Q: Can I crawl multiple sites at once?**
|
|
A: Yes, you can configure multiple seed URLs and the crawler will process them in sequence.
|
|
|
|
## Support
|
|
|
|
For crawling issues:
|
|
|
|
- Review crawl logs for error details
|
|
- Check network and firewall settings
|
|
- Verify target site is accessible
|
|
- Contact your administrator for configuration help |