5.7 KiB
Web Crawling Guide
Overview
The Web Crawler bot helps you extract and index content from websites. It automatically navigates through web pages, collects information, and makes it searchable through your knowledge base.
Features
Content Extraction
- Text Content: Extract readable text from web pages
- Links: Follow and index linked pages
- Metadata: Capture page titles, descriptions, and keywords
- Structured Data: Extract data from tables and lists
Crawl Management
- Depth Control: Set how many levels of links to follow
- Domain Restrictions: Limit crawling to specific domains
- URL Patterns: Include or exclude URLs by pattern
- Rate Limiting: Control request frequency to avoid overloading servers
Content Processing
- Duplicate Detection: Avoid indexing the same content twice
- Content Filtering: Skip irrelevant pages (login, error pages, etc.)
- Format Conversion: Convert HTML to clean, searchable text
- Language Detection: Identify content language for proper indexing
How to Use
Starting a Crawl
To start crawling a website:
- Provide the starting URL (seed URL)
- Configure crawl parameters (depth, limits)
- Start the crawl process
- Monitor progress and results
Configuration Options
| Option | Description | Default |
|---|---|---|
max_depth |
How many link levels to follow | 3 |
max_pages |
Maximum pages to crawl | 100 |
delay |
Seconds between requests | 1 |
same_domain |
Stay within starting domain | true |
follow_external |
Follow links to other domains | false |
URL Patterns
You can filter URLs using patterns:
Include patterns:
/blog/*- Only crawl blog pages/products/*- Only crawl product pages
Exclude patterns:
/admin/*- Skip admin pages/login- Skip login pages*.pdf- Skip PDF files
Best Practices
Respectful Crawling
- Respect robots.txt: Always check and honor robots.txt rules
- Rate limiting: Don't overload servers with too many requests
- Identify yourself: Use a proper user agent string
- Off-peak hours: Schedule large crawls during low-traffic times
Efficient Crawling
- Start focused: Begin with a specific section rather than entire site
- Set limits: Use reasonable depth and page limits
- Filter content: Exclude irrelevant sections early
- Monitor progress: Watch for errors and adjust as needed
Content Quality
- Remove navigation: Filter out repeated headers/footers
- Extract main content: Focus on the primary page content
- Handle dynamic content: Some sites require JavaScript rendering
- Check encoding: Ensure proper character encoding
Common Crawl Scenarios
Documentation Site
Starting URL: https://docs.example.com/
Depth: 4
Include: /docs/*, /api/*
Exclude: /changelog/*
Blog Archive
Starting URL: https://blog.example.com/
Depth: 2
Include: /posts/*, /articles/*
Exclude: /author/*, /tag/*
Product Catalog
Starting URL: https://shop.example.com/products/
Depth: 3
Include: /products/*, /categories/*
Exclude: /cart/*, /checkout/*
Understanding Results
Crawl Statistics
After a crawl completes, you'll see:
- Pages Crawled: Total pages successfully processed
- Pages Skipped: Pages excluded by filters
- Errors: Pages that failed to load
- Time Elapsed: Total crawl duration
- Content Size: Total indexed content size
Content Index
Crawled content is indexed and available for:
- Semantic search queries
- Knowledge base answers
- Document retrieval
- AI-powered Q&A
Troubleshooting
Pages Not Crawling
- Check if URL is accessible (not behind login)
- Verify robots.txt allows crawling
- Ensure URL matches include patterns
- Check for JavaScript-only content
Slow Crawling
- Increase delay between requests if seeing errors
- Reduce concurrent connections
- Check network connectivity
- Monitor server response times
Missing Content
- Some sites require JavaScript rendering
- Content may be loaded dynamically via AJAX
- Check if content is within an iframe
- Verify content isn't blocked by login wall
Duplicate Content
- Enable duplicate detection
- Use canonical URL handling
- Filter URL parameters that don't change content
Scheduled Crawling
Set up recurring crawls to keep content fresh:
- Daily: For frequently updated news/blog sites
- Weekly: For documentation and knowledge bases
- Monthly: For stable reference content
Legal Considerations
Always ensure you have the right to crawl and index content:
- Check website terms of service
- Respect copyright and intellectual property
- Honor robots.txt directives
- Don't crawl private or restricted content
- Consider data protection regulations (GDPR, LGPD)
Frequently Asked Questions
Q: How do I crawl a site that requires login? A: The crawler works best with public content. For authenticated content, consider using API integrations instead.
Q: Can I crawl PDF documents? A: Yes, PDFs can be downloaded and processed separately for text extraction.
Q: How often should I re-crawl? A: Depends on how frequently the site updates. News sites may need daily crawls; documentation might only need weekly or monthly.
Q: What happens if a page moves or is deleted? A: The crawler will detect 404 errors and can remove outdated content from the index.
Q: Can I crawl multiple sites at once? A: Yes, you can configure multiple seed URLs and the crawler will process them in sequence.
Support
For crawling issues:
- Review crawl logs for error details
- Check network and firewall settings
- Verify target site is accessible
- Contact your administrator for configuration help