botserver/templates/crawler.gbai/crawler.gbkb/web-crawling-guide.md

5.7 KiB

Web Crawling Guide

Overview

The Web Crawler bot helps you extract and index content from websites. It automatically navigates through web pages, collects information, and makes it searchable through your knowledge base.

Features

Content Extraction

  • Text Content: Extract readable text from web pages
  • Links: Follow and index linked pages
  • Metadata: Capture page titles, descriptions, and keywords
  • Structured Data: Extract data from tables and lists

Crawl Management

  • Depth Control: Set how many levels of links to follow
  • Domain Restrictions: Limit crawling to specific domains
  • URL Patterns: Include or exclude URLs by pattern
  • Rate Limiting: Control request frequency to avoid overloading servers

Content Processing

  • Duplicate Detection: Avoid indexing the same content twice
  • Content Filtering: Skip irrelevant pages (login, error pages, etc.)
  • Format Conversion: Convert HTML to clean, searchable text
  • Language Detection: Identify content language for proper indexing

How to Use

Starting a Crawl

To start crawling a website:

  1. Provide the starting URL (seed URL)
  2. Configure crawl parameters (depth, limits)
  3. Start the crawl process
  4. Monitor progress and results

Configuration Options

Option Description Default
max_depth How many link levels to follow 3
max_pages Maximum pages to crawl 100
delay Seconds between requests 1
same_domain Stay within starting domain true
follow_external Follow links to other domains false

URL Patterns

You can filter URLs using patterns:

Include patterns:

  • /blog/* - Only crawl blog pages
  • /products/* - Only crawl product pages

Exclude patterns:

  • /admin/* - Skip admin pages
  • /login - Skip login pages
  • *.pdf - Skip PDF files

Best Practices

Respectful Crawling

  1. Respect robots.txt: Always check and honor robots.txt rules
  2. Rate limiting: Don't overload servers with too many requests
  3. Identify yourself: Use a proper user agent string
  4. Off-peak hours: Schedule large crawls during low-traffic times

Efficient Crawling

  1. Start focused: Begin with a specific section rather than entire site
  2. Set limits: Use reasonable depth and page limits
  3. Filter content: Exclude irrelevant sections early
  4. Monitor progress: Watch for errors and adjust as needed

Content Quality

  1. Remove navigation: Filter out repeated headers/footers
  2. Extract main content: Focus on the primary page content
  3. Handle dynamic content: Some sites require JavaScript rendering
  4. Check encoding: Ensure proper character encoding

Common Crawl Scenarios

Documentation Site

Starting URL: https://docs.example.com/
Depth: 4
Include: /docs/*, /api/*
Exclude: /changelog/*

Blog Archive

Starting URL: https://blog.example.com/
Depth: 2
Include: /posts/*, /articles/*
Exclude: /author/*, /tag/*

Product Catalog

Starting URL: https://shop.example.com/products/
Depth: 3
Include: /products/*, /categories/*
Exclude: /cart/*, /checkout/*

Understanding Results

Crawl Statistics

After a crawl completes, you'll see:

  • Pages Crawled: Total pages successfully processed
  • Pages Skipped: Pages excluded by filters
  • Errors: Pages that failed to load
  • Time Elapsed: Total crawl duration
  • Content Size: Total indexed content size

Content Index

Crawled content is indexed and available for:

  • Semantic search queries
  • Knowledge base answers
  • Document retrieval
  • AI-powered Q&A

Troubleshooting

Pages Not Crawling

  • Check if URL is accessible (not behind login)
  • Verify robots.txt allows crawling
  • Ensure URL matches include patterns
  • Check for JavaScript-only content

Slow Crawling

  • Increase delay between requests if seeing errors
  • Reduce concurrent connections
  • Check network connectivity
  • Monitor server response times

Missing Content

  • Some sites require JavaScript rendering
  • Content may be loaded dynamically via AJAX
  • Check if content is within an iframe
  • Verify content isn't blocked by login wall

Duplicate Content

  • Enable duplicate detection
  • Use canonical URL handling
  • Filter URL parameters that don't change content

Scheduled Crawling

Set up recurring crawls to keep content fresh:

  • Daily: For frequently updated news/blog sites
  • Weekly: For documentation and knowledge bases
  • Monthly: For stable reference content

Always ensure you have the right to crawl and index content:

  • Check website terms of service
  • Respect copyright and intellectual property
  • Honor robots.txt directives
  • Don't crawl private or restricted content
  • Consider data protection regulations (GDPR, LGPD)

Frequently Asked Questions

Q: How do I crawl a site that requires login? A: The crawler works best with public content. For authenticated content, consider using API integrations instead.

Q: Can I crawl PDF documents? A: Yes, PDFs can be downloaded and processed separately for text extraction.

Q: How often should I re-crawl? A: Depends on how frequently the site updates. News sites may need daily crawls; documentation might only need weekly or monthly.

Q: What happens if a page moves or is deleted? A: The crawler will detect 404 errors and can remove outdated content from the index.

Q: Can I crawl multiple sites at once? A: Yes, you can configure multiple seed URLs and the crawler will process them in sequence.

Support

For crawling issues:

  • Review crawl logs for error details
  • Check network and firewall settings
  • Verify target site is accessible
  • Contact your administrator for configuration help