Rodrigo Rodriguez (Pragmatismo) e8d171ea40 - New templates.

2025-12-03 07:15:54 -03:00

5.7 KiB

Raw Blame History

Web Crawling Guide

Overview

The Web Crawler bot helps you extract and index content from websites. It automatically navigates through web pages, collects information, and makes it searchable through your knowledge base.

Features

Content Extraction

Text Content: Extract readable text from web pages
Links: Follow and index linked pages
Metadata: Capture page titles, descriptions, and keywords
Structured Data: Extract data from tables and lists

Crawl Management

Depth Control: Set how many levels of links to follow
Domain Restrictions: Limit crawling to specific domains
URL Patterns: Include or exclude URLs by pattern
Rate Limiting: Control request frequency to avoid overloading servers

Content Processing

Duplicate Detection: Avoid indexing the same content twice
Content Filtering: Skip irrelevant pages (login, error pages, etc.)
Format Conversion: Convert HTML to clean, searchable text
Language Detection: Identify content language for proper indexing

How to Use

Starting a Crawl

To start crawling a website:

Provide the starting URL (seed URL)
Configure crawl parameters (depth, limits)
Start the crawl process
Monitor progress and results

Configuration Options

Option	Description	Default
`max_depth`	How many link levels to follow	3
`max_pages`	Maximum pages to crawl	100
`delay`	Seconds between requests	1
`same_domain`	Stay within starting domain	true
`follow_external`	Follow links to other domains	false

URL Patterns

You can filter URLs using patterns:

Include patterns:

/blog/* - Only crawl blog pages
/products/* - Only crawl product pages

Exclude patterns:

/admin/* - Skip admin pages
/login - Skip login pages
*.pdf - Skip PDF files

Best Practices

Respectful Crawling

Respect robots.txt: Always check and honor robots.txt rules
Rate limiting: Don't overload servers with too many requests
Identify yourself: Use a proper user agent string
Off-peak hours: Schedule large crawls during low-traffic times

Efficient Crawling

Start focused: Begin with a specific section rather than entire site
Set limits: Use reasonable depth and page limits
Filter content: Exclude irrelevant sections early
Monitor progress: Watch for errors and adjust as needed

Content Quality

Remove navigation: Filter out repeated headers/footers
Extract main content: Focus on the primary page content
Handle dynamic content: Some sites require JavaScript rendering
Check encoding: Ensure proper character encoding

Common Crawl Scenarios

Documentation Site

Starting URL: https://docs.example.com/
Depth: 4
Include: /docs/*, /api/*
Exclude: /changelog/*

Blog Archive

Starting URL: https://blog.example.com/
Depth: 2
Include: /posts/*, /articles/*
Exclude: /author/*, /tag/*

Product Catalog

Starting URL: https://shop.example.com/products/
Depth: 3
Include: /products/*, /categories/*
Exclude: /cart/*, /checkout/*

Understanding Results

Crawl Statistics

After a crawl completes, you'll see:

Pages Crawled: Total pages successfully processed
Pages Skipped: Pages excluded by filters
Errors: Pages that failed to load
Time Elapsed: Total crawl duration
Content Size: Total indexed content size

Content Index

Crawled content is indexed and available for:

Semantic search queries
Knowledge base answers
Document retrieval
AI-powered Q&A

Troubleshooting

Pages Not Crawling

Check if URL is accessible (not behind login)
Verify robots.txt allows crawling
Ensure URL matches include patterns
Check for JavaScript-only content

Slow Crawling

Increase delay between requests if seeing errors
Reduce concurrent connections
Check network connectivity
Monitor server response times

Missing Content

Some sites require JavaScript rendering
Content may be loaded dynamically via AJAX
Check if content is within an iframe
Verify content isn't blocked by login wall

Duplicate Content

Enable duplicate detection
Use canonical URL handling
Filter URL parameters that don't change content

Scheduled Crawling

Set up recurring crawls to keep content fresh:

Daily: For frequently updated news/blog sites
Weekly: For documentation and knowledge bases
Monthly: For stable reference content

Legal Considerations

Always ensure you have the right to crawl and index content:

Check website terms of service
Respect copyright and intellectual property
Honor robots.txt directives
Don't crawl private or restricted content
Consider data protection regulations (GDPR, LGPD)

Frequently Asked Questions

Q: How do I crawl a site that requires login? A: The crawler works best with public content. For authenticated content, consider using API integrations instead.

Q: Can I crawl PDF documents? A: Yes, PDFs can be downloaded and processed separately for text extraction.

Q: How often should I re-crawl? A: Depends on how frequently the site updates. News sites may need daily crawls; documentation might only need weekly or monthly.

Q: What happens if a page moves or is deleted? A: The crawler will detect 404 errors and can remove outdated content from the index.

Q: Can I crawl multiple sites at once? A: Yes, you can configure multiple seed URLs and the crawler will process them in sequence.

Support

For crawling issues:

Review crawl logs for error details
Check network and firewall settings
Verify target site is accessible
Contact your administrator for configuration help

5.7 KiB Raw Blame History