Crawl4Claude
A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.
README Documentation
Documentation Scraper & MCP Server
A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.
🚀 Features
Core Functionality
- 🌐 Universal Documentation Scraper: Works with any documentation website
- 📊 Structured Database: SQLite database with full-text search capabilities
- 🤖 MCP Server Integration: Native Claude Desktop integration via Model Context Protocol
- 📝 LLM-Optimized Output: Ready-to-use context files for AI applications
- ⚙️ Configuration-Driven: Single config file controls all settings
Advanced Tools
- 🔍 Query Interface: Command-line tool for searching and analyzing scraped content
- 🛠️ Debug Suite: Comprehensive debugging tools for testing and validation
- 📋 Auto-Configuration: Automatic MCP setup file generation
- 📈 Progress Tracking: Detailed logging and error handling
- 💾 Resumable Crawls: Smart caching for interrupted crawls
📋 Prerequisites
- Python 3.8 or higher
- Internet connection
- ~500MB free disk space per documentation site
🛠️ Quick Start
1. Installation
# Clone the repository
git clone <repository-url>
cd documentation-scraper
# Install dependencies
pip install -r requirements.txt
2. Configure Your Target
Edit config.py
to set your documentation site:
SCRAPER_CONFIG = {
"base_url": "https://docs.example.com/", # Your documentation site
"output_dir": "docs_db",
"max_pages": 200,
# ... other settings
}
3. Run the Scraper
python docs_scraper.py
4. Query Your Documentation
# Search for content
python query_docs.py --search "tutorial"
# Browse by section
python query_docs.py --section "getting-started"
# Get statistics
python query_docs.py --stats
5. Set Up Claude Integration
# Generate MCP configuration files
python utils/gen_mcp.py
# Follow the instructions to add to Claude Desktop
🏗️ Project Structure
📁 documentation-scraper/
├── 📄 config.py # Central configuration file
├── 🕷️ docs_scraper.py # Main scraper script
├── 🔍 query_docs.py # Query and analysis tool
├── 🤖 mcp_docs_server.py # MCP server for Claude integration
├── 📋 requirements.txt # Python dependencies
├── 📁 utils/ # Debug and utility tools
│ ├── 🛠️ gen_mcp.py # Generate MCP config files
│ ├── 🧪 debug_scraper.py # Test scraper functionality
│ ├── 🔧 debug_mcp_server.py # Debug MCP server
│ ├── 🎯 debug_mcp_client.py # Test MCP tools directly
│ ├── 📡 debug_mcp_server_protocol.py # Test MCP via JSON-RPC
│ └── 🌐 debug_site_content.py # Debug content extraction
├── 📁 docs_db/ # Generated documentation database
│ ├── 📊 documentation.db # SQLite database
│ ├── 📄 documentation.json # JSON export
│ ├── 📋 scrape_summary.json # Statistics
│ └── 📁 llm_context/ # LLM-ready context files
└── 📁 mcp/ # Generated MCP configuration
├── 🔧 run_mcp_server.bat # Windows launcher script
└── ⚙️ claude_mcp_config.json # Claude Desktop config
⚙️ Configuration
Main Configuration (config.py
)
The entire system is controlled by a single configuration file:
# Basic scraping settings
SCRAPER_CONFIG = {
"base_url": "https://docs.example.com/",
"output_dir": "docs_db",
"max_depth": 3,
"max_pages": 200,
"delay_between_requests": 0.5,
}
# URL filtering rules
URL_FILTER_CONFIG = {
"skip_patterns": [r'/api/', r'\.pdf$'],
"allowed_domains": ["docs.example.com"],
}
# MCP server settings
MCP_CONFIG = {
"server_name": "docs-server",
"default_search_limit": 10,
"max_search_limit": 50,
}
Environment Overrides
You can override any setting with environment variables:
export DOCS_DB_PATH="/custom/path/documentation.db"
export DOCS_BASE_URL="https://different-docs.com/"
python mcp_docs_server.py
🤖 Claude Desktop Integration
Automatic Setup
-
Generate configuration files:
python utils/gen_mcp.py
-
Copy the generated config to Claude Desktop:
- Windows:
%APPDATA%\Claude\claude_desktop_config.json
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json
- Windows:
-
Restart Claude Desktop
Manual Setup
If you prefer manual setup, add this to your Claude Desktop config:
{
"mcpServers": {
"docs": {
"command": "python",
"args": ["path/to/mcp_docs_server.py"],
"cwd": "path/to/project",
"env": {
"DOCS_DB_PATH": "path/to/docs_db/documentation.db"
}
}
}
}
Available MCP Tools
Once connected, Claude can use these tools:
- 🔍 search_documentation: Search for content across all documentation
- 📚 get_documentation_sections: List all available sections
- 📄 get_page_content: Get full content of specific pages
- 🗂️ browse_section: Browse pages within a section
- 📊 get_documentation_stats: Get database statistics
🔧 Command Line Tools
Documentation Scraper
# Basic scraping
python docs_scraper.py
# Override config settings
python docs_scraper.py # Settings from config.py
Query Tool
# Search for content
python query_docs.py --search "authentication guide"
# Browse specific sections
python query_docs.py --section "api-reference"
# Get database statistics
python query_docs.py --stats
# List all sections
python query_docs.py --list-sections
# Export section to file
python query_docs.py --export-section "tutorials" --format markdown > tutorials.md
# Use custom database
python query_docs.py --db "custom/path/docs.db" --search "example"
Debug Tools
# Test scraper functionality
python utils/debug_scraper.py
# Test MCP server
python utils/debug_mcp_server.py
# Test MCP tools directly
python utils/debug_mcp_client.py
# Test MCP protocol
python utils/debug_mcp_server_protocol.py
# Debug content extraction
python utils/debug_site_content.py
# Generate MCP config files
python utils/gen_mcp.py
📊 Database Schema
Pages Table
CREATE TABLE pages (
id INTEGER PRIMARY KEY,
url TEXT UNIQUE NOT NULL,
title TEXT,
content TEXT,
markdown TEXT,
word_count INTEGER,
section TEXT,
subsection TEXT,
scraped_at TIMESTAMP,
metadata TEXT
);
Full-Text Search
-- Search using FTS5
SELECT * FROM pages_fts WHERE pages_fts MATCH 'your search term';
-- Or use the query tool
python query_docs.py --search "your search term"
🎯 Example Use Cases
1. Documentation Analysis
# Get overview of documentation
python query_docs.py --stats
# Find all tutorial content
python query_docs.py --search "tutorial guide example"
# Export specific sections
python query_docs.py --export-section "getting-started" > onboarding.md
2. AI Integration with Claude
# Once MCP is set up, ask Claude:
# "Search the documentation for authentication examples"
# "What sections are available in the documentation?"
# "Show me the content for the API reference page"
3. Custom Applications
import sqlite3
# Connect to your scraped documentation
conn = sqlite3.connect('docs_db/documentation.db')
# Query for specific content
results = conn.execute("""
SELECT title, url, markdown
FROM pages
WHERE section = 'tutorials'
AND word_count > 500
ORDER BY word_count DESC
""").fetchall()
# Build your own tools on top of the structured data
🔍 Debugging and Testing
Test Scraper Before Full Run
python utils/debug_scraper.py
Validate Content Extraction
python utils/debug_site_content.py
Test MCP Integration
# Test server functionality
python utils/debug_mcp_server.py
# Test tools directly
python utils/debug_mcp_client.py
# Test JSON-RPC protocol
python utils/debug_mcp_server_protocol.py
📈 Performance and Optimization
Scraping Performance
- Start small: Use
max_pages=50
for testing - Adjust depth:
max_depth=2
covers most content efficiently - Rate limiting: Increase
delay_between_requests
if getting blocked - Caching: Enabled by default for resumable crawls
Database Performance
- Full-text search: Automatic FTS5 index for fast searching
- Indexing: Optimized indexes on URL and section columns
- Word counts: Pre-calculated for quick statistics
MCP Performance
- Configurable limits: Set appropriate search and section limits
- Snippet length: Adjust snippet size for optimal response times
- Connection pooling: Efficient database connections
🌐 Supported Documentation Sites
This scraper works with most documentation websites including:
- Static sites: Hugo, Jekyll, MkDocs, Docusaurus
- Documentation platforms: GitBook, Notion, Confluence
- API docs: Swagger/OpenAPI documentation
- Wiki-style: MediaWiki, TiddlyWiki
- Custom sites: Any site with consistent HTML structure
Site-Specific Configuration
Customize URL filtering and content extraction for your target site:
URL_FILTER_CONFIG = {
"skip_patterns": [
r'/api/', # Skip API endpoint docs
r'/edit/', # Skip edit pages
r'\.pdf$', # Skip PDF files
],
"allowed_domains": ["docs.yoursite.com"],
}
CONTENT_FILTER_CONFIG = {
"remove_patterns": [
r'Edit this page.*?\n', # Remove edit links
r'Was this helpful\?.*?\n', # Remove feedback sections
],
}
🤝 Contributing
We welcome contributions! Here are some areas where you can help:
- New export formats: PDF, EPUB, Word documents
- Enhanced content filtering: Better noise removal
- Additional debug tools: More comprehensive testing
- Documentation: Improve guides and examples
- Performance optimizations: Faster scraping and querying
⚠️ Responsible Usage
- Respect robots.txt: Check the target site's robots.txt file
- Rate limiting: Use appropriate delays between requests
- Terms of service: Respect the documentation site's terms
- Fair use: Use for educational, research, or personal purposes
- Attribution: Credit the original documentation source
📄 License
This project is provided as-is for educational and research purposes. Please respect the terms of service and licensing of the documentation sites you scrape.
🎉 Getting Started Examples
Example 1: Scrape Python Documentation
# config.py
SCRAPER_CONFIG = {
"base_url": "https://docs.python.org/3/",
"max_pages": 500,
"max_depth": 3,
}
Example 2: Scrape API Documentation
# config.py
SCRAPER_CONFIG = {
"base_url": "https://api-docs.example.com/",
"max_pages": 200,
}
URL_FILTER_CONFIG = {
"skip_patterns": [r'/changelog/', r'/releases/'],
}
Example 3: Corporate Documentation
# config.py
SCRAPER_CONFIG = {
"base_url": "https://internal-docs.company.com/",
"output_dir": "company_docs",
}
MCP_CONFIG = {
"server_name": "company-docs-server",
"docs_display_name": "Company Internal Docs",
}
Happy Documenting! 📚✨
For questions, issues, or feature requests, please check the debug logs first, then create an issue with relevant details.
🙏 Attribution
This project is powered by Crawl4AI - an amazing open-source LLM-friendly web crawler and scraper.
Crawl4AI enables the intelligent web scraping capabilities that make this documentation toolkit possible. A huge thanks to @unclecode and the Crawl4AI community for building such an incredible tool! 🚀
Check out Crawl4AI:
- Repository: https://github.com/unclecode/crawl4ai
- Documentation: https://crawl4ai.com
- Discord Community: https://discord.gg/jP8KfhDhyN