MCP WebAnalyzer
An enterprise-grade Model Context Protocol server for high-performance web analysis that discovers subpages, provides AI-based page summaries, and extracts structured content for RAG using FastMCP and FastAPI.
README Documentation
🔍 Web Analyzer MCP
A powerful MCP (Model Context Protocol) server for intelligent web content analysis and summarization. Built with FastMCP, this server provides smart web scraping, content extraction, and AI-powered question-answering capabilities.
✨ Features
🎯 Core Tools
-
url_to_markdown
- Extract and summarize key web page content- Analyzes content importance using custom algorithms
- Removes ads, navigation, and irrelevant content
- Keeps only essential information (tables, images, key text)
- Outputs structured markdown optimized for analysis
-
web_content_qna
- AI-powered Q&A about web content- Extracts relevant content sections from web pages
- Uses intelligent chunking and relevance matching
- Answers questions using OpenAI GPT models
🚀 Key Features
- Smart Content Ranking: Algorithm-based content importance scoring
- Essential Content Only: Removes clutter, keeps what matters
- Multi-IDE Support: Works with Claude Desktop, Cursor, VS Code, PyCharm
- Flexible Models: Choose from GPT-3.5, GPT-4, GPT-4 Turbo, or GPT-5
📦 Installation
Prerequisites
- uv (Python package manager)
- Chrome/Chromium browser (for Selenium)
- OpenAI API key (for Q&A functionality)
🚀 Quick Start with uv (Recommended)
# Clone the repository
git clone https://github.com/kimdonghwi94/web-analyzer-mcp.git
cd web-analyzer-mcp
# Run directly with uv (auto-installs dependencies)
uv run mcp-webanalyzer
Installing via Smithery
To install web-analyzer-mcp for Claude Desktop automatically via Smithery:
npx -y @smithery/cli install @kimdonghwi94/web-analyzer-mcp --client claude
IDE/Editor Integration
Install Claude Desktop
Add to your Claude Desktop_config.json file. See Claude Desktop MCP documentation for more details.
{
"mcpServers": {
"web-analyzer": {
"command": "uv",
"args": [
"--directory",
"/path/to/web-analyzer-mcp",
"run",
"mcp-webanalyzer"
],
"env": {
"OPENAI_API_KEY": "your_openai_api_key_here",
"OPENAI_MODEL": "gpt-4"
}
}
}
}
Install Claude Code (VS Code Extension)
Add the server using Claude Code CLI:
claude mcp add web-analyzer -e OPENAI_API_KEY=your_api_key_here -e OPENAI_MODEL=gpt-4 -- uv --directory /path/to/web-analyzer-mcp run mcp-webanalyzer
Install Cursor IDE
Add to your Cursor settings (File > Preferences > Settings > Extensions > MCP
):
{
"mcpServers": {
"web-analyzer": {
"command": "uv",
"args": [
"--directory",
"/path/to/web-analyzer-mcp",
"run",
"mcp-webanalyzer"
],
"env": {
"OPENAI_API_KEY": "your_openai_api_key_here",
"OPENAI_MODEL": "gpt-4"
}
}
}
}
Install JetBrains AI Assistant
See JetBrains AI Assistant Documentation for more details.
- In JetBrains IDEs go to Settings → Tools → AI Assistant → Model Context Protocol (MCP)
- Click + Add
- Click on Command in the top-left corner of the dialog and select the As JSON option from the list
- Add this configuration and click OK:
{
"mcpServers": {
"web-analyzer": {
"command": "uv",
"args": [
"--directory",
"/path/to/web-analyzer-mcp",
"run",
"mcp-webanalyzer"
],
"env": {
"OPENAI_API_KEY": "your_openai_api_key_here",
"OPENAI_MODEL": "gpt-4"
}
}
}
}
🎛️ Tool Descriptions
url_to_markdown
Converts web pages to clean markdown format with essential content extraction.
Parameters:
url
(string): The web page URL to analyze
Returns: Clean markdown content with structured data preservation
web_content_qna
Answers questions about web page content using intelligent content analysis.
Parameters:
url
(string): The web page URL to analyzequestion
(string): Question about the page content
Returns: AI-generated answer based on page content
🏗️ Architecture
Content Extraction Pipeline
- URL Validation - Ensures proper URL format
- HTML Fetching - Uses Selenium for dynamic content
- Content Parsing - BeautifulSoup for HTML processing
- Element Scoring - Custom algorithm ranks content importance
- Content Filtering - Removes duplicates and low-value content
- Markdown Conversion - Structured output generation
Q&A Processing Pipeline
- Content Chunking - Intelligent text segmentation
- Relevance Scoring - Matches content to questions
- Context Selection - Picks most relevant chunks
- Answer Generation - OpenAI GPT integration
🏗️ Project Structure
web-analyzer-mcp/
├── web_analyzer_mcp/ # Main Python package
│ ├── __init__.py # Package initialization
│ ├── server.py # FastMCP server with tools
│ ├── web_extractor.py # Web content extraction engine
│ └── rag_processor.py # RAG-based Q&A processor
├── scripts/ # Build and utility scripts
│ └── build.js # Node.js build script
├── README.md # English documentation
├── README.ko.md # Korean documentation
├── package.json # npm configuration and scripts
├── pyproject.toml # Python package configuration
├── .env.example # Environment variables template
└── dist-info.json # Build information (generated)
🛠️ Development
Modern Development with uv
# Clone repository
git clone https://github.com/kimdonghwi94/web-analyzer-mcp.git
cd web-analyzer-mcp
# Development commands
uv run mcp-webanalyzer # Start development server
uv run python -m pytest # Run tests
uv run ruff check . # Lint code
uv run ruff format . # Format code
uv sync # Sync dependencies
# Install development dependencies
uv add --dev pytest ruff mypy
# Create production build
npm run build
Alternative: Traditional Python Development
# Setup Python environment (if not using uv)
pip install -e .[dev]
# Development commands
python -m web_analyzer_mcp.server # Start server
python -m pytest tests/ # Run tests
python -m ruff check . # Lint code
python -m ruff format . # Format code
python -m mypy web_analyzer_mcp/ # Type checking
🤝 Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
📋 Roadmap
- Support for more content types (PDFs, videos)
- Multi-language content extraction
- Custom extraction rules
- Caching for frequently accessed content
- Webhook support for real-time updates
⚠️ Limitations
- Requires Chrome/Chromium for JavaScript-heavy sites
- OpenAI API key needed for Q&A functionality
- Rate limited to prevent abuse
- Some sites may block automated access
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙋♂️ Support
- Create an issue for bug reports or feature requests
- Contribute to discussions in the GitHub repository
- Check the documentation for detailed guides
🌟 Acknowledgments
- Built with FastMCP framework
- Inspired by HTMLRAG techniques for web content processing
- Thanks to the MCP community for feedback and contributions
Made with ❤️ for the MCP community