Web-curl MCP Server
A powerful tool for fetching and extracting text content from web pages and APIs, supporting web scraping, REST API requests, and Google Custom Search integration.
README Documentation
Google Custom Search API
Google Custom Search API is free with usage limits (e.g., 100 queries per day for free, with additional queries requiring payment). For full details on quotas, pricing, and restrictions, see the official documentation.
Web-curl

Developed by Rayss
๐ Open Source Project
๐ ๏ธ Built with Node.js & TypeScript (Node.js v18+ required)
๐ฌ Demo Video
Click here to watch the demo video directly in your browser.
If your platform supports it, you can also download and play demo/demo_1.mp4 directly.
๐ Table of Contents
- Changelog / Update History
- Overview
- Features
- Architecture
- Installation
- Usage
- CLI Usage
- MCP Server Usage
- Configuration
- Examples
- Troubleshooting
- Tips & Best Practices
- Contributing & Issues
- License & Attribution
๐ Changelog / Update History
See CHANGELOG.md for a complete history of updates and new features.
๐ Overview
Web-curl is a powerful tool for fetching and extracting text content from web pages and APIs. Use it as a standalone CLI or as an MCP (Model Context Protocol) server. Web-curl leverages Puppeteer for robust web scraping and supports advanced features such as resource blocking, custom headers, authentication, and Google Custom Search.
โจ Features
๐ Deep Research & Automation (v1.4.2)
- Advanced Browser Automation: Full control over Chromium via Puppeteer (click, type, scroll, hover, key presses).
- Always-On Session Persistence: Browser profiles are now always persistent. Login sessions, cookies, and cache are automatically saved in a local
user_data/directory. - Token-Efficient Snapshots:
- Accessibility Tree: Clean, structured snapshots instead of messy HTML.
- HTML Slice Mode: Raw HTML with
startIndex/endIndexfor safe chunking when needed. - Viewport Filtering: Automatically filters out elements not visible on screen, saving up to 90% of context tokens on long pages.
- Chrome DevTools Integration (implemented, but hidden from
list_tools):- Network Monitoring (
browser_network_requests) - Console Logs (
browser_console_messages)
- Network Monitoring (
- Parallel Search:
multi_search: Run multiple Google searches at once (only exposed search tool).
- Intelligent Resource Management:
- Idle Auto-Close: Browser automatically shuts down after 15 minutes of inactivity to save RAM/CPU.
- Tab Rotation: Automatically replaces the oldest tab when the 10-tab limit is reached.
- Media & Documents:
- Full-Page Screenshots: Capture high-quality screenshots with a 5-day auto-cleanup lifecycle and custom destination support.
- Document Parsing: Extract text from PDF and DOCX files directly from URLs.
Storage & Download Details
- ๐๏ธ Error log rotation:
logs/error-log.txtis rotated when it exceeds ~1MB (renamed toerror-log.txt.bak) to prevent unbounded growth. - ๐งน Logs & temp cleanup: old temporary files in the
logs/directory are cleaned up at startup. - ๐ Browser lifecycle: Puppeteer browser instances are closed in finally blocks to avoid Chromium temp file leaks.
- ๐ Content extraction:
- Returns raw text, HTML, and Readability "main article" when available. Readability attempts to extract the primary content of a webpage, removing headers, footers, sidebars, and other non-essential elements, providing a cleaner, more focused text.
- Readability output is subject to
startIndex/maxLength/chunkSizeslicing when requested.
- ๐ซ Resource blocking:
blockResourcesis now always forced tofalse, meaning resources are never blocked for faster page loads. - โฑ๏ธ Timeout control: navigation and API request timeouts are configurable via tool arguments.
- ๐พ Output: results can be printed to stdout or written to a file via CLI options.
- โฌ๏ธ Download behavior (
download_file):destinationFolderaccepts relative paths (resolved against the project root) or absolute paths.- The server creates
destinationFolderif it does not exist. - Downloads are streamed using Node streams +
pipelineto minimize memory use and ensure robust writes. - Filenames are derived from the URL path (e.g.,
https://.../path/file.jpg->file.jpg). If no filename is present, the fallback name isdownloaded_file. - Overwrite semantics: by default the implementation will overwrite an existing file with the same name.
- ๐ฅ๏ธ Usage modes: CLI and MCP server (stdin/stdout transport).
- ๐ REST client:
fetch_apireturns JSON/text when appropriate and base64 for binary responses. - ๐ Google Custom Search: requires
APIKEY_GOOGLE_SEARCHandCX_GOOGLE_SEARCH. - ๐ค Smart command:
- Auto language detection (franc-min) and optional translation (dynamic
translateimport). - Query enrichment is heuristic-based; results depend on the detected intent.
- Auto language detection (franc-min) and optional translation (dynamic
๐๏ธ Architecture
This section outlines the high-level architecture of Web-curl.
graph TD
A[User/MCP Host] --> B(CLI / MCP Server)
B --> C{Tool Handlers}
C -- browser_flow --> D["Puppeteer (Web Scraping)"]
C -- fetch_api --> E["REST Client"]
C -- multi_search --> F["Google Custom Search API"]
C -- parse_document --> G["Document Parser (PDF/DOCX)"]
C -- download_file --> H["File System (Downloads)"]
D --> I["Web Content"]
E --> J["External APIs"]
F --> K["Google Search Results"]
H --> L["Local Storage"]
- CLI & MCP Server:
src/index.tsImplements both the CLI entry point and the MCP server. - Web Scraping: Uses Puppeteer for headless browsing and content extraction.
- REST Client:
src/rest-client.tsProvides a flexible HTTP client for API requests.
โ๏ธ MCP Server Configuration Example
To integrate web-curl as an MCP server, add the following configuration to your mcp_settings.json:
{
"mcpServers": {
"web-curl": {
"command": "node",
"args": [
"build/index.js"
],
"disabled": false,
"alwaysAllow": [
"browser_flow",
"browser_configure",
"browser_close",
"multi_search",
"fetch_api",
"download_file",
"parse_document"
],
"env": {
"APIKEY_GOOGLE_SEARCH": "YOUR_GOOGLE_API_KEY",
"CX_GOOGLE_SEARCH": "YOUR_CX_ID"
}
}
}
}
๐ How to Obtain Google API Key and CX
- Get a Google API Key:
- Go to Google Cloud Console.
- Create/select a project, then go to APIs & Services > Credentials.
- Click Create Credentials > API key and copy it.
- Get a Custom Search Engine (CX) ID:
- Go to Google Custom Search Engine.
- Create/select a search engine, then copy the Search engine ID (CX).
- Enable Custom Search API:
- In Google Cloud Console, go to APIs & Services > Library.
- Search for Custom Search API and enable it.
Replace YOUR_GOOGLE_API_KEY and YOUR_CX_ID in the config above.
๐ ๏ธ Installation
# Clone the repository
git clone https://github.com/rayss868/MCP-Web-Curl
cd web-curl
# Install dependencies
npm install
# Build the project
npm run build
- Prerequisites: Ensure you have Node.js (v18+) and Git installed on your system.
Puppeteer installation notes
-
Windows: Just run
npm install. -
Linux / Ubuntu Server: You must install extra dependencies for Chromium to handle rendering and screenshots in a headless environment. Run:
sudo apt-get update && sudo apt-get install -y \ fonts-liberation \ libasound2 \ libatk-bridge2.0-0 \ libatk1.0-0 \ libc6 \ libcairo2 \ libcups2 \ libdbus-1-3 \ libexpat1 \ libfontconfig1 \ libgbm1 \ libgcc1 \ libglib2.0-0 \ libgtk-3-0 \ libnspr4 \ libnss3 \ libpango-1-0-0 \ libpangocairo-1.0-0 \ libstdc++6 \ libx11-6 \ libx11-xcb1 \ libxcb1 \ libxcomposite1 \ libxcursor1 \ libxdamage1 \ libxext6 \ libxfixes3 \ libxi6 \ libxrandr2 \ libxrender1 \ libxss1 \ libxtst6 \ lsb-release \ wget \ xdg-utils
For more details, see the Puppeteer troubleshooting guide.
๐ Usage
CLI Usage
The CLI supports fetching and extracting text content from web pages.
# Basic usage
node build/index.js https://example.com
# With options
node build/index.js --timeout 30000 https://example.com
# Save output to a file
node build/index.js -o result.json https://example.com
Command Line Options
--timeout <ms>: Set navigation timeout (default: 60000)-o <file>: Output result to specified file
MCP Server Usage
Web-curl can be run as an MCP server for integration with Roo Context or other MCP-compatible environments.
Exposed Tools (v1.4.2)
Only the tools below are exposed via list_tools to reduce tool-chaining in agent clients.
-
browser_flow: One-call browser workflow (optional navigate โ optional actions โ return ONE result).
-
browser_configure: Set proxy/user-agent/viewport (session persistence is always on via
user_data/). -
browser_close: Close browser and tabs (also auto-closes after 15 minutes of inactivity).
-
multi_search: Run multiple Google searches in parallel (the only exposed search entrypoint).
-
fetch_api: REST API request with response truncation (
limit). -
download_file: Download a file from a URL.
-
parse_document: Extract text from PDF/DOCX URLs.
Running as MCP Server
npm run start
The server will communicate via stdin/stdout and expose the tools as defined in src/index.ts.
๐ฆ HTML Slicing Example (Recommended for Large Pages)
Use browser_flow with result: { type: "snapshot", mode: "html" } when you need raw HTML but want to keep the response small.
Client request for first slice:
{
"name": "browser_flow",
"arguments": {
"result": {
"type": "snapshot",
"mode": "html",
"startIndex": 0,
"endIndex": 20000
}
}
}
Response (example):
{
"mode": "html",
"totalLength": 123456,
"startIndex": 0,
"endIndex": 20000,
"remainingCharacters": 103456,
"content": "<html>...first slice...</html>"
}
๐งฉ Configuration
- Session Persistence: Always enabled. Logins and cookies are automatically reused across restarts.
- Timeout: Set navigation and API request timeouts.
- Environment Variables: Used for Google Search API integration (used by
multi_search).
๐ก Examples {#examples}
Make a REST API Request
{
"name": "fetch_api",
"arguments": {
"url": "https://api.github.com/repos/nodejs/node",
"method": "GET",
"headers": {
"Accept": "application/vnd.github.v3+json"
},
"limit": 10000
}
}
Download File
{
"name": "download_file",
"arguments": {
"url": "https://example.com/image.jpg",
"destinationFolder": "downloads"
}
}
Note: destinationFolder can be either a relative path (resolved against the project root) or an absolute path. The server will create the destination folder if it does not exist.
Configure Browser
{
"name": "browser_configure",
"arguments": {
"proxy": "http://proxy.example.com:8080",
"viewport": { "width": 1920, "height": 1080 }
}
}
Note: Session persistence is always enabled. Cookies and login sessions are automatically stored in the user_data/ directory.
๐ ๏ธ Troubleshooting {#troubleshooting}
- Timeout Errors: Increase the
timeoutparameter if requests are timing out. - Google Search Fails: Ensure
APIKEY_GOOGLE_SEARCHandCX_GOOGLE_SEARCHare set in your environment. - Error Logs: Check the
logs/error-log.txtfile for detailed error messages.
๐ง Tips & Best Practices {#tips--best-practices}
Click for advanced tips
- For large pages, use
maxLengthandstartIndexto fetch content in slices. - Always validate your tool arguments to avoid errors.
- Secure your API keys and sensitive data using environment variables.
- Review the MCP tool schemas in
src/index.tsfor all available options.
๐ค Contributing & Issues {#contributing--issues}
Contributions are welcome! If you want to contribute, fork this repository and submit a pull request.
If you find any issues or have suggestions, please open an issue on the repository page.
๐ License & Attribution {#license--attribution}
This project was developed by Rayss.
For questions, improvements, or contributions, please contact the author or open an issue in the repository.
Note: Google Search API is free with usage limits. For details, see: Google Custom Search API Overview