JUHE API Marketplace
subslink326 avatar
MCP Server

Job URL Analyzer MCP Server

A FastAPI-based microservice that analyzes job URLs and extracts detailed company information by crawling job postings and company websites, with data enrichment from external providers.

1
GitHub Stars
10/3/2025
Last Updated
No Configuration
Please check the documentation below.

README Documentation

Job URL Analyzer MCP Server

A comprehensive FastAPI-based microservice for analyzing job URLs and extracting detailed company information. Built with modern async Python, this service crawls job postings and company websites to build rich company profiles with data enrichment from external providers.

✨ Features

  • 🕷️ Intelligent Web Crawling: Respectful crawling with robots.txt compliance and rate limiting
  • 🧠 Content Extraction: Advanced HTML parsing using Selectolax for fast, accurate data extraction
  • 🔗 Data Enrichment: Pluggable enrichment providers (Crunchbase, LinkedIn, custom APIs)
  • 📊 Quality Scoring: Completeness and confidence metrics for extracted data
  • 📝 Markdown Reports: Beautiful, comprehensive company analysis reports
  • 🔍 Observability: OpenTelemetry tracing, Prometheus metrics, structured logging
  • 🚀 Production Ready: Docker, Kubernetes, health checks, graceful shutdown
  • 🧪 Well Tested: Comprehensive test suite with 80%+ coverage

🏗️ Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   FastAPI App   │───▶│   Orchestrator  │───▶│   Web Crawler   │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                │                        │
                                ▼                        ▼
                       ┌─────────────────┐    ┌─────────────────┐
                       │ Content Extract │    │    Database     │
                       └─────────────────┘    │   (SQLAlchemy)  │
                                │             └─────────────────┘
                                ▼                        
                       ┌─────────────────┐    ┌─────────────────┐
                       │   Enrichment    │───▶│    Providers    │
                       │    Manager      │    │ (Crunchbase,etc)│
                       └─────────────────┘    └─────────────────┘
                                │                        
                                ▼                        
                       ┌─────────────────┐              
                       │ Report Generator│              
                       └─────────────────┘              

🚀 Quick Start

Prerequisites

  • Python 3.11+
  • Poetry (for dependency management)
  • Docker & Docker Compose (optional)

Local Development

  1. Clone and Setup

    git clone https://github.com/subslink326/job-url-analyzer-mcp.git
    cd job-url-analyzer-mcp
    poetry install
    
  2. Environment Configuration (Optional)

    # The application has sensible defaults and can run without environment configuration
    # To customize settings, create a .env file with your configuration
    # See src/job_url_analyzer/config.py for available settings
    
  3. Database Setup

    poetry run alembic upgrade head
    
  4. Run Development Server

    poetry run python -m job_url_analyzer.main
    # Server starts at http://localhost:8000
    

Docker Deployment

  1. Development

    docker-compose up --build
    
  2. Production

    docker-compose -f docker-compose.prod.yml up -d
    

📡 API Usage

Analyze Job URL

curl -X POST "http://localhost:8000/analyze" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://company.com/jobs/software-engineer",
    "include_enrichment": true,
    "force_refresh": false
  }'

Response Example

{
  "profile_id": "123e4567-e89b-12d3-a456-426614174000",
  "source_url": "https://company.com/jobs/software-engineer",
  "company_profile": {
    "name": "TechCorp",
    "description": "Leading AI company...",
    "industry": "Technology",
    "employee_count": 150,
    "funding_stage": "Series B",
    "total_funding": 25.0,
    "headquarters": "San Francisco, CA",
    "tech_stack": ["Python", "React", "AWS"],
    "benefits": ["Health insurance", "Remote work"]
  },
  "completeness_score": 0.85,
  "confidence_score": 0.90,
  "processing_time_ms": 3450,
  "enrichment_sources": ["crunchbase"],
  "markdown_report": "# TechCorp - Company Analysis Report\n..."
}

⚙️ Configuration

Environment Variables

VariableDescriptionDefault
DEBUGEnable debug modefalse
HOSTServer host0.0.0.0
PORTServer port8000
DATABASE_URLDatabase connection stringsqlite+aiosqlite:///./data/job_analyzer.db
MAX_CONCURRENT_REQUESTSMax concurrent HTTP requests10
REQUEST_TIMEOUTHTTP request timeout (seconds)30
CRAWL_DELAYDelay between requests (seconds)1.0
RESPECT_ROBOTS_TXTRespect robots.txttrue
ENABLE_CRUNCHBASEEnable Crunchbase enrichmentfalse
CRUNCHBASE_API_KEYCrunchbase API key""
DATA_RETENTION_DAYSData retention period90

📊 Monitoring

Metrics Endpoints

  • Health Check: GET /health
  • Prometheus Metrics: GET /metrics

Key Metrics

  • job_analyzer_requests_total - Total API requests
  • job_analyzer_analysis_success_total - Successful analyses
  • job_analyzer_completeness_score - Data completeness distribution
  • job_analyzer_crawl_requests_total - Crawl requests by status
  • job_analyzer_enrichment_success_total - Enrichment success by provider

🧪 Testing

Run Tests

# Unit tests
poetry run pytest

# With coverage
poetry run pytest --cov=job_url_analyzer --cov-report=html

# Integration tests only
poetry run pytest -m integration

# Skip slow tests
poetry run pytest -m "not slow"

🚀 Deployment

Kubernetes

# Apply manifests
kubectl apply -f kubernetes/

# Check deployment
kubectl get pods -l app=job-analyzer
kubectl logs -f deployment/job-analyzer

Production Checklist

  • Environment variables configured
  • Database migrations applied
  • SSL certificates configured
  • Monitoring dashboards set up
  • Log aggregation configured
  • Backup strategy implemented
  • Rate limiting configured
  • Resource limits set

🔧 Development

Project Structure

job-url-analyzer/
├── src/job_url_analyzer/          # Main application code
│   ├── enricher/                  # Enrichment providers
│   ├── main.py                    # FastAPI application
│   ├── config.py                  # Configuration
│   ├── models.py                  # Pydantic models
│   ├── database.py                # Database models
│   ├── crawler.py                 # Web crawler
│   ├── extractor.py               # Content extraction
│   ├── orchestrator.py            # Main orchestrator
│   └── report_generator.py        # Report generation
├── tests/                         # Test suite
├── alembic/                       # Database migrations
├── kubernetes/                    # K8s manifests
├── monitoring/                    # Monitoring configs
├── docker-compose.yml             # Development setup
├── docker-compose.prod.yml        # Production setup
└── Dockerfile                     # Container definition

Code Quality

The project uses:

  • Black for code formatting
  • Ruff for linting
  • MyPy for type checking
  • Pre-commit hooks for quality gates
# Setup pre-commit
poetry run pre-commit install

# Run quality checks
poetry run black .
poetry run ruff check .
poetry run mypy src/

📝 Recent Changes

Dependency Updates

  • Fixed: Replaced non-existent aiohttp-robotparser dependency with robotexclusionrulesparser for robots.txt parsing
  • Improved: Setup process now works out-of-the-box without requiring .env file configuration

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Add tests for new functionality
  5. Ensure all tests pass (poetry run pytest)
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

  • Documentation: This README and inline code comments
  • Issues: GitHub Issues for bug reports and feature requests
  • Discussions: GitHub Discussions for questions and community

Built with ❤️ using FastAPI, SQLAlchemy, and modern Python tooling.

Quick Actions

Key Features

Model Context Protocol
Secure Communication
Real-time Updates
Open Source

Boost your projects with Wisdom Gate LLM API

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.