JUHE API Marketplace
rustomax avatar
MCP Server

Observe MCP Server

A Model Context Protocol server that provides access to Observe API functionality, enabling LLMs to execute OPAL queries, manage datasets/monitors, and leverage vector search for documentation and troubleshooting runbooks.

1
GitHub Stars
8/18/2025
Last Updated
MCP Server Configuration
1{
2 "name": "observe-community",
3 "command": "npx",
4 "args": [
5 "mcp-remote@latest",
6 "http://localhost: 8000/sse",
7 "--header",
8 "Authorization: Bearer bearer_token"
9 ]
10}
JSON10 lines

README Documentation

Observe Community MCP Server

A Model Context Protocol (MCP) server providing access to Observe API functionality, OPAL query assistance, and troubleshooting runbooks through vector search.

Purpose

This server is designed to work with technically capable LLM models, specifically tested with Claude Sonnet 3.7 and Claude Sonnet 4. Unlike traditional chatbots, this MCP server:

  • Interfaces directly with the Observe platform in an LLM-friendly way
  • Avoids using LLMs for internal functions to prevent private data leakage
  • Serves as a secure bridge between third-party LLMs and your Observe data

⚠️ IMPORTANT: The newest addition to this MCP - the execute_nlp_query tool does leverage LLM under the hood (specifically Anthropic models). If this is not desirable, you can disable the tool. The caller LLM or agent can still use the MCP, but it will need to do more heavy lifting to get to the end result.

⚠️ IMPORTANT DISCLAIMER: This is an experimental and unsupported MCP server created for testing and collaboration with Observe AI design partners. Observe or the author bear no liability for any use of this server and does not support it in any way. A separate production-ready closed-source MCP server is available to Observe customers.

Table of Contents

Overview

This MCP server provides comprehensive access to the Observe platform through multiple layers of functionality:

Core API Integration

  • OPAL Query Execution: Direct query execution against Observe datasets
  • Worksheet Export: Flexible data export with multiple time parameter options
  • Dataset Discovery: Listing and detailed schema inspection of available datasets
  • Monitor Management: Creation, listing, and configuration of monitoring alerts

Natural Language Query Interface

  • Natural Language Processing: The execute_nlp_query tool converts natural language requests into validated OPAL queries, featuring:
    • 90%+ Success Rate: OPAL syntax generation with multi-tier error recovery
    • Schema-Aware Generation: Automatic field validation and syntax correction
    • Multi-Tier Error Recovery: Fallback strategies for failed queries
    • Automatic Validation: Pre-execution syntax checking and correction

Knowledge Base Integration

The server leverages Pinecone as a vector database for semantic search across:

  • OPAL Reference Documentation: Comprehensive syntax guides and examples
  • Troubleshooting Runbooks: Structured investigation methodologies
  • Best Practices: Validated patterns for common monitoring scenarios

Value Proposition

This experimental server delivers significant value by enabling:

  • Conversational Data Analysis: Ask questions in natural language and get structured OPAL queries with results
  • Rapid Investigation: Get investigation strategies through semantic runbook recommendations
  • Comprehensive Troubleshooting: Analyze logs, metrics, and traces through a unified interface
  • Knowledge-Augmented Queries: Access to vectorized documentation ensures syntactically correct and efficient queries

Claude Desktop using Observe MCP Server

Getting Started

Prerequisites

  • Python 3.8+
  • Pinecone account with an API key
  • Observe API credentials
  • For Natural Language Processing: Anthropic API key (for execute_nlp_query tool)

Installation

First, clone the repository:

git clone https://github.com/rustomax/observe-experimental-mcp.git
cd observe-experimental-mcp

Create a virtual environment and activate it:

python3 -m venv .venv
source .venv/bin/activate

Install the required dependencies:

pip install -r requirements.txt

Environment Setup

Copy .env.template to .env and fill in the values.

Populating Vector Database

Note: Pinecone's vector indices are eventually consistent - after populating an index, allow a few minutes for lookups to return the most recent data.

Documentation Index

The vector database needs to be populated with OPAL reference data before the server can provide semantic search capabilities.

IMPORTANT: if you are not an Observe employee, you won't have access to documentation markdown files. Please contact your Observe representative for assistance for this step.

Run the following command to populate the documentation database:

python populate_docs_index.py

Note: If you have previously created the index and you want a clean slate, you can use the --force flag to recreate the index.

This will process all the markdown files in the observe-docs directory and create vector embeddings for semantic search. The script will:

  • Read markdown files from the observe-docs directory (both documentation and prompts)
  • Extract metadata and content from these files
  • Generate embeddings for each document using Pinecone's integrated embedding model (llama-text-embed-v2)
  • Store the embeddings and metadata in the Pinecone index

Options:

  • --force: Force recreation of the index even if it already exists
  • --verbose: Enable detailed logging
  • --limit <n>: Limit the number of files to process (0 for no limit)

The script will show a progress bar as it processes the files and uploads them to Pinecone.

Runbooks Index

To populate the runbooks vector database, run:

python populate_runbooks_index.py

Note: If you have previously created the index and you want a clean slate, you can use the --force flag to recreate the index.

This script will:

  • Find all markdown (.md) files in the runbooks directory
  • Chunk the content into semantically meaningful segments (approximately 1000 characters each)
  • Generate embeddings for each chunk using Pinecone's integrated embedding model
  • Store the embeddings and metadata in a separate Pinecone index for runbooks

Options:

  • --force: Force recreation of the index even if it already exists
  • --runbooks_dir <path>: Specify a custom path to the runbooks directory

The chunking process preserves the semantic meaning of the content while optimizing for vector search, but the recommend_runbook function will return complete runbooks, not just chunks.

Example run:

python ./populate_runbooks_index.py --force

Forcing recreation of index
Deleting existing index: observe-runbooks
Initializing Pinecone client
Creating new Pinecone index with integrated embedding model: observe-runbooks
Waiting for index to be ready...
Connected to Pinecone index: observe-runbooks
Found 11 runbook files
Processing files: 100%|███████████████████████████| 11/11 [00:00<00:00, 5327.02it/s]
Collected 116 chunks from 11 files
Upserting batch 1/2 with 95 records
Successfully upserted batch 1/2
Upserting batch 2/2 with 21 records
Successfully upserted batch 2/2
Total chunks added to Pinecone: 116

Running the Server

Traditional Python Execution

python observe_server.py

The server runs with Server-Sent Events (SSE) transport by default on port 8000. You can modify the port and transport method in the observe_server.py file if needed.

Docker Deployment

The server can be containerized using Docker for easier deployment and isolation.

Development with Docker

  1. Copy environment configuration:

    cp .env.docker .env
    # Edit .env with your actual API keys and configuration
    
  2. Start with docker-compose (development mode):

    docker-compose up --build
    

    This runs in development mode with source code mounted for live reloading.

  3. Access the server:

    • MCP Server: http://localhost:8000 (SSE transport)
    • Health monitoring: Automatic via Docker health checks

Production Deployment

For production deployments with nginx proxy and resource limits:

# Use production configuration
docker-compose -f docker-compose.yml -f docker-compose.prod.yml up --build -d

# Or with nginx proxy (for load balancing/SSL termination)
docker-compose --profile production up --build -d

Docker Configuration

  • Port: 8000 (configurable via docker-compose)
  • Health Check: Automatic health monitoring with restart policies
  • Security: Non-root user execution, minimal attack surface
  • Resource Limits: Configurable CPU/memory limits in production mode
  • Data Persistence: Vector database and documentation mounted as volumes

Customizing Docker Deployment

  • Environment Variables: Modify .env file or docker-compose environment section
  • Resource Limits: Adjust in docker-compose.prod.yml
  • SSL/TLS: Configure nginx section in docker-compose with your certificates
  • Custom Docs: Mount your own observe-docs or runbooks directories

Key Components

The server uses a clean, modular architecture for maintainability and reusability. Core functionality is organized into specialized packages, keeping the main server file focused and lean.

ComponentDescription
observe_server.pyMain MCP server implementation with tool definitions
Natural Language Processing
src/nlp_agent/Intelligent OPAL query generation from natural language
src/nlp_agent/agent.pyCore NLP agent with 90%+ success rate, schema-aware generation, and multi-tier error recovery
Observe API Package
src/observe/Organized package containing all Observe API operations
src/observe/client.pyHTTP client with error handling and request utilities
src/observe/config.pyConfiguration management and environment validation
src/observe/datasets.pyDataset listing, filtering, and information retrieval
src/observe/queries.pyOPAL query execution with QueryBuilder helper
src/observe/monitors.pyMonitor creation, listing, and management
src/observe/worksheets.pyWorksheet export with WorksheetExporter class
Authentication Package
src/auth/Complete authentication and authorization system
src/auth/jwt_utils.pyJWT token decoding, validation, and utilities
src/auth/scopes.pyScope-based authorization middleware with decorators
src/auth/permissions.pyPermission management and user capability checking
src/auth/middleware.pyFastMCP authentication integration and setup
Vector Database Package
src/pinecone/Organized package containing all Pinecone vector database operations
src/pinecone/client.pyPinecone client initialization and connection management
src/pinecone/embeddings.pyEmbedding generation for single texts and batches
src/pinecone/search.pySemantic search operations for docs and runbooks
src/pinecone/indexing.pyDocument and runbook indexing operations
Scripts and Data
populate_docs_index.pySimplified script to ingest markdown files from observe-docs into Pinecone
populate_runbooks_index.pySimplified script to ingest troubleshooting runbooks from runbooks into Pinecone
runbooks/Directory containing troubleshooting runbooks
observe-docs/Directory containing Observe documentation markdown files (not included in public repo)
generate_mcp_token.shScript to generate MCP tokens

Architecture Benefits

  • Natural Language Processing: Natural language to OPAL translation with 90%+ success rate
  • Maintainability: Each module has a single responsibility and clear boundaries
  • Reusability: Core logic can be easily imported and used in other projects
  • Type Safety: TypedDict definitions and type hints throughout
  • Error Handling: Standardized error patterns with both dictionary and exception-based approaches
  • Self-Healing: Automatic query validation, correction, and multi-tier error recovery
  • Testing: Individual modules can be tested in isolation
  • Documentation: Each module is self-documenting with docstrings

Key Features

  • Natural Language Interface: Convert plain English questions into validated OPAL queries
  • Schema-Aware Generation: Automatic field validation using real dataset schemas
  • Multi-Tier Error Recovery: Progressive fallback strategies for query failures
  • Helper Classes: QueryBuilder, WorksheetExporter, and PermissionChecker for developer experience
  • Standardized Error Handling: Both dictionary-based and exception-based error patterns
  • Authentication: JWT utilities, scope-based authorization, and permission management
  • Configuration Management: Centralized environment validation and header sanitization
  • Type Safety: Type definitions and Optional type handling

Available MCP Tools

🚀 Intelligent Query Interface (Recommended)

  • execute_nlp_query: Convert natural language requests into validated OPAL queries and execute them
    • Required Parameters: dataset_id (string) + request (natural language query)
    • Prerequisite: Caller must first identify the target dataset using list_datasets or get_dataset_info
    • 90%+ Success Rate: Advanced OPAL syntax generation with comprehensive error recovery
    • Schema-Aware: Automatically validates field names against dataset schemas
    • Self-Healing: Multi-tier error recovery with intelligent fallback strategies
    • Example: execute_nlp_query("42160988", "Show me error rates by service in the last hour")

Core Observe API Tools

  • execute_opal_query: Execute an OPAL query on a dataset (manual OPAL syntax required)
  • export_worksheet: Export data from an Observe worksheet with flexible time parameters (defaults to 15m interval)
  • list_datasets: List available datasets in Observe with filtering options
  • get_dataset_info: Get detailed schema information about a specific dataset
  • create_monitor: Create a new monitor in Observe with OPAL queries
  • list_monitors: List all monitors in Observe
  • get_monitor: Get detailed information about a specific monitor

Knowledge Base & Documentation Tools

  • get_relevant_docs: Get relevant OPAL documentation using semantic vector search
  • recommend_runbook: Get structured troubleshooting methodologies for specific problems

Authentication and System Tools

  • get_system_prompt: Get the system prompt that configures LLMs as Observe experts
  • decode_jwt_token: Decode a JWT token and return its contents (debugging tool)
  • get_auth_token_info: Get comprehensive authentication and authorization information
  • admin_system_info: Get detailed system information (requires admin scope)
  • public_server_info: Get basic public server information (available to all users)

Scope-Based Access Control

All tools are protected by scope-based authorization:

  • Admin scope (admin): Full access to all tools including system information
  • Write scope (write): Can create monitors and execute queries
  • Read scope (read): Can view datasets, monitors, and execute read-only operations

Scopes follow a hierarchical model where admin includes write and read permissions, and write includes read permissions.

Worksheet Export Tool

The export_worksheet tool provides flexible worksheet data export functionality with multiple time parameter options:

Parameters

  • worksheet_id (required): The ID of the worksheet to export
  • time_range (optional): Time range for the export (e.g., "15m", "1h", "24h"). Defaults to "15m" if no time parameters are provided
  • start_time (optional): Start time in ISO format (e.g., "2025-07-21T00:00:00Z")
  • end_time (optional): End time in ISO format (e.g., "2025-07-22T00:00:00Z")

Time Parameter Combinations

The tool supports all standard Observe API time parameter patterns:

  1. Interval only (relative to now): export_worksheet("42566610", time_range="1h")
  2. Start and end times: export_worksheet("42566610", start_time="2025-07-21T00:00:00Z", end_time="2025-07-22T00:00:00Z")
  3. Start time + interval: export_worksheet("42566610", start_time="2025-07-21T00:00:00Z", time_range="24h")
  4. End time + interval: export_worksheet("42566610", end_time="2025-07-22T00:00:00Z", time_range="24h")

Response Format

The tool returns data in NDJSON (newline-delimited JSON) format. For large datasets, responses are automatically truncated with a summary showing the first 50 lines and total row count.

Vector Database Helpers

The organized src/pinecone/ package provides comprehensive Pinecone functionality:

Core Modules:

  • src/pinecone/client.py - Initialize Pinecone client with support for multiple index types
  • src/pinecone/embeddings.py - Generate embeddings for single texts and batches
  • src/pinecone/search.py - Semantic search operations for both docs and runbooks
  • src/pinecone/indexing.py - Document and runbook indexing with portable path handling

Key Functions:

  • initialize_pinecone(index_type) - Initialize client for specific index type ("docs" or "runbooks")
  • get_embedding(pc, text, is_query) - Get embedding using Pinecone's integrated embedding model
  • semantic_search(query, n_results, index_type) - Unified semantic search across index types
  • index_documents(docs_dir) - Index documents with relative path conversion
  • index_runbooks(runbooks_dir) - Index runbooks with consistent chunking strategy

Helper Classes and Utilities

The modular architecture includes several helper classes that improve developer experience and code maintainability:

Observe API Helpers

QueryBuilder (src/observe/queries.py)

# Build complex OPAL queries programmatically
builder = QueryBuilder()
query = builder.filter("service_name", "cartservice") \
              .filter("status_code", ">=", 400) \
              .timechart("5m", "count()") \
              .build()

WorksheetExporter (src/observe/worksheets.py)

# Export worksheets with flexible time parameters
exporter = WorksheetExporter()
data = await exporter.export_with_time_range("42566610", "24h")
data = await exporter.export_with_time_window("42566610", start="2025-01-01T00:00:00Z", end="2025-01-02T00:00:00Z")

Authentication Helpers

PermissionChecker (src/auth/permissions.py)

# Check user permissions programmatically
checker = PermissionChecker(user_scopes=["read", "write"])
if checker.can_create_monitors():
    # User can create monitors
if checker.can_access_admin_tools():
    # User has admin access

ScopeValidator (src/auth/scopes.py)

# Validate scopes for complex scenarios
validator = ScopeValidator(required_scopes=["admin", "write"])
result = validator.validate(user_scopes)
if not result["has_access"]:
    return validator.get_access_error()

Error Handling

The modules provide both dictionary-based and exception-based error handling patterns:

# Dictionary-based errors (for MCP tools)
{"error": True, "message": "Access denied"}

# Exception-based errors (for internal logic)  
try:
    result = await some_operation()
except ObserveAPIError as e:
    return {"error": True, "message": str(e)}

Architecture and How It Works

The system uses a multi-index vector database architecture to provide both documentation assistance and runbook recommendations. There are two main workflows: the indexing process and the runtime query process.

Indexing Process

This diagram shows how documentation and runbooks are processed and loaded into the Pinecone vector database:

Data Import

Runtime Query Process

This diagram shows how user queries are processed at runtime:

Query Architecture

Documentation Assistance Flow

  1. The LLM assistant calls the MCP server with a user query for documentation
  2. The get_relevant_docs function is called to process the query
  3. The system generates an embedding for the query using Pinecone's inference API
  4. This embedding is used to search the "observe-docs" index for relevant document chunks
  5. The system groups chunks by source document and calculates relevance scores
  6. Complete documents are retrieved from the filesystem and returned, ranked by relevance

Runbook Recommendation Flow

  1. The LLM assistant calls the MCP server with a user troubleshooting query
  2. The recommend_runbook function is called to process the query
  3. The system generates an embedding for the query using Pinecone's inference API
  4. This embedding is used to search the "observe-runbooks" index for relevant runbook chunks
  5. The system groups chunks by source runbook and calculates average relevance scores
  6. The most relevant complete runbook is retrieved from the filesystem and returned to the user

Advantages of the Vector Database Approach

  • Semantic Understanding: Vector search understands the meaning of queries, not just keywords
  • Fuzzy Matching: Can find relevant results even with typos or different wording
  • Relevance Ranking: Results are ranked by semantic similarity
  • Extensibility: Easy to add new documents without schema changes
  • Direct Use of Markdown: Uses markdown files as the source of truth
  • Integrated Embeddings: Uses Pinecone's built-in embedding model, eliminating the need for OpenAI API keys
  • Scalability: Pinecone provides a managed, scalable vector database service
  • Chunking with Full Document Retrieval: Optimizes search accuracy through chunking while providing complete context through full document retrieval

Setting up Authentication

⚠️ CRITICAL: DO NOT SKIP THIS SECTION. READ IT IN ITS ENTIRETY.

There are two types of authentication mechanisms used in this server: Observe API authentication and MCP authentication.

Observe API authentication (Observe API bearer token) - this inherits the context of the token that is used to authenticate the user who created the token to the Observe platform. This token should be considered secret and is never exposed to the MCP users.

⚠️ DANGER ZONE: The consequence of the above is that once a user is authenticated to the MCP server, they will NOT assume their identity in Observe, but the identity of the user who generated the Observe token. Make sure to use RBAC and limit access for the Observe API token to specific roles and permissions that you want to make available to the Observe MCP server users.

MCP authentication (MCP bearer token) - this is the authentication that is used to authorize the user to access and use the functionality of the MCP server. This token is generated by the server administrator and is exposed to the MCP users, for instance to use in Claude Desktop or other MCP clients.

This second layer of authentication is necessary because the server exposes resource-intensive APIs (like Pinecone) to MCP users. It allows server administrators to control access and prevent resource abuse.

Authentication

IMPORTANT: Current implementation of the MCP server also includes a basic RBAC via predefined roles: admin, read, write. These DO NOT map into any roles in Observe. They are used to control access to the MCP server tools.

Local-only deployment: If you're running the server locally without public access, you can disable MCP authentication by modifying observe_server.py and removing the Authorization header from the MCP client configuration.

Setting up MCP authentication

Create private and public key files in a secure location, i.e. in _secure directory.

openssl genrsa -out private_key.pem 2048
openssl rsa -in private_key.pem -pubout -out public_key.pem

This will create two files:

  • private_key.pem - the private key file. Keep this secret - you will need it to sign the MCP bearer tokens.
  • public_key.pem - the public key file. You will need to add this to the observe_server.py file.
cat public_key.pem

Copy the public key and add it to the .env file:

# Public PEM key for MCP token verification
PUBLIC_KEY_PEM="-----BEGIN PUBLIC KEY-----
<your_public_key_here>
-----END PUBLIC KEY-----"

Sign the bearer token with the private key:

./generate_mcp_token.sh 'user@example.com' 'admin,read,write' '4H'

Security Best Practice: Keep token expiration times short (hours rather than days). Avoid issuing long-lived tokens to minimize security risks.

Using in Claude Desktop

Add the following to your claude_desktop_config.json if you are running the MPC server locally, or provide the URL if you exposed the server publically.

Network Configuration Note: MCP clients typically restrict HTTP access to localhost only. For internet-accessible deployments, implement an HTTPS reverse proxy (Nginx, Caddy, etc.) with proper DNS configuration and SSL certificates.

{
  "mcpServers": {
    "observe-community": {
      "command": "npx",
      "args": [
        "mcp-remote@latest",
        "http://localhost:8000/sse",
        "--header",
        "Authorization: Bearer bearer_token"
      ]
    }
  }
}

The server runs on port 8000 by default. You can modify the port in the observe_server.py file if needed.

Performance Note: The server uses Server-Sent Events (SSE) transport by default, streaming responses as they're generated for improved efficiency with large payloads. Transport method can be modified in observe_server.py if needed.

Example Startup:

mcp run ./observe_server.py

Starting observe_server.py
Python version: 3.13.5 (main, Jun 11 2025, 15:36:57) [Clang 15.0.0 (clang-1500.1.0.2.5)]
Python path: ...
Basic imports successful
Attempting to import pinecone module...
Pinecone import successful. Version: 7.0.2
Attempting to import src.pinecone modules...
Successfully imported search functions from src.pinecone.search

Python MCP server starting...
[07/22/25 08:50:22] INFO     Starting MCP server 'observe-community'   server.py:1297
                             with transport 'sse' on                            
                             http://0.0.0.0:8000/sse/                                   

INFO:     Started server process [67344]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Test it out in Claude Desktop or any other MCP client, by checking out tools available in Observe MCP server.

Claude Desktop using Observe MCP Server

Making the Most of the MCP Server

Start with Natural Language Queries

The execute_nlp_query tool is designed to be your primary interface for data analysis. Simply describe what you want to investigate in natural language:

  • "Show me error rates by service in the last hour"
  • "Find slow traces with high latency"
  • "What services are having performance issues?"
  • "Calculate 95th percentile response times by endpoint"

The system automatically:

  • Converts your request to proper OPAL syntax
  • Validates field names against dataset schemas
  • Handles error recovery if queries fail
  • Returns formatted results with analysis

Use a Smart Client LLM

This MCP server is designed to work with technically capable LLM models and has been specifically tested with Claude Sonnet 3.7 and Claude Sonnet 4. The combination of intelligent client LLMs and the server's NLP capabilities creates a powerful troubleshooting environment.

Understand the Recommended Flow

The optimal workflow leverages both natural language processing and semantic search:

  1. Connect & Configure: MCP client connects and adopts the system prompt to become an Observe expert
  2. Investigation Strategy: Use recommend_runbook() to get structured troubleshooting approaches
  3. Data Discovery: Use list_datasets() to find relevant data sources
  4. Natural Language Analysis: Use execute_nlp_query() for most data analysis tasks
  5. Knowledge Augmentation: Use get_relevant_docs() for OPAL syntax questions or advanced features
  6. Fallback to Manual: Use execute_opal_query() only when you need precise control over query syntax

Query Types and Tool Selection

For Data Analysis (90% of use cases):

  • Use execute_nlp_query() - it handles OPAL generation, validation, and error recovery automatically

For Platform Knowledge:

  • Use get_relevant_docs() for OPAL syntax questions and platform capabilities

For Structured Investigation:

  • Use recommend_runbook() to get methodology, then execute_nlp_query() for data analysis

For Manual Control:

  • Use execute_opal_query() only when you need to validate specific OPAL syntax or have precise requirements

Hard-Code the System Prompt

As mentioned earlier, when the MCP client first connects to the MCP server it will discover available tools and their descriptions and the MCP server will try to convince the model to use system prompt in prompts/Observe MCP System Prompt.md to configure itself as an Observe expert. In situations where the model fails to do so, it might need more calls to accomplish the requested tasks. To ensure the most consistent use of the MCP server, hardcode the system prompt into your LLM configuration, do not rely on the MCP server to configure the model. Here is an example of how to do it in Claude Desktop:

Claude Desktop System Prompt

Craft Effective Prompts

Prompt engineering still applies! You can speed up the investigations by crafting effective prompts, by including clear objectives for the investigation, specifying time ranges of interest to focus the analysis, identifying relevant datasets or systems that should be examined and defining the expected output format, such as requesting a summary with the top 3 issues.

You can also use progressive disclosure to help the LLM build a mental model of your environment. Start with high-level questions and progressively dive deeper based on findings. As you progress through investigation, provide feedback. When the LLM makes incorrect assumptions or uses tools ineffectively, nudge it in the right direction to guide it back on track.

Use Runbooks to Guide LLMs

Feel free to create, update or delete runbooks as you see fit. Tweak them to your particular environment or use-cases. For instance, if you created a number of custom datasets that are unique to your environment, create a custom runbook that helps LLM use them in investigations. Use LLMs to improve your runbooks by using Improve this prompt type of prompt.

Remind LLMs to Use Tools

Due to sheer volume of tokens pushed into LLM's context window for the investigative process, the LLM can become "forgetful" and stop using available tools effectively. For instance, it might repeatedly struggle to craft a working OPAL query. You can nudge it in the right direction by reminding it to use get_dataset_info tool, which returns a list of fields and their types, as well as get_relevant_docs tool, which performs semantic search across the documentation and returns relevant chunks. Using these tools in combination dramatically improves the LLM's ability to craft working OPAL queries.

Maintenance

Updating Documentation

To update the vector database with new documentation:

  1. Add or update markdown files in the observe-docs directory
  2. Run python populate_docs_index.py --force to rebuild the index

Updating Runbooks

To update the runbooks vector database:

  1. Add or update markdown files in the runbooks directory
  2. Run python populate_runbooks_index.py --force to rebuild the index

Both populate scripts now use the organized src/pinecone/ modules for consistent functionality and maintenance.

Addendum: NLP Agent Architecture

This section is light bed-time reading, because the rest of the document wasn't dense enough. You are welcome.

Critical Usage Requirements

The execute_nlp_query tool requires two essential parameters:

  1. dataset_id (string): The specific Observe dataset to query against
  2. request (string): Natural language description of the desired analysis

Important: The caller must identify the appropriate dataset BEFORE using this tool. Use list_datasets to explore available datasets or get_dataset_info to understand dataset schemas and contents. The NLP agent cannot determine which dataset to use from the natural language request alone.

Typical Workflow:

  1. list_datasets → Identify relevant datasets
  2. get_dataset_info("dataset_id") → Understand schema and data structure
  3. execute_nlp_query("dataset_id", "natural language request") → Execute analysis

Technical Architecture

The execute_nlp_query tool implements a multi-component agent system for converting natural language requests into OPAL queries. Rather than simple pattern matching, the system uses several specialized components to handle schema validation, error recovery, and query optimization.

Agent System Overview

graph TD
    A[Natural Language Request] --> B[Schema Analysis Agent]
    B --> C[Documentation Retrieval Agent]
    C --> D[Query Generation Agent]
    D --> E[Validation Agent]
    E --> F[Execution Agent]
    F --> G{Success?}
    G -->|Yes| H[Result Analysis Agent]
    G -->|No| I[Error Classification Agent]
    I --> J[Recovery Strategy Agent]
    J --> K[Query Correction Agent]
    K --> F
    H --> L[Formatted Response]
    
    style A fill:#e1f5fe
    style L fill:#e8f5e8
    style I fill:#fff3e0
    style J fill:#fff3e0

Component Architecture

The NLP agent operates through several specialized components, each handling different aspects of query processing:

1. Schema Analysis Agent

Purpose: Understanding available data structures

  • Retrieves real dataset schemas from Observe
  • Identifies available fields, types, and relationships
  • Validates field names against actual dataset structure
  • Prevents "field not found" errors before query execution

2. Documentation Retrieval Agent

Purpose: Knowledge-augmented query generation

  • Performs semantic search across OPAL documentation
  • Retrieves relevant syntax patterns and examples
  • Ensures generated queries follow best practices
  • Provides context for complex analytical patterns

3. Query Generation Agent

Purpose: Natural language to OPAL translation

  • Uses Claude Sonnet 4 for sophisticated language understanding
  • Converts analytical intent into syntactically correct OPAL
  • Applies schema knowledge to use correct field names
  • Incorporates documentation patterns for optimal query structure

4. Validation Agent

Purpose: Pre-execution quality assurance

  • Validates OPAL syntax before execution
  • Checks for common anti-patterns (SQL syntax, time filtering, etc.)
  • Applies automatic corrections for known issues
  • Prevents execution of malformed queries

5. Execution Agent

Purpose: Reliable query execution with resilience

  • Executes queries against Observe APIs
  • Implements exponential backoff for transient failures
  • Monitors execution performance and resource usage
  • Provides detailed execution context for debugging

6. Error Recovery System

Purpose: Automated failure handling and correction

graph LR
    A[Query Failure] --> B{Error Classification}
    B -->|Syntax Error| C[Pattern-Based Fix]
    B -->|Field Error| D[Schema Consultation]
    B -->|Logic Error| E[Documentation Search]
    B -->|Unknown Error| F[Simplification Strategy]
    
    C --> G[Corrected Query]
    D --> G
    E --> G
    F --> G
    
    G --> H[Re-execution]
    H --> I{Success?}
    I -->|No| J[Next Strategy]
    I -->|Yes| K[Success]
    
    J --> B
    
    style A fill:#ffebee
    style K fill:#e8f5e8
    style J fill:#fff3e0

Recovery Strategy Matrix

The system implements a five-tier recovery approach:

AttemptStrategyFocusSuccess Rate
1Direct ExecutionGenerated query as-is~70%
2Syntax CorrectionFix obvious OPAL syntax issues~85%
3Schema-Guided RecoveryUse dataset schema to fix field references~90%
4Documentation-AugmentedRetrieve specific OPAL patterns for error type~95%
5Simplification FallbackGenerate simpler, safer query~98%

Error Classification

The system categorizes errors to select appropriate recovery strategies:

flowchart TD
    A[Query Error] --> B{Error Analysis}
    
    B --> C[Syntax Errors]
    B --> D[Field Errors]  
    B --> E[Logic Errors]
    B --> F[Performance Errors]
    B --> G[Unknown Errors]
    
    C --> C1[Time filtering detected]
    C --> C2[SQL syntax detected]
    C --> C3[Invalid operators]
    
    D --> D1[Field not found]
    D --> D2[Type mismatch]
    D --> D3[Invalid aggregation]
    
    E --> E1[Empty result set]
    E --> E2[Query timeout]
    E --> E3[Logic inconsistency]
    
    F --> F1[Resource limits]
    F --> F2[Query complexity]
    
    G --> G1[API errors]
    G --> G2[System errors]
    
    C1 --> H[Remove time filters]
    C2 --> I[Convert to OPAL syntax]
    D1 --> J[Schema consultation]
    E1 --> K[Broaden filters]
    F1 --> L[Add limits/simplify]
    
    style A fill:#ffebee
    style H fill:#e8f5e8
    style I fill:#e8f5e8
    style J fill:#e8f5e8
    style K fill:#e8f5e8
    style L fill:#e8f5e8

Learning and Adaptation

The system incorporates several learning mechanisms:

Pattern Recognition

  • Identifies common error patterns across queries
  • Builds library of successful correction strategies
  • Adapts validation rules based on frequent failures

Context Preservation

  • Maintains context across recovery attempts
  • Uses previous attempt failures to inform next strategy
  • Preserves user intent while adapting implementation

Performance Optimization

  • Tracks query performance and optimization opportunities
  • Suggests more efficient query patterns when possible
  • Balances accuracy with execution speed

Agent Communication Flow

sequenceDiagram
    participant User
    participant NLP as NLP Agent
    participant Schema as Schema Agent
    participant Docs as Documentation Agent
    participant Query as Query Generator
    participant Exec as Execution Agent
    participant Recovery as Recovery Agent
    
    User->>NLP: "Show error rates by service"
    NLP->>Schema: Get dataset schema
    Schema-->>NLP: Field names, types
    NLP->>Docs: Search OPAL patterns
    Docs-->>NLP: Error rate examples
    NLP->>Query: Generate OPAL query
    Query-->>NLP: statsby error_rate:avg(if(error, 1.0, 0.0))...
    NLP->>Exec: Execute query
    
    alt Query Success
        Exec-->>NLP: Results
        NLP-->>User: Formatted analysis
    else Query Failure
        Exec-->>Recovery: Error details
        Recovery->>Recovery: Classify error type
        Recovery->>Query: Generate corrected query
        Query-->>Recovery: Fixed query
        Recovery->>Exec: Re-execute
        Exec-->>NLP: Results
        NLP-->>User: Results + recovery info
    end

Key Agent Capabilities

Semantic Understanding

  • Parses analytical intent beyond keyword matching
  • Recognizes patterns: "error rates", "slow queries", "performance issues"
  • Maps business concepts to technical implementation

Contextual Adaptation

  • Adapts query complexity based on dataset characteristics
  • Adjusts aggregation strategies based on data types
  • Optimizes performance based on dataset size

Self-Healing Operation

  • Automatically corrects common syntax errors
  • Recovers from schema mismatches
  • Handles API-level failures gracefully

This multi-component architecture achieves the 90%+ success rate through layered validation, error recovery, and continuous learning from query execution patterns.

Real-World Agent Examples

Example 1: Simple Error Rate Analysis

User Request: "Show me error rates by service in the last hour"

Agent Process:
1. Schema Agent: Identifies fields like 'error', 'service_name' in dataset
2. Documentation Agent: Retrieves error rate calculation patterns
3. Query Generator: Creates OPAL query with conditional aggregation
4. Generated Query: statsby error_rate:avg(if(error, 1.0, 0.0)), total_requests:count(), group_by(service_name)
5. Result: ✅ Success on first attempt

Example 2: Complex Recovery Scenario

User Request: "Find slow database queries over 2 seconds"

Attempt 1: Filter duration > 2s (FAILS - field not found)
├── Error Classification: Field reference error
├── Schema Consultation: Finds 'response_time' field instead
└── Recovery Query: filter response_time > 2000

Attempt 2: Filter response_time > 2000 (FAILS - wrong units)  
├── Error Classification: Logic error (no results)
├── Documentation Search: Finds duration patterns in nanoseconds
└── Recovery Query: filter response_time > 2000000000

Attempt 3: ✅ Success - Returns slow query analysis

Example 3: Multi-Tier Validation and Correction

User Request: "Show me case when error count by hour"

Initial Generation: 

SELECT CASE WHEN error_count > 0 THEN 'has_errors' ELSE 'no_errors' END, 
       COUNT(*) 
FROM dataset 
WHERE timestamp > timestamp - 1h 
GROUP BY hour

Validation Agent Detects:

  • ❌ SQL syntax (not OPAL)
  • ❌ Time filtering in query
  • ❌ case() function (should be if())

Auto-Corrections Applied:

  • Convert SQL to OPAL syntax
  • Remove time filtering (handled by API)
  • Replace case() with if() function

Final Query:

statsby error_status:if(error_count > 0, "has_errors", "no_errors"), 
        total_count:count(), 
        group_by(error_status) | 
timechart 1h, count()

Result: ✅ Success after automatic correction

Performance Characteristics

MetricValueDescription
Overall Success Rate90%+Queries that execute successfully and return meaningful results
First-Attempt Success~70%Queries that succeed without any recovery needed
Average Recovery Time<2 secondsTime from failure detection to successful re-execution
Schema Validation Accuracy98%+Correct field name usage after schema consultation
Syntax Correction Rate95%+Automatic fixes that result in valid OPAL
Mean Response Time1.2 secondsAverage time from request to formatted result

The agent system makes observability data accessible through natural language while maintaining the precision of expert-written queries.

Quick Install

Quick Actions

Key Features

Model Context Protocol
Secure Communication
Real-time Updates
Open Source