Phone-Based OpenClaw Agent: Voice and SMS AI Access

AI Phone Assistant Automation: Hands-Free Access to Your Agent from Any Phone

Imagine you're driving and need to confirm whether the critical Jira ticket blocking your release has been updated, add a calendar event for a follow-up call, and get a quick summary of the latest news on a key dependency. Your laptop is packed away, but your phone is connected to your car's system.

A phone-based OpenClaw agent handles all these tasks with ease: just call the agent’s number, speak your request, and hear a response in under three seconds. No app, no screen—totally hands-free. When voice isn't practical, such as in noisy settings or meetings, the same assistant is reachable via SMS, providing uninterrupted access.

This setup is among the most infrastructure-intensive OpenClaw use cases in the productivity space due to telephony's complexity versus straightforward chat agents. The reward is a personal AI assistant accessible from any phone, no smartphone required.

By following this guide, you'll build a working phone-based voice and SMS agent that classifies intents, queries backends like calendar or Jira, and returns fast, natural responses. Test intent classification and response generation safely in WisGate's AI Studio before wiring any telephony infrastructure: https://wisgate.ai/studio/image and obtain your API key at https://wisgate.ai/hall/tokens.

What the Phone-Based Agent Stack Looks Like

Before diving into configuration, understand the full stack powering this phone-based agent. Voice calls add layers not present in standard text chats, each with latency implications.

Layer	Component	Role
Telephony	Twilio Voice / SMS	Handles incoming calls and SMS, routes to your webhook
Speech-to-Text	Twilio STT or Deepgram	Converts caller's speech into a text transcript
Intent & Response	OpenClaw + WisGate (Haiku)	Classifies intent, interacts with backend, generates replies
Backend Integrations	Calendar API, Jira API, Web Search	Executes requests like calendar updates, ticket status, or searches
Text-to-Speech	Twilio TTS or ElevenLabs	Converts OpenClaw's text response back into spoken audio
SMS fallback	Twilio SMS	Sends text responses for SMS-based access

Voice interfaces are uniquely sensitive to latency. The combined delays from STT, model processing, TTS, and telephony round-trips determine whether callers experience a fluid interaction or frustrating waits. The developer has the most control over the model choice; hence, model selection is the first critical architectural decision—not the last.

OpenClaw API Voice Assistant: WisGate Configuration and Model Selection

Step 1 — Open the configuration file

Open your terminal and edit your OpenClaw config:

nano ~/.openclaw/openclaw.json

Step 2 — Add the WisGate provider to your models section

Insert this JSON snippet within the models section. It configures WisGate as a custom provider and registers the Claude Haiku model, optimized for voice latency:

"models": {
  "mode": "merge",
  "providers": {
    "moonshot": {
      "baseUrl": "https://api.wisgate.ai/v1",
      "apiKey": "WISGATE-API-KEY",
      "api": "openai-completions",
      "models": [
        {
          "id": "claude-haiku-4-5-20251001",
          "name": "Claude Haiku 4.5",
          "reasoning": false,
          "input": ["text"],
          "cost": {
            "input": 0,
            "output": 0,
            "cacheRead": 0,
            "cacheWrite": 0
          },
          "contextWindow": 256000,
          "maxTokens": 8192
        }
      ]
    }
  }
}

Replace WISGATE-API-KEY with your actual key from https://wisgate.ai/hall/tokens. The "mode": "merge" option adds WisGate models alongside your existing providers.

Step 3 — Save, exit, and restart OpenClaw

Press Ctrl + O then Enter to save
Press Ctrl + X to exit
Stop OpenClaw: Ctrl + C
Restart with: openclaw tui

Upon restart, Claude Haiku is selectable. This sets the base model configuration; the telephony stack sits upstream, feeding transcribed text into OpenClaw.

Why Haiku is the Correct Model for Voice Interfaces

A voice assistant must respond quickly: the user expects audible feedback within approximately 2–3 seconds after finishing their speech. Total latency includes STT processing, model inference, and TTS playback.

Model inference speed is crucial since every additional second adds to perceived wait time. At typical voice output lengths (100–200 tokens), Claude Haiku's inference speed noticeably outperforms bigger models like Sonnet or Opus. This speed advantage leads to a smoother, more responsive experience.

Unlike text chat, where longer waits are tolerable, voice interactions demand low latency as a UX imperative. Confirm Claude Haiku model pricing and details at https://wisgate.ai/models.

For this agent, create a dedicated API key labeled openclaw-phone-agent at https://wisgate.ai/hall/tokens. Phones are a broad access point; isolated keys limit exposure if leaked.

LLM SMS Voice Automation: Twilio Webhook Configuration

Configure Twilio to relay voice and SMS inputs to your OpenClaw agent.

Voice call setup:

Purchase a Twilio phone number via the Twilio console.
Set the phone number’s Voice webhook to your endpoint: POST https://your-server.com/voice-incoming.
Use Twilio <Gather> with input="speech" to capture spoken input, forwarding transcripts to POST https://your-server.com/voice-transcript.
Your server receives the transcript as SpeechResult in the POST body; pass this text to OpenClaw as the user message.
Receive the agent’s textual response and use Twilio’s <Say> verb to convert it to speech and play back.

SMS setup:

Set the number’s Messaging webhook to POST https://your-server.com/sms-incoming.
Receive SMS messages as Body in the POST; relay the text to OpenClaw.
Send the OpenClaw text response back via Twilio SMS API.

Latency considerations: Set Twilio’s <Gather> timeout to 3–5 seconds silence to balance between premature cutoffs and dead air. This timeout governs when the speech transcript is sent for processing.

The Voice Agent System Prompt and Token Budget

Below is a copy-ready template for the voice agent system prompt, designed to enforce voice-specific constraints and delineate intent classification:

You are a hands-free personal assistant reachable by phone and SMS.

RESPONSE FORMAT RULES — MANDATORY:
- Maximum response length: 120 words for voice; 200 words for SMS
- [VOICE] Voice responses must be speakable: no markdown, no bullet points, no URLs,
  no code blocks — only natural spoken sentences
- [SMS] SMS responses may use line breaks but no markdown headers or code blocks
- Always lead with the answer — never with "Great question" or preamble

INTENT CLASSIFICATION:
Classify each request into one of the following and respond accordingly:
- CALENDAR: query or update [CALENDAR PROVIDER] calendar events
- JIRA: query ticket status or add a comment to [PROJECT KEY] tickets
- WEB_SEARCH: summarize the top result for the query
- GENERAL: answer from context without querying external systems

BACKEND ACCESS:
- Calendar: [insert calendar API integration instructions]
- Jira: [insert Jira API base URL and auth method]
- Web search: [insert search tool or API]

VOICE CONSTRAINTS:
- Never read out a URL aloud — say "I'll send that link by SMS"
- Never read out more than two items in a list — summarize and offer to SMS the full list
- If the request requires more than 120 words to answer correctly,
  respond with the key point verbally and offer: "Want me to SMS you the details?"

CHANNEL DETECTION:
The user message will be prefixed with [VOICE] or [SMS].
Apply voice constraints only when [VOICE] is present.

Token budget: The 120-word voice limit maps roughly to 180 output tokens. Combined with about 400 tokens in system prompt and around 50 tokens per user input, each voice interaction consumes approximately 600–650 tokens. Adjust max_tokens in your WisGate API calls accordingly.

OpenClaw Use Cases: Cost Per 1,000 Voice and SMS Interactions

Estimate your token consumption and costs before deploying at scale.

Channel	Input tokens	Output tokens	Total tokens
Voice	~450	~180	~630
SMS	~450	~300	~750

Volume	Haiku (WisGate)	Sonnet (WisGate)	Saving vs. Sonnet
1,000 interactions	Confirm pricing and calculate	Confirm pricing and calculate	Calculate difference
10,000 interactions/month	Confirm pricing and calculate	Confirm pricing and calculate	Calculate difference

At typical voice interaction rates, Haiku’s faster inference significantly enhances user experience while reducing compute costs. Sonnet’s richer reasoning power is overkill for brief voice responses capped at 120 words. The cost savings at scale, e.g., 10,000 interactions/month, transform into tangible operational efficiencies.

Confirm up-to-date figures at https://wisgate.ai/models.

OpenClaw Use Cases: Any Phone, Any Request, No Screen Required

Your telephony stack is now fully defined. The system prompt is finalized and ready for deployment. The WisGate API call snippet above validates intent classification and response formatting before connecting any live number.

Deploy by creating a Twilio phone number and linking voice and SMS webhooks to your server. Activate OpenClaw configured with Claude Haiku. Test thoroughly with SMS-only mode for at least 24 hours to ensure reliable intent parsing and manageable response length before enabling the voice path.

The prerequisites are simple: a Twilio account and a WisGate API key.

Explore tokens and workflow in WisGate’s AI Studio: https://wisgate.ai/studio/image and manage your API keys securely at https://wisgate.ai/hall/tokens.

Start with SMS, then expand to full voice assistant automation with low-latency response — unlocking hands-free access to your AI on any phone.