AI Phone Assistant Automation: Hands-Free Access to Your Agent from Any Phone
Imagine you're driving and need to confirm whether the critical Jira ticket blocking your release has been updated, add a calendar event for a follow-up call, and get a quick summary of the latest news on a key dependency. Your laptop is packed away, but your phone is connected to your car's system.
A phone-based OpenClaw agent handles all these tasks with ease: just call the agent’s number, speak your request, and hear a response in under three seconds. No app, no screen—totally hands-free. When voice isn't practical, such as in noisy settings or meetings, the same assistant is reachable via SMS, providing uninterrupted access.
This setup is among the most infrastructure-intensive OpenClaw use cases in the productivity space due to telephony's complexity versus straightforward chat agents. The reward is a personal AI assistant accessible from any phone, no smartphone required.
By following this guide, you'll build a working phone-based voice and SMS agent that classifies intents, queries backends like calendar or Jira, and returns fast, natural responses. Test intent classification and response generation safely in WisGate's AI Studio before wiring any telephony infrastructure: https://wisgate.ai/studio/image and obtain your API key at https://wisgate.ai/hall/tokens.
What the Phone-Based Agent Stack Looks Like
Before diving into configuration, understand the full stack powering this phone-based agent. Voice calls add layers not present in standard text chats, each with latency implications.
| Layer | Component | Role |
|---|---|---|
| Telephony | Twilio Voice / SMS | Handles incoming calls and SMS, routes to your webhook |
| Speech-to-Text | Twilio STT or Deepgram | Converts caller's speech into a text transcript |
| Intent & Response | OpenClaw + WisGate (Haiku) | Classifies intent, interacts with backend, generates replies |
| Backend Integrations | Calendar API, Jira API, Web Search | Executes requests like calendar updates, ticket status, or searches |
| Text-to-Speech | Twilio TTS or ElevenLabs | Converts OpenClaw's text response back into spoken audio |
| SMS fallback | Twilio SMS | Sends text responses for SMS-based access |
Voice interfaces are uniquely sensitive to latency. The combined delays from STT, model processing, TTS, and telephony round-trips determine whether callers experience a fluid interaction or frustrating waits. The developer has the most control over the model choice; hence, model selection is the first critical architectural decision—not the last.
OpenClaw API Voice Assistant: WisGate Configuration and Model Selection
Step 1 — Open the configuration file
Open your terminal and edit your OpenClaw config:
nano ~/.openclaw/openclaw.json
Step 2 — Add the WisGate provider to your models section
Insert this JSON snippet within the models section. It configures WisGate as a custom provider and registers the Claude Haiku model, optimized for voice latency:
"models": {
"mode": "merge",
"providers": {
"moonshot": {
"baseUrl": "https://api.wisgate.ai/v1",
"apiKey": "WISGATE-API-KEY",
"api": "openai-completions",
"models": [
{
"id": "claude-haiku-4-5-20251001",
"name": "Claude Haiku 4.5",
"reasoning": false,
"input": ["text"],
"cost": {
"input": 0,
"output": 0,
"cacheRead": 0,
"cacheWrite": 0
},
"contextWindow": 256000,
"maxTokens": 8192
}
]
}
}
}
Replace
WISGATE-API-KEYwith your actual key from https://wisgate.ai/hall/tokens. The "mode": "merge" option adds WisGate models alongside your existing providers.
Step 3 — Save, exit, and restart OpenClaw
- Press Ctrl + O then Enter to save
- Press Ctrl + X to exit
- Stop OpenClaw: Ctrl + C
- Restart with:
openclaw tui
Upon restart, Claude Haiku is selectable. This sets the base model configuration; the telephony stack sits upstream, feeding transcribed text into OpenClaw.
Why Haiku is the Correct Model for Voice Interfaces
A voice assistant must respond quickly: the user expects audible feedback within approximately 2–3 seconds after finishing their speech. Total latency includes STT processing, model inference, and TTS playback.
Model inference speed is crucial since every additional second adds to perceived wait time. At typical voice output lengths (100–200 tokens), Claude Haiku's inference speed noticeably outperforms bigger models like Sonnet or Opus. This speed advantage leads to a smoother, more responsive experience.
Unlike text chat, where longer waits are tolerable, voice interactions demand low latency as a UX imperative. Confirm Claude Haiku model pricing and details at https://wisgate.ai/models.
For this agent, create a dedicated API key labeled openclaw-phone-agent at https://wisgate.ai/hall/tokens. Phones are a broad access point; isolated keys limit exposure if leaked.
LLM SMS Voice Automation: Twilio Webhook Configuration
Configure Twilio to relay voice and SMS inputs to your OpenClaw agent.
Voice call setup:
- Purchase a Twilio phone number via the Twilio console.
- Set the phone number’s Voice webhook to your endpoint:
POST https://your-server.com/voice-incoming. - Use Twilio
<Gather>withinput="speech"to capture spoken input, forwarding transcripts toPOST https://your-server.com/voice-transcript. - Your server receives the transcript as
SpeechResultin the POST body; pass this text to OpenClaw as the user message. - Receive the agent’s textual response and use Twilio’s
<Say>verb to convert it to speech and play back.
SMS setup:
- Set the number’s Messaging webhook to
POST https://your-server.com/sms-incoming. - Receive SMS messages as
Bodyin the POST; relay the text to OpenClaw. - Send the OpenClaw text response back via Twilio SMS API.
Latency considerations:
Set Twilio’s <Gather> timeout to 3–5 seconds silence to balance between premature cutoffs and dead air. This timeout governs when the speech transcript is sent for processing.
The Voice Agent System Prompt and Token Budget
Below is a copy-ready template for the voice agent system prompt, designed to enforce voice-specific constraints and delineate intent classification:
You are a hands-free personal assistant reachable by phone and SMS.
RESPONSE FORMAT RULES — MANDATORY:
- Maximum response length: 120 words for voice; 200 words for SMS
- [VOICE] Voice responses must be speakable: no markdown, no bullet points, no URLs,
no code blocks — only natural spoken sentences
- [SMS] SMS responses may use line breaks but no markdown headers or code blocks
- Always lead with the answer — never with "Great question" or preamble
INTENT CLASSIFICATION:
Classify each request into one of the following and respond accordingly:
- CALENDAR: query or update [CALENDAR PROVIDER] calendar events
- JIRA: query ticket status or add a comment to [PROJECT KEY] tickets
- WEB_SEARCH: summarize the top result for the query
- GENERAL: answer from context without querying external systems
BACKEND ACCESS:
- Calendar: [insert calendar API integration instructions]
- Jira: [insert Jira API base URL and auth method]
- Web search: [insert search tool or API]
VOICE CONSTRAINTS:
- Never read out a URL aloud — say "I'll send that link by SMS"
- Never read out more than two items in a list — summarize and offer to SMS the full list
- If the request requires more than 120 words to answer correctly,
respond with the key point verbally and offer: "Want me to SMS you the details?"
CHANNEL DETECTION:
The user message will be prefixed with [VOICE] or [SMS].
Apply voice constraints only when [VOICE] is present.
Token budget: The 120-word voice limit maps roughly to 180 output tokens. Combined with about 400 tokens in system prompt and around 50 tokens per user input, each voice interaction consumes approximately 600–650 tokens. Adjust max_tokens in your WisGate API calls accordingly.
OpenClaw Use Cases: Cost Per 1,000 Voice and SMS Interactions
Estimate your token consumption and costs before deploying at scale.
| Channel | Input tokens | Output tokens | Total tokens |
|---|---|---|---|
| Voice | ~450 | ~180 | ~630 |
| SMS | ~450 | ~300 | ~750 |
| Volume | Haiku (WisGate) | Sonnet (WisGate) | Saving vs. Sonnet |
|---|---|---|---|
| 1,000 interactions | Confirm pricing and calculate | Confirm pricing and calculate | Calculate difference |
| 10,000 interactions/month | Confirm pricing and calculate | Confirm pricing and calculate | Calculate difference |
At typical voice interaction rates, Haiku’s faster inference significantly enhances user experience while reducing compute costs. Sonnet’s richer reasoning power is overkill for brief voice responses capped at 120 words. The cost savings at scale, e.g., 10,000 interactions/month, transform into tangible operational efficiencies.
Confirm up-to-date figures at https://wisgate.ai/models.
OpenClaw Use Cases: Any Phone, Any Request, No Screen Required
Your telephony stack is now fully defined. The system prompt is finalized and ready for deployment. The WisGate API call snippet above validates intent classification and response formatting before connecting any live number.
Deploy by creating a Twilio phone number and linking voice and SMS webhooks to your server. Activate OpenClaw configured with Claude Haiku. Test thoroughly with SMS-only mode for at least 24 hours to ensure reliable intent parsing and manageable response length before enabling the voice path.
The prerequisites are simple: a Twilio account and a WisGate API key.
Explore tokens and workflow in WisGate’s AI Studio: https://wisgate.ai/studio/image and manage your API keys securely at https://wisgate.ai/hall/tokens.
Start with SMS, then expand to full voice assistant automation with low-latency response — unlocking hands-free access to your AI on any phone.