You don’t need to understand the technical details to use this feature. The configuration examples below are copy-paste ready.
How Voice Works
User sends a voice message
A user records a voice note in Telegram, Discord, WhatsApp, or any connected
platform.
Comis transcribes the audio
The voice message is automatically converted to text using your configured
speech-to-text provider. The agent sees the written transcript.
Agent reads and responds
Your agent processes the transcript like any text message and writes a reply.
Voice also helps with mention detection. In group chats where the bot only responds
when mentioned, Comis can pre-transcribe voice messages to check if the user said
the bot’s name.
Transcription (Speech-to-Text)
Comis supports three speech-to-text providers. You only need one configured, but you can set up multiple for automatic failover. Each provider connects to a different speech recognition service, so the best choice depends on your needs and which API keys you already have.| Provider | Default Model | API Key | Special Features |
|---|---|---|---|
| OpenAI | gpt-4o-mini-transcribe | OPENAI_API_KEY | Default provider, fast |
| Groq | whisper-large-v3-turbo | GROQ_API_KEY | Returns language detection and duration |
| Deepgram | nova-3 | DEEPGRAM_API_KEY | High accuracy, real-time streaming |
OpenAI
OpenAI
Default transcription provider. Uses the gpt-4o-mini-transcribe model. Reliable
and fast for most use cases. Requires an OpenAI API key. Does not return
language or duration metadata in the response.This is the provider Comis uses out of the box if you have an OpenAI API key
configured. No additional setup beyond the API key is needed.
Groq
Groq
Uses Whisper Large V3 Turbo. Returns rich metadata including the detected
language of the audio and total duration. Requires a Groq API key. A good
alternative if you want automatic language detection.The language detection feature is especially useful for multilingual communities
where voice messages arrive in different languages.
Deepgram
Deepgram
Uses the Nova 3 model. Known for high accuracy and fast processing speeds.
Requires a Deepgram API key. Internally uses a different API format (raw binary
instead of form data), but this is handled automatically — you just provide
the API key.Deepgram is a good choice when transcription accuracy is your top priority,
especially for noisy audio or accented speech.
Fallback Chains
If your primary provider fails (API error, timeout, rate limit), Comis tries the next provider in your fallback list. This keeps voice transcription working even when a single provider has issues.Text-to-Speech (TTS)
Comis can convert your agent’s text replies into spoken audio using three TTS providers. Like transcription, you only need one provider configured. The right choice depends on voice quality requirements, budget, and whether you need an API key.| Provider | Default Model | Default Voice | API Key | Cost |
|---|---|---|---|---|
| OpenAI | tts-1 | alloy | OPENAI_API_KEY | Paid |
| ElevenLabs | eleven_multilingual_v2 | Rachel | ELEVENLABS_API_KEY | Paid |
| Edge | — | en-US-AvaMultilingualNeural | None needed | Free |
OpenAI
OpenAI
Default TTS provider. Uses the tts-1 model with the “alloy” voice. Supports
speed adjustment from 0.25x to 4.0x. Requires an OpenAI API key. A good
balance of quality and speed.OpenAI offers several voice options beyond “alloy.” You can change the voice
in your configuration or per-reply using TTS directives.
ElevenLabs
ElevenLabs
Premium voice synthesis using the eleven_multilingual_v2 model. Default voice
is “Rachel.” Offers advanced voice settings: stability, similarity boost, style,
and speaker boost. Requires an ElevenLabs API key. Best voice quality for
human-like speech.ElevenLabs voices sound the most natural and expressive. The advanced settings
let you fine-tune how the voice sounds — for example, increasing stability
makes the voice more consistent, while boosting style adds more expressiveness.
Edge TTS
Edge TTS
Microsoft Edge text-to-speech. Free, no API key required. Default voice is
en-US-AvaMultilingualNeural. A good option for testing or budget-conscious
setups. Limited customization compared to OpenAI and ElevenLabs.Since Edge TTS is free and requires no API key, it is the easiest way to try
out voice replies. You can always switch to OpenAI or ElevenLabs later for
higher quality.
Auto-Reply Modes
TheautoMode setting controls when Comis automatically converts text replies
into voice messages.
| Mode | Behavior |
|---|---|
"off" (default) | TTS only when the agent explicitly uses the tts_synthesize tool |
"always" | Every text reply is automatically converted to a voice message |
"inbound" | Reply with voice only when the user sent a voice message first |
"tagged" | Reply with voice only when the agent’s response contains a [[tts]] directive |
Off
The default. Your agent’s text replies are sent as text. The agent can still generate voice on demand using the tts_synthesize tool, but nothing is automatic.Always
Every text reply from the agent is also sent as a voice message. Useful for accessibility or when users prefer listening. Replies that already contain media attachments skip TTS to avoid sending double media.Inbound
The agent mirrors the user. If the user sends a voice message, the reply is a voice message. If the user sends text, the reply is text. This creates natural conversational behavior where the agent adapts to how the user is communicating.Tagged
The agent controls when to use voice. When the LLM includes[[tts]] in its
output, that reply is converted to speech. The tag is stripped from the final
text so the user never sees it.
TTS Directives
In tagged mode, the agent can include directives to fine-tune how the voice reply sounds. The[[tts:...]] syntax allows overriding the voice, provider,
format, and speed for individual replies.
| Parameter | Description |
|---|---|
voice | Voice identifier (provider-specific) |
provider | TTS provider to use for this reply |
format | Audio format (opus, mp3, etc.) |
speed | Playback speed multiplier |
Per-Channel Audio Formats
Different platforms work best with different audio formats. Comis automatically converts audio to the best format for each platform so you do not have to worry about compatibility. You can override these defaults in your configuration if needed.| Platform | Default Format | Why |
|---|---|---|
| Telegram | opus | Native voice message format for Telegram |
| Discord | mp3 | Widely supported across Discord clients |
| mp3 | Compatible with WhatsApp voice playback | |
| Slack | mp3 | Standard audio format for Slack file uploads |
| Other | mp3 | Safe default for any platform |
Configuration
Full configuration reference for both transcription and text-to-speech:What Happens Without a Provider
Voice features degrade gracefully. You do not need every provider configured for the system to work. Without a transcription provider, voice messages are not auto-transcribed. Instead, your agent receives a hint: “Voice message attached — use transcribe_audio tool to listen.” The agent can still process voice messages using the transcribe_audio on-demand tool if a provider key is available to the tool. Without a TTS provider, auto-reply modes have no effect. Text replies are sent as text regardless of theautoMode setting. The agent can still use
tts_synthesize if the tool and a provider key are
available.
This means you can start with just transcription, add TTS later, or use neither
and rely entirely on the on-demand tools.
Walkthrough: voice → transcript → reply → TTS
Here is the full round-trip for a WhatsApp voice note when transcription (OpenAI) and auto-TTS (autoMode: inbound) are both configured.
User records a voice note
A user holds the microphone in WhatsApp and says: “Can you check whether
we have any meetings tomorrow before 10am?” — sends the audio.
Comis downloads and transcribes
The WhatsApp adapter pulls the audio, decrypts it via Baileys, and hands
it to the transcription factory. OpenAI’s
gpt-4o-mini-transcribe
returns: Can you check whether we have any meetings tomorrow before 10am?The agent reads the transcript
The agent receives the text plus a metadata flag indicating the source
was a voice message. With
autoMode: inbound, that flag tells Comis to
reply by voice as well.The agent calls calendar tools and writes a reply
After looking up the calendar via an MCP tool, the agent writes:
“You have one meeting before 10am tomorrow — the standup at 9:30. Nothing
else booked until 11.”
Comis synthesises the reply
Because auto-TTS fired, the OpenAI TTS provider renders the reply with
the
alloy voice into an mp3. The WhatsApp adapter sends it as an
audioMessage so it appears as a native voice note on the user’s phone.inbound setting only triggers voice when the
user spoke first, so written messages always get written replies.
Related
Vision
Image and video analysis
Documents
Document text extraction
Media Tools
transcribe_audio and tts_synthesize tool reference
Media Overview
Back to media overview
