You don’t need to understand the technical details to use this feature. The configuration examples below are copy-paste ready.
What Comis Handles
Your agent can work with five types of media. Some processing happens automatically before your agent even starts thinking — the results are ready by the time your agent sees the message. Other capabilities are available as on-demand tools that your agent can call when it decides more analysis is needed.| Media Type | What Happens Automatically | On-Demand Tool | Details |
|---|---|---|---|
| Images | Your agent sees the image directly (when vision is enabled) | image_analyze | Vision |
| Voice Messages | Audio is transcribed to text so your agent can read it | transcribe_audio, tts_synthesize | Voice |
| Documents | Text is extracted from PDFs, CSVs, code files, and more | extract_document | Documents |
| Links | URLs in messages are fetched and the content is summarized | — | Links |
| Videos | Video content is described using vision AI | describe_video | Vision |
Automatic processing happens before your agent even starts thinking. On-demand
tools let your agent actively request analysis when it decides to.
How Media Processing Works
Understanding the flow helps you see where configuration matters. Here is what happens step by step when someone sends a message with media attached:Someone sends a message with media
A user sends a photo, voice note, document, or link in any connected chat
platform.
Comis detects the media type
The media preprocessor identifies what was sent — whether it is an image,
audio file, video, document, or URL. Each type has its own processing
pipeline optimized for that kind of content.
Automatic processing runs
Based on your configuration, Comis processes the media automatically. This
might mean transcribing a voice message to text, sending an image to
vision AI for analysis, extracting text from a PDF, or fetching the
content of a linked web page.
Your agent receives the enriched message
The agent sees the original message text plus all of the processed media
results. A voice message becomes readable text. An image comes with a
description. A document’s content is available inline. The agent has
everything it needs to respond intelligently.
What Happens Without Configuration
Even if you have not configured any media providers, Comis still handles media gracefully. When media arrives without a configured provider, your agent does not crash or ignore the message. Instead, it receives a helpful hint about what was sent so it can still respond appropriately. Here is what your agent sees for each media type when no provider is configured: Voice messages: Your agent receives a hint like “Voice message attached — usetranscribe_audio tool to listen” so it knows a voice message was sent.
If the on-demand transcription tool is available, the agent can still process
the audio by calling it explicitly.
Images: Your agent sees a hint that includes the image URL. If a vision
provider is available for on-demand analysis, the agent can use the
image_analyze tool to understand the image contents. Without any vision
provider at all, the agent still knows an image was shared and can
acknowledge it.
Documents: Your agent sees a hint about the attached file — including the
filename, MIME type, and file size. The agent can use extract_document to
read the content on demand if the extraction tool is available.
Videos: Similar to images, your agent sees a hint about the video
attachment. If Google Gemini is configured, the agent can use the
describe_video tool to get a text description of the video content.
Links: When link understanding is disabled (the default), URLs in messages
are passed through as plain text. Your agent can still see and reference the
URLs, but Comis does not automatically fetch and summarize the linked content.
This means you can start using Comis without configuring every media provider
up front. Add providers later as you need more automatic processing. Your
agent always knows when media was sent, even without full media configuration.
Channel × Media Capability Matrix
Not every channel can deliver every media type — that is a platform limitation, not a Comis decision. The table below shows what each connected channel can actually receive and send.| Channel | Images in | Voice in (audio attachment) | Voice out (TTS reply) | Video in | Documents in | Link previews |
|---|---|---|---|---|---|---|
| Telegram | Yes | Yes | Yes (opus) | Yes | Yes | Yes |
| Discord | Yes | Yes | Yes (mp3) | Yes | Yes | Yes |
| Slack | Yes | Yes | Yes (mp3) | Yes | Yes | Yes |
| Yes | Yes | Yes (mp3) | Yes | Yes | No | |
| Signal | Yes | Yes | Yes | Yes | Yes | No |
| LINE | Yes | Yes | Yes | Yes | No | No |
| iMessage | Yes | Yes | Yes | Yes | Yes | No |
| IRC | No | No | No | No | No | No |
| Yes (inline/attachment) | Yes (audio attachment) | Yes (audio attachment) | Yes (attachment) | Yes (full MIME) | No |
Cross-Platform Support
Media processing works across all connected chat platforms. The same configuration applies everywhere — you set up vision once, and it works whether someone sends a photo in Telegram, Discord, Slack, or any other connected channel. However, each platform handles media attachments differently behind the scenes. Comis abstracts these differences so you do not need to worry about them:- Telegram sends voice messages as
.oggaudio files - Discord uses various audio formats depending on the client
- WhatsApp sends voice notes in a specific format that Comis converts automatically
- Slack hosts files on its own CDN with authentication tokens
Some media features are only available on certain platforms. For example,
rich messages with buttons currently render on Discord, Telegram, and Slack;
LINE and WhatsApp button rendering is not yet wired through
sendMessage,
and Signal, iMessage, IRC, and Email do not support buttons. See each
capability page for platform-specific details.Configuration Overview
All media settings live underintegrations.media in your config.yaml.
Each media capability has its own section with sensible defaults, so you only
need to configure the features you actually want to use. Here is a minimal
example showing the main toggles:
vision— Image and video analysis (see Vision)transcription— Speech-to-text for voice messages (see Voice)tts— Text-to-speech for voice replies (see Voice)linkUnderstanding— Automatic URL content fetching (see Links)documentExtraction— File text extraction (see Documents)
Explore Capabilities
Vision
How your agent sees and understands images and videos
Voice
Speech-to-text transcription and text-to-speech auto-reply
Documents
PDF, CSV, code files, and more
Links
Automatic URL content understanding
Rich Messages
Buttons, cards, embeds, and polls across platforms
Related
Agent Tools: Media
See all media tool parameters and usage
Configuration Reference
Full config.yaml reference
