Media & Voice

When someone sends a photo, a voice note, a PDF, or a link to your agent, Comis processes it automatically. Your agent can see images, listen to voice messages, read documents, and understand web pages — all without any extra effort from you. This page gives you the big picture of what Comis does with different types of media and how to configure each capability.

You don’t need to understand the technical details to use this feature. The configuration examples below are copy-paste ready.

What Comis Handles

Your agent can work with five types of media. Some processing happens automatically before your agent even starts thinking — the results are ready by the time your agent sees the message. Other capabilities are available as on-demand tools that your agent can call when it decides more analysis is needed.

Media Type	What Happens Automatically	On-Demand Tool	Details
Images	Your agent sees the image directly (when vision is enabled)	`image_analyze`	Vision
Voice Messages	Audio is transcribed to text so your agent can read it	`transcribe_audio`, `tts_synthesize`	Voice
Documents	Text is extracted from PDFs, CSVs, code files, and more	`extract_document`	Documents
Links	URLs in messages are fetched and the content is summarized	—	Links
Videos	Video content is described using vision AI	`describe_video`	Vision

Automatic processing happens before your agent even starts thinking. On-demand tools let your agent actively request analysis when it decides to.

How Media Processing Works

Understanding the flow helps you see where configuration matters. Here is what happens step by step when someone sends a message with media attached:

Someone sends a message with media

A user sends a photo, voice note, document, or link in any connected chat platform.

Comis detects the media type

The media preprocessor identifies what was sent — whether it is an image, audio file, video, document, or URL. Each type has its own processing pipeline optimized for that kind of content.

Automatic processing runs

Based on your configuration, Comis processes the media automatically. This might mean transcribing a voice message to text, sending an image to vision AI for analysis, extracting text from a PDF, or fetching the content of a linked web page.

Your agent receives the enriched message

The agent sees the original message text plus all of the processed media results. A voice message becomes readable text. An image comes with a description. A document’s content is available inline. The agent has everything it needs to respond intelligently.

Agent can do more on demand

If the automatic processing was not enough, your agent can use on-demand tools to analyze further. For example, it can ask specific questions about an image, re-analyze a document with different settings, or transcribe audio that was not automatically processed.

What Happens Without Configuration

Even if you have not configured any media providers, Comis still handles media gracefully. When media arrives without a configured provider, your agent does not crash or ignore the message. Instead, it receives a helpful hint about what was sent so it can still respond appropriately. Here is what your agent sees for each media type when no provider is configured: Voice messages: Your agent receives a hint like “Voice message attached — use transcribe_audio tool to listen” so it knows a voice message was sent. If the on-demand transcription tool is available, the agent can still process the audio by calling it explicitly. Images: Your agent sees a hint that includes the image URL. If a vision provider is available for on-demand analysis, the agent can use the image_analyze tool to understand the image contents. Without any vision provider at all, the agent still knows an image was shared and can acknowledge it. Documents: Your agent sees a hint about the attached file — including the filename, MIME type, and file size. The agent can use extract_document to read the content on demand if the extraction tool is available. Videos: Similar to images, your agent sees a hint about the video attachment. If Google Gemini is configured, the agent can use the describe_video tool to get a text description of the video content. Links: When link understanding is disabled (the default), URLs in messages are passed through as plain text. Your agent can still see and reference the URLs, but Comis does not automatically fetch and summarize the linked content. This means you can start using Comis without configuring every media provider up front. Add providers later as you need more automatic processing. Your agent always knows when media was sent, even without full media configuration.

Channel × Media Capability Matrix

Not every channel can deliver every media type — that is a platform limitation, not a Comis decision. The table below shows what each connected channel can actually receive and send.

Channel	Images in	Voice in (audio attachment)	Voice out (TTS reply)	Video in	Documents in	Link previews
Telegram	Yes	Yes	Yes (opus)	Yes	Yes	Yes
Discord	Yes	Yes	Yes (mp3)	Yes	Yes	Yes
Slack	Yes	Yes	Yes (mp3)	Yes	Yes	Yes
WhatsApp	Yes	Yes	Yes (mp3)	Yes	Yes	No
Signal	Yes	Yes	Yes	Yes	Yes	No
LINE	Yes	Yes	Yes	Yes	No	No
iMessage	Yes	Yes	Yes	Yes	Yes	No
IRC	No	No	No	No	No	No
Email	Yes (inline/attachment)	Yes (audio attachment)	Yes (audio attachment)	Yes (attachment)	Yes (full MIME)	No

The “voice out” column reflects whether the channel adapter can deliver an audio reply when auto-TTS is enabled. The default audio format per channel is documented in Voice → Per-Channel Audio Formats.

Cross-Platform Support

Media processing works across all connected chat platforms. The same configuration applies everywhere — you set up vision once, and it works whether someone sends a photo in Telegram, Discord, Slack, or any other connected channel. However, each platform handles media attachments differently behind the scenes. Comis abstracts these differences so you do not need to worry about them:

Telegram sends voice messages as .ogg audio files
Discord uses various audio formats depending on the client
WhatsApp sends voice notes in a specific format that Comis converts automatically
Slack hosts files on its own CDN with authentication tokens

Your configuration stays the same regardless of which platform the media comes from. Comis handles the platform-specific details internally so your agent gets consistent results no matter where the message originated.

Some media features are only available on certain platforms. For example, rich messages with buttons currently render on Discord, Telegram, and Slack; LINE and WhatsApp button rendering is not yet wired through sendMessage, and Signal, iMessage, IRC, and Email do not support buttons. See each capability page for platform-specific details.

Configuration Overview

All media settings live under integrations.media in your config.yaml. Each media capability has its own section with sensible defaults, so you only need to configure the features you actually want to use. Here is a minimal example showing the main toggles:

# config.yaml
integrations:
  media:
    vision:
      enabled: true
    transcription:
      provider: openai
    tts:
      provider: openai
      autoMode: off
    linkUnderstanding:
      enabled: false
    documentExtraction:
      enabled: true

Each section in the configuration corresponds to one of the five media capabilities:

vision — Image and video analysis (see Vision)
transcription — Speech-to-text for voice messages (see Voice)
tts — Text-to-speech for voice replies (see Voice)
linkUnderstanding — Automatic URL content fetching (see Links)
documentExtraction — File text extraction (see Documents)

Each capability page below shows the full configuration options. You only need to configure the features you want to use — everything has sensible defaults.

Explore Capabilities

Vision

How your agent sees and understands images and videos

Voice

Speech-to-text transcription and text-to-speech auto-reply

Documents

PDF, CSV, code files, and more

Links

Automatic URL content understanding

Rich Messages

Buttons, cards, embeds, and polls across platforms

Agent Tools: Media

See all media tool parameters and usage

Configuration Reference

Full config.yaml reference

​What Comis Handles

​How Media Processing Works

​What Happens Without Configuration

​Channel × Media Capability Matrix

​Cross-Platform Support

​Configuration Overview

​Explore Capabilities

Vision

Voice

Documents

Links

Rich Messages

​Related

Agent Tools: Media

Configuration Reference

What Comis Handles

How Media Processing Works

What Happens Without Configuration

Channel × Media Capability Matrix

Cross-Platform Support

Configuration Overview

Explore Capabilities

Related