Media - Comis

What it does: Gives agents the ability to process images, audio, video, and documents — analyze, transcribe, synthesize, generate, and extract. Who it’s for: Agents living in group chats where users share photos, voice notes, documents, and video clips. Instead of ignoring media content, your agent can understand and respond to it.

Quick Reference

Tool	What It Does	Provider Categories
`image_analyze`	Analyze images using AI vision (file, URL, base64, or attachment URL)	Vision (Anthropic, OpenAI, Google)
`image_generate`	Generate an image from a text prompt and deliver it to the current channel	Image gen (FAL, OpenAI DALL-E)
`tts_synthesize`	Generate spoken audio from text — returns a workspace file path	TTS (OpenAI, ElevenLabs, Edge)
`transcribe_audio`	Convert speech to text	STT (OpenAI Whisper, Groq, Deepgram)
`describe_video`	Extract key frames and describe video content using vision	Vision (frame-by-frame)
`extract_document`	Extract text from PDFs, CSVs, and other documents	None (local extraction)

For per-provider setup, model defaults, and configuration, see the Media & Voice section.

Tool Details

image_analyze -- Analyze images with AI vision

Use image_analyze to have your agent look at an image and describe what it sees, answer questions about the content, or extract information like text in a screenshot.The tool supports multiple ways to provide an image:

Parameter	Type	Required	Description
`action`	string	Yes	Always `"analyze"`
`source_type`	string	No	How the image is provided: `"file"`, `"url"`, or `"base64"` (required unless `attachment_url` is used)
`source`	string	No	The image data — a file path, URL, or base64-encoded string (required unless `attachment_url` is used)
`prompt`	string	No	A specific question about the image (e.g., “What text is visible in this screenshot?”). Defaults to a general description.
`attachment_url`	string	No	Platform attachment URL from a message hint (e.g., `tg-file://...`, `discord://...`). When provided, overrides `source_type`/`source`.
`mime_type`	string	No	MIME type for base64 input (auto-detected if omitted)

Source types explained:

file — A path to an image file on the server (e.g., /tmp/screenshot.png)
url — A publicly accessible image URL
base64 — Raw image data encoded as a base64 string

For images attached to chat messages, use the attachment_url parameter instead of source_type/source. The attachment_url is provided in the message hint and resolves platform-specific attachment URLs automatically.Example usage: When a user sends a photo in a chat, the agent can analyze it:

"Analyze this screenshot and tell me what application is shown."

The agent calls image_analyze with attachment_url from the message hint, then responds with a description of the image content.

Use the prompt parameter to ask specific questions about images. Without a prompt, the agent generates a general description. With a prompt like “What error message is shown?”, the agent focuses on extracting that specific information.

tts_synthesize -- Generate spoken audio from text

Use tts_synthesize to convert text into spoken audio. The tool returns a workspace file path (under media/tts/) along with the MIME type and size. The agent must then send that file via the message tool’s attach action to deliver it to a chat — TTS itself does not auto-deliver.

Parameter	Type	Required	Description
`action`	string	Yes	Always `"synthesize"`
`text`	string	Yes	The text to speak
`voice`	string	No	Voice identifier (provider-specific, available voices depend on your configured provider)
`format`	string	No	Audio format: `"mp3"`, `"opus"`, or `"wav"` (default: resolved from config and channel; Telegram defaults to opus)

The TTS provider (OpenAI, ElevenLabs, or Edge TTS) is determined by your config.yaml — there is no per-call provider parameter.Supported providers (configured in config.yaml):

OpenAI — High-quality voices with natural intonation. Requires an OpenAI API key.
ElevenLabs — Wide selection of realistic voices with emotion control. Requires an ElevenLabs API key.
Edge TTS — Free text-to-speech using Microsoft Edge’s built-in voices. No API key required.

Returns: { filePath, mimeType, sizeBytes }.Example two-step usage:

tts_synthesize with the summary text -> obtain filePath.
message action attach with attachment_url: file://{filePath}, attachment_type: audio to deliver the audio file.

transcribe_audio -- Convert speech to text

Use transcribe_audio to convert an audio or voice message into written text. This is useful for processing voice messages that users send in chat.

Parameter	Type	Required	Description
`attachment_url`	string	Yes	The URL of the audio file to transcribe (from a message hint, e.g., `tg-file://...`, `discord://...`)
`language`	string	No	BCP-47 language hint to improve transcription accuracy (e.g., `"en"`, `"he"`, `"es"`)

Supported providers:

OpenAI Whisper — High accuracy across many languages
Groq — Fast transcription with Whisper models
Deepgram — Real-time transcription with speaker detection

The provider used depends on your configuration. The tool returns the transcribed text, which the agent can then use in its response.Supported audio formats: Most common audio formats are supported, including MP3, WAV, OGG, M4A, and WEBM. Voice messages from chat platforms (Discord, Telegram, WhatsApp) are automatically handled.Example usage: When a user sends a voice message, the agent automatically transcribes it and responds to the spoken content. This works especially well in chat platforms where voice messages are common — the agent can participate in voice-message conversations by reading the transcribed text and responding in text or with a TTS audio reply.

describe_video -- Describe video content

Use describe_video to have your agent watch a video and describe what happens in it. The tool uses the vision pipeline to extract key frames from the video and analyze them.

Parameter	Type	Required	Description
`attachment_url`	string	Yes	The URL of the video to describe (e.g., `tg-file://...`, `discord://...`)
`prompt`	string	No	Custom analysis prompt to guide the description (defaults to a generic description prompt)

The tool extracts representative frames from the video, analyzes each frame with AI vision, and returns a coherent description of the video content. This is useful for understanding short clips, tutorials, or screen recordings shared in chat.How it works:

The video is downloaded from the provided URL
Key frames are extracted at intervals throughout the video
Each frame is analyzed using AI vision
The individual frame descriptions are combined into a coherent narrative

Example usage: When a user shares a video clip, the agent can describe what is happening:

"What's going on in this video?"

The agent watches the video and responds with a description like: “The video shows a step-by-step tutorial for configuring a Discord bot, starting with the developer portal and ending with the bot joining a server.”

extract_document -- Extract text from documents

Use extract_document to pull readable text from documents. This is useful for processing PDFs, spreadsheets, and other files that users share in chat.

Parameter	Type	Required	Description
`attachment_url`	string	Yes	The URL of the document to extract text from
`max_chars`	number	No	Maximum number of characters to extract from the document

Supported formats:

PDF — Extracts text from PDF documents
CSV — Reads comma-separated data
TXT — Plain text files
Other text-based formats

The tool returns the extracted text content, which the agent can then summarize, answer questions about, or process further.Example usage:

"Read this PDF and give me a summary of the key points."

The agent extracts the text from the PDF, then uses that text to generate a summary. For spreadsheets (CSV), the data is returned in a structured format that the agent can analyze, filter, or summarize.

Document extraction works best with text-based PDFs. Scanned documents (images of text) may require the image_analyze tool instead, since the content is stored as images rather than selectable text.

image_generate -- Generate images from text prompts

Use image_generate to create images from text descriptions. Available only when an image generation provider is configured (API key present). The generated image is delivered directly to the current channel via the daemon.

Parameter	Type	Required	Description
`prompt`	string	Yes	Text description of the image to generate
`size`	string	No	Provider-specific size. fal.ai uses presets (`square_hd`, `landscape_16_9`); OpenAI uses pixel dimensions (`1024x1024`, `1792x1024`). Omit for the provider default.

The daemon-side handler applies rate limiting, safety checking, and provider execution before delivering the generated image to the current channel.

Common Workflows

Here are some typical ways agents use media tools together:

Voice message handling — When a user sends a voice note, the agent uses transcribe_audio to convert it to text, processes the request, and optionally uses tts_synthesize to reply with audio.
Document Q&A — A user shares a PDF report. The agent uses extract_document to read it, then answers questions about the content.
Image moderation — In a group chat, the agent uses image_analyze to check shared images for content that violates community guidelines.
Accessibility — The agent uses describe_video and image_analyze to provide text descriptions of visual content for users who need them.

Provider Configuration

Media tools route to providers based on the credentials present in your secret store and the providers listed in config.yaml. None of the tools accept a provider parameter at call time — selection is config-driven, not per-call.

Capability	Available providers	Auto-selection order	Required env keys
Vision (`image_analyze`, `describe_video`)	OpenAI (gpt-4o), Anthropic (claude-sonnet-4.5), Google Gemini	OpenAI -> Anthropic -> Google (first available)	`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY` (at least one)
TTS (`tts_synthesize`)	OpenAI TTS, ElevenLabs, Edge	Configured order; `auto-tts` heuristic decides when to use voice	`OPENAI_API_KEY`, `ELEVENLABS_API_KEY` (Edge is free)
STT (`transcribe_audio`)	OpenAI Whisper (gpt-4o-mini-transcribe), Groq Whisper, Deepgram nova-3	Fallback chain in declared order	`OPENAI_API_KEY`, `GROQ_API_KEY`, `DEEPGRAM_API_KEY` (at least one)
Image generation (`image_generate`)	FAL (flux-pro), OpenAI DALL-E 3	First with credentials present	`FAL_KEY`, `OPENAI_API_KEY` (optional — feature disabled silently if missing)
Document extraction (`extract_document`)	Local pdf.js + CSV / text decoders	n/a	None (uses FFmpeg for some audio metadata)

Cost notes

Media calls bill against your model providers, not Comis itself. Rough rules of thumb:

Vision is significantly more expensive per call than text generation — a single high-res image can cost 1-5K tokens of input. Use prompt to focus the analysis and avoid re-analyzing the same image across turns.
STT is generally cheap (Groq’s whisper-large-v3-turbo is the lowest-latency / lowest-cost option in the fallback chain).
TTS is priced per character. ElevenLabs is the priciest tier; Edge TTS is free but voice quality is more limited.
Image generation has the highest per-call cost. The image-gen rate limiter is conservative by default (~1 req/min on OpenAI, ~1 req/sec on FAL).

Comis applies graceful degradation: if no credentials are configured for a capability, the corresponding tool is omitted from the agent’s tool catalog rather than failing at call time. Vision falls back to Anthropic by default since most setups already have an Anthropic key for the agent runtime itself. See the Media & Voice section for detailed provider setup, the Vision page for image flow details, and Voice for TTS/STT options.

All media tools process content on your server. Files are not sent to third parties beyond the configured AI provider for analysis.

Enabling Media Tools

Media tools are enabled by default in the full tool policy profile. If you are using a restricted profile (like minimal or coding), you can enable specific media tools by adding them to your agent’s allow list:

skills:
  toolPolicy:
    profile: minimal
    allow:
      - image_analyze
      - transcribe_audio

See Tool Policy for more details on controlling which tools your agents can access.

Media & Voice

Detailed media provider setup and configuration

Vision

How AI vision processing works

Voice

Text-to-speech and transcription configuration

Agent Tools Overview

See all available agent tools

​Quick Reference

​Tool Details

​Common Workflows

​Provider Configuration

​Cost notes

​Enabling Media Tools

​Related

Media & Voice

Vision

Voice

Agent Tools Overview

Quick Reference

Tool Details

Common Workflows

Provider Configuration

Cost notes

Enabling Media Tools

Related