Skip to main content
What it does: Gives agents the ability to process images, audio, video, and documents — analyze, transcribe, synthesize, generate, and extract. Who it’s for: Agents living in group chats where users share photos, voice notes, documents, and video clips. Instead of ignoring media content, your agent can understand and respond to it.

Quick Reference

ToolWhat It DoesProvider Categories
image_analyzeAnalyze images using AI vision (file, URL, base64, or attachment URL)Vision (Anthropic, OpenAI, Google)
image_generateGenerate an image from a text prompt and deliver it to the current channelImage gen (FAL, OpenAI DALL-E)
tts_synthesizeGenerate spoken audio from text — returns a workspace file pathTTS (OpenAI, ElevenLabs, Edge)
transcribe_audioConvert speech to textSTT (OpenAI Whisper, Groq, Deepgram)
describe_videoExtract key frames and describe video content using visionVision (frame-by-frame)
extract_documentExtract text from PDFs, CSVs, and other documentsNone (local extraction)
For per-provider setup, model defaults, and configuration, see the Media & Voice section.

Tool Details

Use image_analyze to have your agent look at an image and describe what it sees, answer questions about the content, or extract information like text in a screenshot.The tool supports multiple ways to provide an image:
ParameterTypeRequiredDescription
actionstringYesAlways "analyze"
source_typestringNoHow the image is provided: "file", "url", or "base64" (required unless attachment_url is used)
sourcestringNoThe image data — a file path, URL, or base64-encoded string (required unless attachment_url is used)
promptstringNoA specific question about the image (e.g., “What text is visible in this screenshot?”). Defaults to a general description.
attachment_urlstringNoPlatform attachment URL from a message hint (e.g., tg-file://..., discord://...). When provided, overrides source_type/source.
mime_typestringNoMIME type for base64 input (auto-detected if omitted)
Source types explained:
  • file — A path to an image file on the server (e.g., /tmp/screenshot.png)
  • url — A publicly accessible image URL
  • base64 — Raw image data encoded as a base64 string
For images attached to chat messages, use the attachment_url parameter instead of source_type/source. The attachment_url is provided in the message hint and resolves platform-specific attachment URLs automatically.Example usage: When a user sends a photo in a chat, the agent can analyze it:
"Analyze this screenshot and tell me what application is shown."
The agent calls image_analyze with attachment_url from the message hint, then responds with a description of the image content.
Use the prompt parameter to ask specific questions about images. Without a prompt, the agent generates a general description. With a prompt like “What error message is shown?”, the agent focuses on extracting that specific information.
Use tts_synthesize to convert text into spoken audio. The tool returns a workspace file path (under media/tts/) along with the MIME type and size. The agent must then send that file via the message tool’s attach action to deliver it to a chat — TTS itself does not auto-deliver.
ParameterTypeRequiredDescription
actionstringYesAlways "synthesize"
textstringYesThe text to speak
voicestringNoVoice identifier (provider-specific, available voices depend on your configured provider)
formatstringNoAudio format: "mp3", "opus", or "wav" (default: resolved from config and channel; Telegram defaults to opus)
The TTS provider (OpenAI, ElevenLabs, or Edge TTS) is determined by your config.yaml — there is no per-call provider parameter.Supported providers (configured in config.yaml):
  • OpenAI — High-quality voices with natural intonation. Requires an OpenAI API key.
  • ElevenLabs — Wide selection of realistic voices with emotion control. Requires an ElevenLabs API key.
  • Edge TTS — Free text-to-speech using Microsoft Edge’s built-in voices. No API key required.
Returns: { filePath, mimeType, sizeBytes }.Example two-step usage:
  1. tts_synthesize with the summary text -> obtain filePath.
  2. message action attach with attachment_url: file://{filePath}, attachment_type: audio to deliver the audio file.
Use transcribe_audio to convert an audio or voice message into written text. This is useful for processing voice messages that users send in chat.
ParameterTypeRequiredDescription
attachment_urlstringYesThe URL of the audio file to transcribe (from a message hint, e.g., tg-file://..., discord://...)
languagestringNoBCP-47 language hint to improve transcription accuracy (e.g., "en", "he", "es")
Supported providers:
  • OpenAI Whisper — High accuracy across many languages
  • Groq — Fast transcription with Whisper models
  • Deepgram — Real-time transcription with speaker detection
The provider used depends on your configuration. The tool returns the transcribed text, which the agent can then use in its response.Supported audio formats: Most common audio formats are supported, including MP3, WAV, OGG, M4A, and WEBM. Voice messages from chat platforms (Discord, Telegram, WhatsApp) are automatically handled.Example usage: When a user sends a voice message, the agent automatically transcribes it and responds to the spoken content. This works especially well in chat platforms where voice messages are common — the agent can participate in voice-message conversations by reading the transcribed text and responding in text or with a TTS audio reply.
Use describe_video to have your agent watch a video and describe what happens in it. The tool uses the vision pipeline to extract key frames from the video and analyze them.
ParameterTypeRequiredDescription
attachment_urlstringYesThe URL of the video to describe (e.g., tg-file://..., discord://...)
promptstringNoCustom analysis prompt to guide the description (defaults to a generic description prompt)
The tool extracts representative frames from the video, analyzes each frame with AI vision, and returns a coherent description of the video content. This is useful for understanding short clips, tutorials, or screen recordings shared in chat.How it works:
  1. The video is downloaded from the provided URL
  2. Key frames are extracted at intervals throughout the video
  3. Each frame is analyzed using AI vision
  4. The individual frame descriptions are combined into a coherent narrative
Example usage: When a user shares a video clip, the agent can describe what is happening:
"What's going on in this video?"
The agent watches the video and responds with a description like: “The video shows a step-by-step tutorial for configuring a Discord bot, starting with the developer portal and ending with the bot joining a server.”
Use extract_document to pull readable text from documents. This is useful for processing PDFs, spreadsheets, and other files that users share in chat.
ParameterTypeRequiredDescription
attachment_urlstringYesThe URL of the document to extract text from
max_charsnumberNoMaximum number of characters to extract from the document
Supported formats:
  • PDF — Extracts text from PDF documents
  • CSV — Reads comma-separated data
  • TXT — Plain text files
  • Other text-based formats
The tool returns the extracted text content, which the agent can then summarize, answer questions about, or process further.Example usage:
"Read this PDF and give me a summary of the key points."
The agent extracts the text from the PDF, then uses that text to generate a summary. For spreadsheets (CSV), the data is returned in a structured format that the agent can analyze, filter, or summarize.
Document extraction works best with text-based PDFs. Scanned documents (images of text) may require the image_analyze tool instead, since the content is stored as images rather than selectable text.
Use image_generate to create images from text descriptions. Available only when an image generation provider is configured (API key present). The generated image is delivered directly to the current channel via the daemon.
ParameterTypeRequiredDescription
promptstringYesText description of the image to generate
sizestringNoProvider-specific size. fal.ai uses presets (square_hd, landscape_16_9); OpenAI uses pixel dimensions (1024x1024, 1792x1024). Omit for the provider default.
The daemon-side handler applies rate limiting, safety checking, and provider execution before delivering the generated image to the current channel.

Common Workflows

Here are some typical ways agents use media tools together:
  • Voice message handling — When a user sends a voice note, the agent uses transcribe_audio to convert it to text, processes the request, and optionally uses tts_synthesize to reply with audio.
  • Document Q&A — A user shares a PDF report. The agent uses extract_document to read it, then answers questions about the content.
  • Image moderation — In a group chat, the agent uses image_analyze to check shared images for content that violates community guidelines.
  • Accessibility — The agent uses describe_video and image_analyze to provide text descriptions of visual content for users who need them.

Provider Configuration

Media tools route to providers based on the credentials present in your secret store and the providers listed in config.yaml. None of the tools accept a provider parameter at call time — selection is config-driven, not per-call.
CapabilityAvailable providersAuto-selection orderRequired env keys
Vision (image_analyze, describe_video)OpenAI (gpt-4o), Anthropic (claude-sonnet-4.5), Google GeminiOpenAI -> Anthropic -> Google (first available)OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY (at least one)
TTS (tts_synthesize)OpenAI TTS, ElevenLabs, EdgeConfigured order; auto-tts heuristic decides when to use voiceOPENAI_API_KEY, ELEVENLABS_API_KEY (Edge is free)
STT (transcribe_audio)OpenAI Whisper (gpt-4o-mini-transcribe), Groq Whisper, Deepgram nova-3Fallback chain in declared orderOPENAI_API_KEY, GROQ_API_KEY, DEEPGRAM_API_KEY (at least one)
Image generation (image_generate)FAL (flux-pro), OpenAI DALL-E 3First with credentials presentFAL_KEY, OPENAI_API_KEY (optional — feature disabled silently if missing)
Document extraction (extract_document)Local pdf.js + CSV / text decodersn/aNone (uses FFmpeg for some audio metadata)

Cost notes

Media calls bill against your model providers, not Comis itself. Rough rules of thumb:
  • Vision is significantly more expensive per call than text generation — a single high-res image can cost 1-5K tokens of input. Use prompt to focus the analysis and avoid re-analyzing the same image across turns.
  • STT is generally cheap (Groq’s whisper-large-v3-turbo is the lowest-latency / lowest-cost option in the fallback chain).
  • TTS is priced per character. ElevenLabs is the priciest tier; Edge TTS is free but voice quality is more limited.
  • Image generation has the highest per-call cost. The image-gen rate limiter is conservative by default (~1 req/min on OpenAI, ~1 req/sec on FAL).
Comis applies graceful degradation: if no credentials are configured for a capability, the corresponding tool is omitted from the agent’s tool catalog rather than failing at call time. Vision falls back to Anthropic by default since most setups already have an Anthropic key for the agent runtime itself. See the Media & Voice section for detailed provider setup, the Vision page for image flow details, and Voice for TTS/STT options.
All media tools process content on your server. Files are not sent to third parties beyond the configured AI provider for analysis.

Enabling Media Tools

Media tools are enabled by default in the full tool policy profile. If you are using a restricted profile (like minimal or coding), you can enable specific media tools by adding them to your agent’s allow list:
skills:
  toolPolicy:
    profile: minimal
    allow:
      - image_analyze
      - transcribe_audio
See Tool Policy for more details on controlling which tools your agents can access.

Media & Voice

Detailed media provider setup and configuration

Vision

How AI vision processing works

Voice

Text-to-speech and transcription configuration

Agent Tools Overview

See all available agent tools