Quick Reference
| Tool | What It Does | Provider Categories |
|---|---|---|
image_analyze | Analyze images using AI vision (file, URL, base64, or attachment URL) | Vision (Anthropic, OpenAI, Google) |
image_generate | Generate an image from a text prompt and deliver it to the current channel | Image gen (FAL, OpenAI DALL-E) |
tts_synthesize | Generate spoken audio from text — returns a workspace file path | TTS (OpenAI, ElevenLabs, Edge) |
transcribe_audio | Convert speech to text | STT (OpenAI Whisper, Groq, Deepgram) |
describe_video | Extract key frames and describe video content using vision | Vision (frame-by-frame) |
extract_document | Extract text from PDFs, CSVs, and other documents | None (local extraction) |
Tool Details
image_analyze -- Analyze images with AI vision
image_analyze -- Analyze images with AI vision
Use
Source types explained:The agent calls
image_analyze to have your agent look at an image and describe what it sees, answer questions about the content, or extract information like text in a screenshot.The tool supports multiple ways to provide an image:| Parameter | Type | Required | Description |
|---|---|---|---|
action | string | Yes | Always "analyze" |
source_type | string | No | How the image is provided: "file", "url", or "base64" (required unless attachment_url is used) |
source | string | No | The image data — a file path, URL, or base64-encoded string (required unless attachment_url is used) |
prompt | string | No | A specific question about the image (e.g., “What text is visible in this screenshot?”). Defaults to a general description. |
attachment_url | string | No | Platform attachment URL from a message hint (e.g., tg-file://..., discord://...). When provided, overrides source_type/source. |
mime_type | string | No | MIME type for base64 input (auto-detected if omitted) |
- file — A path to an image file on the server (e.g.,
/tmp/screenshot.png) - url — A publicly accessible image URL
- base64 — Raw image data encoded as a base64 string
attachment_url parameter instead of source_type/source. The attachment_url is provided in the message hint and resolves platform-specific attachment URLs automatically.Example usage:
When a user sends a photo in a chat, the agent can analyze it:image_analyze with attachment_url from the message hint, then responds with a description of the image content.tts_synthesize -- Generate spoken audio from text
tts_synthesize -- Generate spoken audio from text
Use
The TTS provider (OpenAI, ElevenLabs, or Edge TTS) is determined by your
tts_synthesize to convert text into spoken audio. The tool returns a workspace file path (under media/tts/) along with the MIME type and size. The agent must then send that file via the message tool’s attach action to deliver it to a chat — TTS itself does not auto-deliver.| Parameter | Type | Required | Description |
|---|---|---|---|
action | string | Yes | Always "synthesize" |
text | string | Yes | The text to speak |
voice | string | No | Voice identifier (provider-specific, available voices depend on your configured provider) |
format | string | No | Audio format: "mp3", "opus", or "wav" (default: resolved from config and channel; Telegram defaults to opus) |
config.yaml — there is no per-call provider parameter.Supported providers (configured in config.yaml):- OpenAI — High-quality voices with natural intonation. Requires an OpenAI API key.
- ElevenLabs — Wide selection of realistic voices with emotion control. Requires an ElevenLabs API key.
- Edge TTS — Free text-to-speech using Microsoft Edge’s built-in voices. No API key required.
{ filePath, mimeType, sizeBytes }.Example two-step usage:tts_synthesizewith the summary text -> obtainfilePath.messageactionattachwithattachment_url: file://{filePath},attachment_type: audioto deliver the audio file.
transcribe_audio -- Convert speech to text
transcribe_audio -- Convert speech to text
Use
Supported providers:
transcribe_audio to convert an audio or voice message into written text. This is useful for processing voice messages that users send in chat.| Parameter | Type | Required | Description |
|---|---|---|---|
attachment_url | string | Yes | The URL of the audio file to transcribe (from a message hint, e.g., tg-file://..., discord://...) |
language | string | No | BCP-47 language hint to improve transcription accuracy (e.g., "en", "he", "es") |
- OpenAI Whisper — High accuracy across many languages
- Groq — Fast transcription with Whisper models
- Deepgram — Real-time transcription with speaker detection
describe_video -- Describe video content
describe_video -- Describe video content
Use
The tool extracts representative frames from the video, analyzes each frame with AI vision, and returns a coherent description of the video content. This is useful for understanding short clips, tutorials, or screen recordings shared in chat.How it works:The agent watches the video and responds with a description like: “The video shows a step-by-step tutorial for configuring a Discord bot, starting with the developer portal and ending with the bot joining a server.”
describe_video to have your agent watch a video and describe what happens in it. The tool uses the vision pipeline to extract key frames from the video and analyze them.| Parameter | Type | Required | Description |
|---|---|---|---|
attachment_url | string | Yes | The URL of the video to describe (e.g., tg-file://..., discord://...) |
prompt | string | No | Custom analysis prompt to guide the description (defaults to a generic description prompt) |
- The video is downloaded from the provided URL
- Key frames are extracted at intervals throughout the video
- Each frame is analyzed using AI vision
- The individual frame descriptions are combined into a coherent narrative
extract_document -- Extract text from documents
extract_document -- Extract text from documents
Use
Supported formats:The agent extracts the text from the PDF, then uses that text to generate a summary. For spreadsheets (CSV), the data is returned in a structured format that the agent can analyze, filter, or summarize.
extract_document to pull readable text from documents. This is useful for processing PDFs, spreadsheets, and other files that users share in chat.| Parameter | Type | Required | Description |
|---|---|---|---|
attachment_url | string | Yes | The URL of the document to extract text from |
max_chars | number | No | Maximum number of characters to extract from the document |
- PDF — Extracts text from PDF documents
- CSV — Reads comma-separated data
- TXT — Plain text files
- Other text-based formats
Document extraction works best with text-based PDFs. Scanned documents (images of text) may require the
image_analyze tool instead, since the content is stored as images rather than selectable text.image_generate -- Generate images from text prompts
image_generate -- Generate images from text prompts
Use
The daemon-side handler applies rate limiting, safety checking, and provider execution before delivering the generated image to the current channel.
image_generate to create images from text descriptions. Available only when an image generation provider is configured (API key present). The generated image is delivered directly to the current channel via the daemon.| Parameter | Type | Required | Description |
|---|---|---|---|
prompt | string | Yes | Text description of the image to generate |
size | string | No | Provider-specific size. fal.ai uses presets (square_hd, landscape_16_9); OpenAI uses pixel dimensions (1024x1024, 1792x1024). Omit for the provider default. |
Common Workflows
Here are some typical ways agents use media tools together:- Voice message handling — When a user sends a voice note, the agent uses
transcribe_audioto convert it to text, processes the request, and optionally usestts_synthesizeto reply with audio. - Document Q&A — A user shares a PDF report. The agent uses
extract_documentto read it, then answers questions about the content. - Image moderation — In a group chat, the agent uses
image_analyzeto check shared images for content that violates community guidelines. - Accessibility — The agent uses
describe_videoandimage_analyzeto provide text descriptions of visual content for users who need them.
Provider Configuration
Media tools route to providers based on the credentials present in your secret store and the providers listed inconfig.yaml. None of the tools accept a provider parameter at call time — selection is config-driven, not per-call.
| Capability | Available providers | Auto-selection order | Required env keys |
|---|---|---|---|
Vision (image_analyze, describe_video) | OpenAI (gpt-4o), Anthropic (claude-sonnet-4.5), Google Gemini | OpenAI -> Anthropic -> Google (first available) | OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY (at least one) |
TTS (tts_synthesize) | OpenAI TTS, ElevenLabs, Edge | Configured order; auto-tts heuristic decides when to use voice | OPENAI_API_KEY, ELEVENLABS_API_KEY (Edge is free) |
STT (transcribe_audio) | OpenAI Whisper (gpt-4o-mini-transcribe), Groq Whisper, Deepgram nova-3 | Fallback chain in declared order | OPENAI_API_KEY, GROQ_API_KEY, DEEPGRAM_API_KEY (at least one) |
Image generation (image_generate) | FAL (flux-pro), OpenAI DALL-E 3 | First with credentials present | FAL_KEY, OPENAI_API_KEY (optional — feature disabled silently if missing) |
Document extraction (extract_document) | Local pdf.js + CSV / text decoders | n/a | None (uses FFmpeg for some audio metadata) |
Cost notes
Media calls bill against your model providers, not Comis itself. Rough rules of thumb:- Vision is significantly more expensive per call than text generation — a single high-res image can cost 1-5K tokens of input. Use
promptto focus the analysis and avoid re-analyzing the same image across turns. - STT is generally cheap (Groq’s whisper-large-v3-turbo is the lowest-latency / lowest-cost option in the fallback chain).
- TTS is priced per character. ElevenLabs is the priciest tier; Edge TTS is free but voice quality is more limited.
- Image generation has the highest per-call cost. The image-gen rate limiter is conservative by default (~1 req/min on OpenAI, ~1 req/sec on FAL).
All media tools process content on your server. Files are not sent to third parties beyond the configured AI provider for analysis.
Enabling Media Tools
Media tools are enabled by default in thefull tool policy profile. If you are using a restricted profile (like minimal or coding), you can enable specific media tools by adding them to your agent’s allow list:
Related
Media & Voice
Detailed media provider setup and configuration
Vision
How AI vision processing works
Voice
Text-to-speech and transcription configuration
Agent Tools Overview
See all available agent tools
