Browser - Comis

What it does: Drives a headless Chromium browser via Playwright — navigate, click, type, screenshot, read pages via accessibility snapshots, manage tabs and isolated profiles, and run JavaScript. Who it’s for: Anyone whose agent needs to interact with sites that don’t yield to a simple web_fetch — single-page apps, sites behind logins, dynamic dashboards, file uploads, or workflows that require multiple steps. For static page content, the lighter web_fetch tool is faster and cheaper.

Browser Actions

The browser tool supports 16 actions, verified against BROWSER_TOOL_ACTIONS in packages/skills/src/builtin/platform/browser-tool-schema.ts:

Action	What It Does
`status`	Check if the browser is running or stopped
`start`	Launch the browser
`stop`	Close the browser
`profiles`	Manage isolated browser profiles
`tabs`	List all open tabs
`open`	Open a new tab with a URL
`focus`	Switch to a specific tab (by `targetId`)
`close`	Close a tab (by `targetId`)
`snapshot`	Get the page accessibility tree (aria or ai format)
`screenshot`	Capture a page screenshot
`navigate`	Go to a URL in the current tab
`console`	View browser console output
`pdf`	Save the current page as a PDF
`upload`	Upload a file to the page
`dialog`	Handle browser dialogs (alerts, confirms, prompts)
`act`	Interact with page elements (11 sub-kinds)

Page Actions

navigate -- Go to a URL

Navigate the current tab to a new URL.

Parameter	Type	Required	Description
`action`	string	Yes	`"navigate"`
`targetUrl`	string	No	The URL to navigate to

The browser loads the page and waits for it to finish loading before returning. SSRF validation prevents navigation to internal network addresses (see Security below).

screenshot -- Capture a screenshot

Take a screenshot of the current page. The image is returned as a message attachment.

Parameter	Type	Required	Description
`action`	string	Yes	`"screenshot"`
`targetId`	string	No	The tab to capture (defaults to the active tab)
`fullPage`	boolean	No	Capture the full scrollable page (not just the viewport)
`type`	string	No	Image format: `"png"` (lossless) or `"jpeg"` (compressed)

Screenshots are useful for visual verification — checking how a page looks, confirming form submissions, or capturing error states.

snapshot -- Get the accessibility tree

Get a structured representation of the page content using the accessibility tree. This is how agents “read” web pages.

Parameter	Type	Required	Description
`action`	string	Yes	`"snapshot"`
`targetId`	string	No	The tab to snapshot (defaults to the active tab)
`snapshotFormat`	string	No	`"aria"` for the raw accessibility tree, or `"ai"` for an AI-optimized format
`mode`	string	No	Snapshot mode: `"efficient"` for compact output optimized for LLM consumption
`refs`	string	No	Element reference type: `"role"` (ARIA role references) or `"aria"` (ARIA label references)
`interactive`	boolean	No	Show only interactive elements
`compact`	boolean	No	Use compact output format
`depth`	number	No	Maximum tree depth to include
`maxChars`	number	No	Maximum characters in the snapshot output

Format options:

aria — The full accessibility tree with all ARIA roles and properties. Detailed but verbose.
ai — A condensed format designed for AI consumption. Shows interactive elements (buttons, links, inputs) with labels and references that can be used with the act action.

The ai format is recommended for most use cases — it gives agents the information they need to interact with the page without overwhelming them with structural details.

act -- Interact with page elements

Interact with elements on the page. The act action supports 11 different interaction types (called “sub-kinds”), each designed for a specific type of interaction.

Parameter	Type	Required	Description
`action`	string	Yes	`"act"`
`request.kind`	string	Yes	The interaction type (see table below)
`request.*`	(varies)	(varies)	Additional parameters depend on the `kind`

The act action wraps its interaction parameters inside a request object. See the Interaction Types section below for all 11 sub-kinds and their parameters.

open -- Open a new tab

Open a new browser tab and navigate to a URL.

Parameter	Type	Required	Description
`action`	string	Yes	`"open"`
`targetUrl`	string	No	The URL to open in the new tab

SSRF validation applies to the URL (same as navigate).

pdf -- Save page as PDF

Save the current page as a PDF document, returned as a message attachment.

Parameter	Type	Required	Description
`action`	string	Yes	`"pdf"`

Useful for archiving web pages or generating printable versions of online content.

upload -- Upload files

Upload one or more files to a file input element on the page.

Parameter	Type	Required	Description
`action`	string	Yes	`"upload"`
`inputRef`	string	No	CSS selector or aria reference for the file input element
`paths`	string[]	No	Array of file paths to upload

dialog -- Handle browser dialogs

Respond to browser dialogs such as JavaScript alerts, confirmation prompts, and input prompts.

Parameter	Type	Required	Description
`action`	string	Yes	`"dialog"`
`accept`	boolean	No	Whether to accept (`true`) or dismiss (`false`) the dialog
`promptText`	string	No	Text to enter in a prompt dialog

console -- View console output

Retrieve the browser’s developer console output, including errors, warnings, and log messages.

Parameter	Type	Required	Description
`action`	string	Yes	`"console"`

Useful for debugging page issues or monitoring JavaScript errors.

Interaction Types (act sub-kinds)

When using the act action, the kind parameter determines what type of interaction to perform. There are 11 interaction types:

Kind	What It Does	Key Parameters
`click`	Click an element	`ref` (element reference or CSS selector), `doubleClick`, `button`, `modifiers`
`type`	Type text character by character	`ref`, `text`, `submit`, `slowly`
`press`	Press a keyboard key	`key` (e.g., `"Enter"`, `"Tab"`, `"Escape"`)
`hover`	Hover over an element	`ref`
`drag`	Drag from one element to another	`startRef`, `endRef` (both element references)
`select`	Select a dropdown option	`ref`, `values` (array of option values)
`fill`	Fill form fields (clears existing content first)	`ref`, `fields` (array of field objects)
`resize`	Resize the browser viewport	`width`, `height`
`wait`	Wait for a condition	`timeMs` (milliseconds), `textGone` (text to wait to disappear)
`evaluate`	Run JavaScript in the page	`fn` (JavaScript code)
`close`	Close the current page or dialog	(no additional parameters)

Selecting Elements

Most interaction types require a ref parameter to identify which element to interact with. You can use:

CSS selectors — Standard CSS selectors like #login-button, .submit-btn, or input[name="email"]
Aria references — References from the snapshot output in ai format, which label interactive elements with short identifiers

The recommended workflow is: take a snapshot in ai format to see the page structure, then use the element references from the snapshot in your act calls via the ref parameter.

Browser Profiles

Profiles create isolated browser contexts with separate cookies, local storage, and session data. This is useful for:

Managing multiple accounts on the same website
Testing with different user states (logged in vs. logged out)
Keeping browsing sessions separate between different tasks

Use the profiles action to create, list, and switch between browser profiles.

Stealth mode (anti-bot detection)

By default the browser runs headless Chrome — fine for most public sites and internal tools, but instantly flagged by modern bot-detection services (Cloudflare Turnstile, FingerprintJS, BrowserScan, reCAPTCHA v3 scoring, Reddit’s secondary fingerprint check). For agents that hit those sites you can install Comis with progressively stronger stealth. Three install-time flags compose together. They’re available in both the bare-metal installer (install.sh) and the Docker image (build args):

Flag	What it adds	When you need it
`--with-browser`	Stock Google Chrome + headless shared libs. Sandbox `ReadWritePaths` widened for Chrome’s out-of-profile writes (`~/.config/google-chrome`, `~/.local/share/applications`).	Baseline. The browser tool needs a Chrome binary to launch; this provisions one if you didn’t bring your own.
`--with-xvfb`	Adds Xvfb + a `comis-xvfb.service` systemd companion that runs a virtual display on `:99`. The main daemon unit joins its `/tmp` namespace so the X11 socket is reachable. Config seeded with `headless: false`.	Sites that detect headless mode itself (BrowserScan, DataDome-tier, Cloudflare managed). On the test VPS, BrowserScan flipped from `Robot` to `Normal` just by switching to headed mode via Xvfb.
`--with-cloakbrowser`	Installs CloakBrowser — a stealth Chromium fork with source-level fingerprint patches at the C++ level (canvas, WebGL, audio, fonts, GPU, screen, WebRTC, hardware reporting). `findChrome()` auto-detects and prefers it over stock Chrome. Sandbox paths are tighter than the Chrome variant.	Sites that fingerprint visitors. Verified bypass on Cloudflare Turnstile (non-interactive), FingerprintJS, BrowserScan, bot.incolumitas (1 fail vs 4 fails for stock Chrome — only the irreducible W3C WebDriver-spec leak remains), and Reddit’s secondary fingerprint check on non-datacenter IPs.

Verified results (Ubuntu 24.04 VPS, head-to-head against the same probes)

Config	bot.incolumitas detection-tests	browserscan.net verdict
Stock Chrome, headless	4 fails (UA leak, HEADCHR_UA, WEBDRIVER, CHR_MEMORY)	`Robot`
Stock Chrome, headed via Xvfb	1 fail (WEBDRIVER spec only)	`Normal`
CloakBrowser, headless	1 fail (WEBDRIVER spec only)	`Normal`
CloakBrowser + Xvfb, headed	1 fail (WEBDRIVER spec only)	`Normal`

The WEBDRIVER fail is unavoidable for any CDP-connected browser — it’s a W3C-spec observable side effect, not a fingerprint defect. CloakBrowser’s own documentation calls this out as the single irreducible cost.

Datacenter IPs are pre-blocked regardless of fingerprint. Reddit, X/Twitter, LinkedIn, and many anti-bot services blocklist datacenter ASNs (AWS, DigitalOcean, Hetzner, Hostinger, …) at the network layer before any browser fingerprint check runs. CloakBrowser does not include a proxy. If your daemon runs on a datacenter VPS, pair the stealth flag with a residential proxy.Concretely measured this session: stock Chrome and CloakBrowser produced identical 403 “blocked by network security” responses on Reddit from a Hostinger VPS. From a residential ISP, stock Chrome still got blocked but CloakBrowser bypassed Reddit’s secondary fingerprint check and returned the actual subreddit content.

Install examples

# Bare-metal: stealth + headed for the hardest tier
curl -fsSL https://comis.ai/install.sh | bash -s -- --with-cloakbrowser --with-xvfb

# Docker: same matrix via build args
docker build \
  --build-arg COMIS_WITH_CLOAKBROWSER=1 \
  --build-arg COMIS_WITH_XVFB=1 \
  -t comisai/comis:cloak-xvfb .

# Docker: validate via install.sh-based image (same end state, exercises
# the actual installer path)
docker build -f Dockerfile.install \
  --build-arg COMIS_WITH_CLOAKBROWSER=1 \
  --build-arg COMIS_WITH_XVFB=1 \
  -t comis-installed:cloak-xvfb .

No runtime config change required — findChrome() probes ~/.cloakbrowser/chromium-*/chrome first, then platform-specific Chrome paths. Whatever was installed wins.

License note

CloakBrowser’s wrapper is MIT-licensed; the compiled binary is under a separate license — free for self-hosted use (including bringing it into your own VPS / Docker deploys), but bundling into a service you distribute to third-party customers requires a separate OEM license from CloakHQ. See the CloakBrowser binary license for the full terms.

Security

The browser includes built-in security protections:

SSRF validation — The navigate and open actions validate URLs before loading them. The browser cannot access localhost, internal IP addresses (like 10.x.x.x or 192.168.x.x), or cloud metadata endpoints (like 169.254.169.254). This prevents attacks where a crafted message tricks the browser into accessing your internal network.
Screenshot sanitization — Screenshots are processed and sanitized before being stored or sent.

For more details on network security protections, see Security.

End-to-end example: scraping a SPA

A typical multi-step workflow — visit a JavaScript-rendered dashboard, log in, navigate to a data view, and pull a value the agent can act on. The agent strings together navigate, snapshot, act, and screenshot in sequence:

# 1. Make sure the browser is running
action: status
# (start it if not)
action: start

# 2. Open the target URL in a new tab
action: open
targetUrl: "https://app.example.com/login"

# 3. Read the page structure to find the login form
action: snapshot
snapshotFormat: ai
interactive: true

# 4. Fill credentials using the refs from the snapshot
action: act
request:
  kind: fill
  fields:
    - { ref: "email-input", value: "agent@example.com" }
    - { ref: "password-input", value: "${LOGIN_PASSWORD}" }
# (the value substitution happens at the agent layer; the secret store
# resolves ${LOGIN_PASSWORD} to the encrypted value)

# 5. Submit the form
action: act
request:
  kind: click
  ref: "submit-button"

# 6. Wait for the dashboard to render after login
action: act
request:
  kind: wait
  textGone: "Signing in..."

# 7. Navigate to the data view
action: navigate
targetUrl: "https://app.example.com/dashboard/metrics"

# 8. Read the rendered DOM via accessibility tree
action: snapshot
snapshotFormat: ai
maxChars: 5000

# 9. Capture a screenshot for the response
action: screenshot
fullPage: true

The agent now has the rendered text in a structured snapshot and a visual screenshot, and can post both back to the user via the message tool’s attach action. For sites that emit network calls you want to inspect, the agent can pair the browser with console (to see JS logs) and act kind: evaluate (to run JavaScript like document.querySelector(...).innerText). For high-throughput scraping that does not need rendered JavaScript, prefer web_fetch.

Built-in Tools

All built-in tools including browser

Web Tools

Web search and page fetching tools

Agent Tools Overview

See all available agent tools

Config Reference

Browser and tool configuration options

​Browser Actions

​Page Actions

​Interaction Types (act sub-kinds)

​Selecting Elements

​Browser Profiles

​Stealth mode (anti-bot detection)

​Verified results (Ubuntu 24.04 VPS, head-to-head against the same probes)

​Install examples

​License note

​Security

​End-to-end example: scraping a SPA

​Related

Built-in Tools

Web Tools

Agent Tools Overview

Config Reference

Browser Actions

Page Actions

Interaction Types (act sub-kinds)

Selecting Elements

Browser Profiles

Stealth mode (anti-bot detection)

Verified results (Ubuntu 24.04 VPS, head-to-head against the same probes)

Install examples

License note

Security

End-to-end example: scraping a SPA

Related