Skip to main content
What it does: Drives a headless Chromium browser via Playwright — navigate, click, type, screenshot, read pages via accessibility snapshots, manage tabs and isolated profiles, and run JavaScript. Who it’s for: Anyone whose agent needs to interact with sites that don’t yield to a simple web_fetch — single-page apps, sites behind logins, dynamic dashboards, file uploads, or workflows that require multiple steps. For static page content, the lighter web_fetch tool is faster and cheaper.

Browser Actions

The browser tool supports 16 actions, verified against BROWSER_TOOL_ACTIONS in packages/skills/src/builtin/platform/browser-tool-schema.ts:
ActionWhat It Does
statusCheck if the browser is running or stopped
startLaunch the browser
stopClose the browser
profilesManage isolated browser profiles
tabsList all open tabs
openOpen a new tab with a URL
focusSwitch to a specific tab (by targetId)
closeClose a tab (by targetId)
snapshotGet the page accessibility tree (aria or ai format)
screenshotCapture a page screenshot
navigateGo to a URL in the current tab
consoleView browser console output
pdfSave the current page as a PDF
uploadUpload a file to the page
dialogHandle browser dialogs (alerts, confirms, prompts)
actInteract with page elements (11 sub-kinds)

Page Actions

Take a screenshot of the current page. The image is returned as a message attachment.
ParameterTypeRequiredDescription
actionstringYes"screenshot"
targetIdstringNoThe tab to capture (defaults to the active tab)
fullPagebooleanNoCapture the full scrollable page (not just the viewport)
typestringNoImage format: "png" (lossless) or "jpeg" (compressed)
Screenshots are useful for visual verification — checking how a page looks, confirming form submissions, or capturing error states.
Get a structured representation of the page content using the accessibility tree. This is how agents “read” web pages.
ParameterTypeRequiredDescription
actionstringYes"snapshot"
targetIdstringNoThe tab to snapshot (defaults to the active tab)
snapshotFormatstringNo"aria" for the raw accessibility tree, or "ai" for an AI-optimized format
modestringNoSnapshot mode: "efficient" for compact output optimized for LLM consumption
refsstringNoElement reference type: "role" (ARIA role references) or "aria" (ARIA label references)
interactivebooleanNoShow only interactive elements
compactbooleanNoUse compact output format
depthnumberNoMaximum tree depth to include
maxCharsnumberNoMaximum characters in the snapshot output
Format options:
  • aria — The full accessibility tree with all ARIA roles and properties. Detailed but verbose.
  • ai — A condensed format designed for AI consumption. Shows interactive elements (buttons, links, inputs) with labels and references that can be used with the act action.
The ai format is recommended for most use cases — it gives agents the information they need to interact with the page without overwhelming them with structural details.
Interact with elements on the page. The act action supports 11 different interaction types (called “sub-kinds”), each designed for a specific type of interaction.
ParameterTypeRequiredDescription
actionstringYes"act"
request.kindstringYesThe interaction type (see table below)
request.*(varies)(varies)Additional parameters depend on the kind
The act action wraps its interaction parameters inside a request object. See the Interaction Types section below for all 11 sub-kinds and their parameters.
Open a new browser tab and navigate to a URL.
ParameterTypeRequiredDescription
actionstringYes"open"
targetUrlstringNoThe URL to open in the new tab
SSRF validation applies to the URL (same as navigate).
Save the current page as a PDF document, returned as a message attachment.
ParameterTypeRequiredDescription
actionstringYes"pdf"
Useful for archiving web pages or generating printable versions of online content.
Upload one or more files to a file input element on the page.
ParameterTypeRequiredDescription
actionstringYes"upload"
inputRefstringNoCSS selector or aria reference for the file input element
pathsstring[]NoArray of file paths to upload
Respond to browser dialogs such as JavaScript alerts, confirmation prompts, and input prompts.
ParameterTypeRequiredDescription
actionstringYes"dialog"
acceptbooleanNoWhether to accept (true) or dismiss (false) the dialog
promptTextstringNoText to enter in a prompt dialog
Retrieve the browser’s developer console output, including errors, warnings, and log messages.
ParameterTypeRequiredDescription
actionstringYes"console"
Useful for debugging page issues or monitoring JavaScript errors.

Interaction Types (act sub-kinds)

When using the act action, the kind parameter determines what type of interaction to perform. There are 11 interaction types:
KindWhat It DoesKey Parameters
clickClick an elementref (element reference or CSS selector), doubleClick, button, modifiers
typeType text character by characterref, text, submit, slowly
pressPress a keyboard keykey (e.g., "Enter", "Tab", "Escape")
hoverHover over an elementref
dragDrag from one element to anotherstartRef, endRef (both element references)
selectSelect a dropdown optionref, values (array of option values)
fillFill form fields (clears existing content first)ref, fields (array of field objects)
resizeResize the browser viewportwidth, height
waitWait for a conditiontimeMs (milliseconds), textGone (text to wait to disappear)
evaluateRun JavaScript in the pagefn (JavaScript code)
closeClose the current page or dialog(no additional parameters)

Selecting Elements

Most interaction types require a ref parameter to identify which element to interact with. You can use:
  • CSS selectors — Standard CSS selectors like #login-button, .submit-btn, or input[name="email"]
  • Aria references — References from the snapshot output in ai format, which label interactive elements with short identifiers
The recommended workflow is: take a snapshot in ai format to see the page structure, then use the element references from the snapshot in your act calls via the ref parameter.

Browser Profiles

Profiles create isolated browser contexts with separate cookies, local storage, and session data. This is useful for:
  • Managing multiple accounts on the same website
  • Testing with different user states (logged in vs. logged out)
  • Keeping browsing sessions separate between different tasks
Use the profiles action to create, list, and switch between browser profiles.

Stealth mode (anti-bot detection)

By default the browser runs headless Chrome — fine for most public sites and internal tools, but instantly flagged by modern bot-detection services (Cloudflare Turnstile, FingerprintJS, BrowserScan, reCAPTCHA v3 scoring, Reddit’s secondary fingerprint check). For agents that hit those sites you can install Comis with progressively stronger stealth. Three install-time flags compose together. They’re available in both the bare-metal installer (install.sh) and the Docker image (build args):
FlagWhat it addsWhen you need it
--with-browserStock Google Chrome + headless shared libs. Sandbox ReadWritePaths widened for Chrome’s out-of-profile writes (~/.config/google-chrome, ~/.local/share/applications).Baseline. The browser tool needs a Chrome binary to launch; this provisions one if you didn’t bring your own.
--with-xvfbAdds Xvfb + a comis-xvfb.service systemd companion that runs a virtual display on :99. The main daemon unit joins its /tmp namespace so the X11 socket is reachable. Config seeded with headless: false.Sites that detect headless mode itself (BrowserScan, DataDome-tier, Cloudflare managed). On the test VPS, BrowserScan flipped from Robot to Normal just by switching to headed mode via Xvfb.
--with-cloakbrowserInstalls CloakBrowser — a stealth Chromium fork with source-level fingerprint patches at the C++ level (canvas, WebGL, audio, fonts, GPU, screen, WebRTC, hardware reporting). findChrome() auto-detects and prefers it over stock Chrome. Sandbox paths are tighter than the Chrome variant.Sites that fingerprint visitors. Verified bypass on Cloudflare Turnstile (non-interactive), FingerprintJS, BrowserScan, bot.incolumitas (1 fail vs 4 fails for stock Chrome — only the irreducible W3C WebDriver-spec leak remains), and Reddit’s secondary fingerprint check on non-datacenter IPs.

Verified results (Ubuntu 24.04 VPS, head-to-head against the same probes)

Configbot.incolumitas detection-testsbrowserscan.net verdict
Stock Chrome, headless4 fails (UA leak, HEADCHR_UA, WEBDRIVER, CHR_MEMORY)Robot
Stock Chrome, headed via Xvfb1 fail (WEBDRIVER spec only)Normal
CloakBrowser, headless1 fail (WEBDRIVER spec only)Normal
CloakBrowser + Xvfb, headed1 fail (WEBDRIVER spec only)Normal
The WEBDRIVER fail is unavoidable for any CDP-connected browser — it’s a W3C-spec observable side effect, not a fingerprint defect. CloakBrowser’s own documentation calls this out as the single irreducible cost.
Datacenter IPs are pre-blocked regardless of fingerprint. Reddit, X/Twitter, LinkedIn, and many anti-bot services blocklist datacenter ASNs (AWS, DigitalOcean, Hetzner, Hostinger, …) at the network layer before any browser fingerprint check runs. CloakBrowser does not include a proxy. If your daemon runs on a datacenter VPS, pair the stealth flag with a residential proxy.Concretely measured this session: stock Chrome and CloakBrowser produced identical 403 “blocked by network security” responses on Reddit from a Hostinger VPS. From a residential ISP, stock Chrome still got blocked but CloakBrowser bypassed Reddit’s secondary fingerprint check and returned the actual subreddit content.

Install examples

# Bare-metal: stealth + headed for the hardest tier
curl -fsSL https://comis.ai/install.sh | bash -s -- --with-cloakbrowser --with-xvfb

# Docker: same matrix via build args
docker build \
  --build-arg COMIS_WITH_CLOAKBROWSER=1 \
  --build-arg COMIS_WITH_XVFB=1 \
  -t comisai/comis:cloak-xvfb .

# Docker: validate via install.sh-based image (same end state, exercises
# the actual installer path)
docker build -f Dockerfile.install \
  --build-arg COMIS_WITH_CLOAKBROWSER=1 \
  --build-arg COMIS_WITH_XVFB=1 \
  -t comis-installed:cloak-xvfb .
No runtime config change required — findChrome() probes ~/.cloakbrowser/chromium-*/chrome first, then platform-specific Chrome paths. Whatever was installed wins.

License note

CloakBrowser’s wrapper is MIT-licensed; the compiled binary is under a separate license — free for self-hosted use (including bringing it into your own VPS / Docker deploys), but bundling into a service you distribute to third-party customers requires a separate OEM license from CloakHQ. See the CloakBrowser binary license for the full terms.

Security

The browser includes built-in security protections:
  • SSRF validation — The navigate and open actions validate URLs before loading them. The browser cannot access localhost, internal IP addresses (like 10.x.x.x or 192.168.x.x), or cloud metadata endpoints (like 169.254.169.254). This prevents attacks where a crafted message tricks the browser into accessing your internal network.
  • Screenshot sanitization — Screenshots are processed and sanitized before being stored or sent.
For more details on network security protections, see Security.

End-to-end example: scraping a SPA

A typical multi-step workflow — visit a JavaScript-rendered dashboard, log in, navigate to a data view, and pull a value the agent can act on. The agent strings together navigate, snapshot, act, and screenshot in sequence:
# 1. Make sure the browser is running
action: status
# (start it if not)
action: start

# 2. Open the target URL in a new tab
action: open
targetUrl: "https://app.example.com/login"

# 3. Read the page structure to find the login form
action: snapshot
snapshotFormat: ai
interactive: true

# 4. Fill credentials using the refs from the snapshot
action: act
request:
  kind: fill
  fields:
    - { ref: "email-input", value: "agent@example.com" }
    - { ref: "password-input", value: "${LOGIN_PASSWORD}" }
# (the value substitution happens at the agent layer; the secret store
# resolves ${LOGIN_PASSWORD} to the encrypted value)

# 5. Submit the form
action: act
request:
  kind: click
  ref: "submit-button"

# 6. Wait for the dashboard to render after login
action: act
request:
  kind: wait
  textGone: "Signing in..."

# 7. Navigate to the data view
action: navigate
targetUrl: "https://app.example.com/dashboard/metrics"

# 8. Read the rendered DOM via accessibility tree
action: snapshot
snapshotFormat: ai
maxChars: 5000

# 9. Capture a screenshot for the response
action: screenshot
fullPage: true
The agent now has the rendered text in a structured snapshot and a visual screenshot, and can post both back to the user via the message tool’s attach action. For sites that emit network calls you want to inspect, the agent can pair the browser with console (to see JS logs) and act kind: evaluate (to run JavaScript like document.querySelector(...).innerText). For high-throughput scraping that does not need rendered JavaScript, prefer web_fetch.

Built-in Tools

All built-in tools including browser

Web Tools

Web search and page fetching tools

Agent Tools Overview

See all available agent tools

Config Reference

Browser and tool configuration options