interact — vision-grounded computer-use MCP

MCP
computer-use
VLM
LiteLLM
Rust
MIT

MCP server letting any agent act on what it sees across browser and real desktop (navigate/click/type/scroll/drag); returns text diffs of what changed instead of raw screenshots. GUI grounding fuses VLM detection + the AT-SPI accessibility tree; LiteLLM multi-provider router with cost-aware auto model-selection ranked from public benchmarks (MMMU, ScreenSpot-Pro, Video-MME); isolated software-GL sandbox so GPU/Flutter/Electron apps render. Installs into the major agent clients; files GitHub issues automatically. MIT.

Open repository