interact — vision-grounded computer-use MCP
- MCP
- computer-use
- VLM
- LiteLLM
- Rust
- MIT
MCP server letting any agent act on what it sees across browser and real desktop (navigate/click/type/scroll/drag); returns text diffs of what changed instead of raw screenshots. GUI grounding fuses VLM detection + the AT-SPI accessibility tree; LiteLLM multi-provider router with cost-aware auto model-selection ranked from public benchmarks (MMMU, ScreenSpot-Pro, Video-MME); isolated software-GL sandbox so GPU/Flutter/Electron apps render. Installs into the major agent clients; files GitHub issues automatically. MIT.