← Back to Blog
Tutorial8 min read

Build a Voice AI Assistant with OpenVoiceUI

What is OpenVoiceUI?

OpenVoiceUI is an open-source voice AI platform that connects to any large language model through the OpenClaw gateway. Unlike closed platforms that lock you into a single provider, OpenVoiceUI gives you full control over which AI model powers your assistant, how your data is handled, and what your assistant can do.

The platform includes a visual canvas system where your AI builds live web pages during conversation — dashboards, reports, image galleries, interactive tools. It is not just a chatbot. It is a full workspace where voice meets vision.

Prerequisites

  • Node.js 18+ (or Docker if you prefer containers)
  • A microphone — any USB or built-in mic works
  • An API key for your chosen LLM provider (OpenAI, Anthropic, Groq, or a local Ollama instance)

That's it. No GPU required, no special hardware. OpenVoiceUI runs on anything from a Raspberry Pi to a cloud VPS.

Quick Start — 3 Ways to Install

Option 1: npm (Fastest)

The quickest way to get started. One command sets up the project, installs dependencies, and walks you through configuration:

npx openvoiceui setup

Option 2: Pinokio (One-Click)

If you use Pinokio, just search for "OpenVoiceUI" in the app store and click Install. Pinokio handles all dependencies, environment setup, and launches the application automatically. Best for non-technical users who want zero terminal interaction.

Option 3: Docker (Production)

For production deployments or if you prefer containerized environments:

git clone https://github.com/MCERQUA/OpenVoiceUI
cd OpenVoiceUI
docker compose up

Docker Compose spins up both OpenVoiceUI and the OpenClaw gateway in isolated containers. Copy .env.example to .env and add your API keys before starting.

Choosing Your LLM

OpenVoiceUI works with any LLM that the OpenClaw gateway supports. Here are the most common options:

OpenAI (GPT-4o, GPT-4)

Best overall quality. Fast inference, strong tool use, great for general-purpose assistants. Set OPENAI_API_KEY in your environment.

Anthropic (Claude)

Excellent for nuanced conversation, long-context tasks, and careful reasoning. Set ANTHROPIC_API_KEY in your environment.

Groq (Llama, Mixtral)

Fastest inference available. Free tier is generous. Great for low-latency voice applications where response time matters most. Set GROQ_API_KEY.

Ollama (Local Models)

Run models entirely on your hardware. Zero API costs, full privacy. Point OpenClaw at http://localhost:11434 and use any model Ollama supports.

Your First Conversation

Once OpenVoiceUI is running, open your browser and navigate to the application. Here is what happens when you start talking:

  1. Speech-to-Text (STT) captures your voice through the browser microphone. OpenVoiceUI supports the Web Speech API (built into Chrome), Deepgram, and Groq Whisper.
  2. OpenClaw gateway receives the transcribed text and routes it to your configured LLM. The gateway handles model selection, API authentication, and response streaming.
  3. The LLM generates a response, which streams back through OpenClaw to the browser in real-time.
  4. Text-to-Speech (TTS) converts the response to audio and plays it through your speakers. Multiple TTS engines are supported, including Supertonic (self-hosted) and browser-native synthesis.
  5. Canvas output — if the AI generates visual content (HTML pages, charts, images), it renders live in the canvas panel alongside the conversation.

The entire loop — from speaking to hearing a response — typically takes 2-4 seconds with a warm cache and a fast LLM provider like Groq.

The Canvas System

This is what makes OpenVoiceUI different from every other voice AI tool. The canvas is a live rendering surface where your AI builds real web pages during conversation.

Ask your assistant to "show me a dashboard of today's tasks" and it generates a full HTML page with styled components, renders it in the canvas panel, and you can interact with it. Ask it to "create an image gallery from my uploads" and it builds one live. The pages persist — you can revisit them anytime through the built-in file explorer.

Canvas pages support the full desktop experience: taskbar navigation, right-click context menus, wallpaper customization, and multiple window management. It is designed to feel like an operating system, not a chat window.

Adding Skills

Skills are modular capability packages that extend what your assistant can do. OpenVoiceUI ships with 35+ built-in skills covering:

  • Social media management — post scheduling, content creation, analytics
  • SEO optimization — keyword research, topical maps, content briefs
  • Email via AgentMail — send, receive, and manage email through voice
  • Image generation — FLUX.1 and Stable Diffusion 3.5 with multiple presets
  • Music creation — AI-generated tracks via Suno
  • Video production — Remotion Studio integration with voice-over
  • Business tools — briefings, reports, CRM integrations, lead tracking

Creating a custom skill is straightforward: add a SKILL.md file to the shared skills directory. No code changes to the core application required. Your skill file describes capabilities, provides examples, and defines any API endpoints the skill needs.

Deploying to a VPS

For always-on access, deploy OpenVoiceUI to any VPS provider. The Docker Compose setup works out of the box on any Linux server with Docker installed. Point a domain at your server, set up SSL (we recommend Cloudflare for the easiest setup), and your voice assistant is accessible from any browser, anywhere.

OpenVoiceUI runs comfortably on a 2-core, 4GB RAM server. For multiple users or heavy workloads, scale up to 4 cores and 16GB. The entire stack — OpenVoiceUI, OpenClaw, and TTS — typically uses under 3GB of memory.

What's Next

You now have a working voice AI assistant. From here:

  • Explore the GitHub repository for full documentation and source code
  • Check out the features page for a deep dive into every capability
  • Read about why we built OpenVoiceUI for the philosophy behind the project
  • Star the repo, file issues, submit PRs — OpenVoiceUI is community-driven and MIT licensed

Ready to build?

OpenVoiceUI is free, open source, and MIT licensed.