← Back to blog Tutorial

Building a Voice Agent That Runs Entirely On-Device

A step-by-step tutorial for building an on-device voice agent using Whisper, a local LLM, and Kokoro TTS — no cloud APIs, no internet required.

Glenn Sonna
· · 11 min read
tutorialvoice-agentttsasron-device-aiflutter

Most voice assistants send your audio to a server, process it remotely, and stream an answer back. That round trip adds latency, costs money per request, and requires a network connection. It also means your voice data leaves your device.

We are going to build the opposite: a voice agent where everything runs locally. The user taps a button, speaks a question, and the app transcribes it, generates a response, and reads it back out loud — all without touching the internet.

The full loop looks like this:

Audio In → Whisper (ASR) → Text → Llama 3.2 (LLM) → Text → Kokoro (TTS) → Audio Out

By the end of this tutorial you will have a working Flutter app that performs this entire pipeline on an iPhone or Android device.

Architecture Overview

The voice agent is a three-stage pipeline. Each stage is a separate ML model, and Xybrid handles the data flow between them automatically.

Stage 1 — Speech Recognition (ASR). Whisper Tiny takes raw audio and produces a text transcript. It runs well on mobile hardware and supports multiple languages, though we will stick with English here.

Stage 2 — Reasoning (LLM). Llama 3.2 1B takes the transcript and generates a conversational response. The 1B parameter variant is small enough to run on-device while still producing coherent, useful answers.

Stage 3 — Voice Synthesis (TTS). Kokoro 82M takes the generated text and produces natural-sounding speech audio. At only 82 million parameters, it is fast enough for real-time synthesis on modern phones.

Prerequisites

Before you start, make sure you have:

  • Flutter SDK 3.22 or later installed and configured
  • A physical device for testing (iPhone 12+ or Pixel 6+). The simulator works for development but performance benchmarks require real hardware.
  • About 2 GB of free storage on the test device for the three models
  • Basic familiarity with Flutter and async Dart

Step 1: Set Up the Flutter Project

Create a new Flutter project or open an existing one. Add the Xybrid Flutter package to your pubspec.yaml:

dependencies:
  flutter:
    sdk: flutter
  xybrid_flutter: ^0.1.0

Run flutter pub get, then initialize Xybrid in your app’s entry point:

import 'package:xybrid_flutter/xybrid_flutter.dart';

void main() async {
  WidgetsFlutterBinding.ensureInitialized();
  await Xybrid.init();
  runApp(const VoiceAgentApp());
}

Step 2: Load the Models

Load all three models and verify they are ready:

final whisper = await Xybrid.model(modelId: 'whisper-tiny').load();
final llm = await Xybrid.model(modelId: 'llama-3.2-1b').load();
final tts = await Xybrid.model(modelId: 'kokoro-82m').load();

The first call downloads each model to the device cache. Subsequent calls load from cache instantly.

Step 3: Implement Speech Recognition

Record audio from the device microphone. The recorded audio should be 16kHz mono WAV — the format Whisper expects.

final Uint8List audioBytes = await recorder.stopAndGetBytes();

final transcript = await whisper.run(
  envelope: Envelope.audio(bytes: audioBytes),
);

print('User said: ${transcript.text}');

Step 4: Add the Reasoning Layer

Pass the transcript into the local LLM:

final response = await llm.run(
  envelope: Envelope.text(text: transcript.text!),
);

print('Assistant: ${response.text}');

The LLM generates a conversational response. Keep max_tokens moderate (100–256) — voice responses that run longer than 20–30 seconds feel unnatural.

Step 5: Add Voice Synthesis

Convert the LLM’s text response to speech:

final audio = await tts.run(
  envelope: Envelope.text(text: response.text!),
);

final audioPlayer = AudioPlayer();
await audioPlayer.playBytes(audio.audioBytes!);

Kokoro produces high-quality, natural-sounding audio at 24kHz sample rate.

Step 6: Wire It All Together

Here is the complete interaction flow as a single widget:

class VoiceAgentScreen extends StatefulWidget {
  const VoiceAgentScreen({super.key});

  @override
  State<VoiceAgentScreen> createState() => _VoiceAgentScreenState();
}

class _VoiceAgentScreenState extends State<VoiceAgentScreen> {
  XybridModel? _whisper;
  XybridModel? _llm;
  XybridModel? _tts;
  bool _isProcessing = false;
  String _transcript = '';
  String _response = '';

  @override
  void initState() {
    super.initState();
    _loadModels();
  }

  Future<void> _loadModels() async {
    _whisper = await Xybrid.model(modelId: 'whisper-tiny').load();
    _llm = await Xybrid.model(modelId: 'llama-3.2-1b').load();
    _tts = await Xybrid.model(modelId: 'kokoro-82m').load();
    setState(() {});
  }

  Future<void> _processAudio(Uint8List audioBytes) async {
    setState(() => _isProcessing = true);

    try {
      // 1. Transcribe
      final transcript = await _whisper!.run(
        envelope: Envelope.audio(bytes: audioBytes),
      );
      setState(() => _transcript = transcript.text ?? '');

      // 2. Generate response
      final response = await _llm!.run(
        envelope: Envelope.text(text: transcript.text!),
      );
      setState(() => _response = response.text ?? '');

      // 3. Speak
      final audio = await _tts!.run(
        envelope: Envelope.text(text: response.text!),
      );
      final player = AudioPlayer();
      await player.playBytes(audio.audioBytes!);
    } finally {
      setState(() => _isProcessing = false);
    }
  }

  @override
  Widget build(BuildContext context) {
    return Scaffold(
      body: Center(
        child: Column(
          mainAxisAlignment: MainAxisAlignment.center,
          children: [
            Text(_transcript, style: Theme.of(context).textTheme.bodyLarge),
            const SizedBox(height: 16),
            Text(_response, style: Theme.of(context).textTheme.headlineSmall),
          ],
        ),
      ),
      floatingActionButton: FloatingActionButton(
        onPressed: () {
          // Record audio, then call _processAudio(audioBytes)
        },
        child: Icon(_isProcessing ? Icons.hourglass_empty : Icons.mic),
      ),
    );
  }
}

The interaction flow:

  1. User taps the microphone button to start recording
  2. Taps again to stop — recorded bytes go to _processAudio
  3. Whisper transcribes the audio
  4. Llama generates a response
  5. Kokoro synthesizes speech and plays it back

Performance Results

Benchmarked on real devices with models loaded in memory:

iPhone 15 Pro (A17 Pro, 8 GB RAM)

StageModelLatency
ASRWhisper Tiny~200ms
LLMLlama 3.2 1B (Q4)~500ms to first token
TTSKokoro 82M~150ms
Total~850ms to first audio

Pixel 8 Pro (Tensor G3, 12 GB RAM)

StageModelLatency
ASRWhisper Tiny~280ms
LLMLlama 3.2 1B (Q4)~650ms to first token
TTSKokoro 82M~180ms
Total~1.1s to first audio

Under one second to first audio on iPhone makes the interaction feel responsive. The LLM is the bottleneck — streaming LLM output to TTS can reduce perceived latency significantly.

Memory usage peaks around 1.8 GB with all three models loaded. On devices with 6 GB or more RAM, this leaves plenty of room for the rest of the app.

Next Steps

Conversation Memory

Right now, each interaction is stateless. Xybrid supports conversation context that carries history across turns:

final context = ConversationContext();

final response = await llm.run(
  envelope: Envelope.text(text: transcript.text!),
  context: context,
);

The ConversationContext manages a rolling window of past messages (default: 50 turns).

Wake Word Detection

A hands-free experience requires wake word detection. You can add a lightweight wake word model that listens continuously and triggers the main flow only when it hears a specific phrase.

Streaming Responses

The current implementation waits for the LLM to finish before starting TTS. With streaming, you can begin synthesizing audio as soon as the first sentence is complete, reducing perceived latency.


The complete source code for this tutorial is available in the Xybrid examples repository. If you run into issues, open a discussion on the GitHub repo.

Related articles

· 8 min read

Add Text-to-Speech to Your Flutter App in 15 Minutes

A step-by-step guide to adding high-quality, on-device TTS to a Flutter app using Xybrid and the Kokoro model. No cloud APIs, no API keys, no per-request costs.

flutterttstutorial
· 12 min read

On-Device AI: The Complete Guide to Running ML Models Locally

Everything you need to know about running machine learning models directly on mobile and desktop devices — privacy, latency, cost benefits, and how to get started.

on-device-aiedge-inferencemobile-ml
· 10 min read

Edge AI vs Cloud AI: When to Run Models On-Device

A practical decision framework for choosing between on-device and cloud-based AI inference, with cost analysis, comparison tables, and real-world use cases.

edge-aicloud-aiinference