Most voice assistants send your audio to a server, process it remotely, and stream an answer back. That round trip adds latency, costs money per request, and requires a network connection. It also means your voice data leaves your device.
We are going to build the opposite: a voice agent where everything runs locally. The user taps a button, speaks a question, and the app transcribes it, generates a response, and reads it back out loud — all without touching the internet.
The full loop looks like this:
Audio In → Whisper (ASR) → Text → Llama 3.2 (LLM) → Text → Kokoro (TTS) → Audio Out By the end of this tutorial you will have a working Flutter app that performs this entire pipeline on an iPhone or Android device.
Architecture Overview
The voice agent is a three-stage pipeline. Each stage is a separate ML model, and Xybrid handles the data flow between them automatically.
Stage 1 — Speech Recognition (ASR). Whisper Tiny takes raw audio and produces a text transcript. It runs well on mobile hardware and supports multiple languages, though we will stick with English here.
Stage 2 — Reasoning (LLM). Llama 3.2 1B takes the transcript and generates a conversational response. The 1B parameter variant is small enough to run on-device while still producing coherent, useful answers.
Stage 3 — Voice Synthesis (TTS). Kokoro 82M takes the generated text and produces natural-sounding speech audio. At only 82 million parameters, it is fast enough for real-time synthesis on modern phones.
Prerequisites
Before you start, make sure you have:
- Flutter SDK 3.22 or later installed and configured
- A physical device for testing (iPhone 12+ or Pixel 6+). The simulator works for development but performance benchmarks require real hardware.
- About 2 GB of free storage on the test device for the three models
- Basic familiarity with Flutter and async Dart
Step 1: Set Up the Flutter Project
Create a new Flutter project or open an existing one. Add the Xybrid Flutter package to your pubspec.yaml:
dependencies:
flutter:
sdk: flutter
xybrid_flutter: ^0.1.0 Run flutter pub get, then initialize Xybrid in your app’s entry point:
import 'package:xybrid_flutter/xybrid_flutter.dart';
void main() async {
WidgetsFlutterBinding.ensureInitialized();
await Xybrid.init();
runApp(const VoiceAgentApp());
} Step 2: Load the Models
Load all three models and verify they are ready:
final whisper = await Xybrid.model(modelId: 'whisper-tiny').load();
final llm = await Xybrid.model(modelId: 'llama-3.2-1b').load();
final tts = await Xybrid.model(modelId: 'kokoro-82m').load(); The first call downloads each model to the device cache. Subsequent calls load from cache instantly.
Step 3: Implement Speech Recognition
Record audio from the device microphone. The recorded audio should be 16kHz mono WAV — the format Whisper expects.
final Uint8List audioBytes = await recorder.stopAndGetBytes();
final transcript = await whisper.run(
envelope: Envelope.audio(bytes: audioBytes),
);
print('User said: ${transcript.text}'); Step 4: Add the Reasoning Layer
Pass the transcript into the local LLM:
final response = await llm.run(
envelope: Envelope.text(text: transcript.text!),
);
print('Assistant: ${response.text}'); The LLM generates a conversational response. Keep max_tokens moderate (100–256) — voice responses that run longer than 20–30 seconds feel unnatural.
Step 5: Add Voice Synthesis
Convert the LLM’s text response to speech:
final audio = await tts.run(
envelope: Envelope.text(text: response.text!),
);
final audioPlayer = AudioPlayer();
await audioPlayer.playBytes(audio.audioBytes!); Kokoro produces high-quality, natural-sounding audio at 24kHz sample rate.
Step 6: Wire It All Together
Here is the complete interaction flow as a single widget:
class VoiceAgentScreen extends StatefulWidget {
const VoiceAgentScreen({super.key});
@override
State<VoiceAgentScreen> createState() => _VoiceAgentScreenState();
}
class _VoiceAgentScreenState extends State<VoiceAgentScreen> {
XybridModel? _whisper;
XybridModel? _llm;
XybridModel? _tts;
bool _isProcessing = false;
String _transcript = '';
String _response = '';
@override
void initState() {
super.initState();
_loadModels();
}
Future<void> _loadModels() async {
_whisper = await Xybrid.model(modelId: 'whisper-tiny').load();
_llm = await Xybrid.model(modelId: 'llama-3.2-1b').load();
_tts = await Xybrid.model(modelId: 'kokoro-82m').load();
setState(() {});
}
Future<void> _processAudio(Uint8List audioBytes) async {
setState(() => _isProcessing = true);
try {
// 1. Transcribe
final transcript = await _whisper!.run(
envelope: Envelope.audio(bytes: audioBytes),
);
setState(() => _transcript = transcript.text ?? '');
// 2. Generate response
final response = await _llm!.run(
envelope: Envelope.text(text: transcript.text!),
);
setState(() => _response = response.text ?? '');
// 3. Speak
final audio = await _tts!.run(
envelope: Envelope.text(text: response.text!),
);
final player = AudioPlayer();
await player.playBytes(audio.audioBytes!);
} finally {
setState(() => _isProcessing = false);
}
}
@override
Widget build(BuildContext context) {
return Scaffold(
body: Center(
child: Column(
mainAxisAlignment: MainAxisAlignment.center,
children: [
Text(_transcript, style: Theme.of(context).textTheme.bodyLarge),
const SizedBox(height: 16),
Text(_response, style: Theme.of(context).textTheme.headlineSmall),
],
),
),
floatingActionButton: FloatingActionButton(
onPressed: () {
// Record audio, then call _processAudio(audioBytes)
},
child: Icon(_isProcessing ? Icons.hourglass_empty : Icons.mic),
),
);
}
} The interaction flow:
- User taps the microphone button to start recording
- Taps again to stop — recorded bytes go to
_processAudio - Whisper transcribes the audio
- Llama generates a response
- Kokoro synthesizes speech and plays it back
Performance Results
Benchmarked on real devices with models loaded in memory:
iPhone 15 Pro (A17 Pro, 8 GB RAM)
| Stage | Model | Latency |
|---|---|---|
| ASR | Whisper Tiny | ~200ms |
| LLM | Llama 3.2 1B (Q4) | ~500ms to first token |
| TTS | Kokoro 82M | ~150ms |
| Total | ~850ms to first audio |
Pixel 8 Pro (Tensor G3, 12 GB RAM)
| Stage | Model | Latency |
|---|---|---|
| ASR | Whisper Tiny | ~280ms |
| LLM | Llama 3.2 1B (Q4) | ~650ms to first token |
| TTS | Kokoro 82M | ~180ms |
| Total | ~1.1s to first audio |
Under one second to first audio on iPhone makes the interaction feel responsive. The LLM is the bottleneck — streaming LLM output to TTS can reduce perceived latency significantly.
Memory usage peaks around 1.8 GB with all three models loaded. On devices with 6 GB or more RAM, this leaves plenty of room for the rest of the app.
Next Steps
Conversation Memory
Right now, each interaction is stateless. Xybrid supports conversation context that carries history across turns:
final context = ConversationContext();
final response = await llm.run(
envelope: Envelope.text(text: transcript.text!),
context: context,
); The ConversationContext manages a rolling window of past messages (default: 50 turns).
Wake Word Detection
A hands-free experience requires wake word detection. You can add a lightweight wake word model that listens continuously and triggers the main flow only when it hears a specific phrase.
Streaming Responses
The current implementation waits for the LLM to finish before starting TTS. With streaming, you can begin synthesizing audio as soon as the first sentence is complete, reducing perceived latency.
The complete source code for this tutorial is available in the Xybrid examples repository. If you run into issues, open a discussion on the GitHub repo.