Voice mode
Real-time voice — STT, TTS, push to talk, continuous
Voice mode
Headmaster supports real-time voice interaction — talk to the agent with your voice and hear responses spoken aloud.
Under the hood
Voice mode uses:
- Speech-to-text (STT): Your spoken audio is transcribed to text and sent to the agent as a message. Headmaster uses a local faster-whisper model by default, or OpenAI Whisper for higher accuracy.
- Text-to-speech (TTS): The agent's text response is converted to speech audio and played back. Headmaster uses OpenAI TTS, xAI, MiniMax, or ElevenLabs as TTS providers.
The cycle is: you speak → STT transcribes → agent processes → TTS speaks the response → you speak again.
Enabling voice mode
- Open Settings → My Headmaster → Look → Voice.
- Turn on Enable voice mode.
- Choose a TTS provider and voice.
- Choose an STT backend (local faster-whisper or OpenAI Whisper).
- Save.
TTS providers
| Provider | Voices | Notes | |---|---|---| | OpenAI | Alloy, Echo, Fable, Onyx, Nova, Shimmer | Natural, high quality. Requires OpenAI key. | | xAI | Various | Requires xAI key. | | MiniMax | Various | Requires MiniMax key. | | ElevenLabs | 5k-40k voice options | Premium quality. Requires ElevenLabs key. | | Edge | Built-in voices | Free, no API key needed. Lower quality. |
STT backends
| Backend | Quality | Notes | |---|---|---| | Local faster-whisper | Good | Free, runs locally, no API key. Default. | | OpenAI Whisper | High | Requires OpenAI key. Better accuracy for accents and noise. |
Using voice mode
In the desktop app
Click the microphone icon in the chat composer. The icon turns red to indicate recording. Speak your message, then click the icon again (or press Esc) to stop recording. The agent transcribes your speech, processes it, and speaks the response.
Push to talk
Enable Push to talk in voice settings. Hold the microphone button (or a keyboard shortcut) to talk, release to send. The agent responds with speech automatically.
Continuous conversation
In continuous mode, the agent listens for your speech, responds, then automatically listens again. You don't need to click the microphone each time — just talk.
Enable in Settings → Voice → Continuous mode.
Interrupting
While the agent is speaking, click the stop button or press Esc to interrupt. The agent stops speaking and the partial audio is discarded. You can then speak a new message.
Voice on messaging platforms
On Telegram and Discord, voice messages you send are transcribed and processed. The agent's response is sent as text (or as a voice message if TTS is enabled for that platform).
Send a voice message to your Headmaster bot on Telegram → the agent transcribes it, processes it, and responds. If TTS is enabled, the response comes back as a voice message.
Voice settings
| Setting | What it controls | |---|---| | TTS provider | Which service generates speech | | TTS voice | Which voice to use | | TTS speed | How fast the agent speaks (0.5x to 2x) | | STT backend | Which service transcribes your speech | | Auto-listen | Start listening automatically after the agent responds | | Push to talk | Hold to talk, release to send | | Continuous mode | Agent listens → responds → listens again | | Voice volume | Output volume for TTS audio |
Speech input button
The microphone button in the chat composer shows the current voice state:
- Gray mic — voice mode is off. Click to start recording.
- Red mic — recording in progress. Click to stop and send.
- Blue mic — processing. The agent is transcribing or generating speech.
- Green mic — speaking. The agent is speaking the response.
Tips for better voice interaction
- Speak clearly — the STT model works best with clear, moderate-paced speech.
- Use a quiet environment — background noise reduces transcription accuracy.
- Try different voices — some TTS voices sound more natural for your use case. Try them all.
- Use push to talk — prevents the agent from picking up background conversation as input.
- Adjust speed — if the agent speaks too fast or slow, adjust the TTS speed setting.