🧩 1. What Is “Voice Detection”?
Voice detection (VAD) = detecting when someone is talking vs. silence or noise.
It doesn’t transcribe speech — it just identifies that speech is happening.
This is the foundation of:
- Voice assistants
- Smart recording tools
- Real-time speech analytics
🧰 2. Tools You’ll Need (All Free)
Library | Purpose | Offline? |
---|---|---|
sounddevice |
Access microphone input | ✅ |
numpy |
Handle audio data | ✅ |
webrtcvad |
Voice Activity Detection (by Google, open source) | ✅ |
wave |
Save audio files | ✅ |
Install them (Python ≥3.8):
pip install sounddevice numpy webrtcvad
⚙️ 3. Basic Concept
We’ll:
- Capture short chunks of microphone audio.
- Use WebRTC VAD to detect if that chunk contains voice.
- Print a message (“Voice detected”) when someone talks.
💻 4. Full Example: voice_detector.py
import sounddevice as sd
import numpy as np
import webrtcvad
import struct
# -----------------------------
# SETTINGS
# -----------------------------
SAMPLE_RATE = 16000 # samples per second
FRAME_DURATION = 30 # ms
FRAME_SIZE = int(SAMPLE_RATE * FRAME_DURATION / 1000) # samples per frame
VAD_MODE = 2 # 0=very sensitive, 3=very strict
vad = webrtcvad.Vad(VAD_MODE)
# -----------------------------
# Helper: convert numpy audio chunk to bytes
# -----------------------------
def audio_to_bytes(audio):
ints = np.int16(audio * 32768)
return struct.pack("%dh" % len(ints), *ints)
# -----------------------------
# Main Loop
# -----------------------------
def main():
print("🎙️ Voice Detection started (Ctrl+C to stop)")
with sd.InputStream(channels=1, samplerate=SAMPLE_RATE, blocksize=FRAME_SIZE) as stream:
while True:
audio_chunk, _ = stream.read(FRAME_SIZE)
audio_chunk = np.squeeze(audio_chunk)
audio_bytes = audio_to_bytes(audio_chunk)
if vad.is_speech(audio_bytes, SAMPLE_RATE):
print("🟢 Voice detected!")
else:
print("⚪ Silence...", end="\r")
if __name__ == "__main__":
try:
main()
except KeyboardInterrupt:
print("\nStopped.")
▶️ 5. Run It
python voice_detector.py
Then speak near your microphone — you’ll see:
🟢 Voice detected!
when it hears speech.
🧠 6. How It Works
sounddevice
streams real-time microphone data.- Each frame (~30 ms of sound) is analyzed.
webrtcvad
applies a machine-learning-based algorithm to detect speech patterns.- The model runs completely offline, using CPU only.
🎧 7. Optional: Record Only When Voice Is Detected
You can extend it to save audio segments that contain voice:
import wave
import time
def record_voice_segments():
print("🎙️ Recording voice segments (Ctrl+C to stop)")
with sd.InputStream(channels=1, samplerate=SAMPLE_RATE, blocksize=FRAME_SIZE) as stream:
buffer = []
speaking = False
while True:
audio_chunk, _ = stream.read(FRAME_SIZE)
audio_chunk = np.squeeze(audio_chunk)
audio_bytes = audio_to_bytes(audio_chunk)
if vad.is_speech(audio_bytes, SAMPLE_RATE):
buffer.append(audio_bytes)
if not speaking:
print("🟢 Voice detected — recording...")
speaking = True
else:
if speaking and len(buffer) > 0:
filename = f"voice_{int(time.time())}.wav"
with wave.open(filename, "wb") as wf:
wf.setnchannels(1)
wf.setsampwidth(2)
wf.setframerate(SAMPLE_RATE)
wf.writeframes(b"".join(buffer))
print(f"💾 Saved segment to {filename}")
buffer.clear()
speaking = False
if __name__ == "__main__":
try:
record_voice_segments()
except KeyboardInterrupt:
print("\nStopped.")
🎯 This script saves a .wav
file every time you speak — all offline.
🧱 8. Improvements You Can Add
- Noise filtering → use
pydub
orscipy
to denoise input - Visualization → use
matplotlib
to plot waveforms in real time - Trigger command → when voice detected, call another script or AI agent
Example idea:
if vad.is_speech(audio_bytes, SAMPLE_RATE):
print("🟢 Voice detected — launching local AI...")
subprocess.run(["python", "local_agent.py"])
🔒 9. Advantages of Local Voice Detection
✅ 100% offline — no Google, no cloud
✅ Zero cost
✅ Privacy-safe
✅ Low CPU usage (works on any laptop)
🚀 Summary
You just built a real-time local voice detection system using:
- Python 🐍
sounddevice
🎤webrtcvad
🧠