Skip to content

🎤 Build a Local Voice Detection System (Offline, Free)

🧩 1. What Is “Voice Detection”?

Voice detection (VAD) = detecting when someone is talking vs. silence or noise.
It doesn’t transcribe speech — it just identifies that speech is happening.

This is the foundation of:

  • Voice assistants
  • Smart recording tools
  • Real-time speech analytics

🧰 2. Tools You’ll Need (All Free)

Library Purpose Offline?
sounddevice Access microphone input
numpy Handle audio data
webrtcvad Voice Activity Detection (by Google, open source)
wave Save audio files

Install them (Python ≥3.8):

pip install sounddevice numpy webrtcvad

⚙️ 3. Basic Concept

We’ll:

  1. Capture short chunks of microphone audio.
  2. Use WebRTC VAD to detect if that chunk contains voice.
  3. Print a message (“Voice detected”) when someone talks.

💻 4. Full Example: voice_detector.py

import sounddevice as sd
import numpy as np
import webrtcvad
import struct

# -----------------------------
# SETTINGS
# -----------------------------
SAMPLE_RATE = 16000  # samples per second
FRAME_DURATION = 30  # ms
FRAME_SIZE = int(SAMPLE_RATE * FRAME_DURATION / 1000)  # samples per frame
VAD_MODE = 2  # 0=very sensitive, 3=very strict

vad = webrtcvad.Vad(VAD_MODE)

# -----------------------------
# Helper: convert numpy audio chunk to bytes
# -----------------------------
def audio_to_bytes(audio):
    ints = np.int16(audio * 32768)
    return struct.pack("%dh" % len(ints), *ints)

# -----------------------------
# Main Loop
# -----------------------------
def main():
    print("🎙️ Voice Detection started (Ctrl+C to stop)")
    with sd.InputStream(channels=1, samplerate=SAMPLE_RATE, blocksize=FRAME_SIZE) as stream:
        while True:
            audio_chunk, _ = stream.read(FRAME_SIZE)
            audio_chunk = np.squeeze(audio_chunk)

            audio_bytes = audio_to_bytes(audio_chunk)

            if vad.is_speech(audio_bytes, SAMPLE_RATE):
                print("🟢 Voice detected!")
            else:
                print("⚪ Silence...", end="\r")

if __name__ == "__main__":
    try:
        main()
    except KeyboardInterrupt:
        print("\nStopped.")

▶️ 5. Run It

python voice_detector.py

Then speak near your microphone — you’ll see:

🟢 Voice detected!

when it hears speech.


🧠 6. How It Works

  • sounddevice streams real-time microphone data.
  • Each frame (~30 ms of sound) is analyzed.
  • webrtcvad applies a machine-learning-based algorithm to detect speech patterns.
  • The model runs completely offline, using CPU only.

🎧 7. Optional: Record Only When Voice Is Detected

You can extend it to save audio segments that contain voice:

import wave
import time

def record_voice_segments():
    print("🎙️ Recording voice segments (Ctrl+C to stop)")
    with sd.InputStream(channels=1, samplerate=SAMPLE_RATE, blocksize=FRAME_SIZE) as stream:
        buffer = []
        speaking = False
        while True:
            audio_chunk, _ = stream.read(FRAME_SIZE)
            audio_chunk = np.squeeze(audio_chunk)
            audio_bytes = audio_to_bytes(audio_chunk)

            if vad.is_speech(audio_bytes, SAMPLE_RATE):
                buffer.append(audio_bytes)
                if not speaking:
                    print("🟢 Voice detected — recording...")
                    speaking = True
            else:
                if speaking and len(buffer) > 0:
                    filename = f"voice_{int(time.time())}.wav"
                    with wave.open(filename, "wb") as wf:
                        wf.setnchannels(1)
                        wf.setsampwidth(2)
                        wf.setframerate(SAMPLE_RATE)
                        wf.writeframes(b"".join(buffer))
                    print(f"💾 Saved segment to {filename}")
                    buffer.clear()
                    speaking = False

if __name__ == "__main__":
    try:
        record_voice_segments()
    except KeyboardInterrupt:
        print("\nStopped.")

🎯 This script saves a .wav file every time you speak — all offline.


🧱 8. Improvements You Can Add

  • Noise filtering → use pydub or scipy to denoise input
  • Visualization → use matplotlib to plot waveforms in real time
  • Trigger command → when voice detected, call another script or AI agent

Example idea:

if vad.is_speech(audio_bytes, SAMPLE_RATE):
    print("🟢 Voice detected — launching local AI...")
    subprocess.run(["python", "local_agent.py"])

🔒 9. Advantages of Local Voice Detection

100% offline — no Google, no cloud
Zero cost
Privacy-safe
Low CPU usage (works on any laptop)


🚀 Summary

You just built a real-time local voice detection system using:

  • Python 🐍
  • sounddevice 🎤
  • webrtcvad 🧠

 

Leave a Reply

Discover more from Sowft | Transforming Ideas into Digital Success

Subscribe now to keep reading and get access to the full archive.

Continue reading