I have learned to keep voice systems boring at first. Over-designed architectures look impressive in diagrams, but they are painful to debug when latency spikes in real calls. My preference is a simple stack that can be measured, tuned, and trusted.
Layer 1: Audio And Speech
Start with a stable realtime input pipeline: reliable voice activity detection plus streaming ASR. If this layer is noisy or delayed, everything above it looks like an AI problem even when it is not.
Layer 2: Reasoning Core
Keep the reasoning loop compact. Detect intent, call tools only when needed, then respond. Long invisible chains feel clever in development and slow in production.
Layer 3: Voice Output
Voice quality matters, but controllability matters more. You need quick cutoff, barge-in support, and pacing control so conversations feel natural instead of scripted.
Layer 4: Memory
Memory should support decisions, not archive everything forever. I keep stable preferences, unresolved tasks, and short context summaries. Full transcripts usually add noise, cost, and retrieval confusion.