How To Build a Conversational AI Platform Using Open Source Models

Mahesh Chand
8h
1.4k
0
0

Article

🧠 Executive Summary

A modern Conversational AI platform is not a chatbot with a microphone attached. It is a real time distributed system that combines audio streaming, speech recognition, LLM reasoning, tool calling, memory, knowledge retrieval, text to speech, observability, security, and human handoff.

The core real time voice pipeline is simple on paper: speech to text converts user audio into text, an LLM generates the response, and text to speech converts the response back to audio. LiveKit describes this as the standard three stage voice agent pipeline, but the production challenge is latency, turn taking, interruptions, reliability, and enterprise integration.

A serious platform should support both:

Real time conversation
Post conversation intelligence

That means you should not rely on one model for everything. Use fast streaming models during live calls, then use stronger long context models after the call for transcript cleanup, summaries, compliance, CRM updates, coaching, and analytics.

🏗️ High Level Platform Architecture

🧩 Core Components

1. User Channel Layer

The platform should support multiple customer entry points:

Website voice widget
Mobile app voice assistant
Phone calls through Twilio, Telnyx, SIP, or WebRTC
WhatsApp voice
Zoom or Teams meeting agent
Call center integration
Healthcare portal
Internal enterprise assistant

This layer should normalize every channel into a common session format:

Session ID
User ID
Tenant ID
Channel type
Audio stream reference
Consent status
Language
Agent configuration
Security policy

2. Real Time Media Gateway

This is the first major technical decision.

For a modern open architecture, use LiveKit as the real time media layer. LiveKit Agents supports LLMs, speech to text, text to speech, realtime APIs, and virtual avatars through a unified agent framework.

LiveKit is useful because it handles:

WebRTC media streaming
Audio rooms
Low latency transport
Agent workers
Voice pipeline orchestration
Telephony integrations
Browser and mobile clients
Realtime model integration

Alternative options:

Daily
Agora
Twilio Media Streams
Azure Communication Services
Self hosted SIP plus WebRTC stack

My recommendation:
Use LiveKit for the first version. Add Twilio or Telnyx for phone numbers. Keep SIP support for enterprise call centers.

3. Voice Activity Detection and Turn Detection

This is one of the most important parts of a voice AI platform.

The platform must know:

When the user starts speaking
When the user stops speaking
When to interrupt the AI
When to ignore background noise
When to continue listening
When to pass control to the LLM

Without good turn detection, the agent feels slow and robotic.

Recommended components:

Silero VAD
WebRTC VAD
LiveKit turn detection
Custom interruption detector
Noise suppression model
Echo cancellation

Design rule:
Never send every audio frame directly to the LLM pipeline. First clean, segment, and classify the audio stream.

4. Speech To Text Layer

There should be two speech to text paths.

Real Time Path

For live conversations, use a fast streaming speech model.

Options:

Whisper streaming
NVIDIA Riva ASR
NVIDIA Nemotron speech models
Moonshine
Faster Whisper
Deepgram or AssemblyAI if hybrid cloud is acceptable

NVIDIA Riva is designed as a GPU accelerated SDK for real time speech AI applications and supports customization and deployment across different environments.

Batch Intelligence Path

For post call processing, use a stronger long context model.

Options:

Microsoft VibeVoice ASR
Whisper large
NVIDIA Parakeet
Canary
Custom fine tuned ASR

Use this path after the call to create:

Clean transcript
Speaker diarization
Accurate timestamps
Medical or legal term correction
Summary
Action items
Sentiment analysis
Compliance record
CRM or EMR update

Architecture rule:
Use fast ASR for live interaction. Use high accuracy ASR for final intelligence.

5. LLM Runtime Layer

The LLM is the reasoning engine of the platform.

You need two LLM classes.

Fast Live Conversation LLM

Use small optimized models for real time response:

Qwen 2.5 7B or 14B
Llama 3.1 8B
Mistral 7B
Gemma
Phi class models
Quantized models through vLLM or Ollama for development

Deeper Reasoning LLM

Use larger models for complex tasks:

Qwen 32B
Llama 70B
DeepSeek models
Mixtral
Commercial fallback models such as GPT or Claude when required

Runtime options:

vLLM
TensorRT LLM
TGI
Ollama for local development
LM Studio for testing
KServe on Kubernetes
Ray Serve for distributed inference

Recommendation:
Use vLLM for production open source LLM serving. Use Ollama only for local development and demos.

6. Conversation Orchestration Layer

This is the brainstem of the platform.

It manages:

Conversation state
Agent instructions
Memory
Tool calls
RAG retrieval
Response validation
Escalation
Fallback behavior
Multi step flows
Human handoff

Recommended frameworks:

LangGraph
Semantic Kernel
LlamaIndex workflows
CrewAI for multi agent workflows
Custom state machine for regulated workflows

For enterprise systems, avoid a fully free form agent for every use case. Use a hybrid model:

LLM for language and reasoning
State machine for business critical flows
Tools for deterministic actions
Guardrails for safety
Human handoff for exceptions

7. RAG Knowledge Layer

A conversational AI platform becomes useful only when it can answer from business data.

The RAG layer should include:

Document ingestion
Chunking
Embedding generation
Vector database
Hybrid search
Permission filtering
Reranking
Citation generation
Answer grounding

AWS Bedrock Knowledge Bases connects foundation models to internal company data sources for RAG and can use enterprise vector databases such as OpenSearch, Aurora PostgreSQL, MongoDB Atlas, Weaviate, and Pinecone.

Open source choices:

Qdrant
Weaviate
Milvus
pgvector
OpenSearch
Elasticsearch

Recommended first version:

PostgreSQL plus pgvector for MVP
Qdrant for better vector search
OpenSearch or Elasticsearch for hybrid enterprise search

8. Tool Calling Layer

A voice agent is only valuable when it can take action.

Tools may include:

Calendar booking
CRM lookup
CRM update
Ticket creation
Order status
Payment link generation
EMR lookup
Insurance eligibility check
Knowledge base search
Human escalation
Email follow up
SMS follow up
Slack or Teams notification

The tool layer must have strict controls:

Tool schema validation
Permission checks
Audit logs
Rate limits
Retry logic
Idempotency keys
Human approval for sensitive actions

Do not let the LLM directly call production APIs without a policy layer.

9. Text To Speech Layer

The TTS layer converts the AI response into natural speech.

Open source options:

Kokoro TTS
XTTS v2
Piper
OpenVoice
StyleTTS2
Parler TTS

Managed options:

Azure Speech
Google Text to Speech
ElevenLabs
Cartesia
PlayHT

Google’s Text to Speech service offers a large catalog of voices across many languages and variants, which is useful when you need broad language coverage quickly.

For open source production:

Use Kokoro for speed and quality
Use Piper for lightweight edge deployments
Use XTTS v2 or OpenVoice when voice cloning is required

Important:
For live calls, TTS latency matters more than perfect voice quality. For video narration, quality matters more than latency.

10. Memory Layer

Conversational AI needs multiple types of memory.

Session Memory

Current call or chat context.

Stored in:

Redis
PostgreSQL
Agent runtime state

User Memory

Preferences, history, profile, past conversations.

Stored in:

PostgreSQL
Document database
Vector memory store

Enterprise Memory

Policies, products, FAQs, procedures, documents.

Stored in:

RAG system
Search index
Vector database

Long Term Conversation Intelligence

Call summaries, transcripts, decisions, follow ups.

Stored in:

Object storage
PostgreSQL
Analytics warehouse

11. Safety, Compliance, and Guardrails

This layer is mandatory for production.

It should handle:

PII detection
PHI detection
PCI redaction
Toxicity filters
Prompt injection defense
Tool permission enforcement
Medical disclaimer policy
Financial compliance rules
Legal escalation rules
Consent capture
Data retention policy
Audit logging

For healthcare, add:

HIPAA compliant storage
BAA supported vendors
Encryption at rest
Encryption in transit
Role based access control
Patient consent
PHI redaction
EMR access controls

For finance, add:

Call recording notices
Payment data masking
Fraud detection
Regulatory logs
Human review workflows

12. Human Handoff Layer

A serious conversational platform must know when to stop being autonomous.

Human handoff should trigger when:

User asks for a person
Confidence is low
User is angry
Policy blocks response
Payment dispute appears
Medical emergency appears
Legal risk appears
System error occurs
Multiple failed attempts happen

Handoff targets:

Live agent dashboard
Call center queue
Slack or Teams alert
CRM task
Support ticket
Emergency escalation workflow

Google’s Contact Center AI products include capabilities such as agent assist, insights, and customer touchpoint agents, showing how enterprise conversational AI often includes both automation and human agent support.

🏢 Deployment Architecture

☁️ Hosting Options

Option 1: Cloud Native MVP

Best for:

Prototype
Early SaaS
Demos
First 10 customers
Low compliance requirements

Components:

LiveKit Cloud
Managed PostgreSQL
Managed Redis
Qdrant Cloud or pgvector
GPU cloud for inference
S3 compatible object storage
Docker based services
Cloudflare for frontend and DNS

Estimated monthly infrastructure:

$1,500 to $5,000

Option 2: Kubernetes Production SaaS

Best for:

Multi tenant SaaS
High availability
Customer isolation
Autoscaling
Enterprise pilots

Components:

Kubernetes
GPU node pool
CPU node pool
Ingress controller
API gateway
LiveKit self hosted or LiveKit Cloud
PostgreSQL primary plus read replica
Redis cluster
Vector database
Object storage
Observability stack
CI CD pipeline

Estimated monthly infrastructure:

$8,000 to $30,000

Option 3: Enterprise Private Cloud

Best for:

Healthcare
Legal
Finance
Government
Large enterprise deployments

Components:

Private VPC
Dedicated Kubernetes cluster
Dedicated GPU servers
Private object storage
Private database cluster
VPN or private link
SIEM integration
Audit logs
Disaster recovery
Data residency controls

Estimated monthly cost:

$30,000 to $100,000 plus

Option 4: On Premises Deployment

Best for:

Hospitals
Government agencies
Defense
Strict privacy environments
Offline capable systems

Components:

GPU server rack
Kubernetes or Docker Swarm
Local object storage
PostgreSQL cluster
Redis
Local vector database
Local model registry
Local monitoring
Private network
Backup appliance

Estimated hardware cost:

$80,000 to $300,000 plus depending on GPU count

🖥️ Hardware Requirements

Developer Workstation

Use this for local model testing.

Recommended:

CPU: 16 to 32 cores
RAM: 128GB
GPU: RTX 4090 or RTX 5090 class
VRAM: 24GB or more
Storage: 2TB to 4TB NVMe
OS: Ubuntu Linux
Runtime: Docker, CUDA, Python, PyTorch

Estimated cost:

$4,000 to $8,000

MVP Server Setup

Good for early demos and limited concurrency.

Recommended:

1 GPU server for LLM and TTS
1 CPU server for API and orchestration
Managed PostgreSQL
Managed Redis
Object storage
LiveKit Cloud

GPU options:

RTX 4090
L40S
A10
A100 if budget allows

Estimated monthly cloud cost:

$1,500 to $5,000

Production Setup for 50 to 100 Concurrent Calls

Recommended:

2 to 4 GPU servers
Separate ASR service
Separate LLM service
Separate TTS service
3 CPU application nodes
PostgreSQL cluster
Redis cluster
Object storage
Vector database
Monitoring stack

GPU options:

L40S for cost effective inference
A100 80GB for larger models
H100 for high throughput
H200 for heavy enterprise workloads

Estimated monthly cloud cost:

$8,000 to $30,000

Enterprise Setup

Recommended:

4 to 8 GPU servers
Kubernetes GPU node pool
Dedicated media servers
Dedicated inference services
Dedicated database cluster
Dedicated vector database
Private networking
DR environment
Security monitoring

Estimated cloud cost:

$30,000 to $100,000 plus per month

Estimated on premises hardware:

$150,000 to $500,000 plus for serious scale

🧮 Capacity Planning Model

Voice AI capacity depends on:

Concurrent calls
Average call duration
ASR latency
LLM token speed
TTS generation speed
Tool call latency
RAG latency
Model size
Quantization
GPU memory
Batching strategy

A good starting assumption:

One live voice session needs low latency access to ASR, LLM, and TTS.
The LLM is usually the bottleneck.
TTS becomes the bottleneck when using high quality voices.
Post call jobs should run asynchronously on cheaper GPUs.

Recommended Concurrency Design

🧱 Recommended Technology Stack

Frontend

React
Next.js
WebRTC client
Admin dashboard
Agent builder UI
Call review UI
Transcript viewer
Analytics dashboard

Backend

FastAPI or Node.js
gRPC for internal services
REST APIs for dashboard
Webhooks for integrations
Background workers with Celery, Temporal, or BullMQ

Real Time Voice

LiveKit
SIP bridge
Twilio or Telnyx
WebRTC
Voice activity detection
Interruption handling

AI Runtime

vLLM for LLM inference
TensorRT LLM for optimized NVIDIA deployment
Faster Whisper for ASR
NVIDIA Riva for real time speech
Kokoro or XTTS for TTS
VibeVoice ASR for long context post call intelligence

Data

PostgreSQL
Redis
Qdrant
S3 compatible object storage
ClickHouse or BigQuery for analytics
OpenSearch for logs and transcript search

Infrastructure

Docker
Kubernetes
Helm
Terraform
Prometheus
Grafana
OpenTelemetry
Loki or ELK
Vault or cloud secrets manager

Security

OAuth
JWT
RBAC
Tenant isolation
Encryption at rest
Encryption in transit
Secrets manager
Audit logging
PII and PHI redaction

🔐 Multi Tenant SaaS Architecture

For a SaaS platform, every request must be tenant aware.

Tenant isolation should apply to:

Agent configuration
Knowledge base
Conversation logs
Audio recordings
Transcripts
API keys
Tool permissions
Billing
Analytics
Admin users

Recommended tenant model:

Shared application services
Tenant scoped database rows
Tenant scoped vector collections
Tenant scoped object storage prefixes
Dedicated encryption keys for enterprise customers
Optional dedicated deployment for large customers

🔁 Real Time Call Flow

🧠 Agent Design Pattern

Each agent should be configured as a structured object.

{
  "agent_name": "Clinic Appointment Assistant",
  "tenant_id": "clinic_123",
  "voice_profile": "professional_friendly",
  "language": "en-US",
  "goals": [
    "answer clinic questions",
    "book appointments",
    "collect patient details",
    "escalate emergencies"
  ],
  "knowledge_sources": [
    "clinic_faq",
    "insurance_policy",
    "doctor_availability"
  ],
  "tools": [
    "calendar_lookup",
    "appointment_booking",
    "sms_confirmation",
    "human_handoff"
  ],
  "guardrails": [
    "do_not_diagnose",
    "do_not_collect_payment_card_audio",
    "escalate_emergency_language"
  ]
}

🏥 Healthcare Example Architecture

Healthcare architecture must be stricter than a general support bot. The AI should not diagnose. It should summarize, route, schedule, educate from approved sources, and escalate.

⚙️ Infrastructure Services Required

Minimum production services:

API gateway
Auth service
Tenant service
Agent configuration service
LiveKit media service
Agent worker service
ASR inference service
LLM inference service
TTS inference service
RAG service
Tool gateway
Transcript service
Recording service
Analytics service
Billing service
Admin dashboard
Monitoring service
Audit log service

📊 Observability Requirements

Track these metrics:

End to end latency
ASR latency
LLM first token latency
TTS first audio latency
Average response time
Call drop rate
Interruption success rate
Tool call failure rate
Fallback rate
Human handoff rate
Hallucination reports
User satisfaction
Cost per call minute
GPU utilization
Tokens per minute
Audio minutes processed

💰 Cost Model

Your cost per call minute depends on:

ASR cost
LLM inference cost
TTS cost
Media streaming cost
Telephony cost
Recording storage
Post call processing
RAG queries
Monitoring and logs

A practical MVP cost target:

$0.03 to $0.12 per minute for infrastructure if optimized
$0.15 to $0.50 per minute if using premium APIs
$0.50 plus per minute if using expensive TTS and large LLMs without optimization

The architecture should be designed to move expensive workloads out of the live path.

🚀 MVP Roadmap

Phase 1: Working Voice Agent

Build:

Web voice widget
LiveKit integration
Streaming ASR
Small LLM
Open source TTS
Basic conversation memory
Basic admin dashboard
Call recording
Transcript storage

Phase 2: Business Agent

Add:

RAG knowledge base
Tool calling
Calendar integration
CRM integration
Human handoff
Conversation analytics
Agent configuration UI

Phase 3: Production SaaS

Add:

Multi tenant architecture

Billing
RBAC
Audit logs
Monitoring
Prompt versioning
Model routing
A/B testing
Enterprise security

Phase 4: Industry Solutions

Build vertical agents:

Healthcare receptionist
Dental clinic agent
Legal intake agent
Sales qualification agent
Customer support agent
Education tutor
Podcast content agent
Meeting intelligence agent

🧭 Final Architecture Recommendation

For your first serious build, I would use this stack:

LiveKit for real time media
Twilio or Telnyx for phone calls
FastAPI for backend services
React or Next.js for dashboard
PostgreSQL for system data
Redis for real time session state
Qdrant for vector search
S3 compatible storage for audio and transcripts
vLLM for open source LLM serving
Faster Whisper or NVIDIA Riva for live speech to text
Kokoro or XTTS v2 for text to speech
VibeVoice ASR for post call long context intelligence
LangGraph for agent orchestration
Kubernetes for production deployment
Prometheus, Grafana, and OpenTelemetry for observability

The winning architecture is hybrid.

Fast models handle live conversation.
Long context models handle post call intelligence.
State machines control business critical workflows.
LLMs handle natural language and reasoning.
Human handoff protects trust.

That is how you build a real Conversational AI platform, not just a demo voice bot.