VocRT - Personal Realtime Voice-to-Voice AI Solution

https://vocrt.vercel.app

VocRT is a comprehensive, privacy-first Realtime Voice-to-Voice (V2V) solution that enables natural conversations with AI. Built with cutting-edge TTS models, RAG capabilities, and seamless integration, VocRT processes your voice input and responds with high-quality synthesized speech in real-time.

🚀 Key Features

Real-time Voice Processing

Ultra-low latency voice-to-voice conversion
High-quality speech synthesis using Kokoro-82M model
Customizable voice selection with multiple voice options
Adjustable threshold and silence duration for optimal user experience

Advanced RAG Capabilities

Multi-format document support: PDF, CSV, TXT, PPT, PPTX, DOC, DOCX, XLS, XLSX
URL content extraction: Process web pages, Medium blogs, and online PDFs
Unlimited document uploads without usage limits or billing concerns
100% privacy-first approach with local processing

Privacy & Cost Benefits

No API usage limits or recurring charges
Complete data privacy - all processing happens locally
Offline capability use local LLM model if resources allow
No data sharing with external AI services

🏗️ Architecture Overview

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   React Client  │◄──►│  Express Server │◄──►│  VocRT Engine   │
│   (Frontend)    │    │   (Backend)     │    │   (Python)      │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                                        |
                                                        |
                                 _______________________|
                                │                       │
                                ▼                       ▼
                       ┌──────────────────┐    ┌─────────────────┐
                       │    Embeddings    │    │   Whisper STT   │
                       │   (e5-base-v2)   │    │   Kokoro TTS    │
                       │    Qdrant DB     │    │                 │
                       │   (Vector Store) │    └─────────────────┘
                       └──────────────────┘

📁 Repository Structure

VocRT/
├── backend/         # Express.js server
├── frontend/        # React client application
├── models/          # AI models directory
├── voices/          # Available voice profiles
├── demo/            # Sample audio and demo files
├── .env             # Environment configuration
├── requirements.txt # Python dependencies
└── README.md        # Project documentation

🛠️ Manual Installation

Prerequisites

Python 3.10 (required)
Node.js 16+ and npm
Docker (for Qdrant vector database)
Git for cloning repositories

Step 1: Clone Repository

git clone https://huggingface.co/anuragsingh922/VocRT
cd VocRT

Step 2: Python Environment Setup

macOS/Linux:

python3.10 -m venv venv
source venv/bin/activate

Windows:

python3.10 -m venv venv
venv\Scripts\activate

Step 3: Install Python Dependencies

pip install -r requirements.txt

If the installation fails (e.g. due to dependency or PyTorch issues), try the following recovery steps:

pip install --upgrade pip setuptools wheel
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
pip install -r requirements.txt

Step 4: Install eSpeak

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install espeak

macOS:

# Install Homebrew if not present
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install eSpeak
brew install espeak

Windows:

Download from eSpeak official website
Run installer and follow instructions
Add installation path to system PATH environment variable
Verify installation: espeak --version

Verification:

espeak "VocRT installation successful!"

Step 5: Backend Setup (Express.js)

cd backend
npm install
npm run dev

Step 6: Frontend Setup (Vite)

cd frontend
npm install
npm run dev

Step 7: Qdrant Vector Database Setup

Documentation: Qdrant Quickstart Guide

# Pull Qdrant image
docker pull qdrant/qdrant

# Start Qdrant container
docker run -p 6333:6333 -p 6334:6334 \
  -v "$(pwd)/qdrant_storage:/qdrant/storage:z" \
  qdrant/qdrant

Access Points:

REST API: http://localhost:6333
Web Dashboard: http://localhost:6333/dashboard
gRPC API: http://localhost:6334

Step 8: Download Required Models

Embedding Model:

Clone e5-base-v2 to models/e5-base-v2

Whisper STT Model:

Choose your preferred Whisper model size:

✅ Just specify the model name(app.py) — it will be automatically downloaded and loaded.

tiny: Fastest, lower accuracy
base: Balanced performance
small: Better accuracy
medium/large: Highest accuracy, slower processing

Step 9: Environment Configuration

Edit .env file with your API credentials:

# LLM Configuration
OPENAI_API_KEY=your_openai_api_key_here
GEMINI_API_KEY=your_gemini_api_key_here
LLM_PROVIDER=google  # or 'google' for Gemini
LLM_MODEL=gemini-2.0-flash  # or your preferred model

Step 10: Launch VocRT Server

python3 app.py

🎯 Usage Guide

Access the application: Navigate to http://localhost:3000
Select voice profile: Choose from available voice options
Configure settings: Adjust silence duration for optimal performance
Add context: Upload documents, provide URLs, or enter text for AI context
Start conversation: Begin speaking and enjoy real-time voice responses

📊 Supported Document Formats

Format	Extension	Description
PDF	`.pdf`	Portable Document Format
Text	`.txt`	Plain text files
Word	`.doc`, `.docx`	Microsoft Word documents
Excel	`.xls`, `.xlsx`	Microsoft Excel spreadsheets
PowerPoint	`.ppt`, `.pptx`	Microsoft PowerPoint presentations
CSV	`.csv`	Comma-separated values
URLs	Web links	Online content, blogs, PDFs

🤖 AI Models & Technology Stack

Core Models

TTS: Kokoro-82M - High-quality text-to-speech
STT: OpenAI Whisper - Accurate speech recognition
Embeddings: e5-base-v2 - Semantic text understanding
LLM: OpenAI GPT / Google Gemini - Natural language processing

Technology Stack

Backend: Python, Express.js, gRPC
Frontend: React, Vite
Database: Qdrant (Vector Database)
Audio Processing: Whisper, eSpeak, phonemizer

🔧 Performance Optimization

Hardware Recommendations

CPU: Multi-core processor (4+ cores recommended)
RAM: 4GB+ for optimal performance
Storage: SSD for faster model loading
GPU: Optional, for accelerated inference can reduce latency upto 60%

Configuration Tips

Modify silence duration for natural conversation flow
Use smaller Whisper models for faster STT processing
Enable GPU acceleration if available

🤝 Contributing

We welcome contributions from the community! Here's how you can help:

Ways to Contribute

🐛 Bug Reports: Submit issues with detailed reproduction steps
💡 Feature Requests: Suggest new capabilities and improvements
📝 Documentation: Improve guides, tutorials, and API docs
🔧 Code Contributions: Submit pull requests with enhancements

Development Setup

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Commit changes: git commit -m 'Add amazing feature'
Push to branch: git push origin feature/amazing-feature
Open a Pull Request

📄 License

This project is licensed under the MIT License

🙏 Acknowledgments

Special thanks to the amazing open-source communities:

Hugging Face - For hosting and maintaining AI models
Kokoro-82M Team - Exceptional TTS model
OpenAI Whisper - Revolutionary speech recognition
Qdrant - High-performance vector database
React & Node.js communities

📞 Support & Contact

Email: [email protected]

📞 Website

VocRT: https://vocrt.vercel.app

⭐ If VocRT helps your projects, please consider giving it a star!

Built with ❤️ for the open-source community

VocRT - Personal Realtime Voice-to-Voice AI Solution

🚀 Key Features

Real-time Voice Processing

Advanced RAG Capabilities

Privacy & Cost Benefits

🏗️ Architecture Overview

📁 Repository Structure

🛠️ Manual Installation

Prerequisites

Step 1: Clone Repository

Step 2: Python Environment Setup

macOS/Linux:

Windows:

Step 3: Install Python Dependencies

Step 4: Install eSpeak

Ubuntu/Debian:

macOS:

Windows:

Verification:

Step 5: Backend Setup (Express.js)

Step 6: Frontend Setup (Vite)

Step 7: Qdrant Vector Database Setup

Step 8: Download Required Models

Embedding Model:

Whisper STT Model:

Step 9: Environment Configuration

Step 10: Launch VocRT Server

🎯 Usage Guide

📊 Supported Document Formats

🤖 AI Models & Technology Stack

Core Models

Technology Stack

🔧 Performance Optimization

Hardware Recommendations

Configuration Tips

🤝 Contributing

Ways to Contribute

Development Setup

📄 License

🙏 Acknowledgments

📞 Support & Contact

📞 Website

Collection including anuragsingh922/VocRT