VocRT - Personal Realtime Voice-to-Voice AI Solution
VocRT is a comprehensive, privacy-first Realtime Voice-to-Voice (V2V) solution that enables natural conversations with AI. Built with cutting-edge TTS models, RAG capabilities, and seamless integration, VocRT processes your voice input and responds with high-quality synthesized speech in real-time.
π Key Features
Real-time Voice Processing
- Ultra-low latency voice-to-voice conversion
- High-quality speech synthesis using Kokoro-82M model
- Customizable voice selection with multiple voice options
- Adjustable threshold and silence duration for optimal user experience
Advanced RAG Capabilities
- Multi-format document support: PDF, CSV, TXT, PPT, PPTX, DOC, DOCX, XLS, XLSX
- URL content extraction: Process web pages, Medium blogs, and online PDFs
- Unlimited document uploads without usage limits or billing concerns
- 100% privacy-first approach with local processing
Privacy & Cost Benefits
- No API usage limits or recurring charges
- Complete data privacy - all processing happens locally
- Offline capability use local LLM model if resources allow
- No data sharing with external AI services
ποΈ Architecture Overview
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β React Client βββββΊβ Express Server βββββΊβ VocRT Engine β
β (Frontend) β β (Backend) β β (Python) β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
|
|
_______________________|
β β
βΌ βΌ
ββββββββββββββββββββ βββββββββββββββββββ
β Embeddings β β Whisper STT β
β (e5-base-v2) β β Kokoro TTS β
β Qdrant DB β β β
β (Vector Store) β βββββββββββββββββββ
ββββββββββββββββββββ
π Repository Structure
VocRT/
βββ backend/ # Express.js server
βββ frontend/ # React client application
βββ models/ # AI models directory
βββ voices/ # Available voice profiles
βββ demo/ # Sample audio and demo files
βββ .env # Environment configuration
βββ requirements.txt # Python dependencies
βββ README.md # Project documentation
π οΈ Manual Installation
Prerequisites
- Python 3.10 (required)
- Node.js 16+ and npm
- Docker (for Qdrant vector database)
- Git for cloning repositories
Step 1: Clone Repository
git clone https://huggingface.co/anuragsingh922/VocRT
cd VocRT
Step 2: Python Environment Setup
macOS/Linux:
python3.10 -m venv venv
source venv/bin/activate
Windows:
python3.10 -m venv venv
venv\Scripts\activate
Step 3: Install Python Dependencies
pip install -r requirements.txt
If the installation fails (e.g. due to dependency or PyTorch issues), try the following recovery steps:
pip install --upgrade pip setuptools wheel
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
pip install -r requirements.txt
Step 4: Install eSpeak
Ubuntu/Debian:
sudo apt-get update
sudo apt-get install espeak
macOS:
# Install Homebrew if not present
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install eSpeak
brew install espeak
Windows:
- Download from eSpeak official website
- Run installer and follow instructions
- Add installation path to system PATH environment variable
- Verify installation:
espeak --version
Verification:
espeak "VocRT installation successful!"
Step 5: Backend Setup (Express.js)
cd backend
npm install
npm run dev
Step 6: Frontend Setup (Vite)
cd frontend
npm install
npm run dev
Step 7: Qdrant Vector Database Setup
Documentation: Qdrant Quickstart Guide
# Pull Qdrant image
docker pull qdrant/qdrant
# Start Qdrant container
docker run -p 6333:6333 -p 6334:6334 \
-v "$(pwd)/qdrant_storage:/qdrant/storage:z" \
qdrant/qdrant
Access Points:
- REST API: http://localhost:6333
- Web Dashboard: http://localhost:6333/dashboard
- gRPC API: http://localhost:6334
Step 8: Download Required Models
Embedding Model:
Clone e5-base-v2 to models/e5-base-v2
Whisper STT Model:
Choose your preferred Whisper model size:
β
Just specify the model name(app.py
) β it will be automatically downloaded and loaded.
- tiny: Fastest, lower accuracy
- base: Balanced performance
- small: Better accuracy
- medium/large: Highest accuracy, slower processing
Step 9: Environment Configuration
Edit .env
file with your API credentials:
# LLM Configuration
OPENAI_API_KEY=your_openai_api_key_here
GEMINI_API_KEY=your_gemini_api_key_here
LLM_PROVIDER=google # or 'google' for Gemini
LLM_MODEL=gemini-2.0-flash # or your preferred model
Step 10: Launch VocRT Server
python3 app.py
π― Usage Guide
- Access the application: Navigate to http://localhost:3000
- Select voice profile: Choose from available voice options
- Configure settings: Adjust silence duration for optimal performance
- Add context: Upload documents, provide URLs, or enter text for AI context
- Start conversation: Begin speaking and enjoy real-time voice responses
π Supported Document Formats
Format | Extension | Description |
---|---|---|
.pdf |
Portable Document Format | |
Text | .txt |
Plain text files |
Word | .doc , .docx |
Microsoft Word documents |
Excel | .xls , .xlsx |
Microsoft Excel spreadsheets |
PowerPoint | .ppt , .pptx |
Microsoft PowerPoint presentations |
CSV | .csv |
Comma-separated values |
URLs | Web links | Online content, blogs, PDFs |
π€ AI Models & Technology Stack
Core Models
- TTS: Kokoro-82M - High-quality text-to-speech
- STT: OpenAI Whisper - Accurate speech recognition
- Embeddings: e5-base-v2 - Semantic text understanding
- LLM: OpenAI GPT / Google Gemini - Natural language processing
Technology Stack
- Backend: Python, Express.js, gRPC
- Frontend: React, Vite
- Database: Qdrant (Vector Database)
- Audio Processing: Whisper, eSpeak, phonemizer
π§ Performance Optimization
Hardware Recommendations
- CPU: Multi-core processor (4+ cores recommended)
- RAM: 4GB+ for optimal performance
- Storage: SSD for faster model loading
- GPU: Optional, for accelerated inference can reduce latency upto 60%
Configuration Tips
- Modify silence duration for natural conversation flow
- Use smaller Whisper models for faster STT processing
- Enable GPU acceleration if available
π€ Contributing
We welcome contributions from the community! Here's how you can help:
Ways to Contribute
- π Bug Reports: Submit issues with detailed reproduction steps
- π‘ Feature Requests: Suggest new capabilities and improvements
- π Documentation: Improve guides, tutorials, and API docs
- π§ Code Contributions: Submit pull requests with enhancements
Development Setup
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature
- Commit changes:
git commit -m 'Add amazing feature'
- Push to branch:
git push origin feature/amazing-feature
- Open a Pull Request
π License
This project is licensed under the MIT License
π Acknowledgments
Special thanks to the amazing open-source communities:
- Hugging Face - For hosting and maintaining AI models
- Kokoro-82M Team - Exceptional TTS model
- OpenAI Whisper - Revolutionary speech recognition
- Qdrant - High-performance vector database
- React & Node.js communities
π Support & Contact
- Email: [email protected]
π Website
- VocRT: https://vocrt.vercel.app
β If VocRT helps your projects, please consider giving it a star!
Built with β€οΈ for the open-source community
- Downloads last month
- 34