--- title: NovaEval by Noveum.ai emoji: ⚡ colorFrom: purple colorTo: blue sdk: docker pinned: false --- # NovaEval by Noveum.ai Advanced AI Model Evaluation Platform powered by Hugging Face Models ## 🚀 Features ### 🤖 **Comprehensive Model Selection** - **15+ Top Hugging Face Models** across different size categories - **Real-time Model Search** with provider filtering - **Detailed Model Information** including capabilities, size, and provider - **Size-based Filtering** (Small 1-3B, Medium 7B, Large 14B+) ### 📊 **Rich Dataset Collection** - **11 Evaluation Datasets** covering reasoning, knowledge, math, code, and language - **Category-based Filtering** for easy dataset discovery - **Detailed Dataset Information** including sample counts and difficulty levels - **Popular Benchmarks** like MMLU, HellaSwag, GSM8K, HumanEval ### ⚡ **Advanced Evaluation Engine** - **Real-time Progress Tracking** with WebSocket updates - **Live Evaluation Logs** showing detailed request/response data - **Multiple Metrics Support** (Accuracy, F1-Score, BLEU, ROUGE, Pass@K) - **Configurable Parameters** (sample size, temperature, max tokens) ### 🎨 **Modern User Interface** - **Responsive Design** optimized for desktop and mobile - **Interactive Model Cards** with hover effects and selection states - **Real-time Configuration** with sliders and checkboxes - **Professional Gradient Design** with smooth animations ## 🔧 **Technical Stack** - **Backend**: FastAPI + Python 3.11 - **Frontend**: HTML5 + Tailwind CSS + Vanilla JavaScript - **Real-time**: WebSocket for live updates - **Models**: Hugging Face Inference API (free tier) - **Deployment**: Docker + Hugging Face Spaces ## 📋 **Available Models** ### Small Models (1-3B) - **FLAN-T5 Large** (0.8B) - Google - **Qwen 2.5 3B** (3B) - Alibaba - **Gemma 2B** (2B) - Google ### Medium Models (7B) - **Qwen 2.5 7B** (7B) - Alibaba - **Mistral 7B** (7B) - Mistral AI - **DialoGPT Medium** (345M) - Microsoft - **CodeLlama 7B Python** (7B) - Meta ### Large Models (14B+) - **Qwen 2.5 14B** (14B) - Alibaba - **Qwen 2.5 32B** (32B) - Alibaba - **Qwen 2.5 72B** (72B) - Alibaba ## 📊 **Available Datasets** ### Reasoning - **HellaSwag** - Commonsense reasoning (60K samples) - **CommonsenseQA** - Reasoning questions (12.1K samples) - **ARC** - Science reasoning (7.8K samples) ### Knowledge - **MMLU** - Multitask understanding (231K samples) - **BoolQ** - Reading comprehension (12.7K samples) ### Math - **GSM8K** - Grade school math (17.6K samples) - **AQUA-RAT** - Algebraic reasoning (196K samples) ### Code - **HumanEval** - Python code generation (164 samples) - **MBPP** - Basic Python problems (1.4K samples) ### Language - **IMDB Reviews** - Sentiment analysis (100K samples) - **CNN/DailyMail** - Summarization (936K samples) ## 🎯 **Evaluation Metrics** - **Accuracy** - Percentage of correct predictions - **F1 Score** - Harmonic mean of precision and recall - **BLEU Score** - Text generation quality - **ROUGE Score** - Summarization quality - **Pass@K** - Code generation success rate ## 🚀 **Quick Start** ### Option 1: Direct Upload to Hugging Face Spaces 1. Create a new Space on Hugging Face 2. Choose "Docker" as the SDK 3. Upload these files: - `app.py` (renamed from `advanced_novaeval_app.py`) - `requirements.txt` - `Dockerfile` - `README.md` 4. Commit and push - your Space will build automatically! ### Option 2: Local Development ```bash # Install dependencies pip install -r requirements.txt # Run the application python advanced_novaeval_app.py # Open browser to http://localhost:7860 ``` ## 🔧 **Configuration Options** ### Model Parameters - **Sample Size**: 10-1000 samples - **Temperature**: 0.0-2.0 (creativity control) - **Max Tokens**: 128-2048 (response length) - **Top-p**: 0.9 (nucleus sampling) ### Evaluation Settings - **Multiple Model Selection**: Compare up to 10 models - **Flexible Metrics**: Choose relevant metrics for your task - **Real-time Monitoring**: Watch evaluations progress live - **Export Results**: Download results in JSON format ## 📱 **User Experience** ### Workflow 1. **Select Models** - Choose from 15+ Hugging Face models 2. **Pick Dataset** - Select from 11 evaluation datasets 3. **Configure Metrics** - Choose relevant evaluation metrics 4. **Set Parameters** - Adjust sample size, temperature, etc. 5. **Start Evaluation** - Watch real-time progress and logs 6. **View Results** - Analyze performance comparisons ### Features - **Model Search** - Find models by name or provider - **Category Filtering** - Filter by model size or dataset type - **Real-time Logs** - See actual evaluation steps - **Progress Tracking** - Visual progress bars and percentages - **Interactive Results** - Compare models side-by-side ## 🌟 **Why NovaEval?** ### For Researchers - **Comprehensive Benchmarking** across multiple models and datasets - **Standardized Evaluation** with consistent metrics and procedures - **Real-time Monitoring** to track evaluation progress - **Export Capabilities** for further analysis ### For Developers - **Easy Integration** with Hugging Face ecosystem - **No API Keys Required** - uses free HF Inference API - **Modern Interface** with responsive design - **Detailed Logging** for debugging and analysis ### For Teams - **Collaborative Evaluation** with shareable results - **Professional Interface** suitable for presentations - **Comprehensive Documentation** for easy onboarding - **Open Source** with full customization capabilities ## 🔗 **Links** - **Noveum.ai**: [https://noveum.ai](https://noveum.ai) - **NovaEval Framework**: [https://github.com/Noveum/NovaEval](https://github.com/Noveum/NovaEval) - **Hugging Face Models**: [https://huggingface.co/models](https://huggingface.co/models) - **Documentation**: Available in the application interface ## 📄 **License** This project is open source and available under the MIT License. ## 🤝 **Contributing** We welcome contributions! Please see our contributing guidelines for more information. --- **Built with ❤️ by [Noveum.ai](https://noveum.ai) - Advancing AI Evaluation**