Spaces:
Sleeping
Sleeping
File size: 6,169 Bytes
87d5dfe 81e8991 613b401 81e8991 613b401 81e8991 613b401 81e8991 613b401 81e8991 eae454a 81e8991 eae454a 81e8991 eae454a 81e8991 eae454a 900252d 81e8991 eae454a 81e8991 eae454a 81e8991 eae454a 81e8991 eae454a 81e8991 eae454a 900252d eae454a 81e8991 613b401 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
---
title: NovaEval by Noveum.ai
emoji: ⚡
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
---
# NovaEval by Noveum.ai
Advanced AI Model Evaluation Platform powered by Hugging Face Models
## 🚀 Features
### 🤖 **Comprehensive Model Selection**
- **15+ Top Hugging Face Models** across different size categories
- **Real-time Model Search** with provider filtering
- **Detailed Model Information** including capabilities, size, and provider
- **Size-based Filtering** (Small 1-3B, Medium 7B, Large 14B+)
### 📊 **Rich Dataset Collection**
- **11 Evaluation Datasets** covering reasoning, knowledge, math, code, and language
- **Category-based Filtering** for easy dataset discovery
- **Detailed Dataset Information** including sample counts and difficulty levels
- **Popular Benchmarks** like MMLU, HellaSwag, GSM8K, HumanEval
### ⚡ **Advanced Evaluation Engine**
- **Real-time Progress Tracking** with WebSocket updates
- **Live Evaluation Logs** showing detailed request/response data
- **Multiple Metrics Support** (Accuracy, F1-Score, BLEU, ROUGE, Pass@K)
- **Configurable Parameters** (sample size, temperature, max tokens)
### 🎨 **Modern User Interface**
- **Responsive Design** optimized for desktop and mobile
- **Interactive Model Cards** with hover effects and selection states
- **Real-time Configuration** with sliders and checkboxes
- **Professional Gradient Design** with smooth animations
## 🔧 **Technical Stack**
- **Backend**: FastAPI + Python 3.11
- **Frontend**: HTML5 + Tailwind CSS + Vanilla JavaScript
- **Real-time**: WebSocket for live updates
- **Models**: Hugging Face Inference API (free tier)
- **Deployment**: Docker + Hugging Face Spaces
## 📋 **Available Models**
### Small Models (1-3B)
- **FLAN-T5 Large** (0.8B) - Google
- **Qwen 2.5 3B** (3B) - Alibaba
- **Gemma 2B** (2B) - Google
### Medium Models (7B)
- **Qwen 2.5 7B** (7B) - Alibaba
- **Mistral 7B** (7B) - Mistral AI
- **DialoGPT Medium** (345M) - Microsoft
- **CodeLlama 7B Python** (7B) - Meta
### Large Models (14B+)
- **Qwen 2.5 14B** (14B) - Alibaba
- **Qwen 2.5 32B** (32B) - Alibaba
- **Qwen 2.5 72B** (72B) - Alibaba
## 📊 **Available Datasets**
### Reasoning
- **HellaSwag** - Commonsense reasoning (60K samples)
- **CommonsenseQA** - Reasoning questions (12.1K samples)
- **ARC** - Science reasoning (7.8K samples)
### Knowledge
- **MMLU** - Multitask understanding (231K samples)
- **BoolQ** - Reading comprehension (12.7K samples)
### Math
- **GSM8K** - Grade school math (17.6K samples)
- **AQUA-RAT** - Algebraic reasoning (196K samples)
### Code
- **HumanEval** - Python code generation (164 samples)
- **MBPP** - Basic Python problems (1.4K samples)
### Language
- **IMDB Reviews** - Sentiment analysis (100K samples)
- **CNN/DailyMail** - Summarization (936K samples)
## 🎯 **Evaluation Metrics**
- **Accuracy** - Percentage of correct predictions
- **F1 Score** - Harmonic mean of precision and recall
- **BLEU Score** - Text generation quality
- **ROUGE Score** - Summarization quality
- **Pass@K** - Code generation success rate
## 🚀 **Quick Start**
### Option 1: Direct Upload to Hugging Face Spaces
1. Create a new Space on Hugging Face
2. Choose "Docker" as the SDK
3. Upload these files:
- `app.py` (renamed from `advanced_novaeval_app.py`)
- `requirements.txt`
- `Dockerfile`
- `README.md`
4. Commit and push - your Space will build automatically!
### Option 2: Local Development
```bash
# Install dependencies
pip install -r requirements.txt
# Run the application
python advanced_novaeval_app.py
# Open browser to http://localhost:7860
```
## 🔧 **Configuration Options**
### Model Parameters
- **Sample Size**: 10-1000 samples
- **Temperature**: 0.0-2.0 (creativity control)
- **Max Tokens**: 128-2048 (response length)
- **Top-p**: 0.9 (nucleus sampling)
### Evaluation Settings
- **Multiple Model Selection**: Compare up to 10 models
- **Flexible Metrics**: Choose relevant metrics for your task
- **Real-time Monitoring**: Watch evaluations progress live
- **Export Results**: Download results in JSON format
## 📱 **User Experience**
### Workflow
1. **Select Models** - Choose from 15+ Hugging Face models
2. **Pick Dataset** - Select from 11 evaluation datasets
3. **Configure Metrics** - Choose relevant evaluation metrics
4. **Set Parameters** - Adjust sample size, temperature, etc.
5. **Start Evaluation** - Watch real-time progress and logs
6. **View Results** - Analyze performance comparisons
### Features
- **Model Search** - Find models by name or provider
- **Category Filtering** - Filter by model size or dataset type
- **Real-time Logs** - See actual evaluation steps
- **Progress Tracking** - Visual progress bars and percentages
- **Interactive Results** - Compare models side-by-side
## 🌟 **Why NovaEval?**
### For Researchers
- **Comprehensive Benchmarking** across multiple models and datasets
- **Standardized Evaluation** with consistent metrics and procedures
- **Real-time Monitoring** to track evaluation progress
- **Export Capabilities** for further analysis
### For Developers
- **Easy Integration** with Hugging Face ecosystem
- **No API Keys Required** - uses free HF Inference API
- **Modern Interface** with responsive design
- **Detailed Logging** for debugging and analysis
### For Teams
- **Collaborative Evaluation** with shareable results
- **Professional Interface** suitable for presentations
- **Comprehensive Documentation** for easy onboarding
- **Open Source** with full customization capabilities
## 🔗 **Links**
- **Noveum.ai**: [https://noveum.ai](https://noveum.ai)
- **NovaEval Framework**: [https://github.com/Noveum/NovaEval](https://github.com/Noveum/NovaEval)
- **Hugging Face Models**: [https://huggingface.co/models](https://huggingface.co/models)
- **Documentation**: Available in the application interface
## 📄 **License**
This project is open source and available under the MIT License.
## 🤝 **Contributing**
We welcome contributions! Please see our contributing guidelines for more information.
---
**Built with ❤️ by [Noveum.ai](https://noveum.ai) - Advancing AI Evaluation**
|