---
title: NovaEval by Noveum.ai
emoji: ⚡
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
---

# NovaEval by Noveum.ai

Advanced AI Model Evaluation Platform powered by Hugging Face Models

## 🚀 Features

### 🤖 **Comprehensive Model Selection**
- **15+ Top Hugging Face Models** across different size categories
- **Real-time Model Search** with provider filtering
- **Detailed Model Information** including capabilities, size, and provider
- **Size-based Filtering** (Small 1-3B, Medium 7B, Large 14B+)

### 📊 **Rich Dataset Collection**
- **11 Evaluation Datasets** covering reasoning, knowledge, math, code, and language
- **Category-based Filtering** for easy dataset discovery
- **Detailed Dataset Information** including sample counts and difficulty levels
- **Popular Benchmarks** like MMLU, HellaSwag, GSM8K, HumanEval

### ⚡ **Advanced Evaluation Engine**
- **Real-time Progress Tracking** with WebSocket updates
- **Live Evaluation Logs** showing detailed request/response data
- **Multiple Metrics Support** (Accuracy, F1-Score, BLEU, ROUGE, Pass@K)
- **Configurable Parameters** (sample size, temperature, max tokens)

### 🎨 **Modern User Interface**
- **Responsive Design** optimized for desktop and mobile
- **Interactive Model Cards** with hover effects and selection states
- **Real-time Configuration** with sliders and checkboxes
- **Professional Gradient Design** with smooth animations

## 🔧 **Technical Stack**

- **Backend**: FastAPI + Python 3.11
- **Frontend**: HTML5 + Tailwind CSS + Vanilla JavaScript
- **Real-time**: WebSocket for live updates
- **Models**: Hugging Face Inference API (free tier)
- **Deployment**: Docker + Hugging Face Spaces

## 📋 **Available Models**

### Small Models (1-3B)
- **FLAN-T5 Large** (0.8B) - Google
- **Qwen 2.5 3B** (3B) - Alibaba  
- **Gemma 2B** (2B) - Google

### Medium Models (7B)
- **Qwen 2.5 7B** (7B) - Alibaba
- **Mistral 7B** (7B) - Mistral AI
- **DialoGPT Medium** (345M) - Microsoft
- **CodeLlama 7B Python** (7B) - Meta

### Large Models (14B+)
- **Qwen 2.5 14B** (14B) - Alibaba
- **Qwen 2.5 32B** (32B) - Alibaba
- **Qwen 2.5 72B** (72B) - Alibaba

## 📊 **Available Datasets**

### Reasoning
- **HellaSwag** - Commonsense reasoning (60K samples)
- **CommonsenseQA** - Reasoning questions (12.1K samples)
- **ARC** - Science reasoning (7.8K samples)

### Knowledge
- **MMLU** - Multitask understanding (231K samples)
- **BoolQ** - Reading comprehension (12.7K samples)

### Math
- **GSM8K** - Grade school math (17.6K samples)
- **AQUA-RAT** - Algebraic reasoning (196K samples)

### Code
- **HumanEval** - Python code generation (164 samples)
- **MBPP** - Basic Python problems (1.4K samples)

### Language
- **IMDB Reviews** - Sentiment analysis (100K samples)
- **CNN/DailyMail** - Summarization (936K samples)

## 🎯 **Evaluation Metrics**

- **Accuracy** - Percentage of correct predictions
- **F1 Score** - Harmonic mean of precision and recall
- **BLEU Score** - Text generation quality
- **ROUGE Score** - Summarization quality
- **Pass@K** - Code generation success rate

## 🚀 **Quick Start**

### Option 1: Direct Upload to Hugging Face Spaces

1. Create a new Space on Hugging Face
2. Choose "Docker" as the SDK
3. Upload these files:
   - `app.py` (renamed from `advanced_novaeval_app.py`)
   - `requirements.txt`
   - `Dockerfile`
   - `README.md`
4. Commit and push - your Space will build automatically!

### Option 2: Local Development

```bash
# Install dependencies
pip install -r requirements.txt

# Run the application
python advanced_novaeval_app.py

# Open browser to http://localhost:7860
```

## 🔧 **Configuration Options**

### Model Parameters
- **Sample Size**: 10-1000 samples
- **Temperature**: 0.0-2.0 (creativity control)
- **Max Tokens**: 128-2048 (response length)
- **Top-p**: 0.9 (nucleus sampling)

### Evaluation Settings
- **Multiple Model Selection**: Compare up to 10 models
- **Flexible Metrics**: Choose relevant metrics for your task
- **Real-time Monitoring**: Watch evaluations progress live
- **Export Results**: Download results in JSON format

## 📱 **User Experience**

### Workflow
1. **Select Models** - Choose from 15+ Hugging Face models
2. **Pick Dataset** - Select from 11 evaluation datasets
3. **Configure Metrics** - Choose relevant evaluation metrics
4. **Set Parameters** - Adjust sample size, temperature, etc.
5. **Start Evaluation** - Watch real-time progress and logs
6. **View Results** - Analyze performance comparisons

### Features
- **Model Search** - Find models by name or provider
- **Category Filtering** - Filter by model size or dataset type
- **Real-time Logs** - See actual evaluation steps
- **Progress Tracking** - Visual progress bars and percentages
- **Interactive Results** - Compare models side-by-side

## 🌟 **Why NovaEval?**

### For Researchers
- **Comprehensive Benchmarking** across multiple models and datasets
- **Standardized Evaluation** with consistent metrics and procedures
- **Real-time Monitoring** to track evaluation progress
- **Export Capabilities** for further analysis

### For Developers
- **Easy Integration** with Hugging Face ecosystem
- **No API Keys Required** - uses free HF Inference API
- **Modern Interface** with responsive design
- **Detailed Logging** for debugging and analysis

### For Teams
- **Collaborative Evaluation** with shareable results
- **Professional Interface** suitable for presentations
- **Comprehensive Documentation** for easy onboarding
- **Open Source** with full customization capabilities

## 🔗 **Links**

- **Noveum.ai**: [https://noveum.ai](https://noveum.ai)
- **NovaEval Framework**: [https://github.com/Noveum/NovaEval](https://github.com/Noveum/NovaEval)
- **Hugging Face Models**: [https://huggingface.co/models](https://huggingface.co/models)
- **Documentation**: Available in the application interface

## 📄 **License**

This project is open source and available under the MIT License.

## 🤝 **Contributing**

We welcome contributions! Please see our contributing guidelines for more information.

---

**Built with ❤️ by [Noveum.ai](https://noveum.ai) - Advancing AI Evaluation**