File size: 6,169 Bytes
87d5dfe
 
 
 
 
 
 
 
 
81e8991
613b401
81e8991
 
 
613b401
81e8991
 
 
 
 
613b401
81e8991
 
 
 
 
613b401
81e8991
 
 
 
 
 
 
 
 
 
 
eae454a
81e8991
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eae454a
81e8991
 
 
 
 
 
 
 
 
 
 
eae454a
81e8991
eae454a
900252d
81e8991
 
 
eae454a
81e8991
eae454a
81e8991
eae454a
81e8991
eae454a
81e8991
eae454a
900252d
eae454a
81e8991
613b401
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
---
title: NovaEval by Noveum.ai
emoji: 
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
---

# NovaEval by Noveum.ai

Advanced AI Model Evaluation Platform powered by Hugging Face Models

## 🚀 Features

### 🤖 **Comprehensive Model Selection**
- **15+ Top Hugging Face Models** across different size categories
- **Real-time Model Search** with provider filtering
- **Detailed Model Information** including capabilities, size, and provider
- **Size-based Filtering** (Small 1-3B, Medium 7B, Large 14B+)

### 📊 **Rich Dataset Collection**
- **11 Evaluation Datasets** covering reasoning, knowledge, math, code, and language
- **Category-based Filtering** for easy dataset discovery
- **Detailed Dataset Information** including sample counts and difficulty levels
- **Popular Benchmarks** like MMLU, HellaSwag, GSM8K, HumanEval

### ⚡ **Advanced Evaluation Engine**
- **Real-time Progress Tracking** with WebSocket updates
- **Live Evaluation Logs** showing detailed request/response data
- **Multiple Metrics Support** (Accuracy, F1-Score, BLEU, ROUGE, Pass@K)
- **Configurable Parameters** (sample size, temperature, max tokens)

### 🎨 **Modern User Interface**
- **Responsive Design** optimized for desktop and mobile
- **Interactive Model Cards** with hover effects and selection states
- **Real-time Configuration** with sliders and checkboxes
- **Professional Gradient Design** with smooth animations

## 🔧 **Technical Stack**

- **Backend**: FastAPI + Python 3.11
- **Frontend**: HTML5 + Tailwind CSS + Vanilla JavaScript
- **Real-time**: WebSocket for live updates
- **Models**: Hugging Face Inference API (free tier)
- **Deployment**: Docker + Hugging Face Spaces

## 📋 **Available Models**

### Small Models (1-3B)
- **FLAN-T5 Large** (0.8B) - Google
- **Qwen 2.5 3B** (3B) - Alibaba  
- **Gemma 2B** (2B) - Google

### Medium Models (7B)
- **Qwen 2.5 7B** (7B) - Alibaba
- **Mistral 7B** (7B) - Mistral AI
- **DialoGPT Medium** (345M) - Microsoft
- **CodeLlama 7B Python** (7B) - Meta

### Large Models (14B+)
- **Qwen 2.5 14B** (14B) - Alibaba
- **Qwen 2.5 32B** (32B) - Alibaba
- **Qwen 2.5 72B** (72B) - Alibaba

## 📊 **Available Datasets**

### Reasoning
- **HellaSwag** - Commonsense reasoning (60K samples)
- **CommonsenseQA** - Reasoning questions (12.1K samples)
- **ARC** - Science reasoning (7.8K samples)

### Knowledge
- **MMLU** - Multitask understanding (231K samples)
- **BoolQ** - Reading comprehension (12.7K samples)

### Math
- **GSM8K** - Grade school math (17.6K samples)
- **AQUA-RAT** - Algebraic reasoning (196K samples)

### Code
- **HumanEval** - Python code generation (164 samples)
- **MBPP** - Basic Python problems (1.4K samples)

### Language
- **IMDB Reviews** - Sentiment analysis (100K samples)
- **CNN/DailyMail** - Summarization (936K samples)

## 🎯 **Evaluation Metrics**

- **Accuracy** - Percentage of correct predictions
- **F1 Score** - Harmonic mean of precision and recall
- **BLEU Score** - Text generation quality
- **ROUGE Score** - Summarization quality
- **Pass@K** - Code generation success rate

## 🚀 **Quick Start**

### Option 1: Direct Upload to Hugging Face Spaces

1. Create a new Space on Hugging Face
2. Choose "Docker" as the SDK
3. Upload these files:
   - `app.py` (renamed from `advanced_novaeval_app.py`)
   - `requirements.txt`
   - `Dockerfile`
   - `README.md`
4. Commit and push - your Space will build automatically!

### Option 2: Local Development

```bash
# Install dependencies
pip install -r requirements.txt

# Run the application
python advanced_novaeval_app.py

# Open browser to http://localhost:7860
```

## 🔧 **Configuration Options**

### Model Parameters
- **Sample Size**: 10-1000 samples
- **Temperature**: 0.0-2.0 (creativity control)
- **Max Tokens**: 128-2048 (response length)
- **Top-p**: 0.9 (nucleus sampling)

### Evaluation Settings
- **Multiple Model Selection**: Compare up to 10 models
- **Flexible Metrics**: Choose relevant metrics for your task
- **Real-time Monitoring**: Watch evaluations progress live
- **Export Results**: Download results in JSON format

## 📱 **User Experience**

### Workflow
1. **Select Models** - Choose from 15+ Hugging Face models
2. **Pick Dataset** - Select from 11 evaluation datasets
3. **Configure Metrics** - Choose relevant evaluation metrics
4. **Set Parameters** - Adjust sample size, temperature, etc.
5. **Start Evaluation** - Watch real-time progress and logs
6. **View Results** - Analyze performance comparisons

### Features
- **Model Search** - Find models by name or provider
- **Category Filtering** - Filter by model size or dataset type
- **Real-time Logs** - See actual evaluation steps
- **Progress Tracking** - Visual progress bars and percentages
- **Interactive Results** - Compare models side-by-side

## 🌟 **Why NovaEval?**

### For Researchers
- **Comprehensive Benchmarking** across multiple models and datasets
- **Standardized Evaluation** with consistent metrics and procedures
- **Real-time Monitoring** to track evaluation progress
- **Export Capabilities** for further analysis

### For Developers
- **Easy Integration** with Hugging Face ecosystem
- **No API Keys Required** - uses free HF Inference API
- **Modern Interface** with responsive design
- **Detailed Logging** for debugging and analysis

### For Teams
- **Collaborative Evaluation** with shareable results
- **Professional Interface** suitable for presentations
- **Comprehensive Documentation** for easy onboarding
- **Open Source** with full customization capabilities

## 🔗 **Links**

- **Noveum.ai**: [https://noveum.ai](https://noveum.ai)
- **NovaEval Framework**: [https://github.com/Noveum/NovaEval](https://github.com/Noveum/NovaEval)
- **Hugging Face Models**: [https://huggingface.co/models](https://huggingface.co/models)
- **Documentation**: Available in the application interface

## 📄 **License**

This project is open source and available under the MIT License.

## 🤝 **Contributing**

We welcome contributions! Please see our contributing guidelines for more information.

---

**Built with ❤️ by [Noveum.ai](https://noveum.ai) - Advancing AI Evaluation**