Spaces:
Sleeping
Sleeping
Upload 4 files
Browse files- Dockerfile +5 -7
- README.md +165 -56
- app.py +0 -0
- requirements.txt +5 -22
Dockerfile
CHANGED
@@ -1,4 +1,3 @@
|
|
1 |
-
# Comprehensive NovaEval Space Dockerfile
|
2 |
FROM python:3.11-slim
|
3 |
|
4 |
# Set working directory
|
@@ -6,18 +5,17 @@ WORKDIR /app
|
|
6 |
|
7 |
# Install system dependencies
|
8 |
RUN apt-get update && apt-get install -y \
|
9 |
-
|
10 |
-
build-essential \
|
11 |
&& rm -rf /var/lib/apt/lists/*
|
12 |
|
13 |
# Copy requirements and install Python dependencies
|
14 |
COPY requirements.txt .
|
15 |
RUN pip install --no-cache-dir -r requirements.txt
|
16 |
|
17 |
-
# Copy application
|
18 |
-
COPY
|
19 |
|
20 |
-
# Create non-root user
|
21 |
RUN useradd -m -u 1000 user
|
22 |
USER user
|
23 |
|
@@ -28,6 +26,6 @@ EXPOSE 7860
|
|
28 |
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
|
29 |
CMD curl -f http://localhost:7860/api/health || exit 1
|
30 |
|
31 |
-
# Run application
|
32 |
CMD ["python", "app.py"]
|
33 |
|
|
|
|
|
1 |
FROM python:3.11-slim
|
2 |
|
3 |
# Set working directory
|
|
|
5 |
|
6 |
# Install system dependencies
|
7 |
RUN apt-get update && apt-get install -y \
|
8 |
+
curl \
|
|
|
9 |
&& rm -rf /var/lib/apt/lists/*
|
10 |
|
11 |
# Copy requirements and install Python dependencies
|
12 |
COPY requirements.txt .
|
13 |
RUN pip install --no-cache-dir -r requirements.txt
|
14 |
|
15 |
+
# Copy application code
|
16 |
+
COPY advanced_novaeval_app.py app.py
|
17 |
|
18 |
+
# Create non-root user for security
|
19 |
RUN useradd -m -u 1000 user
|
20 |
USER user
|
21 |
|
|
|
26 |
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
|
27 |
CMD curl -f http://localhost:7860/api/health || exit 1
|
28 |
|
29 |
+
# Run the application
|
30 |
CMD ["python", "app.py"]
|
31 |
|
README.md
CHANGED
@@ -1,73 +1,182 @@
|
|
1 |
-
|
2 |
-
title: NovaEval by Noveum.ai - Advanced AI Model Evaluation Platform
|
3 |
-
emoji: 🚀
|
4 |
-
colorFrom: blue
|
5 |
-
colorTo: purple
|
6 |
-
sdk: docker
|
7 |
-
pinned: false
|
8 |
-
license: mit
|
9 |
-
app_port: 7860
|
10 |
-
---
|
11 |
-
|
12 |
-
# NovaEval by Noveum.ai - Advanced AI Model Evaluation Platform
|
13 |
-
|
14 |
-
A comprehensive platform for evaluating AI language models using the NovaEval framework. Built by [Noveum.ai](https://noveum.ai) for the AI research community.
|
15 |
-
|
16 |
-
## 🌟 Features
|
17 |
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
- **Amazon Titan, Cohere Command** (AWS Bedrock)
|
22 |
-
- **Noveum AI Gateway** (Noveum.ai)
|
23 |
|
24 |
-
###
|
25 |
-
- **
|
26 |
-
- **
|
27 |
-
- **
|
28 |
-
- **
|
29 |
-
- **TruthfulQA** - Truthfulness Assessment
|
30 |
-
- **Custom Dataset Upload** - Bring your own data
|
31 |
|
32 |
-
###
|
33 |
-
- **
|
34 |
-
- **
|
35 |
-
- **
|
36 |
-
- **
|
37 |
|
38 |
-
###
|
39 |
-
- **
|
40 |
-
- **
|
41 |
-
- **
|
42 |
-
- **
|
|
|
|
|
|
|
|
|
|
|
|
|
43 |
|
44 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
45 |
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
51 |
|
52 |
-
## 🔗 Links
|
53 |
|
54 |
- **Noveum.ai**: [https://noveum.ai](https://noveum.ai)
|
55 |
-
- **NovaEval
|
56 |
-
- **
|
|
|
57 |
|
58 |
-
##
|
59 |
|
60 |
-
|
61 |
-
- **Backend**: FastAPI with WebSocket support
|
62 |
-
- **Frontend**: Modern HTML5/CSS3/JavaScript
|
63 |
-
- **Models**: OpenAI, Anthropic, AWS Bedrock, Noveum.ai APIs
|
64 |
-
- **Deployment**: Docker on Hugging Face Spaces
|
65 |
|
66 |
-
##
|
67 |
|
68 |
-
|
69 |
|
70 |
---
|
71 |
|
72 |
-
**
|
73 |
|
|
|
1 |
+
# NovaEval by Noveum.ai
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
3 |
+
Advanced AI Model Evaluation Platform powered by Hugging Face Models
|
4 |
+
|
5 |
+
## 🚀 Features
|
|
|
|
|
6 |
|
7 |
+
### 🤖 **Comprehensive Model Selection**
|
8 |
+
- **15+ Top Hugging Face Models** across different size categories
|
9 |
+
- **Real-time Model Search** with provider filtering
|
10 |
+
- **Detailed Model Information** including capabilities, size, and provider
|
11 |
+
- **Size-based Filtering** (Small 1-3B, Medium 7B, Large 14B+)
|
|
|
|
|
12 |
|
13 |
+
### 📊 **Rich Dataset Collection**
|
14 |
+
- **11 Evaluation Datasets** covering reasoning, knowledge, math, code, and language
|
15 |
+
- **Category-based Filtering** for easy dataset discovery
|
16 |
+
- **Detailed Dataset Information** including sample counts and difficulty levels
|
17 |
+
- **Popular Benchmarks** like MMLU, HellaSwag, GSM8K, HumanEval
|
18 |
|
19 |
+
### ⚡ **Advanced Evaluation Engine**
|
20 |
+
- **Real-time Progress Tracking** with WebSocket updates
|
21 |
+
- **Live Evaluation Logs** showing detailed request/response data
|
22 |
+
- **Multiple Metrics Support** (Accuracy, F1-Score, BLEU, ROUGE, Pass@K)
|
23 |
+
- **Configurable Parameters** (sample size, temperature, max tokens)
|
24 |
+
|
25 |
+
### 🎨 **Modern User Interface**
|
26 |
+
- **Responsive Design** optimized for desktop and mobile
|
27 |
+
- **Interactive Model Cards** with hover effects and selection states
|
28 |
+
- **Real-time Configuration** with sliders and checkboxes
|
29 |
+
- **Professional Gradient Design** with smooth animations
|
30 |
|
31 |
+
## 🔧 **Technical Stack**
|
32 |
+
|
33 |
+
- **Backend**: FastAPI + Python 3.11
|
34 |
+
- **Frontend**: HTML5 + Tailwind CSS + Vanilla JavaScript
|
35 |
+
- **Real-time**: WebSocket for live updates
|
36 |
+
- **Models**: Hugging Face Inference API (free tier)
|
37 |
+
- **Deployment**: Docker + Hugging Face Spaces
|
38 |
+
|
39 |
+
## 📋 **Available Models**
|
40 |
+
|
41 |
+
### Small Models (1-3B)
|
42 |
+
- **FLAN-T5 Large** (0.8B) - Google
|
43 |
+
- **Qwen 2.5 3B** (3B) - Alibaba
|
44 |
+
- **Gemma 2B** (2B) - Google
|
45 |
+
|
46 |
+
### Medium Models (7B)
|
47 |
+
- **Qwen 2.5 7B** (7B) - Alibaba
|
48 |
+
- **Mistral 7B** (7B) - Mistral AI
|
49 |
+
- **DialoGPT Medium** (345M) - Microsoft
|
50 |
+
- **CodeLlama 7B Python** (7B) - Meta
|
51 |
+
|
52 |
+
### Large Models (14B+)
|
53 |
+
- **Qwen 2.5 14B** (14B) - Alibaba
|
54 |
+
- **Qwen 2.5 32B** (32B) - Alibaba
|
55 |
+
- **Qwen 2.5 72B** (72B) - Alibaba
|
56 |
+
|
57 |
+
## 📊 **Available Datasets**
|
58 |
+
|
59 |
+
### Reasoning
|
60 |
+
- **HellaSwag** - Commonsense reasoning (60K samples)
|
61 |
+
- **CommonsenseQA** - Reasoning questions (12.1K samples)
|
62 |
+
- **ARC** - Science reasoning (7.8K samples)
|
63 |
+
|
64 |
+
### Knowledge
|
65 |
+
- **MMLU** - Multitask understanding (231K samples)
|
66 |
+
- **BoolQ** - Reading comprehension (12.7K samples)
|
67 |
+
|
68 |
+
### Math
|
69 |
+
- **GSM8K** - Grade school math (17.6K samples)
|
70 |
+
- **AQUA-RAT** - Algebraic reasoning (196K samples)
|
71 |
+
|
72 |
+
### Code
|
73 |
+
- **HumanEval** - Python code generation (164 samples)
|
74 |
+
- **MBPP** - Basic Python problems (1.4K samples)
|
75 |
+
|
76 |
+
### Language
|
77 |
+
- **IMDB Reviews** - Sentiment analysis (100K samples)
|
78 |
+
- **CNN/DailyMail** - Summarization (936K samples)
|
79 |
+
|
80 |
+
## 🎯 **Evaluation Metrics**
|
81 |
+
|
82 |
+
- **Accuracy** - Percentage of correct predictions
|
83 |
+
- **F1 Score** - Harmonic mean of precision and recall
|
84 |
+
- **BLEU Score** - Text generation quality
|
85 |
+
- **ROUGE Score** - Summarization quality
|
86 |
+
- **Pass@K** - Code generation success rate
|
87 |
+
|
88 |
+
## 🚀 **Quick Start**
|
89 |
+
|
90 |
+
### Option 1: Direct Upload to Hugging Face Spaces
|
91 |
+
|
92 |
+
1. Create a new Space on Hugging Face
|
93 |
+
2. Choose "Docker" as the SDK
|
94 |
+
3. Upload these files:
|
95 |
+
- `app.py` (renamed from `advanced_novaeval_app.py`)
|
96 |
+
- `requirements.txt`
|
97 |
+
- `Dockerfile`
|
98 |
+
- `README.md`
|
99 |
+
4. Commit and push - your Space will build automatically!
|
100 |
+
|
101 |
+
### Option 2: Local Development
|
102 |
+
|
103 |
+
```bash
|
104 |
+
# Install dependencies
|
105 |
+
pip install -r requirements.txt
|
106 |
+
|
107 |
+
# Run the application
|
108 |
+
python advanced_novaeval_app.py
|
109 |
+
|
110 |
+
# Open browser to http://localhost:7860
|
111 |
+
```
|
112 |
+
|
113 |
+
## 🔧 **Configuration Options**
|
114 |
+
|
115 |
+
### Model Parameters
|
116 |
+
- **Sample Size**: 10-1000 samples
|
117 |
+
- **Temperature**: 0.0-2.0 (creativity control)
|
118 |
+
- **Max Tokens**: 128-2048 (response length)
|
119 |
+
- **Top-p**: 0.9 (nucleus sampling)
|
120 |
+
|
121 |
+
### Evaluation Settings
|
122 |
+
- **Multiple Model Selection**: Compare up to 10 models
|
123 |
+
- **Flexible Metrics**: Choose relevant metrics for your task
|
124 |
+
- **Real-time Monitoring**: Watch evaluations progress live
|
125 |
+
- **Export Results**: Download results in JSON format
|
126 |
+
|
127 |
+
## 📱 **User Experience**
|
128 |
+
|
129 |
+
### Workflow
|
130 |
+
1. **Select Models** - Choose from 15+ Hugging Face models
|
131 |
+
2. **Pick Dataset** - Select from 11 evaluation datasets
|
132 |
+
3. **Configure Metrics** - Choose relevant evaluation metrics
|
133 |
+
4. **Set Parameters** - Adjust sample size, temperature, etc.
|
134 |
+
5. **Start Evaluation** - Watch real-time progress and logs
|
135 |
+
6. **View Results** - Analyze performance comparisons
|
136 |
+
|
137 |
+
### Features
|
138 |
+
- **Model Search** - Find models by name or provider
|
139 |
+
- **Category Filtering** - Filter by model size or dataset type
|
140 |
+
- **Real-time Logs** - See actual evaluation steps
|
141 |
+
- **Progress Tracking** - Visual progress bars and percentages
|
142 |
+
- **Interactive Results** - Compare models side-by-side
|
143 |
+
|
144 |
+
## 🌟 **Why NovaEval?**
|
145 |
+
|
146 |
+
### For Researchers
|
147 |
+
- **Comprehensive Benchmarking** across multiple models and datasets
|
148 |
+
- **Standardized Evaluation** with consistent metrics and procedures
|
149 |
+
- **Real-time Monitoring** to track evaluation progress
|
150 |
+
- **Export Capabilities** for further analysis
|
151 |
|
152 |
+
### For Developers
|
153 |
+
- **Easy Integration** with Hugging Face ecosystem
|
154 |
+
- **No API Keys Required** - uses free HF Inference API
|
155 |
+
- **Modern Interface** with responsive design
|
156 |
+
- **Detailed Logging** for debugging and analysis
|
157 |
+
|
158 |
+
### For Teams
|
159 |
+
- **Collaborative Evaluation** with shareable results
|
160 |
+
- **Professional Interface** suitable for presentations
|
161 |
+
- **Comprehensive Documentation** for easy onboarding
|
162 |
+
- **Open Source** with full customization capabilities
|
163 |
|
164 |
+
## 🔗 **Links**
|
165 |
|
166 |
- **Noveum.ai**: [https://noveum.ai](https://noveum.ai)
|
167 |
+
- **NovaEval Framework**: [https://github.com/Noveum/NovaEval](https://github.com/Noveum/NovaEval)
|
168 |
+
- **Hugging Face Models**: [https://huggingface.co/models](https://huggingface.co/models)
|
169 |
+
- **Documentation**: Available in the application interface
|
170 |
|
171 |
+
## 📄 **License**
|
172 |
|
173 |
+
This project is open source and available under the MIT License.
|
|
|
|
|
|
|
|
|
174 |
|
175 |
+
## 🤝 **Contributing**
|
176 |
|
177 |
+
We welcome contributions! Please see our contributing guidelines for more information.
|
178 |
|
179 |
---
|
180 |
|
181 |
+
**Built with ❤️ by [Noveum.ai](https://noveum.ai) - Advancing AI Evaluation**
|
182 |
|
app.py
CHANGED
The diff for this file is too large to render.
See raw diff
|
|
requirements.txt
CHANGED
@@ -1,23 +1,6 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
pydantic>=2.5.0
|
7 |
-
python-multipart>=0.0.6
|
8 |
-
|
9 |
-
# NovaEval and dependencies
|
10 |
-
git+https://github.com/Noveum/NovaEval.git
|
11 |
-
|
12 |
-
# Additional ML dependencies
|
13 |
-
transformers>=4.35.0
|
14 |
-
torch>=2.1.0
|
15 |
-
datasets>=2.14.0
|
16 |
-
evaluate>=0.4.0
|
17 |
-
accelerate>=0.24.0
|
18 |
-
tokenizers>=0.15.0
|
19 |
-
|
20 |
-
# Optional: For better performance
|
21 |
-
numpy>=1.24.0
|
22 |
-
pandas>=2.0.0
|
23 |
|
|
|
1 |
+
fastapi==0.116.0
|
2 |
+
uvicorn==0.35.0
|
3 |
+
websockets==15.0.1
|
4 |
+
httpx==0.28.1
|
5 |
+
pydantic==2.11.7
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
|