shashankagar commited on
Commit
81e8991
·
verified ·
1 Parent(s): 900252d

Upload 4 files

Browse files
Files changed (4) hide show
  1. Dockerfile +5 -7
  2. README.md +165 -56
  3. app.py +0 -0
  4. requirements.txt +5 -22
Dockerfile CHANGED
@@ -1,4 +1,3 @@
1
- # Comprehensive NovaEval Space Dockerfile
2
  FROM python:3.11-slim
3
 
4
  # Set working directory
@@ -6,18 +5,17 @@ WORKDIR /app
6
 
7
  # Install system dependencies
8
  RUN apt-get update && apt-get install -y \
9
- git \
10
- build-essential \
11
  && rm -rf /var/lib/apt/lists/*
12
 
13
  # Copy requirements and install Python dependencies
14
  COPY requirements.txt .
15
  RUN pip install --no-cache-dir -r requirements.txt
16
 
17
- # Copy application
18
- COPY app.py .
19
 
20
- # Create non-root user
21
  RUN useradd -m -u 1000 user
22
  USER user
23
 
@@ -28,6 +26,6 @@ EXPOSE 7860
28
  HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
29
  CMD curl -f http://localhost:7860/api/health || exit 1
30
 
31
- # Run application
32
  CMD ["python", "app.py"]
33
 
 
 
1
  FROM python:3.11-slim
2
 
3
  # Set working directory
 
5
 
6
  # Install system dependencies
7
  RUN apt-get update && apt-get install -y \
8
+ curl \
 
9
  && rm -rf /var/lib/apt/lists/*
10
 
11
  # Copy requirements and install Python dependencies
12
  COPY requirements.txt .
13
  RUN pip install --no-cache-dir -r requirements.txt
14
 
15
+ # Copy application code
16
+ COPY advanced_novaeval_app.py app.py
17
 
18
+ # Create non-root user for security
19
  RUN useradd -m -u 1000 user
20
  USER user
21
 
 
26
  HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
27
  CMD curl -f http://localhost:7860/api/health || exit 1
28
 
29
+ # Run the application
30
  CMD ["python", "app.py"]
31
 
README.md CHANGED
@@ -1,73 +1,182 @@
1
- ---
2
- title: NovaEval by Noveum.ai - Advanced AI Model Evaluation Platform
3
- emoji: 🚀
4
- colorFrom: blue
5
- colorTo: purple
6
- sdk: docker
7
- pinned: false
8
- license: mit
9
- app_port: 7860
10
- ---
11
-
12
- # NovaEval by Noveum.ai - Advanced AI Model Evaluation Platform
13
-
14
- A comprehensive platform for evaluating AI language models using the NovaEval framework. Built by [Noveum.ai](https://noveum.ai) for the AI research community.
15
-
16
- ## 🌟 Features
17
 
18
- ### 🤖 Latest LLMs
19
- - **GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo** (OpenAI)
20
- - **Claude 3.5 Sonnet, Claude 3 Opus/Sonnet/Haiku** (Anthropic)
21
- - **Amazon Titan, Cohere Command** (AWS Bedrock)
22
- - **Noveum AI Gateway** (Noveum.ai)
23
 
24
- ### 📊 Comprehensive Datasets
25
- - **MMLU** - Massive Multitask Language Understanding
26
- - **HumanEval** - Code Generation Benchmark
27
- - **HellaSwag** - Commonsense Reasoning
28
- - **GSM8K** - Grade School Math
29
- - **TruthfulQA** - Truthfulness Assessment
30
- - **Custom Dataset Upload** - Bring your own data
31
 
32
- ### Advanced Analytics
33
- - **Real-time Evaluation Logs** - Live request/response monitoring
34
- - **Detailed Metrics** - Accuracy, F1-Score, BLEU, ROUGE, Semantic Similarity
35
- - **Interactive Visualizations** - Charts, comparisons, statistical analysis
36
- - **Export Results** - JSON, CSV formats
37
 
38
- ### 🔧 Advanced Configuration
39
- - **Sample Size Control** - 10 to 1000 samples
40
- - **Model Parameters** - Temperature, max tokens, top-p
41
- - **Evaluation Settings** - Batch size, timeout, retry logic
42
- - **Cost Estimation** - Real-time cost tracking
 
 
 
 
 
 
43
 
44
- ## 🚀 Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
- 1. **Select Models** - Choose up to 5 LLMs from different providers
47
- 2. **Choose Dataset** - Pick from academic benchmarks or upload custom data
48
- 3. **Pick Metrics** - Select evaluation metrics for your use case
49
- 4. **Configure** - Set parameters and start evaluation
50
- 5. **Analyze** - View real-time results and detailed analytics
 
 
 
 
 
 
51
 
52
- ## 🔗 Links
53
 
54
  - **Noveum.ai**: [https://noveum.ai](https://noveum.ai)
55
- - **NovaEval GitHub**: [https://github.com/Noveum/NovaEval](https://github.com/Noveum/NovaEval)
56
- - **Documentation**: [NovaEval Docs](https://github.com/Noveum/NovaEval#readme)
 
57
 
58
- ## 🛠️ Technical Details
59
 
60
- - **Framework**: NovaEval v0.3.3
61
- - **Backend**: FastAPI with WebSocket support
62
- - **Frontend**: Modern HTML5/CSS3/JavaScript
63
- - **Models**: OpenAI, Anthropic, AWS Bedrock, Noveum.ai APIs
64
- - **Deployment**: Docker on Hugging Face Spaces
65
 
66
- ## 📝 License
67
 
68
- MIT License - See [LICENSE](https://github.com/Noveum/NovaEval/blob/main/LICENSE) for details.
69
 
70
  ---
71
 
72
- **Powered by NovaEval v0.3.3 | Built with ❤️ by [Noveum.ai](https://noveum.ai)**
73
 
 
1
+ # NovaEval by Noveum.ai
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
+ Advanced AI Model Evaluation Platform powered by Hugging Face Models
4
+
5
+ ## 🚀 Features
 
 
6
 
7
+ ### 🤖 **Comprehensive Model Selection**
8
+ - **15+ Top Hugging Face Models** across different size categories
9
+ - **Real-time Model Search** with provider filtering
10
+ - **Detailed Model Information** including capabilities, size, and provider
11
+ - **Size-based Filtering** (Small 1-3B, Medium 7B, Large 14B+)
 
 
12
 
13
+ ### 📊 **Rich Dataset Collection**
14
+ - **11 Evaluation Datasets** covering reasoning, knowledge, math, code, and language
15
+ - **Category-based Filtering** for easy dataset discovery
16
+ - **Detailed Dataset Information** including sample counts and difficulty levels
17
+ - **Popular Benchmarks** like MMLU, HellaSwag, GSM8K, HumanEval
18
 
19
+ ### **Advanced Evaluation Engine**
20
+ - **Real-time Progress Tracking** with WebSocket updates
21
+ - **Live Evaluation Logs** showing detailed request/response data
22
+ - **Multiple Metrics Support** (Accuracy, F1-Score, BLEU, ROUGE, Pass@K)
23
+ - **Configurable Parameters** (sample size, temperature, max tokens)
24
+
25
+ ### 🎨 **Modern User Interface**
26
+ - **Responsive Design** optimized for desktop and mobile
27
+ - **Interactive Model Cards** with hover effects and selection states
28
+ - **Real-time Configuration** with sliders and checkboxes
29
+ - **Professional Gradient Design** with smooth animations
30
 
31
+ ## 🔧 **Technical Stack**
32
+
33
+ - **Backend**: FastAPI + Python 3.11
34
+ - **Frontend**: HTML5 + Tailwind CSS + Vanilla JavaScript
35
+ - **Real-time**: WebSocket for live updates
36
+ - **Models**: Hugging Face Inference API (free tier)
37
+ - **Deployment**: Docker + Hugging Face Spaces
38
+
39
+ ## 📋 **Available Models**
40
+
41
+ ### Small Models (1-3B)
42
+ - **FLAN-T5 Large** (0.8B) - Google
43
+ - **Qwen 2.5 3B** (3B) - Alibaba
44
+ - **Gemma 2B** (2B) - Google
45
+
46
+ ### Medium Models (7B)
47
+ - **Qwen 2.5 7B** (7B) - Alibaba
48
+ - **Mistral 7B** (7B) - Mistral AI
49
+ - **DialoGPT Medium** (345M) - Microsoft
50
+ - **CodeLlama 7B Python** (7B) - Meta
51
+
52
+ ### Large Models (14B+)
53
+ - **Qwen 2.5 14B** (14B) - Alibaba
54
+ - **Qwen 2.5 32B** (32B) - Alibaba
55
+ - **Qwen 2.5 72B** (72B) - Alibaba
56
+
57
+ ## 📊 **Available Datasets**
58
+
59
+ ### Reasoning
60
+ - **HellaSwag** - Commonsense reasoning (60K samples)
61
+ - **CommonsenseQA** - Reasoning questions (12.1K samples)
62
+ - **ARC** - Science reasoning (7.8K samples)
63
+
64
+ ### Knowledge
65
+ - **MMLU** - Multitask understanding (231K samples)
66
+ - **BoolQ** - Reading comprehension (12.7K samples)
67
+
68
+ ### Math
69
+ - **GSM8K** - Grade school math (17.6K samples)
70
+ - **AQUA-RAT** - Algebraic reasoning (196K samples)
71
+
72
+ ### Code
73
+ - **HumanEval** - Python code generation (164 samples)
74
+ - **MBPP** - Basic Python problems (1.4K samples)
75
+
76
+ ### Language
77
+ - **IMDB Reviews** - Sentiment analysis (100K samples)
78
+ - **CNN/DailyMail** - Summarization (936K samples)
79
+
80
+ ## 🎯 **Evaluation Metrics**
81
+
82
+ - **Accuracy** - Percentage of correct predictions
83
+ - **F1 Score** - Harmonic mean of precision and recall
84
+ - **BLEU Score** - Text generation quality
85
+ - **ROUGE Score** - Summarization quality
86
+ - **Pass@K** - Code generation success rate
87
+
88
+ ## 🚀 **Quick Start**
89
+
90
+ ### Option 1: Direct Upload to Hugging Face Spaces
91
+
92
+ 1. Create a new Space on Hugging Face
93
+ 2. Choose "Docker" as the SDK
94
+ 3. Upload these files:
95
+ - `app.py` (renamed from `advanced_novaeval_app.py`)
96
+ - `requirements.txt`
97
+ - `Dockerfile`
98
+ - `README.md`
99
+ 4. Commit and push - your Space will build automatically!
100
+
101
+ ### Option 2: Local Development
102
+
103
+ ```bash
104
+ # Install dependencies
105
+ pip install -r requirements.txt
106
+
107
+ # Run the application
108
+ python advanced_novaeval_app.py
109
+
110
+ # Open browser to http://localhost:7860
111
+ ```
112
+
113
+ ## 🔧 **Configuration Options**
114
+
115
+ ### Model Parameters
116
+ - **Sample Size**: 10-1000 samples
117
+ - **Temperature**: 0.0-2.0 (creativity control)
118
+ - **Max Tokens**: 128-2048 (response length)
119
+ - **Top-p**: 0.9 (nucleus sampling)
120
+
121
+ ### Evaluation Settings
122
+ - **Multiple Model Selection**: Compare up to 10 models
123
+ - **Flexible Metrics**: Choose relevant metrics for your task
124
+ - **Real-time Monitoring**: Watch evaluations progress live
125
+ - **Export Results**: Download results in JSON format
126
+
127
+ ## 📱 **User Experience**
128
+
129
+ ### Workflow
130
+ 1. **Select Models** - Choose from 15+ Hugging Face models
131
+ 2. **Pick Dataset** - Select from 11 evaluation datasets
132
+ 3. **Configure Metrics** - Choose relevant evaluation metrics
133
+ 4. **Set Parameters** - Adjust sample size, temperature, etc.
134
+ 5. **Start Evaluation** - Watch real-time progress and logs
135
+ 6. **View Results** - Analyze performance comparisons
136
+
137
+ ### Features
138
+ - **Model Search** - Find models by name or provider
139
+ - **Category Filtering** - Filter by model size or dataset type
140
+ - **Real-time Logs** - See actual evaluation steps
141
+ - **Progress Tracking** - Visual progress bars and percentages
142
+ - **Interactive Results** - Compare models side-by-side
143
+
144
+ ## 🌟 **Why NovaEval?**
145
+
146
+ ### For Researchers
147
+ - **Comprehensive Benchmarking** across multiple models and datasets
148
+ - **Standardized Evaluation** with consistent metrics and procedures
149
+ - **Real-time Monitoring** to track evaluation progress
150
+ - **Export Capabilities** for further analysis
151
 
152
+ ### For Developers
153
+ - **Easy Integration** with Hugging Face ecosystem
154
+ - **No API Keys Required** - uses free HF Inference API
155
+ - **Modern Interface** with responsive design
156
+ - **Detailed Logging** for debugging and analysis
157
+
158
+ ### For Teams
159
+ - **Collaborative Evaluation** with shareable results
160
+ - **Professional Interface** suitable for presentations
161
+ - **Comprehensive Documentation** for easy onboarding
162
+ - **Open Source** with full customization capabilities
163
 
164
+ ## 🔗 **Links**
165
 
166
  - **Noveum.ai**: [https://noveum.ai](https://noveum.ai)
167
+ - **NovaEval Framework**: [https://github.com/Noveum/NovaEval](https://github.com/Noveum/NovaEval)
168
+ - **Hugging Face Models**: [https://huggingface.co/models](https://huggingface.co/models)
169
+ - **Documentation**: Available in the application interface
170
 
171
+ ## 📄 **License**
172
 
173
+ This project is open source and available under the MIT License.
 
 
 
 
174
 
175
+ ## 🤝 **Contributing**
176
 
177
+ We welcome contributions! Please see our contributing guidelines for more information.
178
 
179
  ---
180
 
181
+ **Built with ❤️ by [Noveum.ai](https://noveum.ai) - Advancing AI Evaluation**
182
 
app.py CHANGED
The diff for this file is too large to render. See raw diff
 
requirements.txt CHANGED
@@ -1,23 +1,6 @@
1
- # Comprehensive NovaEval Space Requirements
2
- fastapi>=0.104.0
3
- uvicorn[standard]>=0.24.0
4
- websockets>=12.0
5
- httpx>=0.25.0
6
- pydantic>=2.5.0
7
- python-multipart>=0.0.6
8
-
9
- # NovaEval and dependencies
10
- git+https://github.com/Noveum/NovaEval.git
11
-
12
- # Additional ML dependencies
13
- transformers>=4.35.0
14
- torch>=2.1.0
15
- datasets>=2.14.0
16
- evaluate>=0.4.0
17
- accelerate>=0.24.0
18
- tokenizers>=0.15.0
19
-
20
- # Optional: For better performance
21
- numpy>=1.24.0
22
- pandas>=2.0.0
23
 
 
1
+ fastapi==0.116.0
2
+ uvicorn==0.35.0
3
+ websockets==15.0.1
4
+ httpx==0.28.1
5
+ pydantic==2.11.7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6