File size: 14,101 Bytes
70eb27e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4f7f8d4
 
70eb27e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
# NovaEval by Noveum.ai

[![CI](https://github.com/Noveum/NovaEval/actions/workflows/ci.yml/badge.svg)](https://github.com/Noveum/NovaEval/actions/workflows/ci.yml)
[![Release](https://github.com/Noveum/NovaEval/actions/workflows/release.yml/badge.svg)](https://github.com/Noveum/NovaEval/actions/workflows/release.yml)
[![codecov](https://codecov.io/gh/Noveum/NovaEval/branch/main/graph/badge.svg)](https://codecov.io/gh/Noveum/NovaEval)
[![PyPI version](https://badge.fury.io/py/novaeval.svg)](https://badge.fury.io/py/novaeval)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

A comprehensive, extensible AI model evaluation framework designed for production use. NovaEval provides a unified interface for evaluating language models across various datasets, metrics, and deployment scenarios.

## 🚧 Development Status

> **⚠️ ACTIVE DEVELOPMENT - NOT PRODUCTION READY**
>
> NovaEval is currently in active development and **not recommended for production use**. We are actively working on improving stability, adding features, and expanding test coverage. APIs may change without notice.
>
> **We're looking for contributors!** See the [Contributing](#-contributing) section below for ways to help.

## 🀝 We Need Your Help!

NovaEval is an open-source project that thrives on community contributions. Whether you're a seasoned developer or just getting started, there are many ways to contribute:

### 🎯 High-Priority Contribution Areas

We're actively looking for contributors in these key areas:

- **πŸ§ͺ Unit Tests**: Help us improve our test coverage (currently 23% overall, 90%+ for core modules)
- **πŸ“š Examples**: Create real-world evaluation examples and use cases
- **πŸ“ Guides & Notebooks**: Write evaluation guides and interactive Jupyter notebooks
- **πŸ“– Documentation**: Improve API documentation and user guides
- **πŸ” RAG Metrics**: Add more metrics specifically for Retrieval-Augmented Generation evaluation
- **πŸ€– Agent Evaluation**: Build frameworks for evaluating AI agents and multi-turn conversations

### πŸš€ Getting Started as a Contributor

1. **Start Small**: Pick up issues labeled `good first issue` or `help wanted`
2. **Join Discussions**: Share your ideas in [GitHub Discussions](https://github.com/Noveum/NovaEval/discussions)
3. **Review Code**: Help review pull requests and provide feedback
4. **Report Issues**: Found a bug? Report it in [GitHub Issues](https://github.com/Noveum/NovaEval/issues)
5. **Spread the Word**: Star the repository and share with your network

## πŸš€ Features

- **Multi-Model Support**: Evaluate models from OpenAI, Anthropic, AWS Bedrock, and custom providers
- **Extensible Scoring**: Built-in scorers for accuracy, semantic similarity, code evaluation, and custom metrics
- **Dataset Integration**: Support for MMLU, HuggingFace datasets, custom datasets, and more
- **Production Ready**: Docker support, Kubernetes deployment, and cloud integrations
- **Comprehensive Reporting**: Detailed evaluation reports, artifacts, and visualizations
- **Secure**: Built-in credential management and secret store integration
- **Scalable**: Designed for both local testing and large-scale production evaluations
- **Cross-Platform**: Tested on macOS, Linux, and Windows with comprehensive CI/CD

## πŸ“¦ Installation

### From PyPI (Recommended)

```bash
pip install novaeval
```

### From Source

```bash
git clone https://github.com/Noveum/NovaEval.git
cd NovaEval
pip install -e .
```

### Docker

```bash
docker pull noveum/novaeval:latest
```

## πŸƒβ€β™‚οΈ Quick Start

### Basic Evaluation

```python
from novaeval import Evaluator
from novaeval.datasets import MMLUDataset
from novaeval.models import OpenAIModel
from novaeval.scorers import AccuracyScorer

# Configure for cost-conscious evaluation
MAX_TOKENS = 100  # Adjust based on budget: 5-10 for answers, 100+ for reasoning

# Initialize components
dataset = MMLUDataset(
    subset="elementary_mathematics",  # Easier subset for demo
    num_samples=10,
    split="test"
)

model = OpenAIModel(
    model_name="gpt-4o-mini",  # Cost-effective model
    temperature=0.0,
    max_tokens=MAX_TOKENS
)

scorer = AccuracyScorer(extract_answer=True)

# Create and run evaluation
evaluator = Evaluator(
    dataset=dataset,
    models=[model],
    scorers=[scorer],
    output_dir="./results"
)

results = evaluator.run()

# Display detailed results
for model_name, model_results in results["model_results"].items():
    for scorer_name, score_info in model_results["scores"].items():
        if isinstance(score_info, dict):
            mean_score = score_info.get("mean", 0)
            count = score_info.get("count", 0)
            print(f"{scorer_name}: {mean_score:.4f} ({count} samples)")
```

### Configuration-Based Evaluation

```python
from novaeval import Evaluator

# Load configuration from YAML/JSON
evaluator = Evaluator.from_config("evaluation_config.yaml")
results = evaluator.run()
```

### Command Line Interface

NovaEval provides a comprehensive CLI for running evaluations:

```bash
# Run evaluation from configuration file
novaeval run config.yaml

# Quick evaluation with minimal setup
novaeval quick -d mmlu -m gpt-4 -s accuracy

# List available datasets, models, and scorers
novaeval list-datasets
novaeval list-models
novaeval list-scorers

# Generate sample configuration
novaeval generate-config sample-config.yaml
```

πŸ“– **[Complete CLI Reference](docs/cli-reference.md)** - Detailed documentation for all CLI commands and options

### Example Configuration

```yaml
# evaluation_config.yaml
dataset:
  type: "mmlu"
  subset: "abstract_algebra"
  num_samples: 500

models:
  - type: "openai"
    model_name: "gpt-4"
    temperature: 0.0
  - type: "anthropic"
    model_name: "claude-3-opus"
    temperature: 0.0

scorers:
  - type: "accuracy"
  - type: "semantic_similarity"
    threshold: 0.8

output:
  directory: "./results"
  formats: ["json", "csv", "html"]
  upload_to_s3: true
  s3_bucket: "my-eval-results"
```

## πŸ—οΈ Architecture

NovaEval is built with extensibility and modularity in mind:

```
src/novaeval/
β”œβ”€β”€ datasets/          # Dataset loaders and processors
β”œβ”€β”€ evaluators/        # Core evaluation logic
β”œβ”€β”€ integrations/      # External service integrations
β”œβ”€β”€ models/           # Model interfaces and adapters
β”œβ”€β”€ reporting/        # Report generation and visualization
β”œβ”€β”€ scorers/          # Scoring mechanisms and metrics
└── utils/            # Utility functions and helpers
```

### Core Components

- **Datasets**: Standardized interface for loading evaluation datasets
- **Models**: Unified API for different AI model providers
- **Scorers**: Pluggable scoring mechanisms for various evaluation metrics
- **Evaluators**: Orchestrates the evaluation process
- **Reporting**: Generates comprehensive reports and artifacts
- **Integrations**: Handles external services (S3, credential stores, etc.)

## πŸ“Š Supported Datasets

- **MMLU**: Massive Multitask Language Understanding
- **HuggingFace**: Any dataset from the HuggingFace Hub
- **Custom**: JSON, CSV, or programmatic dataset definitions
- **Code Evaluation**: Programming benchmarks and code generation tasks
- **Agent Traces**: Multi-turn conversation and agent evaluation

## πŸ€– Supported Models

- **OpenAI**: GPT-3.5, GPT-4, and newer models
- **Anthropic**: Claude family models
- **AWS Bedrock**: Amazon's managed AI services
- **Noveum AI Gateway**: Integration with Noveum's model gateway
- **Custom**: Extensible interface for any API-based model

## πŸ“ Built-in Scorers

### Accuracy-Based
- **ExactMatch**: Exact string matching
- **Accuracy**: Classification accuracy
- **F1Score**: F1 score for classification tasks

### Semantic-Based
- **SemanticSimilarity**: Embedding-based similarity scoring
- **BERTScore**: BERT-based semantic evaluation
- **RougeScore**: ROUGE metrics for text generation

### Code-Specific
- **CodeExecution**: Execute and validate code outputs
- **SyntaxChecker**: Validate code syntax
- **TestCoverage**: Code coverage analysis

### Custom
- **LLMJudge**: Use another LLM as a judge
- **HumanEval**: Integration with human evaluation workflows

## πŸš€ Deployment

### Local Development

```bash
# Install dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run example evaluation
python examples/basic_evaluation.py
```

### Docker

```bash
# Build image
docker build -t nova-eval .

# Run evaluation
docker run -v $(pwd)/config:/config -v $(pwd)/results:/results nova-eval --config /config/eval.yaml
```

### Kubernetes

```bash
# Deploy to Kubernetes
kubectl apply -f kubernetes/

# Check status
kubectl get pods -l app=nova-eval
```

## πŸ”§ Configuration

NovaEval supports configuration through:

- **YAML/JSON files**: Declarative configuration
- **Environment variables**: Runtime configuration
- **Python code**: Programmatic configuration
- **CLI arguments**: Command-line overrides

### Environment Variables

```bash
export NOVA_EVAL_OUTPUT_DIR="./results"
export NOVA_EVAL_LOG_LEVEL="INFO"
export OPENAI_API_KEY="your-api-key"
export AWS_ACCESS_KEY_ID="your-aws-key"
```

### CI/CD Integration

NovaEval includes optimized GitHub Actions workflows:
- **Unit tests** run on all PRs and pushes for quick feedback
- **Integration tests** run on main branch only to minimize API costs
- **Cross-platform testing** on macOS, Linux, and Windows

## πŸ“ˆ Reporting and Artifacts

NovaEval generates comprehensive evaluation reports:

- **Summary Reports**: High-level metrics and insights
- **Detailed Results**: Per-sample predictions and scores
- **Visualizations**: Charts and graphs for result analysis
- **Artifacts**: Model outputs, intermediate results, and debug information
- **Export Formats**: JSON, CSV, HTML, PDF

### Example Report Structure

```
results/
β”œβ”€β”€ summary.json              # High-level metrics
β”œβ”€β”€ detailed_results.csv      # Per-sample results
β”œβ”€β”€ artifacts/
β”‚   β”œβ”€β”€ model_outputs/        # Raw model responses
β”‚   β”œβ”€β”€ intermediate/         # Processing artifacts
β”‚   └── debug/               # Debug information
β”œβ”€β”€ visualizations/
β”‚   β”œβ”€β”€ accuracy_by_category.png
β”‚   β”œβ”€β”€ score_distribution.png
β”‚   └── confusion_matrix.png
└── report.html              # Interactive HTML report
```

## πŸ”Œ Extending NovaEval

### Custom Datasets

```python
from novaeval.datasets import BaseDataset

class MyCustomDataset(BaseDataset):
    def load_data(self):
        # Implement data loading logic
        return samples

    def get_sample(self, index):
        # Return individual sample
        return sample
```

### Custom Scorers

```python
from novaeval.scorers import BaseScorer

class MyCustomScorer(BaseScorer):
    def score(self, prediction, ground_truth, context=None):
        # Implement scoring logic
        return score
```

### Custom Models

```python
from novaeval.models import BaseModel

class MyCustomModel(BaseModel):
    def generate(self, prompt, **kwargs):
        # Implement model inference
        return response
```

## 🀝 Contributing

We welcome contributions! NovaEval is actively seeking contributors to help build a robust AI evaluation framework. Please see our [Contributing Guide](CONTRIBUTING.md) for detailed guidelines.

### 🎯 Priority Contribution Areas

As mentioned in the [We Need Your Help](#-we-need-your-help) section, we're particularly looking for help with:

1. **Unit Tests** - Expand test coverage beyond the current 23%
2. **Examples** - Real-world evaluation scenarios and use cases
3. **Guides & Notebooks** - Interactive evaluation tutorials
4. **Documentation** - API docs, user guides, and tutorials
5. **RAG Metrics** - Specialized metrics for retrieval-augmented generation
6. **Agent Evaluation** - Frameworks for multi-turn and agent-based evaluations

### Development Setup

```bash
# Clone repository
git clone https://github.com/Noveum/NovaEval.git
cd NovaEval

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install development dependencies
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

# Run tests
pytest

# Run with coverage
pytest --cov=src/novaeval --cov-report=html
```

### πŸ—οΈ Contribution Workflow

1. **Fork** the repository
2. **Create** a feature branch (`git checkout -b feature/amazing-feature`)
3. **Make** your changes following our coding standards
4. **Add** tests for your changes
5. **Commit** your changes (`git commit -m 'Add amazing feature'`)
6. **Push** to the branch (`git push origin feature/amazing-feature`)
7. **Open** a Pull Request

### πŸ“‹ Contribution Guidelines

- **Code Quality**: Follow PEP 8 and use the provided pre-commit hooks
- **Testing**: Add unit tests for new features and bug fixes
- **Documentation**: Update documentation for API changes
- **Commit Messages**: Use conventional commit format
- **Issues**: Reference relevant issues in your PR description

### πŸŽ‰ Recognition

Contributors will be:
- Listed in our contributors page
- Mentioned in release notes for significant contributions
- Invited to join our contributor Discord community

## πŸ“„ License

This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.

## πŸ™ Acknowledgments

- Inspired by evaluation frameworks like DeepEval, Confident AI, and Braintrust
- Built with modern Python best practices and industry standards
- Designed for the AI evaluation community

## πŸ“ž Support

- **Documentation**: [https://noveum.github.io/NovaEval](https://noveum.github.io/NovaEval)
- **Issues**: [GitHub Issues](https://github.com/Noveum/NovaEval/issues)
- **Discussions**: [GitHub Discussions](https://github.com/Noveum/NovaEval/discussions)
- **Email**: [email protected]

---

Made with ❀️ by the Noveum.ai team