shashankagar commited on
Commit
70eb27e
Β·
verified Β·
1 Parent(s): 4f7f8d4
Files changed (1) hide show
  1. README.md +452 -8
README.md CHANGED
@@ -1,10 +1,454 @@
1
- ---
2
- title: README
3
- emoji: πŸŒ–
4
- colorFrom: gray
5
- colorTo: red
6
- sdk: static
7
- pinned: false
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  ---
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
1
+ # NovaEval by Noveum.ai
2
+
3
+ [![CI](https://github.com/Noveum/NovaEval/actions/workflows/ci.yml/badge.svg)](https://github.com/Noveum/NovaEval/actions/workflows/ci.yml)
4
+ [![Release](https://github.com/Noveum/NovaEval/actions/workflows/release.yml/badge.svg)](https://github.com/Noveum/NovaEval/actions/workflows/release.yml)
5
+ [![codecov](https://codecov.io/gh/Noveum/NovaEval/branch/main/graph/badge.svg)](https://codecov.io/gh/Noveum/NovaEval)
6
+ [![PyPI version](https://badge.fury.io/py/novaeval.svg)](https://badge.fury.io/py/novaeval)
7
+ [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
8
+ [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
9
+
10
+ A comprehensive, extensible AI model evaluation framework designed for production use. NovaEval provides a unified interface for evaluating language models across various datasets, metrics, and deployment scenarios.
11
+
12
+ ## 🚧 Development Status
13
+
14
+ > **⚠️ ACTIVE DEVELOPMENT - NOT PRODUCTION READY**
15
+ >
16
+ > NovaEval is currently in active development and **not recommended for production use**. We are actively working on improving stability, adding features, and expanding test coverage. APIs may change without notice.
17
+ >
18
+ > **We're looking for contributors!** See the [Contributing](#-contributing) section below for ways to help.
19
+
20
+ ## 🀝 We Need Your Help!
21
+
22
+ NovaEval is an open-source project that thrives on community contributions. Whether you're a seasoned developer or just getting started, there are many ways to contribute:
23
+
24
+ ### 🎯 High-Priority Contribution Areas
25
+
26
+ We're actively looking for contributors in these key areas:
27
+
28
+ - **πŸ§ͺ Unit Tests**: Help us improve our test coverage (currently 23% overall, 90%+ for core modules)
29
+ - **πŸ“š Examples**: Create real-world evaluation examples and use cases
30
+ - **πŸ“ Guides & Notebooks**: Write evaluation guides and interactive Jupyter notebooks
31
+ - **πŸ“– Documentation**: Improve API documentation and user guides
32
+ - **πŸ” RAG Metrics**: Add more metrics specifically for Retrieval-Augmented Generation evaluation
33
+ - **πŸ€– Agent Evaluation**: Build frameworks for evaluating AI agents and multi-turn conversations
34
+
35
+ ### πŸš€ Getting Started as a Contributor
36
+
37
+ 1. **Start Small**: Pick up issues labeled `good first issue` or `help wanted`
38
+ 2. **Join Discussions**: Share your ideas in [GitHub Discussions](https://github.com/Noveum/NovaEval/discussions)
39
+ 3. **Review Code**: Help review pull requests and provide feedback
40
+ 4. **Report Issues**: Found a bug? Report it in [GitHub Issues](https://github.com/Noveum/NovaEval/issues)
41
+ 5. **Spread the Word**: Star the repository and share with your network
42
+
43
+ ## πŸš€ Features
44
+
45
+ - **Multi-Model Support**: Evaluate models from OpenAI, Anthropic, AWS Bedrock, and custom providers
46
+ - **Extensible Scoring**: Built-in scorers for accuracy, semantic similarity, code evaluation, and custom metrics
47
+ - **Dataset Integration**: Support for MMLU, HuggingFace datasets, custom datasets, and more
48
+ - **Production Ready**: Docker support, Kubernetes deployment, and cloud integrations
49
+ - **Comprehensive Reporting**: Detailed evaluation reports, artifacts, and visualizations
50
+ - **Secure**: Built-in credential management and secret store integration
51
+ - **Scalable**: Designed for both local testing and large-scale production evaluations
52
+ - **Cross-Platform**: Tested on macOS, Linux, and Windows with comprehensive CI/CD
53
+
54
+ ## πŸ“¦ Installation
55
+
56
+ ### From PyPI (Recommended)
57
+
58
+ ```bash
59
+ pip install novaeval
60
+ ```
61
+
62
+ ### From Source
63
+
64
+ ```bash
65
+ git clone https://github.com/Noveum/NovaEval.git
66
+ cd NovaEval
67
+ pip install -e .
68
+ ```
69
+
70
+ ### Docker
71
+
72
+ ```bash
73
+ docker pull noveum/novaeval:latest
74
+ ```
75
+
76
+ ## πŸƒβ€β™‚οΈ Quick Start
77
+
78
+ ### Basic Evaluation
79
+
80
+ ```python
81
+ from novaeval import Evaluator
82
+ from novaeval.datasets import MMLUDataset
83
+ from novaeval.models import OpenAIModel
84
+ from novaeval.scorers import AccuracyScorer
85
+
86
+ # Configure for cost-conscious evaluation
87
+ MAX_TOKENS = 100 # Adjust based on budget: 5-10 for answers, 100+ for reasoning
88
+
89
+ # Initialize components
90
+ dataset = MMLUDataset(
91
+ subset="elementary_mathematics", # Easier subset for demo
92
+ num_samples=10,
93
+ split="test"
94
+ )
95
+
96
+ model = OpenAIModel(
97
+ model_name="gpt-4o-mini", # Cost-effective model
98
+ temperature=0.0,
99
+ max_tokens=MAX_TOKENS
100
+ )
101
+
102
+ scorer = AccuracyScorer(extract_answer=True)
103
+
104
+ # Create and run evaluation
105
+ evaluator = Evaluator(
106
+ dataset=dataset,
107
+ models=[model],
108
+ scorers=[scorer],
109
+ output_dir="./results"
110
+ )
111
+
112
+ results = evaluator.run()
113
+
114
+ # Display detailed results
115
+ for model_name, model_results in results["model_results"].items():
116
+ for scorer_name, score_info in model_results["scores"].items():
117
+ if isinstance(score_info, dict):
118
+ mean_score = score_info.get("mean", 0)
119
+ count = score_info.get("count", 0)
120
+ print(f"{scorer_name}: {mean_score:.4f} ({count} samples)")
121
+ ```
122
+
123
+ ### Configuration-Based Evaluation
124
+
125
+ ```python
126
+ from novaeval import Evaluator
127
+
128
+ # Load configuration from YAML/JSON
129
+ evaluator = Evaluator.from_config("evaluation_config.yaml")
130
+ results = evaluator.run()
131
+ ```
132
+
133
+ ### Command Line Interface
134
+
135
+ NovaEval provides a comprehensive CLI for running evaluations:
136
+
137
+ ```bash
138
+ # Run evaluation from configuration file
139
+ novaeval run config.yaml
140
+
141
+ # Quick evaluation with minimal setup
142
+ novaeval quick -d mmlu -m gpt-4 -s accuracy
143
+
144
+ # List available datasets, models, and scorers
145
+ novaeval list-datasets
146
+ novaeval list-models
147
+ novaeval list-scorers
148
+
149
+ # Generate sample configuration
150
+ novaeval generate-config sample-config.yaml
151
+ ```
152
+
153
+ πŸ“– **[Complete CLI Reference](docs/cli-reference.md)** - Detailed documentation for all CLI commands and options
154
+
155
+ ### Example Configuration
156
+
157
+ ```yaml
158
+ # evaluation_config.yaml
159
+ dataset:
160
+ type: "mmlu"
161
+ subset: "abstract_algebra"
162
+ num_samples: 500
163
+
164
+ models:
165
+ - type: "openai"
166
+ model_name: "gpt-4"
167
+ temperature: 0.0
168
+ - type: "anthropic"
169
+ model_name: "claude-3-opus"
170
+ temperature: 0.0
171
+
172
+ scorers:
173
+ - type: "accuracy"
174
+ - type: "semantic_similarity"
175
+ threshold: 0.8
176
+
177
+ output:
178
+ directory: "./results"
179
+ formats: ["json", "csv", "html"]
180
+ upload_to_s3: true
181
+ s3_bucket: "my-eval-results"
182
+ ```
183
+
184
+ ## πŸ—οΈ Architecture
185
+
186
+ NovaEval is built with extensibility and modularity in mind:
187
+
188
+ ```
189
+ src/novaeval/
190
+ β”œβ”€β”€ datasets/ # Dataset loaders and processors
191
+ β”œβ”€β”€ evaluators/ # Core evaluation logic
192
+ β”œβ”€β”€ integrations/ # External service integrations
193
+ β”œβ”€β”€ models/ # Model interfaces and adapters
194
+ β”œβ”€β”€ reporting/ # Report generation and visualization
195
+ β”œβ”€β”€ scorers/ # Scoring mechanisms and metrics
196
+ └── utils/ # Utility functions and helpers
197
+ ```
198
+
199
+ ### Core Components
200
+
201
+ - **Datasets**: Standardized interface for loading evaluation datasets
202
+ - **Models**: Unified API for different AI model providers
203
+ - **Scorers**: Pluggable scoring mechanisms for various evaluation metrics
204
+ - **Evaluators**: Orchestrates the evaluation process
205
+ - **Reporting**: Generates comprehensive reports and artifacts
206
+ - **Integrations**: Handles external services (S3, credential stores, etc.)
207
+
208
+ ## πŸ“Š Supported Datasets
209
+
210
+ - **MMLU**: Massive Multitask Language Understanding
211
+ - **HuggingFace**: Any dataset from the HuggingFace Hub
212
+ - **Custom**: JSON, CSV, or programmatic dataset definitions
213
+ - **Code Evaluation**: Programming benchmarks and code generation tasks
214
+ - **Agent Traces**: Multi-turn conversation and agent evaluation
215
+
216
+ ## πŸ€– Supported Models
217
+
218
+ - **OpenAI**: GPT-3.5, GPT-4, and newer models
219
+ - **Anthropic**: Claude family models
220
+ - **AWS Bedrock**: Amazon's managed AI services
221
+ - **Noveum AI Gateway**: Integration with Noveum's model gateway
222
+ - **Custom**: Extensible interface for any API-based model
223
+
224
+ ## πŸ“ Built-in Scorers
225
+
226
+ ### Accuracy-Based
227
+ - **ExactMatch**: Exact string matching
228
+ - **Accuracy**: Classification accuracy
229
+ - **F1Score**: F1 score for classification tasks
230
+
231
+ ### Semantic-Based
232
+ - **SemanticSimilarity**: Embedding-based similarity scoring
233
+ - **BERTScore**: BERT-based semantic evaluation
234
+ - **RougeScore**: ROUGE metrics for text generation
235
+
236
+ ### Code-Specific
237
+ - **CodeExecution**: Execute and validate code outputs
238
+ - **SyntaxChecker**: Validate code syntax
239
+ - **TestCoverage**: Code coverage analysis
240
+
241
+ ### Custom
242
+ - **LLMJudge**: Use another LLM as a judge
243
+ - **HumanEval**: Integration with human evaluation workflows
244
+
245
+ ## πŸš€ Deployment
246
+
247
+ ### Local Development
248
+
249
+ ```bash
250
+ # Install dependencies
251
+ pip install -e ".[dev]"
252
+
253
+ # Run tests
254
+ pytest
255
+
256
+ # Run example evaluation
257
+ python examples/basic_evaluation.py
258
+ ```
259
+
260
+ ### Docker
261
+
262
+ ```bash
263
+ # Build image
264
+ docker build -t nova-eval .
265
+
266
+ # Run evaluation
267
+ docker run -v $(pwd)/config:/config -v $(pwd)/results:/results nova-eval --config /config/eval.yaml
268
+ ```
269
+
270
+ ### Kubernetes
271
+
272
+ ```bash
273
+ # Deploy to Kubernetes
274
+ kubectl apply -f kubernetes/
275
+
276
+ # Check status
277
+ kubectl get pods -l app=nova-eval
278
+ ```
279
+
280
+ ## πŸ”§ Configuration
281
+
282
+ NovaEval supports configuration through:
283
+
284
+ - **YAML/JSON files**: Declarative configuration
285
+ - **Environment variables**: Runtime configuration
286
+ - **Python code**: Programmatic configuration
287
+ - **CLI arguments**: Command-line overrides
288
+
289
+ ### Environment Variables
290
+
291
+ ```bash
292
+ export NOVA_EVAL_OUTPUT_DIR="./results"
293
+ export NOVA_EVAL_LOG_LEVEL="INFO"
294
+ export OPENAI_API_KEY="your-api-key"
295
+ export AWS_ACCESS_KEY_ID="your-aws-key"
296
+ ```
297
+
298
+ ### CI/CD Integration
299
+
300
+ NovaEval includes optimized GitHub Actions workflows:
301
+ - **Unit tests** run on all PRs and pushes for quick feedback
302
+ - **Integration tests** run on main branch only to minimize API costs
303
+ - **Cross-platform testing** on macOS, Linux, and Windows
304
+
305
+ ## πŸ“ˆ Reporting and Artifacts
306
+
307
+ NovaEval generates comprehensive evaluation reports:
308
+
309
+ - **Summary Reports**: High-level metrics and insights
310
+ - **Detailed Results**: Per-sample predictions and scores
311
+ - **Visualizations**: Charts and graphs for result analysis
312
+ - **Artifacts**: Model outputs, intermediate results, and debug information
313
+ - **Export Formats**: JSON, CSV, HTML, PDF
314
+
315
+ ### Example Report Structure
316
+
317
+ ```
318
+ results/
319
+ β”œβ”€β”€ summary.json # High-level metrics
320
+ β”œβ”€β”€ detailed_results.csv # Per-sample results
321
+ β”œβ”€β”€ artifacts/
322
+ β”‚ β”œβ”€β”€ model_outputs/ # Raw model responses
323
+ β”‚ β”œβ”€β”€ intermediate/ # Processing artifacts
324
+ β”‚ └── debug/ # Debug information
325
+ β”œβ”€β”€ visualizations/
326
+ β”‚ β”œβ”€β”€ accuracy_by_category.png
327
+ β”‚ β”œβ”€β”€ score_distribution.png
328
+ β”‚ └── confusion_matrix.png
329
+ └── report.html # Interactive HTML report
330
+ ```
331
+
332
+ ## πŸ”Œ Extending NovaEval
333
+
334
+ ### Custom Datasets
335
+
336
+ ```python
337
+ from novaeval.datasets import BaseDataset
338
+
339
+ class MyCustomDataset(BaseDataset):
340
+ def load_data(self):
341
+ # Implement data loading logic
342
+ return samples
343
+
344
+ def get_sample(self, index):
345
+ # Return individual sample
346
+ return sample
347
+ ```
348
+
349
+ ### Custom Scorers
350
+
351
+ ```python
352
+ from novaeval.scorers import BaseScorer
353
+
354
+ class MyCustomScorer(BaseScorer):
355
+ def score(self, prediction, ground_truth, context=None):
356
+ # Implement scoring logic
357
+ return score
358
+ ```
359
+
360
+ ### Custom Models
361
+
362
+ ```python
363
+ from novaeval.models import BaseModel
364
+
365
+ class MyCustomModel(BaseModel):
366
+ def generate(self, prompt, **kwargs):
367
+ # Implement model inference
368
+ return response
369
+ ```
370
+
371
+ ## 🀝 Contributing
372
+
373
+ We welcome contributions! NovaEval is actively seeking contributors to help build a robust AI evaluation framework. Please see our [Contributing Guide](CONTRIBUTING.md) for detailed guidelines.
374
+
375
+ ### 🎯 Priority Contribution Areas
376
+
377
+ As mentioned in the [We Need Your Help](#-we-need-your-help) section, we're particularly looking for help with:
378
+
379
+ 1. **Unit Tests** - Expand test coverage beyond the current 23%
380
+ 2. **Examples** - Real-world evaluation scenarios and use cases
381
+ 3. **Guides & Notebooks** - Interactive evaluation tutorials
382
+ 4. **Documentation** - API docs, user guides, and tutorials
383
+ 5. **RAG Metrics** - Specialized metrics for retrieval-augmented generation
384
+ 6. **Agent Evaluation** - Frameworks for multi-turn and agent-based evaluations
385
+
386
+ ### Development Setup
387
+
388
+ ```bash
389
+ # Clone repository
390
+ git clone https://github.com/Noveum/NovaEval.git
391
+ cd NovaEval
392
+
393
+ # Create virtual environment
394
+ python -m venv venv
395
+ source venv/bin/activate # On Windows: venv\Scripts\activate
396
+
397
+ # Install development dependencies
398
+ pip install -e ".[dev]"
399
+
400
+ # Install pre-commit hooks
401
+ pre-commit install
402
+
403
+ # Run tests
404
+ pytest
405
+
406
+ # Run with coverage
407
+ pytest --cov=src/novaeval --cov-report=html
408
+ ```
409
+
410
+ ### πŸ—οΈ Contribution Workflow
411
+
412
+ 1. **Fork** the repository
413
+ 2. **Create** a feature branch (`git checkout -b feature/amazing-feature`)
414
+ 3. **Make** your changes following our coding standards
415
+ 4. **Add** tests for your changes
416
+ 5. **Commit** your changes (`git commit -m 'Add amazing feature'`)
417
+ 6. **Push** to the branch (`git push origin feature/amazing-feature`)
418
+ 7. **Open** a Pull Request
419
+
420
+ ### πŸ“‹ Contribution Guidelines
421
+
422
+ - **Code Quality**: Follow PEP 8 and use the provided pre-commit hooks
423
+ - **Testing**: Add unit tests for new features and bug fixes
424
+ - **Documentation**: Update documentation for API changes
425
+ - **Commit Messages**: Use conventional commit format
426
+ - **Issues**: Reference relevant issues in your PR description
427
+
428
+ ### πŸŽ‰ Recognition
429
+
430
+ Contributors will be:
431
+ - Listed in our contributors page
432
+ - Mentioned in release notes for significant contributions
433
+ - Invited to join our contributor Discord community
434
+
435
+ ## πŸ“„ License
436
+
437
+ This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
438
+
439
+ ## πŸ™ Acknowledgments
440
+
441
+ - Inspired by evaluation frameworks like DeepEval, Confident AI, and Braintrust
442
+ - Built with modern Python best practices and industry standards
443
+ - Designed for the AI evaluation community
444
+
445
+ ## πŸ“ž Support
446
+
447
+ - **Documentation**: [https://noveum.github.io/NovaEval](https://noveum.github.io/NovaEval)
448
+ - **Issues**: [GitHub Issues](https://github.com/Noveum/NovaEval/issues)
449
+ - **Discussions**: [GitHub Discussions](https://github.com/Noveum/NovaEval/discussions)
450
+ - **Email**: [email protected]
451
+
452
  ---
453
 
454
+ Made with ❀️ by the Noveum.ai team