shashankagar commited on
Commit
c9cb337
·
verified ·
1 Parent(s): c7b6f6b
Files changed (1) hide show
  1. README.md +422 -39
README.md CHANGED
@@ -8,76 +8,459 @@ pinned: false
8
  app_build_command: npm run build
9
  app_file: build/index.html
10
  license: apache-2.0
11
- short_description: A comprehensive, extensible AI model evaluation framework de
12
  ---
 
13
 
14
- # Getting Started with Create React App
 
 
 
 
 
15
 
16
- This project was bootstrapped with [Create React App](https://github.com/facebook/create-react-app).
17
 
18
- ## Available Scripts
19
 
20
- In the project directory, you can run:
 
 
 
 
21
 
22
- ### `npm start`
23
 
24
- Runs the app in the development mode.\
25
- Open [http://localhost:3000](http://localhost:3000) to view it in your browser.
26
 
27
- The page will reload when you make changes.\
28
- You may also see any lint errors in the console.
29
 
30
- ### `npm test`
31
 
32
- Launches the test runner in the interactive watch mode.\
33
- See the section about [running tests](https://facebook.github.io/create-react-app/docs/running-tests) for more information.
 
 
 
 
34
 
35
- ### `npm run build`
36
 
37
- Builds the app for production to the `build` folder.\
38
- It correctly bundles React in production mode and optimizes the build for the best performance.
 
 
 
39
 
40
- The build is minified and the filenames include the hashes.\
41
- Your app is ready to be deployed!
42
 
43
- See the section about [deployment](https://facebook.github.io/create-react-app/docs/deployment) for more information.
 
 
 
 
 
 
 
44
 
45
- ### `npm run eject`
46
 
47
- **Note: this is a one-way operation. Once you `eject`, you can't go back!**
48
 
49
- If you aren't satisfied with the build tool and configuration choices, you can `eject` at any time. This command will remove the single build dependency from your project.
 
 
50
 
51
- Instead, it will copy all the configuration files and the transitive dependencies (webpack, Babel, ESLint, etc) right into your project so you have full control over them. All of the commands except `eject` will still work, but they will point to the copied scripts so you can tweak them. At this point you're on your own.
52
 
53
- You don't have to ever use `eject`. The curated feature set is suitable for small and middle deployments, and you shouldn't feel obligated to use this feature. However we understand that this tool wouldn't be useful if you couldn't customize it when you are ready for it.
 
 
 
 
54
 
55
- ## Learn More
56
 
57
- You can learn more in the [Create React App documentation](https://facebook.github.io/create-react-app/docs/getting-started).
 
 
58
 
59
- To learn React, check out the [React documentation](https://reactjs.org/).
60
 
61
- ### Code Splitting
62
 
63
- This section has moved here: [https://facebook.github.io/create-react-app/docs/code-splitting](https://facebook.github.io/create-react-app/docs/code-splitting)
 
 
 
 
64
 
65
- ### Analyzing the Bundle Size
 
66
 
67
- This section has moved here: [https://facebook.github.io/create-react-app/docs/analyzing-the-bundle-size](https://facebook.github.io/create-react-app/docs/analyzing-the-bundle-size)
 
 
 
 
 
68
 
69
- ### Making a Progressive Web App
 
 
 
 
70
 
71
- This section has moved here: [https://facebook.github.io/create-react-app/docs/making-a-progressive-web-app](https://facebook.github.io/create-react-app/docs/making-a-progressive-web-app)
 
 
 
 
 
 
 
 
72
 
73
- ### Advanced Configuration
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
- This section has moved here: [https://facebook.github.io/create-react-app/docs/advanced-configuration](https://facebook.github.io/create-react-app/docs/advanced-configuration)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
- ### Deployment
 
 
 
 
 
 
 
 
 
 
78
 
79
- This section has moved here: [https://facebook.github.io/create-react-app/docs/deployment](https://facebook.github.io/create-react-app/docs/deployment)
80
 
81
- ### `npm run build` fails to minify
 
 
 
 
 
 
 
 
 
 
 
82
 
83
- This section has moved here: [https://facebook.github.io/create-react-app/docs/troubleshooting#npm-run-build-fails-to-minify](https://facebook.github.io/create-react-app/docs/troubleshooting#npm-run-build-fails-to-minify)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  app_build_command: npm run build
9
  app_file: build/index.html
10
  license: apache-2.0
11
+ short_description: A comprehensive AI model evaluation framework.
12
  ---
13
+ # NovaEval by Noveum.ai
14
 
15
+ [![CI](https://github.com/Noveum/NovaEval/actions/workflows/ci.yml/badge.svg)](https://github.com/Noveum/NovaEval/actions/workflows/ci.yml)
16
+ [![Release](https://github.com/Noveum/NovaEval/actions/workflows/release.yml/badge.svg)](https://github.com/Noveum/NovaEval/actions/workflows/release.yml)
17
+ [![codecov](https://codecov.io/gh/Noveum/NovaEval/branch/main/graph/badge.svg)](https://codecov.io/gh/Noveum/NovaEval)
18
+ [![PyPI version](https://badge.fury.io/py/novaeval.svg)](https://badge.fury.io/py/novaeval)
19
+ [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
20
+ [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
21
 
22
+ A comprehensive, extensible AI model evaluation framework designed for production use. NovaEval provides a unified interface for evaluating language models across various datasets, metrics, and deployment scenarios.
23
 
24
+ ## 🚧 Development Status
25
 
26
+ > **⚠️ ACTIVE DEVELOPMENT - NOT PRODUCTION READY**
27
+ >
28
+ > NovaEval is currently in active development and **not recommended for production use**. We are actively working on improving stability, adding features, and expanding test coverage. APIs may change without notice.
29
+ >
30
+ > **We're looking for contributors!** See the [Contributing](#-contributing) section below for ways to help.
31
 
32
+ ## 🤝 We Need Your Help!
33
 
34
+ NovaEval is an open-source project that thrives on community contributions. Whether you're a seasoned developer or just getting started, there are many ways to contribute:
 
35
 
36
+ ### 🎯 High-Priority Contribution Areas
 
37
 
38
+ We're actively looking for contributors in these key areas:
39
 
40
+ - **🧪 Unit Tests**: Help us improve our test coverage (currently 23% overall, 90%+ for core modules)
41
+ - **📚 Examples**: Create real-world evaluation examples and use cases
42
+ - **📝 Guides & Notebooks**: Write evaluation guides and interactive Jupyter notebooks
43
+ - **📖 Documentation**: Improve API documentation and user guides
44
+ - **🔍 RAG Metrics**: Add more metrics specifically for Retrieval-Augmented Generation evaluation
45
+ - **🤖 Agent Evaluation**: Build frameworks for evaluating AI agents and multi-turn conversations
46
 
47
+ ### 🚀 Getting Started as a Contributor
48
 
49
+ 1. **Start Small**: Pick up issues labeled `good first issue` or `help wanted`
50
+ 2. **Join Discussions**: Share your ideas in [GitHub Discussions](https://github.com/Noveum/NovaEval/discussions)
51
+ 3. **Review Code**: Help review pull requests and provide feedback
52
+ 4. **Report Issues**: Found a bug? Report it in [GitHub Issues](https://github.com/Noveum/NovaEval/issues)
53
+ 5. **Spread the Word**: Star the repository and share with your network
54
 
55
+ ## 🚀 Features
 
56
 
57
+ - **Multi-Model Support**: Evaluate models from OpenAI, Anthropic, AWS Bedrock, and custom providers
58
+ - **Extensible Scoring**: Built-in scorers for accuracy, semantic similarity, code evaluation, and custom metrics
59
+ - **Dataset Integration**: Support for MMLU, HuggingFace datasets, custom datasets, and more
60
+ - **Production Ready**: Docker support, Kubernetes deployment, and cloud integrations
61
+ - **Comprehensive Reporting**: Detailed evaluation reports, artifacts, and visualizations
62
+ - **Secure**: Built-in credential management and secret store integration
63
+ - **Scalable**: Designed for both local testing and large-scale production evaluations
64
+ - **Cross-Platform**: Tested on macOS, Linux, and Windows with comprehensive CI/CD
65
 
66
+ ## 📦 Installation
67
 
68
+ ### From PyPI (Recommended)
69
 
70
+ ```bash
71
+ pip install novaeval
72
+ ```
73
 
74
+ ### From Source
75
 
76
+ ```bash
77
+ git clone https://github.com/Noveum/NovaEval.git
78
+ cd NovaEval
79
+ pip install -e .
80
+ ```
81
 
82
+ ### Docker
83
 
84
+ ```bash
85
+ docker pull noveum/novaeval:latest
86
+ ```
87
 
88
+ ## 🏃‍♂️ Quick Start
89
 
90
+ ### Basic Evaluation
91
 
92
+ ```python
93
+ from novaeval import Evaluator
94
+ from novaeval.datasets import MMLUDataset
95
+ from novaeval.models import OpenAIModel
96
+ from novaeval.scorers import AccuracyScorer
97
 
98
+ # Configure for cost-conscious evaluation
99
+ MAX_TOKENS = 100 # Adjust based on budget: 5-10 for answers, 100+ for reasoning
100
 
101
+ # Initialize components
102
+ dataset = MMLUDataset(
103
+ subset="elementary_mathematics", # Easier subset for demo
104
+ num_samples=10,
105
+ split="test"
106
+ )
107
 
108
+ model = OpenAIModel(
109
+ model_name="gpt-4o-mini", # Cost-effective model
110
+ temperature=0.0,
111
+ max_tokens=MAX_TOKENS
112
+ )
113
 
114
+ scorer = AccuracyScorer(extract_answer=True)
115
+
116
+ # Create and run evaluation
117
+ evaluator = Evaluator(
118
+ dataset=dataset,
119
+ models=[model],
120
+ scorers=[scorer],
121
+ output_dir="./results"
122
+ )
123
 
124
+ results = evaluator.run()
125
+
126
+ # Display detailed results
127
+ for model_name, model_results in results["model_results"].items():
128
+ for scorer_name, score_info in model_results["scores"].items():
129
+ if isinstance(score_info, dict):
130
+ mean_score = score_info.get("mean", 0)
131
+ count = score_info.get("count", 0)
132
+ print(f"{scorer_name}: {mean_score:.4f} ({count} samples)")
133
+ ```
134
+
135
+ ### Configuration-Based Evaluation
136
+
137
+ ```python
138
+ from novaeval import Evaluator
139
+
140
+ # Load configuration from YAML/JSON
141
+ evaluator = Evaluator.from_config("evaluation_config.yaml")
142
+ results = evaluator.run()
143
+ ```
144
+
145
+ ### Command Line Interface
146
+
147
+ NovaEval provides a comprehensive CLI for running evaluations:
148
+
149
+ ```bash
150
+ # Run evaluation from configuration file
151
+ novaeval run config.yaml
152
+
153
+ # Quick evaluation with minimal setup
154
+ novaeval quick -d mmlu -m gpt-4 -s accuracy
155
+
156
+ # List available datasets, models, and scorers
157
+ novaeval list-datasets
158
+ novaeval list-models
159
+ novaeval list-scorers
160
+
161
+ # Generate sample configuration
162
+ novaeval generate-config sample-config.yaml
163
+ ```
164
 
165
+ 📖 **[Complete CLI Reference](docs/cli-reference.md)** - Detailed documentation for all CLI commands and options
166
+
167
+ ### Example Configuration
168
+
169
+ ```yaml
170
+ # evaluation_config.yaml
171
+ dataset:
172
+ type: "mmlu"
173
+ subset: "abstract_algebra"
174
+ num_samples: 500
175
+
176
+ models:
177
+ - type: "openai"
178
+ model_name: "gpt-4"
179
+ temperature: 0.0
180
+ - type: "anthropic"
181
+ model_name: "claude-3-opus"
182
+ temperature: 0.0
183
 
184
+ scorers:
185
+ - type: "accuracy"
186
+ - type: "semantic_similarity"
187
+ threshold: 0.8
188
+
189
+ output:
190
+ directory: "./results"
191
+ formats: ["json", "csv", "html"]
192
+ upload_to_s3: true
193
+ s3_bucket: "my-eval-results"
194
+ ```
195
 
196
+ ## 🏗️ Architecture
197
 
198
+ NovaEval is built with extensibility and modularity in mind:
199
+
200
+ ```
201
+ src/novaeval/
202
+ ├── datasets/ # Dataset loaders and processors
203
+ ├── evaluators/ # Core evaluation logic
204
+ ├── integrations/ # External service integrations
205
+ ├── models/ # Model interfaces and adapters
206
+ ├── reporting/ # Report generation and visualization
207
+ ├── scorers/ # Scoring mechanisms and metrics
208
+ └── utils/ # Utility functions and helpers
209
+ ```
210
 
211
+ ### Core Components
212
+
213
+ - **Datasets**: Standardized interface for loading evaluation datasets
214
+ - **Models**: Unified API for different AI model providers
215
+ - **Scorers**: Pluggable scoring mechanisms for various evaluation metrics
216
+ - **Evaluators**: Orchestrates the evaluation process
217
+ - **Reporting**: Generates comprehensive reports and artifacts
218
+ - **Integrations**: Handles external services (S3, credential stores, etc.)
219
+
220
+ ## 📊 Supported Datasets
221
+
222
+ - **MMLU**: Massive Multitask Language Understanding
223
+ - **HuggingFace**: Any dataset from the HuggingFace Hub
224
+ - **Custom**: JSON, CSV, or programmatic dataset definitions
225
+ - **Code Evaluation**: Programming benchmarks and code generation tasks
226
+ - **Agent Traces**: Multi-turn conversation and agent evaluation
227
+
228
+ ## 🤖 Supported Models
229
+
230
+ - **OpenAI**: GPT-3.5, GPT-4, and newer models
231
+ - **Anthropic**: Claude family models
232
+ - **AWS Bedrock**: Amazon's managed AI services
233
+ - **Noveum AI Gateway**: Integration with Noveum's model gateway
234
+ - **Custom**: Extensible interface for any API-based model
235
+
236
+ ## 📏 Built-in Scorers
237
+
238
+ ### Accuracy-Based
239
+ - **ExactMatch**: Exact string matching
240
+ - **Accuracy**: Classification accuracy
241
+ - **F1Score**: F1 score for classification tasks
242
+
243
+ ### Semantic-Based
244
+ - **SemanticSimilarity**: Embedding-based similarity scoring
245
+ - **BERTScore**: BERT-based semantic evaluation
246
+ - **RougeScore**: ROUGE metrics for text generation
247
+
248
+ ### Code-Specific
249
+ - **CodeExecution**: Execute and validate code outputs
250
+ - **SyntaxChecker**: Validate code syntax
251
+ - **TestCoverage**: Code coverage analysis
252
+
253
+ ### Custom
254
+ - **LLMJudge**: Use another LLM as a judge
255
+ - **HumanEval**: Integration with human evaluation workflows
256
+
257
+ ## 🚀 Deployment
258
+
259
+ ### Local Development
260
+
261
+ ```bash
262
+ # Install dependencies
263
+ pip install -e ".[dev]"
264
+
265
+ # Run tests
266
+ pytest
267
+
268
+ # Run example evaluation
269
+ python examples/basic_evaluation.py
270
+ ```
271
+
272
+ ### Docker
273
+
274
+ ```bash
275
+ # Build image
276
+ docker build -t nova-eval .
277
+
278
+ # Run evaluation
279
+ docker run -v $(pwd)/config:/config -v $(pwd)/results:/results nova-eval --config /config/eval.yaml
280
+ ```
281
+
282
+ ### Kubernetes
283
+
284
+ ```bash
285
+ # Deploy to Kubernetes
286
+ kubectl apply -f kubernetes/
287
+
288
+ # Check status
289
+ kubectl get pods -l app=nova-eval
290
+ ```
291
+
292
+ ## 🔧 Configuration
293
+
294
+ NovaEval supports configuration through:
295
+
296
+ - **YAML/JSON files**: Declarative configuration
297
+ - **Environment variables**: Runtime configuration
298
+ - **Python code**: Programmatic configuration
299
+ - **CLI arguments**: Command-line overrides
300
+
301
+ ### Environment Variables
302
+
303
+ ```bash
304
+ export NOVA_EVAL_OUTPUT_DIR="./results"
305
+ export NOVA_EVAL_LOG_LEVEL="INFO"
306
+ export OPENAI_API_KEY="your-api-key"
307
+ export AWS_ACCESS_KEY_ID="your-aws-key"
308
+ ```
309
+
310
+ ### CI/CD Integration
311
+
312
+ NovaEval includes optimized GitHub Actions workflows:
313
+ - **Unit tests** run on all PRs and pushes for quick feedback
314
+ - **Integration tests** run on main branch only to minimize API costs
315
+ - **Cross-platform testing** on macOS, Linux, and Windows
316
+
317
+ ## 📈 Reporting and Artifacts
318
+
319
+ NovaEval generates comprehensive evaluation reports:
320
+
321
+ - **Summary Reports**: High-level metrics and insights
322
+ - **Detailed Results**: Per-sample predictions and scores
323
+ - **Visualizations**: Charts and graphs for result analysis
324
+ - **Artifacts**: Model outputs, intermediate results, and debug information
325
+ - **Export Formats**: JSON, CSV, HTML, PDF
326
+
327
+ ### Example Report Structure
328
+
329
+ ```
330
+ results/
331
+ ├── summary.json # High-level metrics
332
+ ├── detailed_results.csv # Per-sample results
333
+ ├── artifacts/
334
+ │ ├── model_outputs/ # Raw model responses
335
+ │ ├── intermediate/ # Processing artifacts
336
+ │ └── debug/ # Debug information
337
+ ├── visualizations/
338
+ │ ├── accuracy_by_category.png
339
+ │ ├── score_distribution.png
340
+ │ └── confusion_matrix.png
341
+ └── report.html # Interactive HTML report
342
+ ```
343
+
344
+ ## 🔌 Extending NovaEval
345
+
346
+ ### Custom Datasets
347
+
348
+ ```python
349
+ from novaeval.datasets import BaseDataset
350
+
351
+ class MyCustomDataset(BaseDataset):
352
+ def load_data(self):
353
+ # Implement data loading logic
354
+ return samples
355
+
356
+ def get_sample(self, index):
357
+ # Return individual sample
358
+ return sample
359
+ ```
360
+
361
+ ### Custom Scorers
362
+
363
+ ```python
364
+ from novaeval.scorers import BaseScorer
365
+
366
+ class MyCustomScorer(BaseScorer):
367
+ def score(self, prediction, ground_truth, context=None):
368
+ # Implement scoring logic
369
+ return score
370
+ ```
371
+
372
+ ### Custom Models
373
+
374
+ ```python
375
+ from novaeval.models import BaseModel
376
+
377
+ class MyCustomModel(BaseModel):
378
+ def generate(self, prompt, **kwargs):
379
+ # Implement model inference
380
+ return response
381
+ ```
382
+
383
+ ## 🤝 Contributing
384
+
385
+ We welcome contributions! NovaEval is actively seeking contributors to help build a robust AI evaluation framework. Please see our [Contributing Guide](CONTRIBUTING.md) for detailed guidelines.
386
+
387
+ ### 🎯 Priority Contribution Areas
388
+
389
+ As mentioned in the [We Need Your Help](#-we-need-your-help) section, we're particularly looking for help with:
390
+
391
+ 1. **Unit Tests** - Expand test coverage beyond the current 23%
392
+ 2. **Examples** - Real-world evaluation scenarios and use cases
393
+ 3. **Guides & Notebooks** - Interactive evaluation tutorials
394
+ 4. **Documentation** - API docs, user guides, and tutorials
395
+ 5. **RAG Metrics** - Specialized metrics for retrieval-augmented generation
396
+ 6. **Agent Evaluation** - Frameworks for multi-turn and agent-based evaluations
397
+
398
+ ### Development Setup
399
+
400
+ ```bash
401
+ # Clone repository
402
+ git clone https://github.com/Noveum/NovaEval.git
403
+ cd NovaEval
404
+
405
+ # Create virtual environment
406
+ python -m venv venv
407
+ source venv/bin/activate # On Windows: venv\Scripts\activate
408
+
409
+ # Install development dependencies
410
+ pip install -e ".[dev]"
411
+
412
+ # Install pre-commit hooks
413
+ pre-commit install
414
+
415
+ # Run tests
416
+ pytest
417
+
418
+ # Run with coverage
419
+ pytest --cov=src/novaeval --cov-report=html
420
+ ```
421
+
422
+ ### 🏗️ Contribution Workflow
423
+
424
+ 1. **Fork** the repository
425
+ 2. **Create** a feature branch (`git checkout -b feature/amazing-feature`)
426
+ 3. **Make** your changes following our coding standards
427
+ 4. **Add** tests for your changes
428
+ 5. **Commit** your changes (`git commit -m 'Add amazing feature'`)
429
+ 6. **Push** to the branch (`git push origin feature/amazing-feature`)
430
+ 7. **Open** a Pull Request
431
+
432
+ ### 📋 Contribution Guidelines
433
+
434
+ - **Code Quality**: Follow PEP 8 and use the provided pre-commit hooks
435
+ - **Testing**: Add unit tests for new features and bug fixes
436
+ - **Documentation**: Update documentation for API changes
437
+ - **Commit Messages**: Use conventional commit format
438
+ - **Issues**: Reference relevant issues in your PR description
439
+
440
+ ### 🎉 Recognition
441
+
442
+ Contributors will be:
443
+ - Listed in our contributors page
444
+ - Mentioned in release notes for significant contributions
445
+ - Invited to join our contributor Discord community
446
+
447
+ ## 📄 License
448
+
449
+ This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
450
+
451
+ ## 🙏 Acknowledgments
452
+
453
+ - Inspired by evaluation frameworks like DeepEval, Confident AI, and Braintrust
454
+ - Built with modern Python best practices and industry standards
455
+ - Designed for the AI evaluation community
456
+
457
+ ## 📞 Support
458
+
459
+ - **Documentation**: [https://noveum.github.io/NovaEval](https://noveum.github.io/NovaEval)
460
+ - **Issues**: [GitHub Issues](https://github.com/Noveum/NovaEval/issues)
461
+ - **Discussions**: [GitHub Discussions](https://github.com/Noveum/NovaEval/discussions)
462
+ - **Email**: [email protected]
463
+
464
+ ---
465
+
466
+ Made with ❤️ by the Noveum.ai team