File size: 14,101 Bytes
70eb27e 4f7f8d4 70eb27e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 |
# NovaEval by Noveum.ai
[](https://github.com/Noveum/NovaEval/actions/workflows/ci.yml)
[](https://github.com/Noveum/NovaEval/actions/workflows/release.yml)
[](https://codecov.io/gh/Noveum/NovaEval)
[](https://badge.fury.io/py/novaeval)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/Apache-2.0)
A comprehensive, extensible AI model evaluation framework designed for production use. NovaEval provides a unified interface for evaluating language models across various datasets, metrics, and deployment scenarios.
## π§ Development Status
> **β οΈ ACTIVE DEVELOPMENT - NOT PRODUCTION READY**
>
> NovaEval is currently in active development and **not recommended for production use**. We are actively working on improving stability, adding features, and expanding test coverage. APIs may change without notice.
>
> **We're looking for contributors!** See the [Contributing](#-contributing) section below for ways to help.
## π€ We Need Your Help!
NovaEval is an open-source project that thrives on community contributions. Whether you're a seasoned developer or just getting started, there are many ways to contribute:
### π― High-Priority Contribution Areas
We're actively looking for contributors in these key areas:
- **π§ͺ Unit Tests**: Help us improve our test coverage (currently 23% overall, 90%+ for core modules)
- **π Examples**: Create real-world evaluation examples and use cases
- **π Guides & Notebooks**: Write evaluation guides and interactive Jupyter notebooks
- **π Documentation**: Improve API documentation and user guides
- **π RAG Metrics**: Add more metrics specifically for Retrieval-Augmented Generation evaluation
- **π€ Agent Evaluation**: Build frameworks for evaluating AI agents and multi-turn conversations
### π Getting Started as a Contributor
1. **Start Small**: Pick up issues labeled `good first issue` or `help wanted`
2. **Join Discussions**: Share your ideas in [GitHub Discussions](https://github.com/Noveum/NovaEval/discussions)
3. **Review Code**: Help review pull requests and provide feedback
4. **Report Issues**: Found a bug? Report it in [GitHub Issues](https://github.com/Noveum/NovaEval/issues)
5. **Spread the Word**: Star the repository and share with your network
## π Features
- **Multi-Model Support**: Evaluate models from OpenAI, Anthropic, AWS Bedrock, and custom providers
- **Extensible Scoring**: Built-in scorers for accuracy, semantic similarity, code evaluation, and custom metrics
- **Dataset Integration**: Support for MMLU, HuggingFace datasets, custom datasets, and more
- **Production Ready**: Docker support, Kubernetes deployment, and cloud integrations
- **Comprehensive Reporting**: Detailed evaluation reports, artifacts, and visualizations
- **Secure**: Built-in credential management and secret store integration
- **Scalable**: Designed for both local testing and large-scale production evaluations
- **Cross-Platform**: Tested on macOS, Linux, and Windows with comprehensive CI/CD
## π¦ Installation
### From PyPI (Recommended)
```bash
pip install novaeval
```
### From Source
```bash
git clone https://github.com/Noveum/NovaEval.git
cd NovaEval
pip install -e .
```
### Docker
```bash
docker pull noveum/novaeval:latest
```
## πββοΈ Quick Start
### Basic Evaluation
```python
from novaeval import Evaluator
from novaeval.datasets import MMLUDataset
from novaeval.models import OpenAIModel
from novaeval.scorers import AccuracyScorer
# Configure for cost-conscious evaluation
MAX_TOKENS = 100 # Adjust based on budget: 5-10 for answers, 100+ for reasoning
# Initialize components
dataset = MMLUDataset(
subset="elementary_mathematics", # Easier subset for demo
num_samples=10,
split="test"
)
model = OpenAIModel(
model_name="gpt-4o-mini", # Cost-effective model
temperature=0.0,
max_tokens=MAX_TOKENS
)
scorer = AccuracyScorer(extract_answer=True)
# Create and run evaluation
evaluator = Evaluator(
dataset=dataset,
models=[model],
scorers=[scorer],
output_dir="./results"
)
results = evaluator.run()
# Display detailed results
for model_name, model_results in results["model_results"].items():
for scorer_name, score_info in model_results["scores"].items():
if isinstance(score_info, dict):
mean_score = score_info.get("mean", 0)
count = score_info.get("count", 0)
print(f"{scorer_name}: {mean_score:.4f} ({count} samples)")
```
### Configuration-Based Evaluation
```python
from novaeval import Evaluator
# Load configuration from YAML/JSON
evaluator = Evaluator.from_config("evaluation_config.yaml")
results = evaluator.run()
```
### Command Line Interface
NovaEval provides a comprehensive CLI for running evaluations:
```bash
# Run evaluation from configuration file
novaeval run config.yaml
# Quick evaluation with minimal setup
novaeval quick -d mmlu -m gpt-4 -s accuracy
# List available datasets, models, and scorers
novaeval list-datasets
novaeval list-models
novaeval list-scorers
# Generate sample configuration
novaeval generate-config sample-config.yaml
```
π **[Complete CLI Reference](docs/cli-reference.md)** - Detailed documentation for all CLI commands and options
### Example Configuration
```yaml
# evaluation_config.yaml
dataset:
type: "mmlu"
subset: "abstract_algebra"
num_samples: 500
models:
- type: "openai"
model_name: "gpt-4"
temperature: 0.0
- type: "anthropic"
model_name: "claude-3-opus"
temperature: 0.0
scorers:
- type: "accuracy"
- type: "semantic_similarity"
threshold: 0.8
output:
directory: "./results"
formats: ["json", "csv", "html"]
upload_to_s3: true
s3_bucket: "my-eval-results"
```
## ποΈ Architecture
NovaEval is built with extensibility and modularity in mind:
```
src/novaeval/
βββ datasets/ # Dataset loaders and processors
βββ evaluators/ # Core evaluation logic
βββ integrations/ # External service integrations
βββ models/ # Model interfaces and adapters
βββ reporting/ # Report generation and visualization
βββ scorers/ # Scoring mechanisms and metrics
βββ utils/ # Utility functions and helpers
```
### Core Components
- **Datasets**: Standardized interface for loading evaluation datasets
- **Models**: Unified API for different AI model providers
- **Scorers**: Pluggable scoring mechanisms for various evaluation metrics
- **Evaluators**: Orchestrates the evaluation process
- **Reporting**: Generates comprehensive reports and artifacts
- **Integrations**: Handles external services (S3, credential stores, etc.)
## π Supported Datasets
- **MMLU**: Massive Multitask Language Understanding
- **HuggingFace**: Any dataset from the HuggingFace Hub
- **Custom**: JSON, CSV, or programmatic dataset definitions
- **Code Evaluation**: Programming benchmarks and code generation tasks
- **Agent Traces**: Multi-turn conversation and agent evaluation
## π€ Supported Models
- **OpenAI**: GPT-3.5, GPT-4, and newer models
- **Anthropic**: Claude family models
- **AWS Bedrock**: Amazon's managed AI services
- **Noveum AI Gateway**: Integration with Noveum's model gateway
- **Custom**: Extensible interface for any API-based model
## π Built-in Scorers
### Accuracy-Based
- **ExactMatch**: Exact string matching
- **Accuracy**: Classification accuracy
- **F1Score**: F1 score for classification tasks
### Semantic-Based
- **SemanticSimilarity**: Embedding-based similarity scoring
- **BERTScore**: BERT-based semantic evaluation
- **RougeScore**: ROUGE metrics for text generation
### Code-Specific
- **CodeExecution**: Execute and validate code outputs
- **SyntaxChecker**: Validate code syntax
- **TestCoverage**: Code coverage analysis
### Custom
- **LLMJudge**: Use another LLM as a judge
- **HumanEval**: Integration with human evaluation workflows
## π Deployment
### Local Development
```bash
# Install dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run example evaluation
python examples/basic_evaluation.py
```
### Docker
```bash
# Build image
docker build -t nova-eval .
# Run evaluation
docker run -v $(pwd)/config:/config -v $(pwd)/results:/results nova-eval --config /config/eval.yaml
```
### Kubernetes
```bash
# Deploy to Kubernetes
kubectl apply -f kubernetes/
# Check status
kubectl get pods -l app=nova-eval
```
## π§ Configuration
NovaEval supports configuration through:
- **YAML/JSON files**: Declarative configuration
- **Environment variables**: Runtime configuration
- **Python code**: Programmatic configuration
- **CLI arguments**: Command-line overrides
### Environment Variables
```bash
export NOVA_EVAL_OUTPUT_DIR="./results"
export NOVA_EVAL_LOG_LEVEL="INFO"
export OPENAI_API_KEY="your-api-key"
export AWS_ACCESS_KEY_ID="your-aws-key"
```
### CI/CD Integration
NovaEval includes optimized GitHub Actions workflows:
- **Unit tests** run on all PRs and pushes for quick feedback
- **Integration tests** run on main branch only to minimize API costs
- **Cross-platform testing** on macOS, Linux, and Windows
## π Reporting and Artifacts
NovaEval generates comprehensive evaluation reports:
- **Summary Reports**: High-level metrics and insights
- **Detailed Results**: Per-sample predictions and scores
- **Visualizations**: Charts and graphs for result analysis
- **Artifacts**: Model outputs, intermediate results, and debug information
- **Export Formats**: JSON, CSV, HTML, PDF
### Example Report Structure
```
results/
βββ summary.json # High-level metrics
βββ detailed_results.csv # Per-sample results
βββ artifacts/
β βββ model_outputs/ # Raw model responses
β βββ intermediate/ # Processing artifacts
β βββ debug/ # Debug information
βββ visualizations/
β βββ accuracy_by_category.png
β βββ score_distribution.png
β βββ confusion_matrix.png
βββ report.html # Interactive HTML report
```
## π Extending NovaEval
### Custom Datasets
```python
from novaeval.datasets import BaseDataset
class MyCustomDataset(BaseDataset):
def load_data(self):
# Implement data loading logic
return samples
def get_sample(self, index):
# Return individual sample
return sample
```
### Custom Scorers
```python
from novaeval.scorers import BaseScorer
class MyCustomScorer(BaseScorer):
def score(self, prediction, ground_truth, context=None):
# Implement scoring logic
return score
```
### Custom Models
```python
from novaeval.models import BaseModel
class MyCustomModel(BaseModel):
def generate(self, prompt, **kwargs):
# Implement model inference
return response
```
## π€ Contributing
We welcome contributions! NovaEval is actively seeking contributors to help build a robust AI evaluation framework. Please see our [Contributing Guide](CONTRIBUTING.md) for detailed guidelines.
### π― Priority Contribution Areas
As mentioned in the [We Need Your Help](#-we-need-your-help) section, we're particularly looking for help with:
1. **Unit Tests** - Expand test coverage beyond the current 23%
2. **Examples** - Real-world evaluation scenarios and use cases
3. **Guides & Notebooks** - Interactive evaluation tutorials
4. **Documentation** - API docs, user guides, and tutorials
5. **RAG Metrics** - Specialized metrics for retrieval-augmented generation
6. **Agent Evaluation** - Frameworks for multi-turn and agent-based evaluations
### Development Setup
```bash
# Clone repository
git clone https://github.com/Noveum/NovaEval.git
cd NovaEval
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install development dependencies
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install
# Run tests
pytest
# Run with coverage
pytest --cov=src/novaeval --cov-report=html
```
### ποΈ Contribution Workflow
1. **Fork** the repository
2. **Create** a feature branch (`git checkout -b feature/amazing-feature`)
3. **Make** your changes following our coding standards
4. **Add** tests for your changes
5. **Commit** your changes (`git commit -m 'Add amazing feature'`)
6. **Push** to the branch (`git push origin feature/amazing-feature`)
7. **Open** a Pull Request
### π Contribution Guidelines
- **Code Quality**: Follow PEP 8 and use the provided pre-commit hooks
- **Testing**: Add unit tests for new features and bug fixes
- **Documentation**: Update documentation for API changes
- **Commit Messages**: Use conventional commit format
- **Issues**: Reference relevant issues in your PR description
### π Recognition
Contributors will be:
- Listed in our contributors page
- Mentioned in release notes for significant contributions
- Invited to join our contributor Discord community
## π License
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
## π Acknowledgments
- Inspired by evaluation frameworks like DeepEval, Confident AI, and Braintrust
- Built with modern Python best practices and industry standards
- Designed for the AI evaluation community
## π Support
- **Documentation**: [https://noveum.github.io/NovaEval](https://noveum.github.io/NovaEval)
- **Issues**: [GitHub Issues](https://github.com/Noveum/NovaEval/issues)
- **Discussions**: [GitHub Discussions](https://github.com/Noveum/NovaEval/discussions)
- **Email**: [email protected]
---
Made with β€οΈ by the Noveum.ai team
|