| # NovaEval by Noveum.ai | |
| [](https://github.com/Noveum/NovaEval/actions/workflows/ci.yml) | |
| [](https://github.com/Noveum/NovaEval/actions/workflows/release.yml) | |
| [](https://codecov.io/gh/Noveum/NovaEval) | |
| [](https://badge.fury.io/py/novaeval) | |
| [](https://www.python.org/downloads/) | |
| [](https://opensource.org/licenses/Apache-2.0) | |
| A comprehensive, extensible AI model evaluation framework designed for production use. NovaEval provides a unified interface for evaluating language models across various datasets, metrics, and deployment scenarios. | |
| ## π§ Development Status | |
| > **β οΈ ACTIVE DEVELOPMENT - NOT PRODUCTION READY** | |
| > | |
| > NovaEval is currently in active development and **not recommended for production use**. We are actively working on improving stability, adding features, and expanding test coverage. APIs may change without notice. | |
| > | |
| > **We're looking for contributors!** See the [Contributing](#-contributing) section below for ways to help. | |
| ## π€ We Need Your Help! | |
| NovaEval is an open-source project that thrives on community contributions. Whether you're a seasoned developer or just getting started, there are many ways to contribute: | |
| ### π― High-Priority Contribution Areas | |
| We're actively looking for contributors in these key areas: | |
| - **π§ͺ Unit Tests**: Help us improve our test coverage (currently 23% overall, 90%+ for core modules) | |
| - **π Examples**: Create real-world evaluation examples and use cases | |
| - **π Guides & Notebooks**: Write evaluation guides and interactive Jupyter notebooks | |
| - **π Documentation**: Improve API documentation and user guides | |
| - **π RAG Metrics**: Add more metrics specifically for Retrieval-Augmented Generation evaluation | |
| - **π€ Agent Evaluation**: Build frameworks for evaluating AI agents and multi-turn conversations | |
| ### π Getting Started as a Contributor | |
| 1. **Start Small**: Pick up issues labeled `good first issue` or `help wanted` | |
| 2. **Join Discussions**: Share your ideas in [GitHub Discussions](https://github.com/Noveum/NovaEval/discussions) | |
| 3. **Review Code**: Help review pull requests and provide feedback | |
| 4. **Report Issues**: Found a bug? Report it in [GitHub Issues](https://github.com/Noveum/NovaEval/issues) | |
| 5. **Spread the Word**: Star the repository and share with your network | |
| ## π Features | |
| - **Multi-Model Support**: Evaluate models from OpenAI, Anthropic, AWS Bedrock, and custom providers | |
| - **Extensible Scoring**: Built-in scorers for accuracy, semantic similarity, code evaluation, and custom metrics | |
| - **Dataset Integration**: Support for MMLU, HuggingFace datasets, custom datasets, and more | |
| - **Production Ready**: Docker support, Kubernetes deployment, and cloud integrations | |
| - **Comprehensive Reporting**: Detailed evaluation reports, artifacts, and visualizations | |
| - **Secure**: Built-in credential management and secret store integration | |
| - **Scalable**: Designed for both local testing and large-scale production evaluations | |
| - **Cross-Platform**: Tested on macOS, Linux, and Windows with comprehensive CI/CD | |
| ## π¦ Installation | |
| ### From PyPI (Recommended) | |
| ```bash | |
| pip install novaeval | |
| ``` | |
| ### From Source | |
| ```bash | |
| git clone https://github.com/Noveum/NovaEval.git | |
| cd NovaEval | |
| pip install -e . | |
| ``` | |
| ### Docker | |
| ```bash | |
| docker pull noveum/novaeval:latest | |
| ``` | |
| ## πββοΈ Quick Start | |
| ### Basic Evaluation | |
| ```python | |
| from novaeval import Evaluator | |
| from novaeval.datasets import MMLUDataset | |
| from novaeval.models import OpenAIModel | |
| from novaeval.scorers import AccuracyScorer | |
| # Configure for cost-conscious evaluation | |
| MAX_TOKENS = 100 # Adjust based on budget: 5-10 for answers, 100+ for reasoning | |
| # Initialize components | |
| dataset = MMLUDataset( | |
| subset="elementary_mathematics", # Easier subset for demo | |
| num_samples=10, | |
| split="test" | |
| ) | |
| model = OpenAIModel( | |
| model_name="gpt-4o-mini", # Cost-effective model | |
| temperature=0.0, | |
| max_tokens=MAX_TOKENS | |
| ) | |
| scorer = AccuracyScorer(extract_answer=True) | |
| # Create and run evaluation | |
| evaluator = Evaluator( | |
| dataset=dataset, | |
| models=[model], | |
| scorers=[scorer], | |
| output_dir="./results" | |
| ) | |
| results = evaluator.run() | |
| # Display detailed results | |
| for model_name, model_results in results["model_results"].items(): | |
| for scorer_name, score_info in model_results["scores"].items(): | |
| if isinstance(score_info, dict): | |
| mean_score = score_info.get("mean", 0) | |
| count = score_info.get("count", 0) | |
| print(f"{scorer_name}: {mean_score:.4f} ({count} samples)") | |
| ``` | |
| ### Configuration-Based Evaluation | |
| ```python | |
| from novaeval import Evaluator | |
| # Load configuration from YAML/JSON | |
| evaluator = Evaluator.from_config("evaluation_config.yaml") | |
| results = evaluator.run() | |
| ``` | |
| ### Command Line Interface | |
| NovaEval provides a comprehensive CLI for running evaluations: | |
| ```bash | |
| # Run evaluation from configuration file | |
| novaeval run config.yaml | |
| # Quick evaluation with minimal setup | |
| novaeval quick -d mmlu -m gpt-4 -s accuracy | |
| # List available datasets, models, and scorers | |
| novaeval list-datasets | |
| novaeval list-models | |
| novaeval list-scorers | |
| # Generate sample configuration | |
| novaeval generate-config sample-config.yaml | |
| ``` | |
| π **[Complete CLI Reference](docs/cli-reference.md)** - Detailed documentation for all CLI commands and options | |
| ### Example Configuration | |
| ```yaml | |
| # evaluation_config.yaml | |
| dataset: | |
| type: "mmlu" | |
| subset: "abstract_algebra" | |
| num_samples: 500 | |
| models: | |
| - type: "openai" | |
| model_name: "gpt-4" | |
| temperature: 0.0 | |
| - type: "anthropic" | |
| model_name: "claude-3-opus" | |
| temperature: 0.0 | |
| scorers: | |
| - type: "accuracy" | |
| - type: "semantic_similarity" | |
| threshold: 0.8 | |
| output: | |
| directory: "./results" | |
| formats: ["json", "csv", "html"] | |
| upload_to_s3: true | |
| s3_bucket: "my-eval-results" | |
| ``` | |
| ## ποΈ Architecture | |
| NovaEval is built with extensibility and modularity in mind: | |
| ``` | |
| src/novaeval/ | |
| βββ datasets/ # Dataset loaders and processors | |
| βββ evaluators/ # Core evaluation logic | |
| βββ integrations/ # External service integrations | |
| βββ models/ # Model interfaces and adapters | |
| βββ reporting/ # Report generation and visualization | |
| βββ scorers/ # Scoring mechanisms and metrics | |
| βββ utils/ # Utility functions and helpers | |
| ``` | |
| ### Core Components | |
| - **Datasets**: Standardized interface for loading evaluation datasets | |
| - **Models**: Unified API for different AI model providers | |
| - **Scorers**: Pluggable scoring mechanisms for various evaluation metrics | |
| - **Evaluators**: Orchestrates the evaluation process | |
| - **Reporting**: Generates comprehensive reports and artifacts | |
| - **Integrations**: Handles external services (S3, credential stores, etc.) | |
| ## π Supported Datasets | |
| - **MMLU**: Massive Multitask Language Understanding | |
| - **HuggingFace**: Any dataset from the HuggingFace Hub | |
| - **Custom**: JSON, CSV, or programmatic dataset definitions | |
| - **Code Evaluation**: Programming benchmarks and code generation tasks | |
| - **Agent Traces**: Multi-turn conversation and agent evaluation | |
| ## π€ Supported Models | |
| - **OpenAI**: GPT-3.5, GPT-4, and newer models | |
| - **Anthropic**: Claude family models | |
| - **AWS Bedrock**: Amazon's managed AI services | |
| - **Noveum AI Gateway**: Integration with Noveum's model gateway | |
| - **Custom**: Extensible interface for any API-based model | |
| ## π Built-in Scorers | |
| ### Accuracy-Based | |
| - **ExactMatch**: Exact string matching | |
| - **Accuracy**: Classification accuracy | |
| - **F1Score**: F1 score for classification tasks | |
| ### Semantic-Based | |
| - **SemanticSimilarity**: Embedding-based similarity scoring | |
| - **BERTScore**: BERT-based semantic evaluation | |
| - **RougeScore**: ROUGE metrics for text generation | |
| ### Code-Specific | |
| - **CodeExecution**: Execute and validate code outputs | |
| - **SyntaxChecker**: Validate code syntax | |
| - **TestCoverage**: Code coverage analysis | |
| ### Custom | |
| - **LLMJudge**: Use another LLM as a judge | |
| - **HumanEval**: Integration with human evaluation workflows | |
| ## π Deployment | |
| ### Local Development | |
| ```bash | |
| # Install dependencies | |
| pip install -e ".[dev]" | |
| # Run tests | |
| pytest | |
| # Run example evaluation | |
| python examples/basic_evaluation.py | |
| ``` | |
| ### Docker | |
| ```bash | |
| # Build image | |
| docker build -t nova-eval . | |
| # Run evaluation | |
| docker run -v $(pwd)/config:/config -v $(pwd)/results:/results nova-eval --config /config/eval.yaml | |
| ``` | |
| ### Kubernetes | |
| ```bash | |
| # Deploy to Kubernetes | |
| kubectl apply -f kubernetes/ | |
| # Check status | |
| kubectl get pods -l app=nova-eval | |
| ``` | |
| ## π§ Configuration | |
| NovaEval supports configuration through: | |
| - **YAML/JSON files**: Declarative configuration | |
| - **Environment variables**: Runtime configuration | |
| - **Python code**: Programmatic configuration | |
| - **CLI arguments**: Command-line overrides | |
| ### Environment Variables | |
| ```bash | |
| export NOVA_EVAL_OUTPUT_DIR="./results" | |
| export NOVA_EVAL_LOG_LEVEL="INFO" | |
| export OPENAI_API_KEY="your-api-key" | |
| export AWS_ACCESS_KEY_ID="your-aws-key" | |
| ``` | |
| ### CI/CD Integration | |
| NovaEval includes optimized GitHub Actions workflows: | |
| - **Unit tests** run on all PRs and pushes for quick feedback | |
| - **Integration tests** run on main branch only to minimize API costs | |
| - **Cross-platform testing** on macOS, Linux, and Windows | |
| ## π Reporting and Artifacts | |
| NovaEval generates comprehensive evaluation reports: | |
| - **Summary Reports**: High-level metrics and insights | |
| - **Detailed Results**: Per-sample predictions and scores | |
| - **Visualizations**: Charts and graphs for result analysis | |
| - **Artifacts**: Model outputs, intermediate results, and debug information | |
| - **Export Formats**: JSON, CSV, HTML, PDF | |
| ### Example Report Structure | |
| ``` | |
| results/ | |
| βββ summary.json # High-level metrics | |
| βββ detailed_results.csv # Per-sample results | |
| βββ artifacts/ | |
| β βββ model_outputs/ # Raw model responses | |
| β βββ intermediate/ # Processing artifacts | |
| β βββ debug/ # Debug information | |
| βββ visualizations/ | |
| β βββ accuracy_by_category.png | |
| β βββ score_distribution.png | |
| β βββ confusion_matrix.png | |
| βββ report.html # Interactive HTML report | |
| ``` | |
| ## π Extending NovaEval | |
| ### Custom Datasets | |
| ```python | |
| from novaeval.datasets import BaseDataset | |
| class MyCustomDataset(BaseDataset): | |
| def load_data(self): | |
| # Implement data loading logic | |
| return samples | |
| def get_sample(self, index): | |
| # Return individual sample | |
| return sample | |
| ``` | |
| ### Custom Scorers | |
| ```python | |
| from novaeval.scorers import BaseScorer | |
| class MyCustomScorer(BaseScorer): | |
| def score(self, prediction, ground_truth, context=None): | |
| # Implement scoring logic | |
| return score | |
| ``` | |
| ### Custom Models | |
| ```python | |
| from novaeval.models import BaseModel | |
| class MyCustomModel(BaseModel): | |
| def generate(self, prompt, **kwargs): | |
| # Implement model inference | |
| return response | |
| ``` | |
| ## π€ Contributing | |
| We welcome contributions! NovaEval is actively seeking contributors to help build a robust AI evaluation framework. Please see our [Contributing Guide](CONTRIBUTING.md) for detailed guidelines. | |
| ### π― Priority Contribution Areas | |
| As mentioned in the [We Need Your Help](#-we-need-your-help) section, we're particularly looking for help with: | |
| 1. **Unit Tests** - Expand test coverage beyond the current 23% | |
| 2. **Examples** - Real-world evaluation scenarios and use cases | |
| 3. **Guides & Notebooks** - Interactive evaluation tutorials | |
| 4. **Documentation** - API docs, user guides, and tutorials | |
| 5. **RAG Metrics** - Specialized metrics for retrieval-augmented generation | |
| 6. **Agent Evaluation** - Frameworks for multi-turn and agent-based evaluations | |
| ### Development Setup | |
| ```bash | |
| # Clone repository | |
| git clone https://github.com/Noveum/NovaEval.git | |
| cd NovaEval | |
| # Create virtual environment | |
| python -m venv venv | |
| source venv/bin/activate # On Windows: venv\Scripts\activate | |
| # Install development dependencies | |
| pip install -e ".[dev]" | |
| # Install pre-commit hooks | |
| pre-commit install | |
| # Run tests | |
| pytest | |
| # Run with coverage | |
| pytest --cov=src/novaeval --cov-report=html | |
| ``` | |
| ### ποΈ Contribution Workflow | |
| 1. **Fork** the repository | |
| 2. **Create** a feature branch (`git checkout -b feature/amazing-feature`) | |
| 3. **Make** your changes following our coding standards | |
| 4. **Add** tests for your changes | |
| 5. **Commit** your changes (`git commit -m 'Add amazing feature'`) | |
| 6. **Push** to the branch (`git push origin feature/amazing-feature`) | |
| 7. **Open** a Pull Request | |
| ### π Contribution Guidelines | |
| - **Code Quality**: Follow PEP 8 and use the provided pre-commit hooks | |
| - **Testing**: Add unit tests for new features and bug fixes | |
| - **Documentation**: Update documentation for API changes | |
| - **Commit Messages**: Use conventional commit format | |
| - **Issues**: Reference relevant issues in your PR description | |
| ### π Recognition | |
| Contributors will be: | |
| - Listed in our contributors page | |
| - Mentioned in release notes for significant contributions | |
| - Invited to join our contributor Discord community | |
| ## π License | |
| This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. | |
| ## π Acknowledgments | |
| - Inspired by evaluation frameworks like DeepEval, Confident AI, and Braintrust | |
| - Built with modern Python best practices and industry standards | |
| - Designed for the AI evaluation community | |
| ## π Support | |
| - **Documentation**: [https://noveum.github.io/NovaEval](https://noveum.github.io/NovaEval) | |
| - **Issues**: [GitHub Issues](https://github.com/Noveum/NovaEval/issues) | |
| - **Discussions**: [GitHub Discussions](https://github.com/Noveum/NovaEval/discussions) | |
| - **Email**: [email protected] | |
| --- | |
| Made with β€οΈ by the Noveum.ai team | |