NovaEval by Noveum.ai
Evaluate AI models with NovaEval
A comprehensive, extensible AI model evaluation framework designed for production use.
A comprehensive, extensible AI model evaluation framework designed for production use. NovaEval provides a unified interface for evaluating language models across various datasets, metrics, and deployment scenarios.
โ ๏ธ ACTIVE DEVELOPMENT - NOT PRODUCTION READY
NovaEval is currently in active development and not recommended for production use. We are actively working on improving stability, adding features, and expanding test coverage. APIs may change without notice.
We're looking for contributors! See the Contributing section below for ways to help.
NovaEval is an open-source project that thrives on community contributions. Whether you're a seasoned developer or just getting started, there are many ways to contribute:
We're actively looking for contributors in these key areas:
good first issue
or help wanted
pip install novaeval
git clone https://github.com/Noveum/NovaEval.git
cd NovaEval
pip install -e .
docker pull noveum/novaeval:latest
from novaeval import Evaluator
from novaeval.datasets import MMLUDataset
from novaeval.models import OpenAIModel
from novaeval.scorers import AccuracyScorer
# Configure for cost-conscious evaluation
MAX_TOKENS = 100 # Adjust based on budget: 5-10 for answers, 100+ for reasoning
# Initialize components
dataset = MMLUDataset(
subset="elementary_mathematics", # Easier subset for demo
num_samples=10,
split="test"
)
model = OpenAIModel(
model_name="gpt-4o-mini", # Cost-effective model
temperature=0.0,
max_tokens=MAX_TOKENS
)
scorer = AccuracyScorer(extract_answer=True)
# Create and run evaluation
evaluator = Evaluator(
dataset=dataset,
models=[model],
scorers=[scorer],
output_dir="./results"
)
results = evaluator.run()
# Display detailed results
for model_name, model_results in results["model_results"].items():
for scorer_name, score_info in model_results["scores"].items():
if isinstance(score_info, dict):
mean_score = score_info.get("mean", 0)
count = score_info.get("count", 0)
print(f"{scorer_name}: {mean_score:.4f} ({count} samples)")
from novaeval import Evaluator
# Load configuration from YAML/JSON
evaluator = Evaluator.from_config("evaluation_config.yaml")
results = evaluator.run()
NovaEval provides a comprehensive CLI for running evaluations:
# Run evaluation from configuration file
novaeval run config.yaml
# Quick evaluation with minimal setup
novaeval quick -d mmlu -m gpt-4 -s accuracy
# List available datasets, models, and scorers
novaeval list-datasets
novaeval list-models
novaeval list-scorers
# Generate sample configuration
novaeval generate-config sample-config.yaml
๐ Complete CLI Reference - Detailed documentation for all CLI commands and options
# evaluation_config.yaml
dataset:
type: "mmlu"
subset: "abstract_algebra"
num_samples: 500
models:
- type: "openai"
model_name: "gpt-4"
temperature: 0.0
- type: "anthropic"
model_name: "claude-3-opus"
temperature: 0.0
scorers:
- type: "accuracy"
- type: "semantic_similarity"
threshold: 0.8
output:
directory: "./results"
formats: ["json", "csv", "html"]
upload_to_s3: true
s3_bucket: "my-eval-results"
NovaEval is built with extensibility and modularity in mind:
src/novaeval/
โโโ datasets/ # Dataset loaders and processors
โโโ evaluators/ # Core evaluation logic
โโโ integrations/ # External service integrations
โโโ models/ # Model interfaces and adapters
โโโ reporting/ # Report generation and visualization
โโโ scorers/ # Scoring mechanisms and metrics
โโโ utils/ # Utility functions and helpers
# Install dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run example evaluation
python examples/basic_evaluation.py
# Build image
docker build -t nova-eval .
# Run evaluation
docker run -v $(pwd)/config:/config -v $(pwd)/results:/results nova-eval --config /config/eval.yaml
# Deploy to Kubernetes
kubectl apply -f kubernetes/
# Check status
kubectl get pods -l app=nova-eval
NovaEval supports configuration through:
export NOVA_EVAL_OUTPUT_DIR="./results"
export NOVA_EVAL_LOG_LEVEL="INFO"
export OPENAI_API_KEY="your-api-key"
export AWS_ACCESS_KEY_ID="your-aws-key"
NovaEval includes optimized GitHub Actions workflows:
NovaEval generates comprehensive evaluation reports:
results/
โโโ summary.json # High-level metrics
โโโ detailed_results.csv # Per-sample results
โโโ artifacts/
โ โโโ model_outputs/ # Raw model responses
โ โโโ intermediate/ # Processing artifacts
โ โโโ debug/ # Debug information
โโโ visualizations/
โ โโโ accuracy_by_category.png
โ โโโ score_distribution.png
โ โโโ confusion_matrix.png
โโโ report.html # Interactive HTML report
from novaeval.datasets import BaseDataset
class MyCustomDataset(BaseDataset):
def load_data(self):
# Implement data loading logic
return samples
def get_sample(self, index):
# Return individual sample
return sample
from novaeval.scorers import BaseScorer
class MyCustomScorer(BaseScorer):
def score(self, prediction, ground_truth, context=None):
# Implement scoring logic
return score
from novaeval.models import BaseModel
class MyCustomModel(BaseModel):
def generate(self, prompt, **kwargs):
# Implement model inference
return response
We welcome contributions! NovaEval is actively seeking contributors to help build a robust AI evaluation framework. Please see our Contributing Guide for detailed guidelines.
As mentioned in the We Need Your Help section, we're particularly looking for help with:
# Clone repository
git clone https://github.com/Noveum/NovaEval.git
cd NovaEval
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install development dependencies
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install
# Run tests
pytest
# Run with coverage
pytest --cov=src/novaeval --cov-report=html
git checkout -b feature/amazing-feature
)git commit -m 'Add amazing feature'
)git push origin feature/amazing-feature
)Contributors will be:
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Made with โค๏ธ by the Noveum.ai team