Spaces:

Noveumai
/

README

Configuration error

App Files Files Community

shashankagar commited on Jul 14

Commit

70eb27e

verified ·

1 Parent(s): 4f7f8d4

Update

Browse files

Files changed (1) hide show

README.md +452 -8

README.md CHANGED Viewed

@@ -1,10 +1,454 @@
----
-title: README
-emoji: 🌖
-colorFrom: gray
-colorTo: red
-sdk: static
-pinned: false
 ---
-Edit this `README.md` markdown file to author your organization card.

+# NovaEval by Noveum.ai
+[![CI](https://github.com/Noveum/NovaEval/actions/workflows/ci.yml/badge.svg)](https://github.com/Noveum/NovaEval/actions/workflows/ci.yml)
+[![Release](https://github.com/Noveum/NovaEval/actions/workflows/release.yml/badge.svg)](https://github.com/Noveum/NovaEval/actions/workflows/release.yml)
+[![codecov](https://codecov.io/gh/Noveum/NovaEval/branch/main/graph/badge.svg)](https://codecov.io/gh/Noveum/NovaEval)
+[![PyPI version](https://badge.fury.io/py/novaeval.svg)](https://badge.fury.io/py/novaeval)
+[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
+[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
+A comprehensive, extensible AI model evaluation framework designed for production use. NovaEval provides a unified interface for evaluating language models across various datasets, metrics, and deployment scenarios.
+## 🚧 Development Status
+> **⚠️ ACTIVE DEVELOPMENT - NOT PRODUCTION READY**
+>
+> NovaEval is currently in active development and **not recommended for production use**. We are actively working on improving stability, adding features, and expanding test coverage. APIs may change without notice.
+>
+> **We're looking for contributors!** See the [Contributing](#-contributing) section below for ways to help.
+## 🤝 We Need Your Help!
+NovaEval is an open-source project that thrives on community contributions. Whether you're a seasoned developer or just getting started, there are many ways to contribute:
+### 🎯 High-Priority Contribution Areas
+We're actively looking for contributors in these key areas:
+- **🧪 Unit Tests**: Help us improve our test coverage (currently 23% overall, 90%+ for core modules)
+- **📚 Examples**: Create real-world evaluation examples and use cases
+- **📝 Guides & Notebooks**: Write evaluation guides and interactive Jupyter notebooks
+- **📖 Documentation**: Improve API documentation and user guides
+- **🔍 RAG Metrics**: Add more metrics specifically for Retrieval-Augmented Generation evaluation
+- **🤖 Agent Evaluation**: Build frameworks for evaluating AI agents and multi-turn conversations
+### 🚀 Getting Started as a Contributor
+1. **Start Small**: Pick up issues labeled `good first issue` or `help wanted`
+2. **Join Discussions**: Share your ideas in [GitHub Discussions](https://github.com/Noveum/NovaEval/discussions)
+3. **Review Code**: Help review pull requests and provide feedback
+4. **Report Issues**: Found a bug? Report it in [GitHub Issues](https://github.com/Noveum/NovaEval/issues)
+5. **Spread the Word**: Star the repository and share with your network
+## 🚀 Features
+- **Multi-Model Support**: Evaluate models from OpenAI, Anthropic, AWS Bedrock, and custom providers
+- **Extensible Scoring**: Built-in scorers for accuracy, semantic similarity, code evaluation, and custom metrics
+- **Dataset Integration**: Support for MMLU, HuggingFace datasets, custom datasets, and more
+- **Production Ready**: Docker support, Kubernetes deployment, and cloud integrations
+- **Comprehensive Reporting**: Detailed evaluation reports, artifacts, and visualizations
+- **Secure**: Built-in credential management and secret store integration
+- **Scalable**: Designed for both local testing and large-scale production evaluations
+- **Cross-Platform**: Tested on macOS, Linux, and Windows with comprehensive CI/CD
+## 📦 Installation
+### From PyPI (Recommended)
+```bash
+pip install novaeval
+```
+### From Source
+```bash
+git clone https://github.com/Noveum/NovaEval.git
+cd NovaEval
+pip install -e .
+```
+### Docker
+```bash
+docker pull noveum/novaeval:latest
+```
+## 🏃‍♂️ Quick Start
+### Basic Evaluation
+```python
+from novaeval import Evaluator
+from novaeval.datasets import MMLUDataset
+from novaeval.models import OpenAIModel
+from novaeval.scorers import AccuracyScorer
+# Configure for cost-conscious evaluation
+MAX_TOKENS = 100  # Adjust based on budget: 5-10 for answers, 100+ for reasoning
+# Initialize components
+dataset = MMLUDataset(
+    subset="elementary_mathematics",  # Easier subset for demo
+    num_samples=10,
+    split="test"
+)
+model = OpenAIModel(
+    model_name="gpt-4o-mini",  # Cost-effective model
+    temperature=0.0,
+    max_tokens=MAX_TOKENS
+)
+scorer = AccuracyScorer(extract_answer=True)
+# Create and run evaluation
+evaluator = Evaluator(
+    dataset=dataset,
+    models=[model],
+    scorers=[scorer],
+    output_dir="./results"
+)
+results = evaluator.run()
+# Display detailed results
+for model_name, model_results in results["model_results"].items():
+    for scorer_name, score_info in model_results["scores"].items():
+        if isinstance(score_info, dict):
+            mean_score = score_info.get("mean", 0)
+            count = score_info.get("count", 0)
+            print(f"{scorer_name}: {mean_score:.4f} ({count} samples)")
+```
+### Configuration-Based Evaluation
+```python
+from novaeval import Evaluator
+# Load configuration from YAML/JSON
+evaluator = Evaluator.from_config("evaluation_config.yaml")
+results = evaluator.run()
+```
+### Command Line Interface
+NovaEval provides a comprehensive CLI for running evaluations:
+```bash
+# Run evaluation from configuration file
+novaeval run config.yaml
+# Quick evaluation with minimal setup
+novaeval quick -d mmlu -m gpt-4 -s accuracy
+# List available datasets, models, and scorers
+novaeval list-datasets
+novaeval list-models
+novaeval list-scorers
+# Generate sample configuration
+novaeval generate-config sample-config.yaml
+```
+📖 **[Complete CLI Reference](docs/cli-reference.md)** - Detailed documentation for all CLI commands and options
+### Example Configuration
+```yaml
+# evaluation_config.yaml
+dataset:
+  type: "mmlu"
+  subset: "abstract_algebra"
+  num_samples: 500
+models:
+  - type: "openai"
+    model_name: "gpt-4"
+    temperature: 0.0
+  - type: "anthropic"
+    model_name: "claude-3-opus"
+    temperature: 0.0
+scorers:
+  - type: "accuracy"
+  - type: "semantic_similarity"
+    threshold: 0.8
+output:
+  directory: "./results"
+  formats: ["json", "csv", "html"]
+  upload_to_s3: true
+  s3_bucket: "my-eval-results"
+```
+## 🏗️ Architecture
+NovaEval is built with extensibility and modularity in mind:
+```
+src/novaeval/
+├── datasets/          # Dataset loaders and processors
+├── evaluators/        # Core evaluation logic
+├── integrations/      # External service integrations
+├── models/           # Model interfaces and adapters
+├── reporting/        # Report generation and visualization
+├── scorers/          # Scoring mechanisms and metrics
+└── utils/            # Utility functions and helpers
+```
+### Core Components
+- **Datasets**: Standardized interface for loading evaluation datasets
+- **Models**: Unified API for different AI model providers
+- **Scorers**: Pluggable scoring mechanisms for various evaluation metrics
+- **Evaluators**: Orchestrates the evaluation process
+- **Reporting**: Generates comprehensive reports and artifacts
+- **Integrations**: Handles external services (S3, credential stores, etc.)
+## 📊 Supported Datasets
+- **MMLU**: Massive Multitask Language Understanding
+- **HuggingFace**: Any dataset from the HuggingFace Hub
+- **Custom**: JSON, CSV, or programmatic dataset definitions
+- **Code Evaluation**: Programming benchmarks and code generation tasks
+- **Agent Traces**: Multi-turn conversation and agent evaluation
+## 🤖 Supported Models
+- **OpenAI**: GPT-3.5, GPT-4, and newer models
+- **Anthropic**: Claude family models
+- **AWS Bedrock**: Amazon's managed AI services
+- **Noveum AI Gateway**: Integration with Noveum's model gateway
+- **Custom**: Extensible interface for any API-based model
+## 📏 Built-in Scorers
+### Accuracy-Based
+- **ExactMatch**: Exact string matching
+- **Accuracy**: Classification accuracy
+- **F1Score**: F1 score for classification tasks
+### Semantic-Based
+- **SemanticSimilarity**: Embedding-based similarity scoring
+- **BERTScore**: BERT-based semantic evaluation
+- **RougeScore**: ROUGE metrics for text generation
+### Code-Specific
+- **CodeExecution**: Execute and validate code outputs
+- **SyntaxChecker**: Validate code syntax
+- **TestCoverage**: Code coverage analysis
+### Custom
+- **LLMJudge**: Use another LLM as a judge
+- **HumanEval**: Integration with human evaluation workflows
+## 🚀 Deployment
+### Local Development
+```bash
+# Install dependencies
+pip install -e ".[dev]"
+# Run tests
+pytest
+# Run example evaluation
+python examples/basic_evaluation.py
+```
+### Docker
+```bash
+# Build image
+docker build -t nova-eval .
+# Run evaluation
+docker run -v $(pwd)/config:/config -v $(pwd)/results:/results nova-eval --config /config/eval.yaml
+```
+### Kubernetes
+```bash
+# Deploy to Kubernetes
+kubectl apply -f kubernetes/
+# Check status
+kubectl get pods -l app=nova-eval
+```
+## 🔧 Configuration
+NovaEval supports configuration through:
+- **YAML/JSON files**: Declarative configuration
+- **Environment variables**: Runtime configuration
+- **Python code**: Programmatic configuration
+- **CLI arguments**: Command-line overrides
+### Environment Variables
+```bash
+export NOVA_EVAL_OUTPUT_DIR="./results"
+export NOVA_EVAL_LOG_LEVEL="INFO"
+export OPENAI_API_KEY="your-api-key"
+export AWS_ACCESS_KEY_ID="your-aws-key"
+```
+### CI/CD Integration
+NovaEval includes optimized GitHub Actions workflows:
+- **Unit tests** run on all PRs and pushes for quick feedback
+- **Integration tests** run on main branch only to minimize API costs
+- **Cross-platform testing** on macOS, Linux, and Windows
+## 📈 Reporting and Artifacts
+NovaEval generates comprehensive evaluation reports:
+- **Summary Reports**: High-level metrics and insights
+- **Detailed Results**: Per-sample predictions and scores
+- **Visualizations**: Charts and graphs for result analysis
+- **Artifacts**: Model outputs, intermediate results, and debug information
+- **Export Formats**: JSON, CSV, HTML, PDF
+### Example Report Structure
+```
+results/
+├── summary.json              # High-level metrics
+├── detailed_results.csv      # Per-sample results
+├── artifacts/
+│   ├── model_outputs/        # Raw model responses
+│   ├── intermediate/         # Processing artifacts
+│   └── debug/               # Debug information
+├── visualizations/
+│   ├── accuracy_by_category.png
+│   ├── score_distribution.png
+│   └── confusion_matrix.png
+└── report.html              # Interactive HTML report
+```
+## 🔌 Extending NovaEval
+### Custom Datasets
+```python
+from novaeval.datasets import BaseDataset
+class MyCustomDataset(BaseDataset):
+    def load_data(self):
+        # Implement data loading logic
+        return samples
+    def get_sample(self, index):
+        # Return individual sample
+        return sample
+```
+### Custom Scorers
+```python
+from novaeval.scorers import BaseScorer
+class MyCustomScorer(BaseScorer):
+    def score(self, prediction, ground_truth, context=None):
+        # Implement scoring logic
+        return score
+```
+### Custom Models
+```python
+from novaeval.models import BaseModel
+class MyCustomModel(BaseModel):
+    def generate(self, prompt, **kwargs):
+        # Implement model inference
+        return response
+```
+## 🤝 Contributing
+We welcome contributions! NovaEval is actively seeking contributors to help build a robust AI evaluation framework. Please see our [Contributing Guide](CONTRIBUTING.md) for detailed guidelines.
+### 🎯 Priority Contribution Areas
+As mentioned in the [We Need Your Help](#-we-need-your-help) section, we're particularly looking for help with:
+1. **Unit Tests** - Expand test coverage beyond the current 23%
+2. **Examples** - Real-world evaluation scenarios and use cases
+3. **Guides & Notebooks** - Interactive evaluation tutorials
+4. **Documentation** - API docs, user guides, and tutorials
+5. **RAG Metrics** - Specialized metrics for retrieval-augmented generation
+6. **Agent Evaluation** - Frameworks for multi-turn and agent-based evaluations
+### Development Setup
+```bash
+# Clone repository
+git clone https://github.com/Noveum/NovaEval.git
+cd NovaEval
+# Create virtual environment
+python -m venv venv
+source venv/bin/activate  # On Windows: venv\Scripts\activate
+# Install development dependencies
+pip install -e ".[dev]"
+# Install pre-commit hooks
+pre-commit install
+# Run tests
+pytest
+# Run with coverage
+pytest --cov=src/novaeval --cov-report=html
+```
+### 🏗️ Contribution Workflow
+1. **Fork** the repository
+2. **Create** a feature branch (`git checkout -b feature/amazing-feature`)
+3. **Make** your changes following our coding standards
+4. **Add** tests for your changes
+5. **Commit** your changes (`git commit -m 'Add amazing feature'`)
+6. **Push** to the branch (`git push origin feature/amazing-feature`)
+7. **Open** a Pull Request
+### 📋 Contribution Guidelines
+- **Code Quality**: Follow PEP 8 and use the provided pre-commit hooks
+- **Testing**: Add unit tests for new features and bug fixes
+- **Documentation**: Update documentation for API changes
+- **Commit Messages**: Use conventional commit format
+- **Issues**: Reference relevant issues in your PR description
+### 🎉 Recognition
+Contributors will be:
+- Listed in our contributors page
+- Mentioned in release notes for significant contributions
+- Invited to join our contributor Discord community
+## 📄 License
+This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
+## 🙏 Acknowledgments
+- Inspired by evaluation frameworks like DeepEval, Confident AI, and Braintrust
+- Built with modern Python best practices and industry standards
+- Designed for the AI evaluation community
+## 📞 Support
+- **Documentation**: [https://noveum.github.io/NovaEval](https://noveum.github.io/NovaEval)
+- **Issues**: [GitHub Issues](https://github.com/Noveum/NovaEval/issues)
+- **Discussions**: [GitHub Discussions](https://github.com/Noveum/NovaEval/discussions)
+- **Email**: [email protected]
 ---
+Made with ❤️ by the Noveum.ai team