Spaces:

Noveumai
/

NovaEval

Sleeping

App Files Files Community

shashankagar commited on Jul 14

Commit

c9cb337

verified ·

1 Parent(s): c7b6f6b

Update

Browse files

Files changed (1) hide show

README.md +422 -39

README.md CHANGED Viewed

@@ -8,76 +8,459 @@ pinned: false
 app_build_command: npm run build
 app_file: build/index.html
 license: apache-2.0
-short_description: A comprehensive, extensible AI model evaluation framework de
 ---
-# Getting Started with Create React App
-This project was bootstrapped with [Create React App](https://github.com/facebook/create-react-app).
-## Available Scripts
-In the project directory, you can run:
-### `npm start`
-Runs the app in the development mode.\
-Open [http://localhost:3000](http://localhost:3000) to view it in your browser.
-The page will reload when you make changes.\
-You may also see any lint errors in the console.
-### `npm test`
-Launches the test runner in the interactive watch mode.\
-See the section about [running tests](https://facebook.github.io/create-react-app/docs/running-tests) for more information.
-### `npm run build`
-Builds the app for production to the `build` folder.\
-It correctly bundles React in production mode and optimizes the build for the best performance.
-The build is minified and the filenames include the hashes.\
-Your app is ready to be deployed!
-See the section about [deployment](https://facebook.github.io/create-react-app/docs/deployment) for more information.
-### `npm run eject`
-**Note: this is a one-way operation. Once you `eject`, you can't go back!**
-If you aren't satisfied with the build tool and configuration choices, you can `eject` at any time. This command will remove the single build dependency from your project.
-Instead, it will copy all the configuration files and the transitive dependencies (webpack, Babel, ESLint, etc) right into your project so you have full control over them. All of the commands except `eject` will still work, but they will point to the copied scripts so you can tweak them. At this point you're on your own.
-You don't have to ever use `eject`. The curated feature set is suitable for small and middle deployments, and you shouldn't feel obligated to use this feature. However we understand that this tool wouldn't be useful if you couldn't customize it when you are ready for it.
-## Learn More
-You can learn more in the [Create React App documentation](https://facebook.github.io/create-react-app/docs/getting-started).
-To learn React, check out the [React documentation](https://reactjs.org/).
-### Code Splitting
-This section has moved here: [https://facebook.github.io/create-react-app/docs/code-splitting](https://facebook.github.io/create-react-app/docs/code-splitting)
-### Analyzing the Bundle Size
-This section has moved here: [https://facebook.github.io/create-react-app/docs/analyzing-the-bundle-size](https://facebook.github.io/create-react-app/docs/analyzing-the-bundle-size)
-### Making a Progressive Web App
-This section has moved here: [https://facebook.github.io/create-react-app/docs/making-a-progressive-web-app](https://facebook.github.io/create-react-app/docs/making-a-progressive-web-app)
-### Advanced Configuration
-This section has moved here: [https://facebook.github.io/create-react-app/docs/advanced-configuration](https://facebook.github.io/create-react-app/docs/advanced-configuration)
-### Deployment
-This section has moved here: [https://facebook.github.io/create-react-app/docs/deployment](https://facebook.github.io/create-react-app/docs/deployment)
-### `npm run build` fails to minify
-This section has moved here: [https://facebook.github.io/create-react-app/docs/troubleshooting#npm-run-build-fails-to-minify](https://facebook.github.io/create-react-app/docs/troubleshooting#npm-run-build-fails-to-minify)

 app_build_command: npm run build
 app_file: build/index.html
 license: apache-2.0
+short_description: A comprehensive AI model evaluation framework.
 ---
+# NovaEval by Noveum.ai
+[![CI](https://github.com/Noveum/NovaEval/actions/workflows/ci.yml/badge.svg)](https://github.com/Noveum/NovaEval/actions/workflows/ci.yml)
+[![Release](https://github.com/Noveum/NovaEval/actions/workflows/release.yml/badge.svg)](https://github.com/Noveum/NovaEval/actions/workflows/release.yml)
+[![codecov](https://codecov.io/gh/Noveum/NovaEval/branch/main/graph/badge.svg)](https://codecov.io/gh/Noveum/NovaEval)
+[![PyPI version](https://badge.fury.io/py/novaeval.svg)](https://badge.fury.io/py/novaeval)
+[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
+[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
+A comprehensive, extensible AI model evaluation framework designed for production use. NovaEval provides a unified interface for evaluating language models across various datasets, metrics, and deployment scenarios.
+## 🚧 Development Status
+> **⚠️ ACTIVE DEVELOPMENT - NOT PRODUCTION READY**
+>
+> NovaEval is currently in active development and **not recommended for production use**. We are actively working on improving stability, adding features, and expanding test coverage. APIs may change without notice.
+>
+> **We're looking for contributors!** See the [Contributing](#-contributing) section below for ways to help.
+## 🤝 We Need Your Help!
+NovaEval is an open-source project that thrives on community contributions. Whether you're a seasoned developer or just getting started, there are many ways to contribute:
+### 🎯 High-Priority Contribution Areas
+We're actively looking for contributors in these key areas:
+- **🧪 Unit Tests**: Help us improve our test coverage (currently 23% overall, 90%+ for core modules)
+- **📚 Examples**: Create real-world evaluation examples and use cases
+- **📝 Guides & Notebooks**: Write evaluation guides and interactive Jupyter notebooks
+- **📖 Documentation**: Improve API documentation and user guides
+- **🔍 RAG Metrics**: Add more metrics specifically for Retrieval-Augmented Generation evaluation
+- **🤖 Agent Evaluation**: Build frameworks for evaluating AI agents and multi-turn conversations
+### 🚀 Getting Started as a Contributor
+1. **Start Small**: Pick up issues labeled `good first issue` or `help wanted`
+2. **Join Discussions**: Share your ideas in [GitHub Discussions](https://github.com/Noveum/NovaEval/discussions)
+3. **Review Code**: Help review pull requests and provide feedback
+4. **Report Issues**: Found a bug? Report it in [GitHub Issues](https://github.com/Noveum/NovaEval/issues)
+5. **Spread the Word**: Star the repository and share with your network
+## 🚀 Features
+- **Multi-Model Support**: Evaluate models from OpenAI, Anthropic, AWS Bedrock, and custom providers
+- **Extensible Scoring**: Built-in scorers for accuracy, semantic similarity, code evaluation, and custom metrics
+- **Dataset Integration**: Support for MMLU, HuggingFace datasets, custom datasets, and more
+- **Production Ready**: Docker support, Kubernetes deployment, and cloud integrations
+- **Comprehensive Reporting**: Detailed evaluation reports, artifacts, and visualizations
+- **Secure**: Built-in credential management and secret store integration
+- **Scalable**: Designed for both local testing and large-scale production evaluations
+- **Cross-Platform**: Tested on macOS, Linux, and Windows with comprehensive CI/CD
+## 📦 Installation
+### From PyPI (Recommended)
+```bash
+pip install novaeval
+```
+### From Source
+```bash
+git clone https://github.com/Noveum/NovaEval.git
+cd NovaEval
+pip install -e .
+```
+### Docker
+```bash
+docker pull noveum/novaeval:latest
+```
+## 🏃‍♂️ Quick Start
+### Basic Evaluation
+```python
+from novaeval import Evaluator
+from novaeval.datasets import MMLUDataset
+from novaeval.models import OpenAIModel
+from novaeval.scorers import AccuracyScorer
+# Configure for cost-conscious evaluation
+MAX_TOKENS = 100  # Adjust based on budget: 5-10 for answers, 100+ for reasoning
+# Initialize components
+dataset = MMLUDataset(
+    subset="elementary_mathematics",  # Easier subset for demo
+    num_samples=10,
+    split="test"
+)
+model = OpenAIModel(
+    model_name="gpt-4o-mini",  # Cost-effective model
+    temperature=0.0,
+    max_tokens=MAX_TOKENS
+)
+scorer = AccuracyScorer(extract_answer=True)
+# Create and run evaluation
+evaluator = Evaluator(
+    dataset=dataset,
+    models=[model],
+    scorers=[scorer],
+    output_dir="./results"
+)
+results = evaluator.run()
+# Display detailed results
+for model_name, model_results in results["model_results"].items():
+    for scorer_name, score_info in model_results["scores"].items():
+        if isinstance(score_info, dict):
+            mean_score = score_info.get("mean", 0)
+            count = score_info.get("count", 0)
+            print(f"{scorer_name}: {mean_score:.4f} ({count} samples)")
+```
+### Configuration-Based Evaluation
+```python
+from novaeval import Evaluator
+# Load configuration from YAML/JSON
+evaluator = Evaluator.from_config("evaluation_config.yaml")
+results = evaluator.run()
+```
+### Command Line Interface
+NovaEval provides a comprehensive CLI for running evaluations:
+```bash
+# Run evaluation from configuration file
+novaeval run config.yaml
+# Quick evaluation with minimal setup
+novaeval quick -d mmlu -m gpt-4 -s accuracy
+# List available datasets, models, and scorers
+novaeval list-datasets
+novaeval list-models
+novaeval list-scorers
+# Generate sample configuration
+novaeval generate-config sample-config.yaml
+```
+📖 **[Complete CLI Reference](docs/cli-reference.md)** - Detailed documentation for all CLI commands and options
+### Example Configuration
+```yaml
+# evaluation_config.yaml
+dataset:
+  type: "mmlu"
+  subset: "abstract_algebra"
+  num_samples: 500
+models:
+  - type: "openai"
+    model_name: "gpt-4"
+    temperature: 0.0
+  - type: "anthropic"
+    model_name: "claude-3-opus"
+    temperature: 0.0
+scorers:
+  - type: "accuracy"
+  - type: "semantic_similarity"
+    threshold: 0.8
+output:
+  directory: "./results"
+  formats: ["json", "csv", "html"]
+  upload_to_s3: true
+  s3_bucket: "my-eval-results"
+```
+## 🏗️ Architecture
+NovaEval is built with extensibility and modularity in mind:
+```
+src/novaeval/
+├── datasets/          # Dataset loaders and processors
+├── evaluators/        # Core evaluation logic
+├── integrations/      # External service integrations
+├── models/           # Model interfaces and adapters
+├── reporting/        # Report generation and visualization
+├── scorers/          # Scoring mechanisms and metrics
+└── utils/            # Utility functions and helpers
+```
+### Core Components
+- **Datasets**: Standardized interface for loading evaluation datasets
+- **Models**: Unified API for different AI model providers
+- **Scorers**: Pluggable scoring mechanisms for various evaluation metrics
+- **Evaluators**: Orchestrates the evaluation process
+- **Reporting**: Generates comprehensive reports and artifacts
+- **Integrations**: Handles external services (S3, credential stores, etc.)
+## 📊 Supported Datasets
+- **MMLU**: Massive Multitask Language Understanding
+- **HuggingFace**: Any dataset from the HuggingFace Hub
+- **Custom**: JSON, CSV, or programmatic dataset definitions
+- **Code Evaluation**: Programming benchmarks and code generation tasks
+- **Agent Traces**: Multi-turn conversation and agent evaluation
+## 🤖 Supported Models
+- **OpenAI**: GPT-3.5, GPT-4, and newer models
+- **Anthropic**: Claude family models
+- **AWS Bedrock**: Amazon's managed AI services
+- **Noveum AI Gateway**: Integration with Noveum's model gateway
+- **Custom**: Extensible interface for any API-based model
+## 📏 Built-in Scorers
+### Accuracy-Based
+- **ExactMatch**: Exact string matching
+- **Accuracy**: Classification accuracy
+- **F1Score**: F1 score for classification tasks
+### Semantic-Based
+- **SemanticSimilarity**: Embedding-based similarity scoring
+- **BERTScore**: BERT-based semantic evaluation
+- **RougeScore**: ROUGE metrics for text generation
+### Code-Specific
+- **CodeExecution**: Execute and validate code outputs
+- **SyntaxChecker**: Validate code syntax
+- **TestCoverage**: Code coverage analysis
+### Custom
+- **LLMJudge**: Use another LLM as a judge
+- **HumanEval**: Integration with human evaluation workflows
+## 🚀 Deployment
+### Local Development
+```bash
+# Install dependencies
+pip install -e ".[dev]"
+# Run tests
+pytest
+# Run example evaluation
+python examples/basic_evaluation.py
+```
+### Docker
+```bash
+# Build image
+docker build -t nova-eval .
+# Run evaluation
+docker run -v $(pwd)/config:/config -v $(pwd)/results:/results nova-eval --config /config/eval.yaml
+```
+### Kubernetes
+```bash
+# Deploy to Kubernetes
+kubectl apply -f kubernetes/
+# Check status
+kubectl get pods -l app=nova-eval
+```
+## 🔧 Configuration
+NovaEval supports configuration through:
+- **YAML/JSON files**: Declarative configuration
+- **Environment variables**: Runtime configuration
+- **Python code**: Programmatic configuration
+- **CLI arguments**: Command-line overrides
+### Environment Variables
+```bash
+export NOVA_EVAL_OUTPUT_DIR="./results"
+export NOVA_EVAL_LOG_LEVEL="INFO"
+export OPENAI_API_KEY="your-api-key"
+export AWS_ACCESS_KEY_ID="your-aws-key"
+```
+### CI/CD Integration
+NovaEval includes optimized GitHub Actions workflows:
+- **Unit tests** run on all PRs and pushes for quick feedback
+- **Integration tests** run on main branch only to minimize API costs
+- **Cross-platform testing** on macOS, Linux, and Windows
+## 📈 Reporting and Artifacts
+NovaEval generates comprehensive evaluation reports:
+- **Summary Reports**: High-level metrics and insights
+- **Detailed Results**: Per-sample predictions and scores
+- **Visualizations**: Charts and graphs for result analysis
+- **Artifacts**: Model outputs, intermediate results, and debug information
+- **Export Formats**: JSON, CSV, HTML, PDF
+### Example Report Structure
+```
+results/
+├── summary.json              # High-level metrics
+├── detailed_results.csv      # Per-sample results
+├── artifacts/
+│   ├── model_outputs/        # Raw model responses
+│   ├── intermediate/         # Processing artifacts
+│   └── debug/               # Debug information
+├── visualizations/
+│   ├── accuracy_by_category.png
+│   ├── score_distribution.png
+│   └── confusion_matrix.png
+└── report.html              # Interactive HTML report
+```
+## 🔌 Extending NovaEval
+### Custom Datasets
+```python
+from novaeval.datasets import BaseDataset
+class MyCustomDataset(BaseDataset):
+    def load_data(self):
+        # Implement data loading logic
+        return samples
+    def get_sample(self, index):
+        # Return individual sample
+        return sample
+```
+### Custom Scorers
+```python
+from novaeval.scorers import BaseScorer
+class MyCustomScorer(BaseScorer):
+    def score(self, prediction, ground_truth, context=None):
+        # Implement scoring logic
+        return score
+```
+### Custom Models
+```python
+from novaeval.models import BaseModel
+class MyCustomModel(BaseModel):
+    def generate(self, prompt, **kwargs):
+        # Implement model inference
+        return response
+```
+## 🤝 Contributing
+We welcome contributions! NovaEval is actively seeking contributors to help build a robust AI evaluation framework. Please see our [Contributing Guide](CONTRIBUTING.md) for detailed guidelines.
+### 🎯 Priority Contribution Areas
+As mentioned in the [We Need Your Help](#-we-need-your-help) section, we're particularly looking for help with:
+1. **Unit Tests** - Expand test coverage beyond the current 23%
+2. **Examples** - Real-world evaluation scenarios and use cases
+3. **Guides & Notebooks** - Interactive evaluation tutorials
+4. **Documentation** - API docs, user guides, and tutorials
+5. **RAG Metrics** - Specialized metrics for retrieval-augmented generation
+6. **Agent Evaluation** - Frameworks for multi-turn and agent-based evaluations
+### Development Setup
+```bash
+# Clone repository
+git clone https://github.com/Noveum/NovaEval.git
+cd NovaEval
+# Create virtual environment
+python -m venv venv
+source venv/bin/activate  # On Windows: venv\Scripts\activate
+# Install development dependencies
+pip install -e ".[dev]"
+# Install pre-commit hooks
+pre-commit install
+# Run tests
+pytest
+# Run with coverage
+pytest --cov=src/novaeval --cov-report=html
+```
+### 🏗️ Contribution Workflow
+1. **Fork** the repository
+2. **Create** a feature branch (`git checkout -b feature/amazing-feature`)
+3. **Make** your changes following our coding standards
+4. **Add** tests for your changes
+5. **Commit** your changes (`git commit -m 'Add amazing feature'`)
+6. **Push** to the branch (`git push origin feature/amazing-feature`)
+7. **Open** a Pull Request
+### 📋 Contribution Guidelines
+- **Code Quality**: Follow PEP 8 and use the provided pre-commit hooks
+- **Testing**: Add unit tests for new features and bug fixes
+- **Documentation**: Update documentation for API changes
+- **Commit Messages**: Use conventional commit format
+- **Issues**: Reference relevant issues in your PR description
+### 🎉 Recognition
+Contributors will be:
+- Listed in our contributors page
+- Mentioned in release notes for significant contributions
+- Invited to join our contributor Discord community
+## 📄 License
+This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
+## 🙏 Acknowledgments
+- Inspired by evaluation frameworks like DeepEval, Confident AI, and Braintrust
+- Built with modern Python best practices and industry standards
+- Designed for the AI evaluation community
+## 📞 Support
+- **Documentation**: [https://noveum.github.io/NovaEval](https://noveum.github.io/NovaEval)
+- **Issues**: [GitHub Issues](https://github.com/Noveum/NovaEval/issues)
+- **Discussions**: [GitHub Discussions](https://github.com/Noveum/NovaEval/discussions)
+- **Email**: [email protected]
+---
+Made with ❤️ by the Noveum.ai team