Spaces:
Running
Running
A newer version of the Gradio SDK is available:
5.34.2
metadata
title: Tp 1 Dgx Node Estimator
emoji: ⚙️
colorFrom: purple
colorTo: yellow
sdk: gradio
sdk_version: 5.34.0
app_file: app.py
pinned: false
license: mit
short_description: for NVIDIA TRDC estimation
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
🚀 H100 Node & CUDA Version Estimator
An interactive Gradio application for estimating H100 GPU node requirements and CUDA version recommendations based on your machine learning workload specifications.
Features
- Comprehensive Model Support: Supports 40+ models including:
- Text Models: LLaMA-2/3/3.1, Nemotron-4, Qwen2/2.5
- Vision-Language: Qwen-VL, Qwen2-VL, NVIDIA VILA series
- Audio Models: Qwen-Audio, Qwen2-Audio
- Physics-ML: NVIDIA PhysicsNeMo (FNO, PINN, GraphCast, SFNO)
- Smart Estimation: Calculates memory requirements including model weights, KV cache, and operational overhead
- Multimodal Support: Handles vision-language and audio-language models with specialized memory calculations
- Use Case Optimization: Provides different estimates for inference, training, and fine-tuning scenarios
- Precision Support: Handles different precision formats (FP32, FP16, BF16, INT8, INT4)
- Interactive Visualizations: Memory breakdown charts and node utilization graphs
- CUDA Recommendations: Suggests optimal CUDA versions and driver requirements
Installation
- Clone the repository:
git clone <repository-url>
cd tp-1-dgx-node-estimator
- Install dependencies:
pip install -r requirements.txt
Usage
- Run the application:
python app.py
Open your browser and navigate to
http://localhost:7860
Configure your parameters:
- Model: Select from supported models (LLaMA, Nemotron, Qwen2)
- Input Tokens: Number of input tokens per request
- Output Tokens: Number of output tokens per request
- Batch Size: Number of concurrent requests
- Use Case: Choose between inference, training, or fine-tuning
- Precision: Select model precision/quantization level
Click "💡 Estimate Requirements" to get your recommendations
Key Calculations
Memory Estimation
- Model Memory: Base model weights adjusted for precision
- KV Cache: Calculated based on sequence length and model architecture
- Overhead: Use-case specific multipliers:
- Inference: 1.2x (20% overhead)
- Training: 3.0x (gradients + optimizer states)
- Fine-tuning: 2.5x (moderate overhead)
Node Calculation
- H100 Node: 8 × H100 GPUs per node = 640GB HBM3 total (576GB usable per node)
- Model Parallelism: Automatic consideration for large models
- Memory Efficiency: Optimal distribution across nodes
Example Scenarios
Model | Tokens (In/Out) | Batch Size | Use Case | Precision | Estimated Nodes |
---|---|---|---|---|---|
LLaMA-3-8B | 2048/512 | 1 | Inference | FP16 | 1 |
LLaMA-3-70B | 4096/1024 | 4 | Inference | FP16 | 1 |
Qwen2.5-72B | 8192/2048 | 2 | Fine-tuning | BF16 | 1 |
Nemotron-4-340B | 2048/1024 | 1 | Inference | INT8 | 1-2 |
Qwen2-VL-7B | 1024/256 | 1 | Inference | FP16 | 1 |
VILA-1.5-13B | 2048/512 | 2 | Inference | BF16 | 1 |
Qwen2-Audio-7B | 1024/256 | 1 | Inference | FP16 | 1 |
PhysicsNeMo-FNO-Large | 512/128 | 8 | Training | FP32 | 1 |
PhysicsNeMo-GraphCast-Medium | 1024/256 | 4 | Training | FP16 | 1 |
CUDA Recommendations
The application provides tailored CUDA version recommendations:
- Optimal: CUDA 12.4 + cuDNN 8.9+
- Recommended: CUDA 12.1+ + cuDNN 8.7+
- Minimum: CUDA 11.8 + cuDNN 8.5+
Output Features
📊 Detailed Analysis
- Complete memory breakdown
- Parameter counts and model specifications
- Step-by-step calculation explanation
🔧 CUDA Recommendations
- Version compatibility matrix
- Driver requirements
- Compute capability information
📈 Memory Utilization
- Visual memory breakdown (pie chart)
- Node utilization distribution (bar chart)
- Efficiency metrics
Technical Details
Supported Models
Text Models
- LLaMA: 2-7B, 2-13B, 2-70B, 3-8B, 3-70B, 3.1-8B, 3.1-70B, 3.1-405B
- Nemotron: 4-15B, 4-340B
- Qwen2: 0.5B, 1.5B, 7B, 72B
- Qwen2.5: 0.5B, 1.5B, 7B, 14B, 32B, 72B
Vision-Language Models
- Qwen-VL: Base, Chat, Plus, Max variants
- Qwen2-VL: 2B, 7B, 72B
- NVIDIA VILA: 1.5-3B, 1.5-8B, 1.5-13B, 1.5-40B
Audio Models
- Qwen-Audio: Base, Chat variants
- Qwen2-Audio: 7B
Physics-ML Models (NVIDIA PhysicsNeMo)
- Fourier Neural Operators (FNO): Small (1M), Medium (10M), Large (50M)
- Physics-Informed Neural Networks (PINN): Small (0.5M), Medium (5M), Large (20M)
- GraphCast: Small (50M), Medium (200M), Large (1B) - for weather/climate modeling
- Spherical FNO (SFNO): Small (25M), Medium (100M), Large (500M) - for global simulations
Precision Impact
- FP32: Full precision (4 bytes per parameter)
- FP16/BF16: Half precision (2 bytes per parameter)
- INT8: 8-bit quantization (1 byte per parameter)
- INT4: 4-bit quantization (0.5 bytes per parameter)
Multimodal Considerations
- Vision Models: Process images as token sequences (typically 256-1024 tokens per image)
- Audio Models: Handle audio segments with frame-based tokenization
- Memory Overhead: Additional memory for vision/audio encoders and cross-modal attention
- Token Estimation: Consider multimodal inputs when calculating token counts
PhysicsNeMo Considerations
- Grid-Based Data: Physics models work with spatial/temporal grids rather than text tokens
- Batch Training: Physics-ML models typically require larger batch sizes for stable training
- Memory Patterns: Different from LLMs - less KV cache, more gradient memory for PDE constraints
- Precision Requirements: Many physics simulations require FP32 for numerical stability
- Use Cases:
- FNO: Solving PDEs on regular grids (fluid dynamics, heat transfer)
- PINN: Physics-informed training with PDE constraints
- GraphCast: Weather prediction and climate modeling
- SFNO: Global atmospheric and oceanic simulations
Limitations
- Estimates are approximate and may vary based on:
- Specific model implementation details
- Framework overhead (PyTorch, TensorFlow, etc.)
- Hardware configuration
- Network topology for multi-node setups
Contributing
Feel free to submit issues and enhancement requests!
License
This project is licensed under the MIT License - see the LICENSE file for details.
Notes
- Node Configuration: Each H100 node contains 8 × H100 GPUs (640GB total memory)
- For production deployments, consider adding a 10-20% buffer to estimates
- Network bandwidth and storage requirements are not included in calculations
- Estimates assume optimal memory layout and efficient implementations
- Multi-node setups require high-speed interconnects (InfiniBand/NVLink) for optimal performance