huckiyang's picture
[node] estimation
9e02dae

A newer version of the Gradio SDK is available: 5.34.2

Upgrade
metadata
title: Tp 1 Dgx Node Estimator
emoji: ⚙️
colorFrom: purple
colorTo: yellow
sdk: gradio
sdk_version: 5.34.0
app_file: app.py
pinned: false
license: mit
short_description: for NVIDIA TRDC estimation

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

🚀 H100 Node & CUDA Version Estimator

An interactive Gradio application for estimating H100 GPU node requirements and CUDA version recommendations based on your machine learning workload specifications.

Features

  • Comprehensive Model Support: Supports 40+ models including:
    • Text Models: LLaMA-2/3/3.1, Nemotron-4, Qwen2/2.5
    • Vision-Language: Qwen-VL, Qwen2-VL, NVIDIA VILA series
    • Audio Models: Qwen-Audio, Qwen2-Audio
    • Physics-ML: NVIDIA PhysicsNeMo (FNO, PINN, GraphCast, SFNO)
  • Smart Estimation: Calculates memory requirements including model weights, KV cache, and operational overhead
  • Multimodal Support: Handles vision-language and audio-language models with specialized memory calculations
  • Use Case Optimization: Provides different estimates for inference, training, and fine-tuning scenarios
  • Precision Support: Handles different precision formats (FP32, FP16, BF16, INT8, INT4)
  • Interactive Visualizations: Memory breakdown charts and node utilization graphs
  • CUDA Recommendations: Suggests optimal CUDA versions and driver requirements

Installation

  1. Clone the repository:
git clone <repository-url>
cd tp-1-dgx-node-estimator
  1. Install dependencies:
pip install -r requirements.txt

Usage

  1. Run the application:
python app.py
  1. Open your browser and navigate to http://localhost:7860

  2. Configure your parameters:

    • Model: Select from supported models (LLaMA, Nemotron, Qwen2)
    • Input Tokens: Number of input tokens per request
    • Output Tokens: Number of output tokens per request
    • Batch Size: Number of concurrent requests
    • Use Case: Choose between inference, training, or fine-tuning
    • Precision: Select model precision/quantization level
  3. Click "💡 Estimate Requirements" to get your recommendations

Key Calculations

Memory Estimation

  • Model Memory: Base model weights adjusted for precision
  • KV Cache: Calculated based on sequence length and model architecture
  • Overhead: Use-case specific multipliers:
    • Inference: 1.2x (20% overhead)
    • Training: 3.0x (gradients + optimizer states)
    • Fine-tuning: 2.5x (moderate overhead)

Node Calculation

  • H100 Node: 8 × H100 GPUs per node = 640GB HBM3 total (576GB usable per node)
  • Model Parallelism: Automatic consideration for large models
  • Memory Efficiency: Optimal distribution across nodes

Example Scenarios

Model Tokens (In/Out) Batch Size Use Case Precision Estimated Nodes
LLaMA-3-8B 2048/512 1 Inference FP16 1
LLaMA-3-70B 4096/1024 4 Inference FP16 1
Qwen2.5-72B 8192/2048 2 Fine-tuning BF16 1
Nemotron-4-340B 2048/1024 1 Inference INT8 1-2
Qwen2-VL-7B 1024/256 1 Inference FP16 1
VILA-1.5-13B 2048/512 2 Inference BF16 1
Qwen2-Audio-7B 1024/256 1 Inference FP16 1
PhysicsNeMo-FNO-Large 512/128 8 Training FP32 1
PhysicsNeMo-GraphCast-Medium 1024/256 4 Training FP16 1

CUDA Recommendations

The application provides tailored CUDA version recommendations:

  • Optimal: CUDA 12.4 + cuDNN 8.9+
  • Recommended: CUDA 12.1+ + cuDNN 8.7+
  • Minimum: CUDA 11.8 + cuDNN 8.5+

Output Features

📊 Detailed Analysis

  • Complete memory breakdown
  • Parameter counts and model specifications
  • Step-by-step calculation explanation

🔧 CUDA Recommendations

  • Version compatibility matrix
  • Driver requirements
  • Compute capability information

📈 Memory Utilization

  • Visual memory breakdown (pie chart)
  • Node utilization distribution (bar chart)
  • Efficiency metrics

Technical Details

Supported Models

Text Models

  • LLaMA: 2-7B, 2-13B, 2-70B, 3-8B, 3-70B, 3.1-8B, 3.1-70B, 3.1-405B
  • Nemotron: 4-15B, 4-340B
  • Qwen2: 0.5B, 1.5B, 7B, 72B
  • Qwen2.5: 0.5B, 1.5B, 7B, 14B, 32B, 72B

Vision-Language Models

  • Qwen-VL: Base, Chat, Plus, Max variants
  • Qwen2-VL: 2B, 7B, 72B
  • NVIDIA VILA: 1.5-3B, 1.5-8B, 1.5-13B, 1.5-40B

Audio Models

  • Qwen-Audio: Base, Chat variants
  • Qwen2-Audio: 7B

Physics-ML Models (NVIDIA PhysicsNeMo)

  • Fourier Neural Operators (FNO): Small (1M), Medium (10M), Large (50M)
  • Physics-Informed Neural Networks (PINN): Small (0.5M), Medium (5M), Large (20M)
  • GraphCast: Small (50M), Medium (200M), Large (1B) - for weather/climate modeling
  • Spherical FNO (SFNO): Small (25M), Medium (100M), Large (500M) - for global simulations

Precision Impact

  • FP32: Full precision (4 bytes per parameter)
  • FP16/BF16: Half precision (2 bytes per parameter)
  • INT8: 8-bit quantization (1 byte per parameter)
  • INT4: 4-bit quantization (0.5 bytes per parameter)

Multimodal Considerations

  • Vision Models: Process images as token sequences (typically 256-1024 tokens per image)
  • Audio Models: Handle audio segments with frame-based tokenization
  • Memory Overhead: Additional memory for vision/audio encoders and cross-modal attention
  • Token Estimation: Consider multimodal inputs when calculating token counts

PhysicsNeMo Considerations

  • Grid-Based Data: Physics models work with spatial/temporal grids rather than text tokens
  • Batch Training: Physics-ML models typically require larger batch sizes for stable training
  • Memory Patterns: Different from LLMs - less KV cache, more gradient memory for PDE constraints
  • Precision Requirements: Many physics simulations require FP32 for numerical stability
  • Use Cases:
    • FNO: Solving PDEs on regular grids (fluid dynamics, heat transfer)
    • PINN: Physics-informed training with PDE constraints
    • GraphCast: Weather prediction and climate modeling
    • SFNO: Global atmospheric and oceanic simulations

Limitations

  • Estimates are approximate and may vary based on:
    • Specific model implementation details
    • Framework overhead (PyTorch, TensorFlow, etc.)
    • Hardware configuration
    • Network topology for multi-node setups

Contributing

Feel free to submit issues and enhancement requests!

License

This project is licensed under the MIT License - see the LICENSE file for details.

Notes

  • Node Configuration: Each H100 node contains 8 × H100 GPUs (640GB total memory)
  • For production deployments, consider adding a 10-20% buffer to estimates
  • Network bandwidth and storage requirements are not included in calculations
  • Estimates assume optimal memory layout and efficient implementations
  • Multi-node setups require high-speed interconnects (InfiniBand/NVLink) for optimal performance