Empathic-Insight-Face-Large

**Empathic-Insight-Face-Large** is a set of 40 emotion regression models trained on the EMoNet-FACE benchmark suite. Each model is designed to predict the intensity of a specific fine-grained emotion from facial expressions. These models are built on top of SigLIP2 image embeddings followed by MLP regression heads.

This work is based on the research paper: "EMONET-FACE: An Expert-Annotated Benchmark for Synthetic Emotion Recognition" Authors: Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby, Maurice Kraus, Felix Friedrich, Huu Nguyen, Krishna Kalyan, Kourosh Nadi, Kristian Kersting, Sören Auer. (Please refer to the full paper for a complete list of authors and affiliations if applicable). Paper link: (Insert ArXiv/Conference link here when available)

The models and datasets are released under the CC-BY-4.0 license.

Model Description

The Empathic-Insight-Face-Large suite consists of 40 individual MLP models. Each model takes a 1152-dimensional SigLIP2 image embedding as input and outputs a continuous score (typically 0-7, can be mean-subtracted) for one of the 40 emotion categories defined in the EMoNet-FACE taxonomy.

The models were pre-trained on the EMoNet-FACE BIG dataset (over 203k synthetic images with generated labels) and fine-tuned on the EMoNet-FACE BINARY dataset (nearly 20k synthetic images with over 65k human expert binary annotations).

Key Features:

Fine-grained Emotions: Covers a novel 40-category emotion taxonomy.
High Performance: Achieves human-expert-level performance on the EMoNet-FACE HQ benchmark.
Synthetic Data: Trained on AI-generated, demographically balanced, full-face expressions.
Open: Publicly released models, datasets, and taxonomy.

Intended Use

These models are intended for research purposes in affective computing, human-AI interaction, and emotion recognition. They can be used to:

Analyze and predict fine-grained emotional expressions in facial images.
Serve as a baseline for developing more advanced emotion recognition systems.
Facilitate research into nuanced emotional understanding in AI.

Out-of-Scope Use: These models are trained on synthetic faces and may not generalize well to real-world, in-the-wild images without further adaptation. They should not be used for making critical decisions about individuals, for surveillance, or in any manner that could lead to discriminatory outcomes.

How to Use

These are individual .pth files, each corresponding to one emotion classifier. To use them, you will typically:

Obtain SigLIP2 Embeddings:
- Use a pre-trained SigLIP2 model (e.g., google/siglip2-so400m-patch16-384).
- Extract the 1152-dimensional image embedding for your target facial image.
Load an MLP Model:
- Each .pth file (e.g., model_elation_best.pth) is a PyTorch state dictionary for an MLP.
- The MLP architecture used for "Empathic-Insight-Face-Large" (big models) is:
  - Input: 1152 features
  - Hidden Layer 1: 1024 neurons, ReLU, Dropout (0.2)
  - Hidden Layer 2: 512 neurons, ReLU, Dropout (0.2)
  - Hidden Layer 3: 256 neurons, ReLU, Dropout (0.2)
  - Output Layer: 1 neuron (continuous score)
Perform Inference:
- Pass the SigLIP2 embedding through the loaded MLP model(s).
(Optional) Mean Subtraction:
- The raw output scores can be adjusted by subtracting the model's mean score on neutral faces. The neutral_stats_cache-_human-binary-big-mlps_v8_two_stage_higher_lr_stage2_5_200+ file in this repository contains these mean values for each emotion model.

Example (Conceptual PyTorch for all 40 emotions):

import torch
import torch.nn as nn
from transformers import AutoModel, AutoProcessor
from PIL import Image
import numpy as np
import json
from pathlib import Path # Used for cleaner path handling

# --- 1. Define MLP Architecture (Big Model) ---
class MLP(nn.Module):
    def __init__(self, input_size=1152, output_size=1):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_size, 1024),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(1024, 512),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, output_size)
        )
    def forward(self, x):
        return self.layers(x)

# --- 2. Load Models and Processor ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# === IMPORTANT: Set this to the directory where your .pth models are downloaded ===
# If you've cloned the repo, it might be "./" or the name of the cloned folder.
# Example: MODEL_DIRECTORY = Path("./Empathic-Insight-Face-Large_cloned_repo")
MODEL_DIRECTORY = Path(".") # Assumes models are in the current directory or a sub-directory
# If the models are in the root of where this script runs after cloning, "." is fine.
# If they are in a subfolder, e.g., "Empathic-Insight-Face-Large", use Path("./Empathic-Insight-Face-Large")
# ================================================================================


# Load SigLIP (ensure it's the correct one for 1152 dim)
siglip_model_id = "google/siglip2-so400m-patch16-384" # Produces 1152-dim embeddings
siglip_processor = AutoProcessor.from_pretrained(siglip_model_id)
siglip_model = AutoModel.from_pretrained(siglip_model_id).to(device).eval()

# Load neutral stats
neutral_stats_filename = "neutral_stats_cache-_human-binary-big-mlps_v8_two_stage_higher_lr_stage2_5_200+"
neutral_stats_path = MODEL_DIRECTORY / neutral_stats_filename
neutral_stats_all = {}
if neutral_stats_path.exists():
    with open(neutral_stats_path, 'r') as f:
        neutral_stats_all = json.load(f)
else:
    print(f"Warning: Neutral stats file not found at {neutral_stats_path}. Mean subtraction will use 0.0 for all models.")


# Load all emotion MLP models
emotion_mlps = {}
print(f"Loading emotion MLP models from: {MODEL_DIRECTORY.resolve()}") # .resolve() gives absolute path
model_files_found = list(MODEL_DIRECTORY.glob("model_*_best.pth"))
if not model_files_found:
    print(f"Warning: No model files found in {MODEL_DIRECTORY.resolve()}. Please check the MODEL_DIRECTORY path.")

for pth_file in model_files_found:
    model_key_name = pth_file.stem # e.g., "model_elation_best"
    try:
        mlp_model = MLP().to(device)
        mlp_model.load_state_dict(torch.load(pth_file, map_location=device))
        mlp_model.eval()
        emotion_mlps[model_key_name] = mlp_model
        # print(f"Loaded: {model_key_name}")
    except Exception as e:
        print(f"Error loading {model_key_name} from {pth_file}: {e}")

if not emotion_mlps:
    print(f"Error: No MLP models were successfully loaded. Check MODEL_DIRECTORY and file integrity.")
else:
    print(f"Successfully loaded {len(emotion_mlps)} emotion MLP models.")


# --- 3. Prepare Image and Get Embedding ---
def normalized(a, axis=-1, order=2):
    a = np.asarray(a) # Ensure 'a' is a numpy array
    l2 = np.atleast_1d(np.linalg.norm(a, order, axis))
    l2[l2 == 0] = 1
    return a / np.expand_dims(l2, axis)

# === Replace with your actual image path ===
# image_path_str = "path/to/your/image.jpg" 
# try:
#     image = Image.open(image_path_str).convert("RGB")
#     inputs = siglip_processor(images=[image], return_tensors="pt", padding="max_length", truncation=True).to(device)
#     with torch.no_grad():
#         image_features = siglip_model.get_image_features(**inputs) # PyTorch tensor
#     embedding_numpy_normalized = normalized(image_features.cpu().numpy()) # Normalize on CPU
#     embedding_tensor = torch.from_numpy(embedding_numpy_normalized).to(device).float()
# except FileNotFoundError:
#     print(f"Error: Image not found at {image_path_str}")
#     embedding_tensor = None # Or handle error as appropriate
# except Exception as e:
#     print(f"Error processing image {image_path_str}: {e}")
#     embedding_tensor = None
# ==========================================

# --- For demonstration, let's use a random embedding if no image is processed ---
print("\nUsing a random embedding for demonstration purposes as no image path was set.")
embedding_tensor = torch.randn(1, 1152).to(device).float()
# ==============================================================================


# --- 4. Inference for all loaded models ---
results = {}
if embedding_tensor is not None and emotion_mlps:
    with torch.no_grad():
        for model_key_name, mlp_model_instance in emotion_mlps.items():
            raw_score = mlp_model_instance(embedding_tensor).item()
            neutral_mean = neutral_stats_all.get(model_key_name, {}).get("mean", 0.0)
            mean_subtracted_score = raw_score - neutral_mean
            
            # Derive a human-readable emotion name from the model key
            emotion_name = model_key_name.replace("model_", "").replace("_best", "").replace("_", " ").title()
            results[emotion_name] = {
                "raw_score": raw_score,
                "neutral_mean": neutral_mean,
                "mean_subtracted_score": mean_subtracted_score
            }

    # Print results, sorted alphabetically by emotion name
    print("\n--- Emotion Scores (Mean-Subtracted) ---")
    # Sort items by emotion name for consistent output
    for emotion, scores in sorted(results.items()): 
        print(f"{emotion:<35}: {scores['mean_subtracted_score']:.4f} (Raw: {scores['raw_score']:.4f}, Neutral Mean: {scores['neutral_mean']:.4f})")
else:
    print("Skipping inference as either embedding_tensor is None or no MLP models were loaded.")

Performance on EMoNet-FACE HQ Benchmark

The Empathic-Insight-Face models demonstrate strong performance, achieving near human-expert-level agreement on the EMoNet-FACE HQ benchmark.

Key Metric: Weighted Kappa (κ_w) Agreement with Human Annotators (Aggregated pairwise agreement between model predictions and individual human expert annotations on the EMoNet-FACE HQ dataset)

Annotator Group	Mean κ_w (vs. Humans)
Human Annotators (vs. Humans)	~0.20 - 0.26*
Empathic-Insight-Face LARGE	~0.18
Empathic-Insight-Face SMALL	~0.14
Proprietary Models (e.g., HumeFace)	~0.11
Random Baseline	~0.00

*Human inter-annotator agreement (pairwise κ_w) varies per annotator; this is an approximate range from Table 6 in the paper.

Interpretation (from paper Figure 3 & Table 6):

Empathic-Insight-Face LARGE (our big models) achieves agreement scores that are statistically very close to human inter-annotator agreement and significantly outperforms other evaluated systems like proprietary models and general-purpose VLMs on this benchmark.
The performance indicates that with focused dataset construction and careful fine-tuning, specialized models can approach human-level reliability on synthetic facial emotion recognition tasks for fine-grained emotions.

For more detailed benchmark results, including per-emotion performance and comparisons with other models using Spearman's Rho, please refer to the full EMoNet-FACE paper (Figures 3, 4, 9 and Table 6 in particular).

Taxonomy

The 40 emotion categories are: Affection, Amusement, Anger, Astonishment/Surprise, Awe, Bitterness, Concentration, Confusion, Contemplation, Contempt, Contentment, Disappointment, Disgust, Distress, Doubt, Elation, Embarrassment, Emotional Numbness, Fatigue/Exhaustion, Fear, Helplessness, Hope/Enthusiasm/Optimism, Impatience and Irritability, Infatuation, Interest, Intoxication/Altered States of Consciousness, Jealousy & Envy, Longing, Malevolence/Malice, Pain, Pleasure/Ecstasy, Pride, Relief, Sadness, Sexual Lust, Shame, Sourness, Teasing, Thankfulness/Gratitude, Triumph.

(See Table 4 in the paper for associated descriptive words for each category).

Limitations

Synthetic Data: Models are trained on synthetic faces. Generalization to real-world, diverse, in-the-wild images is not guaranteed and requires further investigation.
Static Faces: Analysis is restricted to static facial expressions, without broader contextual or multimodal cues.
Cultural Universality: The 40-category taxonomy, while expert-validated, is one perspective; its universality across cultures is an open research question.
Subjectivity: Emotion perception is inherently subjective.

Ethical Considerations

The EMoNet-FACE suite was developed with ethical considerations in mind, including:

Mitigating Bias: Efforts were made to create demographically diverse synthetic datasets and prompts were manually filtered.
No PII: All images are synthetic, and no personally identifiable information was used.
Responsible Use: These models are released for research. Users are urged to consider the ethical implications of their applications and avoid misuse, such as for emotional manipulation or in ways that could lead to unfair or harmful outcomes.

Please refer to the "Ethical Considerations" and "Data Integrity, Safety, and Fairness" sections in the EMoNet-FACE paper for a comprehensive discussion.

Citation

If you use these models or the EMoNet-FACE benchmark in your research, please cite the original paper:

@inproceedings{schuhmann2025emonetface,
  title={{EMONET-FACE: An Expert-Annotated Benchmark for Synthetic Emotion Recognition}},
  author={Schuhmann, Christoph and Kaczmarczyk, Robert and Rabby, Gollam and Kraus, Maurice and Friedrich, Felix and Nguyen, Huu and Kalyan, Krishna and Nadi, Kourosh and Kersting, Kristian and Auer, Sören},
  booktitle={Arxiv Preprint},
  year={2025} % Or actual year of publication
  % TODO: Add URL/DOI when available
}