AC-MIL

AC-MIL (Action-Aware Capsule Multiple Instance Learning for Live-Streaming Room Risk Assessment) is a weakly supervised model for room-level risk assessment in live-streaming platforms. It is designed for scenarios where only binary room-level labels are available, while risk evidence is often sparse, localized, and manifested through coordinated behaviors across users and time.

AC-MIL formulates each live room as a Multiple Instance Learning (MIL) bag, where each instance is a user–timeslot capsule—a short action subsequence performed by a particular user within a fixed time window. The model produces:

  • a room-level risk score, indicating the probability that the room is risky, and
  • capsule-level attributions, providing interpretable evidence by highlighting suspicious user–time segments that contribute most to the prediction.

Key idea

Given a room’s action stream, we construct a 2D grid of capsules over users × timeslots. Each capsule summarizes localized behavioral patterns within a specific user–time window. AC-MIL then models:

  • temporal dynamics: how users’ behaviors evolve over time,
  • cross-user dependencies: interactions between viewers and the streamer, as well as coordination patterns among viewers,
  • multi-level signals: evidence captured at the action, capsule, user, and timeslot levels, and fuses these signals to produce robust room-level risk predictions.

Architecture overview

AC-MIL follows a hierarchical serial–parallel design:

  1. Action Field Encoder

    • Encodes the full action sequence with a Transformer to produce contextualized action embeddings.
    • Produces an action-level room representation via a learnable [CLS] token.
  2. Capsule Constructor

    • Partitions actions into user–timeslot capsules.
    • Encodes each capsule with an LSTM (final hidden state as capsule embedding).
  3. Relational Capsule Reasoner

    • Builds an adaptive relation-aware graph over capsules using emantic similarity and relation masks.
    • Runs a graph-aware Transformer to refine capsule embeddings.
    • Provides capsule-level interpretability via [CLS] → capsule attention.
  4. Dual-View Integrator

    • User-view: GRU over each user’s capsule sequence and attention pooling across users.
    • Timeslot-view: attention pooling within each timeslot and GRU across timeslots.
  5. Cross-Level Risk Decoder

    • Learns gates over multi-level room representations.
    • Produces the final room embedding and risk score.

Input / output specification

Input (conceptual)

Each dataset sample corresponds to a live room room_id with a binary room-level label room_label ∈ {0,1,2,3} (>0, risky).
A room is represented as an action sequence patch_list = {α_i} ordered by tuple (user, time), where each action follows the paper’s definition: α = (u, t, a, x) (user, timestamp, action type, and optional textual/multimodal feature).

In our May/June datasets, each action record is stored with the following fields:

  • u_idx (int): user index within the room, used to build the users × timeslots grid (e.g., 0 = streamer, 1..U = selected viewers).
  • global_user_idx (int/str): global user identifier across the whole dataset (before remapping to u_idx).
  • timestamp (int/float): the action timestampt. In the formulation, timestamps are within a window [0, T] after the room starts.
  • t (int): timeslot index derived by discretizing timestamp into fixed-length windows. This is the column index when constructing the users × timeslots capsule grid.
  • l (int): role indicator (recommended convention: 0 = viewer, 1 = streamer).
  • action_id (int): the action type id a (e.g., enter, comment, like, gift, share; streamer-side actions may include stream start, ASR text, OCR text, etc.).
  • action_desc (str / null): raw textual content associated with the action (e.g., comment text, ASR transcript, OCR text).
  • action_vec (numpy): pre-encoded feature vector for action_desc.

Example (JSONL-like):

{
  "room_id": "1",
  "room_label": "2",
  "patch_list": [
    (u_idx, t, l, action_id, action_vec, timestamp, action_desc, user_id),
    (0, 1, 0, 5, [0.0, 0.3, ...], 4, "主播口播:...", 5415431),
    ...
  ]
}

Intended use

Primary use cases

  • Early detection of risky rooms (fraud, collusion, policy-violating coordinated behaviors)
  • Evidence-based moderation: highlight localized suspicious segments (user–time capsules)

Out of scope

  • Identifying or tracking specific individuals
  • Any use that violates privacy laws, platform policies, or user consent requirements

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support