Model Card for RecNeXt-T
Abstract
Recent advances in vision transformers (ViTs) have demonstrated the advantage of global modeling capabilities, prompting widespread integration of large-kernel convolutions for enlarging the effective receptive field (ERF). However, the quadratic scaling of parameter count and computational complexity (FLOPs) with respect to kernel size poses significant efficiency and optimization challenges. This paper introduces RecConv, a recursive decomposition strategy that efficiently constructs multi-frequency representations using small-kernel convolutions. RecConv establishes a linear relationship between parameter growth and decomposing levels which determines the effective receptive field $k\times 2^\ell$ for a base kernel $k$ and $\ell$ levels of decomposition, while maintaining constant FLOPs regardless of the ERF expansion. Specifically, RecConv achieves a parameter expansion of only $\ell+2$ times and a maximum FLOPs increase of $5/3$ times, compared to the exponential growth ($4^\ell$) of standard and depthwise convolutions. RecNeXt-M3 outperforms RepViT-M1.1 by 1.9 $AP^{box}$ on COCO with similar FLOPs. This innovation provides a promising avenue towards designing efficient and compact networks across various modalities. Codes and models can be found at https://github.com/suous/RecNeXt.


Model Details
Model Type: Image Classification / Feature Extraction
Model Series: L
Model Stats:
- Parameters: 12.1M
- MACs: 0.3G
- Latency: 1.8ms (iPhone 13, iOS 18)
- Throughput: 13878 (RTX 3090)
- Image Size: 224x224
Architecture Configuration:
- Embedding Dimensions: (64, 128, 256, 512)
- Depths: (0, 2, 8, 10)
- MLP Ratio: (2, 2, 2, 1.5)
Paper: RecConv: Efficient Recursive Convolutions for Multi-Frequency Representations
Dataset: ImageNet-1K
Recent Updates
UPDATES π₯
- 2025/07/23: Added a simple architecture, the overall design follows LSNet.
- 2025/07/04: Uploaded classification models to HuggingFaceπ€.
- 2025/07/01: Added more comparisons with LSNet.
- 2025/06/27: Added A series code and logs, replacing convolution with linear attention.
- 2025/03/19: Added more ablation study results, including using attention with RecConv design.
- 2025/01/02: Uploaded checkpoints and training logs of RecNeXt-M0.
- 2024/12/29: Uploaded checkpoints and training logs of RecNeXt-M1 - M5.
Model Usage
Image Classification
from urllib.request import urlopen
from PIL import Image
import timm
import torch
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
model = timm.create_model('recnext_t', pretrained=True, distillation=False)
model = model.eval()
# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
output = model(transforms(img).unsqueeze(0)) # unsqueeze single image into batch of 1
top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)
Converting to Inference Mode
import utils
# Convert training-time model to inference structure, fuse batchnorms
utils.replace_batchnorm(model)
Model Comparison
Classification
We introduce two series of models: the A series uses linear attention and nearest interpolation, while the M series employs convolution and bilinear interpolation for simplicity and broader hardware compatibility (e.g., to address suboptimal nearest interpolation support in some iOS versions).
dist: distillation; base: without distillation (all models are trained over 300 epochs).
model | top_1_accuracy | params | gmacs | npu_latency | cpu_latency | throughput | fused_weights | training_logs |
---|---|---|---|---|---|---|---|---|
M0 | 74.7* | 73.2 | 2.5M | 0.4 | 1.0ms | 189ms | 750 | dist | base | dist | base |
M1 | 79.2* | 78.0 | 5.2M | 0.9 | 1.4ms | 361ms | 384 | dist | base | dist | base |
M2 | 80.3* | 79.2 | 6.8M | 1.2 | 1.5ms | 431ms | 325 | dist | base | dist | base |
M3 | 80.9* | 79.6 | 8.2M | 1.4 | 1.6ms | 482ms | 314 | dist | base | dist | base |
M4 | 82.5* | 81.4 | 14.1M | 2.4 | 2.4ms | 843ms | 169 | dist | base | dist | base |
M5 | 83.3* | 82.9 | 22.9M | 4.7 | 3.4ms | 1487ms | 104 | dist | base | dist | base |
A0 | 75.0* | 73.6 | 2.8M | 0.4 | 1.4ms | 177ms | 4891 | dist | base | dist | base |
A1 | 79.6* | 78.3 | 5.9M | 0.9 | 1.9ms | 334ms | 2730 | dist | base | dist | base |
A2 | 80.8* | 79.6 | 7.9M | 1.2 | 2.2ms | 413ms | 2331 | dist | base | dist | base |
A3 | 81.1* | 80.1 | 9.0M | 1.4 | 2.4ms | 447ms | 2151 | dist | base | dist | base |
A4 | 82.5* | 81.6 | 15.8M | 2.4 | 3.6ms | 764ms | 1265 | dist | base | dist | base |
A5 | 83.5* | 83.1 | 25.7M | 4.7 | 5.6ms | 1376ms | 733 | dist | base | dist | base |
Comparison with LSNet
We present a simple architecture, the overall design follows LSNet. This framework centers around sharing channel features from the previous layers. Our motivation for doing so is to reduce the computational cost of token mixers and minimize feature redundancy in the final stage.
With Shared-Channel Blocks
model | top_1_accuracy | params | gmacs | npu_latency | cpu_latency | throughput | fused_weights | training_logs |
---|---|---|---|---|---|---|---|---|
T | 76.8 | 75.2 | 12.1M | 0.3 | 1.8ms | 105ms | 13957 | dist | norm | dist | norm |
S | 79.5 | 78.3 | 15.8M | 0.7 | 2.0ms | 182ms | 8034 | dist | norm | dist | norm |
B | 81.5 | 80.3 | 19.2M | 1.1 | 2.5ms | 296ms | 4472 | dist | norm | dist | norm |
Without Shared-Channel Blocks
model | top_1_accuracy | params | gmacs | npu_latency | cpu_latency | throughput | fused_weights | training_logs |
---|---|---|---|---|---|---|---|---|
T | 76.6* | 75.1 | 12.1M | 0.3 | 1.8ms | 109ms | 13878 | dist | base | dist | base |
S | 79.6* | 78.3 | 15.8M | 0.7 | 2.0ms | 188ms | 7989 | dist | base | dist | base |
B | 81.4* | 80.3 | 19.3M | 1.1 | 2.5ms | 290ms | 4450 | dist | base | dist | base |
The NPU latency is measured on an iPhone 13 with models compiled by Core ML Tools. The CPU latency is accessed on a Quad-core ARM Cortex-A57 processor in ONNX format. And the throughput is tested on an Nvidia RTX3090 with maximum power-of-two batch size that fits in memory.
Latency Measurement
The latency reported in RecNeXt for iPhone 13 (iOS 18) uses the benchmark tool from XCode 14.
RecNeXt-M0

RecNeXt-M1

RecNeXt-M2

RecNeXt-M3

RecNeXt-M4

RecNeXt-M5

RecNeXt-A0

RecNeXt-A1

RecNeXt-A2

RecNeXt-A3

RecNeXt-A4

RecNeXt-A5

RecNeXt-T

RecNeXt-S

RecNeXt-B

Tips: export the model to Core ML model
python export_coreml.py --model recnext_m1 --ckpt pretrain/recnext_m1_distill_300e.pth
Tips: measure the throughput on GPU
python speed_gpu.py --model recnext_m1
ImageNet (Training and Evaluation)
Prerequisites
conda
virtual environment is recommended.
conda create -n recnext python=3.8
pip install -r requirements.txt
Data preparation
Download and extract ImageNet train and val images from http://image-net.org/. The training and validation data are expected to be in the train
folder and val
folder respectively:
# script to extract ImageNet dataset: https://github.com/pytorch/examples/blob/main/imagenet/extract_ILSVRC.sh
# ILSVRC2012_img_train.tar (about 138 GB)
# ILSVRC2012_img_val.tar (about 6.3 GB)
# organize the ImageNet dataset as follows:
imagenet
βββ train
β βββ n01440764
β β βββ n01440764_10026.JPEG
β β βββ n01440764_10027.JPEG
β β βββ ......
β βββ ......
βββ val
β βββ n01440764
β β βββ ILSVRC2012_val_00000293.JPEG
β β βββ ILSVRC2012_val_00002138.JPEG
β β βββ ......
β βββ ......
Training
To train RecNeXt-M1 on an 8-GPU machine:
python -m torch.distributed.launch --nproc_per_node=8 --master_port 12346 --use_env main.py --model recnext_m1 --data-path ~/imagenet --dist-eval
Tips: specify your data path and model name!
Testing
For example, to test RecNeXt-M1:
python main.py --eval --model recnext_m1 --resume pretrain/recnext_m1_distill_300e.pth --data-path ~/imagenet
Use pretrained model without knowledge distillation from HuggingFace π€.
python main.py --eval --model recnext_m1 --data-path ~/imagenet --pretrained --distillation-type none
Use pretrained model with knowledge distillation from HuggingFace π€.
python main.py --eval --model recnext_m1 --data-path ~/imagenet --pretrained --distillation-type hard
Fused model evaluation
For example, to evaluate RecNeXt-M1 with the fused model:
python fuse_eval.py --model recnext_m1 --resume pretrain/recnext_m1_distill_300e_fused.pt --data-path ~/imagenet
Extract model for publishing
# without distillation
python publish.py --model_name recnext_m1 --checkpoint_path pretrain/checkpoint_best.pth --epochs 300
# with distillation
python publish.py --model_name recnext_m1 --checkpoint_path pretrain/checkpoint_best.pth --epochs 300 --distillation
# fused model
python publish.py --model_name recnext_m1 --checkpoint_path pretrain/checkpoint_best.pth --epochs 300 --fused
Downstream Tasks
Object Detection and Instance Segmentation
model | $AP^b$ | $AP_{50}^b$ | $AP_{75}^b$ | $AP^m$ | $AP_{50}^m$ | $AP_{75}^m$ | Latency | Ckpt | Log |
---|---|---|---|---|---|---|---|---|---|
M3 | 41.7 | 63.4 | 45.4 | 38.6 | 60.5 | 41.4 | 5.2ms | M3 | M3 |
M4 | 43.5 | 64.9 | 47.7 | 39.7 | 62.1 | 42.4 | 7.6ms | M4 | M4 |
M5 | 44.6 | 66.3 | 49.0 | 40.6 | 63.5 | 43.5 | 12.4ms | M5 | M5 |
A3 | 42.1 | 64.1 | 46.2 | 38.8 | 61.1 | 41.6 | 8.3ms | A3 | A3 |
A4 | 43.5 | 65.4 | 47.6 | 39.8 | 62.4 | 42.9 | 14.0ms | A4 | A4 |
A5 | 44.4 | 66.3 | 48.9 | 40.3 | 63.3 | 43.4 | 25.3ms | A5 | A5 |
Model | mIoU | Latency | Ckpt | Log |
---|---|---|---|---|
RecNeXt-M3 | 41.0 | 5.6ms | M3 | M3 |
RecNeXt-M4 | 43.6 | 7.2ms | M4 | M4 |
RecNeXt-M5 | 46.0 | 12.4ms | M5 | M5 |
RecNeXt-A3 | 41.9 | 8.4ms | A3 | A3 |
RecNeXt-A4 | 43.0 | 14.0ms | A4 | A4 |
RecNeXt-A5 | 46.5 | 25.3ms | A5 | A5 |
Ablation Study
Overall Experiments
Ablation Logs
logs/ablation βββ 224 β βββ recnext_m1_120e_224x224_3x3_7464.txt β βββ recnext_m1_120e_224x224_7x7_7552.txt β βββ recnext_m1_120e_224x224_bxb_7541.txt β βββ recnext_m1_120e_224x224_rec_3x3_7548.txt β βββ recnext_m1_120e_224x224_rec_5x5_7603.txt β βββ recnext_m1_120e_224x224_rec_7x7_7567.txt β βββ recnext_m1_120e_224x224_rec_7x7_nearest_7571.txt β βββ recnext_m1_120e_224x224_rec_7x7_nearest_ssm_7593.txt β βββ recnext_m1_120e_224x224_rec_7x7_unpool_7548.txt βββ 384 βββ recnext_m1_120e_384x384_3x3_7635.txt βββ recnext_m1_120e_384x384_7x7_7742.txt βββ recnext_m1_120e_384x384_bxb_7800.txt βββ recnext_m1_120e_384x384_rec_3x3_7772.txt βββ recnext_m1_120e_384x384_rec_5x5_7811.txt βββ recnext_m1_120e_384x384_rec_7x7_7803.txt βββ recnext_m1_120e_384x384_rec_convtrans_3x3_basic_7726.txt βββ recnext_m1_120e_384x384_rec_convtrans_5x5_basic_7787.txt βββ recnext_m1_120e_384x384_rec_convtrans_7x7_basic_7824.txt βββ recnext_m1_120e_384x384_rec_convtrans_7x7_group_7791.txt βββ recnext_m1_120e_384x384_rec_convtrans_7x7_split_7683.txt
RecConv Recurrent Aggregation
class RecConv2d(nn.Module):
def __init__(self, in_channels, kernel_size=5, bias=False, level=1, mode='nearest'):
super().__init__()
self.level = level
self.mode = mode
kwargs = {
'in_channels': in_channels,
'out_channels': in_channels,
'groups': in_channels,
'kernel_size': kernel_size,
'padding': kernel_size // 2,
'bias': bias
}
self.n = nn.Conv2d(stride=2, **kwargs)
self.a = nn.Conv2d(**kwargs) if level >1 else None
self.b = nn.Conv2d(**kwargs)
self.c = nn.Conv2d(**kwargs)
self.d = nn.Conv2d(**kwargs)
def forward(self, x):
# 1. Generate Multi-scale Features.
fs = [x]
for _ in range(self.level):
fs.append(self.n(fs[-1]))
# 2. Multi-scale Recurrent Aggregation.
h = None
for i, o in reversed(list(zip(fs[1:], fs[:-1]))):
h = self.a(h) + self.b(i) if h is not None else self.b(i)
h = nn.functional.interpolate(h, size=o.shape[2:], mode=self.mode)
return self.c(h) + self.d(x)
RecConv Variants


RecConv Variant Details
- RecConv using group convolutions
# RecConv Variant A
# recursive decomposition on both spatial and channel dimensions
# downsample and upsample through group convolutions
class RecConv2d(nn.Module):
def __init__(self, in_channels, kernel_size=5, bias=False, level=2):
super().__init__()
self.level = level
kwargs = {'kernel_size': kernel_size, 'padding': kernel_size // 2, 'bias': bias}
downs = []
for l in range(level):
i_channels = in_channels // (2 ** l)
o_channels = in_channels // (2 ** (l+1))
downs.append(nn.Conv2d(in_channels=i_channels, out_channels=o_channels, groups=o_channels, stride=2, **kwargs))
self.downs = nn.ModuleList(downs)
convs = []
for l in range(level+1):
channels = in_channels // (2 ** l)
convs.append(nn.Conv2d(in_channels=channels, out_channels=channels, groups=channels, **kwargs))
self.convs = nn.ModuleList(reversed(convs))
# this is the simplest modification, only support resoltions like 256, 384, etc
kwargs['kernel_size'] = kernel_size + 1
ups = []
for l in range(level):
i_channels = in_channels // (2 ** (l+1))
o_channels = in_channels // (2 ** l)
ups.append(nn.ConvTranspose2d(in_channels=i_channels, out_channels=o_channels, groups=i_channels, stride=2, **kwargs))
self.ups = nn.ModuleList(reversed(ups))
def forward(self, x):
i = x
features = []
for down in self.downs:
x, s = down(x), x.shape[2:]
features.append((x, s))
x = 0
for conv, up, (f, s) in zip(self.convs, self.ups, reversed(features)):
x = up(conv(f + x))
return self.convs[self.level](i + x)
- RecConv using channel-wise concatenation
# recursive decomposition on both spatial and channel dimensions
# downsample using channel-wise split, followed by depthwise convolution with a stride of 2
# upsample through channel-wise concatenation
class RecConv2d(nn.Module):
def __init__(self, in_channels, kernel_size=5, bias=False, level=2):
super().__init__()
self.level = level
kwargs = {'kernel_size': kernel_size, 'padding': kernel_size // 2, 'bias': bias}
downs = []
for l in range(level):
channels = in_channels // (2 ** (l+1))
downs.append(nn.Conv2d(in_channels=channels, out_channels=channels, groups=channels, stride=2, **kwargs))
self.downs = nn.ModuleList(downs)
convs = []
for l in range(level+1):
channels = in_channels // (2 ** l)
convs.append(nn.Conv2d(in_channels=channels, out_channels=channels, groups=channels, **kwargs))
self.convs = nn.ModuleList(reversed(convs))
. # this is the simplest modification, only support resoltions like 256, 384, etc
kwargs['kernel_size'] = kernel_size + 1
ups = []
for l in range(level):
channels = in_channels // (2 ** (l+1))
ups.append(nn.ConvTranspose2d(in_channels=channels, out_channels=channels, groups=channels, stride=2, **kwargs))
self.ups = nn.ModuleList(reversed(ups))
def forward(self, x):
features = []
for down in self.downs:
r, x = torch.chunk(x, 2, dim=1)
x, s = down(x), x.shape[2:]
features.append((r, s))
for conv, up, (r, s) in zip(self.convs, self.ups, reversed(features)):
x = torch.cat([r, up(conv(x))], dim=1)
return self.convs[self.level](x)
RecConv Beyond
We apply RecConv to MLLA small variants, replacing linear attention and downsampling layers. Result in higher throughput and less training memory usage.
Ablation Logs
mlla/logs βββ 1_mlla_nano β βββ 01_baseline.txt β βββ 02_recconv_5x5_conv_trans.txt β βββ 03_recconv_5x5_nearest_interp.txt β βββ 04_recattn_nearest_interp.txt β βββ 05_recattn_nearest_interp_simplify.txt βββ 2_mlla_mini βββ 01_baseline.txt βββ 02_recconv_5x5_conv_trans.txt βββ 03_recconv_5x5_nearest_interp.txt βββ 04_recattn_nearest_interp.txt βββ 05_recattn_nearest_interp_simplify.txt
Limitations
- RecNeXt exhibits the lowest throughput among models of comparable parameter size due to extensive use of bilinear interpolation, which can be mitigated by employing transposed convolution.
- The recursive decomposition may introduce numerical instability during mixed precision training, which can be alleviated by using fixed-point or BFloat16 arithmetic.
- Compatibility issues with bilinear interpolation and transposed convolution on certain iOS versions may also result in performance degradation.
Acknowledgement
Classification (ImageNet) code base is partly built with LeViT, PoolFormer, EfficientFormer, RepViT, LSNet, MLLA, and MogaNet.
The detection and segmentation pipeline is from MMCV (MMDetection and MMSegmentation).
Thanks for the great implementations!
Citation
@misc{zhao2024recnext,
title={RecConv: Efficient Recursive Convolutions for Multi-Frequency Representations},
author={Mingshu Zhao and Yi Luo and Yong Ouyang},
year={2024},
eprint={2412.19628},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
- Downloads last month
- 32