VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters
Abstract
Foundation models have emerged as a promising approach in time series forecasting (TSF). Existing approaches either fine-tune large language models (LLMs) or build large-scale time-series datasets to develop TSF foundation models. However, these methods face challenges due to the severe cross-domain gap or in-domain heterogeneity. In this paper, we explore a new road to building a TSF foundation model from rich and high-quality natural images, based on the intrinsic similarities between images and time series. To bridge the gap between the two domains, we reformulate the TSF task as an image reconstruction task, which is further processed by a visual masked autoencoder (MAE) self-supervised pre-trained on the ImageNet dataset. Surprisingly, without further adaptation in the time-series domain, the proposed VisionTS could achieve superior zero-shot forecasting performance compared to existing TSF foundation models. With minimal fine-tuning, VisionTS could further improve the forecasting and achieve state-of-the-art performance in most cases. These findings suggest that visual models could be a free lunch for TSF and highlight the potential for future cross-domain research between computer vision and TSF. Our code is publicly available at https://github.com/Keytoyze/VisionTS.
Community
VisionTS is a groundbreaking time series forecasting foundation model, building from rich, high-quality natural images without any time-series training. It employs the MAE as its backbone, and reimagines the time series forecasting as a patch-level image reconstruction problem. Evaluation in zero-shot forecasting settings reveals remarkable capabilities: VisionTS outperforms SOTAs, like Moirai, timesFM.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ViTime: A Visual Intelligence-Based Foundation Model for Time Series Forecasting (2024)
- Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation (2024)
- CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning (2024)
- CosmoCLIP: Generalizing Large Vision-Language Models for Astronomical Imaging (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper