arxiv:2401.02709

German Text Embedding Clustering Benchmark

Published on Jan 5

Authors:

Silvan Wehrli ,

Abstract

This work introduces a benchmark assessing the performance of clustering German text embeddings in different domains. This benchmark is driven by the increasing use of clustering neural text embeddings in tasks that require the grouping of texts (such as topic modeling) and the need for German resources in existing benchmarks. We provide an initial analysis for a range of pre-trained mono- and multilingual models evaluated on the outcome of different clustering algorithms. Results include strong performing mono- and multilingual models. Reducing the dimensions of embeddings can further improve clustering. Additionally, we conduct experiments with continued pre-training for German BERT models to estimate the benefits of this additional training. Our experiments suggest that significant performance improvements are possible for short text. All code and datasets are publicly available.

View arXiv page View PDF Add to collection

Community

davanstrien

Jan 8

Really happy to see more language-specific evaluations for embedding models!

slvnwhrl

Paper author Jan 8

Thank you! I am happy to contribute to the German NLP landscape and hope our work is somehow helpful to others :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2401.02709 in a model README.md to link it from this page.

Datasets citing this paper 4

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2401.02709 in a Space README.md to link it from this page.