File size: 1,428 Bytes
dc4d7cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
016ee79
dc4d7cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4140c35
dc4d7cf
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
---
tags:
- espnet
- audio
- automatic-speech-recognition
- speech-translation
language: multilingual
datasets:
- owsm_v3.1
license: cc-by-4.0
---

## OWLS: Open Whisper-style Large-scale neural model Suite 

OWLS is a suite of Whisper-style models, designed to help researchers understand the scaling properties of speech models.
OWLS models range from 0.25B to 18B parameters, and are trained on up to 360K hours of data.

OWLS models are developed using [ESPnet](https://github.com/espnet/espnet), and support multilingual Speech Recognition and Translation.

It is part of the [OWSM](https://www.wavlab.org/activities/2024/owsm/) project, which aims to develop fully open speech foundation models using publicly available data and open-source toolkits.

The model in this repo has 4.66B parameters in total and is trained on 180k hours of public speech data.
Specifically, it supports the following speech-to-text tasks:
- Speech recognition
- Any-to-any-language speech translation
- Utterance-level alignment
- Long-form transcription
- Language identification

## Use this model

You can use this model in your projects with the following code:

```python
# make sure espnet is installed: pip install espnet
from espnet2.bin.s2t_inference import Speech2Text

model = Speech2Text.from_pretrained(
  "espnet/owls_4B_180K"
)

speech, rate = soundfile.read("speech.wav")
text, *_ = model(speech)[0]
```


## Citations

TBA