Real-Time Mic Transcription on free 2vCPU - using this model, check it out

#41
by WJ88 - opened

Hi,

https://youtube.com/shorts/YV-waEbg1S8

I am working on Real-Time Mic version, I have working one ready to test:

https://huggingface.co/spaces/WJ88/NVIDIA-Parakeet-TDT-0.6B-v2-INT8-Real-Time-Mic-Transcription
*the whole point of this space is to fit the model into 2vCPUs :) and it works!

**THERE IS 4seconds delay but it is transcribed in less than 2 seconds - it will be improved but still, its real-time comparing to other SPACES,
where you have to RECORD first then await for transcription. ** (any FEEDBACK appreciated) //Wojciech
likeIRCbutAIlol.jpg

The UI may not be nice but in overall just click RECORD, speak and watch transcription. After you finish, refresh the Browser Tab. (to free resources (please)
NOTE: the app is currently public, I mean, each user transcriptions are accumulating and other users cans see them, I am working on isolation but it is what it is, it works :)

You can use NVIDIA-Parakeet-TDT-0.6B-v2 without NVIDIA card in REAL-TIME - I encourage you to check it, check the code (its interesting that the model fits to 2vCPUs) and finally clone and base on that make your own version! I will stick to optimizations and not fancy features in my repo.

"I love Pain"

updated with preprocessing of audio from different plantforms, should work for you if did not before - also added video on how to use the app (a short one, less than one minute) - it is also a DEMO showing that the app works :)

I tried the integer version of this model to fill html forms. If anyone wants to convert this into a okayish server can have a look at this repository
https://github.com/deepanshu-yadav/voice-form-filler

I tried the integer version of this model to fill html forms. If anyone wants to convert this into a okayish server can have a look at this repository
https://github.com/deepanshu-yadav/voice-form-filler

Hi,
I am not sure if understood correctly, my space has in the name INT8 (I was running on int8 at first) but at this moment I have reverted to normal, and I am using other optimizations - I am letting you know that in case you use my code directly do not expect int8 - just so you know (I need to add it to the description to not disinform users...)

There was a problem to make it int8 because of deepcopy - i will do this but first I have to work on managin users session ;)
BR//
Wojciech

Hi,
Just to clarify people who are interested in this thread I did not convert this model to integer version myself someone did that here https://github.com/k2-fsa/sherpa-onnx/tree/master/scripts/nemo/parakeet-tdt-0.6b-v2
I just wrapped it around a server for my use case. Done here https://github.com/deepanshu-yadav/voice-form-filler/blob/main/asr_server.py
You are right we need more optimizations in order to use for Realtime more importantly I am trying really hard for fine tuning it to other languages.

Cheers

UPDATE:

Clone it to your own private repository Space on HuggingFace and have your own transcriber.
Also, I encourage you to built your own app on top of it, its just a beginning - I am going for assistant who will use transcription real time to help me but for this I need to use GPU so no more updates probably here.
Its a good base for projects, check the link again I have released final version with user session separation

https://huggingface.co/spaces/WJ88/NVIDIA-Parakeet-TDT-0.6B-v2-INT8-Real-Time-Mic-Transcription

done_clone_use_buildontopofit.png

Sign up or log in to comment