Did your team consider using an encodec + embedding model (e.g. like moshi?)
#27
by
RonanMcGovern
- opened
Thanks for releasing this model. I'm curious why you went with the encoder systems rather than going for a tokenised approach (would that be too slow)?
Also, the two-part transformer (think + talk) is quite unusual, did you try and just use one unified transformer there?