Code based version next pleast and also why isn't this one and the similarly sized multilingual available through the ollama pulls?
ollama website only lists 30m English and 278m multilingual.
Hi @TimeLordRaps ! At the moment, we only released the smallest and largest on Ollama to simplify the user story. If there's interest in getting the two middle-sized versions released there as well, we're certainly open to it! For the time being, you can also convert these checkpoints to GGUF and run with Ollama locally:
Convert using
convert_hf_to_gguf.py
fromllama.cpp
- For the english models (
roberta
architecture), you will need to use the version from this PR as we work to get it merged - Make sure you have installed the right python requirements from requirements/requirements-convert_hf_to_gguf.txt
convert_hf_to_gguf.py /path/to/granite-embedding-125m-english --outfile /path/to/granite-embedding-125m-english/granite-embedding-125m-english.gguf
- For the english models (
Create a
Modelfile
pointing at the converted GGUF fileFROM /path/to/granite-embedding-125m-english.gguf
Import it directly into Ollama
ollama create granite-embedding-local:125m -f Modelfile
Hi @TimeLordRaps , this model does well on code too. It is better than most other similar sized models on CoIR benchmark.
Did you investigate different pretraining token quantities to gauge effects of training data scale on performance for embedding models? Based on the training data even slight variations in different parallel pretraining scales could be saved for the last step of pretraining with the most general difficult dataset. Effectively creating a narrow range of pretrained models in pretraining token scale, comparatively to the overall scale. This narrow range would be the first simple easily executable steps for understanding embedding scaling from my perspective that has seemed lacking, or has gone unnoticed more broadly.
The above depends on dataset steps, rather than dataset mixing...
My whole original point on making a code based version was on an expectation that this model is already quite good at code, I dont think there is currently a widely recognizable code embedding model, and was offering that perception subtlely. And so to speak gave even more insight into how I would go about this post-training...
It seems these models based on their comparative performance across benchmarks were in fact trained on lots of code, and are in fact already what I personally would consider code-targeted embedding models, as this seems to be the case, and you value the user story, I think the story here is these are code embedding models... My apologies for having not done deep enough research initially. Can't wait to give these a try in mindcraft.
After using this model from some time and observing not only top results being better, but all results being better, than previous comparably sized models. This embedding seems capable of internal reasoning. Might be something interesting to like tokenize the embeddings, need I say more?
Though I am using the multilingual 278m still comparable to other similarly sized embeddings I have tried this one is special.