metadata
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:156
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
base_model: Snowflake/snowflake-arctic-embed-l
widget:
- source_sentence: >-
What advantage do open models have over closed, hosted models according to
the context?
sentences:
- >-
On the other hand, as software engineers we are better placed to take
advantage of this than anyone else. We’ve all been given weird coding
interns—we can use our deep knowledge to prompt them to solve coding
problems more effectively than anyone else can.
The ethics of this space remain diabolically complex
In September last year Andy Baio and I produced the first major story on
the unlicensed training data behind Stable Diffusion.
Since then, almost every major LLM (and most of the image generation
models) have also been trained on unlicensed data.
- >-
There’s now a fascinating ecosystem of people training their own models
on top of these foundations, publishing those models, building
fine-tuning datasets and sharing those too.
The Hugging Face Open LLM Leaderboard is one place that tracks these. I
can’t even attempt to count them, and any count would be out-of-date
within a few hours.
The best overall openly licensed LLM at any time is rarely a foundation
model: instead, it’s whichever fine-tuned community model has most
recently discovered the best combination of fine-tuning data.
This is a huge advantage for open over closed models: the closed, hosted
models don’t have thousands of researchers and hobbyists around the
world collaborating and competing to improve them.
- >-
Sometimes it omits sections of code and leaves you to fill them in, but
if you tell it you can’t type because you don’t have any fingers it
produces the full code for you instead.
There are so many more examples like this. Offer it cash tips for better
answers. Tell it your career depends on it. Give it positive
reinforcement. It’s all so dumb, but it works!
Gullibility is the biggest unsolved problem
I coined the term prompt injection in September last year.
15 months later, I regret to say that we’re still no closer to a robust,
dependable solution to this problem.
I’ve written a ton about this already.
Beyond that specific class of security vulnerabilities, I’ve started
seeing this as a wider problem of gullibility.
- source_sentence: >-
How is the user applying a similar interactive app concept in their
Datasette project?
sentences:
- >-
Meta’s Llama 3.2 models deserve a special mention. They may not be GPT-4
class, but at 1B and 3B sizes they punch massively above their weight. I
run Llama 3.2 3B on my iPhone using the free MLC Chat iOS app and it’s a
shockingly capable model for its tiny (<2GB) size. Try firing it up and
asking it for “a plot outline of a Netflix Christmas movie where a data
journalist falls in love with a local ceramacist”. Here’s what I got, at
a respectable 20 tokens per second:
- >-
Then in December, the Chatbot Arena team introduced a whole new
leaderboard for this feature, driven by users building the same
interactive app twice with two different models and voting on the
answer. Hard to come up with a more convincing argument that this
feature is now a commodity that can be effectively implemented against
all of the leading models.
I’ve been tinkering with a version of this myself for my Datasette
project, with the goal of letting users use prompts to build and iterate
on custom widgets and data visualizations against their own data. I also
figured out a similar pattern for writing one-shot Python programs,
enabled by uv.
- >-
260 input tokens, 92 output tokens. Cost approximately 0.0024 cents
(that’s less than a 400th of a cent).
This increase in efficiency and reduction in price is my single
favourite trend from 2024. I want the utility of LLMs at a fraction of
the energy cost and it looks like that’s what we’re getting.
Multimodal vision is common, audio and video are starting to emerge
My butterfly example above illustrates another key trend from 2024: the
rise of multi-modal LLMs.
A year ago the single most notable example of these was GPT-4 Vision,
released at OpenAI’s DevDay in November 2023. Google’s multi-modal
Gemini 1.0 was announced on December 7th 2023 so it also (just) makes it
into the 2023 window.
- source_sentence: >-
What were the OpenAI pricing rates for GPT-4, GPT-4 Turbo, and GPT-35
Turbo in December 2023?
sentences:
- >-
I run a bunch of them on my laptop. I run Mistral 7B (a surprisingly
great model) on my iPhone. You can install several different apps to get
your own, local, completely private LLM. My own LLM project provides a
CLI tool for running an array of different models via plugins.
You can even run them entirely in your browser using WebAssembly and the
latest Chrome!
Hobbyists can build their own fine-tuned models
I said earlier that building an LLM was still out of reach of hobbyists.
That may be true for training from scratch, but fine-tuning one of those
models is another matter entirely.
- >-
Here’s the rest of the transcript. It’s bland and generic, but my phone
can pitch bland and generic Christmas movies to Netflix now!
LLM prices crashed, thanks to competition and increased efficiency
The past twelve months have seen a dramatic collapse in the cost of
running a prompt through the top tier hosted LLMs.
In December 2023 (here’s the Internet Archive for the OpenAI pricing
page) OpenAI were charging $30/million input tokens for GPT-4, $10/mTok
for the then-new GPT-4 Turbo and $1/mTok for GPT-3.5 Turbo.
- >-
Prince Canuma’s excellent, fast moving mlx-vlm project brings vision
LLMs to Apple Silicon as well. I used that recently to run Qwen’s QvQ.
While MLX is a game changer, Apple’s own “Apple Intelligence” features
have mostly been a disappointment. I wrote about their initial
announcement in June, and I was optimistic that Apple had focused hard
on the subset of LLM applications that preserve user privacy and
minimize the chance of users getting mislead by confusing features.
- source_sentence: >-
According to the context, how many lines of Python code are generally
needed to train a basic version of a powerful system?
sentences:
- >-
The May 13th announcement of GPT-4o included a demo of a brand new voice
mode, where the true multi-modal GPT-4o (the o is for “omni”) model
could accept audio input and output incredibly realistic sounding speech
without needing separate TTS or STT models.
The demo also sounded conspicuously similar to Scarlett Johansson... and
after she complained the voice from the demo, Skye, never made it to a
production product.
The delay in releasing the new voice mode after the initial demo caused
quite a lot of confusion. I wrote about that in ChatGPT in “4o” mode is
not running the new features yet.
- >-
This remains astonishing to me. I thought a model with the capabilities
and output quality of GPT-4 needed a datacenter class server with one or
more $40,000+ GPUs.
These models take up enough of my 64GB of RAM that I don’t run them
often—they don’t leave much room for anything else.
The fact that they run at all is a testament to the incredible training
and inference performance gains that we’ve figured out over the past
year. It turns out there was a lot of low-hanging fruit to be harvested
in terms of model efficiency. I expect there’s still more to come.
- >-
Intuitively, one would expect that systems this powerful would take
millions of lines of complex code. Instead, it turns out a few hundred
lines of Python is genuinely enough to train a basic version!
What matters most is the training data. You need a lot of data to make
these things work, and the quantity and quality of the training data
appears to be the most important factor in how good the resulting model
is.
If you can gather the right data, and afford to pay for the GPUs to
train it, you can build an LLM.
- source_sentence: >-
What challenges does the author face when trying to evaluate multiple
LLMs?
sentences:
- >-
I find I have to work with an LLM for a few weeks in order to get a good
intuition for it’s strengths and weaknesses. This greatly limits how
many I can evaluate myself!
The most frustrating thing for me is at the level of individual
prompting.
Sometimes I’ll tweak a prompt and capitalize some of the words in it, to
emphasize that I really want it to OUTPUT VALID MARKDOWN or similar. Did
capitalizing those words make a difference? I still don’t have a good
methodology for figuring that out.
We’re left with what’s effectively Vibes Based Development. It’s vibes
all the way down.
I’d love to see us move beyond vibes in 2024!
LLMs are really smart, and also really, really dumb
- >-
Getting back to models that beat GPT-4: Anthropic’s Claude 3 series
launched in March, and Claude 3 Opus quickly became my new favourite
daily-driver. They upped the ante even more in June with the launch of
Claude 3.5 Sonnet—a model that is still my favourite six months later
(though it got a significant upgrade on October 22, confusingly keeping
the same 3.5 version number. Anthropic fans have since taken to calling
it Claude 3.6).
- >-
I think this means that, as individual users, we don’t need to feel any
guilt at all for the energy consumed by the vast majority of our
prompts. The impact is likely neglible compared to driving a car down
the street or maybe even watching a video on YouTube.
Likewise, training. DeepSeek v3 training for less than $6m is a
fantastic sign that training costs can and should continue to drop.
For less efficient models I find it useful to compare their energy usage
to commercial flights. The largest Llama 3 model cost about the same as
a single digit number of fully loaded passenger flights from New York to
London. That’s certainly not nothing, but once trained that model can be
used by millions of people at no extra training cost.
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- cosine_accuracy@1
- cosine_accuracy@3
- cosine_accuracy@5
- cosine_accuracy@10
- cosine_precision@1
- cosine_precision@3
- cosine_precision@5
- cosine_precision@10
- cosine_recall@1
- cosine_recall@3
- cosine_recall@5
- cosine_recall@10
- cosine_ndcg@10
- cosine_mrr@10
- cosine_map@100
model-index:
- name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
results:
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: Unknown
type: unknown
metrics:
- type: cosine_accuracy@1
value: 0.9583333333333334
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 1
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 1
name: Cosine Accuracy@5
- type: cosine_accuracy@10
value: 1
name: Cosine Accuracy@10
- type: cosine_precision@1
value: 0.9583333333333334
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.3333333333333333
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.20000000000000004
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.10000000000000002
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.9583333333333334
name: Cosine Recall@1
- type: cosine_recall@3
value: 1
name: Cosine Recall@3
- type: cosine_recall@5
value: 1
name: Cosine Recall@5
- type: cosine_recall@10
value: 1
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.9846220730654774
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.9791666666666666
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.9791666666666666
name: Cosine Map@100
SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-l. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: Snowflake/snowflake-arctic-embed-l
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 1024 dimensions
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("s4um1l/legal-ft-84f7d2b4-c963-45b6-b749-04d2d76a110f")
# Run inference
sentences = [
'What challenges does the author face when trying to evaluate multiple LLMs?',
'I find I have to work with an LLM for a few weeks in order to get a good intuition for it’s strengths and weaknesses. This greatly limits how many I can evaluate myself!\nThe most frustrating thing for me is at the level of individual prompting.\nSometimes I’ll tweak a prompt and capitalize some of the words in it, to emphasize that I really want it to OUTPUT VALID MARKDOWN or similar. Did capitalizing those words make a difference? I still don’t have a good methodology for figuring that out.\nWe’re left with what’s effectively Vibes Based Development. It’s vibes all the way down.\nI’d love to see us move beyond vibes in 2024!\nLLMs are really smart, and also really, really dumb',
'I think this means that, as individual users, we don’t need to feel any guilt at all for the energy consumed by the vast majority of our prompts. The impact is likely neglible compared to driving a car down the street or maybe even watching a video on YouTube.\nLikewise, training. DeepSeek v3 training for less than $6m is a fantastic sign that training costs can and should continue to drop.\nFor less efficient models I find it useful to compare their energy usage to commercial flights. The largest Llama 3 model cost about the same as a single digit number of fully loaded passenger flights from New York to London. That’s certainly not nothing, but once trained that model can be used by millions of people at no extra training cost.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Information Retrieval
- Evaluated with
InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.9583 |
cosine_accuracy@3 | 1.0 |
cosine_accuracy@5 | 1.0 |
cosine_accuracy@10 | 1.0 |
cosine_precision@1 | 0.9583 |
cosine_precision@3 | 0.3333 |
cosine_precision@5 | 0.2 |
cosine_precision@10 | 0.1 |
cosine_recall@1 | 0.9583 |
cosine_recall@3 | 1.0 |
cosine_recall@5 | 1.0 |
cosine_recall@10 | 1.0 |
cosine_ndcg@10 | 0.9846 |
cosine_mrr@10 | 0.9792 |
cosine_map@100 | 0.9792 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 156 training samples
- Columns:
sentence_0
andsentence_1
- Approximate statistics based on the first 156 samples:
sentence_0 sentence_1 type string string details - min: 12 tokens
- mean: 20.9 tokens
- max: 33 tokens
- min: 43 tokens
- mean: 135.28 tokens
- max: 214 tokens
- Samples:
sentence_0 sentence_1 When did Meta release the original Llama model?
Then in February, Meta released Llama. And a few weeks later in March, Georgi Gerganov released code that got it working on a MacBook.
I wrote about how Large language models are having their Stable Diffusion moment, and with hindsight that was a very good call!
This unleashed a whirlwind of innovation, which was accelerated further in July when Meta released Llama 2—an improved version which, crucially, included permission for commercial use.
Today there are literally thousands of LLMs that can be run locally, on all manner of different devices.What was significant about the release of Llama 2 in July?
Then in February, Meta released Llama. And a few weeks later in March, Georgi Gerganov released code that got it working on a MacBook.
I wrote about how Large language models are having their Stable Diffusion moment, and with hindsight that was a very good call!
This unleashed a whirlwind of innovation, which was accelerated further in July when Meta released Llama 2—an improved version which, crucially, included permission for commercial use.
Today there are literally thousands of LLMs that can be run locally, on all manner of different devices.When did OpenAI make GPT-4o free for all users?
OpenAI made GPT-4o free for all users in May, and Claude 3.5 Sonnet was freely available from its launch in June. This was a momentus change, because for the previous year free users had mostly been restricted to GPT-3.5 level models, meaning new users got a very inaccurate mental model of what a capable LLM could actually do.
That era appears to have ended, likely permanently, with OpenAI’s launch of ChatGPT Pro. This $200/month subscription service is the only way to access their most capable model, o1 Pro.
Since the trick behind the o1 series (and the future models it will undoubtedly inspire) is to expend more compute time to get better results, I don’t think those days of free access to the best available models are likely to return. - Loss:
MatryoshkaLoss
with these parameters:{ "loss": "MultipleNegativesRankingLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 10per_device_eval_batch_size
: 10num_train_epochs
: 10multi_dataset_batch_sampler
: round_robin
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 10per_device_eval_batch_size
: 10per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1num_train_epochs
: 10max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}tp_size
: 0fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: round_robin
Training Logs
Epoch | Step | cosine_ndcg@10 |
---|---|---|
1.0 | 16 | 0.9554 |
2.0 | 32 | 0.9455 |
3.0 | 48 | 0.9484 |
3.125 | 50 | 0.9484 |
4.0 | 64 | 0.9692 |
5.0 | 80 | 0.9692 |
6.0 | 96 | 0.9692 |
6.25 | 100 | 0.9846 |
7.0 | 112 | 0.9846 |
8.0 | 128 | 0.9846 |
9.0 | 144 | 0.9846 |
9.375 | 150 | 0.9846 |
10.0 | 160 | 0.9846 |
Framework Versions
- Python: 3.11.12
- Sentence Transformers: 4.1.0
- Transformers: 4.51.3
- PyTorch: 2.6.0+cu124
- Accelerate: 1.6.0
- Datasets: 3.5.1
- Tokenizers: 0.21.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}