metadata
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:157
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
base_model: Snowflake/snowflake-arctic-embed-l
widget:
- source_sentence: >-
What is the most important factor in determining the quality of a trained
model according to the context?
sentences:
- |-
The GPT-4 barrier was comprehensively broken
Some of those GPT-4 models run on my laptop
LLM prices crashed, thanks to competition and increased efficiency
Multimodal vision is common, audio and video are starting to emerge
Voice and live camera mode are science fiction come to life
Prompt driven app generation is a commodity already
Universal access to the best models lasted for just a few short months
“Agents” still haven’t really happened yet
Evals really matter
Apple Intelligence is bad, Apple’s MLX library is excellent
The rise of inference-scaling “reasoning” models
Was the best currently available LLM trained in China for less than $6m?
The environmental impact got better
The environmental impact got much, much worse
- >-
Intuitively, one would expect that systems this powerful would take
millions of lines of complex code. Instead, it turns out a few hundred
lines of Python is genuinely enough to train a basic version!
What matters most is the training data. You need a lot of data to make
these things work, and the quantity and quality of the training data
appears to be the most important factor in how good the resulting model
is.
If you can gather the right data, and afford to pay for the GPUs to
train it, you can build an LLM.
- >-
DeepSeek v3 is a huge 685B parameter model—one of the largest openly
licensed models currently available, significantly bigger than the
largest of Meta’s Llama series, Llama 3.1 405B.
Benchmarks put it up there with Claude 3.5 Sonnet. Vibe benchmarks (aka
the Chatbot Arena) currently rank it 7th, just behind the Gemini 2.0 and
OpenAI 4o/o1 models. This is by far the highest ranking openly licensed
model.
The really impressive thing about DeepSeek v3 is the training cost. The
model was trained on 2,788,000 H800 GPU hours at an estimated cost of
$5,576,000. Llama 3.1 405B trained 30,840,000 GPU hours—11x that used by
DeepSeek v3, for a model that benchmarks slightly worse.
- source_sentence: Which company released the QwQ model under an Apache 20 license?
sentences:
- >-
There’s now a fascinating ecosystem of people training their own models
on top of these foundations, publishing those models, building
fine-tuning datasets and sharing those too.
The Hugging Face Open LLM Leaderboard is one place that tracks these. I
can’t even attempt to count them, and any count would be out-of-date
within a few hours.
The best overall openly licensed LLM at any time is rarely a foundation
model: instead, it’s whichever fine-tuned community model has most
recently discovered the best combination of fine-tuning data.
This is a huge advantage for open over closed models: the closed, hosted
models don’t have thousands of researchers and hobbyists around the
world collaborating and competing to improve them.
- >-
OpenAI are not the only game in town here. Google released their first
entrant in the category, gemini-2.0-flash-thinking-exp, on December
19th.
Alibaba’s Qwen team released their QwQ model on November 28th—under an
Apache 2.0 license, and that one I could run on my own machine. They
followed that up with a vision reasoning model called QvQ on December
24th, which I also ran locally.
DeepSeek made their DeepSeek-R1-Lite-Preview model available to try out
through their chat interface on November 20th.
To understand more about inference scaling I recommend Is AI progress
slowing down? by Arvind Narayanan and Sayash Kapoor.
- >-
The most recent twist, again from December (December was a lot) is live
video. ChatGPT voice mode now provides the option to share your camera
feed with the model and talk about what you can see in real time. Google
Gemini have a preview of the same feature, which they managed to ship
the day before ChatGPT did.
- source_sentence: When was GPT-4 officially released by OpenAI?
sentences:
- >-
Then in February, Meta released Llama. And a few weeks later in March,
Georgi Gerganov released code that got it working on a MacBook.
I wrote about how Large language models are having their Stable
Diffusion moment, and with hindsight that was a very good call!
This unleashed a whirlwind of innovation, which was accelerated further
in July when Meta released Llama 2—an improved version which, crucially,
included permission for commercial use.
Today there are literally thousands of LLMs that can be run locally, on
all manner of different devices.
- >-
We don’t yet know how to build GPT-4
Frustratingly, despite the enormous leaps ahead we’ve had this year, we
are yet to see an alternative model that’s better than GPT-4.
OpenAI released GPT-4 in March, though it later turned out we had a
sneak peak of it in February when Microsoft used it as part of the new
Bing.
This may well change in the next few weeks: Google’s Gemini Ultra has
big claims, but isn’t yet available for us to try out.
The team behind Mistral are working to beat GPT-4 as well, and their
track record is already extremely strong considering their first public
model only came out in September, and they’ve released two significant
improvements since then.
- >-
Nothing yet from Anthropic or Meta but I would be very surprised if they
don’t have their own inference-scaling models in the works. Meta
published a relevant paper Training Large Language Models to Reason in a
Continuous Latent Space in December.
Was the best currently available LLM trained in China for less than $6m?
Not quite, but almost! It does make for a great attention-grabbing
headline.
The big news to end the year was the release of DeepSeek v3—dropped on
Hugging Face on Christmas Day without so much as a README file, then
followed by documentation and a paper the day after that.
- source_sentence: >-
What are some ways mentioned to run local, private large language models
(LLMs) on personal devices?
sentences:
- >-
So training an LLM still isn’t something a hobbyist can afford, but it’s
no longer the sole domain of the super-rich. I like to compare the
difficulty of training an LLM to that of building a suspension
bridge—not trivial, but hundreds of countries around the world have
figured out how to do it. (Correction: Wikipedia’s Suspension bridges by
country category lists 44 countries).
You can run LLMs on your own devices
In January of this year, I thought it would be years before I could run
a useful LLM on my own computer. GPT-3 and 3.5 were pretty much the only
games in town, and I thought that even if the model weights were
available it would take a $10,000+ server to run them.
- >-
DeepSeek v3 is a huge 685B parameter model—one of the largest openly
licensed models currently available, significantly bigger than the
largest of Meta’s Llama series, Llama 3.1 405B.
Benchmarks put it up there with Claude 3.5 Sonnet. Vibe benchmarks (aka
the Chatbot Arena) currently rank it 7th, just behind the Gemini 2.0 and
OpenAI 4o/o1 models. This is by far the highest ranking openly licensed
model.
The really impressive thing about DeepSeek v3 is the training cost. The
model was trained on 2,788,000 H800 GPU hours at an estimated cost of
$5,576,000. Llama 3.1 405B trained 30,840,000 GPU hours—11x that used by
DeepSeek v3, for a model that benchmarks slightly worse.
- >-
I run a bunch of them on my laptop. I run Mistral 7B (a surprisingly
great model) on my iPhone. You can install several different apps to get
your own, local, completely private LLM. My own LLM project provides a
CLI tool for running an array of different models via plugins.
You can even run them entirely in your browser using WebAssembly and the
latest Chrome!
Hobbyists can build their own fine-tuned models
I said earlier that building an LLM was still out of reach of hobbyists.
That may be true for training from scratch, but fine-tuning one of those
models is another matter entirely.
- source_sentence: >-
How can LLMs like Claude create full interactive applications using web
technologies in a single prompt?
sentences:
- >-
We already knew LLMs were spookily good at writing code. If you prompt
them right, it turns out they can build you a full interactive
application using HTML, CSS and JavaScript (and tools like React if you
wire up some extra supporting build mechanisms)—often in a single
prompt.
Anthropic kicked this idea into high gear when they released Claude
Artifacts, a groundbreaking new feature that was initially slightly lost
in the noise due to being described half way through their announcement
of the incredible Claude 3.5 Sonnet.
With Artifacts, Claude can write you an on-demand interactive
application and then let you use it directly inside the Claude
interface.
Here’s my Extract URLs app, entirely generated by Claude:
- >-
Language Models are gullible. They “believe” what we tell them—what’s in
their training data, then what’s in the fine-tuning data, then what’s in
the prompt.
In order to be useful tools for us, we need them to believe what we feed
them!
But it turns out a lot of the things we want to build need them not to
be gullible.
Everyone wants an AI personal assistant. If you hired a real-world
personal assistant who believed everything that anyone told them, you
would quickly find that their ability to positively impact your life was
severely limited.
- >-
An interesting point of comparison here could be the way railways rolled
out around the world in the 1800s. Constructing these required enormous
investments and had a massive environmental impact, and many of the
lines that were built turned out to be unnecessary—sometimes multiple
lines from different companies serving the exact same routes!
The resulting bubbles contributed to several financial crashes, see
Wikipedia for Panic of 1873, Panic of 1893, Panic of 1901 and the UK’s
Railway Mania. They left us with a lot of useful infrastructure and a
great deal of bankruptcies and environmental damage.
The year of slop
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- cosine_accuracy@1
- cosine_accuracy@3
- cosine_accuracy@5
- cosine_accuracy@10
- cosine_precision@1
- cosine_precision@3
- cosine_precision@5
- cosine_precision@10
- cosine_recall@1
- cosine_recall@3
- cosine_recall@5
- cosine_recall@10
- cosine_ndcg@10
- cosine_mrr@10
- cosine_map@100
model-index:
- name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
results:
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: Unknown
type: unknown
metrics:
- type: cosine_accuracy@1
value: 0.875
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 1
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 1
name: Cosine Accuracy@5
- type: cosine_accuracy@10
value: 1
name: Cosine Accuracy@10
- type: cosine_precision@1
value: 0.875
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.3333333333333333
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.20000000000000004
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.10000000000000002
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.875
name: Cosine Recall@1
- type: cosine_recall@3
value: 1
name: Cosine Recall@3
- type: cosine_recall@5
value: 1
name: Cosine Recall@5
- type: cosine_recall@10
value: 1
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.9538662191964322
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.9375
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.9375
name: Cosine Map@100
SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-l. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: Snowflake/snowflake-arctic-embed-l
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 1024 dimensions
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("dwb2023/legal-ft-bd105ada-eeb1-440e-8bf8-91c5de53b1a7")
# Run inference
sentences = [
'How can LLMs like Claude create full interactive applications using web technologies in a single prompt?',
'We already knew LLMs were spookily good at writing code. If you prompt them right, it turns out they can build you a full interactive application using HTML, CSS and JavaScript (and tools like React if you wire up some extra supporting build mechanisms)—often in a single prompt.\nAnthropic kicked this idea into high gear when they released Claude Artifacts, a groundbreaking new feature that was initially slightly lost in the noise due to being described half way through their announcement of the incredible Claude 3.5 Sonnet.\nWith Artifacts, Claude can write you an on-demand interactive application and then let you use it directly inside the Claude interface.\nHere’s my Extract URLs app, entirely generated by Claude:',
'An interesting point of comparison here could be the way railways rolled out around the world in the 1800s. Constructing these required enormous investments and had a massive environmental impact, and many of the lines that were built turned out to be unnecessary—sometimes multiple lines from different companies serving the exact same routes!\nThe resulting bubbles contributed to several financial crashes, see Wikipedia for Panic of 1873, Panic of 1893, Panic of 1901 and the UK’s Railway Mania. They left us with a lot of useful infrastructure and a great deal of bankruptcies and environmental damage.\nThe year of slop',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Information Retrieval
- Evaluated with
InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.875 |
cosine_accuracy@3 | 1.0 |
cosine_accuracy@5 | 1.0 |
cosine_accuracy@10 | 1.0 |
cosine_precision@1 | 0.875 |
cosine_precision@3 | 0.3333 |
cosine_precision@5 | 0.2 |
cosine_precision@10 | 0.1 |
cosine_recall@1 | 0.875 |
cosine_recall@3 | 1.0 |
cosine_recall@5 | 1.0 |
cosine_recall@10 | 1.0 |
cosine_ndcg@10 | 0.9539 |
cosine_mrr@10 | 0.9375 |
cosine_map@100 | 0.9375 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 157 training samples
- Columns:
sentence_0
andsentence_1
- Approximate statistics based on the first 157 samples:
sentence_0 sentence_1 type string string details - min: 2 tokens
- mean: 20.9 tokens
- max: 37 tokens
- min: 43 tokens
- mean: 135.72 tokens
- max: 214 tokens
- Samples:
sentence_0 sentence_1 Why are language models described as gullible in the given context?
Language Models are gullible. They “believe” what we tell them—what’s in their training data, then what’s in the fine-tuning data, then what’s in the prompt.
In order to be useful tools for us, we need them to believe what we feed them!
But it turns out a lot of the things we want to build need them not to be gullible.
Everyone wants an AI personal assistant. If you hired a real-world personal assistant who believed everything that anyone told them, you would quickly find that their ability to positively impact your life was severely limited.What is the challenge in building AI personal assistants based on the gullibility of language models?
Language Models are gullible. They “believe” what we tell them—what’s in their training data, then what’s in the fine-tuning data, then what’s in the prompt.
In order to be useful tools for us, we need them to believe what we feed them!
But it turns out a lot of the things we want to build need them not to be gullible.
Everyone wants an AI personal assistant. If you hired a real-world personal assistant who believed everything that anyone told them, you would quickly find that their ability to positively impact your life was severely limited.What challenges does the author face when trying to evaluate multiple LLMs?
I find I have to work with an LLM for a few weeks in order to get a good intuition for it’s strengths and weaknesses. This greatly limits how many I can evaluate myself!
The most frustrating thing for me is at the level of individual prompting.
Sometimes I’ll tweak a prompt and capitalize some of the words in it, to emphasize that I really want it to OUTPUT VALID MARKDOWN or similar. Did capitalizing those words make a difference? I still don’t have a good methodology for figuring that out.
We’re left with what’s effectively Vibes Based Development. It’s vibes all the way down.
I’d love to see us move beyond vibes in 2024!
LLMs are really smart, and also really, really dumb - Loss:
MatryoshkaLoss
with these parameters:{ "loss": "MultipleNegativesRankingLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 10per_device_eval_batch_size
: 10num_train_epochs
: 10multi_dataset_batch_sampler
: round_robin
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 10per_device_eval_batch_size
: 10per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1num_train_epochs
: 10max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}tp_size
: 0fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: round_robin
Training Logs
Epoch | Step | cosine_ndcg@10 |
---|---|---|
1.0 | 16 | 0.9484 |
2.0 | 32 | 0.9692 |
3.0 | 48 | 0.9638 |
3.125 | 50 | 0.9539 |
4.0 | 64 | 0.9539 |
5.0 | 80 | 0.9539 |
6.0 | 96 | 0.9539 |
6.25 | 100 | 0.9539 |
7.0 | 112 | 0.9539 |
8.0 | 128 | 0.9539 |
9.0 | 144 | 0.9539 |
9.375 | 150 | 0.9539 |
10.0 | 160 | 0.9539 |
Framework Versions
- Python: 3.11.12
- Sentence Transformers: 4.1.0
- Transformers: 4.51.3
- PyTorch: 2.6.0+cu124
- Accelerate: 1.6.0
- Datasets: 3.6.0
- Tokenizers: 0.21.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}