metadata
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:156
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
base_model: Snowflake/snowflake-arctic-embed-l
widget:
- source_sentence: >-
What significant change occurred in the AI landscape regarding models
surpassing GPT-4 in the past twelve months?
sentences:
- >-
Except... you can run generated code to see if it’s correct. And with
patterns like ChatGPT Code Interpreter the LLM can execute the code
itself, process the error message, then rewrite it and keep trying until
it works!
So hallucination is a much lesser problem for code generation than for
anything else. If only we had the equivalent of Code Interpreter for
fact-checking natural language!
How should we feel about this as software engineers?
On the one hand, this feels like a threat: who needs a programmer if
ChatGPT can write code for you?
- >-
The GPT-4 barrier was comprehensively broken
In my December 2023 review I wrote about how We don’t yet know how to
build GPT-4—OpenAI’s best model was almost a year old at that point, yet
no other AI lab had produced anything better. What did OpenAI know that
the rest of us didn’t?
I’m relieved that this has changed completely in the past twelve months.
18 organizations now have models on the Chatbot Arena Leaderboard that
rank higher than the original GPT-4 from March 2023 (GPT-4-0314 on the
board)—70 models in total.
- >-
If you think about what they do, this isn’t such a big surprise. The
grammar rules of programming languages like Python and JavaScript are
massively less complicated than the grammar of Chinese, Spanish or
English.
It’s still astonishing to me how effective they are though.
One of the great weaknesses of LLMs is their tendency to hallucinate—to
imagine things that don’t correspond to reality. You would expect this
to be a particularly bad problem for code—if an LLM hallucinates a
method that doesn’t exist, the code should be useless.
- source_sentence: >-
How does Claude enable users to interact with applications created through
its interface?
sentences:
- >-
OpenAI made GPT-4o free for all users in May, and Claude 3.5 Sonnet was
freely available from its launch in June. This was a momentus change,
because for the previous year free users had mostly been restricted to
GPT-3.5 level models, meaning new users got a very inaccurate mental
model of what a capable LLM could actually do.
That era appears to have ended, likely permanently, with OpenAI’s launch
of ChatGPT Pro. This $200/month subscription service is the only way to
access their most capable model, o1 Pro.
Since the trick behind the o1 series (and the future models it will
undoubtedly inspire) is to expend more compute time to get better
results, I don’t think those days of free access to the best available
models are likely to return.
- >-
We already knew LLMs were spookily good at writing code. If you prompt
them right, it turns out they can build you a full interactive
application using HTML, CSS and JavaScript (and tools like React if you
wire up some extra supporting build mechanisms)—often in a single
prompt.
Anthropic kicked this idea into high gear when they released Claude
Artifacts, a groundbreaking new feature that was initially slightly lost
in the noise due to being described half way through their announcement
of the incredible Claude 3.5 Sonnet.
With Artifacts, Claude can write you an on-demand interactive
application and then let you use it directly inside the Claude
interface.
Here’s my Extract URLs app, entirely generated by Claude:
- >-
Industry’s Tardy Response to the AI Prompt Injection Vulnerability on
RedMonk Conversations
Posted 31st December 2023 at 11:59 pm · Follow me on Mastodon or Twitter
or subscribe to my newsletter
More recent articles
LLM 0.22, the annotated release notes - 17th February 2025
Run LLMs on macOS using llm-mlx and Apple's MLX framework - 15th
February 2025
URL-addressable Pyodide Python environments - 13th February 2025
This is Stuff we figured out about AI in 2023 by Simon Willison, posted
on 31st December 2023.
Part of series LLMs annual review
Stuff we figured out about AI in 2023 - Dec. 31, 2023, 11:59 p.m.
Things we learned about LLMs in 2024 - Dec. 31, 2024, 6:07 p.m.
blogging
69
- source_sentence: >-
What incident involving Google Search is mentioned in the context, and
what was the nature of the misinformation?
sentences:
- >-
My personal laptop is a 64GB M2 MacBook Pro from 2023. It’s a powerful
machine, but it’s also nearly two years old now—and crucially it’s the
same laptop I’ve been using ever since I first ran an LLM on my computer
back in March 2023 (see Large language models are having their Stable
Diffusion moment).
That same laptop that could just about run a GPT-3-class model in March
last year has now run multiple GPT-4 class models! Some of my notes on
that:
- >-
Terminology aside, I remain skeptical as to their utility based, once
again, on the challenge of gullibility. LLMs believe anything you tell
them. Any systems that attempts to make meaningful decisions on your
behalf will run into the same roadblock: how good is a travel agent, or
a digital assistant, or even a research tool if it can’t distinguish
truth from fiction?
Just the other day Google Search was caught serving up an entirely fake
description of the non-existant movie “Encanto 2”. It turned out to be
summarizing an imagined movie listing from a fan fiction wiki.
- >-
On the other hand, as software engineers we are better placed to take
advantage of this than anyone else. We’ve all been given weird coding
interns—we can use our deep knowledge to prompt them to solve coding
problems more effectively than anyone else can.
The ethics of this space remain diabolically complex
In September last year Andy Baio and I produced the first major story on
the unlicensed training data behind Stable Diffusion.
Since then, almost every major LLM (and most of the image generation
models) have also been trained on unlicensed data.
- source_sentence: >-
What are the limitations of Apple's LLM features compared to frontier
LLMs, according to the context?
sentences:
- >-
DeepSeek v3 is a huge 685B parameter model—one of the largest openly
licensed models currently available, significantly bigger than the
largest of Meta’s Llama series, Llama 3.1 405B.
Benchmarks put it up there with Claude 3.5 Sonnet. Vibe benchmarks (aka
the Chatbot Arena) currently rank it 7th, just behind the Gemini 2.0 and
OpenAI 4o/o1 models. This is by far the highest ranking openly licensed
model.
The really impressive thing about DeepSeek v3 is the training cost. The
model was trained on 2,788,000 H800 GPU hours at an estimated cost of
$5,576,000. Llama 3.1 405B trained 30,840,000 GPU hours—11x that used by
DeepSeek v3, for a model that benchmarks slightly worse.
- >-
An interesting point of comparison here could be the way railways rolled
out around the world in the 1800s. Constructing these required enormous
investments and had a massive environmental impact, and many of the
lines that were built turned out to be unnecessary—sometimes multiple
lines from different companies serving the exact same routes!
The resulting bubbles contributed to several financial crashes, see
Wikipedia for Panic of 1873, Panic of 1893, Panic of 1901 and the UK’s
Railway Mania. They left us with a lot of useful infrastructure and a
great deal of bankruptcies and environmental damage.
The year of slop
- >-
Now that those features are rolling out they’re pretty weak. As an LLM
power-user I know what these models are capable of, and Apple’s LLM
features offer a pale imitation of what a frontier LLM can do. Instead
we’re getting notification summaries that misrepresent news headlines
and writing assistant tools that I’ve not found useful at all. Genmoji
are kind of fun though.
The rise of inference-scaling “reasoning” models
The most interesting development in the final quarter of 2024 was the
introduction of a new shape of LLM, exemplified by OpenAI’s o1
models—initially released as o1-preview and o1-mini on September 12th.
- source_sentence: What new feature was introduced in ChatGPT's voice mode in December?
sentences:
- >-
The most recent twist, again from December (December was a lot) is live
video. ChatGPT voice mode now provides the option to share your camera
feed with the model and talk about what you can see in real time. Google
Gemini have a preview of the same feature, which they managed to ship
the day before ChatGPT did.
- >-
The two main categories I see are people who think AI agents are
obviously things that go and act on your behalf—the travel agent
model—and people who think in terms of LLMs that have been given access
to tools which they can run in a loop as part of solving a problem. The
term “autonomy” is often thrown into the mix too, again without
including a clear definition.
(I also collected 211 definitions on Twitter a few months ago—here they
are in Datasette Lite—and had gemini-exp-1206 attempt to summarize
them.)
Whatever the term may mean, agents still have that feeling of
perpetually “coming soon”.
- >-
The GPT-4 barrier was comprehensively broken
In my December 2023 review I wrote about how We don’t yet know how to
build GPT-4—OpenAI’s best model was almost a year old at that point, yet
no other AI lab had produced anything better. What did OpenAI know that
the rest of us didn’t?
I’m relieved that this has changed completely in the past twelve months.
18 organizations now have models on the Chatbot Arena Leaderboard that
rank higher than the original GPT-4 from March 2023 (GPT-4-0314 on the
board)—70 models in total.
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- cosine_accuracy@1
- cosine_accuracy@3
- cosine_accuracy@5
- cosine_accuracy@10
- cosine_precision@1
- cosine_precision@3
- cosine_precision@5
- cosine_precision@10
- cosine_recall@1
- cosine_recall@3
- cosine_recall@5
- cosine_recall@10
- cosine_ndcg@10
- cosine_mrr@10
- cosine_map@100
model-index:
- name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
results:
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: Unknown
type: unknown
metrics:
- type: cosine_accuracy@1
value: 0.9166666666666666
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 1
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 1
name: Cosine Accuracy@5
- type: cosine_accuracy@10
value: 1
name: Cosine Accuracy@10
- type: cosine_precision@1
value: 0.9166666666666666
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.3333333333333333
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.20000000000000004
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.10000000000000002
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.9166666666666666
name: Cosine Recall@1
- type: cosine_recall@3
value: 1
name: Cosine Recall@3
- type: cosine_recall@5
value: 1
name: Cosine Recall@5
- type: cosine_recall@10
value: 1
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.9692441461309548
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.9583333333333334
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.9583333333333334
name: Cosine Map@100
SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-l. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: Snowflake/snowflake-arctic-embed-l
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 1024 dimensions
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("dabraldeepti25/legal-ft-v0-updated")
# Run inference
sentences = [
"What new feature was introduced in ChatGPT's voice mode in December?",
'The most recent twist, again from December (December was a lot) is live video. ChatGPT voice mode now provides the option to share your camera feed with the model and talk about what you can see in real time. Google Gemini have a preview of the same feature, which they managed to ship the day before ChatGPT did.',
'The GPT-4 barrier was comprehensively broken\nIn my December 2023 review I wrote about how We don’t yet know how to build GPT-4—OpenAI’s best model was almost a year old at that point, yet no other AI lab had produced anything better. What did OpenAI know that the rest of us didn’t?\nI’m relieved that this has changed completely in the past twelve months. 18 organizations now have models on the Chatbot Arena Leaderboard that rank higher than the original GPT-4 from March 2023 (GPT-4-0314 on the board)—70 models in total.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Information Retrieval
- Evaluated with
InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.9167 |
cosine_accuracy@3 | 1.0 |
cosine_accuracy@5 | 1.0 |
cosine_accuracy@10 | 1.0 |
cosine_precision@1 | 0.9167 |
cosine_precision@3 | 0.3333 |
cosine_precision@5 | 0.2 |
cosine_precision@10 | 0.1 |
cosine_recall@1 | 0.9167 |
cosine_recall@3 | 1.0 |
cosine_recall@5 | 1.0 |
cosine_recall@10 | 1.0 |
cosine_ndcg@10 | 0.9692 |
cosine_mrr@10 | 0.9583 |
cosine_map@100 | 0.9583 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 156 training samples
- Columns:
sentence_0
andsentence_1
- Approximate statistics based on the first 156 samples:
sentence_0 sentence_1 type string string details - min: 13 tokens
- mean: 20.17 tokens
- max: 34 tokens
- min: 43 tokens
- mean: 135.18 tokens
- max: 214 tokens
- Samples:
sentence_0 sentence_1 What is the significance of prompt engineering in DALL-E 3?
Now add a walrus: Prompt engineering in DALL-E 3
32.8k
41.2k
Web LLM runs the vicuna-7b Large Language Model entirely in your browser, and it’s very impressive
32.5k
38.2k
ChatGPT can’t access the internet, even though it really looks like it can
30.5k
34.2k
Stanford Alpaca, and the acceleration of on-device large language model development
29.7k
35.7k
Run Llama 2 on your own Mac using LLM and Homebrew
27.9k
33.6k
Midjourney 5.1
26.7k
33.4k
Think of language models like ChatGPT as a “calculator for words”
25k
31.8k
Multi-modal prompt injection image attacks against GPT-4V
23.7k
27.4kHow does the vicuna-7b Large Language Model operate within a browser?
Now add a walrus: Prompt engineering in DALL-E 3
32.8k
41.2k
Web LLM runs the vicuna-7b Large Language Model entirely in your browser, and it’s very impressive
32.5k
38.2k
ChatGPT can’t access the internet, even though it really looks like it can
30.5k
34.2k
Stanford Alpaca, and the acceleration of on-device large language model development
29.7k
35.7k
Run Llama 2 on your own Mac using LLM and Homebrew
27.9k
33.6k
Midjourney 5.1
26.7k
33.4k
Think of language models like ChatGPT as a “calculator for words”
25k
31.8k
Multi-modal prompt injection image attacks against GPT-4V
23.7k
27.4kWhat model of MacBook Pro is being used in the context, and what is its storage capacity?
My personal laptop is a 64GB M2 MacBook Pro from 2023. It’s a powerful machine, but it’s also nearly two years old now—and crucially it’s the same laptop I’ve been using ever since I first ran an LLM on my computer back in March 2023 (see Large language models are having their Stable Diffusion moment).
That same laptop that could just about run a GPT-3-class model in March last year has now run multiple GPT-4 class models! Some of my notes on that: - Loss:
MatryoshkaLoss
with these parameters:{ "loss": "MultipleNegativesRankingLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 10per_device_eval_batch_size
: 10num_train_epochs
: 10multi_dataset_batch_sampler
: round_robin
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 10per_device_eval_batch_size
: 10per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1num_train_epochs
: 10max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: round_robin
Training Logs
Epoch | Step | cosine_ndcg@10 |
---|---|---|
1.0 | 16 | 0.9692 |
2.0 | 32 | 0.9692 |
3.0 | 48 | 1.0 |
3.125 | 50 | 1.0 |
4.0 | 64 | 1.0 |
5.0 | 80 | 0.9692 |
6.0 | 96 | 0.9692 |
6.25 | 100 | 0.9692 |
7.0 | 112 | 0.9692 |
8.0 | 128 | 0.9692 |
9.0 | 144 | 0.9692 |
9.375 | 150 | 0.9692 |
10.0 | 160 | 0.9692 |
Framework Versions
- Python: 3.11.11
- Sentence Transformers: 3.4.1
- Transformers: 4.48.3
- PyTorch: 2.5.1+cu124
- Accelerate: 1.3.0
- Datasets: 3.3.1
- Tokenizers: 0.21.0
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}