The Llama-SmolTalk-3.2-1B-Instruct model is a lightweight, instruction-tuned model designed for efficient text generation and conversational AI tasks. With a 1B parameter architecture, this model strikes a balance between performance and resource efficiency, making it ideal for applications requiring concise, contextually relevant outputs. The model has been fine-tuned to deliver robust instruction-following capabilities, catering to both structured and open-ended queries.

Key Features:

  1. Instruction-Tuned Performance: Optimized to understand and execute user-provided instructions across diverse domains.
  2. Lightweight Architecture: With just 1 billion parameters, the model provides efficient computation and storage without compromising output quality.
  3. Versatile Use Cases: Suitable for tasks like content generation, conversational interfaces, and basic problem-solving.

Intended Applications:

  • Conversational AI: Engage users with dynamic and contextually aware dialogue.
  • Content Generation: Produce summaries, explanations, or other creative text outputs efficiently.
  • Instruction Execution: Follow user commands to generate precise and relevant responses.

Technical Details:

The model leverages OpenVINO Ir format for inference, with a tokenizer optimized for seamless text input processing.

Dataset description

This is a synthetic dataset designed for supervised finetuning (SFT) of LLMs. It was used to build SmolLM2-Instruct family of models and contains 1M samples.

During the development of SmolLM2, we observed that models finetuned on public SFT datasets underperformed compared to other models with proprietary instruction datasets. To address this gap, we created new synthetic datasets that improve instruction following while covering diverse tasks including text editing, rewriting, summarization, and reasoning. Through a series of data ablations at 1.7B scale, we enhanced our SFT mix by incorporating public datasets to strengthen specific capabilities such as mathematics, coding, system prompt following and long-context understanding.

All the new datasets were generated with distilabel and you can find the generation code here https://github.com/huggingface/smollm/tree/main/distilabel_pipelines.

Dataset composition

The mix consists of:

New datasets

  • Smol-Magpie-Ultra: the core component of our mix, consisting of 400K samples generated using the Magpie pipeline with /Llama-3.1-405B-Instruct. We also heavily curate and filter this dataset compared to the original Magpie-Pro pipeline. SmolLM models trained on this dataset alone outperform those trained on popular public datasets like OpenHermes and Magpie Pro across key benchmarks including IFEval and MT-Bench.
  • Smol-contraints: a 36K-sample dataset that trains models to follow specific constraints, such as generating responses with a fixed number of sentences or words, or incorporating specified words in the output. The dataset has been decontaminated against IFEval to prevent overlap.
  • Smol-rewrite: an 50k-sample collection focused on text rewriting tasks, such as adjusting tone to be more friendly or professional. Note that Smol-Magpie-Ultra also includes some rewriting, editing, and summarization examples.
  • Smol-summarize: an 100k-sample dataset specialized in email and news summarization.

Existing public datasets To enhance capabilities in mathematics, coding, system prompts, and long-context understanding, we fine-tuned SmolLM2-1.7B on various public SFT datasets and included subsets of the best performing ones using tuned ratios. These include:

  • OpenHermes2.5: we added 100k samples from OpenHermes2.5, since we found that it helps preserve and boost benchmarks such as MMLU and WinoGrande, and BBH.
  • MetaMathQA: we add this dataset to improve the model on mathematics and reasoning, we include 50k random samples.
  • NuminaMath-CoT: we find that this dataset helps on mathematics, especially hard problems found in benchmarks such as MATH.
  • Self-Oss-Starcoder2-Instruct: we use this dataset to improve coding capabilities.
  • SystemChats2.0: to make the model support a variety of system prompt formats we add 30k samples from the SystemChat-2.0 dataset. Note that Smol-rewrite and and Smol-summarize datasets also include system prompts.
  • LongAlign: we find that finetuning the model on only short samples makes it loose long context abilities beyond 2048 tokens, so we add english samples (with less than 16k tokens) from the LongAlign-10k dataset and train with a 8192 sequence.
  • Everyday-conversations: this dataset includes multi-turn everyday conversations such as greeting and was used in SmolLM v1 post-training.
  • APIGen-Function-Calling: we use 80k samples from apigen-function-calling which is a mix of Synth-APIGen-v0.1 and xlam-function-calling-60k datasets.
  • Explore-Instruct-Rewriting: 30k samples from this rewriting dataset.

You can find the code for generating the new datasets with distilabel here: https://github.com/huggingface/smollm. The ablation details will be included in an upcoming blog post.

License

All the new datasets (Smol-Magpie-Ultra, Smol-contraints, Smol-rewrite, Smol-summarize) are licensed under Apache 2.0. For the existing public datasets, please refer to the original dataset for the license Dataset composition


Prompt format

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>

{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

This is the OpenVINO IR format of the model, quantized in int8

The model was created with the Optimum-Intel libray cli-command

Dependencies required to create the model

There is an open clash in dependencies versions between optiumum-intel and openvino-genai

⚠️ Exporting tokenizers to OpenVINO is not supported for tokenizers version > 0.19 and openvino version <= 2024.4. Please downgrade to tokenizers version <= 0.19 to export tokenizers to OpenVINO.

So for the model conversion the only dependency you need is

pip install  -U "openvino>=2024.3.0" "openvino-genai"
pip install "torch>=2.1" "nncf>=2.7" "transformers>=4.40.0" "onnx<1.16.2" "optimum>=1.16.1" "accelerate" "datasets>=2.14.6" "git+https://github.com/huggingface/optimum-intel.git" --extra-index-url https://download.pytorch.org/whl/cpu

The instructions are from the amazing OpenVINO notebooks
vanilla pip install will create clashes among dependencies/versions
This command will install, among others:

tokenizers==0.20.3
torch==2.5.1+cpu
transformers==4.46.3
nncf==2.14.0
numpy==2.1.3
onnx==1.16.1
openvino==2024.5.0
openvino-genai==2024.5.0.0
openvino-telemetry==2024.5.0
openvino-tokenizers==2024.5.0.0
optimum==1.23.3
optimum-intel @ git+https://github.com/huggingface/optimum-intel.git@c454b0000279ac9801302d726fbbbc1152733315

How to quantized the original model

After the previous step you are enabled to run the following command (considering that you downloaded all the model weights and files into a subfolder called Llama-SmolTalk-3.2-1B-Instruct from the official model repository)

optimum-cli export openvino --model .\Llama-SmolTalk-3.2-1B-Instruct\ --task text-generation-with-past --trust-remote-code --weight-format int8 ov_Llama-SmolTalk-3.2-1B-Instruct

this will start the process and produce the following messages, without any fatal error

Dependencies required to run the model with openvino-genai

If you simply need to run already converted models into OpenVINO IR format, you need to install only openvino-genai

pip install openvino-genai==2024.5.0

How to use the model with openvino-genai

followed official tutorial on https://docs.openvino.ai/2024/notebooks/llm-question-answering-with-output.html With changes because here we are using Chat Templates

refer to https://huggingface.co/docs/transformers/main/chat_templating

# MAIN IMPORTS
import warnings
warnings.filterwarnings(action='ignore')
import datetime
from transformers import AutoTokenizer #for chat templating
import openvino_genai as ov_genai
import tiktoken
import sys

def countTokens(text):
    """
    Use tiktoken to count the number of tokens
    text -> str input
    Return -> int number of tokens counted
    """
    encoding = tiktoken.get_encoding("r50k_base") #context_count = len(encoding.encode(yourtext))
    numoftokens = len(encoding.encode(text))
    return numoftokens

# LOADING THE MODEL
print('Loading the model', end='')
model_dir = 'ov_Llama-SmolTalk-3.2-1B-Instruct'
pipe = ov_genai.LLMPipeline(model_dir, 'CPU')
# PROMPT FORMATTING - we use tokenizer chat templating
tokenizer = AutoTokenizer.from_pretrained(model_dir)
print('✅  done')
print('Ready for generation')

print('Starting now Normal Chat based interface with NO TURNS - chat history disabled...')
counter = 1
while True:
    # Reset history ALWAys
    history = []        
    userinput = ""
    print("\033[1;30m")  #dark grey
    print("Enter your text (end input with Ctrl+D on Unix or Ctrl+Z on Windows) - type quit! to exit the chatroom:")
    print("\033[91;1m")  #red
    lines = sys.stdin.readlines()
    for line in lines:
        userinput += line + "\n"
    if "quit!" in lines[0].lower():
        print("\033[0mBYE BYE!")
        break
    history.append({"role": "user", "content": userinput})
    tokenized_chat = tokenizer.apply_chat_template(history, tokenize=False)
    # START PIPELINE setting eos_token_id = 151643
    start = datetime.datetime.now() 
    print("\033[92;1m")
    streamer = lambda x: print(x, end='', flush=True)
    output = pipe.generate(tokenized_chat, temperature=0.2, 
                        do_sample=True, 
                        max_new_tokens=500, 
                        repetition_penalty=1.178,
                        streamer=streamer,
                        eos_token_id = 128009)
    print('')
    delta = datetime.datetime.now() - start
    totalseconds = delta.total_seconds()
    totaltokens = countTokens(output)
    genspeed = totaltokens/totalseconds
    # PRINT THE STATISTICS
    print('---')
    print(f'Generated in {delta}')
    print(f'🧮 Total number of generated tokens: {totaltokens}')
    print(f'⏱️ Generation time: {totalseconds:.0f} seconds')
    print(f'📈 speed: {genspeed:.2f}  t/s')
Downloads last month
17
Inference Examples
Inference API (serverless) does not yet support OpenVINO models for this pipeline type.

Model tree for FM-1976/ov_Llama-SmolTalk-3.2-1B-Instruct

Finetuned
(184)
this model

Dataset used to train FM-1976/ov_Llama-SmolTalk-3.2-1B-Instruct