lbourdois commited on
Commit
d10cc4e
·
verified ·
1 Parent(s): 6bf5272

Improve language tag

Browse files

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show
  1. README.md +125 -113
README.md CHANGED
@@ -1,114 +1,126 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - en
5
- base_model:
6
- - Qwen/Qwen2.5-0.5B-Instruct
7
- pipeline_tag: text-generation
8
- library_name: transformers
9
- tags:
10
- - text-generation-inference
11
- - Reasoner
12
- - cot
13
- ---
14
- # **FastThink-0.5B-Tiny**
15
-
16
- FastThink-0.5B-Tiny is a reasoning-focused model based on Qwen2.5. We have released a range of base language models and instruction-tuned language models, spanning from 0.5 billion to 72 billion parameters. Qwen2.5 introduces the following improvements over Qwen2:
17
-
18
- - Significantly enhanced knowledge and greatly improved capabilities in coding and mathematics, thanks to specialized expert models in these domains.
19
- - Major improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g., tables), and generating structured outputs, especially JSON. It is more resilient to diverse system prompts, enhancing role-play implementation and condition-setting for chatbots.
20
- - Long-context support for up to 128K tokens and the ability to generate outputs up to 8K tokens.
21
- - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more.
22
-
23
- **Architecture**: Transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias, and tied word embeddings.
24
-
25
- # **Quickstart with Transformer**
26
-
27
- Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents.
28
-
29
- ```python
30
- from transformers import AutoModelForCausalLM, AutoTokenizer
31
-
32
- model_name = "prithivMLmods/FastThink-0.5B-Tiny"
33
-
34
- model = AutoModelForCausalLM.from_pretrained(
35
- model_name,
36
- torch_dtype="auto",
37
- device_map="auto"
38
- )
39
- tokenizer = AutoTokenizer.from_pretrained(model_name)
40
-
41
- prompt = "Give me a short introduction to large language model."
42
- messages = [
43
- {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
44
- {"role": "user", "content": prompt}
45
- ]
46
- text = tokenizer.apply_chat_template(
47
- messages,
48
- tokenize=False,
49
- add_generation_prompt=True
50
- )
51
- model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
52
-
53
- generated_ids = model.generate(
54
- **model_inputs,
55
- max_new_tokens=512
56
- )
57
- generated_ids = [
58
- output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
59
- ]
60
-
61
- response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
62
- ```
63
-
64
- # **Dataset Preparation**
65
-
66
- This script is designed to load, process, and combine multiple datasets into a single, standardized format suitable for training conversational AI models. The script uses the `datasets` library to load and manipulate the datasets, and the `chat_templates` library to standardize the conversation format.
67
-
68
- ## Example
69
-
70
- ```python
71
- # Load the initial three datasets
72
- dataset1 = load_dataset("PowerInfer/LONGCOT-Refine-500K", split="train")
73
- dataset2 = load_dataset("amphora/QwQ-LongCoT-130K", split="train")
74
- dataset3 = load_dataset("AI-MO/NuminaMath-CoT", split="train")
75
-
76
- # Map conversation columns for all datasets
77
- dataset1 = dataset1.map(add_conversations_column, batched=False)
78
- dataset2 = dataset2.map(add_conversations_column_prompt_qwq, batched=False)
79
- dataset3 = dataset3.map(add_conversations_column_prompt_solution, batched=False)
80
-
81
- # Combine all datasets
82
- combined_dataset = concatenate_datasets([dataset1, dataset2, dataset3])
83
-
84
- # Standardize using the ShareGPT format
85
- combined_dataset = standardize_sharegpt(combined_dataset)
86
-
87
- # Initialize the tokenizer with a specific chat template
88
- tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5")
89
-
90
- # Apply formatting function to the combined dataset
91
- combined_dataset = combined_dataset.map(formatting_prompts_func, batched=True)
92
-
93
- # Print the first few examples to verify the output
94
- print(combined_dataset[:50000])
95
- ```
96
- # **Intended Use**
97
- 1. **Reasoning Tasks**: FastThink-0.5B-Tiny is optimized for reasoning-focused applications, such as logical problem-solving, decision-making, and analytical workflows.
98
- 2. **Instruction Following**: Ideal for scenarios where precise adherence to instructions is required, including generating structured outputs like JSON or tables.
99
- 3. **Multilingual Support**: Suitable for use in multilingual environments, supporting over 29 languages, making it versatile for global applications.
100
- 4. **Coding and Mathematics**: Highly effective in tasks involving coding, debugging, or solving mathematical problems, leveraging expert domain knowledge.
101
- 5. **Role-play Scenarios**: Can simulate conversational agents or personas for role-playing, enhancing chatbot and virtual assistant implementations.
102
- 6. **Long-form Content Creation**: Designed to generate and manage long-form text (up to 8K tokens) while maintaining context, making it ideal for tasks like report writing or storytelling.
103
- 7. **Understanding and Processing Structured Data**: Efficient at interpreting and working with structured data, such as tables or hierarchical formats.
104
- 8. **Low-Resource Applications**: With a smaller parameter size (0.5B), it is well-suited for applications with limited computational resources or edge deployment.
105
-
106
- # **Limitations**
107
- 1. **Limited Model Size**: As a 0.5B-parameter model, its reasoning and comprehension capabilities are less advanced compared to larger models, particularly for highly complex tasks.
108
- 2. **Contextual Limitations**: Although it supports a context length of up to 128K tokens, its ability to effectively utilize such a long context may vary, particularly in tasks requiring intricate cross-referencing of earlier inputs.
109
- 3. **Accuracy in Domain-Specific Tasks**: While capable in coding and mathematics, it may struggle with highly specialized or esoteric domain knowledge compared to models fine-tuned specifically for those areas.
110
- 4. **Ambiguity Handling**: May misinterpret vague or poorly structured prompts, leading to less accurate or unintended results.
111
- 5. **Long-Context Tradeoffs**: Generating or processing very long outputs (e.g., close to the 8K token limit) could result in decreased coherence or relevance toward the end.
112
- 6. **Multilingual Performance**: Although it supports 29 languages, its proficiency and fluency may vary across languages, with some underrepresented languages possibly seeing reduced performance.
113
- 7. **Resource-Intensive for Long Contexts**: Using its long-context capabilities (128K tokens) can be computationally demanding, requiring significant memory and processing power.
 
 
 
 
 
 
 
 
 
 
 
 
114
  8. **Dependence on Fine-Tuning**: For highly specialized tasks or domains, additional fine-tuning may be necessary to achieve optimal performance.
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - zho
5
+ - eng
6
+ - fra
7
+ - spa
8
+ - por
9
+ - deu
10
+ - ita
11
+ - rus
12
+ - jpn
13
+ - kor
14
+ - vie
15
+ - tha
16
+ - ara
17
+ base_model:
18
+ - Qwen/Qwen2.5-0.5B-Instruct
19
+ pipeline_tag: text-generation
20
+ library_name: transformers
21
+ tags:
22
+ - text-generation-inference
23
+ - Reasoner
24
+ - cot
25
+ ---
26
+ # **FastThink-0.5B-Tiny**
27
+
28
+ FastThink-0.5B-Tiny is a reasoning-focused model based on Qwen2.5. We have released a range of base language models and instruction-tuned language models, spanning from 0.5 billion to 72 billion parameters. Qwen2.5 introduces the following improvements over Qwen2:
29
+
30
+ - Significantly enhanced knowledge and greatly improved capabilities in coding and mathematics, thanks to specialized expert models in these domains.
31
+ - Major improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g., tables), and generating structured outputs, especially JSON. It is more resilient to diverse system prompts, enhancing role-play implementation and condition-setting for chatbots.
32
+ - Long-context support for up to 128K tokens and the ability to generate outputs up to 8K tokens.
33
+ - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more.
34
+
35
+ **Architecture**: Transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias, and tied word embeddings.
36
+
37
+ # **Quickstart with Transformer**
38
+
39
+ Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents.
40
+
41
+ ```python
42
+ from transformers import AutoModelForCausalLM, AutoTokenizer
43
+
44
+ model_name = "prithivMLmods/FastThink-0.5B-Tiny"
45
+
46
+ model = AutoModelForCausalLM.from_pretrained(
47
+ model_name,
48
+ torch_dtype="auto",
49
+ device_map="auto"
50
+ )
51
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
52
+
53
+ prompt = "Give me a short introduction to large language model."
54
+ messages = [
55
+ {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
56
+ {"role": "user", "content": prompt}
57
+ ]
58
+ text = tokenizer.apply_chat_template(
59
+ messages,
60
+ tokenize=False,
61
+ add_generation_prompt=True
62
+ )
63
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
64
+
65
+ generated_ids = model.generate(
66
+ **model_inputs,
67
+ max_new_tokens=512
68
+ )
69
+ generated_ids = [
70
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
71
+ ]
72
+
73
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
74
+ ```
75
+
76
+ # **Dataset Preparation**
77
+
78
+ This script is designed to load, process, and combine multiple datasets into a single, standardized format suitable for training conversational AI models. The script uses the `datasets` library to load and manipulate the datasets, and the `chat_templates` library to standardize the conversation format.
79
+
80
+ ## Example
81
+
82
+ ```python
83
+ # Load the initial three datasets
84
+ dataset1 = load_dataset("PowerInfer/LONGCOT-Refine-500K", split="train")
85
+ dataset2 = load_dataset("amphora/QwQ-LongCoT-130K", split="train")
86
+ dataset3 = load_dataset("AI-MO/NuminaMath-CoT", split="train")
87
+
88
+ # Map conversation columns for all datasets
89
+ dataset1 = dataset1.map(add_conversations_column, batched=False)
90
+ dataset2 = dataset2.map(add_conversations_column_prompt_qwq, batched=False)
91
+ dataset3 = dataset3.map(add_conversations_column_prompt_solution, batched=False)
92
+
93
+ # Combine all datasets
94
+ combined_dataset = concatenate_datasets([dataset1, dataset2, dataset3])
95
+
96
+ # Standardize using the ShareGPT format
97
+ combined_dataset = standardize_sharegpt(combined_dataset)
98
+
99
+ # Initialize the tokenizer with a specific chat template
100
+ tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5")
101
+
102
+ # Apply formatting function to the combined dataset
103
+ combined_dataset = combined_dataset.map(formatting_prompts_func, batched=True)
104
+
105
+ # Print the first few examples to verify the output
106
+ print(combined_dataset[:50000])
107
+ ```
108
+ # **Intended Use**
109
+ 1. **Reasoning Tasks**: FastThink-0.5B-Tiny is optimized for reasoning-focused applications, such as logical problem-solving, decision-making, and analytical workflows.
110
+ 2. **Instruction Following**: Ideal for scenarios where precise adherence to instructions is required, including generating structured outputs like JSON or tables.
111
+ 3. **Multilingual Support**: Suitable for use in multilingual environments, supporting over 29 languages, making it versatile for global applications.
112
+ 4. **Coding and Mathematics**: Highly effective in tasks involving coding, debugging, or solving mathematical problems, leveraging expert domain knowledge.
113
+ 5. **Role-play Scenarios**: Can simulate conversational agents or personas for role-playing, enhancing chatbot and virtual assistant implementations.
114
+ 6. **Long-form Content Creation**: Designed to generate and manage long-form text (up to 8K tokens) while maintaining context, making it ideal for tasks like report writing or storytelling.
115
+ 7. **Understanding and Processing Structured Data**: Efficient at interpreting and working with structured data, such as tables or hierarchical formats.
116
+ 8. **Low-Resource Applications**: With a smaller parameter size (0.5B), it is well-suited for applications with limited computational resources or edge deployment.
117
+
118
+ # **Limitations**
119
+ 1. **Limited Model Size**: As a 0.5B-parameter model, its reasoning and comprehension capabilities are less advanced compared to larger models, particularly for highly complex tasks.
120
+ 2. **Contextual Limitations**: Although it supports a context length of up to 128K tokens, its ability to effectively utilize such a long context may vary, particularly in tasks requiring intricate cross-referencing of earlier inputs.
121
+ 3. **Accuracy in Domain-Specific Tasks**: While capable in coding and mathematics, it may struggle with highly specialized or esoteric domain knowledge compared to models fine-tuned specifically for those areas.
122
+ 4. **Ambiguity Handling**: May misinterpret vague or poorly structured prompts, leading to less accurate or unintended results.
123
+ 5. **Long-Context Tradeoffs**: Generating or processing very long outputs (e.g., close to the 8K token limit) could result in decreased coherence or relevance toward the end.
124
+ 6. **Multilingual Performance**: Although it supports 29 languages, its proficiency and fluency may vary across languages, with some underrepresented languages possibly seeing reduced performance.
125
+ 7. **Resource-Intensive for Long Contexts**: Using its long-context capabilities (128K tokens) can be computationally demanding, requiring significant memory and processing power.
126
  8. **Dependence on Fine-Tuning**: For highly specialized tasks or domains, additional fine-tuning may be necessary to achieve optimal performance.