vxo hyperclovax commited on
Commit
2c33877
·
verified ·
0 Parent(s):

Duplicate from naver-hyperclovax/HyperCLOVAX-SEED-Think-14B

Browse files

Co-authored-by: HyperCLOVA X (admin) <[email protected]>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
LICENSE ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ HyperCLOVA X SEED 14B Think Model License Agreement
2
+
3
+ Model Release Date: July 22, 2025
4
+
5
+ This HyperCLOVA X SEED 14B Think Model License Agreement (the “Agreement”) is a legal agreement between you and NAVER Corporation (“Naver Corp.”) and NAVER Cloud Corporation (“Naver Cloud Corp.”) (Naver Corp. and Naver Cloud Corp. are collectively referred to as “NAVER”) and governs your use of the Models that NAVER provides to You under this Agreement.
6
+
7
+ NAVER Corp., as the holder of the intellectual property of the Model, and its affiliate, NAVER Cloud Corp., as the exclusive business operator of HyperCLOVA X, enter into this Agreement with you. NAVER and you are each a “party” and collectively the “parties.”
8
+
9
+ By using, reproducing, modifying, distributing, performing or displaying any portion or element of the Model or Derivative Model, or otherwise accepting the terms of this Agreement, you agree to be bound by this Agreement. You represent to us that you are lawfully able to enter into contracts, and if you are entering into this Agreement for an entity, that you have legal authority to bind that entity.
10
+
11
+ 1. Definitions.
12
+
13
+ 1.1. "Affiliate” means any entity directly or indirectly controlling, controlled by or under common control with either party, where “control” means the possession, directly or indirectly, of the power to independently direct or cause the direction of the management and policies of an entity, whether through ownership of more than fifty percent (50%) of the stock or other equity interests entitled to vote for representation on its board of directors, or body performing similar functions, by contract or otherwise.
14
+
15
+ 1.2. “Derivative Model” means all (i) modifications to the Model, (ii) works based on the Model, or (iii) any other machine learning model which is created by transfer of patterns of the weights, parameters, operations, or Output of the Model, to that model in order to cause that model to perform similarly to the Model, including distillation methods that use intermediate data representations or methods based on the generation of synthetic data Outputs by the Model for training that Model. For clarity, Outputs are not deemed Derivative Model.
16
+
17
+ 1.3. “Licensee” or “you” means you, or your employer or any other person or entity (if you are entering into this Agreement on such person or entity’s behalf), of the age required under applicable laws, rules or regulations to provide legal consent and that has legal authority to bind your employer or such other person or entity if you are entering in this Agreement on their behalf.
18
+
19
+ 1.4. “Model” means the foundational large language models and software and algorithms, including machine-learning model code and trained model weights distributed by NAVER.
20
+
21
+
22
+ 1.5. “Output” means the information content output of the Model or a Derivative Model that results from operating or otherwise using the Model or Derivative Model.
23
+
24
+ 2. Conditions for Use, License Grant and Restrictions
25
+
26
+ 2.1. Conditions for Use. The Model and any Derivative Model are subject to the terms of this Agreement and govern your use. If You institute copyright or patent litigation against any entity (including a crossclaim or counterclaim in a lawsuit) alleging that the Model or Derivative Model constitutes direct or contributory copyright or patent infringement, then any license granted to you under this Agreement for that Model or Derivative Model will terminate as of the date such litigation is filed. NAVER may update this Agreement to comply with legal and regulatory requirements any time and You agree to either comply with any updated license or cease your copying, use, and distribution of the Model and any Derivative Model.
27
+
28
+ 2.2. License Grant. Subject to the terms and conditions of this Agreement, NAVER hereby grants to you a non-exclusive, worldwide, non-transferable, revocable and royalty-free limited license under NAVER’s intellectual property or other rights owned by NAVER embodied in the Model to access, download, install, copy, use, reproduce, distribute, create derivative works of, and make modifications to the Model.
29
+
30
+ 2.3. Prohibited Use Policy. NAVER is committed to ensuring safety trust, and transparency in the development and use of AI technologies. Accordingly, your use of the Model and any Derivative Models is subject to the following conditions:
31
+ (i) You must ensure that any product or service you develop, use, offer as a service, or distribute complies with all applicable laws and regulations, and is operated appropriately for the relevant industry or use case.
32
+ (ii) You must comply with the Acceptable Use Policy applicable to the Model and any Derivative Models, which is attached hereto as Addendum A and incorporated by reference into this Agreement.
33
+ (iii) NAVER expressly prohibits the use of its products or services for any purpose in violation of applicable law and regulation, including but not limited to:
34
+ (a) illegal surveillance,
35
+ (b) illegal collection or processing of biometric information without the consent of the subject which is required under applicable law, or
36
+ (c) illegal harassment, abuse, threatening or bullying of individuals or groups of individuals or intentionally misleading or deceiving others.
37
+ (iv) You must take reasonable measures to address unintended bias and to mitigate harm to others, including underrepresented or vulnerable groups.
38
+
39
+
40
+ 3. Redistribution.
41
+
42
+ 3.1. You may reproduce, distribute or make available the Model or Derivative Models thereof, or a product or service (including another AI model) that contains any of them, if you meet all of the following conditions: you must (i) include the Prohibited Use Policy referenced in Section 2.3. as an enforceable provision in any agreement (e.g., license agreement, terms of use, etc.) governing the use and/or distribution of the Model or Derivative Model and you must provide notice to subsequence users you distribute to the Model or Derivative Models are subject to the use restrictions in Section 2.3., (ii) provide all third party recipients of the Model or Derivative Models a copy of this Agreement, (iii) cause any modified files to carry prominent notices stating that you modified the files; (iv) include the following attribution notice within a “Notice” text file distributed as part of such copies: “HyperCLOVA X SEED 14B Think Model is licensed under the HyperCLOVA X SEED 14B Think Model License Agreement, Copyright © NAVER Corp. All Rights Reserved.”, and (v) prominently display “Powered by HyperCLOVA X” on a related website, user interface, blogpost, about page, or product documentation. If you use the Model or any Outputs of the Model to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, you shall also include “HyperCLOVA X” at the beginning of any such AI model name.
43
+ 3.2. You may add your own copyright statement to your modifications and, except as set forth in this Section, may provide additional or different license terms and conditions for use, reproduction, or distribution of your modifications, or for any such Derivative Models as a whole, provided your use, reproduction, and distribution of the Model or Derivative Models otherwise comply with the terms and conditions stated in this Agreement. Any additional or different terms and conditions you impose must not conflict with the terms of this Agreement.
44
+
45
+ 4. Additional Commercial Terms. If (i) as of the Model Release Date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s Affiliates, is greater than 10 million monthly active users in the preceding calendar month, or (ii) the Licensee or its Affiliate distributes or makes available any product or service, which is substantially similar to or directly competes with any product and service provided by NAVER, then the Licensee must request a license from NAVER. Such a license may be granted by NAVER at its sole discretion, and the Licensee is not authorized to exercise any rights under this Agreement unless and until NAVER expressly grants you such rights.
46
+
47
+ 5. Generated Output. NAVER claims no rights in Outputs you generate using the Model. You and your use are solely responsible for Outputs and their subsequent uses.
48
+
49
+ 6. DISCLAIMER OF WARRANTY. UNLESS REQUIRED BY APPLICABLE LAW, THE MODEL AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OR ANY KIND, AND NAVER DISCLAIMS ALL WARRANTIES OF ANY KIND, BOTH EXPRESS AND IMPLIED, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING THE MODEL, DERIVATIVE MODELS, OUTPUTS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE MODEL AND ANY OUTPUTS AND RESULTS AND YOUR EXERCISE OF PERMISSION UNDER THIS AGREEMENT.
50
+
51
+ 7. LIMITATION OF LIABILITY. IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER IN TORT (INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE, UNLESS REQUIRED BY APPLICABLE LAW (SUCH AS IN CASES OF DELIBERATE AND GROSSLY NEGLIGENT ACTS), WILL NAVER BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY DIRECT, INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL, EXEMPLARY, OR PUNITIVE DAMAGES, OR LOST PROFITS OF ANY KIND, ARISING FROM OR RELATED TO THIS AGREEMENT, OR RESULTING FROM THE USE OR INABILITY TO USE THE MODEL, DERIVATIVE MODELS OR, OUTPUTS (INCLUDING, BUT NOT LIMITED TO, DAMAGES FOR LOSS OF GOODWILL, WORK STOPPAGES, COMPUTER FAILURE OR MALFUNCTION, OR ANY AND ALL OTHER COMMERCIAL DAMAGES OR LOSSES), EVEN IF NAVER HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
52
+
53
+ 8. Indemnity. You will indemnify and hold harmless NAVER from and against any claim by any third party arising out of or related to your use or distribution of the Model, Derivative Model or Outputs.
54
+
55
+ 9. Intellectual Property.
56
+
57
+ 9.1. This Agreement does not grant permission to use the trade names, trademarks, service marks, or product names of NAVER, except as required for reasonable and customary use in describing the origin of the Model and reproducing the content of the “Notice” text file.
58
+
59
+ 9.2. NAVER Corp. owns the Model and any Derivative Model created by NAVER Corp. Except as expressively granted in this Agreement, NAVER Corp. reserves all rights, interests and remedies in connection with the Model and Derivative Model created by NAVER Corp. and no other license or right is granted to you by implication, estoppel or otherwise. Subject to NAVER Corp.’s ownership of the Model and any Derivative Model made by or for NAVER Corp., with respect to any derivative works and modifications of the Model that are made by you, as between you and NAVER Corp., you are and will be the owner of such derivative works and modifications.
60
+
61
+ 10. Term and Termination. The term of this Agreement will commence upon your acceptance of this Agreement or access to the Model and will continue in full force and effect until terminated in accordance with the terms and conditions of this Agreement. NAVER may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you shall delete and cease use of the Model and Derivative Model. Section 5, 6, 7 and 10 shall survive the termination of this Agreement.
62
+
63
+ 11. Governing Law and Jurisdiction.
64
+
65
+ 11.1. This Agreement will be governed by and construed in accordance with the laws of the Republic of Korea, without regard to its conflicts of laws principles.
66
+
67
+ 11.2. Any disputes, controversies, or claims arising out of or relating to this Agreement, including its existence, validity, interpretation, performance, breach, or termination, shall be referred to and finally resolved by arbitration administered by the Korean Commercial Arbitration Board (KCAB) in accordance with the International Arbitration Rules of the Korean Commercial Arbitration Board in force at the time of the commencement of the arbitration. The seat of arbitration shall be Seoul, Republic of Korea. The tribunal shall consist of one arbitrator. The language of the arbitration shall be English. Either party may seek interim or provisional relief from a court of competent jurisdiction and doing so shall not be considered a waiver of any provision in this section. The arbitral tribunal also has the authority to issue orders for interim or provisional relief.
68
+
69
+ 12. Modifications. NAVER reserves the right to modify or amend this Agreement at any time, in its sole discretion. Any modifications will be effective upon posting the updated Agreement on our website or through other means of communication. You are responsible for reviewing the Agreement periodically for changes.
70
+
71
+ 13. No Waiver. NAVER will not be treated as having waived any rights by not exercising (or delaying the exercise of) any rights under this Agreement.
72
+
73
+
74
+
75
+
76
+ Addendum A – Acceptable Use Policy
77
+
78
+ NAVER is committed to promoting safe and responsible use of its AI technologies, including the HyperCLOVA X SEED 14B Think Model (the “Model”). By accessing or using the Model and Derivative Model (Defined in the Model License Agreement) (the Model and Derivative Model are collectively referred to as the “Models”), you agree to this Acceptable Use Policy (“Policy”).
79
+
80
+ We want everyone to use the Models safely, legally, and ethically. You agree that you will not use, or allow others to use, the Models to:
81
+
82
+ 1. Violate applicable laws or the rights of others, including by:
83
+ a. Engaging in, promoting, contributing to, encouraging, planning, inciting, or furthering illegal or unlawful activity or content, such as:
84
+ * Violence or terrorism
85
+ * Exploitation or harm to children, including the creation or dissemination of child exploitative content
86
+ * Human trafficking, exploitation, or sexual violence
87
+ * The unlawful distribution of obscene or harmful material to minors, or failure to apply legally required age restrictions
88
+ * Sexual solicitation or sexually exploitative behavior
89
+ * Any other criminal activity
90
+ b. Engaging in, promoting, inciting, or facilitating the harassment, abuse, threatening, or bullying of individuals or groups
91
+ c. Engaging in, promoting, inciting, or facilitating discrimination or other unlawful or harmful conduct in the provision of employment, credit, housing, or access to essential goods and services
92
+ d. Providing unauthorized or unlicensed professional services, including but not limited to financial, legal, medical/health, or related services
93
+ e. Collecting, processing, disclosing, generating, or inferring private or sensitive personal information, including identity, health, or demographic data, unless lawfully permitted under applicable laws
94
+ f. Infringing, misappropriating, or otherwise violating third-party rights, including through the generation or use of outputs derived from the Models
95
+ g. Creating, generating, or facilitating malicious code, malware, or computer viruses, or interfering with the functioning, security, or integrity of a website, application, or system
96
+ h. Intentionally bypassing or disabling usage restrictions, safety measures, or access controls imposed by NAVER
97
+
98
+ 2. Engage in or promote use cases that may pose a risk of death, bodily harm, or significant safety hazard to individuals, including use of the Models in connection with:
99
+ a. Military, warfare, nuclear technology or espionage
100
+ b. The development or distribution of firearms or illegal weapons
101
+ c. Illegal drugs or regulated controlled substances
102
+ d. Operation of critical infrastructure, transportation systems, or heavy machinery
103
+ e. Content promoting self-harm, including suicide, or eating disorders
104
+ f. Any other use intended to incite or cause physical harm
105
+
106
+ 3. Intentionally deceive or mislead others, including by:
107
+ a. Generating, promoting, or disseminating fraudulent or misleading content
108
+ b. Creating or sharing defamatory content
109
+ c. Generating or distributing spam
110
+ d. Impersonating another individual or entity without proper authorization
111
+ e. Representing Model output as human-generated
112
+ f. Generating or enabling fake online engagement, such as fake reviews or fake users
113
+
114
+ 4. Fail to disclose to end users any known risks or limitations of an AI system that incorporates the Models.
115
+
116
+ 5. Use the Models in conjunction with third-party tools, models, or software designed to generate unlawful content or conduct, or falsely represent outputs from such tools as associated with NAVER or HyperCLOVA X.
117
+
118
+ If you become aware of a violation of this Policy, a bug, or any behavior that could result in a breach of this Policy, please report it to us:
119
+
120
+ Reporting risky outputs: [email protected]
121
+ Reporting policy violations or unauthorized use: [email protected]
122
+
README.md ADDED
@@ -0,0 +1,603 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: hyperclovax-seed
4
+ license_link: LICENSE
5
+ library_name: transformers
6
+ ---
7
+
8
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6512d9827fccffe1e9e28fa7/7BT1W9eHLQjRCCENwcXmE.png)
9
+
10
+ ## Overview
11
+
12
+ HyperCLOVA X SEED 14B Think is a next-generation language model that moves beyond the conventional approach of simply increasing model size to improve performance. It combines [HyperCLOVA X’s lightweighting technology](https://tinyurl.com/y3hrfz67) for building high-efficiency LLMs with advanced reasoning capabilities. Its development relied on two key technologies: (1) Pruning & Knowledge Distillation, which achieves both compactness and high performance, and (2) a Reinforcement Learning (RL) pipeline, which maximizes reasoning ability. By pruning low-importance parameters and distilling knowledge from a large model into a smaller one, training costs have been significantly reduced. On top of this, [the latest RL recipe validated in HyperCLOVA X Think](https://arxiv.org/pdf/2506.22403) is applied in a multi-stage process: (1) Supervised Fine-Tuning (SFT), (2) Reinforcement Learning with Verifiable Rewards (RLVR), (3) Length Controllability (LC) for reasoning path optimization, and (4) a joint training of Reinforcement Learning from Human Feedback (RLHF) and RLVR.
13
+
14
+ It is a considerable challenge to equip a pruned, knowledge-distilled model with reasoning capabilities, since reductions in training costs and model size often degrade reasoning performance. However, through extensive research experience and persistent trial and error, the HyperCLOVA X team has succeeded in lowering training costs while maintaining reasoning performance comparable to that of larger, resource-intensive models.
15
+
16
+
17
+ ## Basic Information
18
+
19
+ - **Architecture** : Transformer-based architecture with Peri-Layer Normalization and Maximal Update Parameterization(μP) (Dense Model)
20
+ - **Parameters** : 14.74B
21
+ - **Input/Output Format** (Input/Output) : Text / Text
22
+ - **Context Length** : 32k
23
+
24
+ ## Training Cost
25
+
26
+ `HyperCLOVA X SEED 14B Think` was trained at a significantly lower cost compared to high-performance external models of similar scale. By utilizing HCX’s lightweight training pipeline, it was trained at approximately **52.60×** lower cost than `Qwen2.5-14B` and **91.38×** lower cost than `Qwen3-14B`.
27
+
28
+ | Model (Base) | GPU Hours (A100-80GB, MFU 50%) |
29
+ | ------------------------------- | ---------------------------------- |
30
+ | **HyperCLOVA X SEED 14B Think** | **68,049** |
31
+ | Qwen2.5-0.5B | 169,257 |
32
+ | Qwen2.5-1.5B | 449,643 |
33
+ | Qwen3-0.6B | 602,460 |
34
+ | Qwen3-1.7B | 1,063,991 |
35
+ | HyperCLOVA X Think | 2,197,732 |
36
+ | **Qwen2.5-14B** | **3,603,432** |
37
+ | Qwen3-8B | 3,993,607 |
38
+ | **Qwen3-14B** | **6,267,077** |
39
+ | Qwen3-32B | 14,108,748 |
40
+
41
+ ## Benchmarks
42
+
43
+ Compared to global models of a similar scale, such as Qwen3 14B, HyperCLOVA X SEED 14B Think demonstrates superior performance in Korean language and cultural understanding, while showing competitive performance in math and coding tasks, which are directly or indirectly related to agent capabilities. This trend remains consistent even when compared with larger models like Qwen3 32B and LG Exaone-Deep 32B.
44
+
45
+ ### Backbone Benchmarks Performance Comparison (Non-think)
46
+
47
+ **Korean/Korea Culture**
48
+
49
+ | Model | Average | CLIcK | HAERAE-Bench | KOBEST | KorMedMCQA | KMMLU | KoBigBench | KoCommonGEN-v2 |
50
+ | ------------------------------- | ------- | ------ | ------------ | ------ | ---------- | ------ | ---------- | -------------- |
51
+ | **HyperCLOVA X SEED 14B Think** | 0.7269 | 0.7208 | 0.8506 | 0.8570 | 0.6411 | 0.5428 | 0.7482 | 0.6682 |
52
+ | QWEN3-8B | 0.6759 | 0.6206 | 0.6618 | 0.7919 | 0.6471 | 0.5543 | 0.7186 | 0.5773 |
53
+ | QWEN3-14B | 0.7079 | 0.6707 | 0.6975 | 0.8174 | 0.6979 | 0.5864 | 0.7507 | 0.5927 |
54
+
55
+
56
+ **English/American Culture**
57
+ | Model | Average | MMLU | BigBench-Hard | Hellaswag | Winogrande | PIQA | ARC-challenge | Social IQa |
58
+ | ------------------------------- | ------- | ------ | ------------- | --------- | ---------- | ------ | ------------- | ---------- |
59
+ | **HyperCLOVA X SEED 14B Think** | 0.6614 | 0.7121 | 0.6216 | 0.6125 | 0.7593 | 0.7791 | 0.6246 | 0.5205 |
60
+ | QWEN3-8B | 0.6548 | 0.7490 | 0.6072 | 0.5817 | 0.7198 | 0.7666 | 0.6433 | 0.5159 |
61
+ | QWEN3-14B | 0.6807 | 0.7885 | 0.6325 | 0.6143 | 0.7356 | 0.8025 | 0.6698 | 0.5215 |
62
+
63
+
64
+ ### Reasoning Performance Comparison
65
+
66
+ **Korean/Korea Culture**
67
+ | Model | KMMLU | CSAT-ko-2025 | KorMedMCQA | KoBALT | HAERAE | CLIcK | KoBigBench | LogicKor |
68
+ |-----------------------------------------|--------|--------|--------|--------|--------|--------|--------|------|
69
+ | HyperCLOVA X SEED 14B Think **(Think)** | 0.6649 | 0.7516 | 0.6933 | 0.4500 | 0.8537 | 0.7280 | 0.7974 | 8.74 |
70
+ | QWEN3-8B | 0.5543 | 0.7200 | 0.6782 | 0.3060 | 0.6618 | 0.6690 | 0.7850 | 8.92 |
71
+ | QWEN3-14B | 0.4930 | 0.7710 | 0.6850 | 0.3840 | 0.7410 | 0.6880 | 0.8380 | 9.15 |
72
+
73
+ **Coding/Math**
74
+ | Model | GSM8k | MATH500 | HumanEval | MBPP |
75
+ |-----------------------------------------|--------|--------|--------|--------|
76
+ | HyperCLOVA X SEED 14B Think | 0.9553 | 0.9380 | 0.9451 | 0.8759 |
77
+ | QWEN3-14B | 0.9590 | 0.9680 | 0.9570 | 0.9080 |
78
+
79
+
80
+ ### Non-Think / Think Performance Comparison
81
+
82
+ | Model | GSM8k | GPT4Eval | MT Bench | Arena-Hard-v0.1 |
83
+ |---------------------------------------------|--------|--------|--------|--------|
84
+ | HyperCLOVA X SEED 14B Think **(Non-think)** | 0.9348 | 0.6741 | 8.2063 | 0.2733 |
85
+ | HyperCLOVA X SEED 14B Think **(Think)** | 0.9553 | 0.8200 | 8.8313 | 0.5826 |
86
+
87
+
88
+ ## ChatML Block
89
+
90
+ The chat template for HyperCLOVA X consists of the following elements.
91
+
92
+ - **tool_list** : A list of tools available to the model (in JSON format). If no tools are available, an empty block should be provided.
93
+ - **system** : System prompt. If not are available, an empty block should be provided.
94
+ - **user** : User input.
95
+ - **assistant** : Assistant output.
96
+
97
+ The basic configuration of ChatML block is as follows.
98
+
99
+ ```
100
+ <|im_start|>{tool_list/system/user/assistant}
101
+ {content}<|im_end|>
102
+ ```
103
+
104
+ - `<|im_start|>` : Start token of ChatML block
105
+ - `<|im_end|>` : End token of ChatML block
106
+
107
+ ## (ChatML) General Conversation
108
+
109
+ ### First turn
110
+
111
+ Given a two-turn conversation between the user and the assistant (`user_query_1`, `assistant_answer_1`, `user_query_2`, `assistant_answer_2`), the prompt for the first-turn can be constructed in its simplest form as follows:
112
+
113
+ ```
114
+ <|im_start|>tool_list
115
+ <|im_end|>
116
+ <|im_start|>system
117
+ - You are "CLOVA X," an AI language model developed by NAVER.
118
+ - The current date is {day of the week}, {month} {dd}, {yyyy}.<|im_end|>
119
+ <|im_start|>user
120
+ {user_query_1}<|im_end|>
121
+ <|im_start|>assistant
122
+ ```
123
+
124
+ After the `<|im_end|>` token (indicating the end of the ChatML block), the model generates text up to the `<|endofturn|>` token. This output corresponds to the assistant's first response (`assistant_answer_1`).
125
+
126
+ ### Second turn
127
+
128
+ Based on the assistant's first response (`assistant_answer_1`), when the user asks an additional question (`user_query_2`), the prompt input to the model is constructed as follows:
129
+
130
+ ```
131
+ {following the previous context}
132
+ {assistant_answer_1}<|im_end|><|endofturn|>
133
+ <|im_start|>user
134
+ {user_query_2}<|im_end|>
135
+ <|im_start|>assistant
136
+ ```
137
+
138
+ As in the previous turn, generation continues until the `<|endofturn|>` token appears after `<|im_end|>`. This corresponds to the assistant's second response (`assistant_answer_2`).
139
+
140
+ ## (ChatML) Function Call (Using Tools)
141
+
142
+ Insert the list of tools available into the tool_list block as a JSON list. For example, the following is a case where the only available tool is `get_weather`.
143
+
144
+ ```
145
+ <|im_start|>tool_list
146
+ [{"name": "get_weather", "description": "Check the current weather at the requested location.", "parameters": {"type": "object", "properties": {"location": {"type": "string", "description": "Name of the city"}}, "required": ["location"]}}]<|im_end|>
147
+ ```
148
+
149
+ Additional instructions can be included in the system prompt if needed, and it is recommended to format them as `- {content}`. For example:
150
+
151
+ ```
152
+ <|im_start|>system
153
+ - In this environment, various tools can be used to answer users' questions.
154
+ - You are "CLOVA X," an AI language model developed by NAVER.
155
+ - Begin by creating a plan for solving the problem, and then utilize the tools accordingly to address the problem.
156
+ - The current date is {day of the week}, {month} {dd}, {yyyy}.
157
+ - Latest information such as news, stock prices, and shopping is retrieved through the tool_list.
158
+ - If external tools are required, the assistant should not answer directly but must first obtain the necessary information via the assistant -> tool/function_call role, and then respond.<|im_end|>
159
+ ```
160
+
161
+ ### First turn
162
+
163
+ Suppose the user gives the instruction, 'Tell me the weather in Seoul.' The prompt input to the model would then be as follows:
164
+
165
+ ```
166
+ <|im_start|>tool_list
167
+ [{"name": "get_weather", "description": "Check the current weather at the requested location.", "parameters": {"type": "object", "properties": {"location": {"type": "string", "description": "Name of the city"}}, "required": ["location"]}}]<|im_end|>
168
+ <|im_start|>system
169
+ - In this environment, various tools can be used to answer users' questions.
170
+ - You are "CLOVA X," an AI language model developed by NAVER.
171
+ - Begin by creating a plan for solving the problem, and then utilize the tools accordingly to address the problem.
172
+ - The current date is {day of the week}, {month} {dd}, {yyyy}.
173
+ - Latest information such as news, stock prices, and shopping is retrieved through the tool_list.
174
+ - If external tools are required, the assistant should not answer directly but must first obtain the necessary information via the assistant -> tool/function_call role, and then respond.<|im_end|>
175
+ <|im_start|>user
176
+ Tell me the weather in Seoul<|im_end|>
177
+ <|im_start|>assistant
178
+ ```
179
+
180
+ Generation continues until either the `<|stop|>` or `<|endofturn|>` token appears immediately after `<|im_end|>`. An example of the model’s output is shown below. HyperCLOVA X checks the list of available tools (tool_list), selects the appropriate tool (`get_weather`), and returns the necessary information for the tool call in JSON format.
181
+
182
+ ```
183
+ {following the previous context}
184
+ Let's check the weather using the get_weather tool.<|im_end|>
185
+ <|im_start|>assistant -> tool/function_call
186
+ {"name": "get_weather","input": {"location":"Seoul"}}<|im_end|><|stop|>
187
+ ```
188
+
189
+ - `assistant -> tool/function_call` means that the assistant(model) invokes a function call.
190
+
191
+ ### Second turn
192
+
193
+ The model stopped generating because `<|stop|>` token appeared immediately after `<|im_end|>`. Based on the information generated, `get_weather`should now be called. Calling an external function and parsing the result must be implemented additionally. For explanation, let's assume that the result of calling and parsing an external function is `{"result":{"location": "Seoul", "weather": "Sunny", "temperature": 25}}`.
194
+
195
+ The model is now ready to respond to the second turn. Using all the information gathered so far, input the following prompt into the model and have it follow accordingly.
196
+
197
+ ```
198
+ {following the previous context}
199
+ <|im_start|>tool/function_call
200
+ {"result":{"location": "Seoul", "weather": "Sunny", "temperature": 25}}<|im_end|>
201
+ <|im_start|>assistant
202
+ ```
203
+
204
+ - `tool/function_call` means it should pass the result of the function call to the assistant(model).
205
+
206
+ Just like in the previous turn, generation continues until the `<|stop|>` or `<|endofturn|>` token appears immediately after `<|im_end|>`. If the model behaves as expected, the output will look like this.
207
+
208
+ ```
209
+ {following the previous context}
210
+ The weather in Seoul is clear and the temperature is 25 degrees.<|im_end|><|endofturn|>
211
+ ```
212
+
213
+ ## (ChatML) Inducing reasoning/non-reasoning
214
+
215
+
216
+ HyperCLOVA X can handle both reasoning and non-reasoning tasks. There is a difference depending on whether the assistant is prompted to 'think' before responding. Based on the previous example, to make HyperCLOVA X respond in reasoning mode, you can input the prompt into the model as follows (excluding the tool_list and system)
217
+
218
+ ```
219
+ <|im_start|>user
220
+ Tell me the weather in Seoul<|im_end|>
221
+ <|im_start|>assistant/think
222
+
223
+ ```
224
+
225
+ - Note that the prompt ends with `assistant/think\n`(think + `\n).
226
+ - Generation continues until either the <|stop|> or <|endofturn|> token appears immediately after `<|im_end|>`.
227
+
228
+ To have the assistant respond in non-reasoning mode (i.e., answer directly), you can input the following prompt.
229
+
230
+ ```
231
+ <|im_start|>user
232
+ Tell me the weather in Seoul<|im_end|>
233
+ <|im_start|>assistant
234
+
235
+ ```
236
+
237
+ - Note that the prompt ends with `assistant\n`(assistant + `\n`).
238
+ - Generation continues until either the <|stop|> or <|endofturn|> token appears immediately after `<|im_end|>`.
239
+
240
+
241
+ ### Adjusting inference length
242
+ The length of reasoning can be controlled by appending "\nThink for maximum {N} tokens" at the end of the user's utterance.
243
+
244
+ Example
245
+ Suppose the user says, "Tell me the prime number closest to 1000."
246
+ If you want the model to reason for approximately 1024 tokens before answering, you would construct the prompt as follows:
247
+
248
+
249
+ ```
250
+ <|im_start|>tool_list
251
+ <|im_end|>
252
+ <|im_start|>system
253
+ - You are "CLOVA X," an AI language model developed by NAVER.
254
+ - The current date is {day of the week}, {month} {dd}, {yyyy}.<|im_end|>
255
+ <|im_start|>user
256
+ Tell me the prime number closest to 1000.
257
+ Think for maximum 1024 tokens.<|im_end|>
258
+ <|im_start|>assistant/think
259
+
260
+ ```
261
+ - Adjusting the reasoning length means guiding the model to reason for approximately the specified number of tokens; it does not always generate for exact number of tokens given.
262
+
263
+
264
+ ### Multiturn inference
265
+
266
+ When engaging in multi-turn conversations with reasoning enabled, the reasoning content from previous turns (`<|im_start|>assistant/think\n...<|im_end|>`) is not included in the next turn's prompt. Instead, as in the standard multi-turn dialogue format, only the final assistant response is included in the prompt passed to the model.
267
+
268
+ The same applies when a function call is made during reasoning: the reasoning content is excluded, and only the function call block (`<|im_start|>assistant -> tool/function_call\n...<|im_end|><|stop|>`) is included in the prompt provided to the model.
269
+
270
+ Example: Solving a simple math problem
271
+ Suppose the user sends a request: "Calculate the sum of odd numbers from 1 to 10."
272
+ The following explains step-by-step how this is processed as a multi-turn reasoning interaction.
273
+
274
+ 1. First Turn
275
+ Input Prompt
276
+ ```
277
+ <|im_start|>tool_list
278
+ <|im_end|>
279
+ <|im_start|>system
280
+ <|im_end|>
281
+ <|im_start|>user
282
+ Calculate the sum of odd numbers from 1 to 10
283
+ <|im_end|>
284
+ <|im_start|>assistant/think
285
+
286
+ ```
287
+ The model begins to generate inference contents.
288
+ ```
289
+ {following the previous context}
290
+ The user wants me to find the sum of the odd numbers from 1 to 10. First, I need to clearly understand the problem. I should identify which numbers between 1 and 10 are odd. Going through the numbers in order, the odd numbers are 1, 3, 5, 7, and 9. Since 10 is even, it is excluded. Now, I need to add these numbers together. Adding them step by step: 1+3=4, plus 5 gives 9, plus 7 gives 16, and finally adding 9 gives 25. But is there another way to calculate this? For example, I could use the formula for the sum of an arithmetic sequence. The odd numbers here form an arithmetic sequence with the first term 1 and a common difference of 2. I need to count the number of terms first—there are 5 odd numbers between 1 and 10. The formula for the sum of an arithmetic sequence is n*(a₁ + aₙ)/2, where n is 5, a₁ is 1, and aₙ is 9. Plugging these in, we get 5*(1 + 9)/2 = 5*10/2 = 25, which matches the previous result. Since both methods give 25, I’ll double-check that this is the answer the user expects, and then provide a concise explanation.<|im_end|>
291
+ <|im_start|>assistant
292
+ The odd numbers from 1 to 10 are **1, 3, 5, 7, 9**.
293
+ The sum of these numbers is calculated as follows:
294
+
295
+ \[
296
+ 1 + 3 + 5 + 7 + 9 = 25
297
+ \]
298
+
299
+ Alternatively, the sum can be easily calculated using **the formula for the sum of an arithmetic sequence**:
300
+ - First term (\(a_1\)) = 1
301
+ - Last term (\(a_n\)) = 9
302
+ - Number of terms (\(n\)) = 5
303
+
304
+ \[
305
+ \text{Sum} = \frac{n(a_1 + a_n)}{2} = \frac{5 \times (1 + 9)}{2} = \frac{5 \times 10}{2} = 25
306
+ \]
307
+
308
+ **Result:** The answer is **25**.<|im_end|><|endofturn|>
309
+ ```
310
+ As the `<|endofturn|>` token has been generated, the model concludes the current turn and awaits the next user input.
311
+
312
+ 2. Second Turn
313
+ Suppose you want to follow up with: "What is the result of adding 10 to that sum?" Here, the prompt is created by excluding the reasoning content (assistant/think) from the first turn and including only the final response (assistant).
314
+
315
+ - Input Prompt
316
+
317
+ ```
318
+ <|im_start|>tool_list
319
+ <|im_end|>
320
+ <|im_start|>system
321
+ <|im_end|>
322
+ <|im_start|>user
323
+ Calculate the sum of odd numbers from 1 to 10
324
+ <|im_end|>
325
+ <|im_start|>assistant
326
+ The odd numbers from 1 to 10 are **1, 3, 5, 7, 9**.
327
+ The sum of these numbers is calculated as follows:
328
+
329
+ \[
330
+ 1 + 3 + 5 + 7 + 9 = 25
331
+ \]
332
+
333
+ Alternatively, the sum can be easily calculated using **the formula for the sum of an arithmetic sequence**:
334
+ - First term (\(a_1\)) = 1
335
+ - Last term (\(a_n\)) = 9
336
+ - Number of terms (\(n\)) = 5
337
+
338
+ \[
339
+ \text{Sum} = \frac{n(a_1 + a_n)}{2} = \frac{5 \times (1 + 9)}{2} = \frac{5 \times 10}{2} = 25
340
+ \]
341
+
342
+ **Result:** The answer is **25**.<|im_end|><|endofturn|>
343
+ <|im_end|>
344
+ <|im_start|>user
345
+ What is the result of adding 10 to that sum?
346
+ <|im_end|>
347
+ <|im_start|>assistant/think
348
+
349
+ ```
350
+
351
+ - Excluded contents in the second turn input
352
+ ```
353
+ <|im_start|>assistant/think
354
+ {following the previous context}
355
+ The user wants me to find the sum of the odd numbers from 1 to 10. First, I need to clearly understand the problem. I should identify which numbers between 1 and 10 are odd. Going through the numbers in order, the odd numbers are 1, 3, 5, 7, and 9. Since 10 is even, it is excluded. Now, I need to add these numbers together. Adding them step by step: 1+3=4, plus 5 gives 9, plus 7 gives 16, and finally adding 9 gives 25. But is there another way to calculate this? For example, I could use the formula for the sum of an arithmetic sequence. The odd numbers here form an arithmetic sequence with the first term 1 and a common difference of 2. I need to count the number of terms first—there are 5 odd numbers between 1 and 10. The formula for the sum of an arithmetic sequence is n*(a₁ + aₙ)/2, where n is 5, a₁ is 1, and aₙ is 9. Plugging these in, we get 5*(1 + 9)/2 = 5*10/2 = 25, which matches the previous result. Since both methods give 25, I’ll double-check that this is the answer the user expects, and then provide a concise explanation.<|im_end|>
356
+ ```
357
+
358
+ For all later turns, the reasoning (think) content from previous turns is not added to the prompt as well.
359
+
360
+ ### Difference between <|stop|> and <|endofturn|>
361
+
362
+ - **Similarity**
363
+ - They both act as signals to stop the model from generating further responses.
364
+
365
+ - **Difference**
366
+ - <|stop|>: After the response generation is halted and the tool_invoke results are processed, the AI turn resumes.
367
+ - <|endofturn|>: After the response generation is halted, the AI turn is fully terminated, and the model waits for the user's next input.
368
+
369
+
370
+ ## **Huggingface Usage Example**
371
+
372
+ After downloading the model binaries, including the configuration files, to a local path(`/path/to/hyperclova-x-seed-think-14b`), you can run the following in a Python environment with the [Huggingface library](https://huggingface.co/docs/transformers/installation)(verified to work with version 4.45.0) and [timm(pytorch-image-models)](https://github.com/huggingface/pytorch-image-models) installed.
373
+
374
+ You can use the `apply_chat_template` parameter to explicitly enable or disable the reasoning feature.
375
+
376
+ - The default value for both options is `None`, in which case the model decides on its own whether to reason before answering or to answer directly without reasoning.
377
+ - `force_reasoning=True`: Forces the model to always reason before answering.
378
+ - `skip_reasoning=True`: Forces the model to answer directly without reasoning.
379
+ - Passing `None` or `False` has the same effect.
380
+ - If both are set to True, `force_reasoning=True` takes precedence.
381
+
382
+ ```python
383
+ # Default example
384
+ inputs = tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_dict=True, return_tensors="pt")
385
+
386
+ # By adding force_reasoning=True, the model is forced to always reason before responding
387
+ inputs = tokenizer.apply_chat_template(chat, add_generation_prompt=True, force_reasoning=True, return_dict=True, return_tensors="pt")
388
+
389
+ # By adding skip_reasoning=True, the model is forced to always answer directly without reasoning
390
+ inputs = tokenizer.apply_chat_template(chat, add_generation_prompt=True, skip_reasoning=True, return_dict=True, return_tensors="pt")
391
+ ```
392
+
393
+ ### Non-think Example Code
394
+ ```python
395
+ from transformers import AutoModelForCausalLM, AutoTokenizer
396
+
397
+ model_name = "naver-hyperclovax/HyperCLOVAX-SEED-Think-14B"
398
+ model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="auto")
399
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
400
+
401
+ chat = [
402
+ {"role": "system", "content": "- In this environment, various tools can be used to answer users' questions.\n- You are \"CLOVA X,\" an AI language model developed by NAVER.\n- Begin by creating a plan for solving the problem, and then utilize the tools accordingly to address the problem.\n- The current date is Monday, July 21, 2025.\n- Latest information such as news, stock prices, and shopping is retrieved through the tool_list.\n- If external tools are required, the assistant should not answer directly but must first obtain the necessary information via the assistant -> tool/function_call role, and then respond."},
403
+ {"role": "user", "content": "Explain in as much detail as possible the relationship between the Schrödinger equation and quantum mechanics."},
404
+ ]
405
+
406
+ # By adding skip_reasoning=True, the model is forced to always answer directly without reasoning
407
+ inputs = tokenizer.apply_chat_template(chat, add_generation_prompt=True, skip_reasoning=True, return_dict=True, return_tensors="pt")
408
+ inputs = inputs.to("cuda")
409
+
410
+ output_ids = model.generate(
411
+ **inputs,
412
+ max_length=1024,
413
+ stop_strings=["<|endofturn|>", "<|stop|>"],
414
+ temperature=0.5,
415
+ top_p=0.6,
416
+ repetition_penalty=1.05,
417
+ tokenizer=tokenizer
418
+ )
419
+ print(tokenizer.batch_decode(output_ids))
420
+ ```
421
+
422
+ ### Think Example Code
423
+ ```python
424
+ from transformers import AutoModelForCausalLM, AutoTokenizer
425
+
426
+ model_name = "naver-hyperclovax/HyperCLOVAX-SEED-Think-14B"
427
+ model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="auto")
428
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
429
+
430
+ chat = [
431
+ {"role": "system", "content": "- In this environment, various tools can be used to answer users' questions.\n- You are \"CLOVA X,\" an AI language model developed by NAVER.\n- Begin by creating a plan for solving the problem, and then utilize the tools accordingly to address the problem.\n- The current date is Monday, July 21, 2025.\n- Latest information such as news, stock prices, and shopping is retrieved through the tool_list.\n- If external tools are required, the assistant should not answer directly but must first obtain the necessary information via the assistant -> tool/function_call role, and then respond."},
432
+ {"role": "user", "content": "Explain in as much detail as possible the relationship between the Schrödinger equation and quantum mechanics."},
433
+ ]
434
+
435
+ # By adding force_reasoning=True, the model is forced to always reason before responding
436
+ inputs = tokenizer.apply_chat_template(chat, add_generation_prompt=True, force_reasoning=True, return_dict=True, return_tensors="pt")
437
+ inputs = inputs.to("cuda")
438
+
439
+ output_ids = model.generate(
440
+ **inputs,
441
+ max_length=1024,
442
+ stop_strings=["<|endofturn|>", "<|stop|>"],
443
+ temperature=0.5,
444
+ top_p=0.6,
445
+ repetition_penalty=1.05,
446
+ tokenizer=tokenizer
447
+ )
448
+ print(tokenizer.batch_decode(output_ids))
449
+ ```
450
+
451
+ ### Hybrid(the model decides whether to use think or non-think mode) Example Code
452
+ ```python
453
+ from transformers import AutoModelForCausalLM, AutoTokenizer
454
+
455
+ model_name = "naver-hyperclovax/HyperCLOVAX-SEED-Think-14B"
456
+ model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="auto")
457
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
458
+
459
+ chat = [
460
+ {"role": "system", "content": "- In this environment, various tools can be used to answer users' questions.\n- You are \"CLOVA X,\" an AI language model developed by NAVER.\n- Begin by creating a plan for solving the problem, and then utilize the tools accordingly to address the problem.\n- The current date is Monday, July 21, 2025.\n- Latest information such as news, stock prices, and shopping is retrieved through the tool_list.\n- If external tools are required, the assistant should not answer directly but must first obtain the necessary information via the assistant -> tool/function_call role, and then respond."},
461
+ {"role": "user", "content": "Explain in as much detail as possible the relationship between the Schrödinger equation and quantum mechanics."},
462
+ ]
463
+
464
+ # The model decides whether to answer after reasoning or to respond immediately without reasoning
465
+ inputs = tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_dict=True, return_tensors="pt")
466
+ inputs = inputs.to("cuda")
467
+
468
+ output_ids = model.generate(
469
+ **inputs,
470
+ max_length=1024,
471
+ stop_strings=["<|endofturn|>", "<|stop|>"],
472
+ temperature=0.5,
473
+ top_p=0.6,
474
+ repetition_penalty=1.05,
475
+ tokenizer=tokenizer
476
+ )
477
+ print(tokenizer.batch_decode(output_ids))
478
+ ```
479
+
480
+ ### Example code for function calls (tool usage)
481
+ For a scenario involving tool usage, you can execute it as follows.
482
+
483
+ ```python
484
+ import json
485
+ from transformers import AutoModelForCausalLM, AutoTokenizer
486
+
487
+ model_name = "naver-hyperclovax/HyperCLOVAX-SEED-Think-14B"
488
+ model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="auto")
489
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
490
+
491
+ # 1) The name of the tool should be written as function_call.{{ name }}.
492
+ # 2) Parameters follow the specifications described at https://platform.openai.com/docs/guides/function-calling?api-mode=responses.
493
+ tool_list = [
494
+ {
495
+ "type": "function",
496
+ "function": {
497
+ "name": "add",
498
+ "description": "Add two numbers.",
499
+ "parameters": {
500
+ "type": "object",
501
+ "properties": {
502
+ "x": {"type": "number", "description": "First number"},
503
+ "y": {"type": "number", "description": "Second number"}
504
+ },
505
+ "required": ["x", "y"]
506
+ }
507
+ }
508
+ }, {
509
+ "type": "function",
510
+ "function": {
511
+ "name": "subtract",
512
+ "description": "Subtract two numbers.",
513
+ "parameters": {
514
+ "type": "object",
515
+ "properties": {
516
+ "x": {"type": "number", "description": "First number"},
517
+ "y": {"type": "number", "description": "Second number"}
518
+ },
519
+ "required": ["x", "y"]
520
+ }
521
+ }
522
+ }
523
+ ]
524
+
525
+
526
+ chat = [
527
+ {"role": "system", "content": "- In this environment, various tools can be used to answer users' questions.\n- You are \"CLOVA X,\" an AI language model developed by NAVER.\n- Begin by creating a plan for solving the problem, and then utilize the tools accordingly to address the problem.\n- The current date is Monday, July 21, 2025.\n- Latest information such as news, stock prices, and shopping is retrieved through the tool_list.\n- If external tools are required, the assistant should not answer directly but must first obtain the necessary information via the assistant -> tool/function_call role, and then respond."},
528
+ {"role": "user", "content": "What is 1588 + 1234? Please calculate it using the provided tool."},
529
+ ]
530
+
531
+ inputs = tokenizer.apply_chat_template(chat, tools=tool_list, add_generation_prompt=True, return_dict=True, return_tensors="pt")
532
+ inputs = inputs.to("cuda")
533
+
534
+ output_ids = model.generate(
535
+ **inputs,
536
+ max_length=1024,
537
+ stop_strings=["<|endofturn|>", "<|stop|>"],
538
+ temperature=0.5,
539
+ top_p=0.6,
540
+ repetition_penalty=1.05,
541
+ tokenizer=tokenizer
542
+ )
543
+ print(tokenizer.batch_decode(output_ids))
544
+ ```
545
+
546
+ - If you have any questions or issues regarding usage, please leave them as an issue in the Discussions section of this page.
547
+
548
+ ## **vLLM Usage Example**
549
+
550
+ The HyperCLOVA X SEED Think model is built on a custom LLM architecture based on the LLaMA architecture, incorporating μP and Peri-LN techniques. For convenient use with vLLM, it is available as a dedicated vLLM plugin that can be installed and used with ease once vLLM is set up.
551
+
552
+ 1. Download vLLM plugin source code
553
+
554
+ ```bash
555
+ git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/hcx-vllm-plugin
556
+ ```
557
+
558
+ 2. vLLM Plugin Build & Installation: While keeping the NAVER-Cloud-HyperCLOVA-X/hcx-vllm-plugin path downloaded in step 1, refer to the commands below.
559
+
560
+ ```bash
561
+ pip install -e .
562
+ ```
563
+
564
+ After downloading the model checkpoint to a local path (`/path/to/hyperclova-x-seed-think-14b`), you can perform text inference by running the following commands on a GPU environment with A100 or higher.
565
+
566
+ ```bash
567
+ python -m vllm.entrypoints.openai.api_server --model=/path/to/hyperclova-x-seed-think-14b --trust_remote_code --port=8000
568
+
569
+ curl http://localhost:8000/v1/completions \
570
+ -H "Content-Type: application/json" \
571
+ -d '{
572
+ "prompt": "<|im_start|>tool_list\n<|im_end|>\n<|im_start|>system\n- The AI language model is named \"CLOVA X\" and was developed by NAVER.\n- Today is Friday, July 18, 2025.<|im_end|>\n<|im_start|>user\nExplain in as much detail as possible the relationship between the Schrödinger equation and quantum mechanics.<|im_end|>\n<|im_start|>assistant/think\n",
573
+ "top_k":-1,
574
+ "temperature":0.5,
575
+ "top_p":0.6,
576
+ "repetition_penalty":1.05,
577
+ "stop":["<|im_end|><|endofturn|>", "<|im_end|><|stop|>"],
578
+ "max_tokens":8192,
579
+ "skip_special_tokens":false
580
+ }'
581
+ ```
582
+
583
+ ## License
584
+
585
+ The model is licensed under [HyperCLOVA X SEED Model License Agreement](./LICENSE)
586
+
587
+ ## Citation
588
+
589
+ ```
590
+ @misc{navercloudhyperclovaxteam2025hyperclovaxthinktechnical,
591
+ title={HyperCLOVA X THINK Technical Report},
592
+ author={NAVER Cloud HyperCLOVA X Team},
593
+ year={2025},
594
+ eprint={2506.22403},
595
+ archivePrefix={arXiv},
596
+ primaryClass={cs.CL},
597
+ url={https://arxiv.org/abs/2506.22403},
598
+ }
599
+ ```
600
+
601
+ ## Questions
602
+
603
+ For any other questions, please feel free to contact us at [[email protected]](mailto:[email protected]).
added_tokens.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "<EMAIL>": 110521,
3
+ "<KEY>": 110522,
4
+ "<NAME>": 110520,
5
+ "<PASSWORD>": 110523,
6
+ "<code_to_intermediate>": 110502,
7
+ "<empty_output>": 110501,
8
+ "<file_sep>": 110492,
9
+ "<intermediate_to_code>": 110503,
10
+ "<issue_closed>": 110495,
11
+ "<issue_comment>": 110494,
12
+ "<issue_start>": 110493,
13
+ "<jupyter_code>": 110498,
14
+ "<jupyter_output>": 110499,
15
+ "<jupyter_script>": 110500,
16
+ "<jupyter_start>": 110496,
17
+ "<jupyter_text>": 110497,
18
+ "<pr>": 110504,
19
+ "<pr_base>": 110507,
20
+ "<pr_base_code>": 110509,
21
+ "<pr_comment>": 110512,
22
+ "<pr_diff>": 110510,
23
+ "<pr_diff_hunk>": 110511,
24
+ "<pr_diff_hunk_comment_line>": 110519,
25
+ "<pr_event_id>": 110513,
26
+ "<pr_file>": 110508,
27
+ "<pr_in_reply_to_comment_id>": 110518,
28
+ "<pr_in_reply_to_review_id>": 110517,
29
+ "<pr_is_merged>": 110506,
30
+ "<pr_review>": 110514,
31
+ "<pr_review_comment>": 110516,
32
+ "<pr_review_state>": 110515,
33
+ "<pr_status>": 110505,
34
+ "<repo_name>": 110491
35
+ }
config.json ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "HyperCLOVAXForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "attention_multiplier": 0.0078125,
8
+ "attn_pdrop": 0.0,
9
+ "auto_map": {
10
+ "AutoConfig": "configuration_hyperclovax.HyperCLOVAXConfig",
11
+ "AutoModel": "modeling_hyperclovax.HyperCLOVAXModel",
12
+ "AutoModelForCausalLM": "modeling_hyperclovax.HyperCLOVAXForCausalLM"
13
+ },
14
+ "bos_token_id": 100257,
15
+ "embd_pdrop": 0.0,
16
+ "embedding_multiplier": 10.0,
17
+ "end_token_id": 100257,
18
+ "eos_token_id": 100257,
19
+ "head_dim": 128,
20
+ "hidden_act": "silu",
21
+ "hidden_size": 6144,
22
+ "initializer_range": 0.012727922061357854,
23
+ "intermediate_size": 14336,
24
+ "logits_scaling": 0.125,
25
+ "max_position_embeddings": 131072,
26
+ "mlp_bias": false,
27
+ "model_type": "hyperclovax",
28
+ "num_attention_heads": 48,
29
+ "num_hidden_layers": 38,
30
+ "num_key_value_heads": 8,
31
+ "pad_token_id": 100257,
32
+ "pretraining_tp": 1,
33
+ "resid_pdrop": 0.0,
34
+ "residual_multiplier": 1.0,
35
+ "rms_norm_eps": 1e-05,
36
+ "rope_scaling": null,
37
+ "rope_theta": 100000000,
38
+ "summary_first_dropout": 0.0,
39
+ "tie_word_embeddings": false,
40
+ "torch_dtype": "float32",
41
+ "transformers_version": "4.52.4",
42
+ "use_cache": false,
43
+ "use_post_norm": true,
44
+ "vocab_size": 110592
45
+ }
configuration_hyperclovax.py ADDED
@@ -0,0 +1,235 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # This file was created for the HyperCLOVA X SEED 14B Think architecture.
3
+ # partially copied and modified from https://github.com/huggingface/transformers
4
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
5
+ #
6
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
7
+ # and OPT implementations in this library. It has been modified from its
8
+ # original forms to accommodate minor architectural differences compared
9
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
10
+ #
11
+ # Licensed under the Apache License, Version 2.0 (the "License");
12
+ # you may not use this file except in compliance with the License.
13
+ # You may obtain a copy of the License at
14
+ #
15
+ # http://www.apache.org/licenses/LICENSE-2.0
16
+ #
17
+ # Unless required by applicable law or agreed to in writing, software
18
+ # distributed under the License is distributed on an "AS IS" BASIS,
19
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
20
+ # See the License for the specific language governing permissions and
21
+ # limitations under the License.
22
+ """HyperCLOVAX model configuration"""
23
+
24
+ from transformers.configuration_utils import PretrainedConfig
25
+
26
+ class HyperCLOVAXConfig(PretrainedConfig):
27
+ r"""
28
+ This is the configuration class to store the configuration of a [`HyperCLOVAXModel`]. It is used to instantiate an HyperCLOVAX
29
+ model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
30
+ defaults will yield a similar configuration to that of the HyperCLOVAX.
31
+
32
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
33
+ documentation from [`PretrainedConfig`] for more information.
34
+
35
+
36
+ Args:
37
+ vocab_size (`int`, *optional*, defaults to 32000):
38
+ Vocabulary size of the HyperCLOVAX model. Defines the number of different tokens that can be represented by the
39
+ `inputs_ids` passed when calling [`HyperCLOVAXModel`]
40
+ hidden_size (`int`, *optional*, defaults to 4096):
41
+ Dimension of the hidden representations.
42
+ intermediate_size (`int`, *optional*, defaults to 11008):
43
+ Dimension of the MLP representations.
44
+ num_hidden_layers (`int`, *optional*, defaults to 32):
45
+ Number of hidden layers in the Transformer decoder.
46
+ num_attention_heads (`int`, *optional*, defaults to 32):
47
+ Number of attention heads for each attention layer in the Transformer decoder.
48
+ num_key_value_heads (`int`, *optional*):
49
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
50
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
51
+ `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
52
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
53
+ by meanpooling all the original heads within that group. For more details checkout [this
54
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
55
+ `num_attention_heads`.
56
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
57
+ The non-linear activation function (function or string) in the decoder.
58
+ max_position_embeddings (`int`, *optional*, defaults to 2048):
59
+ The maximum sequence length that this model might ever be used with.
60
+ initializer_range (`float`, *optional*, defaults to 0.02):
61
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
62
+ rms_norm_eps (`float`, *optional*, defaults to 1e-06):
63
+ The epsilon used by the rms normalization layers.
64
+ use_cache (`bool`, *optional*, defaults to `True`):
65
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
66
+ relevant if `config.is_decoder=True`.
67
+ pad_token_id (`int`, *optional*):
68
+ Padding token id.
69
+ bos_token_id (`int`, *optional*, defaults to 1):
70
+ Beginning of stream token id.
71
+ eos_token_id (`int`, *optional*, defaults to 2):
72
+ End of stream token id.
73
+ pretraining_tp (`int`, *optional*, defaults to 1):
74
+ Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
75
+ document](https://huggingface.co/docs/transformers/main/perf_train_gpu_many#tensor-parallelism) to
76
+ understand more about it. This value is necessary to ensure exact reproducibility of the pretraining
77
+ results. Please refer to [this issue](https://github.com/pytorch/pytorch/issues/76232).
78
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
79
+ Whether to tie weight embeddings
80
+ rope_theta (`float`, *optional*, defaults to 10000.0):
81
+ The base period of the RoPE embeddings.
82
+ rope_scaling (`Dict`, *optional*):
83
+ Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
84
+ and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
85
+ accordingly.
86
+ Expected contents:
87
+ `rope_type` (`str`):
88
+ The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
89
+ 'llama3'], with 'default' being the original RoPE implementation.
90
+ `factor` (`float`, *optional*):
91
+ Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
92
+ most scaling types, a `factor` of x will enable the model to handle sequences of length x *
93
+ original maximum pre-trained length.
94
+ `original_max_position_embeddings` (`int`, *optional*):
95
+ Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during
96
+ pretraining.
97
+ `attention_factor` (`float`, *optional*):
98
+ Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention
99
+ computation. If unspecified, it defaults to value recommended by the implementation, using the
100
+ `factor` field to infer the suggested value.
101
+ `beta_fast` (`float`, *optional*):
102
+ Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear
103
+ ramp function. If unspecified, it defaults to 32.
104
+ `beta_slow` (`float`, *optional*):
105
+ Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear
106
+ ramp function. If unspecified, it defaults to 1.
107
+ `short_factor` (`List[float]`, *optional*):
108
+ Only used with 'longrope'. The scaling factor to be applied to short contexts (<
109
+ `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
110
+ size divided by the number of attention heads divided by 2
111
+ `long_factor` (`List[float]`, *optional*):
112
+ Only used with 'longrope'. The scaling factor to be applied to long contexts (<
113
+ `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
114
+ size divided by the number of attention heads divided by 2
115
+ `low_freq_factor` (`float`, *optional*):
116
+ Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE
117
+ `high_freq_factor` (`float`, *optional*):
118
+ Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE
119
+ attention_bias (`bool`, *optional*, defaults to `False`):
120
+ Whether to use a bias in the query, key, value and output projection layers during self-attention.
121
+ attention_dropout (`float`, *optional*, defaults to 0.0):
122
+ The dropout ratio for the attention probabilities.
123
+ mlp_bias (`bool`, *optional*, defaults to `False`):
124
+ Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers.
125
+ head_dim (`int`, *optional*):
126
+ The attention head dimension. If None, it will default to hidden_size // num_heads
127
+ embedding_multiplier (`float, *optional*, defaults to `None`):
128
+ Multiplier applied to the embedding weights. If `None`, it is equivalent to `1.0`.
129
+ logits_scaling (`float, *optional*, defaults to `None`):
130
+ Scaling factor for logits. If `None`, it is equivalent to `1.0`.
131
+ attention_multiplier (`float, *optional*, defaults to `None`):
132
+ Multiplier applied to the attention weights. If `None`, it is equivalent to `self.head_dim ** -0.5`.
133
+ residual_multiplier (`float, *optional*, defaults to `None`):
134
+ Scaling factor for residual connections. If `None`, it is equivalent to `1.0`.
135
+ use_post_norm (`bool`, *optional*, defaults to `False`):
136
+ Determines whether to apply Peri-Layer Normalization. Set to True to enable this feature.
137
+
138
+ ```python
139
+ >>> from transformers import HyperCLOVAXModel, HyperCLOVAXConfig
140
+
141
+ >>> # Initializing a HyperCLOVAX HyperCLOVAX style configuration
142
+ >>> configuration = HyperCLOVAXConfig()
143
+
144
+ >>> # Initializing a model from the HyperCLOVAX style configuration
145
+ >>> model = HyperCLOVAXModel(configuration)
146
+
147
+ >>> # Accessing the model configuration
148
+ >>> configuration = model.config
149
+ ```"""
150
+
151
+ model_type = "hyperclovax"
152
+ keys_to_ignore_at_inference = ["past_key_values"]
153
+
154
+ def __init__(
155
+ self,
156
+ vocab_size=32000,
157
+ hidden_size=4096,
158
+ intermediate_size=11008,
159
+ num_hidden_layers=32,
160
+ num_attention_heads=32,
161
+ num_key_value_heads=None,
162
+ hidden_act="silu",
163
+ max_position_embeddings=2048,
164
+ initializer_range=0.02,
165
+ rms_norm_eps=1e-6,
166
+ use_cache=True,
167
+ pad_token_id=None,
168
+ bos_token_id=1,
169
+ eos_token_id=2,
170
+ pretraining_tp=1,
171
+ tie_word_embeddings=False,
172
+ rope_theta=10000.0,
173
+ rope_scaling=None,
174
+ attention_bias=False,
175
+ attention_dropout=0.0,
176
+ mlp_bias=False,
177
+ head_dim=None,
178
+ embedding_multiplier=None, # MuP
179
+ logits_scaling=None, # MuP
180
+ attention_multiplier=None, # MuP
181
+ residual_multiplier=None, # MuP
182
+ use_post_norm=False, # Peri-LN (post-norm)
183
+ auto_map={
184
+ "AutoConfig": "configuration_hyperclovax.HyperCLOVAXConfig",
185
+ "AutoModel": "modeling_hyperclovax.HyperCLOVAXModel",
186
+ "AutoModelForCausalLM": "modeling_hyperclovax.HyperCLOVAXForCausalLM"
187
+ },
188
+ **kwargs,
189
+ ):
190
+ self.vocab_size = vocab_size
191
+ self.max_position_embeddings = max_position_embeddings
192
+ self.hidden_size = hidden_size
193
+ self.intermediate_size = intermediate_size
194
+ self.num_hidden_layers = num_hidden_layers
195
+ self.num_attention_heads = num_attention_heads
196
+
197
+ # for backward compatibility
198
+ if num_key_value_heads is None:
199
+ num_key_value_heads = num_attention_heads
200
+
201
+ self.num_key_value_heads = num_key_value_heads
202
+ self.hidden_act = hidden_act
203
+ self.initializer_range = initializer_range
204
+ self.rms_norm_eps = rms_norm_eps
205
+ self.pretraining_tp = pretraining_tp
206
+ self.use_cache = use_cache
207
+ self.rope_theta = rope_theta
208
+ self.rope_scaling = rope_scaling
209
+ self.attention_bias = attention_bias
210
+ self.attention_dropout = attention_dropout
211
+ self.mlp_bias = mlp_bias
212
+ self.head_dim = head_dim if head_dim is not None else self.hidden_size // self.num_attention_heads
213
+ # Validate the correctness of rotary position embeddings parameters
214
+ # BC: if there is a 'type' field, copy it it to 'rope_type'.
215
+ if self.rope_scaling is not None and "type" in self.rope_scaling:
216
+ self.rope_scaling["rope_type"] = self.rope_scaling["type"]
217
+ # rope_config_validation(self)
218
+
219
+ # MuP
220
+ self.embedding_multiplier = embedding_multiplier if embedding_multiplier is not None else 1.0
221
+ self.logits_scaling = logits_scaling if logits_scaling is not None else 1.0
222
+ self.attention_multiplier = attention_multiplier if attention_multiplier is not None else self.head_dim ** -0.5
223
+ self.residual_multiplier = residual_multiplier if residual_multiplier is not None else 1.0
224
+
225
+ # Peri-LN (post-norm)
226
+ self.use_post_norm = use_post_norm
227
+
228
+ super().__init__(
229
+ pad_token_id=pad_token_id,
230
+ bos_token_id=bos_token_id,
231
+ eos_token_id=eos_token_id,
232
+ tie_word_embeddings=tie_word_embeddings,
233
+ auto_map=auto_map,
234
+ **kwargs,
235
+ )
generation_config.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 100257,
4
+ "eos_token_id": 100257,
5
+ "pad_token_id": 100257,
6
+ "transformers_version": "4.52.4",
7
+ "use_cache": false
8
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00012.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:305362a08287f1a07c8901b17569030003342501439732b35372f141488e624d
3
+ size 4831938472
model-00002-of-00012.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ff239283211a4d509ae03a47d26eb81981a1e6629b8593e7fea6f7bc982cc7ee
3
+ size 4932899128
model-00003-of-00012.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a87408267364edb5e0ad365c01d33272ecd767eef3a59dcae20b5c4e2b80d2d9
3
+ size 4932800744
model-00004-of-00012.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:34be38a9cd94474e2145567c0d8d8e772e38811a6b7cc7ff21ad32b78cd82e28
3
+ size 4932899160
model-00005-of-00012.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:04b93ef8634259a055d34eda87bc7f0c07493be1729979da93c7fd0b63b4cfc8
3
+ size 4932800784
model-00006-of-00012.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0166c99db8788a872fe9677d88095116586807e61466caeffb1b317815025485
3
+ size 4932899168
model-00007-of-00012.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b3091a6c988544d07a79a12ac4e6327d0b80802af7463f657fa55706760a7b16
3
+ size 4932800784
model-00008-of-00012.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:437804c0b8b0ac88e24253304a7c0d27c0e94ee54fdba2d6b6474f0fcfd71f77
3
+ size 4932899168
model-00009-of-00012.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2bae7aa1fb3ec4ac0d8a9046a15b8674d80b1165ec423b66c3429416846bd076
3
+ size 4932800784
model-00010-of-00012.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:17ec60ba7881c3afa594a584bde88d806582eab0f7768a70549d377b06544c27
3
+ size 4932899168
model-00011-of-00012.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1c2e182da134b1702e84be4f6bff0157182bb8870963b470da45acc4fd9102f5
3
+ size 4932800784
model-00012-of-00012.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9c5ced46cb6d54cc2ace518803b1d350211b110560a7d841412b61d22b09f55e
3
+ size 4832061536
model.safetensors.index.json ADDED
@@ -0,0 +1,428 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 58992451584
4
+ },
5
+ "weight_map": {
6
+ "lm_head.weight": "model-00012-of-00012.safetensors",
7
+ "model.embed_tokens.weight": "model-00001-of-00012.safetensors",
8
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00012.safetensors",
9
+ "model.layers.0.mlp.down_proj.weight": "model-00001-of-00012.safetensors",
10
+ "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00012.safetensors",
11
+ "model.layers.0.mlp.up_proj.weight": "model-00001-of-00012.safetensors",
12
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00012.safetensors",
13
+ "model.layers.0.post_norm1.weight": "model-00001-of-00012.safetensors",
14
+ "model.layers.0.post_norm2.weight": "model-00001-of-00012.safetensors",
15
+ "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00012.safetensors",
16
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00012.safetensors",
17
+ "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00012.safetensors",
18
+ "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00012.safetensors",
19
+ "model.layers.1.input_layernorm.weight": "model-00002-of-00012.safetensors",
20
+ "model.layers.1.mlp.down_proj.weight": "model-00002-of-00012.safetensors",
21
+ "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00012.safetensors",
22
+ "model.layers.1.mlp.up_proj.weight": "model-00002-of-00012.safetensors",
23
+ "model.layers.1.post_attention_layernorm.weight": "model-00002-of-00012.safetensors",
24
+ "model.layers.1.post_norm1.weight": "model-00002-of-00012.safetensors",
25
+ "model.layers.1.post_norm2.weight": "model-00002-of-00012.safetensors",
26
+ "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00012.safetensors",
27
+ "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00012.safetensors",
28
+ "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00012.safetensors",
29
+ "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00012.safetensors",
30
+ "model.layers.10.input_layernorm.weight": "model-00004-of-00012.safetensors",
31
+ "model.layers.10.mlp.down_proj.weight": "model-00004-of-00012.safetensors",
32
+ "model.layers.10.mlp.gate_proj.weight": "model-00004-of-00012.safetensors",
33
+ "model.layers.10.mlp.up_proj.weight": "model-00004-of-00012.safetensors",
34
+ "model.layers.10.post_attention_layernorm.weight": "model-00004-of-00012.safetensors",
35
+ "model.layers.10.post_norm1.weight": "model-00004-of-00012.safetensors",
36
+ "model.layers.10.post_norm2.weight": "model-00004-of-00012.safetensors",
37
+ "model.layers.10.self_attn.k_proj.weight": "model-00004-of-00012.safetensors",
38
+ "model.layers.10.self_attn.o_proj.weight": "model-00004-of-00012.safetensors",
39
+ "model.layers.10.self_attn.q_proj.weight": "model-00004-of-00012.safetensors",
40
+ "model.layers.10.self_attn.v_proj.weight": "model-00004-of-00012.safetensors",
41
+ "model.layers.11.input_layernorm.weight": "model-00004-of-00012.safetensors",
42
+ "model.layers.11.mlp.down_proj.weight": "model-00004-of-00012.safetensors",
43
+ "model.layers.11.mlp.gate_proj.weight": "model-00004-of-00012.safetensors",
44
+ "model.layers.11.mlp.up_proj.weight": "model-00004-of-00012.safetensors",
45
+ "model.layers.11.post_attention_layernorm.weight": "model-00004-of-00012.safetensors",
46
+ "model.layers.11.post_norm1.weight": "model-00004-of-00012.safetensors",
47
+ "model.layers.11.post_norm2.weight": "model-00004-of-00012.safetensors",
48
+ "model.layers.11.self_attn.k_proj.weight": "model-00004-of-00012.safetensors",
49
+ "model.layers.11.self_attn.o_proj.weight": "model-00004-of-00012.safetensors",
50
+ "model.layers.11.self_attn.q_proj.weight": "model-00004-of-00012.safetensors",
51
+ "model.layers.11.self_attn.v_proj.weight": "model-00004-of-00012.safetensors",
52
+ "model.layers.12.input_layernorm.weight": "model-00005-of-00012.safetensors",
53
+ "model.layers.12.mlp.down_proj.weight": "model-00005-of-00012.safetensors",
54
+ "model.layers.12.mlp.gate_proj.weight": "model-00005-of-00012.safetensors",
55
+ "model.layers.12.mlp.up_proj.weight": "model-00005-of-00012.safetensors",
56
+ "model.layers.12.post_attention_layernorm.weight": "model-00005-of-00012.safetensors",
57
+ "model.layers.12.post_norm1.weight": "model-00005-of-00012.safetensors",
58
+ "model.layers.12.post_norm2.weight": "model-00005-of-00012.safetensors",
59
+ "model.layers.12.self_attn.k_proj.weight": "model-00005-of-00012.safetensors",
60
+ "model.layers.12.self_attn.o_proj.weight": "model-00005-of-00012.safetensors",
61
+ "model.layers.12.self_attn.q_proj.weight": "model-00005-of-00012.safetensors",
62
+ "model.layers.12.self_attn.v_proj.weight": "model-00005-of-00012.safetensors",
63
+ "model.layers.13.input_layernorm.weight": "model-00005-of-00012.safetensors",
64
+ "model.layers.13.mlp.down_proj.weight": "model-00005-of-00012.safetensors",
65
+ "model.layers.13.mlp.gate_proj.weight": "model-00005-of-00012.safetensors",
66
+ "model.layers.13.mlp.up_proj.weight": "model-00005-of-00012.safetensors",
67
+ "model.layers.13.post_attention_layernorm.weight": "model-00005-of-00012.safetensors",
68
+ "model.layers.13.post_norm1.weight": "model-00005-of-00012.safetensors",
69
+ "model.layers.13.post_norm2.weight": "model-00005-of-00012.safetensors",
70
+ "model.layers.13.self_attn.k_proj.weight": "model-00005-of-00012.safetensors",
71
+ "model.layers.13.self_attn.o_proj.weight": "model-00005-of-00012.safetensors",
72
+ "model.layers.13.self_attn.q_proj.weight": "model-00005-of-00012.safetensors",
73
+ "model.layers.13.self_attn.v_proj.weight": "model-00005-of-00012.safetensors",
74
+ "model.layers.14.input_layernorm.weight": "model-00005-of-00012.safetensors",
75
+ "model.layers.14.mlp.down_proj.weight": "model-00005-of-00012.safetensors",
76
+ "model.layers.14.mlp.gate_proj.weight": "model-00005-of-00012.safetensors",
77
+ "model.layers.14.mlp.up_proj.weight": "model-00005-of-00012.safetensors",
78
+ "model.layers.14.post_attention_layernorm.weight": "model-00005-of-00012.safetensors",
79
+ "model.layers.14.post_norm1.weight": "model-00005-of-00012.safetensors",
80
+ "model.layers.14.post_norm2.weight": "model-00005-of-00012.safetensors",
81
+ "model.layers.14.self_attn.k_proj.weight": "model-00005-of-00012.safetensors",
82
+ "model.layers.14.self_attn.o_proj.weight": "model-00005-of-00012.safetensors",
83
+ "model.layers.14.self_attn.q_proj.weight": "model-00005-of-00012.safetensors",
84
+ "model.layers.14.self_attn.v_proj.weight": "model-00005-of-00012.safetensors",
85
+ "model.layers.15.input_layernorm.weight": "model-00006-of-00012.safetensors",
86
+ "model.layers.15.mlp.down_proj.weight": "model-00006-of-00012.safetensors",
87
+ "model.layers.15.mlp.gate_proj.weight": "model-00005-of-00012.safetensors",
88
+ "model.layers.15.mlp.up_proj.weight": "model-00006-of-00012.safetensors",
89
+ "model.layers.15.post_attention_layernorm.weight": "model-00006-of-00012.safetensors",
90
+ "model.layers.15.post_norm1.weight": "model-00006-of-00012.safetensors",
91
+ "model.layers.15.post_norm2.weight": "model-00006-of-00012.safetensors",
92
+ "model.layers.15.self_attn.k_proj.weight": "model-00005-of-00012.safetensors",
93
+ "model.layers.15.self_attn.o_proj.weight": "model-00005-of-00012.safetensors",
94
+ "model.layers.15.self_attn.q_proj.weight": "model-00005-of-00012.safetensors",
95
+ "model.layers.15.self_attn.v_proj.weight": "model-00005-of-00012.safetensors",
96
+ "model.layers.16.input_layernorm.weight": "model-00006-of-00012.safetensors",
97
+ "model.layers.16.mlp.down_proj.weight": "model-00006-of-00012.safetensors",
98
+ "model.layers.16.mlp.gate_proj.weight": "model-00006-of-00012.safetensors",
99
+ "model.layers.16.mlp.up_proj.weight": "model-00006-of-00012.safetensors",
100
+ "model.layers.16.post_attention_layernorm.weight": "model-00006-of-00012.safetensors",
101
+ "model.layers.16.post_norm1.weight": "model-00006-of-00012.safetensors",
102
+ "model.layers.16.post_norm2.weight": "model-00006-of-00012.safetensors",
103
+ "model.layers.16.self_attn.k_proj.weight": "model-00006-of-00012.safetensors",
104
+ "model.layers.16.self_attn.o_proj.weight": "model-00006-of-00012.safetensors",
105
+ "model.layers.16.self_attn.q_proj.weight": "model-00006-of-00012.safetensors",
106
+ "model.layers.16.self_attn.v_proj.weight": "model-00006-of-00012.safetensors",
107
+ "model.layers.17.input_layernorm.weight": "model-00006-of-00012.safetensors",
108
+ "model.layers.17.mlp.down_proj.weight": "model-00006-of-00012.safetensors",
109
+ "model.layers.17.mlp.gate_proj.weight": "model-00006-of-00012.safetensors",
110
+ "model.layers.17.mlp.up_proj.weight": "model-00006-of-00012.safetensors",
111
+ "model.layers.17.post_attention_layernorm.weight": "model-00006-of-00012.safetensors",
112
+ "model.layers.17.post_norm1.weight": "model-00006-of-00012.safetensors",
113
+ "model.layers.17.post_norm2.weight": "model-00006-of-00012.safetensors",
114
+ "model.layers.17.self_attn.k_proj.weight": "model-00006-of-00012.safetensors",
115
+ "model.layers.17.self_attn.o_proj.weight": "model-00006-of-00012.safetensors",
116
+ "model.layers.17.self_attn.q_proj.weight": "model-00006-of-00012.safetensors",
117
+ "model.layers.17.self_attn.v_proj.weight": "model-00006-of-00012.safetensors",
118
+ "model.layers.18.input_layernorm.weight": "model-00006-of-00012.safetensors",
119
+ "model.layers.18.mlp.down_proj.weight": "model-00006-of-00012.safetensors",
120
+ "model.layers.18.mlp.gate_proj.weight": "model-00006-of-00012.safetensors",
121
+ "model.layers.18.mlp.up_proj.weight": "model-00006-of-00012.safetensors",
122
+ "model.layers.18.post_attention_layernorm.weight": "model-00006-of-00012.safetensors",
123
+ "model.layers.18.post_norm1.weight": "model-00006-of-00012.safetensors",
124
+ "model.layers.18.post_norm2.weight": "model-00006-of-00012.safetensors",
125
+ "model.layers.18.self_attn.k_proj.weight": "model-00006-of-00012.safetensors",
126
+ "model.layers.18.self_attn.o_proj.weight": "model-00006-of-00012.safetensors",
127
+ "model.layers.18.self_attn.q_proj.weight": "model-00006-of-00012.safetensors",
128
+ "model.layers.18.self_attn.v_proj.weight": "model-00006-of-00012.safetensors",
129
+ "model.layers.19.input_layernorm.weight": "model-00007-of-00012.safetensors",
130
+ "model.layers.19.mlp.down_proj.weight": "model-00007-of-00012.safetensors",
131
+ "model.layers.19.mlp.gate_proj.weight": "model-00007-of-00012.safetensors",
132
+ "model.layers.19.mlp.up_proj.weight": "model-00007-of-00012.safetensors",
133
+ "model.layers.19.post_attention_layernorm.weight": "model-00007-of-00012.safetensors",
134
+ "model.layers.19.post_norm1.weight": "model-00007-of-00012.safetensors",
135
+ "model.layers.19.post_norm2.weight": "model-00007-of-00012.safetensors",
136
+ "model.layers.19.self_attn.k_proj.weight": "model-00007-of-00012.safetensors",
137
+ "model.layers.19.self_attn.o_proj.weight": "model-00007-of-00012.safetensors",
138
+ "model.layers.19.self_attn.q_proj.weight": "model-00007-of-00012.safetensors",
139
+ "model.layers.19.self_attn.v_proj.weight": "model-00007-of-00012.safetensors",
140
+ "model.layers.2.input_layernorm.weight": "model-00002-of-00012.safetensors",
141
+ "model.layers.2.mlp.down_proj.weight": "model-00002-of-00012.safetensors",
142
+ "model.layers.2.mlp.gate_proj.weight": "model-00002-of-00012.safetensors",
143
+ "model.layers.2.mlp.up_proj.weight": "model-00002-of-00012.safetensors",
144
+ "model.layers.2.post_attention_layernorm.weight": "model-00002-of-00012.safetensors",
145
+ "model.layers.2.post_norm1.weight": "model-00002-of-00012.safetensors",
146
+ "model.layers.2.post_norm2.weight": "model-00002-of-00012.safetensors",
147
+ "model.layers.2.self_attn.k_proj.weight": "model-00002-of-00012.safetensors",
148
+ "model.layers.2.self_attn.o_proj.weight": "model-00002-of-00012.safetensors",
149
+ "model.layers.2.self_attn.q_proj.weight": "model-00002-of-00012.safetensors",
150
+ "model.layers.2.self_attn.v_proj.weight": "model-00002-of-00012.safetensors",
151
+ "model.layers.20.input_layernorm.weight": "model-00007-of-00012.safetensors",
152
+ "model.layers.20.mlp.down_proj.weight": "model-00007-of-00012.safetensors",
153
+ "model.layers.20.mlp.gate_proj.weight": "model-00007-of-00012.safetensors",
154
+ "model.layers.20.mlp.up_proj.weight": "model-00007-of-00012.safetensors",
155
+ "model.layers.20.post_attention_layernorm.weight": "model-00007-of-00012.safetensors",
156
+ "model.layers.20.post_norm1.weight": "model-00007-of-00012.safetensors",
157
+ "model.layers.20.post_norm2.weight": "model-00007-of-00012.safetensors",
158
+ "model.layers.20.self_attn.k_proj.weight": "model-00007-of-00012.safetensors",
159
+ "model.layers.20.self_attn.o_proj.weight": "model-00007-of-00012.safetensors",
160
+ "model.layers.20.self_attn.q_proj.weight": "model-00007-of-00012.safetensors",
161
+ "model.layers.20.self_attn.v_proj.weight": "model-00007-of-00012.safetensors",
162
+ "model.layers.21.input_layernorm.weight": "model-00007-of-00012.safetensors",
163
+ "model.layers.21.mlp.down_proj.weight": "model-00007-of-00012.safetensors",
164
+ "model.layers.21.mlp.gate_proj.weight": "model-00007-of-00012.safetensors",
165
+ "model.layers.21.mlp.up_proj.weight": "model-00007-of-00012.safetensors",
166
+ "model.layers.21.post_attention_layernorm.weight": "model-00007-of-00012.safetensors",
167
+ "model.layers.21.post_norm1.weight": "model-00007-of-00012.safetensors",
168
+ "model.layers.21.post_norm2.weight": "model-00007-of-00012.safetensors",
169
+ "model.layers.21.self_attn.k_proj.weight": "model-00007-of-00012.safetensors",
170
+ "model.layers.21.self_attn.o_proj.weight": "model-00007-of-00012.safetensors",
171
+ "model.layers.21.self_attn.q_proj.weight": "model-00007-of-00012.safetensors",
172
+ "model.layers.21.self_attn.v_proj.weight": "model-00007-of-00012.safetensors",
173
+ "model.layers.22.input_layernorm.weight": "model-00008-of-00012.safetensors",
174
+ "model.layers.22.mlp.down_proj.weight": "model-00008-of-00012.safetensors",
175
+ "model.layers.22.mlp.gate_proj.weight": "model-00007-of-00012.safetensors",
176
+ "model.layers.22.mlp.up_proj.weight": "model-00008-of-00012.safetensors",
177
+ "model.layers.22.post_attention_layernorm.weight": "model-00008-of-00012.safetensors",
178
+ "model.layers.22.post_norm1.weight": "model-00008-of-00012.safetensors",
179
+ "model.layers.22.post_norm2.weight": "model-00008-of-00012.safetensors",
180
+ "model.layers.22.self_attn.k_proj.weight": "model-00007-of-00012.safetensors",
181
+ "model.layers.22.self_attn.o_proj.weight": "model-00007-of-00012.safetensors",
182
+ "model.layers.22.self_attn.q_proj.weight": "model-00007-of-00012.safetensors",
183
+ "model.layers.22.self_attn.v_proj.weight": "model-00007-of-00012.safetensors",
184
+ "model.layers.23.input_layernorm.weight": "model-00008-of-00012.safetensors",
185
+ "model.layers.23.mlp.down_proj.weight": "model-00008-of-00012.safetensors",
186
+ "model.layers.23.mlp.gate_proj.weight": "model-00008-of-00012.safetensors",
187
+ "model.layers.23.mlp.up_proj.weight": "model-00008-of-00012.safetensors",
188
+ "model.layers.23.post_attention_layernorm.weight": "model-00008-of-00012.safetensors",
189
+ "model.layers.23.post_norm1.weight": "model-00008-of-00012.safetensors",
190
+ "model.layers.23.post_norm2.weight": "model-00008-of-00012.safetensors",
191
+ "model.layers.23.self_attn.k_proj.weight": "model-00008-of-00012.safetensors",
192
+ "model.layers.23.self_attn.o_proj.weight": "model-00008-of-00012.safetensors",
193
+ "model.layers.23.self_attn.q_proj.weight": "model-00008-of-00012.safetensors",
194
+ "model.layers.23.self_attn.v_proj.weight": "model-00008-of-00012.safetensors",
195
+ "model.layers.24.input_layernorm.weight": "model-00008-of-00012.safetensors",
196
+ "model.layers.24.mlp.down_proj.weight": "model-00008-of-00012.safetensors",
197
+ "model.layers.24.mlp.gate_proj.weight": "model-00008-of-00012.safetensors",
198
+ "model.layers.24.mlp.up_proj.weight": "model-00008-of-00012.safetensors",
199
+ "model.layers.24.post_attention_layernorm.weight": "model-00008-of-00012.safetensors",
200
+ "model.layers.24.post_norm1.weight": "model-00008-of-00012.safetensors",
201
+ "model.layers.24.post_norm2.weight": "model-00008-of-00012.safetensors",
202
+ "model.layers.24.self_attn.k_proj.weight": "model-00008-of-00012.safetensors",
203
+ "model.layers.24.self_attn.o_proj.weight": "model-00008-of-00012.safetensors",
204
+ "model.layers.24.self_attn.q_proj.weight": "model-00008-of-00012.safetensors",
205
+ "model.layers.24.self_attn.v_proj.weight": "model-00008-of-00012.safetensors",
206
+ "model.layers.25.input_layernorm.weight": "model-00008-of-00012.safetensors",
207
+ "model.layers.25.mlp.down_proj.weight": "model-00008-of-00012.safetensors",
208
+ "model.layers.25.mlp.gate_proj.weight": "model-00008-of-00012.safetensors",
209
+ "model.layers.25.mlp.up_proj.weight": "model-00008-of-00012.safetensors",
210
+ "model.layers.25.post_attention_layernorm.weight": "model-00008-of-00012.safetensors",
211
+ "model.layers.25.post_norm1.weight": "model-00008-of-00012.safetensors",
212
+ "model.layers.25.post_norm2.weight": "model-00008-of-00012.safetensors",
213
+ "model.layers.25.self_attn.k_proj.weight": "model-00008-of-00012.safetensors",
214
+ "model.layers.25.self_attn.o_proj.weight": "model-00008-of-00012.safetensors",
215
+ "model.layers.25.self_attn.q_proj.weight": "model-00008-of-00012.safetensors",
216
+ "model.layers.25.self_attn.v_proj.weight": "model-00008-of-00012.safetensors",
217
+ "model.layers.26.input_layernorm.weight": "model-00009-of-00012.safetensors",
218
+ "model.layers.26.mlp.down_proj.weight": "model-00009-of-00012.safetensors",
219
+ "model.layers.26.mlp.gate_proj.weight": "model-00009-of-00012.safetensors",
220
+ "model.layers.26.mlp.up_proj.weight": "model-00009-of-00012.safetensors",
221
+ "model.layers.26.post_attention_layernorm.weight": "model-00009-of-00012.safetensors",
222
+ "model.layers.26.post_norm1.weight": "model-00009-of-00012.safetensors",
223
+ "model.layers.26.post_norm2.weight": "model-00009-of-00012.safetensors",
224
+ "model.layers.26.self_attn.k_proj.weight": "model-00009-of-00012.safetensors",
225
+ "model.layers.26.self_attn.o_proj.weight": "model-00009-of-00012.safetensors",
226
+ "model.layers.26.self_attn.q_proj.weight": "model-00009-of-00012.safetensors",
227
+ "model.layers.26.self_attn.v_proj.weight": "model-00009-of-00012.safetensors",
228
+ "model.layers.27.input_layernorm.weight": "model-00009-of-00012.safetensors",
229
+ "model.layers.27.mlp.down_proj.weight": "model-00009-of-00012.safetensors",
230
+ "model.layers.27.mlp.gate_proj.weight": "model-00009-of-00012.safetensors",
231
+ "model.layers.27.mlp.up_proj.weight": "model-00009-of-00012.safetensors",
232
+ "model.layers.27.post_attention_layernorm.weight": "model-00009-of-00012.safetensors",
233
+ "model.layers.27.post_norm1.weight": "model-00009-of-00012.safetensors",
234
+ "model.layers.27.post_norm2.weight": "model-00009-of-00012.safetensors",
235
+ "model.layers.27.self_attn.k_proj.weight": "model-00009-of-00012.safetensors",
236
+ "model.layers.27.self_attn.o_proj.weight": "model-00009-of-00012.safetensors",
237
+ "model.layers.27.self_attn.q_proj.weight": "model-00009-of-00012.safetensors",
238
+ "model.layers.27.self_attn.v_proj.weight": "model-00009-of-00012.safetensors",
239
+ "model.layers.28.input_layernorm.weight": "model-00009-of-00012.safetensors",
240
+ "model.layers.28.mlp.down_proj.weight": "model-00009-of-00012.safetensors",
241
+ "model.layers.28.mlp.gate_proj.weight": "model-00009-of-00012.safetensors",
242
+ "model.layers.28.mlp.up_proj.weight": "model-00009-of-00012.safetensors",
243
+ "model.layers.28.post_attention_layernorm.weight": "model-00009-of-00012.safetensors",
244
+ "model.layers.28.post_norm1.weight": "model-00009-of-00012.safetensors",
245
+ "model.layers.28.post_norm2.weight": "model-00009-of-00012.safetensors",
246
+ "model.layers.28.self_attn.k_proj.weight": "model-00009-of-00012.safetensors",
247
+ "model.layers.28.self_attn.o_proj.weight": "model-00009-of-00012.safetensors",
248
+ "model.layers.28.self_attn.q_proj.weight": "model-00009-of-00012.safetensors",
249
+ "model.layers.28.self_attn.v_proj.weight": "model-00009-of-00012.safetensors",
250
+ "model.layers.29.input_layernorm.weight": "model-00010-of-00012.safetensors",
251
+ "model.layers.29.mlp.down_proj.weight": "model-00010-of-00012.safetensors",
252
+ "model.layers.29.mlp.gate_proj.weight": "model-00009-of-00012.safetensors",
253
+ "model.layers.29.mlp.up_proj.weight": "model-00010-of-00012.safetensors",
254
+ "model.layers.29.post_attention_layernorm.weight": "model-00010-of-00012.safetensors",
255
+ "model.layers.29.post_norm1.weight": "model-00010-of-00012.safetensors",
256
+ "model.layers.29.post_norm2.weight": "model-00010-of-00012.safetensors",
257
+ "model.layers.29.self_attn.k_proj.weight": "model-00009-of-00012.safetensors",
258
+ "model.layers.29.self_attn.o_proj.weight": "model-00009-of-00012.safetensors",
259
+ "model.layers.29.self_attn.q_proj.weight": "model-00009-of-00012.safetensors",
260
+ "model.layers.29.self_attn.v_proj.weight": "model-00009-of-00012.safetensors",
261
+ "model.layers.3.input_layernorm.weight": "model-00002-of-00012.safetensors",
262
+ "model.layers.3.mlp.down_proj.weight": "model-00002-of-00012.safetensors",
263
+ "model.layers.3.mlp.gate_proj.weight": "model-00002-of-00012.safetensors",
264
+ "model.layers.3.mlp.up_proj.weight": "model-00002-of-00012.safetensors",
265
+ "model.layers.3.post_attention_layernorm.weight": "model-00002-of-00012.safetensors",
266
+ "model.layers.3.post_norm1.weight": "model-00002-of-00012.safetensors",
267
+ "model.layers.3.post_norm2.weight": "model-00002-of-00012.safetensors",
268
+ "model.layers.3.self_attn.k_proj.weight": "model-00002-of-00012.safetensors",
269
+ "model.layers.3.self_attn.o_proj.weight": "model-00002-of-00012.safetensors",
270
+ "model.layers.3.self_attn.q_proj.weight": "model-00002-of-00012.safetensors",
271
+ "model.layers.3.self_attn.v_proj.weight": "model-00002-of-00012.safetensors",
272
+ "model.layers.30.input_layernorm.weight": "model-00010-of-00012.safetensors",
273
+ "model.layers.30.mlp.down_proj.weight": "model-00010-of-00012.safetensors",
274
+ "model.layers.30.mlp.gate_proj.weight": "model-00010-of-00012.safetensors",
275
+ "model.layers.30.mlp.up_proj.weight": "model-00010-of-00012.safetensors",
276
+ "model.layers.30.post_attention_layernorm.weight": "model-00010-of-00012.safetensors",
277
+ "model.layers.30.post_norm1.weight": "model-00010-of-00012.safetensors",
278
+ "model.layers.30.post_norm2.weight": "model-00010-of-00012.safetensors",
279
+ "model.layers.30.self_attn.k_proj.weight": "model-00010-of-00012.safetensors",
280
+ "model.layers.30.self_attn.o_proj.weight": "model-00010-of-00012.safetensors",
281
+ "model.layers.30.self_attn.q_proj.weight": "model-00010-of-00012.safetensors",
282
+ "model.layers.30.self_attn.v_proj.weight": "model-00010-of-00012.safetensors",
283
+ "model.layers.31.input_layernorm.weight": "model-00010-of-00012.safetensors",
284
+ "model.layers.31.mlp.down_proj.weight": "model-00010-of-00012.safetensors",
285
+ "model.layers.31.mlp.gate_proj.weight": "model-00010-of-00012.safetensors",
286
+ "model.layers.31.mlp.up_proj.weight": "model-00010-of-00012.safetensors",
287
+ "model.layers.31.post_attention_layernorm.weight": "model-00010-of-00012.safetensors",
288
+ "model.layers.31.post_norm1.weight": "model-00010-of-00012.safetensors",
289
+ "model.layers.31.post_norm2.weight": "model-00010-of-00012.safetensors",
290
+ "model.layers.31.self_attn.k_proj.weight": "model-00010-of-00012.safetensors",
291
+ "model.layers.31.self_attn.o_proj.weight": "model-00010-of-00012.safetensors",
292
+ "model.layers.31.self_attn.q_proj.weight": "model-00010-of-00012.safetensors",
293
+ "model.layers.31.self_attn.v_proj.weight": "model-00010-of-00012.safetensors",
294
+ "model.layers.32.input_layernorm.weight": "model-00010-of-00012.safetensors",
295
+ "model.layers.32.mlp.down_proj.weight": "model-00010-of-00012.safetensors",
296
+ "model.layers.32.mlp.gate_proj.weight": "model-00010-of-00012.safetensors",
297
+ "model.layers.32.mlp.up_proj.weight": "model-00010-of-00012.safetensors",
298
+ "model.layers.32.post_attention_layernorm.weight": "model-00010-of-00012.safetensors",
299
+ "model.layers.32.post_norm1.weight": "model-00010-of-00012.safetensors",
300
+ "model.layers.32.post_norm2.weight": "model-00010-of-00012.safetensors",
301
+ "model.layers.32.self_attn.k_proj.weight": "model-00010-of-00012.safetensors",
302
+ "model.layers.32.self_attn.o_proj.weight": "model-00010-of-00012.safetensors",
303
+ "model.layers.32.self_attn.q_proj.weight": "model-00010-of-00012.safetensors",
304
+ "model.layers.32.self_attn.v_proj.weight": "model-00010-of-00012.safetensors",
305
+ "model.layers.33.input_layernorm.weight": "model-00011-of-00012.safetensors",
306
+ "model.layers.33.mlp.down_proj.weight": "model-00011-of-00012.safetensors",
307
+ "model.layers.33.mlp.gate_proj.weight": "model-00011-of-00012.safetensors",
308
+ "model.layers.33.mlp.up_proj.weight": "model-00011-of-00012.safetensors",
309
+ "model.layers.33.post_attention_layernorm.weight": "model-00011-of-00012.safetensors",
310
+ "model.layers.33.post_norm1.weight": "model-00011-of-00012.safetensors",
311
+ "model.layers.33.post_norm2.weight": "model-00011-of-00012.safetensors",
312
+ "model.layers.33.self_attn.k_proj.weight": "model-00011-of-00012.safetensors",
313
+ "model.layers.33.self_attn.o_proj.weight": "model-00011-of-00012.safetensors",
314
+ "model.layers.33.self_attn.q_proj.weight": "model-00011-of-00012.safetensors",
315
+ "model.layers.33.self_attn.v_proj.weight": "model-00011-of-00012.safetensors",
316
+ "model.layers.34.input_layernorm.weight": "model-00011-of-00012.safetensors",
317
+ "model.layers.34.mlp.down_proj.weight": "model-00011-of-00012.safetensors",
318
+ "model.layers.34.mlp.gate_proj.weight": "model-00011-of-00012.safetensors",
319
+ "model.layers.34.mlp.up_proj.weight": "model-00011-of-00012.safetensors",
320
+ "model.layers.34.post_attention_layernorm.weight": "model-00011-of-00012.safetensors",
321
+ "model.layers.34.post_norm1.weight": "model-00011-of-00012.safetensors",
322
+ "model.layers.34.post_norm2.weight": "model-00011-of-00012.safetensors",
323
+ "model.layers.34.self_attn.k_proj.weight": "model-00011-of-00012.safetensors",
324
+ "model.layers.34.self_attn.o_proj.weight": "model-00011-of-00012.safetensors",
325
+ "model.layers.34.self_attn.q_proj.weight": "model-00011-of-00012.safetensors",
326
+ "model.layers.34.self_attn.v_proj.weight": "model-00011-of-00012.safetensors",
327
+ "model.layers.35.input_layernorm.weight": "model-00011-of-00012.safetensors",
328
+ "model.layers.35.mlp.down_proj.weight": "model-00011-of-00012.safetensors",
329
+ "model.layers.35.mlp.gate_proj.weight": "model-00011-of-00012.safetensors",
330
+ "model.layers.35.mlp.up_proj.weight": "model-00011-of-00012.safetensors",
331
+ "model.layers.35.post_attention_layernorm.weight": "model-00011-of-00012.safetensors",
332
+ "model.layers.35.post_norm1.weight": "model-00011-of-00012.safetensors",
333
+ "model.layers.35.post_norm2.weight": "model-00011-of-00012.safetensors",
334
+ "model.layers.35.self_attn.k_proj.weight": "model-00011-of-00012.safetensors",
335
+ "model.layers.35.self_attn.o_proj.weight": "model-00011-of-00012.safetensors",
336
+ "model.layers.35.self_attn.q_proj.weight": "model-00011-of-00012.safetensors",
337
+ "model.layers.35.self_attn.v_proj.weight": "model-00011-of-00012.safetensors",
338
+ "model.layers.36.input_layernorm.weight": "model-00012-of-00012.safetensors",
339
+ "model.layers.36.mlp.down_proj.weight": "model-00012-of-00012.safetensors",
340
+ "model.layers.36.mlp.gate_proj.weight": "model-00011-of-00012.safetensors",
341
+ "model.layers.36.mlp.up_proj.weight": "model-00012-of-00012.safetensors",
342
+ "model.layers.36.post_attention_layernorm.weight": "model-00012-of-00012.safetensors",
343
+ "model.layers.36.post_norm1.weight": "model-00012-of-00012.safetensors",
344
+ "model.layers.36.post_norm2.weight": "model-00012-of-00012.safetensors",
345
+ "model.layers.36.self_attn.k_proj.weight": "model-00011-of-00012.safetensors",
346
+ "model.layers.36.self_attn.o_proj.weight": "model-00011-of-00012.safetensors",
347
+ "model.layers.36.self_attn.q_proj.weight": "model-00011-of-00012.safetensors",
348
+ "model.layers.36.self_attn.v_proj.weight": "model-00011-of-00012.safetensors",
349
+ "model.layers.37.input_layernorm.weight": "model-00012-of-00012.safetensors",
350
+ "model.layers.37.mlp.down_proj.weight": "model-00012-of-00012.safetensors",
351
+ "model.layers.37.mlp.gate_proj.weight": "model-00012-of-00012.safetensors",
352
+ "model.layers.37.mlp.up_proj.weight": "model-00012-of-00012.safetensors",
353
+ "model.layers.37.post_attention_layernorm.weight": "model-00012-of-00012.safetensors",
354
+ "model.layers.37.post_norm1.weight": "model-00012-of-00012.safetensors",
355
+ "model.layers.37.post_norm2.weight": "model-00012-of-00012.safetensors",
356
+ "model.layers.37.self_attn.k_proj.weight": "model-00012-of-00012.safetensors",
357
+ "model.layers.37.self_attn.o_proj.weight": "model-00012-of-00012.safetensors",
358
+ "model.layers.37.self_attn.q_proj.weight": "model-00012-of-00012.safetensors",
359
+ "model.layers.37.self_attn.v_proj.weight": "model-00012-of-00012.safetensors",
360
+ "model.layers.4.input_layernorm.weight": "model-00002-of-00012.safetensors",
361
+ "model.layers.4.mlp.down_proj.weight": "model-00002-of-00012.safetensors",
362
+ "model.layers.4.mlp.gate_proj.weight": "model-00002-of-00012.safetensors",
363
+ "model.layers.4.mlp.up_proj.weight": "model-00002-of-00012.safetensors",
364
+ "model.layers.4.post_attention_layernorm.weight": "model-00002-of-00012.safetensors",
365
+ "model.layers.4.post_norm1.weight": "model-00002-of-00012.safetensors",
366
+ "model.layers.4.post_norm2.weight": "model-00002-of-00012.safetensors",
367
+ "model.layers.4.self_attn.k_proj.weight": "model-00002-of-00012.safetensors",
368
+ "model.layers.4.self_attn.o_proj.weight": "model-00002-of-00012.safetensors",
369
+ "model.layers.4.self_attn.q_proj.weight": "model-00002-of-00012.safetensors",
370
+ "model.layers.4.self_attn.v_proj.weight": "model-00002-of-00012.safetensors",
371
+ "model.layers.5.input_layernorm.weight": "model-00003-of-00012.safetensors",
372
+ "model.layers.5.mlp.down_proj.weight": "model-00003-of-00012.safetensors",
373
+ "model.layers.5.mlp.gate_proj.weight": "model-00003-of-00012.safetensors",
374
+ "model.layers.5.mlp.up_proj.weight": "model-00003-of-00012.safetensors",
375
+ "model.layers.5.post_attention_layernorm.weight": "model-00003-of-00012.safetensors",
376
+ "model.layers.5.post_norm1.weight": "model-00003-of-00012.safetensors",
377
+ "model.layers.5.post_norm2.weight": "model-00003-of-00012.safetensors",
378
+ "model.layers.5.self_attn.k_proj.weight": "model-00003-of-00012.safetensors",
379
+ "model.layers.5.self_attn.o_proj.weight": "model-00003-of-00012.safetensors",
380
+ "model.layers.5.self_attn.q_proj.weight": "model-00003-of-00012.safetensors",
381
+ "model.layers.5.self_attn.v_proj.weight": "model-00003-of-00012.safetensors",
382
+ "model.layers.6.input_layernorm.weight": "model-00003-of-00012.safetensors",
383
+ "model.layers.6.mlp.down_proj.weight": "model-00003-of-00012.safetensors",
384
+ "model.layers.6.mlp.gate_proj.weight": "model-00003-of-00012.safetensors",
385
+ "model.layers.6.mlp.up_proj.weight": "model-00003-of-00012.safetensors",
386
+ "model.layers.6.post_attention_layernorm.weight": "model-00003-of-00012.safetensors",
387
+ "model.layers.6.post_norm1.weight": "model-00003-of-00012.safetensors",
388
+ "model.layers.6.post_norm2.weight": "model-00003-of-00012.safetensors",
389
+ "model.layers.6.self_attn.k_proj.weight": "model-00003-of-00012.safetensors",
390
+ "model.layers.6.self_attn.o_proj.weight": "model-00003-of-00012.safetensors",
391
+ "model.layers.6.self_attn.q_proj.weight": "model-00003-of-00012.safetensors",
392
+ "model.layers.6.self_attn.v_proj.weight": "model-00003-of-00012.safetensors",
393
+ "model.layers.7.input_layernorm.weight": "model-00003-of-00012.safetensors",
394
+ "model.layers.7.mlp.down_proj.weight": "model-00003-of-00012.safetensors",
395
+ "model.layers.7.mlp.gate_proj.weight": "model-00003-of-00012.safetensors",
396
+ "model.layers.7.mlp.up_proj.weight": "model-00003-of-00012.safetensors",
397
+ "model.layers.7.post_attention_layernorm.weight": "model-00003-of-00012.safetensors",
398
+ "model.layers.7.post_norm1.weight": "model-00003-of-00012.safetensors",
399
+ "model.layers.7.post_norm2.weight": "model-00003-of-00012.safetensors",
400
+ "model.layers.7.self_attn.k_proj.weight": "model-00003-of-00012.safetensors",
401
+ "model.layers.7.self_attn.o_proj.weight": "model-00003-of-00012.safetensors",
402
+ "model.layers.7.self_attn.q_proj.weight": "model-00003-of-00012.safetensors",
403
+ "model.layers.7.self_attn.v_proj.weight": "model-00003-of-00012.safetensors",
404
+ "model.layers.8.input_layernorm.weight": "model-00004-of-00012.safetensors",
405
+ "model.layers.8.mlp.down_proj.weight": "model-00004-of-00012.safetensors",
406
+ "model.layers.8.mlp.gate_proj.weight": "model-00003-of-00012.safetensors",
407
+ "model.layers.8.mlp.up_proj.weight": "model-00004-of-00012.safetensors",
408
+ "model.layers.8.post_attention_layernorm.weight": "model-00004-of-00012.safetensors",
409
+ "model.layers.8.post_norm1.weight": "model-00004-of-00012.safetensors",
410
+ "model.layers.8.post_norm2.weight": "model-00004-of-00012.safetensors",
411
+ "model.layers.8.self_attn.k_proj.weight": "model-00003-of-00012.safetensors",
412
+ "model.layers.8.self_attn.o_proj.weight": "model-00003-of-00012.safetensors",
413
+ "model.layers.8.self_attn.q_proj.weight": "model-00003-of-00012.safetensors",
414
+ "model.layers.8.self_attn.v_proj.weight": "model-00003-of-00012.safetensors",
415
+ "model.layers.9.input_layernorm.weight": "model-00004-of-00012.safetensors",
416
+ "model.layers.9.mlp.down_proj.weight": "model-00004-of-00012.safetensors",
417
+ "model.layers.9.mlp.gate_proj.weight": "model-00004-of-00012.safetensors",
418
+ "model.layers.9.mlp.up_proj.weight": "model-00004-of-00012.safetensors",
419
+ "model.layers.9.post_attention_layernorm.weight": "model-00004-of-00012.safetensors",
420
+ "model.layers.9.post_norm1.weight": "model-00004-of-00012.safetensors",
421
+ "model.layers.9.post_norm2.weight": "model-00004-of-00012.safetensors",
422
+ "model.layers.9.self_attn.k_proj.weight": "model-00004-of-00012.safetensors",
423
+ "model.layers.9.self_attn.o_proj.weight": "model-00004-of-00012.safetensors",
424
+ "model.layers.9.self_attn.q_proj.weight": "model-00004-of-00012.safetensors",
425
+ "model.layers.9.self_attn.v_proj.weight": "model-00004-of-00012.safetensors",
426
+ "model.norm.weight": "model-00012-of-00012.safetensors"
427
+ }
428
+ }
modeling_hyperclovax.py ADDED
@@ -0,0 +1,979 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # This file was created for the HyperCLOVA X SEED 14B Think architecture.
3
+ # partially copied and modified from https://github.com/huggingface/transformers
4
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
5
+ #
6
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
7
+ # and OPT implementations in this library. It has been modified from its
8
+ # original forms to accommodate minor architectural differences compared
9
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
10
+ #
11
+ # Licensed under the Apache License, Version 2.0 (the "License");
12
+ # you may not use this file except in compliance with the License.
13
+ # You may obtain a copy of the License at
14
+ #
15
+ # http://www.apache.org/licenses/LICENSE-2.0
16
+ #
17
+ # Unless required by applicable law or agreed to in writing, software
18
+ # distributed under the License is distributed on an "AS IS" BASIS,
19
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
20
+ # See the License for the specific language governing permissions and
21
+ # limitations under the License.
22
+ from typing import Callable, Optional, Union
23
+
24
+ import torch
25
+ import torch.utils.checkpoint
26
+ from torch import nn
27
+
28
+ from transformers.activations import ACT2FN
29
+ from transformers.cache_utils import Cache, DynamicCache
30
+ from transformers.generation import GenerationMixin
31
+ from transformers.modeling_attn_mask_utils import AttentionMaskConverter
32
+ from transformers.integrations import use_kernel_forward_from_hub
33
+ from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
34
+ from transformers.modeling_layers import GradientCheckpointingLayer
35
+ from transformers.modeling_outputs import (
36
+ BaseModelOutputWithPast,
37
+ CausalLMOutputWithPast,
38
+ QuestionAnsweringModelOutput,
39
+ SequenceClassifierOutputWithPast,
40
+ TokenClassifierOutput,
41
+ )
42
+ from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS, dynamic_rope_update
43
+ from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
44
+ from transformers.processing_utils import Unpack
45
+ from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS
46
+ from transformers.utils import LossKwargs, auto_docstring, can_return_tuple, is_torch_flex_attn_available, logging
47
+ from .configuration_hyperclovax import HyperCLOVAXConfig
48
+ if is_torch_flex_attn_available():
49
+ from torch.nn.attention.flex_attention import BlockMask
50
+
51
+ from transformers.integrations.flex_attention import make_flex_block_causal_mask
52
+
53
+ logger = logging.get_logger(__name__)
54
+
55
+
56
+ @use_kernel_forward_from_hub("RMSNorm")
57
+ class HyperCLOVAXRMSNorm(nn.Module):
58
+ def __init__(self, hidden_size, eps=1e-6):
59
+ """
60
+ HyperCLOVAXRMSNorm is equivalent to T5LayerNorm
61
+ """
62
+ super().__init__()
63
+ self.weight = nn.Parameter(torch.ones(hidden_size))
64
+ self.variance_epsilon = eps
65
+
66
+ def forward(self, hidden_states):
67
+ input_dtype = hidden_states.dtype
68
+ hidden_states = hidden_states.to(torch.float32)
69
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
70
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
71
+ return self.weight * hidden_states.to(input_dtype)
72
+
73
+ def extra_repr(self):
74
+ return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
75
+
76
+ ALL_LAYERNORM_LAYERS.append(HyperCLOVAXRMSNorm)
77
+ class HyperCLOVAXRotaryEmbedding(nn.Module):
78
+ def __init__(self, config: HyperCLOVAXConfig, device=None):
79
+ super().__init__()
80
+ # BC: "rope_type" was originally "type"
81
+ if hasattr(config, "rope_scaling") and config.rope_scaling is not None:
82
+ self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
83
+ else:
84
+ self.rope_type = "default"
85
+ self.max_seq_len_cached = config.max_position_embeddings
86
+ self.original_max_seq_len = config.max_position_embeddings
87
+
88
+ self.config = config
89
+ self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
90
+
91
+ inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
92
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
93
+ self.original_inv_freq = self.inv_freq
94
+
95
+ @torch.no_grad()
96
+ @dynamic_rope_update # power user: used with advanced RoPE types (e.g. dynamic rope)
97
+ def forward(self, x, position_ids):
98
+ inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device)
99
+ position_ids_expanded = position_ids[:, None, :].float()
100
+
101
+ device_type = x.device.type if isinstance(x.device.type, str) and x.device.type != "mps" else "cpu"
102
+ with torch.autocast(device_type=device_type, enabled=False): # Force float32
103
+ freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
104
+ emb = torch.cat((freqs, freqs), dim=-1)
105
+ cos = emb.cos() * self.attention_scaling
106
+ sin = emb.sin() * self.attention_scaling
107
+
108
+ return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
109
+
110
+
111
+ def rotate_half(x):
112
+ """Rotates half the hidden dims of the input."""
113
+ x1 = x[..., : x.shape[-1] // 2]
114
+ x2 = x[..., x.shape[-1] // 2 :]
115
+ return torch.cat((-x2, x1), dim=-1)
116
+
117
+
118
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
119
+ """Applies Rotary Position Embedding to the query and key tensors.
120
+
121
+ Args:
122
+ q (`torch.Tensor`): The query tensor.
123
+ k (`torch.Tensor`): The key tensor.
124
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
125
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
126
+ position_ids (`torch.Tensor`, *optional*):
127
+ Deprecated and unused.
128
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
129
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
130
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
131
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
132
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
133
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
134
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
135
+ Returns:
136
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
137
+ """
138
+ cos = cos.unsqueeze(unsqueeze_dim)
139
+ sin = sin.unsqueeze(unsqueeze_dim)
140
+ q_embed = (q * cos) + (rotate_half(q) * sin)
141
+ k_embed = (k * cos) + (rotate_half(k) * sin)
142
+ return q_embed, k_embed
143
+
144
+
145
+ class HyperCLOVAXMLP(nn.Module):
146
+ def __init__(self, config):
147
+ super().__init__()
148
+ self.config = config
149
+ self.hidden_size = config.hidden_size
150
+ self.intermediate_size = config.intermediate_size
151
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
152
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
153
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=config.mlp_bias)
154
+ self.act_fn = ACT2FN[config.hidden_act]
155
+
156
+ def forward(self, x):
157
+ down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
158
+ return down_proj
159
+
160
+
161
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
162
+ """
163
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
164
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
165
+ """
166
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
167
+ if n_rep == 1:
168
+ return hidden_states
169
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
170
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
171
+
172
+
173
+ def eager_attention_forward(
174
+ module: nn.Module,
175
+ query: torch.Tensor,
176
+ key: torch.Tensor,
177
+ value: torch.Tensor,
178
+ attention_mask: Optional[torch.Tensor],
179
+ scaling: float,
180
+ dropout: float = 0.0,
181
+ **kwargs,
182
+ ):
183
+ key_states = repeat_kv(key, module.num_key_value_groups)
184
+ value_states = repeat_kv(value, module.num_key_value_groups)
185
+
186
+ attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
187
+ if attention_mask is not None:
188
+ causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
189
+ attn_weights = attn_weights + causal_mask
190
+
191
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
192
+ attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
193
+ attn_output = torch.matmul(attn_weights, value_states)
194
+ attn_output = attn_output.transpose(1, 2).contiguous()
195
+
196
+ return attn_output, attn_weights
197
+
198
+
199
+ class HyperCLOVAXAttention(nn.Module):
200
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
201
+
202
+ def __init__(self, config: HyperCLOVAXConfig, layer_idx: int):
203
+ super().__init__()
204
+ self.config = config
205
+ self.layer_idx = layer_idx
206
+ self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
207
+ self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
208
+ self.scaling = getattr(config, "attention_multiplier", self.head_dim**-0.5) # MuP
209
+ self.attention_dropout = config.attention_dropout
210
+ self.is_causal = True
211
+
212
+ self.q_proj = nn.Linear(
213
+ config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias
214
+ )
215
+ self.k_proj = nn.Linear(
216
+ config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
217
+ )
218
+ self.v_proj = nn.Linear(
219
+ config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
220
+ )
221
+ self.o_proj = nn.Linear(
222
+ config.num_attention_heads * self.head_dim, config.hidden_size, bias=config.attention_bias
223
+ )
224
+
225
+ def forward(
226
+ self,
227
+ hidden_states: torch.Tensor,
228
+ position_embeddings: tuple[torch.Tensor, torch.Tensor],
229
+ attention_mask: Optional[torch.Tensor],
230
+ past_key_value: Optional[Cache] = None,
231
+ cache_position: Optional[torch.LongTensor] = None,
232
+ **kwargs: Unpack[FlashAttentionKwargs],
233
+ ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]:
234
+ input_shape = hidden_states.shape[:-1]
235
+ hidden_shape = (*input_shape, -1, self.head_dim)
236
+
237
+ query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
238
+ key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
239
+ value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
240
+
241
+ cos, sin = position_embeddings
242
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
243
+
244
+ if past_key_value is not None:
245
+ # sin and cos are specific to RoPE models; cache_position needed for the static cache
246
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
247
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
248
+
249
+ attention_interface: Callable = eager_attention_forward
250
+
251
+ if self.config._attn_implementation != "eager":
252
+ if self.config._attn_implementation == "sdpa" and kwargs.get("output_attentions", False):
253
+ logger.warning_once(
254
+ "`torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to "
255
+ 'eager attention. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
256
+ )
257
+ else:
258
+ attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
259
+
260
+ attn_output, attn_weights = attention_interface(
261
+ self,
262
+ query_states,
263
+ key_states,
264
+ value_states,
265
+ attention_mask,
266
+ dropout=0.0 if not self.training else self.attention_dropout,
267
+ scaling=self.scaling,
268
+ **kwargs,
269
+ )
270
+
271
+ attn_output = attn_output.reshape(*input_shape, -1).contiguous()
272
+ attn_output = self.o_proj(attn_output)
273
+ return attn_output, attn_weights
274
+
275
+
276
+ class HyperCLOVAXDecoderLayer(GradientCheckpointingLayer):
277
+ def __init__(self, config: HyperCLOVAXConfig, layer_idx: int):
278
+ super().__init__()
279
+ self.hidden_size = config.hidden_size
280
+
281
+ self.self_attn = HyperCLOVAXAttention(config=config, layer_idx=layer_idx)
282
+
283
+ self.mlp = HyperCLOVAXMLP(config)
284
+ self.input_layernorm = HyperCLOVAXRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
285
+ self.post_attention_layernorm = HyperCLOVAXRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
286
+ self.use_post_norm = getattr(config, "use_post_norm", False)
287
+
288
+ # Peri-LN (post-norm)
289
+ if self.use_post_norm:
290
+ self.post_norm1 = HyperCLOVAXRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
291
+ self.post_norm2 = HyperCLOVAXRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
292
+
293
+ self.residual_multiplier = getattr(config, "residual_multiplier", 1.0) # MuP
294
+
295
+ def forward(
296
+ self,
297
+ hidden_states: torch.Tensor,
298
+ attention_mask: Optional[torch.Tensor] = None,
299
+ position_ids: Optional[torch.LongTensor] = None,
300
+ past_key_value: Optional[Cache] = None,
301
+ output_attentions: Optional[bool] = False,
302
+ use_cache: Optional[bool] = False,
303
+ cache_position: Optional[torch.LongTensor] = None,
304
+ position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None, # necessary, but kept here for BC
305
+ **kwargs: Unpack[FlashAttentionKwargs],
306
+ ) -> tuple[torch.FloatTensor, Optional[tuple[torch.FloatTensor, torch.FloatTensor]]]:
307
+ residual = hidden_states
308
+ hidden_states = self.input_layernorm(hidden_states)
309
+
310
+ # Self Attention
311
+ hidden_states, self_attn_weights = self.self_attn(
312
+ hidden_states=hidden_states,
313
+ attention_mask=attention_mask,
314
+ position_ids=position_ids,
315
+ past_key_value=past_key_value,
316
+ output_attentions=output_attentions,
317
+ use_cache=use_cache,
318
+ cache_position=cache_position,
319
+ position_embeddings=position_embeddings,
320
+ **kwargs,
321
+ )
322
+
323
+ if self.use_post_norm: # Peri-LN
324
+ hidden_states = self.post_norm1(hidden_states)
325
+
326
+ hidden_states = residual + hidden_states * self.residual_multiplier # MuP
327
+
328
+ # Fully Connected
329
+ residual = hidden_states
330
+ hidden_states = self.post_attention_layernorm(hidden_states)
331
+ hidden_states = self.mlp(hidden_states)
332
+
333
+ if self.use_post_norm: # Peri-LN
334
+ hidden_states = self.post_norm2(hidden_states)
335
+
336
+ hidden_states = residual + hidden_states * self.residual_multiplier # MuP
337
+
338
+ outputs = (hidden_states,)
339
+ if output_attentions:
340
+ outputs += (self_attn_weights,)
341
+
342
+ return outputs
343
+
344
+
345
+ @auto_docstring
346
+ class HyperCLOVAXPreTrainedModel(PreTrainedModel):
347
+ config_class = HyperCLOVAXConfig
348
+ base_model_prefix = "model"
349
+ supports_gradient_checkpointing = True
350
+ _no_split_modules = ["HyperCLOVAXDecoderLayer"]
351
+ _skip_keys_device_placement = ["past_key_values"]
352
+ _supports_flash_attn_2 = True
353
+ _supports_sdpa = True
354
+ _supports_flex_attn = True
355
+ _supports_cache_class = True
356
+ _supports_quantized_cache = True
357
+ _supports_static_cache = True
358
+ _supports_attention_backend = True
359
+
360
+ def _init_weights(self, module):
361
+ std = self.config.initializer_range
362
+ if isinstance(module, nn.Linear):
363
+ module.weight.data.normal_(mean=0.0, std=std)
364
+ if module.bias is not None:
365
+ module.bias.data.zero_()
366
+ elif isinstance(module, nn.Embedding):
367
+ module.weight.data.normal_(mean=0.0, std=std)
368
+ if module.padding_idx is not None:
369
+ module.weight.data[module.padding_idx].zero_()
370
+ elif isinstance(module, HyperCLOVAXRMSNorm):
371
+ module.weight.data.fill_(1.0)
372
+
373
+
374
+ @auto_docstring
375
+ class HyperCLOVAXModel(HyperCLOVAXPreTrainedModel):
376
+ def __init__(self, config: HyperCLOVAXConfig):
377
+ super().__init__(config)
378
+ self.padding_idx = config.pad_token_id
379
+ self.vocab_size = config.vocab_size
380
+
381
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
382
+ self.layers = nn.ModuleList(
383
+ [HyperCLOVAXDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
384
+ )
385
+ self.norm = HyperCLOVAXRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
386
+ self.rotary_emb = HyperCLOVAXRotaryEmbedding(config=config)
387
+ self.gradient_checkpointing = False
388
+
389
+ # Initialize weights and apply final processing
390
+ self.post_init()
391
+
392
+ # MuP
393
+ self.embedding_multiplier = getattr(config, "embedding_multiplier", 1.0)
394
+
395
+ def get_input_embeddings(self):
396
+ return self.embed_tokens
397
+
398
+ def set_input_embeddings(self, value):
399
+ self.embed_tokens = value
400
+
401
+ @can_return_tuple
402
+ @auto_docstring
403
+ def forward(
404
+ self,
405
+ input_ids: Optional[torch.LongTensor] = None,
406
+ attention_mask: Optional[torch.Tensor] = None,
407
+ position_ids: Optional[torch.LongTensor] = None,
408
+ past_key_values: Optional[Cache] = None,
409
+ inputs_embeds: Optional[torch.FloatTensor] = None,
410
+ use_cache: Optional[bool] = None,
411
+ output_attentions: Optional[bool] = None,
412
+ output_hidden_states: Optional[bool] = None,
413
+ cache_position: Optional[torch.LongTensor] = None,
414
+ **flash_attn_kwargs: Unpack[FlashAttentionKwargs],
415
+ ) -> BaseModelOutputWithPast:
416
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
417
+ output_hidden_states = (
418
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
419
+ )
420
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
421
+
422
+ if (input_ids is None) ^ (inputs_embeds is not None):
423
+ raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
424
+
425
+ if self.gradient_checkpointing and self.training and use_cache:
426
+ logger.warning_once(
427
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
428
+ )
429
+ use_cache = False
430
+
431
+ # TODO (joao): remove this exception in v4.56 -- it exists for users that try to pass a legacy cache
432
+ if not isinstance(past_key_values, (type(None), Cache)):
433
+ raise ValueError("The `past_key_values` should be either a `Cache` object or `None`.")
434
+
435
+ if inputs_embeds is None:
436
+ inputs_embeds = self.embed_tokens(input_ids)
437
+
438
+ inputs_embeds = inputs_embeds * self.embedding_multiplier # MuP
439
+
440
+ if use_cache and past_key_values is None:
441
+ past_key_values = DynamicCache()
442
+
443
+ if cache_position is None:
444
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
445
+ cache_position = torch.arange(
446
+ past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
447
+ )
448
+
449
+ if position_ids is None:
450
+ position_ids = cache_position.unsqueeze(0)
451
+
452
+ causal_mask = self._update_causal_mask(
453
+ attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
454
+ )
455
+
456
+ hidden_states = inputs_embeds
457
+
458
+ # create position embeddings to be shared across the decoder layers
459
+ position_embeddings = self.rotary_emb(hidden_states, position_ids)
460
+
461
+ # decoder layers
462
+ all_hidden_states = () if output_hidden_states else None
463
+ all_self_attns = () if output_attentions else None
464
+
465
+ for decoder_layer in self.layers[: self.config.num_hidden_layers]:
466
+ if output_hidden_states:
467
+ all_hidden_states += (hidden_states,)
468
+
469
+ layer_outputs = decoder_layer(
470
+ hidden_states,
471
+ attention_mask=causal_mask,
472
+ position_ids=position_ids,
473
+ past_key_value=past_key_values,
474
+ output_attentions=output_attentions,
475
+ use_cache=use_cache,
476
+ cache_position=cache_position,
477
+ position_embeddings=position_embeddings,
478
+ **flash_attn_kwargs,
479
+ )
480
+
481
+ hidden_states = layer_outputs[0]
482
+
483
+ if output_attentions:
484
+ all_self_attns += (layer_outputs[1],)
485
+
486
+ hidden_states = self.norm(hidden_states)
487
+
488
+ # add hidden states from the last decoder layer
489
+ if output_hidden_states:
490
+ all_hidden_states += (hidden_states,)
491
+
492
+ return BaseModelOutputWithPast(
493
+ last_hidden_state=hidden_states,
494
+ past_key_values=past_key_values if use_cache else None,
495
+ hidden_states=all_hidden_states,
496
+ attentions=all_self_attns,
497
+ )
498
+
499
+ def _update_causal_mask(
500
+ self,
501
+ attention_mask: Union[torch.Tensor, "BlockMask"],
502
+ input_tensor: torch.Tensor,
503
+ cache_position: torch.Tensor,
504
+ past_key_values: Cache,
505
+ output_attentions: bool = False,
506
+ ):
507
+ if self.config._attn_implementation == "flash_attention_2":
508
+ if attention_mask is not None and (attention_mask == 0.0).any():
509
+ return attention_mask
510
+ return None
511
+ if self.config._attn_implementation == "flex_attention":
512
+ if isinstance(attention_mask, torch.Tensor):
513
+ attention_mask = make_flex_block_causal_mask(attention_mask)
514
+ return attention_mask
515
+
516
+ # For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
517
+ # order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
518
+ # to infer the attention mask.
519
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
520
+ using_compilable_cache = past_key_values.is_compileable if past_key_values is not None else False
521
+
522
+ # When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward
523
+ if self.config._attn_implementation == "sdpa" and not using_compilable_cache and not output_attentions:
524
+ if AttentionMaskConverter._ignore_causal_mask_sdpa(
525
+ attention_mask,
526
+ inputs_embeds=input_tensor,
527
+ past_key_values_length=past_seen_tokens,
528
+ is_training=self.training,
529
+ ):
530
+ return None
531
+
532
+ dtype = input_tensor.dtype
533
+ sequence_length = input_tensor.shape[1]
534
+ if using_compilable_cache:
535
+ target_length = past_key_values.get_max_cache_shape()
536
+ else:
537
+ target_length = (
538
+ attention_mask.shape[-1]
539
+ if isinstance(attention_mask, torch.Tensor)
540
+ else past_seen_tokens + sequence_length + 1
541
+ )
542
+
543
+ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D).
544
+ causal_mask = self._prepare_4d_causal_attention_mask_with_cache_position(
545
+ attention_mask,
546
+ sequence_length=sequence_length,
547
+ target_length=target_length,
548
+ dtype=dtype,
549
+ cache_position=cache_position,
550
+ batch_size=input_tensor.shape[0],
551
+ )
552
+
553
+ if (
554
+ self.config._attn_implementation == "sdpa"
555
+ and attention_mask is not None
556
+ and attention_mask.device.type in ["cuda", "xpu", "npu"]
557
+ and not output_attentions
558
+ ):
559
+ # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
560
+ # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
561
+ # Details: https://github.com/pytorch/pytorch/issues/110213
562
+ min_dtype = torch.finfo(dtype).min
563
+ causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
564
+
565
+ return causal_mask
566
+
567
+ @staticmethod
568
+ def _prepare_4d_causal_attention_mask_with_cache_position(
569
+ attention_mask: torch.Tensor,
570
+ sequence_length: int,
571
+ target_length: int,
572
+ dtype: torch.dtype,
573
+ cache_position: torch.Tensor,
574
+ batch_size: int,
575
+ **kwargs,
576
+ ):
577
+ """
578
+ Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
579
+ `(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing.
580
+
581
+ Args:
582
+ attention_mask (`torch.Tensor`):
583
+ A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape
584
+ `(batch_size, 1, query_length, key_value_length)`.
585
+ sequence_length (`int`):
586
+ The sequence length being processed.
587
+ target_length (`int`):
588
+ The target length: when generating with static cache, the mask should be as long as the static cache,
589
+ to account for the 0 padding, the part of the cache that is not filled yet.
590
+ dtype (`torch.dtype`):
591
+ The dtype to use for the 4D attention mask.
592
+ cache_position (`torch.Tensor`):
593
+ Indices depicting the position of the input sequence tokens in the sequence.
594
+ batch_size (`torch.Tensor`):
595
+ Batch size.
596
+ """
597
+ if attention_mask is not None and attention_mask.dim() == 4:
598
+ # In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
599
+ causal_mask = attention_mask
600
+ else:
601
+ min_dtype = torch.finfo(dtype).min
602
+ causal_mask = torch.full(
603
+ (sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=cache_position.device
604
+ )
605
+ if sequence_length != 1:
606
+ causal_mask = torch.triu(causal_mask, diagonal=1)
607
+ causal_mask *= torch.arange(target_length, device=cache_position.device) > cache_position.reshape(-1, 1)
608
+ causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
609
+ if attention_mask is not None:
610
+ causal_mask = causal_mask.clone() # copy to contiguous memory for in-place edit
611
+ mask_length = attention_mask.shape[-1]
612
+ padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :].to(
613
+ causal_mask.device
614
+ )
615
+ padding_mask = padding_mask == 0
616
+ causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
617
+ padding_mask, min_dtype
618
+ )
619
+
620
+ return causal_mask
621
+
622
+
623
+ class KwargsForCausalLM(FlashAttentionKwargs, LossKwargs): ...
624
+
625
+
626
+ @auto_docstring
627
+ class HyperCLOVAXForCausalLM(HyperCLOVAXPreTrainedModel, GenerationMixin):
628
+ _tied_weights_keys = ["lm_head.weight"]
629
+ _tp_plan = {"lm_head": "colwise_rep"}
630
+ _pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
631
+
632
+ def __init__(self, config):
633
+ super().__init__(config)
634
+ self.model = HyperCLOVAXModel(config)
635
+ self.vocab_size = config.vocab_size
636
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
637
+ self.logits_scaling = getattr(config, "logits_scaling", 1.0)
638
+
639
+ # Initialize weights and apply final processing
640
+ self.post_init()
641
+
642
+ def get_input_embeddings(self):
643
+ return self.model.embed_tokens
644
+
645
+ def set_input_embeddings(self, value):
646
+ self.model.embed_tokens = value
647
+
648
+ def get_output_embeddings(self):
649
+ return self.lm_head
650
+
651
+ def set_output_embeddings(self, new_embeddings):
652
+ self.lm_head = new_embeddings
653
+
654
+ def set_decoder(self, decoder):
655
+ self.model = decoder
656
+
657
+ def get_decoder(self):
658
+ return self.model
659
+
660
+ @can_return_tuple
661
+ @auto_docstring
662
+ def forward(
663
+ self,
664
+ input_ids: Optional[torch.LongTensor] = None,
665
+ attention_mask: Optional[torch.Tensor] = None,
666
+ position_ids: Optional[torch.LongTensor] = None,
667
+ past_key_values: Optional[Cache] = None,
668
+ inputs_embeds: Optional[torch.FloatTensor] = None,
669
+ labels: Optional[torch.LongTensor] = None,
670
+ use_cache: Optional[bool] = None,
671
+ output_attentions: Optional[bool] = None,
672
+ output_hidden_states: Optional[bool] = None,
673
+ cache_position: Optional[torch.LongTensor] = None,
674
+ logits_to_keep: Union[int, torch.Tensor] = 0,
675
+ **kwargs: Unpack[KwargsForCausalLM],
676
+ ) -> CausalLMOutputWithPast:
677
+ r"""
678
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
679
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
680
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
681
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
682
+
683
+ Example:
684
+
685
+ ```python
686
+ >>> from transformers import AutoTokenizer, HyperCLOVAXForCausalLM
687
+
688
+ >>> model = HyperCLOVAXForCausalLM.from_pretrained("naver-hyperclovax/{model_name}")
689
+ >>> tokenizer = AutoTokenizer.from_pretrained("naver-hyperclovax/{model_name}")
690
+
691
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
692
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
693
+
694
+ >>> # Generate
695
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
696
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
697
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
698
+ ```"""
699
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
700
+ output_hidden_states = (
701
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
702
+ )
703
+
704
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
705
+ outputs: BaseModelOutputWithPast = self.model(
706
+ input_ids=input_ids,
707
+ attention_mask=attention_mask,
708
+ position_ids=position_ids,
709
+ past_key_values=past_key_values,
710
+ inputs_embeds=inputs_embeds,
711
+ use_cache=use_cache,
712
+ output_attentions=output_attentions,
713
+ output_hidden_states=output_hidden_states,
714
+ cache_position=cache_position,
715
+ **kwargs,
716
+ )
717
+
718
+ hidden_states = outputs.last_hidden_state
719
+ # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
720
+ slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
721
+ # MuP
722
+ logits = self.lm_head(hidden_states[:, slice_indices, :]) * self.logits_scaling
723
+
724
+ loss = None
725
+ if labels is not None:
726
+ loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
727
+
728
+ return CausalLMOutputWithPast(
729
+ loss=loss,
730
+ logits=logits,
731
+ past_key_values=outputs.past_key_values,
732
+ hidden_states=outputs.hidden_states,
733
+ attentions=outputs.attentions,
734
+ )
735
+
736
+
737
+ @auto_docstring(
738
+ custom_intro="""
739
+ The HyperCLOVAX Model transformer with a sequence classification head on top (linear layer).
740
+
741
+ [`HyperCLOVAXForSequenceClassification`] uses the last token in order to do the classification, as other causal models
742
+ (e.g. GPT-2) do.
743
+
744
+ Since it does classification on the last token, it requires to know the position of the last token. If a
745
+ `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
746
+ no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
747
+ padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
748
+ each row of the batch).
749
+ """
750
+ )
751
+ class HyperCLOVAXForSequenceClassification(HyperCLOVAXPreTrainedModel):
752
+ def __init__(self, config):
753
+ super().__init__(config)
754
+ self.num_labels = config.num_labels
755
+ self.model = HyperCLOVAXModel(config)
756
+ self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
757
+
758
+ # Initialize weights and apply final processing
759
+ self.post_init()
760
+
761
+ def get_input_embeddings(self):
762
+ return self.model.embed_tokens
763
+
764
+ def set_input_embeddings(self, value):
765
+ self.model.embed_tokens = value
766
+
767
+ @can_return_tuple
768
+ @auto_docstring
769
+ def forward(
770
+ self,
771
+ input_ids: Optional[torch.LongTensor] = None,
772
+ attention_mask: Optional[torch.Tensor] = None,
773
+ position_ids: Optional[torch.LongTensor] = None,
774
+ past_key_values: Optional[Cache] = None,
775
+ inputs_embeds: Optional[torch.FloatTensor] = None,
776
+ labels: Optional[torch.LongTensor] = None,
777
+ use_cache: Optional[bool] = None,
778
+ output_attentions: Optional[bool] = None,
779
+ output_hidden_states: Optional[bool] = None,
780
+ ) -> SequenceClassifierOutputWithPast:
781
+ r"""
782
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
783
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
784
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
785
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
786
+ """
787
+
788
+ transformer_outputs: BaseModelOutputWithPast = self.model(
789
+ input_ids,
790
+ attention_mask=attention_mask,
791
+ position_ids=position_ids,
792
+ past_key_values=past_key_values,
793
+ inputs_embeds=inputs_embeds,
794
+ use_cache=use_cache,
795
+ output_attentions=output_attentions,
796
+ output_hidden_states=output_hidden_states,
797
+ )
798
+ hidden_states = transformer_outputs.last_hidden_state
799
+ logits = self.score(hidden_states)
800
+
801
+ if input_ids is not None:
802
+ batch_size = input_ids.shape[0]
803
+ else:
804
+ batch_size = inputs_embeds.shape[0]
805
+
806
+ if self.config.pad_token_id is None and batch_size != 1:
807
+ raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
808
+ if self.config.pad_token_id is None:
809
+ last_non_pad_token = -1
810
+ elif input_ids is not None:
811
+ # To handle both left- and right- padding, we take the rightmost token that is not equal to pad_token_id
812
+ non_pad_mask = (input_ids != self.config.pad_token_id).to(logits.device, torch.int32)
813
+ token_indices = torch.arange(input_ids.shape[-1], device=logits.device, dtype=torch.int32)
814
+ last_non_pad_token = (token_indices * non_pad_mask).argmax(-1)
815
+ else:
816
+ last_non_pad_token = -1
817
+ logger.warning_once(
818
+ f"{self.__class__.__name__} will not detect padding tokens in `inputs_embeds`. Results may be "
819
+ "unexpected if using padding tokens in conjunction with `inputs_embeds.`"
820
+ )
821
+
822
+ pooled_logits = logits[torch.arange(batch_size, device=logits.device), last_non_pad_token]
823
+
824
+ loss = None
825
+ if labels is not None:
826
+ loss = self.loss_function(logits=logits, labels=labels, pooled_logits=pooled_logits, config=self.config)
827
+
828
+ return SequenceClassifierOutputWithPast(
829
+ loss=loss,
830
+ logits=pooled_logits,
831
+ past_key_values=transformer_outputs.past_key_values,
832
+ hidden_states=transformer_outputs.hidden_states,
833
+ attentions=transformer_outputs.attentions,
834
+ )
835
+
836
+
837
+ @auto_docstring
838
+ class HyperCLOVAXForQuestionAnswering(HyperCLOVAXPreTrainedModel):
839
+ base_model_prefix = "transformer"
840
+
841
+ # Copied from transformers.models.bloom.modeling_bloom.BloomForQuestionAnswering.__init__ with Bloom->HyperCLOVAX
842
+ def __init__(self, config):
843
+ super().__init__(config)
844
+ self.transformer = HyperCLOVAXModel(config)
845
+ self.qa_outputs = nn.Linear(config.hidden_size, 2)
846
+
847
+ # Initialize weights and apply final processing
848
+ self.post_init()
849
+
850
+ def get_input_embeddings(self):
851
+ return self.transformer.embed_tokens
852
+
853
+ def set_input_embeddings(self, value):
854
+ self.transformer.embed_tokens = value
855
+
856
+ @can_return_tuple
857
+ @auto_docstring
858
+ def forward(
859
+ self,
860
+ input_ids: Optional[torch.LongTensor] = None,
861
+ attention_mask: Optional[torch.Tensor] = None,
862
+ position_ids: Optional[torch.LongTensor] = None,
863
+ past_key_values: Optional[Cache] = None,
864
+ inputs_embeds: Optional[torch.FloatTensor] = None,
865
+ start_positions: Optional[torch.LongTensor] = None,
866
+ end_positions: Optional[torch.LongTensor] = None,
867
+ output_attentions: Optional[bool] = None,
868
+ output_hidden_states: Optional[bool] = None,
869
+ **kwargs,
870
+ ) -> QuestionAnsweringModelOutput:
871
+ outputs: BaseModelOutputWithPast = self.transformer(
872
+ input_ids,
873
+ attention_mask=attention_mask,
874
+ position_ids=position_ids,
875
+ past_key_values=past_key_values,
876
+ inputs_embeds=inputs_embeds,
877
+ output_attentions=output_attentions,
878
+ output_hidden_states=output_hidden_states,
879
+ )
880
+
881
+ sequence_output = outputs.last_hidden_state
882
+
883
+ logits = self.qa_outputs(sequence_output)
884
+ start_logits, end_logits = logits.split(1, dim=-1)
885
+ start_logits = start_logits.squeeze(-1).contiguous()
886
+ end_logits = end_logits.squeeze(-1).contiguous()
887
+
888
+ loss = None
889
+ if start_positions is not None and end_positions is not None:
890
+ loss = self.loss_function(start_logits, end_logits, start_positions, end_positions, **kwargs)
891
+
892
+ return QuestionAnsweringModelOutput(
893
+ loss=loss,
894
+ start_logits=start_logits,
895
+ end_logits=end_logits,
896
+ hidden_states=outputs.hidden_states,
897
+ attentions=outputs.attentions,
898
+ )
899
+
900
+
901
+ @auto_docstring
902
+ class HyperCLOVAXForTokenClassification(HyperCLOVAXPreTrainedModel):
903
+ def __init__(self, config):
904
+ super().__init__(config)
905
+ self.num_labels = config.num_labels
906
+ self.model = HyperCLOVAXModel(config)
907
+ if getattr(config, "classifier_dropout", None) is not None:
908
+ classifier_dropout = config.classifier_dropout
909
+ elif getattr(config, "hidden_dropout", None) is not None:
910
+ classifier_dropout = config.hidden_dropout
911
+ else:
912
+ classifier_dropout = 0.1
913
+ self.dropout = nn.Dropout(classifier_dropout)
914
+ self.score = nn.Linear(config.hidden_size, config.num_labels)
915
+
916
+ # Initialize weights and apply final processing
917
+ self.post_init()
918
+
919
+ def get_input_embeddings(self):
920
+ return self.model.embed_tokens
921
+
922
+ def set_input_embeddings(self, value):
923
+ self.model.embed_tokens = value
924
+
925
+ @can_return_tuple
926
+ @auto_docstring
927
+ def forward(
928
+ self,
929
+ input_ids: Optional[torch.LongTensor] = None,
930
+ attention_mask: Optional[torch.Tensor] = None,
931
+ position_ids: Optional[torch.LongTensor] = None,
932
+ past_key_values: Optional[Cache] = None,
933
+ inputs_embeds: Optional[torch.FloatTensor] = None,
934
+ labels: Optional[torch.LongTensor] = None,
935
+ use_cache: Optional[bool] = None,
936
+ output_attentions: Optional[bool] = None,
937
+ output_hidden_states: Optional[bool] = None,
938
+ ) -> TokenClassifierOutput:
939
+ r"""
940
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
941
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
942
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
943
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
944
+ """
945
+
946
+ outputs: BaseModelOutputWithPast = self.model(
947
+ input_ids,
948
+ attention_mask=attention_mask,
949
+ position_ids=position_ids,
950
+ past_key_values=past_key_values,
951
+ inputs_embeds=inputs_embeds,
952
+ use_cache=use_cache,
953
+ output_attentions=output_attentions,
954
+ output_hidden_states=output_hidden_states,
955
+ )
956
+ sequence_output = outputs.last_hidden_state
957
+ sequence_output = self.dropout(sequence_output)
958
+ logits = self.score(sequence_output)
959
+
960
+ loss = None
961
+ if labels is not None:
962
+ loss = self.loss_function(logits, labels, self.config)
963
+
964
+ return TokenClassifierOutput(
965
+ loss=loss,
966
+ logits=logits,
967
+ hidden_states=outputs.hidden_states,
968
+ attentions=outputs.attentions,
969
+ )
970
+
971
+
972
+ __all__ = [
973
+ "HyperCLOVAXForCausalLM",
974
+ "HyperCLOVAXModel",
975
+ "HyperCLOVAXPreTrainedModel",
976
+ "HyperCLOVAXForSequenceClassification",
977
+ "HyperCLOVAXForQuestionAnswering",
978
+ "HyperCLOVAXForTokenClassification",
979
+ ]
special_tokens_map.json ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|endoftext|>",
4
+ "<|fim_prefix|>",
5
+ "<|fim_middle|>",
6
+ "<|fim_suffix|>",
7
+ "<|endofprompt|>",
8
+ "<|_unuse_missing_100256|>",
9
+ "<|_unuse_missing_100261|>",
10
+ "<|_unuse_missing_100262|>",
11
+ "<|_unuse_missing_100263|>",
12
+ "<|_unuse_missing_100264|>",
13
+ "<|_unuse_missing_100265|>",
14
+ "<|_unuse_missing_100266|>",
15
+ "<|_unuse_missing_100267|>",
16
+ "<|_unuse_missing_100268|>",
17
+ "<|_unuse_missing_100269|>",
18
+ "<|_unuse_missing_100270|>",
19
+ "<|_unuse_missing_100271|>",
20
+ "<|im_start|>",
21
+ "<|im_end|>",
22
+ "<|stop|>",
23
+ "<|endofturn|>",
24
+ "<repo_name>",
25
+ "<file_sep>",
26
+ "<issue_start>",
27
+ "<issue_comment>",
28
+ "<issue_closed>",
29
+ "<jupyter_start>",
30
+ "<jupyter_text>",
31
+ "<jupyter_code>",
32
+ "<jupyter_output>",
33
+ "<jupyter_script>",
34
+ "<empty_output>",
35
+ "<code_to_intermediate>",
36
+ "<intermediate_to_code>",
37
+ "<pr>",
38
+ "<pr_status>",
39
+ "<pr_is_merged>",
40
+ "<pr_base>",
41
+ "<pr_file>",
42
+ "<pr_base_code>",
43
+ "<pr_diff>",
44
+ "<pr_diff_hunk>",
45
+ "<pr_comment>",
46
+ "<pr_event_id>",
47
+ "<pr_review>",
48
+ "<pr_review_state>",
49
+ "<pr_review_comment>",
50
+ "<pr_in_reply_to_review_id>",
51
+ "<pr_in_reply_to_comment_id>",
52
+ "<pr_diff_hunk_comment_line>",
53
+ "<NAME>",
54
+ "<EMAIL>",
55
+ "<KEY>",
56
+ "<PASSWORD>"
57
+ ],
58
+ "bos_token": {
59
+ "content": "<|endoftext|>",
60
+ "lstrip": false,
61
+ "normalized": false,
62
+ "rstrip": false,
63
+ "single_word": false
64
+ },
65
+ "eos_token": {
66
+ "content": "<|endoftext|>",
67
+ "lstrip": false,
68
+ "normalized": false,
69
+ "rstrip": false,
70
+ "single_word": false
71
+ },
72
+ "pad_token": {
73
+ "content": "<|endoftext|>",
74
+ "lstrip": false,
75
+ "normalized": false,
76
+ "rstrip": false,
77
+ "single_word": false
78
+ },
79
+ "unk_token": {
80
+ "content": "<|endoftext|>",
81
+ "lstrip": false,
82
+ "normalized": false,
83
+ "rstrip": false,
84
+ "single_word": false
85
+ }
86
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,501 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "100256": {
5
+ "content": "<|_unuse_missing_100256|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "100257": {
13
+ "content": "<|endoftext|>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "100258": {
21
+ "content": "<|fim_prefix|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "100259": {
29
+ "content": "<|fim_middle|>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "100260": {
37
+ "content": "<|fim_suffix|>",
38
+ "lstrip": false,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ },
44
+ "100261": {
45
+ "content": "<|_unuse_missing_100261|>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": true
51
+ },
52
+ "100262": {
53
+ "content": "<|_unuse_missing_100262|>",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false,
58
+ "special": true
59
+ },
60
+ "100263": {
61
+ "content": "<|_unuse_missing_100263|>",
62
+ "lstrip": false,
63
+ "normalized": false,
64
+ "rstrip": false,
65
+ "single_word": false,
66
+ "special": true
67
+ },
68
+ "100264": {
69
+ "content": "<|_unuse_missing_100264|>",
70
+ "lstrip": false,
71
+ "normalized": false,
72
+ "rstrip": false,
73
+ "single_word": false,
74
+ "special": true
75
+ },
76
+ "100265": {
77
+ "content": "<|_unuse_missing_100265|>",
78
+ "lstrip": false,
79
+ "normalized": false,
80
+ "rstrip": false,
81
+ "single_word": false,
82
+ "special": true
83
+ },
84
+ "100266": {
85
+ "content": "<|_unuse_missing_100266|>",
86
+ "lstrip": false,
87
+ "normalized": false,
88
+ "rstrip": false,
89
+ "single_word": false,
90
+ "special": true
91
+ },
92
+ "100267": {
93
+ "content": "<|_unuse_missing_100267|>",
94
+ "lstrip": false,
95
+ "normalized": false,
96
+ "rstrip": false,
97
+ "single_word": false,
98
+ "special": true
99
+ },
100
+ "100268": {
101
+ "content": "<|_unuse_missing_100268|>",
102
+ "lstrip": false,
103
+ "normalized": false,
104
+ "rstrip": false,
105
+ "single_word": false,
106
+ "special": true
107
+ },
108
+ "100269": {
109
+ "content": "<|_unuse_missing_100269|>",
110
+ "lstrip": false,
111
+ "normalized": false,
112
+ "rstrip": false,
113
+ "single_word": false,
114
+ "special": true
115
+ },
116
+ "100270": {
117
+ "content": "<|_unuse_missing_100270|>",
118
+ "lstrip": false,
119
+ "normalized": false,
120
+ "rstrip": false,
121
+ "single_word": false,
122
+ "special": true
123
+ },
124
+ "100271": {
125
+ "content": "<|_unuse_missing_100271|>",
126
+ "lstrip": false,
127
+ "normalized": false,
128
+ "rstrip": false,
129
+ "single_word": false,
130
+ "special": true
131
+ },
132
+ "100272": {
133
+ "content": "<|im_start|>",
134
+ "lstrip": false,
135
+ "normalized": false,
136
+ "rstrip": false,
137
+ "single_word": false,
138
+ "special": true
139
+ },
140
+ "100273": {
141
+ "content": "<|im_end|>",
142
+ "lstrip": false,
143
+ "normalized": false,
144
+ "rstrip": false,
145
+ "single_word": false,
146
+ "special": true
147
+ },
148
+ "100274": {
149
+ "content": "<|stop|>",
150
+ "lstrip": false,
151
+ "normalized": false,
152
+ "rstrip": false,
153
+ "single_word": false,
154
+ "special": true
155
+ },
156
+ "100275": {
157
+ "content": "<|endofturn|>",
158
+ "lstrip": false,
159
+ "normalized": false,
160
+ "rstrip": false,
161
+ "single_word": false,
162
+ "special": true
163
+ },
164
+ "100276": {
165
+ "content": "<|endofprompt|>",
166
+ "lstrip": false,
167
+ "normalized": false,
168
+ "rstrip": false,
169
+ "single_word": false,
170
+ "special": true
171
+ },
172
+ "110491": {
173
+ "content": "<repo_name>",
174
+ "lstrip": false,
175
+ "normalized": false,
176
+ "rstrip": false,
177
+ "single_word": false,
178
+ "special": true
179
+ },
180
+ "110492": {
181
+ "content": "<file_sep>",
182
+ "lstrip": false,
183
+ "normalized": false,
184
+ "rstrip": false,
185
+ "single_word": false,
186
+ "special": true
187
+ },
188
+ "110493": {
189
+ "content": "<issue_start>",
190
+ "lstrip": false,
191
+ "normalized": false,
192
+ "rstrip": false,
193
+ "single_word": false,
194
+ "special": true
195
+ },
196
+ "110494": {
197
+ "content": "<issue_comment>",
198
+ "lstrip": false,
199
+ "normalized": false,
200
+ "rstrip": false,
201
+ "single_word": false,
202
+ "special": true
203
+ },
204
+ "110495": {
205
+ "content": "<issue_closed>",
206
+ "lstrip": false,
207
+ "normalized": false,
208
+ "rstrip": false,
209
+ "single_word": false,
210
+ "special": true
211
+ },
212
+ "110496": {
213
+ "content": "<jupyter_start>",
214
+ "lstrip": false,
215
+ "normalized": false,
216
+ "rstrip": false,
217
+ "single_word": false,
218
+ "special": true
219
+ },
220
+ "110497": {
221
+ "content": "<jupyter_text>",
222
+ "lstrip": false,
223
+ "normalized": false,
224
+ "rstrip": false,
225
+ "single_word": false,
226
+ "special": true
227
+ },
228
+ "110498": {
229
+ "content": "<jupyter_code>",
230
+ "lstrip": false,
231
+ "normalized": false,
232
+ "rstrip": false,
233
+ "single_word": false,
234
+ "special": true
235
+ },
236
+ "110499": {
237
+ "content": "<jupyter_output>",
238
+ "lstrip": false,
239
+ "normalized": false,
240
+ "rstrip": false,
241
+ "single_word": false,
242
+ "special": true
243
+ },
244
+ "110500": {
245
+ "content": "<jupyter_script>",
246
+ "lstrip": false,
247
+ "normalized": false,
248
+ "rstrip": false,
249
+ "single_word": false,
250
+ "special": true
251
+ },
252
+ "110501": {
253
+ "content": "<empty_output>",
254
+ "lstrip": false,
255
+ "normalized": false,
256
+ "rstrip": false,
257
+ "single_word": false,
258
+ "special": true
259
+ },
260
+ "110502": {
261
+ "content": "<code_to_intermediate>",
262
+ "lstrip": false,
263
+ "normalized": false,
264
+ "rstrip": false,
265
+ "single_word": false,
266
+ "special": true
267
+ },
268
+ "110503": {
269
+ "content": "<intermediate_to_code>",
270
+ "lstrip": false,
271
+ "normalized": false,
272
+ "rstrip": false,
273
+ "single_word": false,
274
+ "special": true
275
+ },
276
+ "110504": {
277
+ "content": "<pr>",
278
+ "lstrip": false,
279
+ "normalized": false,
280
+ "rstrip": false,
281
+ "single_word": false,
282
+ "special": true
283
+ },
284
+ "110505": {
285
+ "content": "<pr_status>",
286
+ "lstrip": false,
287
+ "normalized": false,
288
+ "rstrip": false,
289
+ "single_word": false,
290
+ "special": true
291
+ },
292
+ "110506": {
293
+ "content": "<pr_is_merged>",
294
+ "lstrip": false,
295
+ "normalized": false,
296
+ "rstrip": false,
297
+ "single_word": false,
298
+ "special": true
299
+ },
300
+ "110507": {
301
+ "content": "<pr_base>",
302
+ "lstrip": false,
303
+ "normalized": false,
304
+ "rstrip": false,
305
+ "single_word": false,
306
+ "special": true
307
+ },
308
+ "110508": {
309
+ "content": "<pr_file>",
310
+ "lstrip": false,
311
+ "normalized": false,
312
+ "rstrip": false,
313
+ "single_word": false,
314
+ "special": true
315
+ },
316
+ "110509": {
317
+ "content": "<pr_base_code>",
318
+ "lstrip": false,
319
+ "normalized": false,
320
+ "rstrip": false,
321
+ "single_word": false,
322
+ "special": true
323
+ },
324
+ "110510": {
325
+ "content": "<pr_diff>",
326
+ "lstrip": false,
327
+ "normalized": false,
328
+ "rstrip": false,
329
+ "single_word": false,
330
+ "special": true
331
+ },
332
+ "110511": {
333
+ "content": "<pr_diff_hunk>",
334
+ "lstrip": false,
335
+ "normalized": false,
336
+ "rstrip": false,
337
+ "single_word": false,
338
+ "special": true
339
+ },
340
+ "110512": {
341
+ "content": "<pr_comment>",
342
+ "lstrip": false,
343
+ "normalized": false,
344
+ "rstrip": false,
345
+ "single_word": false,
346
+ "special": true
347
+ },
348
+ "110513": {
349
+ "content": "<pr_event_id>",
350
+ "lstrip": false,
351
+ "normalized": false,
352
+ "rstrip": false,
353
+ "single_word": false,
354
+ "special": true
355
+ },
356
+ "110514": {
357
+ "content": "<pr_review>",
358
+ "lstrip": false,
359
+ "normalized": false,
360
+ "rstrip": false,
361
+ "single_word": false,
362
+ "special": true
363
+ },
364
+ "110515": {
365
+ "content": "<pr_review_state>",
366
+ "lstrip": false,
367
+ "normalized": false,
368
+ "rstrip": false,
369
+ "single_word": false,
370
+ "special": true
371
+ },
372
+ "110516": {
373
+ "content": "<pr_review_comment>",
374
+ "lstrip": false,
375
+ "normalized": false,
376
+ "rstrip": false,
377
+ "single_word": false,
378
+ "special": true
379
+ },
380
+ "110517": {
381
+ "content": "<pr_in_reply_to_review_id>",
382
+ "lstrip": false,
383
+ "normalized": false,
384
+ "rstrip": false,
385
+ "single_word": false,
386
+ "special": true
387
+ },
388
+ "110518": {
389
+ "content": "<pr_in_reply_to_comment_id>",
390
+ "lstrip": false,
391
+ "normalized": false,
392
+ "rstrip": false,
393
+ "single_word": false,
394
+ "special": true
395
+ },
396
+ "110519": {
397
+ "content": "<pr_diff_hunk_comment_line>",
398
+ "lstrip": false,
399
+ "normalized": false,
400
+ "rstrip": false,
401
+ "single_word": false,
402
+ "special": true
403
+ },
404
+ "110520": {
405
+ "content": "<NAME>",
406
+ "lstrip": false,
407
+ "normalized": false,
408
+ "rstrip": false,
409
+ "single_word": false,
410
+ "special": true
411
+ },
412
+ "110521": {
413
+ "content": "<EMAIL>",
414
+ "lstrip": false,
415
+ "normalized": false,
416
+ "rstrip": false,
417
+ "single_word": false,
418
+ "special": true
419
+ },
420
+ "110522": {
421
+ "content": "<KEY>",
422
+ "lstrip": false,
423
+ "normalized": false,
424
+ "rstrip": false,
425
+ "single_word": false,
426
+ "special": true
427
+ },
428
+ "110523": {
429
+ "content": "<PASSWORD>",
430
+ "lstrip": false,
431
+ "normalized": false,
432
+ "rstrip": false,
433
+ "single_word": false,
434
+ "special": true
435
+ }
436
+ },
437
+ "additional_special_tokens": [
438
+ "<|endoftext|>",
439
+ "<|fim_prefix|>",
440
+ "<|fim_middle|>",
441
+ "<|fim_suffix|>",
442
+ "<|endofprompt|>",
443
+ "<|_unuse_missing_100256|>",
444
+ "<|_unuse_missing_100261|>",
445
+ "<|_unuse_missing_100262|>",
446
+ "<|_unuse_missing_100263|>",
447
+ "<|_unuse_missing_100264|>",
448
+ "<|_unuse_missing_100265|>",
449
+ "<|_unuse_missing_100266|>",
450
+ "<|_unuse_missing_100267|>",
451
+ "<|_unuse_missing_100268|>",
452
+ "<|_unuse_missing_100269|>",
453
+ "<|_unuse_missing_100270|>",
454
+ "<|_unuse_missing_100271|>",
455
+ "<|im_start|>",
456
+ "<|im_end|>",
457
+ "<|stop|>",
458
+ "<|endofturn|>",
459
+ "<repo_name>",
460
+ "<file_sep>",
461
+ "<issue_start>",
462
+ "<issue_comment>",
463
+ "<issue_closed>",
464
+ "<jupyter_start>",
465
+ "<jupyter_text>",
466
+ "<jupyter_code>",
467
+ "<jupyter_output>",
468
+ "<jupyter_script>",
469
+ "<empty_output>",
470
+ "<code_to_intermediate>",
471
+ "<intermediate_to_code>",
472
+ "<pr>",
473
+ "<pr_status>",
474
+ "<pr_is_merged>",
475
+ "<pr_base>",
476
+ "<pr_file>",
477
+ "<pr_base_code>",
478
+ "<pr_diff>",
479
+ "<pr_diff_hunk>",
480
+ "<pr_comment>",
481
+ "<pr_event_id>",
482
+ "<pr_review>",
483
+ "<pr_review_state>",
484
+ "<pr_review_comment>",
485
+ "<pr_in_reply_to_review_id>",
486
+ "<pr_in_reply_to_comment_id>",
487
+ "<pr_diff_hunk_comment_line>",
488
+ "<NAME>",
489
+ "<EMAIL>",
490
+ "<KEY>",
491
+ "<PASSWORD>"
492
+ ],
493
+ "bos_token": "<|endoftext|>",
494
+ "chat_template": "{% if tools is not defined or tools is none %}\n {{- '<|im_start|>tool_list\\n<|im_end|>\\n' }}\n{%- else %}\n {{- '<|im_start|>tool_list\\n[' }}\n {%- for tool in tools %}\n {{- '{\"name\": \"' }}\n {{- tool.function.name }}\n {{- '\", ' }}\n {{- '\"description\": \"' }}\n {{- tool.function.description }}\n {{- '\"' }}\n {%- if tool.function.parameters is defined %}\n {{- ', \"parameters\": ' }}\n {{- tool.function.parameters | tojson }}\n {%- endif %}\n {{- '}' }}\n {%- if not loop.last %}\n {{- ', ' }}\n {%- endif %}\n {%- endfor %}\n{{- ']<|im_end|>\\n' }}\n{%- endif %}\n\n{%- set ns = namespace(is_searching=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n {%- set index = (messages|length - 1) - loop.index0 %}\n {%- if ns.is_searching and (message.role == 'user' or message.role == 'tool') %}\n {%- set ns.last_query_index = index %}\n {%- set ns.is_searching = false %}\n {%- endif %}\n{%- endfor %}\n\n{%- for message in messages %}\n {%- if loop.index0 == 0 and message.role != 'system' %}\n {{- '<|im_start|>system\\n<|im_end|>\\n' }}\n {%- endif %}\n\n {%- if message.content is string %}\n {%- set content = message.content %}\n {%- else %}\n {%- set content = '' %}\n {%- endif %}\n\n {%- set reasoning_content = '' %}\n {%- if message.reasoning_content is defined and message.reasoning_content is not none %}\n {%- set reasoning_content = message.reasoning_content %} \n {%- endif %}\n {%- if message.role == \"assistant\" %}\n {%- if loop.index0 > ns.last_query_index %}\n {%- if reasoning_content %}\n {{- '<|im_start|>assistant/think\\n' + reasoning_content.strip('\\n') + '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n\n {%- if content %}\n {{- '<|im_start|>assistant\\n' + content.strip('\\n') + '<|im_end|>' }}\n {%- if message.tool_calls %}\n {{- '\\n' }}\n {%- else %}\n {{- '<|endofturn|>\\n' }}\n {%- endif %}\n {%- endif %}\n\n {%- if message.tool_calls %}\n {{- '<|im_start|>assistant -> tool/function_call\\n[' }}\n {%- for tool_call in message.tool_calls %}\n {%- if not loop.first %}\n {{- ', ' }}\n {%- endif %}\n {%- if tool_call.function %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {%- if tool_call.arguments is string %}\n {{- tool_call.arguments }}\n {%- else %}\n {{- tool_call.arguments | tojson }}\n {%- endif %}\n {{- '}' }}\n {%- endfor %}\n {{- ']<|im_end|><|stop|>\\n' }}\n\n {%- endif %}\n {%- elif message.role == \"tool\" %}\n {{- '<|im_start|>tool/function_call\\n' + content + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>' + message.role + '\\n' + content + '<|im_end|>\\n' }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {%- if force_reasoning is defined and force_reasoning is true %}\n {{- '<|im_start|>assistant/think\\n' }}\n {%- elif skip_reasoning is defined and skip_reasoning is true %}\n {{- '<|im_start|>assistant\\n' }}\n {%- else %}\n {{- '<|im_start|>assistant' }}\n {%- endif %}\n{%- endif %}",
495
+ "clean_up_tokenization_spaces": true,
496
+ "eos_token": "<|endoftext|>",
497
+ "model_max_length": 1000000000000000019884624838656,
498
+ "pad_token": "<|endoftext|>",
499
+ "tokenizer_class": "GPT2Tokenizer",
500
+ "unk_token": "<|endoftext|>"
501
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff