Alexis-Az commited on
Commit
62d53a9
·
1 Parent(s): b228316

added jina base model

Browse files
README.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ model-index:
3
+ - name: starcoder-1b-textbook
4
+ results:
5
+ - task:
6
+ type: text-generation
7
+ dataset:
8
+ type: openai_humaneval
9
+ name: HumanEval
10
+ metrics:
11
+ - name: pass@1
12
+ type: pass@1
13
+ value: 27.0%
14
+ verified: false
15
+ datasets:
16
+ - jinaai/code_exercises
17
+ language:
18
+ - en
19
+ tags:
20
+ - HumanEval
21
+ - StarCoder
22
+ license: cc-by-nc-sa-4.0
23
+ ---
24
+
25
+ # StarCoder-1b-textbook
26
+
27
+
28
+ StarCoder-1b-textbook is a finetuned version of [starcoderbase-1b](https://huggingface.co/bigcode/starcoderbase-1b) on the [code_exercices](https://huggingface.co/datasets/jinaai/code_exercises) dataset
29
+
30
+
31
+ It achieves 27.0 pass@1 on the [Human Eval](https://github.com/openai/human-eval) coding benchmark while being only 1b parameters.
32
+ That is an improvement of almost 12 points over the starcoder 1b baseline, almost doubling the score.
33
+
34
+ The results (on the human eval benchmark) are on par with other open-source models like StarCoderBase (30.4) StarCoder(33.6) CodeGen-16B-Mono(29.3) while the model being 15 times smaller.
35
+
36
+ It still underperforms compared to other models like CodeLLama (53%) chat gpt 4 (82) or wizard coder (73.2), but these model are more than 30 times bigger.
37
+
38
+ ## Usage
39
+ You can download and use the model like so:
40
+ ```python
41
+ from transformers import AutoModelForCausalLM, AutoTokenizer
42
+
43
+ model = AutoModelForCausalLM.from_pretrained(
44
+ "jinaai/starcoder-1b-textbook", device_map='auto'
45
+ )
46
+
47
+ tokenizer = AutoTokenizer.from_pretrained("jinaai/starcoder-1b-textbook")
48
+
49
+ prompt = '''
50
+ def unique(l: list):
51
+ """Return sorted unique elements in a list
52
+ >>> unique([5, 3, 5, 2, 3, 3, 9, 0, 123])
53
+ [0, 2, 3, 5, 9, 123]
54
+ """
55
+ '''
56
+
57
+ inputs = tokenizer(prompt.rstrip(), return_tensors="pt").to("cuda")
58
+
59
+ generation_output = model.generate(
60
+ **inputs,
61
+ max_new_tokens=128,
62
+ eos_token_id=tokenizer.eos_token_id,
63
+ return_dict_in_generate=True,
64
+ )
65
+
66
+ s = generation_output.sequences[0]
67
+ output = tokenizer.decode(s, skip_special_tokens=True)
68
+
69
+ print(output)
70
+ ```
71
+
72
+ ## Finetuning details
73
+
74
+ We did full parameter fine-tuning and used a Nvidia a40 for 12 hours using a batch size of 128 and a micro-batch size of 8.
75
+
76
+
77
+ To reproduce the training just follow the training instructions in our [open source codebase](https://github.com/jina-ai/textbook)
78
+
79
+
80
+ ## Disclaimer
81
+
82
+
83
+ * The human eval benchmark is not a perfect benchmark and does not fully represent the coding abilities of an LLM. This model performs well on the task described in the benchmark but it does not necessarily mean that our model is on par with bigger models on coding assistant LLM.
84
+ * This model is not an instruction tuned model and cannot be used as a chatbot. We recommend using the [Evol-Instruct-Code-80k-v1](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) to finetune it into a instrution following model
85
+ * This model has not been aligned with human preferences and therefore could potentially generate harmful content
86
+ * This model has been trained on a dataset generated by ChatGPT 3.5, and you should check the legal status of AI-generated content in your jurisdiction before using it. You should make sure that your usage complies with the OpenAI Terms of Use, in so far as legally applicable.
87
+
88
+ ## Credits
89
+
90
+ This model was trained and released by [Jina.ai](https://jina.ai/)
config.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "bigcode/starcoderbase-1b",
3
+ "activation_function": "gelu_pytorch_tanh",
4
+ "architectures": [
5
+ "GPTBigCodeForCausalLM"
6
+ ],
7
+ "attention_softmax_in_fp32": true,
8
+ "attn_pdrop": 0.1,
9
+ "bos_token_id": 0,
10
+ "embd_pdrop": 0.1,
11
+ "eos_token_id": 0,
12
+ "inference_runner": 0,
13
+ "initializer_range": 0.02,
14
+ "layer_norm_epsilon": 1e-05,
15
+ "max_batch_size": null,
16
+ "max_sequence_length": null,
17
+ "model_type": "gpt_bigcode",
18
+ "multi_query": true,
19
+ "n_embd": 2048,
20
+ "n_head": 16,
21
+ "n_inner": 8192,
22
+ "n_layer": 24,
23
+ "n_positions": 8192,
24
+ "pad_key_length": true,
25
+ "pre_allocate_kv_cache": false,
26
+ "resid_pdrop": 0.1,
27
+ "scale_attention_softmax_in_fp32": true,
28
+ "scale_attn_weights": true,
29
+ "summary_activation": null,
30
+ "summary_first_dropout": 0.1,
31
+ "summary_proj_to_labels": true,
32
+ "summary_type": "cls_index",
33
+ "summary_use_proj": true,
34
+ "torch_dtype": "float32",
35
+ "transformers_version": "4.31.0",
36
+ "use_cache": true,
37
+ "validate_runner_input": true,
38
+ "vocab_size": 49152
39
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 0,
4
+ "eos_token_id": 0,
5
+ "transformers_version": "4.31.0"
6
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:22e2ebd6c08e650edf4959ba737a5337c15ce36952d7144bcb0dbd77c9dc77a0
3
+ size 4548924617
special_tokens_map.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|endoftext|>",
4
+ "<fim_prefix>",
5
+ "<fim_middle>",
6
+ "<fim_suffix>",
7
+ "<fim_pad>",
8
+ "<filename>",
9
+ "<gh_stars>",
10
+ "<issue_start>",
11
+ "<issue_comment>",
12
+ "<issue_closed>",
13
+ "<jupyter_start>",
14
+ "<jupyter_text>",
15
+ "<jupyter_code>",
16
+ "<jupyter_output>",
17
+ "<empty_output>",
18
+ "<commit_before>",
19
+ "<commit_msg>",
20
+ "<commit_after>",
21
+ "<reponame>"
22
+ ],
23
+ "bos_token": "<|endoftext|>",
24
+ "eos_token": "<|endoftext|>",
25
+ "pad_token": "<|endoftext|>",
26
+ "unk_token": "<|endoftext|>"
27
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "additional_special_tokens": [
4
+ "<|endoftext|>",
5
+ "<fim_prefix>",
6
+ "<fim_middle>",
7
+ "<fim_suffix>",
8
+ "<fim_pad>",
9
+ "<filename>",
10
+ "<gh_stars>",
11
+ "<issue_start>",
12
+ "<issue_comment>",
13
+ "<issue_closed>",
14
+ "<jupyter_start>",
15
+ "<jupyter_text>",
16
+ "<jupyter_code>",
17
+ "<jupyter_output>",
18
+ "<empty_output>",
19
+ "<commit_before>",
20
+ "<commit_msg>",
21
+ "<commit_after>",
22
+ "<reponame>"
23
+ ],
24
+ "bos_token": "<|endoftext|>",
25
+ "clean_up_tokenization_spaces": true,
26
+ "eos_token": "<|endoftext|>",
27
+ "model_max_length": 1000000000000000019884624838656,
28
+ "tokenizer_class": "GPT2Tokenizer",
29
+ "unk_token": "<|endoftext|>",
30
+ "vocab_size": 49152
31
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff