Update README.md
Browse files
README.md
CHANGED
@@ -111,7 +111,7 @@ Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-
|
|
111 |
The models have been pre-trained using a blend of the following datasets.
|
112 |
|
113 |
| Language | Dataset | Tokens|
|
114 |
-
|
115 |
|Japanese|[Wikipedia](https://huggingface.co/datasets/wikipedia)|1.4B
|
116 |
||[Common Crawl](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus)|130.7B
|
117 |
|English|[Wikipedia](https://huggingface.co/datasets/wikipedia)|4.7B
|
@@ -123,11 +123,14 @@ The models have been pre-trained using a blend of the following datasets.
|
|
123 |
The models have been fine-tuned on the following datasets.
|
124 |
|
125 |
| Language | Dataset | description |
|
126 |
-
|
127 |
-
|Japanese|[
|
128 |
-
|
129 |
-
|
130 |
-
|
|
|
|
|
|
|
131 |
|
132 |
## Evaluation
|
133 |
|
|
|
111 |
The models have been pre-trained using a blend of the following datasets.
|
112 |
|
113 |
| Language | Dataset | Tokens|
|
114 |
+
|:---|:---|:---|
|
115 |
|Japanese|[Wikipedia](https://huggingface.co/datasets/wikipedia)|1.4B
|
116 |
||[Common Crawl](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus)|130.7B
|
117 |
|English|[Wikipedia](https://huggingface.co/datasets/wikipedia)|4.7B
|
|
|
123 |
The models have been fine-tuned on the following datasets.
|
124 |
|
125 |
| Language | Dataset | description |
|
126 |
+
|:---|:---|:---|
|
127 |
+
|Japanese|[ichikara-instruction-004-001](https://liat-aip.sakura.ne.jp/wp/llm%e3%81%ae%e3%81%9f%e3%82%81%e3%81%ae%e6%97%a5%e6%9c%ac%e8%aa%9e%e3%82%a4%e3%83%b3%e3%82%b9%e3%83%88%e3%83%a9%e3%82%af%e3%82%b7%e3%83%a7%e3%83%b3%e3%83%87%e3%83%bc%e3%82%bf%e4%bd%9c%e6%88%90/llm%e3%81%ae%e3%81%9f%e3%82%81%e3%81%ae%e6%97%a5%e6%9c%ac%e8%aa%9e%e3%82%a4%e3%83%b3%e3%82%b9%e3%83%88%e3%83%a9%e3%82%af%e3%82%b7%e3%83%a7%e3%83%b3%e3%83%87%e3%83%bc%e3%82%bf-%e5%85%ac%e9%96%8b/)| A manually constructed Japanese instruction dataset |
|
128 |
+
| |[answer-carefully-001]()| A manually constructed Japanese instruction dataset focusing on LLMs' safety |
|
129 |
+
| |[databricks-dolly-15k-ja](https://huggingface.co/datasets/llm-jp/databricks-dolly-15k-ja)| [databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) translated into Japanese using DeepL |
|
130 |
+
| |[oasst1-21k-ja](https://huggingface.co/datasets/llm-jp/oasst1-21k-ja)| A subset of [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) translated into Japanese using DeepL |
|
131 |
+
| |[oasst2-33k-ja](https://huggingface.co/datasets/llm-jp/oasst2-33k-ja)| A subset of [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2) translated into Japanese using DeepL |
|
132 |
+
|English |[oasst1-21k-en](https://huggingface.co/datasets/llm-jp/oasst1-21k-en)| A subset of [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) |
|
133 |
+
| |[oasst2-33k-en](https://huggingface.co/datasets/llm-jp/oasst2-33k-en)| A subset of [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2) |
|
134 |
|
135 |
## Evaluation
|
136 |
|