metadata
license: apache-2.0
datasets:
- graelo/wikipedia
- uonlp/CulturaX
- HuggingFaceH4/ultrachat_200k
language:
- ja
- en
Base checkpoint
augmxnt/shisa-7b-v1
- Mistral-7B base
- Pre-trained on 8B of MADLAD-Ja
- Finetuned on Japanese instructions
- Highest scoring 7B model on conversation benchmark (JA MT-Bench)
Training datasets (total ~7B)
- Aozora Bunko
- Japanese Law Precedent Dataset
- Japanese Wikipedia
- .lg.jp, .go.jp, .ac.jp domain webscrapes from CulturaX (Any documents with same first 25 characters were de-duplicated)
- English Ultrachat200K-gen (So that it doesn't forget English and chatting ability learned in the base checkpoint)
Developed by

Engineers
Peter Devine
Sho Higuchi
Advisors
Yuuki Yamanaka
Atom Sonoda
Dataset evaluator
Renju Aoki