not the latest tech
find(search for) new formula ,this is old technology , results are not best , models unable to remember (everything were teached)correctly , only many experts moe which experts are not overloaded with info work good in compare with only one mlp overloaded with everything,this old tech need external info(depend the question) into context window and will work better but current inference players are not so advanced (to auto load external info), 7B model alone (with current formula scheme)will not achieve anything more ,with more training or not nothing will change
for just fun ,train it with 200T-300T tokens and compare with 20T model
https://huggingface.co/datasets/institutional/institutional-books-1.0
983K books, published largely in the 19th and 20th centuries
242B o200k_base tokens