Heralax/philosophy-mistral

Dec 20, 2024

Im a student trying to recreate it. Please help me clear some doubts.

how were books used to train? ocr -> json ?
what was the format of dataset?
3.Did you have custom script or used something like llama factory or unsloth?
How did you convert it to GGUF?
What do i need to consider when doing it?
I converted pdfs OCR to json . i only have one field called text
format: "text":"DATA_FROM_OCR"

OR DOES THE DATASET NEED TO BE IN
"instruction" "input" "output" FORMAT???
They say i have a lot of issues in dataset when i use llamafactory or unsloth. and some token issues, idk.
Please help me complete my project 🙏

LeroyDyer

Dec 20, 2024

you realy need to do some research ! Utube it

P00j4n

Dec 22, 2024

Can you just show some directions?

LeroyDyer

Dec 23, 2024

Perhaps : it's possible to convert the documents to text first .
Then you can create your data set alpaca . Content , question, answer .
...
If your doing image based then the model.will.need to learn images ?? So you would need a image encoder decoder ...perhaps using base64 encoding for each image then each.image would be a text based image ...
The mistral is a text based model :

So for vqa (visual question and answering you should be using a

.
Blip model perhaps ?

You should look at the hugging face documentation and look for the correct model.for the task ... Perhaps a spaces which is doing the same task ...

Heralax
/

philosophy-mistral

HELP!