Text Generation
Transformers
Safetensors
Indonesian
llama
text-generation-inference
alxxtexxr commited on
Commit
97fdeaf
·
verified ·
1 Parent(s): 514ce42

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -0
README.md ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - alxxtexxr/indowebgen-dataset
5
+ language:
6
+ - id
7
+ ---
8
+ # 🇮🇩🌐🤖 IndoWebGen: LLM for Automated (Bootstrap-Based) Website Generation Based-On Indonesian Instructions
9
+ Hugely inspired by [Web App Factory](https://huggingface.co/spaces/jbilcke-hf/webapp-factory-wizardcoder).
10
+ ## Model Description:
11
+ - Base Model: [codellama/CodeLlama-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf) [[1](https://arxiv.org/abs/2308.12950)]
12
+ - Finetuning Method: LoRA [[2](https://arxiv.org/abs/2106.09685)]
13
+ - Dataset: [alxxtexxr/indowebgen-dataset](https://huggingface.co/datasets/alxxtexxr/indowebgen-dataset)
14
+
15
+ ## Finetuning Hyperparameters:
16
+ - Number of Epochs: 20
17
+ - Microbatch Size: 4
18
+ - Gradient Accumulation Step: 8
19
+ - LoRA Rank: 16
20
+ - LoRA Alpha: 32
21
+
22
+ ## Inference:
23
+ Try the inference demo [here](https://indowebgen.alimtegar.my.id) or try running the inference code with the provided Google Colab notebook [here](https://colab.research.google.com/drive/1pqqLGcgRcUTBLCNeF0V6REi7INJ43IZb?usp=sharing). The inference code used is shown below:
24
+ ```
25
+ # Install the required libraries
26
+ !pip install transformers bitsandbytes accelerate
27
+
28
+ # Import the neccessary modules
29
+ from transformers import AutoModelForCausalLM, AutoTokenizer
30
+
31
+ # Load the model and the tokenizer
32
+ model_id = 'alxxtexxr/indowebgen-7b'
33
+ model = AutoModelForCausalLM.from_pretrained(
34
+ model_id,
35
+ load_in_8bit=True,
36
+ # load_in_4bit=True, # for low memory
37
+ device_map='auto',
38
+ )
39
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
40
+
41
+ # Initialize the prompt
42
+ prompt_template = '''Berikut adalah instruksi pembuatan website beserta output-nya yang berupa kode HTML dari website yang dibuat:
43
+
44
+ ### Instruksi:
45
+ {instruction}
46
+
47
+ ### Output:
48
+ <html lang="id">'''
49
+
50
+ # INSERT YOUR OWN INDONESIAN INSTRUCTION BELOW
51
+ instruction = 'Buatlah website portfolio untuk Budi'
52
+
53
+ prompt = prompt_template.format(instruction=instruction)
54
+
55
+ # Generate the output
56
+ input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(model.device)
57
+ outputs = model.generate(
58
+ input_ids,
59
+ max_new_tokens=2400,
60
+ do_sample=True,
61
+ temperature=1.0,
62
+ top_k=3,
63
+ top_p=0.8,
64
+ repetition_penalty=1.1,
65
+ pad_token_id=tokenizer.unk_token_id,
66
+ )
67
+ print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
68
+ ```
69
+
70
+ ## Limitations
71
+ - The dataset used in training is limited to only 500 data, so the model performance may still not be optimal.
72
+ - The generated websites leverage Bootstrap for the styling and Font Awesome for the icons.
73
+ - The content of the generated websites is dummy (including the images), so the users need to further customize the websites.