I am experiencing OOM errors while trying to run inference using on a GPU with 16GB VRAM. While I manage to load the model it then maxes out the memory when I try to generate samples.
Digging a bit into this while doing some profiling on a different machine I found that the amount of RAM is used on model.generate() function is 2x model size, something I am not experiencing with other models. The amount of additional resources requested seems like a copy of the model is loaded onto the same device. Anyone experience anything similar?
This is the code I am profiling with the results:
```
Line # Mem usage Increment Occurrences Line Contents

 8     84.0 MiB     84.0 MiB           1   

@profile
	()
 9                                         def predict():
10     84.0 MiB      0.0 MiB           1       start = time.time()
11     84.0 MiB      0.0 MiB           1       model_path = "falcon-7b-instruct"
12  16616.1 MiB  16532.1 MiB           2       model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True,
13     84.0 MiB      0.0 MiB           1                                                    trust_remote_code=True, low_cpu_mem_usage=True)
14  16616.2 MiB      0.1 MiB           1       print(f"Memory Footprint {round(model.get_memory_footprint()/(1024*1024*1024), 2)} GB")
15  16616.2 MiB      0.0 MiB           1       print(f"Memory Footprint {round(model.get_memory_footprint(return_buffers=False) / (1024 * 1024 * 1024), 2)} GB")
16  16655.0 MiB     38.8 MiB           2       tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True,
17  16616.2 MiB      0.0 MiB           1                                                 low_cpu_mem_usage=True)
18  16655.0 MiB      0.0 MiB           1       print(f"model loaded in {time.time() - start}")
19  16655.0 MiB      0.0 MiB           1       prompt = "Once upon a time "
20  16655.0 MiB      0.0 MiB           1       result_length = 100
21  16655.0 MiB      0.0 MiB           1       start = time.time()
22  16656.0 MiB      1.1 MiB           1       inputs = tokenizer(prompt, return_tensors="pt").to("cpu")
23  16656.0 MiB      0.0 MiB           1       result_length = inputs["input_ids"].size()[1] + result_length
24                                         
25  30170.0 MiB  13514.0 MiB           2       outputs = model.generate(inputs["input_ids"],
26  16656.0 MiB      0.0 MiB           1                              max_length=result_length,
27  16656.0 MiB      0.0 MiB           1                              do_sample=True,
28  16656.0 MiB      0.0 MiB           1                              top_k=50,
29  16656.0 MiB      0.0 MiB           1                              top_p=0.9
30                                                                   )
31                                         
32  30170.1 MiB      0.0 MiB           1       response = tokenizer.decode(outputs[0])
33  30170.1 MiB      0.0 MiB           1       print(time.time() - start)
34  30170.1 MiB      0.0 MiB           1       print(response)

tiiuae
/

falcon-7b-instruct

Double amount of memory usage when calling generate function (inference)