Double amount of memory usage when calling generate function (inference)
#36
by
isaranto
- opened
I am experiencing OOM errors while trying to run inference using on a GPU with 16GB VRAM. While I manage to load the model it then maxes out the memory when I try to generate samples.
Digging a bit into this while doing some profiling on a different machine I found that the amount of RAM is used on model.generate() function is 2x model size, something I am not experiencing with other models. The amount of additional resources requested seems like a copy of the model is loaded onto the same device. Anyone experience anything similar?
This is the code I am profiling with the results:
```
Line # Mem usage Increment Occurrences Line Contents
8 84.0 MiB 84.0 MiB 1
@profile
()
9 def predict():
10 84.0 MiB 0.0 MiB 1 start = time.time()
11 84.0 MiB 0.0 MiB 1 model_path = "falcon-7b-instruct"
12 16616.1 MiB 16532.1 MiB 2 model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True,
13 84.0 MiB 0.0 MiB 1 trust_remote_code=True, low_cpu_mem_usage=True)
14 16616.2 MiB 0.1 MiB 1 print(f"Memory Footprint {round(model.get_memory_footprint()/(1024*1024*1024), 2)} GB")
15 16616.2 MiB 0.0 MiB 1 print(f"Memory Footprint {round(model.get_memory_footprint(return_buffers=False) / (1024 * 1024 * 1024), 2)} GB")
16 16655.0 MiB 38.8 MiB 2 tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True,
17 16616.2 MiB 0.0 MiB 1 low_cpu_mem_usage=True)
18 16655.0 MiB 0.0 MiB 1 print(f"model loaded in {time.time() - start}")
19 16655.0 MiB 0.0 MiB 1 prompt = "Once upon a time "
20 16655.0 MiB 0.0 MiB 1 result_length = 100
21 16655.0 MiB 0.0 MiB 1 start = time.time()
22 16656.0 MiB 1.1 MiB 1 inputs = tokenizer(prompt, return_tensors="pt").to("cpu")
23 16656.0 MiB 0.0 MiB 1 result_length = inputs["input_ids"].size()[1] + result_length
24
25 30170.0 MiB 13514.0 MiB 2 outputs = model.generate(inputs["input_ids"],
26 16656.0 MiB 0.0 MiB 1 max_length=result_length,
27 16656.0 MiB 0.0 MiB 1 do_sample=True,
28 16656.0 MiB 0.0 MiB 1 top_k=50,
29 16656.0 MiB 0.0 MiB 1 top_p=0.9
30 )
31
32 30170.1 MiB 0.0 MiB 1 response = tokenizer.decode(outputs[0])
33 30170.1 MiB 0.0 MiB 1 print(time.time() - start)
34 30170.1 MiB 0.0 MiB 1 print(response)