Model not running on CPU, due to flash_attn package requirement.
I am trying to import the prosparse-llama-2-7b, model on ARM CPU machine. (gr3 instance)
It requires flash_attn, and if we try to install flash_attn, it raises an nvcc error.
Other SparseLLM models like SparseLLM/ReluLLaMA-7B and https://huggingface.co/SparseLLM/ProSparse-MiniCPM-1B-sft seem to work, and execute on CPUs, the issue is with this particular model only and the larger variants too, i.e. prosparse-llama2-13b.
Requesting to look into this issue, I think we need to get rid of the flash_attn dependency, because otherwise the model won't be able to execute on CPUs.
The problems seem strange. I've tried to load the model on a CPU machine with model = AutoModelForCausalLM.from_pretrained("SparseLLM/prosparse-llama-2-7b", torch_dtype=torch.bfloat16, trust_remote_code=True) and succeed. Generally, if you have no GPU on the machine or flash-attn is not installed, transformers.utils.is_flash_attn_2_available() should return False so that flash_attn will not be required. You may check line 47 in modeling_sparsellama.py.
Therefore, you may check the return value of is_flash_attn_2_available(). The package flash-attn and GPUs are not necessary to load these models on CPU machines.
From your pictures, the problems seem to lie in the import phase of package transformers. My transformers version is 4.43.3. If changing the version cannot solve this problem, I suggest you dive deep into the source codes that raise the exception to fix it.
Thanks, I updated the transformers version and it seems to work!



