Hi, I've just started playing around with Quantized LLMs and their capabilities and to be honest it is amazing! But when I try to programmatically infer something the result I get is not to par with LMStudio results.

For example :

1)
I have given a huge text from a Resume and asked the model the following questions. Summarize the resume in less than 100 words. What are the job roles to which this profile will be suitable to? What are the technologies in which they worked on? What is your profile rating for the below resume? - It answered great, though it was slow which is expected

I was expecting a similar response when I tried to infer the model with langchain using below code.

from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler

Callbacks support token-wise streaming

callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

Make sure the model path is correct for your system!

llm = LlamaCpp(
model_path=MODEL_PATH,
callback_manager=callback_manager,
verbose=True, # Verbose is required to pass to the callback manager
n_ctx=4096,
n_threads=4,
n_batch = 512,
temperature=0.8,
top_p=0.95,
top_k=40,
repeat_penalty=1.1,
)

llm.invoke("""Resume Text Here""")

But it failed midway when it was trying to infer the content alone. It never reached to prediction of the answer.

When I tried with a resume text which is half of what I used initially, I was getting a response from the model now (through langchain) but the context was not up-to the mark.

I played around with prompts/prompt templates/Jinja2 Templates but did not have much luck.

I am looking for approach/guidance which I am missing here.

Long Story Short : What is it that I need to do extra to get a decent quality output that I get in LMStudio? Assuming I already have the default setting of the model whichever is available in LMStudio. A sample of how to infer the model, like the format of text/messages/content/instruction for better output.

Any little help would be appreciated. Thanks in Advance

MaziyarPanahi
/

Meta-Llama-3-8B-Instruct-GGUF

What am I missing? - Langchain vs LMStudio

Callbacks support token-wise streaming

Make sure the model path is correct for your system!