Could you describe in simple words how it really works?
Could you, please, describe in simple words how it really works? I've read the paper but I'm unable to understand it.
How is it possible for a LLM model to "think" in "latent space" if it is a language model operating on "words" (tokens).
Could you, please, describe in simple words how it really works? I've read the paper but I'm unable to understand it.
How is it possible for a LLM model to "think" in "latent space" if it is a language model operating on "words" (tokens).
It basically more-less passes information between specialized inner layers over and over without turning them into output tokens for a bit from what I'm understanding
The model uses vectors of numbers that represent a state of information, with these vectors it performs mathematical transformations that simulate reasoning processes. Then in the end it converts it back to tokens so you can read the output/answer.
So in regards to this part of your question:
if it is a language model operating on "words" (tokens).
It's essentially using mathematical transformations to reason.
conceptually (and i do NOT mean in practice), it does chain-of-thought internally before outputting a token. the more steps, the longer it thinks before outputting first token. result is a reasoning-type model that uses smaller context window and less tokens, delivering better performance than models of a similar parameter size (I believe they suggest equivalent to models with around 50bn more parameters).
Thanks for your answers.
My knowledge of LLMs is simplified. My understanding is that the output of a deep network of transformers is a set of vectors (eventually mapped to a set of tokens) with a probability for each, probability that that's the likely next token in the output.
Is the approach from the paper to (instead of outputting the token) put this set of vectors through another network of transformers and do so a number of times? How does it constitute the "internal thinking"?
I would like the paper to be explained (ideally by authors or anyone who understands it) to someone who a software engineer with computer science background is but not an expert in AI/LLM. I've heard explanations on youtube but these were targeted to general public and were too vague (like basically it comes to "model thinks before speaking").
Hi MarcinCF,
It looks like you know that operations in a normal (fixed-depth) transformer are laid out in blocks like so:
# Normal transformer:
x = self.transformer.wte(input_tokens) # look up tokens in embedding matrix
for block in self.transformer.layers:
x = block(x)
logits = self.lm_head(x)
With that notation, this (recurrent-depth) model is instead operating like so (with some slight simplifications):
input_embeds = self.transformer.wte(input_tokens) # look up tokens in embedding matrix
# Non-recurrent prelude
for block in self.transformer.prelude:
input_embeds = block(input_embeds)
# Initialize state
x = torch.randn_like(input_embeds)
# Main recurrence
for step in range(num_steps_recurrence):
x = self.transformer.adapter(torch.cat([x, input_embeds], dim=-1)) # Adapter
for block in self.transformer.core_block: # 4 Inner layers
x = block(x)
# Non-recurrent coda
for block in self.transformer.coda:
x = block(x)
x = self.transformer.ln_f(x)
# Prediction
logits = self.lm_head(x)
where num_steps_recurrence
can be varied at inference time to do more or less computation. Important is that x
here does not correspond to words anymore (only input_tokens
are words), x
and input_embeds
are 5280-dimensional vectors, which the model is processing and refining. This process can be described as reasoning, but it is important to separate the statement that "the model is reasoning more when it computes more", which is an observation of the behavior of the trained model, from the intuition referenced in the introduction of the paper, and the mathematical definition of the model as laid out in Section 3.