AbstractPhil (AbstractPhila)

reacted to codelion's post with ❤️ about 2 months ago

Post

2046

New Research: Theoretical Foundations for In-Context Learning in Transformers

I'm excited to share our latest theoretical work that formally proves an interesting property of large language models: base transformer models can approximate fine-tuned capabilities using only inference-time techniques like in-context learning.

The core question we investigated: Can specialized behaviors typically acquired through expensive supervised fine-tuning be elicited from base models without any parameter updates?

Our theoretical contribution: We provide a formal proof, grounded in the Turing completeness of transformers, showing that this is indeed possible under certain assumptions. The work establishes mathematical bounds on the minimal dataset sizes needed for approximation.

Key theoretical results:

- For text generation tasks: O(mV/ε²) examples suffice (where m = number of contexts, V = vocabulary size, ε = error tolerance)
- For linear classification: O(d/ε) examples (where d = input dimension)
- Extensions to finite context scenarios with practical bounds

This work helps explain why techniques like few-shot prompting, retrieval-augmented generation, and in-context learning work so effectively in practice. It bridges formal computer science theory with empirical observations about modern language models.

While the assumptions are idealized (unbounded computational resources, full dataset access), the results provide mathematical foundations for understanding inference-time adaptation strategies that are increasingly important in AI deployment.

Paper: Eliciting Fine-Tuned Transformer Capabilities via Inference-Time Techniques (2506.08060)

1 reply

·

replied to codelion's post about 2 months ago

This is fascinating. I'm working directly with a highly complex subsystem myself considered phase-shifted roping and it is starting to manifest capability with classification, essentially allowing indefinite acceptance of new verification tokens, and combination interpolation detection using hybrid resonance classification layered independent subsystems.

bert-beatrix-2048 is based on nomic-bert-2048, which is a 2048 context window rope for bert if you're unfamiliar with it.
I trained it a fair complexity directly built to create deep and rich responses to specific sets of tokenization processes. Primarily structured on forked informational subsets, and multiple conceptualization segments built directly into the model.

I've been bench testing it, and it's showing serious promise ahead of the standard bert-uncased-base, and that means I'm going to start cooking the bigger one soon.

reacted to codelion's post with 🚀 about 2 months ago

Post

2046

New Research: Theoretical Foundations for In-Context Learning in Transformers

I'm excited to share our latest theoretical work that formally proves an interesting property of large language models: base transformer models can approximate fine-tuned capabilities using only inference-time techniques like in-context learning.

The core question we investigated: Can specialized behaviors typically acquired through expensive supervised fine-tuning be elicited from base models without any parameter updates?

Our theoretical contribution: We provide a formal proof, grounded in the Turing completeness of transformers, showing that this is indeed possible under certain assumptions. The work establishes mathematical bounds on the minimal dataset sizes needed for approximation.

Key theoretical results:

- For text generation tasks: O(mV/ε²) examples suffice (where m = number of contexts, V = vocabulary size, ε = error tolerance)
- For linear classification: O(d/ε) examples (where d = input dimension)
- Extensions to finite context scenarios with practical bounds

This work helps explain why techniques like few-shot prompting, retrieval-augmented generation, and in-context learning work so effectively in practice. It bridges formal computer science theory with empirical observations about modern language models.

While the assumptions are idealized (unbounded computational resources, full dataset access), the results provide mathematical foundations for understanding inference-time adaptation strategies that are increasingly important in AI deployment.

Paper: Eliciting Fine-Tuned Transformer Capabilities via Inference-Time Techniques (2506.08060)

1 reply

·

replied to their post 2 months ago

The shunts are available for ComfyUI now. Sorry about the delay, this isn't my only project and I have game dev tasks to handle alongside.

posted an update 3 months ago

Post

536

With flan-t5-base and clip models as teachers; I have produced and successfully trained a dual-shunt cross-attention adapter archetype. This is NOT a lora.
This adapter is currently tasked with taking the T5-flan-base to guide the outputs of VIT-L-14 and/or VIT-bigG-14, and the opposite is equally usable and utilizable within the archetype. Meaning the CLIP_G can also guide the T5-FLAN-base.

These checkpoints were trained with 20 million synthetic human-templated captions, and they can be heavily improved by multiple languages, additional depiction context, and any sort of finetune task desired of the user that can be applied to the T5-flan-base with little to no training due to the adapter's functionality and accuracy.

VIT-L-14 adapters only took a couple hours on a colab a100 and the VIT-bigG-14 took about 4 hours. So you can rapidly adapt many of these in short periods of time with almost no additional overhead beyond the single t5-flan-base required. Each can be compiled, loaded, and offloaded.

This is a cross-attention system meant to shape encoded text after the output is received from the clip models and is very fast to inference - the t5-flan-base on the other hand isn't the fastest.

It's trained on a form of cooperative association with a series of complex losses designed specifically for this associative process.

This adapter has individual gating for tokenization context with a multitude of safeguards to prevent overfitting during rapid learning and can be paired with any number of additional other adapters.

I'm currently formatting the comfyui nodes that will allow easy conditioning shift to showcase the full power of this cooperative system's capability.

The comfyui nodes will be available here shortly, I just need to write them.
https://github.com/AbstractEyes/comfy-clip-shunts

1 reply

·

replied to their post 3 months ago

Upgraded version based on t5-flan-base cooking with a more powerful schema, more powerful block catches, a more curated bottleneck integrity, additional loss calculations and roughly 6.7m params.

replied to their post 3 months ago

This variation is based on taking only the t5 in at runtime, and the adapter attempting to approximate the guidance that the vit-l-14 would need by associative loss at training time.
This outcome wasn't robust enough and overall lacked the necessary detail that an adapter of this nature would require, so I recreated the adapter to be trained to accept both the t5-small and vit-l-14 as inputs. The outcomes are substantially more stable. The adapter weights and outcomes are posted as t5-vit-14-v1

posted an update 3 months ago

Post

484

The T5-small + VIT-L-14 guidance shunt adapter is ready for toy use.
AbstractPhil/t5-vit-14-v1
Included is a simple drop-in for sdxl experimentation using colab.

The outcome is okay but not great - diffusers is a headache so I spent more time trying to disjoint that machine than I did actually messing with this adapter.

I trained two variations of the baseline adapter;
t5-small vanilla and t5-small-human-associated-try2-pass3.
The vanilla was more accurate to adding context while the human associated stays locked onto human topics like a bloodhound... badly. Both ended up being substandard, even with a robust adapter like this.

Finetunes with specific goals can complete at runtime if desired due to the t5-small's tiny size, clip_l's inference speed, and the adapter's size. The adapter is very small and has safeguards for overfitting that can be disabled, so runtime freezing and adaptive shifts can be a viable methodology to immediate task pipeline adaptation.

The t5-small lacks the behavioral complexity of a model more built for such a task such as the base, large, or xxl - or even the Flan T5-small. However, this doesn't slow the little brain slug down. It guides and it's wrappers have many rapid generation potentials, whether it's trained the way I trained it or not.
The proof of concept is there, and the outcomes are present. Judge yourself.
The next variation will be more dims, more catches, higher conv, and additional safeguards to prevent overfitting - as well as including considerably more laion flavors so the T5-flan-base doesn't overwhelm or vise-versa.

1 reply

·

replied to their post 3 months ago

It works. I've devised a methodology for token injection using an adapter trained directly between the T5 and CLIP_L to guide the CLIP_L in useful ways.

replied to their post 3 months ago

Try 2 seems to be normalizing around the intentional low loss valuation that I set up around the combination of scaled weights and careful use of alternating special tokens meant to teach models like the T5 how to behave intrinsically according to the paper and the research done on sentencepiece.
Slow bleed should help preserve a large combination of internal structure while slowly attenuating the structure and reshaping the weights of the high/low curve around each batch directly, creating a bit of a cascade 4d bleed effect from one middle-ground high-end topic swapped per one middle-ground low-end topic's valuation for learning and rephrasing.
In the process I've introduced inverse weighting to account for too many of one or too many of another token, while simultaneously improving the power of the lowest without overfitting everything to the lowest on a minimal scale; while simultaneously reducing the overall effect of the highest accountable token. This assists with overfitting everything based on generic flood linear, while allowing training on much smaller captions, without completely obliterating the entire structure of the T5-small in less than a few thousand steps.
Additionally, the highest, and lowest tokens are automatically weighted up or down; and once a token is masked, it automatically rescales the structure around the variant being masked and attended to.

This over time will teach how to predict variant caption middle-curve portions; allowing it to eventually conform to the normalization of the new "middle-ground" prediction purposed to swap elements from one caption to another. This should allow a very high learn rate, without a complete destruction of the T5-small; due to the various regularization techniques engineered for this task today.
This process will tap into similar training methods applied to the T5 small originally, as well as introducing newer methods designed and implemented on much larger scale AI training models - with weighted substructures meant to not rip the arms and legs off of this little model.

It won't do exactly what I wanted to do yet, but that's where the high complexity captions will come into play. They are based entirely on compartmentalized sub-sectioning the critical systems into usable ways - for example;
example prompt; a room of tacos
potential goal; a brightly lit room completely filled with many tacos

posted an update 3 months ago

Post

612

Forcefeeding masked T5-Small 1 billion human-association captions to fry it's brain. I really don't know how long it'll take until I start nor do I know the logistic challenges I'll face when moving data from A to B, but the outcome should completely fry it and make it only fixate on human and diffusion responses. Should be a fun experiment that can just kind of run on automation.
The experiment's captions are available... mostly on my hf, I've had some rate limit problems that caused them to halt and I think I need to autogen another 100 million complex captions.
This WILL form heavy bias and burn-points. Random words will be peppered in the mix to allow the T5-Small to retain at least some semblance of what it was before I lobotomize it.
Likely I'll completely freeze half and burn the other half for a couple million as a test point. See how it takes or if it dies before 50k or something and need a refined process.
Oh great, even better. It didn't include the longer prompt variations. This won't start today.

Alright training began. I'm introducing a high degree variant of noise and chatter for the t5 to learn to bypass - while simultaneously increasing additional information output from the t5 in the process.
So far the outcome has been a degree of introduction for new information in the output. while simultaneously introducing rule of 3 parameterization into the T5 small.
I have high hopes.

3 replies

·

replied to their post 3 months ago

Simply put - I'm not that good with sd15. I got a few cool things working, but training a zero'd model isn't my forte. I'm better at enhancing or improving rather than creating entire full structures. My limit here is almost strictly a logistics one. The libraries don't like sd15 and they especially don't like when I start tinkering with internals and recording information from layer activation.
I require additional fundamental tests and foundational documentation before I proceed with my SD15 finetune from baseline to useful - as training tends to be quite vague to me still. Training Sd15 from zero to a useful state, requires additional tests, practices, and experience to train full models from zero to an anchorable enough of a state for Surge to latch onto the necessary points of interest. Those points of interest won't heat up without the necessary level of attachment and consistency within the inference - so they are required to create the powerful web of information required to learn and advance models from point A to B - without simply replacing the neurons themselves and making a flat copy.
There are other potentials here as well, where implanting neurons and finetuning has shown very good results. However, this isn't about that. This is about showcasing the power of the baseline system.

Hence, the other spectrum - where the model already exists, and has some fairly tested power for, I can interpolate two exacting classifier models - teacher/student using direct layer to layer learning and optimizations for time and speed. Classifiers can be small, lightweight, and fully capable of anchorpoint information transfer as well.
So instead of a full 1.5 finetune with various extra improvements - which the plan was to allow interpolation from a much stronger model into a much less robust model; I plan to take a classifier that is fully trained, and remodel one that has never seen the data and attempt to interpolate train this unfinished - imperfect - classifier; in order to introduce the necessary information using surge and anchorpoints into it's already trained using different data neurons, this process must be repeatable and useful for other realms other than simply 1:1 which is the point of an adapter like surge.

I'd say a fair full repeatable notebook is in order using simple processes and a simple classifier layer set. I'll choose well known models that have omitted training intentionally, and introduce that training using another model that is trained in the notebook itself. No tricks, no barriers, nothing special. Just keras, a bit of code, a bit of research, and a bit of elbow grease using the process.
This will showcase the power of Surge while simultaneously introducing a new type of rapid interpolative learning.

posted an update 3 months ago

Post

453

My indev Surge training methodology and paradigm is powerful. The preliminary tests will be available for debugging soon using a customized sd-scripts and a series of full finetunes using sdxl as a catalyst to the training paradigm.
https://civitai.com/articles/14195/the-methodology-of-surge-training-loss-math
The datasets I'm sourcing are going to be catalysts and tests for the power of Surge to teach very sticky or difficult to understand elements; such as text, positioning, offset, controlnet poses, and more directly into the very stubborn SDXL infrastructure without additional tools.
Should be noted that my current running finetunes based on BeatriXL are not Surge trained - so you won't gain knowledge on Surge from them.

GPT and I have prototyped a new version of SD15 that operates on additional attention heads to match the Surge formula, the Omega-VIT-L reformed, a zeroed unet, and the Flux 16 channel AE.
I'll call it SD-SURGE - as it's not sd15 anymore.
The first surge trainings are already under way.

1 reply

·

replied to nroggendorff's post 5 months ago

Usually anything learning from it's own output is just causing cascade failure given enough time. It's likely that without large amounts of additional information and context along with additional learning weights and parameters in relative sense, it's not going to be particularly effective at remembering new information; even when the conversation is left in context and the context itself is regulated and weighted.
Cascade failure is imminent eventually when the weights get too heavy and it starts to form a natural introspective bias on individual tokens. Looking up the valuation of all English used in literature historically reveals that many words are used over others, and with that same historic bias applied you can likely cut the edge off traditional literature; but modern literature has a completely different spectrum of words with entirely different odds.
Cascade failure is imminent, even with self regularization tokenization; or the process of learning is essentially nonexistent where it simply remembers nothing outside of the context window.
Though... I think I may have an idea to fix it.

AbstractPhila PRO

AI & ML interests

Organizations

AbstractPhila PRO

AI & ML interests

Organizations

AbstractPhil's activity