lambdasec (Lambda Security)

codelion

posted an update 3 days ago

Post

1841

New Research: Theoretical Foundations for In-Context Learning in Transformers

I'm excited to share our latest theoretical work that formally proves an interesting property of large language models: base transformer models can approximate fine-tuned capabilities using only inference-time techniques like in-context learning.

The core question we investigated: Can specialized behaviors typically acquired through expensive supervised fine-tuning be elicited from base models without any parameter updates?

Our theoretical contribution: We provide a formal proof, grounded in the Turing completeness of transformers, showing that this is indeed possible under certain assumptions. The work establishes mathematical bounds on the minimal dataset sizes needed for approximation.

Key theoretical results:

- For text generation tasks: O(mV/ε²) examples suffice (where m = number of contexts, V = vocabulary size, ε = error tolerance)
- For linear classification: O(d/ε) examples (where d = input dimension)
- Extensions to finite context scenarios with practical bounds

This work helps explain why techniques like few-shot prompting, retrieval-augmented generation, and in-context learning work so effectively in practice. It bridges formal computer science theory with empirical observations about modern language models.

While the assumptions are idealized (unbounded computational resources, full dataset access), the results provide mathematical foundations for understanding inference-time adaptation strategies that are increasingly important in AI deployment.

Paper: Eliciting Fine-Tuned Transformer Capabilities via Inference-Time Techniques (2506.08060)

1 reply

·

codelion

authored 2 papers 5 days ago

Humanity's Last Exam

Paper • 2501.14249 • Published Jan 24 • 76

Eliciting Fine-Tuned Transformer Capabilities via Inference-Time Techniques

Paper • 2506.08060 • Published 7 days ago • 6

codelion

posted an update 14 days ago

Post

3389

🧠 We just implemented Andrej Karpathy's "third paradigm" for LLM learning!

System Prompt Learning (SPL) enables LLMs to automatically learn problem-solving strategies from experience, rather than relying on static prompts.

🚀 How it works:
Your LLM builds a database of effective strategies, selects the best ones for each problem, and refines them over time based on success rates.

📊 Results across math benchmarks:
Arena Hard: 29% → 37.6% (+8.6%)
AIME24: 23.33% → 30% (+6.67%)
OptILLMBench: 61% → 65% (+4%)

The best part? All strategies are human-readable and the system gets progressively better at problem types you use frequently.

✨ Key benefits:
🔄 Cumulative learning over time
📖 Transparent, inspectable strategies
🔌 Works with any OpenAI-compatible API
⚡ Simple integration: just add "spl-" prefix to your model

Built as an open-source plugin in optillm. After 500 queries, our system developed 129 strategies and refined 97 of them!

This feels like a genuine step toward AI that learns from experience while staying completely interpretable.

🔗 GitHub: https://github.com/codelion/optillm/tree/main/optillm/plugins/spl
📖 Full article: https://huggingface.co/blog/codelion/system-prompt-learning
🐦 Original Karpathy tweet: https://x.com/karpathy/status/1921368644069765486

Have you experimented with advanced system prompting? What strategies would you want your LLM to learn?

codelion

posted an update 19 days ago

Post

2335

Introducing AutoThink: Adaptive reasoning for LLMs that improves performance by 43% on reasoning benchmarks!

Instead of using fixed thinking budgets, AutoThink:
- Classifies query complexity (HIGH/LOW) using adaptive classification
- Dynamically allocates thinking tokens based on complexity
- Uses steering vectors derived from Pivotal Token Search to guide reasoning patterns

Results on DeepSeek-R1-Distill-Qwen-1.5B:
- GPQA-Diamond: 31.06% vs 21.72% baseline (+9.34 points)
- MMLU-Pro: 26.38% vs 25.58% baseline (+0.8 points)
- Uses fewer tokens than baseline approaches

Works with any local reasoning model - DeepSeek, Qwen, Llama, custom models. The technique combines our research on Pivotal Token Search (PTS) implementation and adaptive classification frameworks.

Paper: AutoThink: efficient inference for reasoning LLMs
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5253327

Code and examples:
https://github.com/codelion/optillm/tree/main/optillm/autothink

PTS implementation and technical details:
https://github.com/codelion/pts
https://huggingface.co/blog/codelion/pts

Adaptive classifier framework:
https://github.com/codelion/adaptive-classifier

Would love to hear your thoughts on adaptive resource allocation for LLM reasoning! Have you experimented with similar approaches?

5 replies

·

codelion

posted an update 26 days ago

Post

2833

🧬 Hey everyone! Just released **OpenEvolve** - an open-source implementation of Google DeepMind's AlphaEvolve system.

It's an evolutionary coding agent that uses LLMs to discover and optimize algorithms. I successfully replicated DeepMind's results on circle packing (99.97% match!) and evolved a random search into a simulated annealing algorithm.

✨ Key features:
- Evolves entire codebases (not just single functions)
- Works with any OpenAI-compatible API
- LLM ensemble approach for better results
- Multi-objective optimization

👉 Check it out:
GitHub: https://github.com/codelion/openevolve
Blog post: https://huggingface.co/blog/codelion/openevolve

Would love to hear your thoughts or answer any questions about it!

codelion

posted an update 28 days ago

Post

2408

Introducing Pivotal Token Search (PTS): A new technique for targeted LLM alignment

Excited to share Pivotal Token Search (PTS), a technique for identifying and optimizing critical decision points in LLM generations!

GitHub repository: https://github.com/codelion/pts

What is PTS?
PTS helps identify specific "pivotal tokens" that dramatically shift the probability of a successful generation. Unlike traditional DPO which treats all tokens equally, PTS focuses optimization on the tokens that actually matter for success.

Inspired by Microsoft's recent Phi-4 paper (which used this technique to achieve SOTA reasoning with only 14B parameters), PTS is especially effective for:
- Mathematical reasoning
- Coding tasks
- Multi-step problem solving
- Any domain where specific decision points strongly impact outcomes

What we're releasing today: codelion/pivotal-token-search-68241145d8b8502122f3ce4f

1. Open-source code:
- Complete implementation of the PTS algorithm
- Data generation pipelines
- Usage examples and documentation

2. Huggingface resources:
- Datasets collection: https://huggingface.co/datasets?other=pts
* Pre-generated preference pairs for various domains
* Ready to use in your DPO training pipelines

- Models collection: https://huggingface.co/models?other=pts
* Pre-trained models fine-tuned with PTS
* Specialized versions for different reasoning tasks

The algorithm is straightforward to implement and can significantly improve your model's reasoning capabilities. Check out the repository for details on getting started!

We welcome feedback, contributions, and collaborations. Let us know if you use PTS in your projects!

codelion

authored 3 papers 9 months ago

codelion

posted an update 10 months ago

Post

2324

We recently worked with OpenAI to fine-tune gpt-4o and built the SOTA model for the patched-codes/static-analysis-eval benchmark. All the code and data patched-codes/synth-vuln-fixes on how we did it is available on their GitHub - https://github.com/openai/build-hours/tree/main/5-4o_fine_tuning.

Here are some tips based on our experience:

→ Establish baseline with "conditioning" / prompting

→ Task-specific datasets are ideal for PEFT; hard to beat gpt-4o on "broad" tasks

→ Add your best system prompt to each example

→ Ensure training data distribution is similar to inference data

→ Shorten instructions with concise prompts; may require more examples.

→ Define clear evaluation metrics (seriously, please eval!)

You can see more details on the benchmark and process here - https://www.patched.codes/blog/the-static-analysis-evaluation-benchmark-measuring-llm-performance-in-fixing-software-vulnerabilities

codelion

posted an update 11 months ago

Post

2886

A new paper titled "STALL+: Boosting LLM-based Repository-level Code Completion with Static Analysis" shows the benefits of integrating static analysis with LLMs. (https://arxiv.org/abs/2406.10018)

Authors evaluate 4 key questions:

- How does each static analysis integration strategy perform in LLM-based repository-level code completion?
> They found that integrating static analysis in the prompting phase (especially with file-level dependencies) can achieve the substantially larger improvements than other phases.

- How do different combinations of integration strategies affect LLM-based repository-level code completion?
> Languages that are easier to analyze like Java show more improvements compared to dynamic languages like Python.

- How do static analysis integration strategies perform when compared or combined with RAG in LLM-based repository-level code completion?
> Static analysis and RAG are complementary and boost the overall accuracy.

- What are the online costs of different integration strategies in LLM-based repository-level code completion?
> Combining prompting-phase static analysis and RAG is the best option for cost-effectiveness.

In my @owasp App Sec keynote last year, I had described how one can do static analysis augmented generation (SaAG) to boost the accuracy of LLM based patches for vulnerability remediation. (you can see the talk here - https://www.youtube.com/watch?v=Cw4-ZnUNVLs)

codelion

posted an update 12 months ago

Post

2246

LLM-Assisted Patching of Polyfill Supply Chain Attack

A recent supply chain attack on polyfill.io affected over 100,000 websites (see https://www.patched.codes/blog/patching-the-polyfill-supply-chain-attack). To address this issue, we show how developers can leverage Large Language Models (LLMs) for efficient vulnerability patching:

1. Automated Detection: Using Semgrep rules (see https://semgrep.dev/playground/r/KxUvD7w/asankhaya_personal_org.polyfill-compromise-copy) to identify vulnerable code.

2. LLM-Powered Patching: Utilizing Patchwork (https://github.com/patched-codes/patchwork), an open-source solution that employs LLMs to automatically fix vulnerabilities.

3. Custom Workflows: The "Fixpolyfill" patchflow (https://github.com/patched-codes/patchwork-configs/tree/main/patchflows/Fixpolyfill) , tailored for this specific attack, can be easily run across multiple repositories.

4. Scalable Solutions: Options to scan and patch entire GitHub/GitLab organizations, with automated pull request generation.

5. Rapid Response: LLM-assisted patching enables swift action to minimize damage from supply chain attacks.

This approach demonstrates how LLMs can be effectively used to quickly respond to and remediate widespread security vulnerabilities in code.

codelion

posted an update 12 months ago

Post

9247

The new Claude Sonnet 3.5 model from Anthropic AI has been getting good reviews on since last night. It is quite good at coding related tasks. We tried it on the Static Analysis Eval benchmark ( patched-codes/static-analysis-eval) which measures the ability of a LLM to fix vulnerabilities. The model scores 59.21% which is good but not better than other frontier models (like GPT-4, Gemini-1.5 and LLama-3).

11 replies

·

codelion

posted an update about 1 year ago

Post

2531

Automatically generate docstrings for your code using LLMs. We just released a new patchflow that can generate docstrings - https://github.com/patched-codes/patchwork/tree/main/patchwork/patchflows/GenerateDocstring

Here is an example PR that does it - https://github.com/codelion/example-java-maven/pull/4

You can check out other patchflows to automate developer chores with patchwork https://github.com/patched-codes/patchwork

codelion

posted an update about 1 year ago

Post

1615

WorkerSafetyQAEval: A new benchmark to evaluate worker safety domain question and answering

Happy to share a new benchmark on question and answers for worker safety domain. The benchmark and leaderboard is available at
codelion/worker-safety-qa-eval

We evaluate popular generic chatbots like ChatGPT and HuggingChat on WorkerSafetyQAEval and compare it with a domain specific RAG bot called Securade.ai Safety Copilot - codelion/safety-copilot It highlights the importance of having domain specific knowledge for critical domains like worker safety that require high accuracy. Securade.ai Safety Copilot achieves ~97% on the benchmark setting a new SOTA.

You can read more about the Safety Copilot on https://securade.ai/blog/how-securade-ai-safety-copilot-transforms-worker-safety.html

codelion

posted an update about 1 year ago

Post

1124

After the announcements yesterday, I got a chance to try the new gemini-1.5-flash model from @goog1e , it is almost as good as gpt-4o on the StaticAnalaysisEval ( patched-codes/static-analysis-eval) It is also a bit faster than gpt-4o and much cheaper.

I did run into a recitation flag with an example in the dataset where the api refused to fix the vulnerability and flagged the input as using copyrighted content. This is something you cannot unset even with the safety filters and seems to be an existing bug https://issuetracker.google.com/issues/331677495

But overall you get gpt-4o level performance for 7% the price, we are thinking of making it default in patchwork - https://github.com/patched-codes/patchwork You can use the google_api_key and model options to choose gemini-1.5-flash-latest to run it with patchwork.

2 replies

·

codelion

posted an update about 1 year ago

Post

1797

The new gpt-4o model seems to a very good coder. OpenAI reported a 90+ score on https://huggingface.co/datasets/openai_humaneval

We tried the new model on our patched-codes/static-analysis-eval which evaluates the model on vulnerability remediation. gpt-4o has reclaimed the top spot on our leaderboard (from meta-llama/Meta-Llama-3-70B-Instruct).

You can now use the new model with our open-source framework PatchWork - https://github.com/patched-codes/patchwork by passing model=gpt-4o on the CLI.

5 replies

·

codelion

posted an update about 1 year ago

Post

1765

Happy to announce the open source framework to turbo charge devops called patchwork - https://github.com/patched-codes/patchwork

You can use it to build patchflows - workflows that use LLMs for software development tasks like bug fixing, pull request review, library migration and documentation.

Supports any LLM of your choice including our own MoE model - patched-codes/patched-mix-4x7B

Give it a try!

2 replies

·

codelion

posted an update about 1 year ago

Post

1820

Meta's new LLama-3 ( meta-llama/Meta-Llama-3-8B-Instruct) is an extremely capable model out of the box for coding related tasks. It is the first model that we have seen that beats GPT-4 on Static-Analysis-Eval - patched-codes/static-analysis-eval.

Lambda Security

AI & ML interests

Recent Activity

lambdasec's activity

Humanity's Last Exam

Eliciting Fine-Tuned Transformer Capabilities via Inference-Time Techniques

Patched MOA: optimizing inference for diverse software development tasks

Patched RTC: evaluating LLMs for diverse software development tasks

Evaluating Pre-trained Language Models for Repairing API Misuses

AI & ML interests

Recent Activity

Team members 1

lambdasec's activity