The new DeepSite space is really insane for vibe-coders enzostvs/deepsite
With the wave of vibe-coding-optimized LLMs like the latest open-source DeepSeek model (version V3-0324), you can basically prompt out-of-the-box and create any app and game in one-shot.
It feels so powerful to me, no more complex framework or under-the-hood prompt engineering to have a working text-to-app tool.
AI is eating the world and *open-source* AI is eating AI itself!
PS: and even more meta is that the DeepSite app and DeepSeek model are both fully open-source code => time to start recursively improve?
PPS: you still need some inference hosting unless you're running the 600B param model at home, so check the very nice list of HF Inference Providers for this model: deepseek-ai/DeepSeek-V3-0324
It's beating Claude 3.7 on (competitive) programming βa domain Anthropic has been historically really strong atβ and it's getting close to o1-mini/R1 on olympiad level coding with just 7B parameters!
And the best part is that we're open-sourcing all about its training dataset, the new IOI benchmark, and more in our Open-R1 progress report #3: https://huggingface.co/blog/open-r1/update-3
Gemma3 family is out! Reading the tech report, and this section was really interesting to me from a methods/scientific fairness pov.
Instead of doing over-hyped comparisons, they clearly state that **results are reported in a setup which is advantageous to their models**. (Which everybody does, but people usually don't say)
For a tech report, it makes a lot of sense to report model performance when used optimally! On leaderboards on the other hand, comparison will be apples to apples, but in a potentially unoptimal way for a given model family (like some user interact sub-optimally with models)
Also contains a cool section (6) on training data memorization rate too! Important to see if your model will output the training data it has seen as such: always an issue for privacy/copyright/... but also very much for evaluation!
Because if your model knows its evals by heart, you're not testing for generalization.
We applied the same data-driven approach that led to SOTA English performance inπ· FineWeb to thousands of languages.
π₯ FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.
The dataset is released under the permissive π ODC-By 1.0 license, and the π» code to reproduce it and our evaluations is public.
We will very soon announce a big community project, and are working on a π blogpost walking you through the entire dataset creation process. Stay tuned!
How do I test an LLM for my unique needs? If you work in finance, law, or medicine, generic benchmarks are not enough. This blog post uses Argilla, Distilllabel and π€οΈLighteval to generate evaluation dataset and evaluate models.
It is an LLM controlled Rogue-Like in which the LLM gets a markdown representation of the map, and should generate a JSON with the objective to fulfill on the map as well as the necessary objects and their placements.