Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up

All HF Hub posts

anakin87ย 
posted an update 2 days ago
view post
Post
10203
How LLM training with RL Environments works?

It all starts with ๐—ฅ๐—ฒ๐—ถ๐—ป๐—ณ๐—ผ๐—ฟ๐—ฐ๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐˜„๐—ถ๐˜๐—ต ๐—ฉ๐—ฒ๐—ฟ๐—ถ๐—ณ๐—ถ๐—ฎ๐—ฏ๐—น๐—ฒ ๐—ฅ๐—ฒ๐˜„๐—ฎ๐—ฟ๐—ฑ๐˜€
- question asked
- model generates reasoning + answer
- answer checked against ground truth
- reward drives RL training


In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s)

Consider a more complex tic-tac-toe env โŒโญ•
It adds:
- dynamic game generation/handling
- tunable opponent skill
- multi-turn interactions

(envs can also include tools)

---

What happens at training?

We use ๐—š๐—ฟ๐—ผ๐˜‚๐—ฝ ๐—ฅ๐—ฒ๐—น๐—ฎ๐˜๐—ถ๐˜ƒ๐—ฒ ๐—ฃ๐—ผ๐—น๐—ถ๐—ฐ๐˜† ๐—ข๐—ฝ๐˜๐—ถ๐—บ๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป with a tic-tac-toe env

No critic model needed, the group is the baseline
Simpler than PPO

1๏ธโƒฃ Rollout generation: from the same board, model plays N games via sampling
2๏ธโƒฃ Each game scored with deterministic rewards (win, format, ...)
3๏ธโƒฃ Mean score computed across the group
4๏ธโƒฃ Each rollout's advantage = its score minus the group mean
5๏ธโƒฃ Model updated to favor trajectories above baseline

๐Ÿ” Repeat


For a deep dive, check out
๐ŸŒฑ https://github.com/anakin87/llm-rl-environments-lil-course
a free hands-on course on RL environments for LLMs
  • 2 replies
ยท
dealermatt72ย 
posted an update 2 days ago
view post
Post
6509
Hey Hugging Face community ๐Ÿ‘‹

My name is M. I'm a solo founder and self-taught developer based in Houston, TX. I build AI-powered apps โ€” I have an iOS app called DeFilter currently in App Store review, a security scanning platform called Sentinel, and a job marketplace called HireHuman.fyi for connecting humans with companies that prefer non-AI workers.

I'm also a poker dealer by night, which means I think a lot about reading situations in real time โ€” and that's exactly what sparked this idea.

I'm not the most technical person in the room. But I have a vision, I have drive, and I believe the best projects get built when people with different skills come together around a shared idea.

That's why I'm posting here. I want to build this with the community.

โ€” M (@dealermatt )

  • 3 replies
ยท
SeanLee97ย 
posted an update about 16 hours ago
view post
Post
2178
Our lab recently released a paper where we introduce ShadowPEFT, a new Parameter-Efficient Fine-Tuning (PEFT) paradigm tailored for edge computing scenarios.

Unlike traditional approaches such as LoRA and its variants, which inject trainable parameters directly into the weights of Transformer, requiring tight coupling with the backbone.

ShadowPEFT instead enhances the frozen large base model by adding a lightweight, centralized, pretrainable, and detachable Shadow network.
This shadow network operates in parallel with the base model, delivering learned corrections to each decoder layer. Because the shadow module is architecturally decoupled from the backbone, it can be independently trained, stored, and deployed, benefiting edge computing scenarios and edge-cloud collaboration computing.

- HF Paper: ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning (2604.19254)
- GitHub: https://github.com/ShadowLLM/shadow-peft
- HF Collection: https://huggingface.co/collections/shadow-llm/shadow-peft-models
  • 5 replies
ยท
Ujjwal-Tyagiย 
posted an update 1 day ago
view post
Post
395
We are hiring at Shirova AI. We need AI researchers and engineers to work in our research lab. Shirova AI is a research lab in India, so we can help our researchers move to nearby workspaces or let them work from home without ever coming to the lab. We're building our founding team, so the pay will be good. You can learn, so don't hesitate to mail us at: careers@shirova.com
consome2ย 
posted an update 2 days ago
view post
Post
3156
Built a small site for tracking speech-to-speech, full-duplex, and audio foundation model work.
It covers models, benchmarks, datasets, and some blog posts to organize the landscape in one place.

Still early, but sharing in case it is useful:
https://www.fullduplex.ai/

If you spot missing entries or mistakes, I would really appreciate corrections.
  • 2 replies
ยท
sequelboxย 
posted an update 3 days ago
view post
Post
1757
NEW RELEASE: Esper 3.1 for Qwen 3.6!

- Your dedicated DevOps expert: Esper 3.1 maximizes DevOps and architecture helpfulness, powered by high-difficulty DevOps and architecture data generated with DeepSeek-V3.1-Terminus!
- Improved coding performance: challenging code-reasoning datasets stretch DeepSeek-V3.1-Terminus and DeepSeek-V3.2 to the limits, allowing Esper 3.1 to tackle harder coding tasks!
- AI to build AI: our high-difficulty AI expertise data boosts Esper 3.1's MLOps, AI architecture, AI research, and general reasoning skills.

Get it now: ValiantLabs/Qwen3.6-35B-A3B-Esper3.1

We're working on more finetunes for the newest Qwen and Gemma models, and we've also started working on the agentic-first datasets for Esper 4 :) we're going to make open source better and better for your work!

Please note that real life financial and family concerns have popped up and have imposed unfortunate limitations on our ability to devote time to our open-source work :( If you would like to see Esper 4 and our other releases speed up instead of slowing down, this is the best way you can help us: sequelbox/SupportOpenSource

No matter what, we'll keep fighting and we won't give up!

with love,
allegra
ajibawa-2023ย 
posted an update 3 days ago
view post
Post
1116
Ruby-Code-Large
Dataset : ajibawa-2023/Ruby-Code-Large

Ruby-Code-Large is a large-scale corpus of Ruby programming language source code comprising 331,743 code samples stored in .jsonl format. The dataset is designed to support research and development in large language model (LLM) pretraining, static analysis, web application development, and software engineering automation within the Ruby ecosystem.

By offering a substantial, language-focused dataset, Ruby-Code-Large enables targeted experimentation in dynamic programming, object-oriented design, and rapid application developmentโ€”areas where Ruby is widely used, particularly in web frameworks and scripting.

Ruby-Code-Large addresses the lack of large, curated, Ruby-specific datasets, enabling focused research on expressive syntax, metaprogramming, and high-level abstractions.
eaddarioย 
posted an update 3 days ago
view post
Post
157
Experimental global target bitsโ€‘perโ€‘weight quantization of google/gemma-4-E2B-it, google/gemma-4-E4B-it and google/gemma-4-26B-A4B-it

Unlike standard llama.cpp quantizations that rely on fixed type heuristics (e.g., Q4_K_M), the Target BPW approach optimizes per-tensor precision where it matters the most, and produces high quality models that meet a precise global file size target.

Key Advantages:
- VRAM Maximization: Can generate high quality models sized exactly to fit hardware constraints (e.g., fitting the model into exactly 24GB VRAM).
- Data-Driven Precision: Quantization mix is determined by actual weight error sensitivity rather than hardcoded rules, often yielding better PPL/KLD size trade-offs.

Full benchmarks (PPL, KLD, ARC, MMLU, etc.) and methodology in the models' cards

eaddario/gemma-4-E2B-it-GGUF
eaddario/gemma-4-E4B-it-GGUF
eaddario/gemma-4-26B-A4B-it-GGUF
SeaWolf-AIย 
posted an update 12 days ago
view post
Post
2989
Why This Matters โ€” David Defeats Goliath

MODEL: FINAL-Bench/Darwin-4B-David
SPACE: FINAL-Bench/Darwin-4B-david

We're releasing Darwin-4B-David, the first second-generation model in the Darwin Opus family. By evolving an already-evolved model, it achieves 85.0% on GPQA Diamond โ€” surpassing its 58.6% original ancestor and even gemma-4-31B (84.3%) โ€” with just 4.5B parameters.

Second-Generation Evolution
Most merges start from a base model and produce a single offspring. Darwin-4B-David breaks this pattern. The Father (Darwin-4B-Opus) was already evolved from gemma-4-E4B-it with Claude Opus reasoning distillation โ€” a Gen-1 model. The Mother (DavidAU's DECKARD-Expresso-Universe) brings Unsloth deep tuning across 5 in-house datasets with thinking mode by default. Crossbreeding these two produced the first Gen-2 Darwin model.

Darwin V6's Model MRI scanned both parents across all 42 layers, assigning independent optimal ratios per layer. The Mother's creativity and Korean language hotspot (Layer 22-25, weight 0.95) was maximally absorbed, while the Father's reasoning core (Layer 30-40, weight 0.48) was preserved. This is "Merge = Evolve" applied recursively โ€” evolution of evolution.

Benchmarks
Darwin-4B-David scores 85.0% on GPQA Diamond (+26.4%p over original 58.6%), evaluated generatively with maj@8 (8 generations per question, majority vote), Epoch AI prompt format, thinking mode enabled, 50 sampled questions. On ARC-Challenge (25-shot, loglikelihood), both score 64.93% โ€” expected, as loglikelihood doesn't capture thinking-mode reasoning differences.

Why This Matters
gemma-4-31B (30.7B) scores 84.3%. Darwin-4B-David surpasses it at 1/7th the size โ€” no training, no RL, just 45 minutes of MRI-guided DARE-TIES on one H100. The name "David" honors Mother creator DavidAU and evokes David vs. Goliath.
tomaarsenย 
posted an update 13 days ago
view post
Post
548
๐ŸŒ I've just published Sentence Transformers v5.4 to make the project fully multimodal for embeddings and reranking. The release also includes a modular CrossEncoder, and automatic Flash Attention 2 input flattening. Details:

You can now use SentenceTransformer and CrossEncoder with text, images, audio, and video, with the same familiar API. That means you can compute embeddings for an image and a text query using model.encode(), compare them with model.similarity(), and it just works. Models like Qwen3-VL-Embedding-2B and jinaai/jina-reranker-m0 are supported out of the box.

Beyond multimodal, I also fully modularized the CrossEncoder class. It's now a torch.nn.Sequential of composable modules, just like SentenceTransformer has been. This unlocked support for generative rerankers (CausalLM-based models like mxbai-rerank-v2 and the Qwen3 rerankers) via a new LogitScore module, which wasn't possible before without custom code.

Also, Flash Attention 2 now automatically skips padding for text-only inputs. If your batch has a mix of short and long texts, this gives you a nice speedup and lower VRAM usage for free.

I wrote a blog post walking through the multimodal features with practical examples. Check it out if you want to get started, or just point your Agent to the URL: https://huggingface.co/blog/multimodal-sentence-transformers

This release has set up the groundwork for more easily introducing late-interaction models (both text-only and multimodal) into Sentence Transformers in the next major release. I'm looking forward to it!