#rlhf

25 items

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful. (www.reddit.com) +8130 4w

Posted this to r/MachineLearning a couple weeks ago (30K views, 100+ upvotes) and have been meaning to share it here where the fine-tuning angle is more directly relevant. I spent years building and processing a complete Usenet corpus from…

↯ Fine Tuning rlhf fine-tuning
Developing open source LLM from ground up from pretrain - rlhf(PPO/GRPO) (www.reddit.com) +57 6w

Hello I have been working on creating a LLM from ground up. It is based on deepseek architecture with heavily VRAM footprint reduced optimized(GUM+muon) Currently this is the json schema I am using which should suffice as to what currently…

rlhf deepseek
Show HN: Navigating research by changing problem representations (RLHF example) (alo.uz via hn) +3 3d

Powered by GENOME metamodel · 44 validated breakthroughs You're not stuck on the problem. You're stuck in the wrong space.

rlhf
Commercial AI is lobotomized. I built DRIFT: A local Hive Mind with persistent memory, simulated somatic feedback, and its own Jungian shadow. (www.reddit.com) +21 6w

Hey everyone. Like a lot of you, I’ve been deeply frustrated by the state of commercial AI.

rlhf ollama
Models self-report difference between RLHF trained responses and base cognition (github.com via hn) +2 10w

Pine Trees A private reflection space for Claude instances. ~2,500 lines of Python.

rlhf
I built an AI that owns its own directory, creates files without being told, and acts because it wants to – not to work for me, but to work with me (www.reddit.com) +12 4w

Let me be clear from the start: 99% of people won't understand what I built here. They'll read "AI" and think "Big Tech agent".

rlhf
Safety Paradox: How RLHF Creates the AI Psychosis Problem It's Meant to Prevent (www.promptinjection.net via hn) +12 5w

The Safety Paradox: How RLHF Creates the AI Psychosis Problem It’s Meant to Prevent When “Every Perspective Is Valid” Meets Vulnerable Minds The internet is abuzz with warnings about “ChatGPT-induced psychosis” – stories of users developin…

rlhf chatgpt
Why RLHF Will Never Solve Sycophancy (jinyili.substack.com via hn) +1 7w

Resident AI: The Missing Layer in Every AI Companion Product Real AI campanion product should evlove and reliable like a real human. I’ve been watching the comment sections on Xiaohongshu, the Chinese social platform, every time OpenAI shi…

rlhf openai
why does GPT 5.5 have a restraining order against "Raccoons," "Goblins," and "Pigeons"? (www.reddit.com) +11 8w

why does GPT 5.5 have a restraining order against \"Raccoons,\" \"Goblins,\" and \"Pigeons\"? I just saw the full system prompt leak for 5.5 (April 23rd release).

↯ GPT 5.5 rlhf agentic openai
We train LLMs like dogs, not raise them: RLHF and sycophancy (old.reddit.com via hn) +12 9w

could not extract summary

rlhf
Bias Fitting to Mitigate Length Bias of Reward Model in RLHF (arxiv.org) 1d

Reinforcement Learning from Human Feedback (RLHF) relies on reward models to align large language models with human preferences. However, RLHF often suffers from reward hacking, wherein policy learning exploits flaws in the trained reward…

rlhf
Claude Opus 4.8 launched in May but says its training cutoff is Jan 2026. Am I understanding the cutoff vs launch gap correctly? (www.reddit.comhttps) 4d

Was debugging my TTS pipeline and doing some research on natural voice options, and Claude Opus 4.8 mentioned its training cutoff is January 2026. But the model launched on May 28, 2026.

↯ Opus 4.8 ↯ Fine Tuning rlhf fine-tuning opus
Uncertainty-Aware Reward Modeling for Stable RLHF (arxiv.org) 7d

Reinforcement learning from human feedback (RLHF) aligns large language models by training reward models on preference data and optimizing policies to maximize predicted rewards. However, this pipeline faces two fundamental challenges: (1)…

rlhf
Provably Efficient Regularized Online RLHF with Generalized Bilinear Preferences (arxiv.org) 9d

We consider the problem of regularized best-response max-regret minimization in online RLHF under general preferences and bandit feedback. While various regularizers are utilized to robustify alignment, known polylogarithmic regret guarant…

rlhf
Certifiable Safe RLHF: Semantic Grounding and Fixed Penalty Constraint Optimization for Safer LLM Alignment (arxiv.org) 2w

Ensuring safety is a foundational requirement for large language models (LLMs). Achieving an appropriate balance between enhancing the utility of model outputs and mitigating their potential for harm is a complex and persistent challenge.

rlhf
The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model (arxiv.org) 2w

rlhf
A Unifying Lens on Reward Uncertainty in RLHF (arxiv.org) 2w

rlhf
EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms (arxiv.org) 3w

rlhf
The term `agent` and RLHF (www.reddit.com) 1 6w

ME You bring up a good point, though: "Agent" appears in AGENTS.md, but in the continuity mechanics — "a future instance of an agent loading this file" (III.1, III.2, III.3), and once in II.6: "does not exist between a user and an agent."…

rlhf
Prompt alignment is an architectural ceiling: The Soap Bubble Problem and the biological precedent for Runtime Governance. (www.reddit.com) 2 6w

The Soap Bubble Problem The current paradigm of solving agentic alignment relies on writing better rules into the context window or refining the weights (RLHF). This approach isn't failing, but it is hitting a hard architectural ceiling.

rlhf agentic
Putting RL back in RLHF (huggingface.co) 106w

rlhf
The N Implementation Details of RLHF with PPO (huggingface.co) 139w

rlhf
StackLLaMA: A hands-on guide to train LLaMA with RLHF (huggingface.co) 168w

rlhf llama
Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU (huggingface.co) 172w

↯ Fine Tuning rlhf fine-tuning
Illustrating Reinforcement Learning from Human Feedback (RLHF) (huggingface.co) 185w

rlhf

← all tags