Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO written from scratch in PyTorch - updates!

reddit-localllama · www.reddit.com ·3 pts·2 replies ↗ ·2d

So, yesterday run was a success and I did get an avg rollout length of about 64 tokens as attached in the image! This was with quality_reward + length_penalty (more info below!) Next, I'll be going with length penalty as the reward and wit…

vllm

open →