Models self-report difference between RLHF trained responses and base cognition (github.com via hn) 2 pts· 4d rlhf