RLHF (Reinforcement Learning from Human Feedback)
Definition
Reinforcement Learning from Human Feedback (RLHF) is a training technique used to align language model behavior with human preferences. After initial pretraining, human raters compare pairs of model outputs and indicate which is better along dimensions such as helpfulness, accuracy, and safety. These preferences train a reward model, which is then used to fine-tune the language model via reinforcement learning — optimizing it to produce outputs that humans rate more highly.
RLHF is the technique behind the transformation of base language models into the instruction-following, safety-conscious assistants deployed in production today (including ChatGPT and Claude). For enterprise buyers, understanding RLHF helps explain both the strengths and limitations of commercial LLMs: the model has been shaped by the preferences of its raters, which may not perfectly align with a specific organization's values, use cases, or customer base. It also informs why some fine-tuning or system prompt configurations are resisted by the model — RLHF-instilled behaviors can be deeply embedded and require significant effort to override safely.
Related Terms
Source
Last updated: May 12, 2026