stringtranslate.com

Talk:Reinforcement learning from human feedback

Technical

@Moorlock: What's your reasoning in tagging this article with the {{jargon}} template? None of the terms used aren't already explained in the reinforcement learning article, and I certainly would assume that someone coming to this specialized article either already knows the basics of RL, or if not that they will go to the RL article to learn more. Needless to say, we shouldn't be redefining every technical term from RL in this article, too. PopoDameron ⁠talk 22:13, 29 March 2023 (UTC)[reply]

I'm less convinced that readers of this page should be expected to have read a different page first as a prerequisite. I see the acronym RLHF tossed around enough nowadays that I anticipate people may come to Wikipedia just to find out what it stands for. Terms like "robustness," "exploration," "agent," "reward model," "policy," "reward function" are not necessarily meaningful to people who are not already well-versed in the discipline. Wikipedia is best when it makes some effort to explain such jargon to the general reader. Moorlock (talk) 22:33, 29 March 2023 (UTC)[reply]
@Moorlock: I don't know, I would find it strange to define terms like "agent," "reward model," "policy," and "reward function" in this article when they're just core RL terms. Why would I not assume that they'll follow wikilinks to learn more? I mean, if someone who has no knowledge about RL comes here, then these terms will make no sense until they do understand RL. But that's not the topic of this article, so it shouldn't be expected for it to give a self-contained explanation of basic RL, no? That's what the reinforcement learning article is for.
As a similar example, the ChatGPT article says in its lead that the model is built on top "families of large language models (LLMs) and has been fine-tuned (an approach to transfer learning) using both supervised and reinforcement learning techniques," yet terms like "language models," "transfer learning," "supervised learning," and "reinforcement learning," are never defined in the article. Instead, an interested reader can navigate through wikilinks and learn about them. PopoDameron ⁠talk 23:06, 29 March 2023 (UTC)[reply]
I agree that additional wikilinks to pages or sections that more thoroughly explain some of these terms of art would be helpful and maybe sufficient. Moorlock (talk) 23:17, 29 March 2023 (UTC)[reply]

Early learning from human feedback

I think that this page should include a reference to one of the earliest examples of RLHF, from 2014: <ref> Scheirer, W. J., Anthony, S. E., Nakayama, K., & Cox, D. D. (2014). Perceptual annotation: Measuring human vision to improve computer vision. IEEE transactions on pattern analysis and machine intelligence, 36(8), 1679-1686. <ref> To the best of my (admittedly limited) knowledge, this is the earliest example of systematically using human feedback to improve machine learning, and the authors deserve the credit of a citation and concise discussion of this original work. Jj1236 (talk) 13:59, 25 October 2023 (UTC)jj1236[reply]

I agree that background is important, but that particular paper does not appear to be about RL at all. Even if it were and with regards to other early papers, we should be careful not to give undue weight, as "RLHF" is a recently coined term that refers to a rather specific concept and not just any kind of RL based on human feedback. popodameron ⁠talk 03:41, 12 March 2024 (UTC)[reply]

GA Review

This review is transcluded from Talk:Reinforcement learning from human feedback/GA1. The edit link for this section can be used to add comments to the review.

Nominator: PopoDameron (talk · contribs)

Reviewer: Esculenta (talk · contribs) 05:07, 23 March 2024 (UTC)[reply]

Hi, I'll take on this review. Not a SME, but highly interested in this and related topics, so we'll see how this article matches up to the GA criteria. Will have comments here within a few days. Esculenta (talk) 05:07, 23 March 2024 (UTC)[reply]

Hey, sounds good, thanks for reviewing! popodameron ⁠talk 06:48, 23 March 2024 (UTC)[reply]

Ok, here are my thoughts after an initial read-through. I think the article is informative and generally well-written, but there are parts that would be difficult for laypeople to follow. Of course, this is largely unavoidable given the technical nature of the underlying computational science. Most of my commentary are suggestions that aim to ameliorate the difficulty of these technical parts. Esculenta (talk) 17:21, 26 March 2024 (UTC)[reply]

Great, thanks, I'll go through those. popodameron ⁠talk 22:49, 26 March 2024 (UTC)[reply]
Esculenta, just finished going through and incorporating your feedback. Please let me know if anything is missing or if there's anything else you'd like to see. popodameron ⁠talk 02:18, 28 March 2024 (UTC)[reply]
Looking good! I'll reread more thoroughly in the next day or 2, but it crossed my mind that the article doesn't mention who first thought up this technique, or when it was first used practically, which seems like it would be an important encyclopaedic addition. More later, Esculenta (talk) 02:29, 28 March 2024 (UTC)[reply]
Thanks! Just added a bit on that on the (now renamed) background & motivation section. popodameron ⁠talk 03:24, 28 March 2024 (UTC)[reply]
I still think there might need to be more said about the early developments of this technique. If I'm a reader who wants to know "who first thought up this cool idea", I think I'd leave with the impression that it was OpenAI ("The algorithm for RLHF as used today was introduced by OpenAI in a paper...") in 2020. But my research seems to contradict this. This 2010 paper describes the TAMER framework "for designing agents that can be interactively shaped by human trainers who give only positive and negative feedback signals", and cites pubs from 2009-2010. This 2011 paper actually uses "Reinforcement learning from human feedback" in the title, so the idea's been around for at least a decade before its first practical usage in the cited 2020 paper. However, I'm not savvy enough to fully understand how or if these early pubs were necessary stepping stones of understanding along the way, or diversions from algorithms used now. Sorry for proposing more work for you, but I think this "Background" could/should be fleshed out into its own section. The Ziegler et al. (2020) paper also gives some historical RLHF background and earlier sources in its introductory section that could be used. What do you think? Esculenta (talk) 17:28, 30 March 2024 (UTC)[reply]
It's a bit tricky because there's a difference between reinforcement learning from human feedback in general and "RLHF." When someone refers to RLHF today, they are almost definitely referring to the specific algorithm that was indeed first described by OpenAI. Of course, this algorithm was not the first attempt to incorporate human feedback into RL, but "RLHF" usually doesn't refer to that general concept. That's why right now in the background section, I've cited a bunch of older papers that do RL+HF, but without considering them to be the first "instances of RLHF" or anything like that. So, I think I can try to make that a bit clearer in the current background, and maybe I can elaborate a bit more about some of the background methods because, while not exactly RLHF, RLHF was no doubt at least partly inspired by many of them. popodameron ⁠talk 21:43, 30 March 2024 (UTC)[reply]
The Ziegler 2019 paper is as far as I can tell the first paper to introduce RLHF very closely to how it is now, except that their formulation is online instead of offline, which is much more common. Nevertheless, it definitely counts. In their introduction, they mention a lot of papers that try to use human feedback, some using RL and some not, but none of them are very close in terms of the actual method used. popodameron ⁠talk 21:52, 30 March 2024 (UTC)[reply]

Lead

Done. popodameron ⁠talk 23:01, 26 March 2024 (UTC)[reply]
Tried to make the transition less stark by introducing the second paragraph with a short sentence. popodameron ⁠talk 23:01, 26 March 2024 (UTC)[reply]

Motivation

Done. popodameron ⁠talk 03:52, 27 March 2024 (UTC)[reply]
Done. popodameron ⁠talk 03:52, 27 March 2024 (UTC)[reply]
The story example was not mine, and I'd forgotten that it's unsourced. I replaced it with a sourced example of RLHF's main use case today: reducing "harmful" outputs from LLMs while remaining helpful. If this is not enough, I can also add an example about text summarization (coming from the paper that largely invented RLHF). popodameron ⁠talk 03:52, 27 March 2024 (UTC)[reply]

Collecting human feedback

I added some brief discussion directed at unfamiliar readers regarding the implications of the MLE convergence. popodameron ⁠talk 05:19, 27 March 2024 (UTC)[reply]
Done. popodameron ⁠talk 05:19, 27 March 2024 (UTC)[reply]
Fixed. popodameron ⁠talk 05:19, 27 March 2024 (UTC)[reply]
Done. popodameron ⁠talk 05:19, 27 March 2024 (UTC)[reply]
The impact of the quality of feedback is already discussed directly in the limitations section. Let me know if this should be moved around or repeated here. I also added a paragraph discussing the impact of quantity. popodameron ⁠talk 05:19, 27 March 2024 (UTC)[reply]

Applications

Done. popodameron ⁠talk 05:57, 27 March 2024 (UTC)[reply]
Added more on that. popodameron ⁠talk 05:57, 27 March 2024 (UTC)[reply]
Clarified. popodameron ⁠talk 05:57, 27 March 2024 (UTC)[reply]
Clarified and added detail. popodameron ⁠talk 05:57, 27 March 2024 (UTC)[reply]

Training

I already have a short parenthetical explanation about KL divergence as a measure of distance (let me know if it is not enough), but for the other two, I'm not really sure about explaining what cross-entropy and sigmoid are. Feels a bit too far out of the scope of the article, and any reader who would be interested in reading this section at all would most likely be familiar with these terms. Correct me if you think I'm wrong about that, though. popodameron ⁠talk 22:57, 27 March 2024 (UTC)[reply]
Done per below. popodameron ⁠talk 22:57, 27 March 2024 (UTC)[reply]
Done. popodameron ⁠talk 22:57, 27 March 2024 (UTC)[reply]
  • Original: "In RLHF, two different models are trained: a reward model and a reinforcement learning (RL) policy." Suggested Explanatory Sentence: "The reward model determines what outcomes are desirable based on human feedback, while the RL policy decides the actions the AI should take to achieve those outcomes."
Done. popodameron ⁠talk 22:57, 27 March 2024 (UTC)[reply]
  • Original Sentence: "Both models are commonly initialized using a pre-trained autoregressive language model." Suggested Explanatory Sentence: "Starting with a pre-trained model, which already understands language to some extent, speeds up training and improves the model's initial performance."
Done. popodameron ⁠talk 22:57, 27 March 2024 (UTC)[reply]
  • Original Sentence: "The reward model is then trained by replacing the final layer of the previous model with a randomly initialized regression head that outputs a number corresponding to the score of any given prompt and response." Suggested Explanatory Sentence: "This process adapts the model to evaluate responses based on the quality standards set by human feedback, scoring them on how well they meet these standards."
Made things a bit simpler. popodameron ⁠talk 22:57, 27 March 2024 (UTC)[reply]
  • Original Sentence: "This model is trained to minimize the following cross-entropy loss function." Suggested Explanatory Sentence: "Minimizing the cross-entropy loss function helps the model to make predictions that are closer to the actual human ratings, improving its ability to judge responses."
Done. popodameron ⁠talk 22:57, 27 March 2024 (UTC)[reply]
  • Original Sentence: "The environment randomly presents the policy with prompts from the dataset and expects responses to them." Suggested Explanatory Sentence: "This step simulates real-world scenarios where the AI must understand various prompts and generate appropriate responses, helping it learn from diverse situations."
Done. popodameron ⁠talk 22:57, 27 March 2024 (UTC)[reply]
  • Original Sentence: "The constant β controls the strength of the second term, which is a per-token KL penalty from the initial unaligned model added to prevent over-optimization of the reward model." Suggested Explanatory Sentence: "By adjusting β, the training can balance learning from new data while retaining useful information from the initial model, avoiding the pitfall of fitting too closely to the training data, which can reduce generalization."
Done. popodameron ⁠talk 22:57, 27 March 2024 (UTC)[reply]
  • Original Sentence: "A second term is commonly added to the objective function that allows the policy to incorporate the pre-training gradients." Suggested Explanatory Sentence: "Incorporating pre-training gradients helps the model to not forget its initial language understanding abilities while it learns new tasks based on human feedback."
Merged that into the existing explanation. popodameron ⁠talk 22:57, 27 March 2024 (UTC)[reply]
Done. popodameron ⁠talk 22:57, 27 March 2024 (UTC)[reply]

Limitations

Done. popodameron ⁠talk 01:46, 28 March 2024 (UTC)[reply]
Added a little bit of introductory stuff. That specific sentence was weakly sourced, so I took it out. popodameron ⁠talk 01:46, 28 March 2024 (UTC)[reply]
I tried looking through the literature for good discussion on ethical implications, but unfortunately I couldn't find anything really worth including for now. Will keep this in mind for the future in case I do find something like that, though. popodameron ⁠talk 01:46, 28 March 2024 (UTC)[reply]
I added some more discussion about how the model can game the system to the last paragraph. I also added some information to an earlier paragraph about how under-represented groups can be put at a disadvantage the way things work. popodameron ⁠talk 01:58, 28 March 2024 (UTC)[reply]

Alternatives

Added a bit on that, but there isn't much clarity yet as to when each might actually be better. All that's clear right now is that each has their strengths, but deciding which one to use doesn't seem to be possible without just trying both. popodameron ⁠talk 02:17, 28 March 2024 (UTC)[reply]
Added a second source from Nvidia research that does train a model using DPO. popodameron ⁠talk 02:17, 28 March 2024 (UTC)[reply]
Done. popodameron ⁠talk 02:17, 28 March 2024 (UTC)[reply]

Final comments

Removed. popodameron ⁠talk 20:22, 1 April 2024 (UTC)[reply]
Done. popodameron ⁠talk 20:22, 1 April 2024 (UTC)[reply]
Done. popodameron ⁠talk 20:22, 1 April 2024 (UTC)[reply]
These are all pretty good and definitely an improvement to the article. Added with some small edits. popodameron ⁠talk 20:22, 1 April 2024 (UTC)[reply]
Not sure I agree about this one, considering that as you mention, Paul Christiano is one of the biggest names in the field, so he has almost certainly has more experience to be talking about the limitations of the method that the people that would be peer-reviewing him, at least in my opinion. Would this not work in accordance with WP:EXPERTSPS (Self-published expert sources may be considered reliable when produced by an established subject-matter expert, whose work in the relevant field has previously been published by reliable, independent publications.)? popodameron ⁠talk 20:22, 1 April 2024 (UTC)[reply]
Removed. popodameron ⁠talk 20:22, 1 April 2024 (UTC)[reply]
Done. popodameron ⁠talk 20:22, 1 April 2024 (UTC)[reply]
Looks good to me. popodameron ⁠talk 20:22, 1 April 2024 (UTC)[reply]

Ok, I think that about wraps up what I have to say for this GA nomination. I'll put the review on hold to let you consider and respond to my final comments. Esculenta (talk) 18:44, 1 April 2024 (UTC)[reply]

Esculenta just finished addressing everything. Only point I disagreed about was the Christiano source, but if you believe strongly that it is not a reliable source, I do not mind removing it. Thanks for all your work and effort reviewing! popodameron ⁠talk 20:23, 1 April 2024 (UTC)[reply]
No, I don't mind that source really, and am fine with the WP:EXPERTSPS rationale for keeping it. Thanks for your efforts in writing! Promoting now. Esculenta (talk) 20:39, 1 April 2024 (UTC)[reply]

Did you know nomination

The following is an archived discussion of the DYK nomination of the article below. Please do not modify this page. Subsequent comments should be made on the appropriate discussion page (such as this nomination's talk page, the article's talk page or Wikipedia talk:Did you know), unless there is consensus to re-open the discussion at this page. No further edits should be made to this page.

The result was: promoted by Hilst talk 14:19, 12 April 2024 (UTC)[reply]

Improved to Good Article status by PopoDameron (talk).

Number of QPQs required: 0. Nominator has less than 5 past nominations.

Post-promotion hook changes will be logged on the talk page; consider watching the nomination until the hook appears on the Main Page.

popodameron ⁠talk 00:08, 2 April 2024 (UTC).[reply]

New GA, and hook is interesting and long enough. Source provided (MIT Technology Review) is reliable. No QPQ is needed for now. Article is properly sourced, and Earwig did not return any plagiarism concerns, so everything should be ok. Passing this nomination. Good job! Davest3r08 >:3 (talk) 14:46, 4 April 2024 (UTC)[reply]

Attempt to simplify the first paragraph

It seems to me that the first paragraph of the introduction could be better explained, especially for non-specialists. So I tried to rewrite it. But I'm not sure whether it's better or worse, so I write it there instead:

Original version:

In machine learning, reinforcement learning from human feedback (RLHF) is a technique to align an intelligent agent to human preferences. In classical reinforcement learning, the goal of such an agent is to learn a function that guides its behavior called a policy. This function learns to maximize the reward it receives from a separate reward function based on its task performance. However, it is difficult to define explicitly a reward function that approximates human preferences. Therefore, RLHF seeks to train a "reward model" directly from human feedback. The reward model is first trained in a supervised fashion—independently from the policy being optimized—to predict if a response to a given prompt is good (high reward) or bad (low reward) based on ranking data collected from human annotators. This model is then used as a reward function to improve an agent's policy through an optimization algorithm like proximal policy optimization.

Proposed version:

In machine learning, reinforcement learning from human feedback (RLHF) is a technique used to align an intelligent agent with human preferences.

In reinforcement learning, the agent learns how to behave in order to maximize a reward function. However, it is difficult to define explicitly a reward function that approximates human preferences. RLHF first trains a preferences model in a supervised fashion, based on how well humans rank different AI-generated answers. This model is then used as a reward function to train other models, through an optimization algorithm like proximal policy optimization.

Feel free to give some honest feedback on what you think of this proposition. Alenoach (talk) 02:32, 21 May 2024 (UTC)[reply]