Talk:Reinforcement learning from human feedback

Technical

@Moorlock: What's your reasoning in tagging this article with the {{jargon}} template? None of the terms used aren't already explained in the reinforcement learning article, and I certainly would assume that someone coming to this specialized article either already knows the basics of RL, or if not that they will go to the RL article to learn more. Needless to say, we shouldn't be redefining every technical term from RL in this article, too. PopoDameron ⁠talk 22:13, 29 March 2023 (UTC)[reply]

I'm less convinced that readers of this page should be expected to have read a different page first as a prerequisite. I see the acronym RLHF tossed around enough nowadays that I anticipate people may come to Wikipedia just to find out what it stands for. Terms like "robustness," "exploration," "agent," "reward model," "policy," "reward function" are not necessarily meaningful to people who are not already well-versed in the discipline. Wikipedia is best when it makes some effort to explain such jargon to the general reader. Moorlock (talk) 22:33, 29 March 2023 (UTC)[reply]

@Moorlock: I don't know, I would find it strange to define terms like "agent," "reward model," "policy," and "reward function" in this article when they're just core RL terms. Why would I not assume that they'll follow wikilinks to learn more? I mean, if someone who has no knowledge about RL comes here, then these terms will make no sense until they do understand RL. But that's not the topic of this article, so it shouldn't be expected for it to give a self-contained explanation of basic RL, no? That's what the reinforcement learning article is for.

As a similar example, the ChatGPT article says in its lead that the model is built on top "families of large language models (LLMs) and has been fine-tuned (an approach to transfer learning) using both supervised and reinforcement learning techniques," yet terms like "language models," "transfer learning," "supervised learning," and "reinforcement learning," are never defined in the article. Instead, an interested reader can navigate through wikilinks and learn about them. PopoDameron ⁠talk 23:06, 29 March 2023 (UTC)[reply]

I agree that additional wikilinks to pages or sections that more thoroughly explain some of these terms of art would be helpful and maybe sufficient. Moorlock (talk) 23:17, 29 March 2023 (UTC)[reply]

Early learning from human feedback

I think that this page should include a reference to one of the earliest examples of RLHF, from 2014: <ref> Scheirer, W. J., Anthony, S. E., Nakayama, K., & Cox, D. D. (2014). Perceptual annotation: Measuring human vision to improve computer vision. IEEE transactions on pattern analysis and machine intelligence, 36(8), 1679-1686. <ref> To the best of my (admittedly limited) knowledge, this is the earliest example of systematically using human feedback to improve machine learning, and the authors deserve the credit of a citation and concise discussion of this original work. Jj1236 (talk) 13:59, 25 October 2023 (UTC)jj1236[reply]

I agree that background is important, but that particular paper does not appear to be about RL at all. Even if it were and with regards to other early papers, we should be careful not to give undue weight, as "RLHF" is a recently coined term that refers to a rather specific concept and not just any kind of RL based on human feedback. popodameron ⁠talk 03:41, 12 March 2024 (UTC)[reply]

GA Review

This review is transcluded from Talk:Reinforcement learning from human feedback/GA1. The edit link for this section can be used to add comments to the review.

Nominator: PopoDameron (talk · contribs)

Reviewer: Esculenta (talk · contribs) 05:07, 23 March 2024 (UTC)[reply]

Hi, I'll take on this review. Not a SME, but highly interested in this and related topics, so we'll see how this article matches up to the GA criteria. Will have comments here within a few days. Esculenta (talk) 05:07, 23 March 2024 (UTC)[reply]

Hey, sounds good, thanks for reviewing! popodameron ⁠talk 06:48, 23 March 2024 (UTC)[reply]

Ok, here are my thoughts after an initial read-through. I think the article is informative and generally well-written, but there are parts that would be difficult for laypeople to follow. Of course, this is largely unavoidable given the technical nature of the underlying computational science. Most of my commentary are suggestions that aim to ameliorate the difficulty of these technical parts. Esculenta (talk) 17:21, 26 March 2024 (UTC)[reply]

Great, thanks, I'll go through those. popodameron ⁠talk 22:49, 26 March 2024 (UTC)[reply]

Esculenta, just finished going through and incorporating your feedback. Please let me know if anything is missing or if there's anything else you'd like to see. popodameron ⁠talk 02:18, 28 March 2024 (UTC)[reply]

Looking good! I'll reread more thoroughly in the next day or 2, but it crossed my mind that the article doesn't mention who first thought up this technique, or when it was first used practically, which seems like it would be an important encyclopaedic addition. More later, Esculenta (talk) 02:29, 28 March 2024 (UTC)[reply]

Thanks! Just added a bit on that on the (now renamed) background & motivation section. popodameron ⁠talk 03:24, 28 March 2024 (UTC)[reply]

I still think there might need to be more said about the early developments of this technique. If I'm a reader who wants to know "who first thought up this cool idea", I think I'd leave with the impression that it was OpenAI ("The algorithm for RLHF as used today was introduced by OpenAI in a paper...") in 2020. But my research seems to contradict this. This 2010 paper describes the TAMER framework "for designing agents that can be interactively shaped by human trainers who give only positive and negative feedback signals", and cites pubs from 2009-2010. This 2011 paper actually uses "Reinforcement learning from human feedback" in the title, so the idea's been around for at least a decade before its first practical usage in the cited 2020 paper. However, I'm not savvy enough to fully understand how or if these early pubs were necessary stepping stones of understanding along the way, or diversions from algorithms used now. Sorry for proposing more work for you, but I think this "Background" could/should be fleshed out into its own section. The Ziegler et al. (2020) paper also gives some historical RLHF background and earlier sources in its introductory section that could be used. What do you think? Esculenta (talk) 17:28, 30 March 2024 (UTC)[reply]

It's a bit tricky because there's a difference between reinforcement learning from human feedback in general and "RLHF." When someone refers to RLHF today, they are almost definitely referring to the specific algorithm that was indeed first described by OpenAI. Of course, this algorithm was not the first attempt to incorporate human feedback into RL, but "RLHF" usually doesn't refer to that general concept. That's why right now in the background section, I've cited a bunch of older papers that do RL+HF, but without considering them to be the first "instances of RLHF" or anything like that. So, I think I can try to make that a bit clearer in the current background, and maybe I can elaborate a bit more about some of the background methods because, while not exactly RLHF, RLHF was no doubt at least partly inspired by many of them. popodameron ⁠talk 21:43, 30 March 2024 (UTC)[reply]

The Ziegler 2019 paper is as far as I can tell the first paper to introduce RLHF very closely to how it is now, except that their formulation is online instead of offline, which is much more common. Nevertheless, it definitely counts. In their introduction, they mention a lot of papers that try to use human feedback, some using RL and some not, but none of them are very close in terms of the actual method used. popodameron ⁠talk 21:52, 30 March 2024 (UTC)[reply]

Lead

in sentence "In classical reinforcement learning, the goal of such an agent is to learn a function called a policy that maximizes the reward it receives based on how well it performs its task." I suggest that it could be split for clarity, e.g., something like "potentially be split for clarity. For example: "In classical reinforcement learning, the goal of an agent is to learn a policy—a function that guides its actions. This policy aims to maximize the agent's reward, which depends on its task performance."

Done. popodameron ⁠talk 23:01, 26 March 2024 (UTC)[reply]

The transition between discussing RLHF's definition and its application areas (in the second paragraph) could be smoother. Consider introducing the application areas with a connecting sentence like "RLHF's adaptability extends to numerous machine learning domains, such as..."

Tried to make the transition less stark by introducing the second paragraph with a short sentence. popodameron ⁠talk 23:01, 26 March 2024 (UTC)[reply]

Motivation

the transition between discussing the general motivation for RLHF and the specifics of previous attempts could be enhanced by a connecting sentence that acknowledges the initial appeal of human-feedback optimization while setting the stage for discussing its complexities and shortcomings in earlier methods, such as: "Despite the clear benefits, prior efforts to integrate human feedback into model optimization have encountered significant challenges."

Done. popodameron ⁠talk 03:52, 27 March 2024 (UTC)[reply]

for clarity, consider explaining or briefly defining less familiar terms for a general audience, such as "sparse or noisy reward function," "robust optimization," and "exploration in reinforcement learning." While the targeted readers might be familiar with these concepts, a succinct explanation could make the section more accessible.

Done. popodameron ⁠talk 03:52, 27 March 2024 (UTC)[reply]

The motivation for RLHF is well established, but the section could briefly mention any specific domains or examples where RLHF has shown promise or has been particularly needed, to give readers a clearer context. The "write a compelling story" example is good (but unsourced, so it looks a bit like WP:OR), are there other (preferably sourced) examples?

The story example was not mine, and I'd forgotten that it's unsourced. I replaced it with a sourced example of RLHF's main use case today: reducing "harmful" outputs from LLMs while remaining helpful. If this is not enough, I can also add an example about text summarization (coming from the paper that largely invented RLHF). popodameron ⁠talk 03:52, 27 March 2024 (UTC)[reply]

Collecting human feedback

this section could be enhanced by adding more context or explanations for readers unfamiliar with some of the more technical terms and models.

I added some brief discussion directed at unfamiliar readers regarding the implications of the MLE convergence. popodameron ⁠talk 05:19, 27 March 2024 (UTC)[reply]

link Pairwise comparison; regret; explain K-wise comparison; a gloss of Elo rating system might be good

Done. popodameron ⁠talk 05:19, 27 March 2024 (UTC)[reply]

"Both offline data collection models, where the model is learning by interacting with a static dataset and updating its policy in batches, as well as offline data collection models," one of these is supposed to be online, no?

Fixed. popodameron ⁠talk 05:19, 27 March 2024 (UTC)[reply]

"The fundamental difficulty encountered in RLHF from pairwise comparisons, sometimes referred to as dueling reinforcement learning, lies in the non-Markovian nature of its optimal policies, as the optimal policy is not necessarily memoryless." might be difficult to follow for non-experts; how about a bit more explanation, like "A key challenge encountered in RLHF when employing pairwise (or dueling) comparisons is associated with the non-Markovian character of its optimal policies. Unlike typical scenarios where the optimal strategy does not require memory of past actions, in RLHF, the best course of action often depends on previous events and decisions, making the strategy inherently memory-dependent."

Done. popodameron ⁠talk 05:19, 27 March 2024 (UTC)[reply]

It would be helpful to discuss the impact of the quality and quantity of feedback on the training process and the final model performance.

The impact of the quality of feedback is already discussed directly in the limitations section. Let me know if this should be moved around or repeated here. I also added a paragraph discussing the impact of quantity. popodameron ⁠talk 05:19, 27 March 2024 (UTC)[reply]

Applications

some of the technical terms (e.g., KL regularization, semi-supervised learning) could be explained in simpler terms or provided with a brief description.

Done. popodameron ⁠talk 05:57, 27 March 2024 (UTC)[reply]

The part about video game bots is interesting but brief; expanding on how human feedback specifically influences the development and performance of these bots could add depth to the discussion. I'm interested to know the mechanics of how a human provides feedback for video game playing.

Added more on that. popodameron ⁠talk 05:57, 27 March 2024 (UTC)[reply]

The sentence "Successful methods noted that the use of KL regularization in RLHF helped to stabilize the training process." is generally clear but could be improved for better clarity and accuracy in a few ways: methods do not "note" things; researchers, studies, or analyses do (i.e. personification); "successful methods" is vague, it would help the reader to specify what kinds of methods or approaches are being referred to. Are these computational methods, algorithmic enhancements, or specific instances of RLHF application?

Clarified. popodameron ⁠talk 05:57, 27 March 2024 (UTC)[reply]

While the sentence states that KL regularization "helped to stabilize the training process," it could provide more detail on how it contributed to stabilization. For example, did it reduce variability, prevent overfitting, or make the model's learning curve smoother?

Clarified and added detail. popodameron ⁠talk 05:57, 27 March 2024 (UTC)[reply]

Training

Clarity in this section could be enhanced for a broader audience. For example, terms like "cross-entropy loss function," "sigmoid function," and "KL divergence" are not explained within the context.

I already have a short parenthetical explanation about KL divergence as a measure of distance (let me know if it is not enough), but for the other two, I'm not really sure about explaining what cross-entropy and sigmoid are. Feels a bit too far out of the scope of the article, and any reader who would be interested in reading this section at all would most likely be familiar with these terms. Correct me if you think I'm wrong about that, though. popodameron ⁠talk 22:57, 27 March 2024 (UTC)[reply]

Definitions or brief explanations of key terms and why they are essential could significantly improve understanding for readers unfamiliar with the subject. Adding some explanatory sentences can make the section more understandable and engaging for readers, especially those not deeply familiar with machine learning terminology and processes.

Done per below. popodameron ⁠talk 22:57, 27 March 2024 (UTC)[reply]

"The human feedback policy is also fine-tuned over the previously trained supervised model." in the spirit of the above comments, might I suggest adding an explanatory sentence to follow, e.g. "This fine-tuning process adapts the pre-existing model (initially trained in a supervised manner) to better align with human feedback by adjusting its parameters based on the rewards derived from human judgments." Other similar suggestions: