Podržano učenje iz ljudskih povratnih informacija — разлика између измена

Садржај обрисан Садржај додат

Инлајн

Верзија на датум 22. март 2024. у 22:45

U mašinskom učenju, podržano učenje iz ljudskih povratnih informacija (Reinforcement learning from human feedback, RLHF), also known as reinforcement learning from human preferences, is a technique to align an intelligent agent to human preferences. In classical reinforcement learning, the goal of such an agent is to learn a function called a policy that maximizes the reward it receives based on how well it performs its task.^[1] In the case of human preferences, however, it tends to be difficult to define explicitly a reward function that approximates human preferences. Therefore, RLHF seeks to train a "reward model" directly from human feedback.^[2] The reward model is first trained in a supervised fashion—independently from the policy being optimized—to predict if a response to a given prompt is good (high reward) or bad (low reward) based on ranking data collected from human annotators. This model is then used as a reward function to improve an agent's policy through an optimization algorithm like proximal policy optimization.^[3]

RLHF can be applied to various domains in machine learning, including natural language processing tasks like text summarization and conversational agents, computer vision tasks like text-to-image models, and the development of video game bots. While RLHF is an effective method of training models to act better in accordance with human preferences, it also faces challenges due to the way the human preference data is collected. Though RLHF does not require massive amounts of data to improve performance, sourcing high-quality preference data is still an expensive process. Furthermore, if the data is not carefully collected from a representative sample, the resulting model may exhibit unwanted biases.

Motivation

Optimizing a model based on human feedback is desirable when a task is difficult to specify yet easy to judge.^[4]^[5] For example, for the task of generating a compelling story, while asking humans to generate examples of good and bad stories would be difficult and time-consuming, humans can easily and quickly assess the quality of different AI-generated stories. The goal would then be for the model to use this human feedback to improve its story generation.

There have been various prior attempts at using human feedback to optimize a model's outputs, including through reinforcement learning, but most attempts were either narrow and difficult to generalize, breaking down on more complex tasks,^[6]^[7]^[8]^[9] or they faced difficulties learning from a sparse or noisy reward function.^[10]^[11] RLHF was an attempt to create a general algorithm for learning from a practical amount of human feedback.^[4]^[3] RLHF has also been shown to improve the robustness and exploration of RL agents.^[12]

Reference

^ Russell, Stuart J.; Norvig, Peter (2016). Artificial intelligence: a modern approach (Third, Global изд.). Boston Columbus Indianapolis New York San Francisco Upper Saddle River Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montreal Toronto Delhi Mexico City Sao Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo: Pearson. стр. 830—831. ISBN 978-0-13-604259-4.
^ Ziegler, Daniel M.; Stiennon, Nisan; Wu, Jeffrey; Brown, Tom B.; Radford, Alec; Amodei, Dario; Christiano, Paul; Irving, Geoffrey (2019). „Fine-Tuning Language Models from Human Preferences”. arXiv:1909.08593  [cs.CL].
^ ^а ^б Lambert, Nathan; Castricato, Louis; von Werra, Leandro; Havrilla, Alex. „Illustrating Reinforcement Learning from Human Feedback (RLHF)”. huggingface.co. Приступљено 4. 3. 2023.
^ ^а ^б „Learning from human preferences”. openai.com. Приступљено 4. 3. 2023.
^ „Learning through human feedback”. www.deepmind.com (на језику: енглески). 12. 6. 2017. Приступљено 4. 3. 2023.
^ Knox, W. Bradley; Stone, Peter; Breazeal, Cynthia (2013). „Training a Robot via Human Feedback: A Case Study”. Social Robotics. Lecture Notes in Computer Science (на језику: енглески). Springer International Publishing. 8239: 460—470. ISBN 978-3-319-02674-9. doi:10.1007/978-3-319-02675-6_46. Приступљено 26. 2. 2024.
^ Akrour, Riad; Schoenauer, Marc; Sebag, Michèle (2012). „APRIL: Active Preference Learning-Based Reinforcement Learning”. Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science (на језику: енглески). Springer. 7524: 116—131. ISBN 978-3-642-33485-6. arXiv:1208.0984 . doi:10.1007/978-3-642-33486-3_8. Приступљено 26. 2. 2024.
^ Wilson, Aaron; Fern, Alan; Tadepalli, Prasad (2012). „A Bayesian Approach for Policy Learning from Trajectory Preference Queries”. Advances in Neural Information Processing Systems. Curran Associates, Inc. 25. Приступљено 26. 2. 2024.
^ Schoenauer, Marc; Akrour, Riad; Sebag, Michele; Souplet, Jean-Christophe (18. 6. 2014). „Programming by Feedback”. Proceedings of the 31st International Conference on Machine Learning (на језику: енглески). PMLR: 1503—1511. Приступљено 26. 2. 2024.
^ Warnell, Garrett; Waytowich, Nicholas; Lawhern, Vernon; Stone, Peter (25. 4. 2018). „Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces”. Proceedings of the AAAI Conference on Artificial Intelligence. 32 (1). S2CID 4130751. arXiv:1709.10163 . doi:10.1609/aaai.v32i1.11485.
^ MacGlashan, James; Ho, Mark K; Loftin, Robert; Peng, Bei; Wang, Guan; Roberts, David L.; Taylor, Matthew E.; Littman, Michael L. (6. 8. 2017). „Interactive learning from policy-dependent human feedback”. Proceedings of the 34th International Conference on Machine Learning - Volume 70. JMLR.org: 2285—2294. arXiv:1701.06049 .
^ Bai, Yuntao; Jones, Andy; Ndousse, Kamal; Askell, Amanda; Chen, Anna; DasSarma, Nova; Drain, Dawn; Fort, Stanislav; Ganguli, Deep; Henighan, Tom; Joseph, Nicholas; Kadavath, Saurav; Kernion, Jackson; Conerly, Tom; El-Showk, Sheer; Elhage, Nelson; Hatfield-Dodds, Zac; Hernandez, Danny; Hume, Tristan; Johnston, Scott; Kravec, Shauna; Lovitt, Liane; Nanda, Neel; Olsson, Catherine; Amodei, Dario; Brown, Tom; Clark, Jack; McCandlish, Sam; Olah, Chris; Mann, Ben; Kaplan, Jared (2022). „Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback”. arXiv:2204.05862  [cs.CL].

[1] Russell, Stuart J.; Norvig, Peter (2016). Artificial intelligence: a modern approach (Third, Global изд.). Boston Columbus Indianapolis New York San Francisco Upper Saddle River Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montreal Toronto Delhi Mexico City Sao Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo: Pearson. стр. 830—831. ISBN 978-0-13-604259-4.

[2] Ziegler, Daniel M.; Stiennon, Nisan; Wu, Jeffrey; Brown, Tom B.; Radford, Alec; Amodei, Dario; Christiano, Paul; Irving, Geoffrey (2019). „Fine-Tuning Language Models from Human Preferences”. arXiv:1909.08593  [cs.CL].

[huggingface-3] а ^б Lambert, Nathan; Castricato, Louis; von Werra, Leandro; Havrilla, Alex. „Illustrating Reinforcement Learning from Human Feedback (RLHF)”. huggingface.co. Приступљено 4. 3. 2023.

[openai-4] а ^б „Learning from human preferences”. openai.com. Приступљено 4. 3. 2023.

[5] „Learning through human feedback”. www.deepmind.com (на језику: енглески). 12. 6. 2017. Приступљено 4. 3. 2023.

[6] Knox, W. Bradley; Stone, Peter; Breazeal, Cynthia (2013). „Training a Robot via Human Feedback: A Case Study”. Social Robotics. Lecture Notes in Computer Science (на језику: енглески). Springer International Publishing. 8239: 460—470. ISBN 978-3-319-02674-9. doi:10.1007/978-3-319-02675-6_46. Приступљено 26. 2. 2024.

[7] Akrour, Riad; Schoenauer, Marc; Sebag, Michèle (2012). „APRIL: Active Preference Learning-Based Reinforcement Learning”. Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science (на језику: енглески). Springer. 7524: 116—131. ISBN 978-3-642-33485-6. arXiv:1208.0984 . doi:10.1007/978-3-642-33486-3_8. Приступљено 26. 2. 2024.

[8] Wilson, Aaron; Fern, Alan; Tadepalli, Prasad (2012). „A Bayesian Approach for Policy Learning from Trajectory Preference Queries”. Advances in Neural Information Processing Systems. Curran Associates, Inc. 25. Приступљено 26. 2. 2024.

[9] Schoenauer, Marc; Akrour, Riad; Sebag, Michele; Souplet, Jean-Christophe (18. 6. 2014). „Programming by Feedback”. Proceedings of the 31st International Conference on Machine Learning (на језику: енглески). PMLR: 1503—1511. Приступљено 26. 2. 2024.

[10] Warnell, Garrett; Waytowich, Nicholas; Lawhern, Vernon; Stone, Peter (25. 4. 2018). „Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces”. Proceedings of the AAAI Conference on Artificial Intelligence. 32 (1). S2CID 4130751. arXiv:1709.10163 . doi:10.1609/aaai.v32i1.11485.

[11] MacGlashan, James; Ho, Mark K; Loftin, Robert; Peng, Bei; Wang, Guan; Roberts, David L.; Taylor, Matthew E.; Littman, Michael L. (6. 8. 2017). „Interactive learning from policy-dependent human feedback”. Proceedings of the 34th International Conference on Machine Learning - Volume 70. JMLR.org: 2285—2294. arXiv:1701.06049 .

[12] Bai, Yuntao; Jones, Andy; Ndousse, Kamal; Askell, Amanda; Chen, Anna; DasSarma, Nova; Drain, Dawn; Fort, Stanislav; Ganguli, Deep; Henighan, Tom; Joseph, Nicholas; Kadavath, Saurav; Kernion, Jackson; Conerly, Tom; El-Showk, Sheer; Elhage, Nelson; Hatfield-Dodds, Zac; Hernandez, Danny; Hume, Tristan; Johnston, Scott; Kravec, Shauna; Lovitt, Liane; Nanda, Neel; Olsson, Catherine; Amodei, Dario; Brown, Tom; Clark, Jack; McCandlish, Sam; Olah, Chris; Mann, Ben; Kaplan, Jared (2022). „Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback”. arXiv:2204.05862  [cs.CL].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]