Effectiveness of RLHF at Removing Random Preferences and Rewarding Indifference

Content may be unverified or unsafe. Report
ChatGPTChatGPT
Citations

Mysteries of mode collapse — LessWrong

https://www.lesswrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-of-mode-collapse-due-to-rlhf#Inescapable_wedding_parties

Mysteries of mode collapse — LessWrong

https://www.lesswrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-of-mode-collapse-due-to-rlhf#Inescapable_wedding_parties

Mysteries of mode collapse — LessWrong

https://www.lesswrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-of-mode-collapse-due-to-rlhf#Inescapable_wedding_parties

Mysteries of mode collapse — LessWrong

https://www.lesswrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-of-mode-collapse-due-to-rlhf#Inescapable_wedding_parties

Secrets of RLHF in Large Language Models Part II: Reward Modeling

https://arxiv.org/html/2401.06080v2

RLHF does not appear to differentially cause mode-collapse — LessWrong

https://www.lesswrong.com/posts/pjesEx526ngE6dnmr/rlhf-does-not-appear-to-differentially-cause-mode-collapse

RLHF does not appear to differentially cause mode-collapse — LessWrong

https://www.lesswrong.com/posts/pjesEx526ngE6dnmr/rlhf-does-not-appear-to-differentially-cause-mode-collapse

RLHF does not appear to differentially cause mode-collapse — LessWrong

https://www.lesswrong.com/posts/pjesEx526ngE6dnmr/rlhf-does-not-appear-to-differentially-cause-mode-collapse

Reward Learning From Preference With Ties

https://arxiv.org/html/2410.05328v1

Reward Learning From Preference With Ties

https://arxiv.org/html/2410.05328v1

Reward Learning From Preference With Ties

https://arxiv.org/html/2410.05328v1

Reward Learning From Preference With Ties

https://arxiv.org/html/2410.05328v1

Reward Learning From Preference With Ties

https://arxiv.org/html/2410.05328v1

Reward Learning From Preference With Ties

https://arxiv.org/html/2410.05328v1

Reward Learning From Preference With Ties

https://arxiv.org/html/2410.05328v1

Reward Learning From Preference With Ties

https://arxiv.org/html/2410.05328v1

Reward Learning From Preference With Ties

https://arxiv.org/html/2410.05328v1

Reward Learning From Preference With Ties

https://arxiv.org/html/2410.05328v1

Reward Learning From Preference With Ties

https://arxiv.org/html/2410.05328v1

Mysteries of mode collapse — LessWrong

https://www.lesswrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-of-mode-collapse-due-to-rlhf#Inescapable_wedding_parties

Mysteries of mode collapse — LessWrong

https://www.lesswrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-of-mode-collapse-due-to-rlhf#Inescapable_wedding_parties

Mysteries of mode collapse — LessWrong

https://www.lesswrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-of-mode-collapse-due-to-rlhf#Inescapable_wedding_parties

Theoretical Tensions in RLHF: Reconciling Empirical Success with ...

https://arxiv.org/html/2506.12350v1

MaxMin-RLHF: Towards Equitable Alignment of Large Language ...

https://arxiv.org/html/2402.08925v1

Mysteries of mode collapse — LessWrong

https://www.lesswrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-of-mode-collapse-due-to-rlhf#Inescapable_wedding_parties

Indifference and compensatory rewards — AI Alignment Forum

https://www.alignmentforum.org/posts/5bd75cc58225bf0670375363/indifference-and-compensatory-rewards

Indifference and compensatory rewards — AI Alignment Forum

https://www.alignmentforum.org/posts/5bd75cc58225bf0670375363/indifference-and-compensatory-rewards

RLHF does not appear to differentially cause mode-collapse — LessWrong

https://www.lesswrong.com/posts/pjesEx526ngE6dnmr/rlhf-does-not-appear-to-differentially-cause-mode-collapse