Of models, preferences and alignment

Of models, preferences and alignment

Introduction

The alignment of Large Language Models (LLMs) with human preferences and values is a critical area of research in the field of artificial intelligence. Let us do a brief comparision of three prominent alignment techniques: Direct Preference Optimization (DPO), Kahneman-Tversky Optimization (KTO), and Identity Preference Optimization (IPO).

Direct Preference Optimization (DPO)

DPO is recognized for its stability, performance, and computational efficiency, eliminating the need for sampling from the LM during fine-tuning or significant hyperparameter tuning. It has been successfully applied to align LLMs such as Hugging Face’s Zephyr and Intel’s Neural Chat. However, DPO is prone to overfitting, as demonstrated by DeepMind. Nevertheless, DPO has been shown to match the quality of Reinforcement Learning with Human Feedback (RLHF) with a simpler training method, and it has been increasingly preferred. for aligning LLMs.

Kahneman-Tversky Optimization (KTO)

KTO aligns LLMs with human feedback by directly maximizing utility based on prospect theory. It addresses challenges in existing methods like RLHF and the high cost of obtaining preference data. KTO does not require preference pairs, only knowledge of whether an output is desirable or undesirable for a given input. It has been shown to outperform DPO and standard fine-tuning, offering a significant boost in performance. KTO is gaining attention as a preferred method due to its ability to mimic human decision-making without the need for extensive preference data.

Identity Preference Optimization (IPO)

IPO is a simpler learning objective that learns pairwise preferences directly instead of using logit-preferences or Elo-scores. It is less impacted by overfitting compared to DPO and can distinguish between different outputs more effectively. IPO's direct approach to learning preferences makes it a strong candidate for aligning LLMs, although it may be more sensitive to the increasing coefficient in the learning process.

Conclusion

When comparing these techniques, it is important to consider their performance, stability, and computational requirements. DPO is noted for its enhanced stability in model convergence and improved performance in conversational and other downstream tasks. KTO, on the other hand, offers a significant performance boost and aligns with human feedback more effectively by leveraging prospect theory. IPO provides a simpler and potentially less overfitting-prone alternative to DPO, directly learning pairwise preferences.

Each alignment technique has its strengths and potential drawbacks. DPO is stable and computationally lightweight but may overfit. KTO is effective in aligning with human feedback and does not require extensive preference data. IPO offers a simpler objective and is less prone to overfitting but may be sensitive to parameter tuning. The choice of technique may depend on the specific requirements of the LLM and the desired outcomes of the alignment process.

References:

[2305.18290] Direct Preference Optimization: Your Language Model is Secretly a Reward Model (arxiv.org)

Better, Cheaper, Faster LLM Alignment with KTO - Contextual AI

[2310.12036] A General Theoretical Paradigm to Understand Learning from Human Preferences (arxiv.org)

Fine-tune Better Chat Models with Distilled Identity Preference Optimization (IPO) (substack.com)

Fine-Tuning Language Models Using Direct Preference Optimization - Cerebras