Large-scale self-reward: Meta allows Llama2 to fine-tune itself, surpassing the performance of GPT-4.

Image Source: Generated by Wujie AI

Will Artificial Intelligence Feedback (AIF) replace RLHF?

In the field of large models, fine-tuning is an important step in improving model performance. With the increasing number of open-source large models, various fine-tuning methods have been summarized, some of which have achieved good results.

Recently, researchers from Meta and New York University have used "self-rewarding methods" to allow large models to generate their own fine-tuning data, bringing some new shock to people.

In the new method, the authors fine-tuned Llama 2 70B through three iterations, and the generated model outperformed a number of existing important large models on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4.

Therefore, the paper attracted attention shortly after it was posted on arXiv.

Although the method has not been open-sourced yet, people believe that the method described in the paper should be easy to reproduce.

It is well known that adjusting large language models (LLM) with human preference data can greatly improve the instruction tracking performance of pre-trained models. In the GPT series, OpenAI proposed the standard method of human feedback reinforcement learning (RLHF), allowing large models to learn rewards from human preferences, then freezing the reward model for use in training LLM using reinforcement learning, which has been hugely successful.

A recent new approach is to completely avoid training reward models and directly use human preferences to train LLM, such as direct preference optimization (DPO). In both of the above cases, fine-tuning is bottlenecked by the size and quality of human preference data, and in the case of RLHF, the quality of fine-tuning is also bottlenecked by the quality of the frozen reward model trained from them.

In Meta's new work, the authors propose to train a self-improving reward model that is not frozen but continuously updated during LLM adjustment to avoid this bottleneck.

The key to this method is to develop an agent with all the capabilities needed during training (rather than dividing into reward model and language model), allowing instruction-following tasks for pre-training and multi-task training to achieve task transfer by training multiple tasks simultaneously.

Therefore, the authors introduce a self-rewarding language model, whose agent acts as both a model follower, generating responses to given prompts, and as a model trainer, generating and evaluating new instructions based on examples to add to their own training set.

The new method uses a framework similar to iterative DPO to train these models. Starting from a seed model, as shown in Figure 1, there is a self-instruction creation process in each iteration, where the model generates candidate responses for the new prompts and then assigns rewards by the same model. The latter is achieved through prompts from LLM-as-a-Judge, which can also be seen as an instruction-following task. A preference dataset is constructed based on the generated data and the model is trained for the next iteration through DPO.

Paper Title: Self-Rewarding Language Models
Paper Link: https://arxiv.org/abs/2401.10020

Self-Rewarding Language Models

The method proposed by the authors first assumes that access to basic pre-trained language models and a small amount of manually annotated seed data is available, and then establishes a model aimed at simultaneously possessing two skills:

Instruction following: Given prompts describing user requests, it can generate high-quality, helpful (and harmless) responses.
Self-instruction creation: It can generate and evaluate new instructions based on examples to add to its own training set.

The use of these skills is to enable the model to perform self-alignment, i.e., they are components used to iteratively train themselves using Artificial Intelligence Feedback (AIF).

Self-instruction creation involves generating candidate responses and then letting the model itself judge their quality, i.e., it acts as its own reward model, replacing the need for an external model. This is achieved through the LLM-as-a-Judge mechanism [Zheng et al., 2023b], by making response evaluation a task of instruction-following. This self-created AIF preference data is used as the training set.

So in the fine-tuning process, the same model is used in two roles: as a "learner" and as a "judge". Based on the newly emerged judge role, the model can further improve its performance through context fine-tuning.

The overall self-alignment process is an iterative process, carried out through the following steps: building a series of models, each of which is an improvement over the previous one. Importantly, because the model can both improve its generation ability and act as its own reward model through the same generation mechanism, this means that the reward model itself can be improved through these iterations, which is different from the inherent standard practice of reward models.

The researchers believe that this approach can increase the potential upper limit for future self-improvement of these learning models, eliminating restrictive bottlenecks.

Figure 1 shows an overview of this method.

Experiments

In the experiments, the researchers used Llama 2 70B as the base pre-trained model. They found that compared to the baseline seed model, self-rewarding LLM alignment not only improved instruction-following performance, but also improved reward modeling capability.

This means that in iterative training, the model is able to provide itself with a better quality preference dataset in a given iteration than in the previous iteration. Although this effect tends to saturate in the real world, it provides an interesting possibility: the resulting reward model (and LLM) trained in this way is superior to models trained only from human-written original seed data.

In terms of instruction-following ability, the experimental results are shown in Figure 3:

The researchers evaluated the self-rewarding model on the AlpacaEval 2 leaderboard, and the results are shown in Table 1. They observed the same conclusion as the head-to-head evaluation, that the win rate of the trained iterations is higher than GPT4-Turbo, increasing from 9.94% in iteration 1 to 15.38% in iteration 2, and further to 20.44% in iteration 3. Additionally, the iteration 3 model outperformed many existing models, including Claude 2, Gemini Pro, and GPT4 0613.

The results of the reward modeling evaluation are shown in Table 2, and the conclusions include:

Enhanced EFT is an improvement over the SFT baseline. Using IFT+EFT resulted in improvements in all five measurement metrics compared to using IFT alone, for example, the consistency of paired accuracy with humans increased from 65.1% to 78.7%.
Improved reward modeling capability through self-training. After one round of self-reward training, the model's ability to provide self-reward for the next iteration improved, and its instruction-following ability also improved.
Importance of LLM-as-a-Judge prompts. Researchers found that LLM-as-a-Judge prompts resulted in higher paired accuracy when using the SFT baseline with various prompt formats.

The authors believe that the self-rewarding training method has improved both the model's instruction-following ability and its reward modeling capability during iterations.

While this is only preliminary research, it seems to be an exciting research direction, as this type of model can better allocate rewards in future iterations to improve instruction-following, creating a virtuous cycle.

This method also opens up possibilities for more complex judgment methods. For example, large models can verify the accuracy of their answers by searching databases, resulting in more accurate and reliable outputs.

Reference: Reddit - Self-Rewarding Language Models (Meta, 2024)

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

Large-scale self-reward: Meta allows Llama2 to fine-tune itself, surpassing the performance of GPT-4.

Self-Rewarding Language Models

Experiments

Selected Articles by 巴比特

Table of Contents

Related Articles