Source: Quantum Bit
The illusion problem of large models has a new solution!
Meta AI Lab has proposed a "divide and conquer" solution.
With this solution, the information accuracy of Llama-65B has doubled, and even exceeded ChatGPT.
The so-called illusion of large models is to output some seemingly reasonable but completely incorrect content.
The "Verification Chain" (CoVe) proposed by Meta this time is a chain-like method similar to the "Chain of Thought" (CoT).
The difference is that the "step-by-step" chain of thought focuses more on logical reasoning, while the verification chain focuses more on factual information.
Some netizens found that this verification chain is very similar to a scientific method used when coding with ChatGPT:
So what exactly is the "verification chain," and what is being "verified"?
Break down the answer, divide and conquer
The core idea of the verification chain is to break down a large piece of content to be verified into small questions, and the specific process is as follows:
First, the model will generate a reply as usual based on the question asked by the user.
Then, based on the generated reply, a series of verification questions are generated for each piece of information.
Then, the model is asked to answer these questions on its own, and based on the results, adjust the initial answer to obtain the final result.
For example, if you want to ask the model about the main reasons for the Mexican-American War in the 19th century.
The model answered the time of the event and what happened before that.
After that, questions are asked about each of these events.
As a result, the model found that the time mentioned in one of the events was too far apart, and after adjustment, it provided the final answer.
The generation and verification of questions are the most critical part. In this regard, researchers have proposed four specific methods in total:
- Joint, which involves writing the instruction for generating questions and answers in the same prompt
- 2-Step, which involves having the model generate questions first, then starting a new conversation (one-time) to answer the questions
- Factored, which builds on 2-Step by starting a new conversation for each question asked
- Factor+Revise, which adds a consistency check on top of Factored, making the model focus on inconsistencies between the content before and after
These four modes become more refined, and the accuracy also increases.
So why does breaking down questions improve the model's accuracy?
First, because the broken-down questions are easier than the overall task, an essay question becomes a question-and-answer or even a multiple-choice or true-false question, simplifying the question improves the accuracy.
In addition, breaking down the questions allows the model to truly rethink, rather than repeatedly repeating incorrect answers.
So, how effective is the verification chain method?
Information accuracy exceeds ChatGPT
To explore this question, researchers tested Llama in three tasks.
First is information listing, such as listing famous people born in a certain place and working in a certain industry.
In this task, researchers tested two datasets—a simpler one from Wikidata and a more difficult one from the Wiki-Category list (extracted from Wikipedia).
The results showed that with the two-step mode of the verification chain, the accuracy of simple questions for the 65B-parameter Llama increased from 0.17 to 0.36, more than doubling, and the accuracy of complex questions also nearly doubled.
Next is the "closed-domain question answering" task, where researchers extracted multiple discontinuous pieces of information from the MultiSpanQA dataset for fill-in-the-blank questions.
For example, "Who created the world's first publishing house in which year" (the answer is Johannes Gutenberg, 1450).
As a result, CoVe also brought about a 20% increase in accuracy for Llama.
The third task is "long-text biography generation," where the question is "Tell me a bio of (person's name)," evaluated using the FactScore dataset.
In the Factor+Revise mode, the accuracy not only significantly increased compared to the mode without the verification chain, but also exceeded ChatGPT.
For those interested in this research, more details can be found in the paper.
Paper link:
https://arxiv.org/abs/2309.11495
免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。