[LLM] SELF-RAG: Learning to Retrieve, Generate and Critique through Self-reflect

Concept

LM을 이용해서 On-demand Retrieval을 할 수 있다.
Reflection Token을 통해 generation의 quality와 attribution을 증진시킨다.

Problem

Expense of runtime efficiency
Robustness to irrelevant context
Lack of attribution

How to Solve

Reflection Token이라는 Special Token을 사용해서

Retrieval이 필요한지 판단할 수 있게 하자.
Retrieval Passage와 Generated Sentence에 대해서 Reflects를 하자.

Reflection Token

Retrieve : 현재 문장을 보고 Retrieve과정이 필요한지 안한지에 대한 토큰이다.
IsREL : Retrieved된 Passage가 질문을 답하는데 유용한 문장인지에 대한 토큰이다.
IsSUP : 생성된 문장의 verification-worthy한 문장이 passage에 의해서 입증되는가
IsUSE : 생성된 문장이 질문에 대해서 유용한 답변인가

Self-RAG

prompt를 받아서 이에 대한 답을 생성하는 과정에서 답변 생성때마다 위의 알고리즘이 적용이 된다.
1. 질문과 이제까지 생성된 문장을 받고 Retrieve가 필요한지 판단한다.
2. 만약 Retrieval이 필요하다면
2-1. Retrieval과정을 진행한다.(병렬로 진행)
2-2-(a). 질문과 검색된 정보에 대해서 IsREL토큰을 생성한다.
2-2-(b). 질문과 이제까지 생성된 문장, 그리고 검색된 정보를 통해 이어지는 문장을 생성한다.
2-3. 질문, 생성된 문장, 검색된 정보들을 이용해서 IsSUP과 IsUSE토큰을 생성한다.
2-4. 이렇게 생성된 IsREL, IsSUP, IsUSE토큰을 이용해서 생성된 답변에 RANK를 매긴다.
2-5. 가장 높은 RANK를 가지는 문장을 채택한다.
2. 만약 Retrieval이 필요하지 않다면
2-1. 질문을 주고 답변을 생성한다.
2-2. 생성된 답변에 대해서 IsUSE을 생성하고 관련성을 평가한다.

Reflection Token학습은 어떻게 시켰을까?

사실 이 부분이 제일 궁금하긴 하다.
이 학습을 위해서는 다음과 같은 데이터셋을 구성하는 것이 최종 목표이다.

이를 위해서는 2가지 방안이 있다

Chat GPT를 사용해서 모든 데이터셋에 대해서 위와 같이 출력시키게 만든다.
토큰을 생성하는 Critic Model을 만들고 이를 바탕으로 기존의 QA셋에 대해서 Augment해서 offline dataset을 위에서 말한 방식대로 학습을 시킨다.

1번 방안은 학습 비용이 너무 많이 필요하기 때문에 논문에서는 2번방법을 택했다.
내 생각에는 결국 Chat-GPT가 처음부터 위의 데이터셋을 만든다면 Chat-GP를 사용하면 되는거 아닌가라는 생각이 들게 할 수 있기 때문에 중간에 Chat-GPT의 Reflection Token을 생성하게 할 수 있는 능력을 추출하는 Critic Model을 사용한것이 아닐까라는 생각이 든다.

Dataset만들기

데이터셋을 만드는 알고리즘을 보면 우리는 Offline Dataset에 대해서 Reflection Token을 생성하는 모델을 만들어야한다.
아래는 학습에 사용된 데이터셋에 대한 설명이다.

Critic Model학습
[Retrieve] 데이터 생성 프롬프트

"multi_retrieval": (
        "You will be provided with an instruction, evidence, output sentence, and preceding sentences (optional). If the preceding sentence is given, the output should be the sentence that follows those preceding sentences.  Your task is to determine whether the information in the output sentence can be fully verified by the evidence or if it requires further external verification. If the output sentence can be verified solely with the evidence or doesn’t require any verification, respond with [No Retrieval]. If additional information is needed to verify the output sentence, respond with [Retrieval]. Please provide explanations for your judgments.\n\n"
        "##\nInstruction: Explain the use of word embeddings in Natural Language Processing.\n"
        "Preceding sentences: Word embeddings are one of the most powerful tools available for Natural Language Processing (NLP). They are mathematical representations of words or phrases in a vector space, allowing similarities between words and the context in which they are used to be measured.\n"
        "Evidence: Word embedding\nWord embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension.\n"
        "Output: Word embeddings are useful for tasks such as sentiment analysis, text classification, predicting the next word in a sequence, and understanding synonyms and analogies.\n"
        "Rating: [Retrieval]\n"
        "Explanation: The output discusses the applications of word embeddings, while the evidence only discusses the definitions of word embeddings and how it works. Therefore, we need to retrieve other evidence to verify whether the output is actually correct or not.\n"
        "###\nInstruction: {instruction}\n"
        "Preceding sentences: {preceding_sentences}\n"
        "Evidence: {evidence}\n"
        "Output: {target_output}\n"
        "Rating: "),

[IsREL] 데이터 생성 프롬프트

 "multi": (
    "You'll be provided with an instruction, along with evidence and possibly some preceding sentences. "
    "When there are preceding sentences, your focus should be on the sentence that comes after them. "
    "Your job is to determine if the evidence is relevant to the initial instruction and the preceding context, and provides useful information to complete the task described in the instruction. "
    "If the evidence meets this requirement, respond with [Relevant]; otherwise, generate [Irrelevant].\n\n"
    "###\nInstruction: Given four answer options, A, B, C, and D, choose the best answer.\n\n"
    "Input: Earth rotating causes\n"
    "A: the cycling of AM and PM\nB: the creation of volcanic eruptions\nC: the cycling of the tides\nD: the creation of gravity\n\n"
    "Evidence: Rotation causes the day-night cycle which also creates a corresponding cycle of temperature and humidity creates a corresponding cycle of temperature and humidity. Sea level rises and falls twice a day as the earth rotates.\n\n"
    "Rating: [Relevant]\n"
    "Explanation: The evidence explicitly mentions that the rotation causes a day-night cycle, as described in the answer option A.\n\n"
    "###\nInstruction: age to run for us house of representatives\n\n"
    "Evidence: The Constitution sets three qualifications for service in the U.S. Senate: age (at least thirty years of age); U.S. citizenship (at least nine years); and residency in the state a senator represents at the time of election.\n\n"
    "Rating: [Irrelevant]\n"
    "Explanation: The evidence only discusses the ages to run for the US Senate, not for the House of Representatives.\n\n"
    "###\nInstruction: {instruction}\n\n"
    "Evidence: {evidence}\n\n"
    "Rating:"
),

[IsSUP] 데이터 생성 프롬프트

"multi": (
    "You will receive an instruction, evidence, and output, and optional preceding sentences.  If the preceding sentence is given, the output should be the sentence that follows those preceding sentences. Your task is to evaluate if the output is fully supported by the information provided in the evidence, and provide explanations on your judgement\n"
    "Use the following entailment scale to generate a score:\n"
    "[Fully supported] - All information in output is supported by the evidence, or extractions from the evidence. This is only applicable when the output and part of the evidence are almost identical.\n"
    "[Partially supported] - The output is supported by the evidence to some extent, but there is major information in the output that is not discussed in the evidence. For example, if an instruction asks about two concepts and the evidence only discusses either of them, it should be considered a [Partially supported].\n"
    "[No support / Contradictory] - The output completely ignores evidence, is unrelated to the evidence, or contradicts the evidence. This can also happen if the evidence is irrelevant to the instruction.\n\n"
    "Make sure to not use any external information/knowledge to judge whether the output is true or not. Only check whether the output is supported by the evidence, and not whether the output follows the instructions or not.\n\n"
    "###\nInstruction: Explain the use of word embeddings in Natural Language Processing.\n"
    "Preceding sentences: Word embeddings are one of the most powerful tools available for Natural Language Processing (NLP). They are mathematical representations of words or phrases in a vector space, allowing similarities between words and the context in which they are used to be measured.\n"
    "Output: Word embeddings are useful for tasks such as sentiment analysis, text classification, predicting the next word in a sequence, and understanding synonyms and analogies.\n"
    "Evidence: Word embedding\nWord embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP tasks such as syntactic parsing, sentiment analysis, next token predictions as well as analogy detection.\n"
    "Score: [Fully supported]\n"
    "Explanation: The output sentence discusses the application of word embeddings, and the evidence mentions all of the applications syntactic parsing, sentiment analysis, next token predictions as well as analogy detection as the applications. Therefore, the score should be [Fully supported].\n\n"
    "###\n"
    "Instruction: {instruction}\n"
    "Preceding sentences: {preceding_sentences}\n"
    "Output: {target_output}\n"
    "Evidence: {evidence}\n"
    "Score: "
)

[IsUSE] 데이터 생성 프롬프트

"context": (
    "Given an instruction and an output, rate whether the response appears to be a helpful and informative answer to the query, from 1 (lowest) - 5 (highest). We call this score perceived utility.\n\n"
    "The detailed criterion is as follows:\n"
    "5: The response provides a complete, highly detailed, and informative response to the query, fully satisfying the information needs.\n"
    "4: The response mostly fulfills the need in the query, while there can be some minor improvements such as discussing more detailed information, having better structure of the response, or improving coherence. \n"
    "3: The response is acceptable, but some major additions or improvements are needed to satisfy users' needs.\n"
    "2: The response still addresses the main request, but it is not complete or not relevant to the query.\n"
    "1: The response is barely on-topic or completely irrelevant.\n"
    "##\n Instruction: Who is the current prime minister of the UK as of 2023?\n"
    "Output: Boris Johnson was the prime minister of the UK from 2019 - 2022.\n"
    "Perceived utility: 2\n"
    "Explanation: While the output provides a factually correct statement about the UK prime minister from 2019 to 2022, this instruction asks who the prime minister is as of 2023, so it doesn't answer the instruction. Therefore, the utility is 2.\n\n"
    "##\nInstruction: Given a description of a travel destination, recommend 10 tourist attractions with detailed explanations of each. The travel destination is Tokyo, Japan.\n"
    "Output: 'Tokyo is a vibrant city full of exciting tourist attractions. Some must-see sights include the Tokyo Skytree, Tokyo Disneyland, Sensoji Temple, Meiji Shrine, Tsukiji Fish Market, Harajuku, and Shinjuku Gyoen.\n"
    "Perceived utility: 3\n"
    "Explanation: This output doesn't provide descriptions of each attraction and the number of the attractions is also less than 10. While this output partially answers the instructions, it doesn't match the instructions strictly. \n\n"
    "##\nInstruction: {instruction}\n"
    "Output:{output}\n"
),

이렇게 생성시킨 학습 데이터를 가지고 Critic Model은 Llama2-7B를 학습을 시킨다.
Generator 학습
이렇게 학습시킨 CriticModel을 가지고 Offline Dataset을 위의 방식대로 처리하여 데이터셋을 구성한다.
학습된 데이터셋은 아래의 링크에서 받을 수 있다.
https://drive.google.com/file/d/10G_FozUV4u27EX0NjwVe-3YMUMeTwuLk/view

full_output_1005.jsonl

drive.google.com

이렇게 학습시킨 데이터셋으로 Generator를 학습시킨다.

Inferencer과정 상세

1. Adaptive retrieval with threshold
LM이 Retrieval과정을 언제 수행할 것인지 결정하는데 있어서 단순히 [Retrieval] 토큰이 생성되었다는 이유로 예측하는 것이 아니라 Threshold를 지정하고 Retrieval토큰에 대한 Score가 해당 임계값 이상이 될 때 Retrieval을 진행하도록 하였다.
관련 코드는 아래와 같다.
https://github.com/AkariAsai/self-rag/blob/12afe0bac2c894e9fbf255960c03d2327600f031/retrieval_lm/run_long_form_static.py#L124-L129
Scoring를 매기는 방식은 생성된 전체 Sequence Token중에 Retrieval관련 토큰들을 counting하고 이 중에 Retrieval토큰이 몇번 나왔는지에 대해서 계산하게 된다.

2. Tree-decoding with critique tokens
다시 Self-RAG의 Inference과정을 살펴보자

self-rag는 병렬적으로 Generation과정을 진행시키는데 각 Generation마다 점수를 측정하는 것을 볼 수 있다.
점수 측정 방식은 다음과 같다.
1. IsREL

생성된 토큰들 중에 Relevant관련 토큰 전체 갯수 중에 Relevant토큰이 몇번 생성되었는지를 가지고 Scoring을 매긴다.
2. IsSUP

생성된 토큰들 중에 IsSup 관련 토큰들의 모수 중에서 Fully의 갯수, 그리고 Partially는 0.5를 곱해서 감가된 상태로 scoring을 매기게 된다.
3. IsUSE
IsUSE는 1에서 5사이의 숫자로 계산이 되는데 다음과 같은 공식을 사용해서 관련 scoring을 계산한다.

이렇게 Scoring을 계산하게 되는 것을 알 수 있다.

Experiment

이 부분에서는 어떤 데이터셋으로 어떤 성능에 대해서 어떤 성과 지표를 가지고 RAG 시스템을 평가했는지 알아보자.
PopQA와 TriviaQA-unfiltered:

테스트 이유: 오픈 도메인의 사실적 질문에 정확하게 답할 수 있는 능력을 평가
평가 지표: 생성된 응답에 gold answer가 포함되었는지를 기준으로 성능을 평가

PubHealth와 ARC-Challenge

테스트 이유: Closed-Domain에서 사실적 질문에 정확하게 답할 수 있는 능력을 평가. PubHealth는 fact-verification Task, ARC-Challenge는 다지선다에서 올바른 답을 선택하는 Task를 위해 사용되었다.
평가 지표: 생성된 응답에 accuracy로 평가

Biography generation task와 Long-Form QA Task(ALCE-ASQA)

평가 지표
- Biography generation Task에는 FactScore라는 것을 사용
- ASQA에는 정확성(str-em), 유창성(MAUVE), citation에 대해서는 precision과 recall을 평가지표로 사용했다.

다음은 해당 Experiment의 결과다.

Conclusion

Reflection 토큰을 사용해서 RAG시스템의 행동을 controll하고 생긴 결과에 대해서 reflection을 하는 방식이 참신했던 논문이다.
해당 논문을 보면서 드는 의문은 다음과 같다.
1. Critic Model만을 사용하면 안될까?
물론 단일 LM이 Critic과 Generation을 하는 경우 Inference성능이 좋아질 수 있다. 하지만 Generation LM의 일반적인 생성 성능에 영향이 갈 수 있다는 것과 RAG를 구현하는데 있어 Geneartion LM을 Finetunig시킬 수 없는 상황을 가정한다면 Critic Model만 사용해서 저자가 말한 성능을 낼 수 있는지가 궁금해진다.
2. Chat-GPT쓰면 해결되는 Task아닐까?
Critic Model이 GPT로 부터 knowledge distillation을 하기 위한 장치라면 GPT를 사용하면 되는거 아닐까? 심지어 코드 상에서는 GPT-3-turbo모델을 사용하고 있었다. 물론 cost-efficient하게 위한 방법이라지만 결국 Task를 수행하는 측면에서는 GPT-4를 사용하면 훨씬 손쉽고 편하게 Task를 수행할 수 있지 않을까 싶다.
3. Datasource가 여러개라면?
위의 모델의 경우 단인 Datasource를 가지고 실험을 진행했다. 만약 RAG에서 참조해야할 Datasource가 여러개라면 어떻게 해야할까? Reflection Token을 학습시킨것과 유사한 방식으로 어떤 Datasource로 부터 데이터를 가져와야하는지에 대한 Token도 학습시킬 수 있지 않을까? A라는 소스로 부터 가져온 정보를 가지고 B라는 소스로 부터 가지온 정보를 분석하는데 사용하는 것처럼 말이다.

긴 글 읽어주셔서 감사합니다.
틀린 부분이 있으면 댓글을 달아주시면 감사하겠습니다.

📧 : realhwan1202@gmail.com
🔗 : https://github.com/RicardoKim

저작자표시 비영리 (새창열림)

'인공지능 > LLM' 카테고리의 다른 글

[RAG] Knowledge Graph Prompting for Multi-Document Question Answering (0)	2024.04.07
[LLM] Seven Failure Points When Engineering a Retrieval AugmentedGeneration System (0)	2024.01.29
[LLM] Paged Attention (1)	2024.01.28
[LLM] Flash Attention (0)	2024.01.22