Introducing ChatQA: Building Question Answering Models at the Level of GPT-4

Introducing ChatQA: Building Question Answering Models at the Level of GPT-4

Sukces ChatQA: Budowanie modeli odpowiadających na pytania na poziomie GPT-4

In recent years, significant progress has been made in developing question answering (QA) models, thanks to advancements in models like ChatGPT by OpenAI (2022) and its subsequent iterations. However, creating a conversational QA model that can match the accuracy of state-of-the-art closed models like GPT-4 remains a challenge for researchers.

Addressing this challenge, the NVIDIA research team presents ChatQA: Building Conversational QA Models at the Level of GPT-4 in their latest publication. They introduce a package of conversational QA models that achieve GPT-4-level accuracy without relying on synthetic data from ChatGPT models.

The researchers first propose a two-step fine-tuning method for ChatQA instructions. In the first stage, they utilize supervised fine-tuning (SFT) on a combination of instruction tracking and dialogue-related datasets. This initial fine-tuning enables the model to effectively track instructions as a conversational agent. The second stage, known as context-aware instruction fine-tuning, aims to improve the model’s ability to generate responses in context or with the use of references in conversational QA tasks.

Additionally, the researchers introduce a new dataset called HumanAnnotatedConvQA, which significantly enhances the language model’s capability to integrate user-provided or retrieved context in conversational QA tasks without relying on synthetic data from ChatGPT models.

The team builds various ChatQA models based on Llama2-7B, Llama2-13B, Llama2-70B (Touvron et al., 2023), as well as internal GPT-8B and GPT-22B models. They conduct a comprehensive analysis based on ten conversational QA datasets. In terms of average performance, the ChatQA-70B model (54.14) outperforms both GPT3.5-turbo (50.37) and GPT-4 (53.90) without using synthetic data from ChatGPT models.

Furthermore, the researchers explore the “unanswerable” scenario, where the desired answer is not present in the provided or retrieved context. In such cases, the language model needs to generate an answer like “unable to provide an answer” to prevent misinformation. It is worth noting that the ChatQA-70B model outperforms GPT-3.5-turbo in handling this scenario, although there is still a slight difference compared to GPT-4 (approximately 3.5%).

Publication: ChatQA: Building Conversational QA Models at the Level of GPT-4, arXiv.

Author: Hecate He | Editor: Chain Zhang

To stay up-to-date with the latest news and research breakthroughs, subscribe to our popular newsletter, Synced Global AI Weekly, for weekly updates on artificial intelligence.

FAQ:

1. What is ChatGPT?
ChatGPT is a conversational model developed by OpenAI. Recent advancements have led to significant changes in the development of question-answering models.

2. What QA models are presented in the publication “ChatQA: Building Conversational QA Models at the Level of GPT-4”?
The publication introduces a package of conversational QA models that achieve GPT-4-level accuracy without relying on synthetic data from OpenAI’s GPT models. These models are based on Llama2-7B, Llama2-13B, Llama2-70B, GPT-8B, and GPT-22B.

3. What is the instruction fine-tuning method for ChatQA?
The method consists of two stages. The first stage involves supervised fine-tuning (SFT) using instruction and dialogue-related datasets, enabling the model to effectively track instructions as a conversational agent. The second stage is context-aware instruction fine-tuning, which enhances the model’s ability to generate responses in context.

4. How does the new dataset, HumanAnnotatedConvQA, assist ChatQA?
The HumanAnnotatedConvQA dataset significantly improves the language model’s capability to integrate user-provided or retrieved context in conversational QA tasks without relying on synthetic data from ChatGPT models.

5. How does ChatQA compare to other models like GPT-4?
The results show that the ChatQA-70B model achieves an average score of 54.14, surpassing both GPT-3.5-turbo (50.37) and approaching GPT-4 (53.90) without using synthetic data from ChatGPT models.

6. How does the ChatQA model perform in the “unanswerable” scenario?
In cases where the desired answer is not present in the provided or retrieved context, the ChatQA model needs to generate an answer like “unable to provide an answer.” ChatQA-70B outperforms GPT-3.5-turbo in handling this scenario, although there is still a slight difference compared to GPT-4.

Related Links:
– openai.com
– arxiv.org

The source of the article is from the blog reporterosdelsur.com.mx