6000-word interpretation: The 10 major challenges in the current research of large language models (LLMs)

Author: Chip Huyen

Source Link: https://huyenchip.com/2023/08/16/llm-research-open-challenges.html

The goal of making large language models (LLMs) more robust is the first time I have seen so many smart people working towards a common goal in my life. After exchanging ideas with numerous industry and academic professionals, I noticed ten major research directions emerging. The two most focused-on directions currently are Hallucinations and Context Learning.

For myself, I am most interested in the 3rd direction (Multimodality), the 5th direction (New architecture), and the 6th direction (GPU alternatives) listed below.

Top Ten Open Challenges in LLM Research

Reduce and evaluate hallucinations (fabricated information)

Optimize context length and context construction

Integrate other data modalities

Improve the speed and cost-effectiveness of language models

Design new model architectures

Develop GPU alternatives

Enhance the usability of agents (AI)

Improve the ability to learn from human preferences

Increase the efficiency of chat interfaces

Build language models for non-English languages

Reduce and evaluate hallucinations

The issue of hallucinations in output environments has been extensively discussed. When artificial intelligence models fabricate information, it leads to hallucinations. While hallucinations may be functional for many creative use cases, they are considered errors in most applications. Recently, I participated in a specialized discussion on LLM with experts from Dropbox, Langchain, Elastics, and Anthropic, and they identified hallucination output as the primary obstacle that enterprises need to overcome when applying LLM in actual production.

Reducing model hallucination output and establishing metrics to evaluate hallucination output is a burgeoning research topic that many startups are currently focusing on. There are techniques to reduce the probability of hallucination output, such as adding more context in prompts, CoT, self-consistency, or specifying concise responses from the model.

Here are a series of papers and references on hallucination output:

Survey of Hallucination in Natural Language Generation (Ji et al., 2022)
How Language Model Hallucinations Can Snowball (Zhang et al., 2023)
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity (Bang et al., 2023)
Contrastive Learning Reduces Hallucination in Conversations (Sun et al., 2022)
Self-Consistency Improves Chain of Thought Reasoning in Language Models (Wang et al., 2022)
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models (Manakul et al., 2023)
A simple example of fact-checking and hallucination by NVIDIA’s NeMo-Guardrails

Optimize context length and context construction

Most problems require context. For example, if we ask ChatGPT, "Which is the best Vietnamese restaurant?" the necessary context would be "Where exactly is this restaurant located?" because the scope of the best restaurant in Vietnam and the best Vietnamese restaurant in the United States is different.

According to the cool paper "SITUATEDQA: Incorporating Extra-Linguistic Contexts into QA" (Zhang & Choi, 2021), a significant portion of information retrieval problems' answers are related to context, accounting for approximately 16.5% in the Natural Questions NQ-Open dataset.

(NQ-Open: https://ai.google.com/research/NaturalQuestions)

In my opinion, this proportion is likely higher in actual enterprise cases. For example, if a company sets up a chatbot for customer support to answer any questions about any product, the required context is likely the customer's history or information about the product. Since language models "learn" from the context provided to them, this process is also known as context learning.

Context length is crucial for RAG (retrieval-augmented generation), which has become the primary mode of application for large language models in the industry. Specifically, retrieval-augmented generation is mainly divided into two stages:

Stage 1: Chunking (also known as indexing)

Collect all documents used by LLM, divide these documents into chunks that can be fed into the model to generate embeddings, and store these embeddings in a vector database.

Stage 2: Query

When a user sends a query, such as "Can my insurance cover drug X," the large language model converts this query into an embedding, which we call QUERYEMBEDDING. The vector database retrieves the chunk most similar to the QUERYEMBEDDING.

The longer the context length, the more chunks we can squeeze into the context. The more information the model obtains, the higher the quality of its output and response, right?

Not always. How much context the model can use and how efficiently the model uses the context are two different issues. While striving to increase the model's context length, we are also striving to improve the efficiency of the context. Some refer to it as "prompt engineering" or "prompt construction." For example, a recent paper discussed how the model can better understand the beginning and end of an index, not just the middle information—Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023).

Integrate other data modalities (multimodality)

In my view, multimodality is very powerful but also underrated. Here's an explanation of the application reasons for multimodality:

First, many specific application scenarios require multimodal data, especially in industries with mixed data modalities such as healthcare, robotics, e-commerce, retail, gaming, and entertainment. For example:

Medical diagnostics typically require text (such as doctor's notes, patient questionnaires) and images (such as CT scans, X-rays, MRI scans).

Product metadata usually includes images, videos, descriptions, and even tabular data (such as production date, weight, color) because from a demand perspective, you may need to automatically fill in missing product information based on user comments or product photos, or you may want users to search for products using visual information such as shape or color.

Second, multimodality is expected to significantly improve model performance. Shouldn't a model that can understand both text and images perform better than a model that can only understand text? Text-based models require a large amount of text, to the point where we are concerned that we will soon run out of internet data to train text-based models. Once text is exhausted, we need to utilize other data modalities.

A particularly exciting use case for multimodal technology is enabling visually impaired individuals to browse the internet and navigate the real world.

Here are a series of papers and references related to multimodality:

[CLIP] Learning Transferable Visual Models From Natural Language Supervision (OpenAI, 2021) Flamingo: a Visual Language Model for Few-Shot Learning (DeepMind, 2022) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (Salesforce, 2023)

KOSMOS-1: Language Is Not All You Need: Aligning Perception with Language Models (Microsoft, 2023)

PaLM-E: An embodied multimodal language model (Google, 2023)

LLaVA: Visual Instruction Tuning (Liu et al., 2023)

NeVA: NeMo Vision and Language Assistant (NVIDIA, 2023)

Make LLM Faster and More Cost-Effective

When GPT-3.5 was first released at the end of November 2022, many people expressed concerns about the latency and cost of using it in production. However, since then, the analysis of latency/cost has rapidly changed. In less than six months, the community found a way to create a model that performs very close to GPT-3.5 but requires only about 2% of the memory used by GPT-3.5.

The insight here is that if you create something good enough, people will find a way to make it fast and cost-effective.

The performance data of "Guanaco 7B" is compared with ChatGPT GPT-3.5 and GPT-4, according to the report in the "Guanco" paper. Please note: Overall, the performance comparison below is far from perfect, and evaluating LLM is very, very difficult.

Performance Comparison of Guanaco 7B with ChatGPT GPT-3.5 and GPT-4:

Performance Comparison

Four years ago, when I started writing notes for what later became the "model compression" section of the book "Designing Machine Learning Systems," I wrote about four main techniques for model optimization/compression:

Quantization: The most widely used model optimization method to date. Quantization reduces the size of the model by using fewer bits to represent the model's parameters, for example, using 16 or even 4 bits to represent floating-point numbers instead of using 32 bits.
Knowledge distillation: A method of training a small model to mimic a large model or ensemble of models.
Low-rank factorization: The key idea here is to use low-dimensional tensors to replace high-dimensional tensors to reduce the number of parameters.
Pruning All four of these techniques are still applicable and popular today. Alpaca uses knowledge distillation for training. QLoRA combines low-rank factorization and quantization.

Design a New Model Architecture

Since the emergence of AlexNet in 2012, we have seen the rise and fall of many architectures, including LSTM, seq2seq, and more. The impact of the Transformer, in comparison to these, is incredible. The Transformer has been around since 2017, and how long this architecture will remain popular is still a mystery.

Developing a new architecture to surpass the Transformer is not easy. The Transformer has undergone a lot of optimization over the past 6 years, and this new architecture must run on the hardware and at the scale that people currently care about.

Note: Google originally designed the Transformer to run fast on TPUs, and later optimized it for GPUs.

In 2021, S4 from Chris Ré's lab garnered widespread attention, as detailed in "Efficiently Modeling Long Sequences with Structured State Spaces" (Gu et al., 2021). Chris Ré's lab is still heavily involved in developing new architectures, and one of them is Monarch Mixer (Fu, 2023), developed in collaboration with the startup Together.

Their main idea is that for the existing Transformer architecture, the complexity of attention is quadratic in the length of the sequence, and the complexity of the MLP is quadratic in the model's dimension. An architecture with subquadratic complexity will be more efficient.

Monarch Mixer

Develop GPU Alternatives

Since the emergence of AlexNet in 2012, GPUs have been the dominant hardware for deep learning. In fact, one of the universally recognized reasons for the popularity of AlexNet is that it was the first paper to successfully use GPUs to train neural networks. Before GPUs, training models at the scale of AlexNet would have required thousands of CPUs, as Google did with a model released a few months before AlexNet. Compared to thousands of CPUs, a few GPUs are more accessible to doctoral students and researchers, leading to a boom in deep learning research.

Over the past decade, many companies, including large enterprises and startups, have attempted to create new hardware for artificial intelligence. The most notable attempts include Google's TPU, Graphcore's IPU (how is the progress of IPU?), and Cerebras. SambaNova raised over a billion dollars to develop a new AI chip but seems to have shifted towards becoming a generative AI platform.

For a while, there was great hope for quantum computing, with key players including:

IBM's QPU
Google's quantum computer reported a significant milestone in reducing quantum errors earlier this year in the journal "Nature." Its quantum virtual machine is publicly accessible through Google Colab.
Research labs such as the MIT Quantum Engineering Center, Max Planck Institute for Quantum Optics, Chicago Quantum Exchange, Oak Ridge National Laboratory, and more.

Another equally exciting direction is photonic chips. I have limited knowledge in this area, so please correct me if I'm wrong. Existing chips use electricity to transmit data, consuming a lot of energy and producing latency. Photonic chips use photons to transmit data, enabling faster and more efficient computation at the speed of light. In this field, various startups have raised hundreds of millions of dollars, including Lightmatter ($270 million), Ayar Labs ($220 million), Lightelligence (over $200 million), and Luminous Computing ($115 million).

The timeline of the progress of three main methods of photonic matrix computation is excerpted from the paper "Photonic matrix multiplication lights up photonic accelerator and beyond" (Zhou, Nature 2022). These three different methods are planar lightwave circuits (PLC), Mach-Zehnder interferometers (MZI), and wavelength-division multiplexing (WDM).

Photonic Matrix Computation

Enhance the Usability of Agents

An agent refers to a large language model that can perform actions (think of them as proxies that can complete various tasks for you, hence the name "agent"), such as browsing the internet, sending emails, making reservations, and more. This may be one of the newest directions compared to other research directions in this article. Due to the novelty and immense potential of agents, people are enthusiastic about them. Auto-GPT is currently the 25th most starred and popular repo on GitHub. GPT-Engineering is another popular repo.

While this direction is exciting, there are still doubts about whether large language models are reliable and high-performing enough to be entrusted with the power to act. However, an application scenario has emerged, using agents for social research, such as the well-known Stanford experiment, which showed that a small cluster of generative agents generated emerging social behaviors: for example, starting from a user-specified idea, an agent wanted to host a Valentine's Day party, and over the next two days, the agent automatically spread invitations to the party, made new friends, and invited each other to the party (Generative Agents: Interactive Simulacra of Human Behavior, Park et al., 2023).

One of the most notable startups in this field may be Adept, founded by two former co-authors of Transformer and a former vice president of OpenAI, which has raised nearly $500 million to date. Last year, they demonstrated how their agent browses the internet and demonstrated how to add new accounts to Salesforce.

Iterate RLHF

RLHF (Reinforcement Learning from Human Feedback) is cool but a bit tricky. It wouldn't be surprising if people find better ways to train LLM. However, there are still many unresolved issues in RLHF, such as:

How to mathematically represent human preferences?

Currently, human preferences are determined through comparison: human annotators determine whether response A is better than response B. However, it does not consider how much better response A is compared to response B.
What is "human preference"?

Anthropic measures the quality of its model based on usefulness, honesty, and harmlessness. Please refer to "Constitutional AI: Harmlessness from AI Feedback" (Bai et al., 2022).

DeepMind attempts to generate responses that can please the majority of people. Please refer to "Fine-tuning language models to find agreement among humans with diverse preferences" (Bakker et al., 2022).

Additionally, do we want AI to express its stance or to avoid controversial topics like traditional AI?

Whose preferences are "human" preferences, and should cultural, religious, and political differences be considered? Obtaining training data that represents all potential users poses many challenges.

For example, in OpenAI's InstructGPT data, there are no annotators over 65 years old. The annotators are mainly Filipino and Bangladeshi. Please refer to "InstructGPT: Training language models to follow instructions with human feedback" (Ouyang et al., 2022).

InstructGPT Annotator Nationality Statistics

Although community-driven efforts are commendable in their intent, they may lead to biased data. For example, in the OpenAssistant dataset, out of 222 respondents, 201 self-identified as male. Jeremy Howard has a good thread on Twitter:

OpenAssistant Dataset Gender Distribution

Improving Chat Interface Efficiency

Since ChatGPT, there has been ongoing discussion about whether chat is a suitable interface for various tasks.

Refer to:

"Natural language is the lazy user interface" (Austin Z. Henley, 2023)
"Why Chatbots Are Not the Future" (Amelia Wattenberger, 2023)
"What Types of Questions Require Conversation to Answer? A Case Study of AskReddit Questions" (Huang et al., 2023)
"AI chat interfaces could become the primary user interface to read documentation" (Tom Johnson, 2023)
"Interacting with LLMs with Minimal Chat" (Eugene Yan, 2023)

However, this is not a new topic. In many countries, especially in Asia, chat has been used as a super app interface for about a decade. Dan Grover wrote a related paper in 2014.

Chat Interface Usage Statistics

In 2016, when many people thought apps were dead and chatbots would be the future, the discussion became intense again:

"On chat as interface" (Alistair Croll, 2016)
"Is the Chatbot Trend One Big Misunderstanding?" (Will Knight, 2016)
"Bots won’t replace apps. Better apps will replace apps" (Dan Grover, 2016)

I personally like chat interfaces for the following reasons:

Chat interfaces are universal and can be quickly learned by everyone, even those who have not previously used computers or the internet. In the early 2010s, when I volunteered in a low-income neighborhood in Kenya, I was amazed at how familiar everyone was with conducting banking on their phones through text messages. That community had no computers.
Chat interfaces are easily accessible. If your hands are busy, you can use voice instead of text.
Chat is a very powerful interface—you can make any request, and it will respond, even if the response is not necessarily perfect.

However, I believe there are areas where chat interfaces can be further improved:

Multi-message interactions

Currently, we basically assume that each interaction consists of a single message. But that's not how I and my friends text. Usually, I need multiple messages to complete my thoughts because I need to insert different data (such as images, locations, links), I may have missed something in the previous messages, or I just don't want to put all the content in a single large paragraph.

Multi-modal input

In the field of multi-modal applications, most of the focus is on building better models, with little focus on building better interfaces. Take Nvidia's NeVA chatbot, for example. I'm not a user experience expert, but I think there may be room for improvement here.

Note: Apologies to the NeVA team mentioned here; your work is still very cool!

NeVA Chatbot Interface

Integrating generative AI into workflows

Linus Lee covered this well in his talk "Generative AI interface beyond chats." For example, if you want to ask a question about a specific column in a chart you're working on, you should be able to simply point to that column and ask.

Message editing and deletion

How would user input editing or deletion change the flow of conversation with a chatbot?

Creating LLMs for Non-English Languages

We know that currently, LLMs with English as the first language do not perform well in many other languages in terms of performance, latency, and speed.

Refer to:

"ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning" (Lai et al., 2023)
"All languages are NOT created (tokenized) equal" (Yennie Jun, 2023)

Multilingual Language Models

I only know of attempts to train Vietnamese (e.g., by the Symato community), but a few early readers of this article told me that they believe I should not include this direction for the following reasons:

This is more of a logistics problem than a research problem. We already know how to do it; it just needs someone to invest the money and effort. However, this is not entirely correct. Most languages are considered low-resource languages, meaning that high-quality data for many languages is much scarcer compared to English or Chinese, and different techniques may be needed to train large language models. See:

"Low-resource Languages: A Review of Past Work and Future Challenges" (Magueresse et al., 2020)
"JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages" (Agić et al., 2019)

Some are more pessimistic and believe that in the future, many languages will disappear, and the internet will be dominated by two universes of two languages: English and Chinese. This sentiment is not new—does anyone remember Esperanto?

The impact of AI tools, such as machine translation and chatbots, on language learning is still unclear. Will they help people learn new languages faster, or will they completely eliminate the need to learn new languages?

Conclusion

If there are any omissions in this article, please let me know. For other perspectives, please refer to the comprehensive paper "Challenges and Applications of Large Language Models" (Kaddour et al., 2023).

Some of the mentioned problems are more difficult than others. For example, I believe the first problem mentioned, reducing hallucinated outputs, will be much more challenging, as hallucinations are just the model doing probabilistic things.

The fourth problem, making LLMs faster and cheaper, can never be completely solved. There has been great progress in this area, and there will be more in the future, but improvements in this direction will continue indefinitely.

The fifth and sixth items, namely new architectures and new hardware, are very challenging, but they are inevitable over time. Due to the symbiotic relationship between architecture and hardware—new architectures need to be optimized for general-purpose hardware, and hardware needs to support general architectures—they may be accomplished by the same company.

Some problems cannot be solved by technical knowledge alone. For example, the eighth problem, improving methods for learning from human preferences, may be more of a policy issue than a technical one. The ninth problem is about improving the efficiency of chat interfaces, which is more of a user experience issue. We need more people with non-technical backgrounds to help us solve these problems.

What research direction are you most interested in? What do you think is the most promising solution to these problems? I would love to hear your thoughts.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

6000-word interpretation: The 10 major challenges in the current research of large language models (LLMs)

Selected Articles by 巴比特

Table of Contents

Related Articles