Charts
DataOn-chain
VIP
Market Cap
API
Rankings
CoinOSNew
CoinClaw🦞
Language
  • 简体中文
  • 繁体中文
  • English
Leader in global market data applications, committed to providing valuable information more efficiently.

Features

  • Real-time Data
  • Special Features
  • AI Grid

Services

  • News
  • Open Data(API)
  • Institutional Services

Downloads

  • Desktop
  • Android
  • iOS

Contact Us

  • Chat Room
  • Business Email
  • Official Email
  • Official Verification

Join Community

  • Telegram
  • Twitter
  • Discord

© Copyright 2013-2026. All rights reserved.

简体繁體English
|Legacy

GPT-4's abstract reasoning far surpasses that of humans! Its multimodal capabilities are far inferior to pure text, and the spark of AGI is difficult to ignite independently.

CN
巴比特
Follow
2 years ago
AI summarizes in 5 seconds.

Source: New Smart Element

Image

Image Source: Generated by Wujie AI

Researchers at the Santa Fe Institute used very rigorous quantitative research methods to test that GPT-4 still has a significant gap in reasoning and abstraction compared to human levels. It is still a long way to go to develop AGI from the level of GPT-4!

GPT-4 may be the most powerful general language model currently available. Once released, in addition to marveling at its outstanding performance in various tasks, many have also raised questions: Is GPT-4 AGI? Does it really herald the day when AI will replace humans?

There are also many netizens on Twitter who have initiated polls:

Poll

Among them, the opposing views mainly focus on:

  • Limited reasoning ability: GPT-4 is most criticized for its inability to perform "reverse reasoning" and its difficulty in forming abstract models of the world for estimation.

  • Task-specific generalization: Although GPT-4 can generalize formally, it may encounter difficulties in cross-task objectives.

So how big is the gap between GPT-4's reasoning and abstraction abilities compared to humans? This kind of intuition seems to have lacked quantitative research support.

Recently, researchers at the Santa Fe Institute systematically compared the gap between humans and GPT-4 in reasoning and abstract generalization.

Comparison

Paper link: https://arxiv.org/abs/2311.09247

In terms of GPT-4's abstract reasoning ability, researchers evaluated the performance of GPT-4's text version and multimodal version through the ConceptARC benchmark test. The results indicate that GPT-4 still has a significant gap compared to humans.

How does ConceptARC test work?

ConceptARC is based on ARC, a set of 1000 manually created analogy puzzles (tasks), each containing a small part (usually 2-4) of demonstrations of transformations on a grid, and a "test input" grid.

The challenger's task is to induce the underlying abstract rule from the demonstration and apply that rule to the test input to generate a transformed grid.

As shown in the figure below, by observing the rules of the demonstration, the challenger needs to generate a new grid.

ConceptARC

The purpose of ARC is to emphasize capturing the core of abstract reasoning: inducing general rules or patterns from a small number of examples and being able to flexibly apply them to new, previously unseen situations; and to weaken language or learned symbolic knowledge to avoid reliance on "approximate retrieval" and pattern matching based on previous training data, which may be the reason for superficial success in language-based reasoning tasks.

Building on this, ConceptARC has been improved to include 480 tasks organized into systematic variations of specific core spaces and semantic concepts, such as Top and Bottom, Inside and Outside, Center, and Same and Different. Each task instantiates the concept in different ways and has different levels of abstraction.

With these changes, the concepts become more abstract, making them easier for humans to understand, and the results can better illustrate the comparison of GPT-4 and humans in abstract reasoning abilities.

Test results show that GPT-4 still lags behind humans significantly

Researchers tested both the pure text version of GPT-4 and the multimodal version of GPT-4.

For the pure text version of GPT-4, researchers evaluated its performance using more expressive prompts, including explanations and examples of solved tasks. If GPT-4 answered incorrectly, it was asked to provide a different answer, with up to three attempts.

However, under different temperature settings (temperature is an adjustable parameter used to adjust the diversity and uncertainty of the generated text. The higher the temperature, the more random and diverse the generated text, which may contain more misspellings and uncertainties), for the complete set of 480 tasks, GPT-4's accuracy performance was far inferior to that of humans, as shown in the figure below.

Text Version

In the multimodal experiment, researchers evaluated GPT-4V on the visual version of the simplest ConceptARC tasks (only 48 tasks), providing it with similar prompts as in the first experiment, but using images instead of text to represent the tasks.

The results, as shown in the figure below, indicate that the performance of providing extremely simple tasks as images to multimodal GPT-4 is even significantly lower than in the text-only case.

Multimodal Version

It is not difficult to conclude that GPT-4, possibly the most powerful general LLM currently, still cannot robustly form abstractions and reason about fundamental core concepts that appear in contexts not seen in its training data.

Analysis by netizens

A well-known netizen made a total of 5 comments on GPT-4's performance on ConceptARC. One of the main reasons explained:

The benchmark testing of large language models based on Transformers made a serious mistake. The testing usually guides the model to produce answers by providing brief descriptions, but in reality, these models are not designed solely to generate the next most likely token.

If the model is not properly guided with propositional logic to guide and lock in relevant concepts, it may fall into patterns of re-generating training data or providing the closest answer related to concepts that are not fully developed or correctly anchored in logic.

Comments

In other words, if the problem-solving approach of large models is as shown in the above image, the actual problem that needs to be solved may be as shown in the following images.

Problem Solving

Problem Solving

The researchers said that the next step to improve the abstract reasoning abilities of GPT-4 and GPT-4V may be to try other prompts or task representation methods.

It can only be said that the road to achieving human-level capabilities for large models is still a long and arduous one.

Reference: https://arxiv.org/abs/2311.09247

免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。

返20%!Boost新规,参与平分+交易量多赚
广告
|
|
APP
Windows
Mac
Share To

X

Telegram

Facebook

Reddit

CopyLink

|
|
APP
Windows
Mac
Share To

X

Telegram

Facebook

Reddit

CopyLink

Selected Articles by 巴比特

1 year ago
Baidu AI, needs to make money through Killer App.
1 year ago
Global AICoin Music Concert, the first time hearing the voice of China
2 years ago
These five women are changing the AI industry
View More

Table of Contents

|
|
APP
Windows
Mac
Share To

X

Telegram

Facebook

Reddit

CopyLink

Related Articles

avatar
avatar深潮TechFlow
6 minutes ago
April 2 Market Overview: Trump's speech on "withdrawing from Iran within 2-3 weeks" ignites the start of Q2, the world awaits that statement at 9 PM tonight.
avatar
avatarBBX
2 hours ago
The opening password of the second quarter - from the reversal of Web2's stance to the dimensionality reduction strike of computing power.
avatar
avatar链捕手
10 hours ago
Dialogue with BlackRock's Head of Digital Assets: How Do Tokenized Stocks Work?
avatar
avatarOdaily星球日报
12 hours ago
Millions of traditional traders are flocking to cryptocurrency exchanges.
avatar
avatar律动BlockBeats
13 hours ago
How to regain your buried creativity in the AI era.
APP
Windows
Mac

X

Telegram

Facebook

Reddit

CopyLink