I found that there are many evaluations of AI programming.

CN
3 hours ago

I found that there are many evaluations of AI programming, but very few evaluations of AI text summarization and writing.

Recently, I had a work requirement in this area, so I compared and analyzed the text capabilities of several major AIs. Here are some summaries.

For practical use, I divided them into two main categories.

The first category includes large models like Gemini 3 Pro, Sonnet 4.5, and GPT 5.1.

They perform better than other large models, with Gemini 3 Pro being the best among the three. Its reports provide the best reading experience, flowing naturally and being highly engaging.

However, a common characteristic of these three is that they are expensive. There are mainly two scenarios for their use:

1) Infrequent writing

2) Usage on web pages or clients

However, they are particularly unsuitable for API calls.

The second category: several domestic large models.

In many scenarios, there is a need to use APIs for a large amount of text summarization and analysis, such as collecting a large number of articles for AI to produce summary reports, writing long novels, or batch writing articles. In such cases, using models from the first category is clearly too costly. At this point, several domestic large models are the best choice: they can save a lot of costs while still providing decent results.

Kimi K2 Thinking

Kimi K2 performs the best in text writing among domestic models. **It is an excellent news reporter. Its writing fluency is second only to Gemini 3, and it excels at balanced reporting. The API cost is also the highest among domestic models, but it is still much cheaper than foreign large models.

DeepSeek V3.2

DeepSeek V3.2 is the newly released official version. It has clear logic and a large amount of information; it is not as easy to read as Gemini but is still acceptable. Its content mining of large texts is the deepest.

It is currently the model I choose to use, and it performs well in summarizing and analyzing large amounts of text, with a cheap API.

Qwen3-235b-a22b-thinking

Alibaba's Qwen3 produces particularly professional articles, but its downside is that they are very long and difficult to read. Additionally, as an open-source model, its analysis time is the longest, taking more than three times that of others.

Minimax M2

Minimax M2 is a standard model; its advantages are not prominent, and its disadvantages are not obvious either. It has good reading suggestions and professionalism, but it shares the shortcomings of others, albeit to a lesser extent. I don't feel a compelling reason to use it.

GLM 4.6

With the GLM 4.6 model, it is evident that it is trying hard to meet my prompts. However, it still falls a bit short. For example, when I ask for an analysis based on the provided materials, it tends to quote a lot of material and then forcefully provide a summary. As mentioned in the official introduction of GLM 4.6, it is a model designed for code and agents.

Doubao 251015

Doubao is the worst-performing model. It loses all details, context, and emotion, and has the weakest analytical depth.

I am also quite puzzled; Doubao 1.6 was just released, and I tested its performance at that time, which was quite good. It even had an upgrade in between. Why has its performance deteriorated? It seems that the decline in intelligence of large models is not limited to Claude; other large models may also experience this.

Note: All of the above used the thinking mode.

免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。

Share To
APP

X

Telegram

Facebook

Reddit

CopyLink