Source: Stone Learning Notes
Editor's Note:
At the end of 2024, the domestic large model companies are launching new products in groups, showing that AI is still thriving. In Silicon Valley, AI practitioners have summarized some consensus and many "non-consensus" points for the AI industry in 2025 after enthusiastic discussions. For example, investors in Silicon Valley believe that AI companies are "new species," and AI applications will be a hot investment topic in 2025.
From November 11 to 15, Jin Qiu Fund held the "Scale with AI" event in Silicon Valley, inviting experts from companies such as A16Z, Pear VC, Soma Capital, Leonis Capital, Old Friendship Capital, OpenAI, xAI, Anthropic, Google, Meta, Microsoft, Apple, Tesla, Nvidia, ScaleAI, Perplexity, Character.ai, Midjourney, Augment, Replit, Codiuem, Limitless, Luma, and Runway to exchange ideas.
After the discussions, we compiled the experts' viewpoints into these 60 insights.
01 Model Insights
1. The pre-training phase of LLM is nearing its bottleneck
But there are many opportunities in post-training
In the pre-training phase, scaling is slowing down, and there is still some time before saturation.
The reason for the slowdown: structure > computing power > data (Single-Model).
However, in Multi-model: data = computing power > structure.
For MultiModel, it is necessary to choose combinations across multiple modalities. Pre-training can be considered finished under the current architecture, but new architectures can be developed.
The current limited investment in pre-training is mainly due to resource constraints, and the marginal benefits of post-training will be higher.
2. The relationship between pre-training and RL
Pre-training does not care much about data quality.
Post-training has higher requirements for data quality, but due to computing power limitations, only high-quality data is provided in the last few parts.
Pre-training is imitation, only capable of mimicking.
RL is creative and can achieve different outcomes.
Pre-training comes first, followed by RL in post-training; the model must have foundational capabilities for RL to be effective.
RL does not change the model's intelligence; it is more about the mode of thinking. For example, using RL in C.AI to optimize engagement has shown great results.
3. Large model optimization will affect product capabilities
This mainly occurs in the post-training phase, helping to ensure safety, such as addressing C.AI's issues with child suicide by using different models to serve different age groups.
Secondly, there is a multi-agent framework. The model will consider how to solve the problem and then assign tasks to different agents. After each agent completes its task, the results are optimized.
4. Some non-consensus points may achieve consensus next year
Is it necessary to adopt large models for everything? There have been many good small models, and it may not be necessary to create another model.
Current large models may become small models in a year.
Model architectures may change. The scaling law has been reached, and future discussions will focus on knowledge model decoupling, which may happen quickly.
5. In the LLM field, as the scaling law reaches its limit, the gap between closed-source and open-source is narrowing.
6. Video generation is still at the stage of GPT-1 and 2
Currently, the level of video generation is close to the SD1.4 version, and in the future, there will be an open-source version with commercial performance.
The current challenge is the dataset; images rely on the LIAON dataset, which can be cleaned by everyone. However, due to copyright issues, there is not a large public dataset for videos. Each company’s methods of obtaining, processing, and cleaning data will lead to significant differences in model capabilities, making the difficulty of open-source versions vary.
The next challenging point in the DiT solution is how to enhance adherence to physical laws rather than just statistical probabilities.
The efficiency of video generation is a bottleneck. Currently, it takes a long time to run on high-end graphics cards, which is a barrier to commercialization and a topic of discussion in academia.
Similar to LLMs, although the speed of model iteration is slowing down, applications are not. From a product perspective, focusing solely on text-to-video generation is not a good direction; related products involving editing and creativity will emerge continuously, and there will be no bottleneck in the short term.
7. Choosing different tech stacks for different scenarios will be a trend
When Sora was released, many believed it would converge to DiT, but in reality, there are many technical paths being explored, such as those based on GANs and real-time generation using AutoRegressive methods, like the recently popular project Oasis, as well as combining CG and CV for better consistency and control. Each company has different choices, and selecting different tech stacks for different scenarios will be a trend in the future.
8. The scaling law for video does not reach the level of LLM
The scaling law for video exists within a certain range but does not reach the level of LLM. The largest model parameters currently are about 30 billion, which has been proven effective; however, there are no successful cases at the 300 billion level.
Current technical solutions are converging, and there are not many significant differences in methods. The main differences lie in the data, including data ratios.
It will take 1-2 years to reach saturation in the DiT technical route. There are many areas for optimization in the DiT route. More efficient model architectures are crucial. For example, in LLMs, initially, everyone was focused on making larger models, but later it was discovered that by adding MOE and optimizing data distribution, it was possible to achieve results without such large models.
More research investment is needed; simply scaling up DiT is very inefficient. If we consider video data from YouTube and TikTok, the quantity is enormous, and it is impossible to use all of it for model training.
Currently, there is still relatively little open-source work, especially in data preparation, where the cleaning methods vary significantly among companies, and the data preparation process greatly impacts the final results, leaving many points for optimization.
9. Methods to improve the speed of video generation
The simplest method is to generate low-resolution, low-frame-rate images. The most common approach is step distillation; diffusion inference involves steps, and currently, image generation requires at least 2 steps. If it can be distilled to 1-step inference, it will be much faster. Recently, there has been a paper on generating video in one step, which, although currently just a proof of concept, is worth attention.
10. Priorities in video model iteration
In fact, clarity, consistency, and controllability have not yet reached saturation; we have not yet reached a point where improving one aspect sacrifices another. This is currently a stage of simultaneous improvement in the pre-training phase.
11. Technical solutions for speeding up long video generation
We can see where the limits of DiT capabilities lie; the larger the model and the better the data, the higher the clarity, longer the duration, and higher the success rate of generation.
There is currently no answer to how large the DiT model can scale. If a bottleneck appears at a certain size, new model architectures may emerge. From an algorithmic perspective, DiT needs to develop a new inference algorithm to support speed. The challenge is how to incorporate these during training.
Currently, the model's understanding of physical laws is statistical; it can simulate phenomena seen in the dataset to a certain extent but does not truly understand physics. There are some academic discussions, such as incorporating physical rules into video generation.
12. The fusion of video models with other modalities
There will be two aspects of unification: one is the unification of multi-modalities, and the other is the unification of generation and understanding. For the former, representations need to be unified first. For the latter, text and speech can be unified; the unification of VLM and diffusion is currently believed to yield results of 1+12. This work is quite challenging, not necessarily because the models are not smart enough, but because these two tasks are inherently contradictory, and achieving a delicate balance is a complex issue.
The simplest idea is to tokenize everything and input it into a transformer model for unified input and output. However, my personal experience suggests that working on a single specific modality yields better results than merging everything together.
In industrial practice, people do not typically work on them together. The latest MIT paper potentially indicates that unifying multiple modalities could yield better results.
13. There is still a lot of training data for the video modality
There is actually a lot of video data; how to efficiently select high-quality data is crucial.
The quantity depends on the understanding of copyright. However, computing power is also a bottleneck; even with so much data, there may not be enough computing power, especially for high-definition data. Sometimes it is necessary to reverse-engineer the required high-quality dataset based on the available computing power.
High-quality data has always been scarce, but even when data is available, a significant issue is that people do not know what kind of image descriptions are correct and what keywords should be included in the descriptions.
14. The future of long video generation lies in storytelling
Current video generation is material-based. The future will be story-based, with video generation having a purpose. Long videos are not defined by their duration but by their storytelling. This will take the form of tasks.
In terms of video editing, the speed will be higher. Currently, a bottleneck is that the speed is too slow. Now it takes minutes (to generate a few seconds). Even with good algorithms, they are unusable. (Editing does not refer to cutting but to editing images, such as changing characters or actions; such technology exists, but the problem is the slow speed, making it unusable.)
15. The aesthetic enhancement of video generation mainly relies on post-training
This mainly relies on the post-training phase, such as using a large amount of film data. The realism is based on the foundational model's capabilities.
16. The two challenges of video understanding are long context and latency.
17. The visual modality may not be the best pathway to AGI
The text modality—text can also be converted into images and then into videos.
Text is a shortcut to intelligence; the efficiency gap between video and text is hundreds of times.
18. There have been significant advancements in end-to-end speech models
There is no need for manual labeling and judgment of data, allowing for fine emotional understanding and output.
19. Multimodal models are still in the early stages
Multimodal models are still in the early stages, and predicting the next 5 seconds of a video after the first second is already quite challenging, especially when adding text.
In theory, training video and text together is ideal, but it is very difficult to implement overall.
Currently, multimodal models cannot enhance intelligence, but perhaps in the future they can. Compression algorithms can learn the relationships within datasets, requiring only pure text and pure image data, which can then facilitate mutual understanding between video and text.
20. The technical paths for multimodal models have not fully converged
Diffusion models are of good quality, and the current model structures are still being continuously modified;
The logic of Alter aggressive is sound.
21. There is currently no consensus on the alignment of different modalities
It has not yet been determined whether video consists of discrete or continuous tokens.
There are not many high-quality alignments available yet.
It is still unclear whether this is a scientific or engineering problem.
22. Generating data with large models and then training smaller models is feasible, but the reverse is more challenging
The main difference between synthetic data and real data is the quality issue.
Various types of data can also be pieced together for synthesis, yielding good results. This can be used in the pre-training phase because the requirements for data quality are not high.
23. For LLMs, the era of pre-training is essentially over
Now everyone is talking about post-training, which has high requirements for data quality.
24. Team building for post-training
In theory, a team size of 5 is sufficient (not necessarily full-time).
One person builds the pipeline (infrastructure).
One person manages data (data effectiveness).
One person is responsible for the model itself SFT (scientist/reading papers).
One person is responsible for product judgments regarding model orchestration and collecting user data.
In the AI era, products and UI, the advantages of post-training, AI compensates for the understanding of products and UI, leading to rich development without being misled by AI.
25. Building a data pipeline
Data loop: data enters the pipeline, generating new data that flows back.
Efficient iteration: data labeling combined with the pipeline and AB testing, structured data warehouse.
Data input: efficient labeling and rich user feedback to build a moat.
Initial phase: SFT (constantly looping back to this phase).
Subsequent phase: RL (differentiating into heavier RLFH), scoring to guide RL, DPO methods are prone to collapse, SFT simplified version of RL.
02 Embodiment Insights
1. Embodied robots have not yet reached a "critical moment" similar to ChatGPT
A core reason is that robots need to complete tasks in the physical world, not just generate text through virtual language.
Breakthroughs in robotic intelligence require solving the core issue of "embodied intelligence," which is how to complete tasks in dynamic and complex physical environments.
The "critical moment" for robots must meet the following conditions: Generality: the ability to adapt to different tasks and environments. Reliability: a high success rate in the real world. Scalability: the ability to continuously iterate and optimize through data and tasks.
2. The most core issue solved by this generation of machine learning is generalization
Generalization is the ability of AI systems to learn patterns from training data and apply them to unseen data.
There are two modes of generalization:
Interpolation: test data falls within the distribution range of training data.
The difficulty of extrapolation lies in whether the training data can adequately cover the test data and the distribution range and cost of the test data. Here, "cover" or "coverage" is a key concept, referring to whether the training data can effectively encompass the diversity of the test data.
3. Visual tasks (such as facial recognition and object detection) mostly belong to interpolation problems
The work of machine vision mainly mimics biological perception capabilities, understanding and perceiving the environment.
Machine vision models have become very mature in certain tasks (like cat and dog recognition) due to the abundance of relevant data. However, for more complex or dynamic tasks, the diversity and coverage of data remain bottlenecks.
Visual tasks (such as facial recognition and object detection) mostly belong to interpolation problems, where models cover most test scenarios through training data.
However, in extrapolation problems (such as new angles or lighting conditions), the model's capabilities are still limited.
4. The difficulty of generalization for this generation of robots: most situations belong to extrapolation scenarios
Environmental complexity: the diversity and dynamic changes of home and industrial environments.
Physical interaction issues: such as the weight of doors, angle differences, wear and tear, and other physical characteristics.
The uncertainty of human-robot interaction: the unpredictability of human behavior places higher demands on robots.
5. Fully human-like generalization capabilities in robots may not be achievable in the current or future generations
It is extremely challenging for robots to cope with complexity and diversity in the real world. The dynamic changes in real environments (such as pets, children, and furniture arrangements in homes) make it difficult for robots to achieve complete generalization.
Humans themselves are not omnipotent individuals; they complete complex tasks in society through division of labor and cooperation. Similarly, robots do not necessarily pursue "human-level" generalization capabilities but may focus more on specific tasks, even achieving "superhuman" performance (such as efficiency and precision in industrial production).
Even seemingly simple tasks (like vacuuming or cooking) have high generalization requirements due to environmental complexity and dynamics. For example, a vacuuming robot must handle the different layouts, obstacles, and floor materials of thousands of households, all of which increase the difficulty of generalization.
Thus, should robots focus on specific tasks (Pick Your Task)? For instance, robots may need to concentrate on specific tasks rather than pursuing comprehensive human capabilities.
6. Stanford Laboratory's Choice: Focus on Home Scenarios
Stanford's robotics lab primarily focuses on tasks in home scenarios, especially household robots related to an aging society. For example, robots can assist with daily tasks such as folding blankets, picking up items, and opening bottle caps.
Reasons for focus: Countries like the United States, Western Europe, and China face serious aging issues. The main challenges brought by aging include: cognitive decline: Alzheimer's disease (dementia) is a widespread issue, with about half of the population over 95 suffering from it. Motor function decline: diseases such as Parkinson's and ALS make it difficult for the elderly to perform basic daily operations.
7. Defining generalization conditions based on specific scenarios
Clearly define the environments and scenarios that robots need to handle, such as homes, restaurants, or nursing homes.
Once the scenarios are defined, it becomes easier to delineate the task scope and ensure coverage of potential changes in item states and environmental dynamics within these scenarios.
The importance of scenario debugging: debugging robot products involves not only solving technical issues but also covering all possible situations. For example, in a nursing home, robots need to handle various complex situations (such as elderly individuals moving slowly, items being placed inconsistently, etc.). Collaborating with domain experts (such as nursing home managers and caregivers) can help better define task requirements and collect relevant data.
The real-world environment is not as completely controllable as an industrial assembly line, but it can be made "known" through debugging. For instance, defining common types of objects, placement locations, and dynamic changes in home environments can cover key aspects in both simulation and real environments.
8. The contradiction between generalization and specialization
The conflict between general models and specific task models: models need to possess strong generalization capabilities to adapt to diverse tasks and environments; however, this often requires substantial data and computational resources.
Specific task models are easier to commercialize, but their capabilities are limited and difficult to extend to other fields.
Future robotic intelligence needs to find a balance between generality and specificity. For example, through modular design, making general models the foundation and then achieving rapid adaptation through fine-tuning for specific tasks.
9. The potential of embodied multimodal models
Integration of multimodal data: multimodal models can simultaneously process various inputs such as vision, touch, and language, enhancing the robot's understanding and decision-making capabilities in complex scenarios. For instance, in grasping tasks, visual data can help the robot identify the position and shape of objects, while tactile data can provide additional feedback to ensure the stability of the grasp.
The challenge lies in how to achieve efficient fusion of multimodal data within the model. How to enhance the robot's adaptability in dynamic environments through multimodal data.
The importance of tactile data: tactile data can provide additional information to help robots complete tasks in complex environments. For example, when grasping flexible objects, tactile data can help the robot perceive the deformation and stress conditions of the object.
10. The data closed loop in robotics is difficult to achieve
The robotics field currently lacks iconic datasets like ImageNet, making it challenging to form unified evaluation standards for research.
The cost of data collection is high, especially when it involves real-world interaction data. For example, collecting multimodal data such as tactile, visual, and dynamic data requires complex hardware and environmental support.
Simulators are considered an important tool for solving the data closed loop problem, but the "Sim-to-Real Gap" between simulation and the real world remains significant.
11. Challenges of the Sim-to-Real Gap
Simulators have gaps in visual rendering, physical modeling (such as friction and material properties), etc., compared to the real world. Robots may perform well in simulated environments but fail in real environments. This gap limits the direct application of simulated data.
12. Advantages and challenges of real data
Real data can more accurately reflect the complexity of the physical world, but its collection is costly. Data labeling is a bottleneck, especially when it involves multimodal data (such as tactile, visual, and dynamic data).
Industrial environments are more standardized, with clearer task objectives, making them suitable for the early deployment of robotic technology. For example, in the construction of solar power plants, robots can perform repetitive tasks such as piling, panel installation, and screw tightening. Industrial robots can gradually enhance model capabilities through data collection from specific tasks, forming a data closed loop.
13. In robotic operations, tactile and force data can provide critical feedback information
In robotic operations, tactile and force data can provide critical feedback information, especially in continuous tasks (such as grasping and placing).
Forms of tactile data: tactile data is typically time-series data that reflects the mechanical changes when the robot contacts an object.
Recent research work has integrated tactile data into large models.
14. Advantages of simulated data
Simulators can quickly generate large-scale data, suitable for early model training and validation. The cost of generating simulated data is low, allowing for coverage of various scenarios and tasks in a short time. In the field of industrial robots, simulators have been widely used for training tasks such as grasping and handling.
Limitations of simulated data: the physical modeling accuracy of simulators is limited, for example, they may not accurately simulate material properties, friction, flexibility, etc. The visual rendering quality of simulated environments is often insufficient, which may lead to poor performance of models in real environments.
15. Data simulation: Stanford has launched a simulation platform for behavior
Behavior is a simulation platform centered around home scenarios, supporting 1,000 tasks and 50 different environments, covering a diverse range from ordinary apartments to five-star hotels.
The platform includes over 10,000 objects and reproduces the physical and semantic properties of objects (such as doors that can be opened, clothes that can be folded, and glass cups that can be broken) through high-precision 3D models and interactive annotations.
To ensure the authenticity of the simulation environment, the team has invested significant manpower (such as PhD students labeling data) to meticulously annotate the physical properties (mass, friction, texture, etc.) and interaction properties (such as whether an object is detachable or deformable). For example, annotating the flexible characteristics of clothing to support the task of folding clothes, or annotating the wet effect after watering plants.
The Behavior project not only provides fixed simulation environments but also allows users to upload their own scenes and objects, which can be annotated and configured through an annotation pipeline.
Currently, simulation can account for 80% of pre-training, with the remaining 20% needing to be supplemented through data collection and debugging in real environments.
16. Application of Hybrid Models
Initial training is conducted using simulated data, followed by fine-tuning and optimization with real data. Attempts have been made to scan real scenes into the simulator, allowing robots to interact and learn in the simulated environment, thereby narrowing the Sim-to-Real Gap.
17. Challenges of Data Sharing in Robotics
Data is a core asset for companies, and enterprises are reluctant to share data easily. There is a lack of a unified data-sharing mechanism and incentive system.
Possible solutions:
Data exchange: Companies contributing data for specific tasks in exchange for capabilities of general models.
Data intermediaries: Establishing third-party platforms to collect, integrate, and distribute data while protecting privacy.
Model sharing: Reducing reliance on raw data through APIs or model fine-tuning.
Some companies are already experimenting with these three approaches.
18. Choice Between Dexterous Hands and Grippers
Advantages of dexterous hands: High degrees of freedom, capable of completing more complex tasks. Dexterous hands can compensate for inaccuracies in model predictions through multi-degree adjustments.
Advantages of grippers: Low cost, suitable for specific tasks in industrial scenarios. They perform well in material handling tasks on assembly lines but lack generalization capabilities.
19. Co-evolution of Hardware and Software in Embodied Robots
The hardware platform and software models need to iterate synchronously. For example, improvements in sensor accuracy can provide higher quality data for the model. Different companies have varying strategies for hardware-software collaboration:
03 AI Application Investment Insights
1. Silicon Valley VCs Believe 2025 Will Be a Big Year for AI Application Investment
Silicon Valley VCs lean towards 2025 being a significant opportunity for application investment. There are basically no killer apps for everyone in the U.S. People are accustomed to using different apps with various functions in different scenarios, and the key is to make the user experience as seamless as possible.
Last year, there was little focus on application companies; everyone was looking at LLMs and foundation models.
When investing in applications, VCs will ask, what's your moat?
One of the standards for Silicon Valley investors in AI products: it’s best to focus on one direction, making it difficult for competitors to replicate, requiring some network effects; either difficult-to-replicate insights, difficult-to-replicate technological edges, or monopolistic capital that others cannot access. Otherwise, it’s hard to call it entrepreneurship; it feels more like a business.
2. Silicon Valley VCs View AI Product Companies as a New Species
AI companies are seen as a new species, quite different from previous SaaS models. Once they find product-market fit (PMF), their revenue can boom very quickly, with real value creation occurring before the hype, especially in the seed stage.
3. A Niche View in VC is to Consider Investing in Chinese Entrepreneurs Under Certain Conditions
The reason is that the new generation of Chinese founders is vibrant and capable of creating good business models.
However, the premise is that they are based in the U.S.
Chinese entrepreneurs are making many new attempts, but international investors are often fearful and lack understanding. The niche view sees this as a value gap.
4. Silicon Valley VCs Are Finding Ways to Establish Their Investment Strategies
Soma Capital: Connect the best people, have the best people introduce their friends, and create lifelong friendships. Inspire, support, and connect these individuals in the process; establish a panoramic map, including market segmentation and project mapping, aiming for data-driven investments. They will invest from seed to Series C, observing successful and failed samples.
Leonis Capital: A research-driven venture capital fund, primarily focused on first checks.
OldFriendship Capital: Work first, invest later; they work with founders first, conduct customer interviews, establish some interview guidelines, and clarify product issues, similar to consulting work. They invest in Chinese projects and can assess whether Chinese founders have the opportunity to work with U.S. customers during the collaboration.
Storm Venture: They like unlocking growth and prefer Series A companies with PMF. These companies typically achieve 1-2 million in revenue, after which they assess whether unlocking growth can support them scaling to 20 million. The core consideration for B2B SaaS is wage; they believe that significant opportunities in enterprise-level scenarios still lie in automation work.
Inference Venture: A $50 million fund that believes barriers are built on interpersonal relationships and domain knowledge.
5. Silicon Valley VCs Believe the Requirements for MVPs in the AI Era Have Increased
AI product directions such as engineering, fintech, and HR are spending more money.
White-collar work is expensive, costing $40 per hour, with high labor costs, and only 25% of the time is spent on actual work; in the future, there may be no middle management layer, which will be eliminated.
Companies with the highest labor costs are generally the ones most susceptible to AI disruption. For example, hospital operators are often not Americans, and their hourly wages may be less than $2, making it difficult to compete with AI.
There will be a shift from service as software to AI agents.
6. Five AI Predictions for 2025 from Leonis Capital, Founded by OpenAI Researchers
There will be a popular AI programming application.
Model providers will start controlling costs: entrepreneurs will need to choose models/agents to create a unique supply.
Pricing based on cost per action will emerge.
Data centers will cause power shocks, possibly leading to new architectural reconfigurations. New frameworks will emerge, and models will become smaller. Multi-agent systems will become more mainstream.
7. Standards for AI Native Startups
Compared to large companies, AI native startups often lack money and personnel, and their organizational structures differ from traditional SaaS companies. Notion and Canva have struggled when using AI, as they do not want to compromise their core functions.
AI native data has a lower customer acquisition cost, and the ROI provided by AI products is clearer. During the scaling process of AI, there is no need to hire many people; a company with 50 million in revenue might only need 20 employees.
In terms of moats, it lies in model architecture and customization.
8. Large Models Focus on Pre-training, While Application Companies Emphasize Reasoning
Each industry has fixed ways and methods of viewing problems, and each has its unique cognitive architecture. The newly emerging AI agents build on LLMs by incorporating cognitive architecture.
9. How to Implement Reasoning for AI Applications in Daily Life
Reasoning for AI applications in daily life can focus on intentions.
Rewarding is very difficult to read, while math and coding are relatively easy.
Consider topic relevance and geographical location.
Only dynamic rewards can be implemented, using similar groups for comparison.
10. Content Generated by AI May Not Be Very Authentic, Potentially Leading to a New Form of Content
For example, cat walking and cooking.
04 AI Coding Insights
1. Possible Approaches for AI Coding Company Model Training
One possible approach: initially using better APIs from model companies to achieve better results, even if the costs are higher. After accumulating customer usage data, continuously train small models in small scenarios, gradually replacing parts of the API scenarios to achieve better results at lower costs.
2. Differences Between Copilot and Agent Modes
The main difference lies in the degree of asynchronicity: the primary distinction is how asynchronously the AI assistant executes tasks. Copilots typically require immediate interaction and feedback from users, while agents can work more independently for longer periods before seeking user input. For example, code completion and code chat tools require real-time user observation and response. In contrast, agents can perform tasks asynchronously and require less feedback, allowing them to complete more tasks.
Initially, agents were designed to work independently for longer periods (10-20 minutes) before providing results. However, user feedback indicated a preference for more control and frequent interactions. As a result, agents were adjusted to work for shorter periods (a few minutes) before requesting feedback, striking a balance between autonomy and user engagement.
Challenges in developing fully autonomous coding agents: Two main obstacles hinder the development of fully autonomous coding agents. The technology is not yet advanced enough to handle complex, long-term tasks without failure, leading to user dissatisfaction. Users are still adapting to the concept of AI assistants making significant changes across multiple files or repositories.
3. Core Challenges and Improvements for Coding Agents
Key areas needing further development include: 1. Event modeling 2. Memory and world modeling 3. Accurate future planning 4. Improving context utilization, especially for long contexts (the utilization rate drops significantly for contexts exceeding 10,000 tokens), enhancing reasoning capabilities for extended memory lengths (e.g., 100,000 tokens or more). Ongoing research aims to improve memory and reasoning capabilities for longer contexts.
Although world modeling seems unrelated to coding agents, it plays an important role in addressing common issues such as inaccurate planning. Overcoming world modeling challenges can enhance the ability of coding agents to formulate more effective and accurate plans.
4. An Important Trend in AI Coding is the Use of Reasoning Enhancement Techniques, Similar to O3 or O1 Methods
These methods can significantly improve the overall efficiency of code agents. While they currently involve high costs (10-100 times more), they can reduce error rates by half or even a quarter. As language models evolve, these costs are expected to decrease rapidly, potentially making this approach a common technical route.
O3 performs significantly better than other models in benchmark tests, including the Total Forces test. The current industry score is generally around 50, while O3 scores between 70-75.
SMV scores have rapidly increased over the past few months. A few months ago, the score was in the 30s, but it has now risen to over 50.
Model performance enhancement techniques: According to internal tests, applying advanced techniques can further raise scores to around 62. Utilizing O3 can push scores up to 74-75. Although these enhancements may significantly increase costs, the overall performance improvement is substantial.
User experience and latency thresholds: Determining the optimal balance between performance and user experience is challenging. For auto-completion features, response times exceeding 215-500 milliseconds may lead users to disable the feature. In chat applications, a few seconds of response time is generally acceptable, but waiting 50-75 minutes is impractical. The acceptable latency threshold varies by application and user expectations.
The two main obstacles to maximizing model quality are computational power requirements and associated costs.
5. GitHub Copilot is Viewed as a Major Competitor.
6. Customer Success is Crucial for the Adoption of AI Coding Tools.
After-sales support, training, onboarding, and adoption are key differentiators. A startup has 60-70 people dedicated to customer success, accounting for about half of its total workforce. This significant investment helps ensure customer satisfaction.
免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。