AI storage is booming, can Filecoin come out to pick up the trash? — What is hot storage and cold storage?
Introduction: Filecoin has not sought cooperation for several years, and Juan has not made any public appearances. The reason I am writing on the topic of Filecoin is that I have a major Filecoin holder neighbor, Kangkang @tktang88, and many big miner friends of Filecoin to share knowledge and future expectations about Filecoin, especially one point that Kangkang mentioned which I am very interested in.
So here comes this tweet. This is not a commercial advertisement and does not encourage everyone to buy $FIL, but rather looks at decentralized storage from a new dimension.
Main Text
The day before yesterday, Micron's earnings forecast cast a shadow over the entire market, while yesterday, Micron's earnings report exceeded expectations, driving a significant short-term surge in the market. Even Micron's market capitalization briefly exceeded that of Meta and Tesla, and the reason is that the storage demand of the AI era may exceed what many partners can imagine.
Because AI training and inference require high-speed read and write, vector databases, KV cache offloading, model parameters, and inference intermediate states all require stronger memory and storage capabilities. This logic is hardware-level, more deterministic, and more directly tied to revenue.
However, the storage demand for AI will not only stay on high-speed memory and SSDs. As model training, inference, agents, and user-generated content increase, another type of troublesome data will inevitably appear in the future, which is a large amount of short-term worthless data with extremely low access frequency that may never be used again, but companies do not dare to delete it easily.
This is the focus of today's discussion: the storage of garbage data!
Data in the AI era will naturally be layered. At the front is hot data, which is currently being used for training and inference and requires high-speed access. This part is dominated by HBM, DRAM, NVMe SSD, and high-speed networks.
The middle layer is warm data, which may be reused in the near future, such as model checkpoints, training shards, vector indexes, experimental logs, evaluation data, and datasets still in iteration.
Lastly, there is cold data, which has completed training and will not be called in the short term, but may be needed again in the future due to retraining, rollback, copyright, regulation, auditing, safety incidents, or model reproducibility.
Especially as cold data and the segment of demand that Micron currently dominates are not in the same position. Micron is dominating high-speed storage, data currently being used for training and inference. This type of data has the highest value and is the most expensive, making the hardware used for storage in short supply.
But what about cold data? The so-called cold data is data with very low usage frequency, such as raw data used in model training, cleaned data, deduplication records, annotation records, early-generated images and videos by users, etc., which are almost regarded as garbage data. Most of these things are rarely opened again, and may not be read for years, but direct deletion is not an option.
Because there may be a need for retraining in the future, a need to roll back a model, a need to explain a certain output, a need to handle copyright disputes, the possibility of regulatory audits, or simply because, after the emergence of a new model, previously seemingly useless data suddenly becomes useful.
Thus, the most troublesome aspect of the AI era is that data will keep increasing while the risk of deleting data will also become higher.
Many AI businesses' early data management tends to be more rough, with hot, warm, and cold data not being distinctly separated. Especially if a lot of low-frequency access data continues to occupy high-cost storage, it will definitely be uneconomical in the long run. Storage costs will significantly increase, and using high-speed cloud storage will also be unprofitable. So can we just dump these cold data into a hard drive "cold store"?
The answer is no.
If these AI data are simply thrown into a cold store without indexing, labeling, sources, model version mapping, or records of the cleaning process, then even if this batch of data still exists, it is almost equivalent to being lost.
What is needed is for the metadata to remain hot while the data itself remains cold. The data itself can be placed in cold storage, but the metadata directory, source, hash, CID, license, creation time, cleaning method, corresponding model, usage records, privacy tags, retention period, and recovery test results need to be placed in a searchable, readable, and auditable hot index layer.
This is the reason why Filecoin and decentralized storage can be re-discussed. Especially with decentralized storage infrastructure that already has network storage capabilities.
Filecoin has a large amount of network storage capacity, and while having many hard drives alone isn't very meaningful, these hard drives on the blockchain already showcase a prototype of verifiable cold storage. Particularly, Filecoin's relative uniqueness compared to traditional cloud storage lies in content addressing, multi-provider storage, and on-chain proof.
To put it simply, customers do not need to blindly trust a single cloud vendor claiming, "Your data has been saved," but can continuously verify whether this data still exists, whether the content has been altered, and can retrieve it in the future using the same content identifier.
This capability is meaningful for AI cold data.
From this perspective, the real opportunity for decentralized storage may be in the management layer of AI cold data. Responsible for migrating data from training clusters, cloud object storage, and enterprise local servers, first performing deduplication, compression, privacy scanning, copyright tagging, encryption, and sharding, before putting large files into cold storage while retaining hot indexes.
When models need to be retrained in the future, the system can retrieve the data based on source, time, tags, and model version. Without this capability, Filecoin is just a warehouse, but with this capability, decentralized storage may become part of the AI data infrastructure.
Different decentralized storage projects should be viewed separately. Filecoin is more suitable for discussion around verifiable cold data storage, as its core revolves around the storage market and data proof, suitable for large files, low-frequency access, fixed version dataset snapshots, model checkpoints, scientific research data, publicly-trained corpora, and privacy-compliant audit logs.
Arweave is more suitable for permanently public data, model descriptions, data source records, and immutable public archives, but data involving privacy and rights to deletion are challenging to place directly inside, as permanent storage itself can bring compliance issues.
Storj and Sia are closer to decentralized object storage, and if the user experience and pricing are sufficiently good, they can capture some backup and archiving demands, but must also prove usability, recovery speed, enterprise service, and long-term economic models.
Of course, the most important thing is to be sufficiently cheap.
AWS Glacier Deep Archive, Google Archive, Azure Archive, enterprise tape libraries, local object storage, hard drive vendors, and cloud vendors will all compete for AI cold data.
Especially for data with extremely low access frequency, tapes and deep archiving remain highly competitive. For decentralized storage to win, the primary requirement is to be cost-effective, but in addition to being cheap, it still needs to meet capabilities like verifiable data, multi-provider support, vendor neutrality, and content addressing. Being cheap is just the door opener.
As AI continues to develop, cold data or garbage data will increase, and this segment of data is likely to become one of the most troublesome costs for future AI companies.
This is also why I believe the existing low-cost decentralized storage can be re-discussed.
Previously, the biggest problem for projects like Filecoin was that there was supply (mining machines) but no complete demand. There are plenty of hard drives, a multitude of storage providers, and decentralized narratives, but real customers and real payments are in a total mess.
If AI cold data truly becomes a large market now, and if decentralized storage can achieve "hot indexing cold storage" that is cheaper than traditional storage, then these already existing hard drives will have the opportunity for real applications.
Of course, from the current investment perspective, one cannot simply think that because Micron has risen, Filecoin should rise as well. The business logic of the two is completely different.
Micron sells hardware, while Filecoin needs to consider paid storage volumes, the number of real customers, renewal rates, retrieval success rates, recovery costs, storage provider profits, and whether these business growths can finally translate into demand for $FIL, collateral, transaction fees, or destruction.
The road ahead for decentralized storage is still long, especially whether this "hot indexing, cold storage" system can run effectively is where Filecoin truly needs to make up for.
The demand for AI cold data is likely to emerge, but where this demand will ultimately flow depends on who can achieve sufficient affordability, stability, ease of retrieval, and ease of auditing.
If Filecoin can only prove that it has many hard drives, that is of little significance.
If Filecoin can prove that these hard drives can carry real paid data and can still reliably retrieve, completely recover, and maintain ongoing renewal over the years, then these seemingly unwanted garbage data in the AI era may indeed provide a second opportunity for decentralized storage.
End

免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。