NVIDIA CEO Huang Renxun visited South Korea for the first time in 15 years. On the 30th of last month, he met with Samsung Electronics Chairman Lee Jae-yong and Hyundai Motor Group Chairman Chung Eui-sun to deepen cooperation in memory and AI Megafactory. Kim Jung-Ho, a professor at South Korea's KAIST and the father of HBM (High Bandwidth Memory), said bluntly on a Youtube program, "The dominance in the AI era is shifting from GPU to memory!" In view of the increasing importance of memory, NVIDIA may acquire memory companies such as Micron or SanDisk.
Kim Jung-Ho said that since the importance of memory to the AI field is constantly increasing, in order to ensure its leadership in the AI field, NVIDIA is likely to acquire memory companies such as Micron or SanDisk, rather than the larger Samsung or SK Hynix. He also joked that SanDisk's stock price has risen recently, partly due to increased demand for NAND Flash in data centers, and that SanDisk's size is more suitable for acquisition.
SanDisk rose 4.3% over five days to $199.33.
In fact, the memory bottleneck is an urgent problem that needs to be solved in the future era of AI inference. How international manufacturers solve this part of the bottleneck will also be a very important part. Fan page Richard only talks about fundamentals - Richard’s Research Blog also said that the value contribution of memory in the GPU package and the technical difficulty of integration are getting higher and higher. The probability that NVIDIA may consider buying or investing in a memory company should not be zero.
With AI inference, how to release the memory bottleneck?Yue Feng, Vice President of Huawei Data Storage Products, said in past events that AI inference currently faces three major problems: "impossible to push" (the input content is too long and exceeds the processing range), "slow to push" (the response speed is too slow), and "expensive to push" (the computing cost is too high).
Memory requirements are mainly divided into HBM, DRAM and SSD. Among them, HBM mainly stores real-time memory data, with a capacity of about 10GB to hundreds of GB, mainly extremely hot data and real-time conversations; DRAM is used as short-term memory data, with a capacity of about hundreds of GB to TB, mainly hot data and multi-round conversations; SSD long-term memory data and external knowledge, with a capacity of about TB to PB, mainly hot and warm data, such as historical conversations, RAG knowledge bases, and corpora.
(Source: Zhixixi)
When it comes to the AI inference stage, an "attention mechanism" similar to the human brain will be used, including remembering the important part of the query (Key) and the important part of the context (Value) in order to answer the prompt.
If each time a new Token (new word) is processed, the model must recalculate the importance (Key and Value) of each word for all previously processed Tokens to update the attention weight. Therefore, the large language model (LLM) is added with a mechanism called "KV Cache" (KV Cache), which can store the previous important information (Key and Value) in the memory, eliminating the cost of each recalculation, thereby increasing the Token processing and generation speed by several orders of magnitude.
This also means that the KV cache is the "short-term memory of the AI model", which allows the model to remember what has been processed in previous questions. In this way, every time the user restarts the previous discussion or asks a new question, there is no need to recalculate from scratch. AI can also understand what the user has said, reasoned, and provided at any time, and provide faster and more thorough answers to these longer and more in-depth discussions.
Further reading: Break through the HBM capacity problem! Huawei UMC technology and NVIDIA invest in new ventures to find new solutions from "KV cache"