Even the chip makers are making LLMs
探索源: Stack Overflow Blog
摘要
NVIDIA VP Kari Briski 谈芯片巨头如何通过极端协同设计进入LLM领域
核心亮点
- 自2018年开始:NVIDIA早在2018年就开始做LLM,与硬件团队紧密协作
- NVFP4精度:训练和推理使用4位浮点数,保留全部精度而非量化后训练
- 极端协同设计:硬件架构师和模型构建者每日紧密反馈
- Nemotron模型家族:Nano、Super、Ultra + 视觉语言模型、嵌入模型
- 混合模型架构:Mamba状态空间模型 + Transformer,实现更高token效率
- 专家混合模型(MoE):进一步提升训练和推理效率
- 百万token上下文:需要新的内存层次架构
- 开源战略:发布权重、训练数据、食谱
核心洞察:
"The most accurate agentic systems take systems of models. It's not just one model"
"The most accurate agentic systems take systems of models. It's not just one model"
技术细节
- NVFP4优势:从FP16降至NVFP4可节省50%内存,且保留全部精度(量化后训练会损失1-2%准确率)
- disaggregated serving:预填充和解码使用不同GPU类型,提高效率
- Context Memory Engine:CES上宣布的专用上下文内存引擎
- GPU仍是通用计算:因为最准确的代理系统是多模型系统
行业观点
- 不是单一模型:最准确的代理系统需要多模型协作
- 扩散模型:当前最有前景的研究方向
- Token效率:推理模型需要更高的token效率
- RAG仍是工具:检索增强生成只是代理系统的一个工具
访谈金句
"We always employ people who truly understand the applications... You have to truly know the workload in order to accelerate it"
"It's funny because when we're around, we laugh because there's a lot of computer system design that's still relevant to agents, and we laugh that agents are almost kinda like a new type of object oriented programming"