Google的神经网络处理器专利

原创 2017-04-07 唐杉 StarryHeavensAbove

相信这两天大家被Google TPU的消息刷屏了。不过，TPU论文实际的信息量很大，目前多数分析文章只是做了简单的介绍，相信后面会有更多详细的讨论。当然我也会找几个点展开说说。

在关注实现细节的时候，我注意到文章中提到几个参考文献[1-6]。而这几个文献实际上是Google申请的专利。这几个专利是2015年申请，2016年已经公开。可能是之前关注的人不多，现在看看还是很有意思的。其中[1]“Neural Network Processor”是总的架构，而[2-6]分别描述了在这个架构上怎么做卷积运算；矢量处理单元的实现；权重的处理；数据旋转方法以及Batch处理等更为细节的内容。这些内容对于深入理解Google TPU设计很有帮助。

我整理了一下这几个专利的原文（包括US版和WO版，建议看WO版），大家可以在我的公众号回复googlepat下载。

下面我们重点看看“Neural Network Processor”的这篇专利，这应该是系列专利的基础。

摘要

一种用于对包括多个网络层的神经网络执行神经网络计算的电路，所述电路包括：一个矩阵运算单元（matrix computation unit），对多个神经网络层中的一层，它可以被配置为：接收多个权重输入和多个用于所述神经网络层的激活输入，并且基于所述多个权重输入和多个激活输入生成多个累积值；以及矢量运算单元（vector computation unit），其通信耦合到所述矩阵运算单元，对多个神经网络层中的一层，它可以被配置为：将激活函数应用于由所述矩阵运算单元生成的每个累积值，以生成这个神经网络层的多个激活值。

详细内容

整体框图如下：

这个框图当然比论文中的要抽象一些，但和论文中的TPU整体框图（下图）对比，可以看出专利中的几个模块是关键功能模块，对专利的权利要求最为重要。其中最核心的是矩阵运算单元（Matrix Computation Unit）和矢量运算单元（Vector Computation Unit）。在此基础上又包括的“Unified Buffer”，“Sequencer”，DMA引擎和Memory unit，图中的箭头也表示了数据流的方向。

下面这个是矩阵运算单元的架构。

下图是矩阵运算单元（脉动阵列 Systolic Array）中的一个Cell的架构。

下图是矢量运算单元的架构。

详细说明的文字部分，主要说明了各个模块的功能和相互之间的关系。这部分建议大家自己过一遍，对完整的数据流就比较清楚了。另外，这个专利介绍的是主要模块的功能，一些更细节的设计，以及系统运行的方式在另外几个专利里面有更详细的描述。

权利要求

权利要求一共有28条，主要就是针对前面几个框图说的，不再赘述。

其它几篇中也可以看出一些Google TPU实现的细节，还是很值得一看的。下面我把这几篇的摘要和比较重要的图贴一下大家参考。

COMPUTING CONVOLUTIONS USING A NEURAL NETWORK PROCESSOR

Abstract
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for computing a layer output for a convolutional neural network layer, the method comprising: receiving the layer input, the layer input comprising a plurality of activation inputs, the plurality of activation inputs represented as a multi-dimensional matrix comprising a plurality of depth levels, each depth level being a respective matrix of distinct activation inputs from the plurality of activation inputs; sending each respective kernel matrix structure to a distinct cell along a first dimension of the systolic array; for each depth level, sending the respective matrix of distinct activation inputs to a distinct cell along a second dimension of the systolic array; causing the systolic array to generate an accumulated output from the respective matrices sent to the cells; and generating the layer output from the accumulated output.

PREFETCHING WEIGHTS FOR USE IN A NEURAL NETWORK PROCESSOR

Abstract
A circuit for performing neural network computations for a neural network, the circuit comprising: a systolic array comprising a plurality of cells; a weight fetcher unit configured to, for each of the plurality of neural network layers: send, for the neural network layer, a plurality of weight inputs to cells along a first dimension of the systolic array; and a plurality of weight sequencer units, each weight sequencer unit coupled to a distinct cell along the first dimension of the systolic array, the plurality of weight sequencer units configured to, for each of the plurality of neural network layers: shift, for the neural network layer, the plurality of weight inputs to cells along the second dimension of the systolic array over a plurality of clock cycles and where each cell is configured to compute a product of an activation input and a respective weight input using multiplication circuitry.

ROTATING DATA FOR NEURAL NETWORK COMPUTATIONS

Abstract
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for computing a layer output for a convolutional neural network layer, the method comprising: receiving a plurality of activation inputs; forming a plurality of vector inputs from the plurality of activation inputs, each vector input comprising values from a distinct region within the multi-dimensional matrix; sending the plurality of vector inputs to one or more cells along a first dimension of the systolic array; generating a plurality of rotated kernel structures from each of the plurality of kernel; sending each kernel structure and each rotated kernel structure to one or more cells along a second dimension of the systolic array; causing the systolic array to generate an accumulated output based on the plurality of value inputs and the plurality of kernels; and generating the layer output from the accumulated output.

VECTOR COMPUTATION UNIT IN A NEURAL NETWORK PROCESSOR

Abstract
A circuit for performing neural network computations for a neural network comprising a plurality of layers, the circuit comprising: activation circuitry configured to receive a vector of accumulated values and configured to apply a function to each accumulated value to generate a vector of activation values; and normalization circuitry coupled to the activation circuitry and configured to generate a respective normalized value from each activation value.

BATCH PROCESSING IN A NEURAL NETWORK PROCESSOR

Abstract
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating a respective neural network output for each of a plurality of inputs, the method comprising, for each of the neural network layers: receiving a plurality of inputs to be processed at the neural network layer; forming one or more batches of inputs from the plurality of inputs, each batch having a number of inputs up to the respective batch size for the neural network layer; selecting a number of the one or more batches of inputs to process, where a count of the inputs in the number of the one or more batches is greater than or equal to the respective associated batch size of a subsequent layer in the sequence; and processing the number of the one or more batches of inputs to generate the respective neural network layer output.

最后还是要说，欢迎大家多留言，多交流，多转发。

T.S.

参考：
1. Ross, J., Jouppi, N., Phelps, A., Young, C., Norrie, T., Thorson, G., Luu, D., 2015. Neural Network Processor, Patent Application No. 62/164,931.
2. Ross, J., Phelps, A., 2015. Computing Convolutions Using a Neural Network Processor, Patent Application No. 62/164,902.
3. Ross, J., 2015. Prefetching Weights for a Neural Network Processor, Patent Application No. 62/164,981.
4. Ross, J., Thorson, G., 2015. Rotating Data for Neural Network Computations,Patent Application No. 62/164,908.
5. Thorson, G., Clark, C., Luu, D., 2015. Vector Computation Unit in a Neural Network Processor, Patent Application No. 62/165,022.
6. Young, C., 2015. Batch Processing in a Neural Network Processor, Patent Application No. 62/165,020.

您可能会对以下内容感兴趣

Google TPU 揭密

嵌入式机器学习处理器的技术挑战和机会

ISSCC2017 Deep-Learning Processors导读文章汇总

初创公司在人工智能芯片（IP）领域的机会

处理器IP厂商的机器学习方案 - 背景

一张图看看人工智能各大“门派”

长按左侧二维码关注