查看原文
其他

Google的神经网络处理器专利

2017-04-07 唐杉 StarryHeavensAbove

相信这两天大家被Google TPU的消息刷屏了。不过,TPU论文实际的信息量很大,目前多数分析文章只是做了简单的介绍,相信后面会有更多详细的讨论。当然我也会找几个点展开说说。


在关注实现细节的时候,我注意到文章中提到几个参考文献[1-6]。而这几个文献实际上是Google申请的专利。这几个专利是2015年申请,2016年已经公开。可能是之前关注的人不多,现在看看还是很有意思的。其中[1]“Neural Network Processor”是总的架构,而[2-6]分别描述了在这个架构上怎么做卷积运算;矢量处理单元的实现;权重的处理;数据旋转方法以及Batch处理等更为细节的内容。这些内容对于深入理解Google TPU设计很有帮助。


我整理了一下这几个专利的原文(包括US版和WO版,建议看WO版),大家可以在我的公众号回复googlepat下载。


下面我们重点看看“Neural Network Processor”的这篇专利,这应该是系列专利的基础。


摘要

一种用于对包括多个网络层的神经网络执行神经网络计算的电路,所述电路包括:一个矩阵运算单元(matrix computation unit),对多个神经网络层中的一层,它可以被配置为:接收多个权重输入和多个用于所述神经网络层的激活输入,并且基于所述多个权重输入和多个激活输入生成多个累积值;以及矢量运算单元(vector computation unit),其通信耦合到所述矩阵运算单元,对多个神经网络层中的一层,它可以被配置为:将激活函数应用于由所述矩阵运算单元生成的每个累积值,以生成这个神经网络层的多个激活值。


详细内容

整体框图如下:

这个框图当然比论文中的要抽象一些,但和论文中的TPU整体框图(下图)对比,可以看出专利中的几个模块是关键功能模块,对专利的权利要求最为重要。其中最核心的是矩阵运算单元(Matrix Computation Unit)和矢量运算单元(Vector Computation Unit)。在此基础上又包括的“Unified Buffer”,“Sequencer”,DMA引擎和Memory unit,图中的箭头也表示了数据流的方向。

下面这个是矩阵运算单元的架构。

下图是矩阵运算单元(脉动阵列  Systolic Array)中的一个Cell的架构。

下图是矢量运算单元的架构。

详细说明的文字部分,主要说明了各个模块的功能和相互之间的关系。这部分建议大家自己过一遍,对完整的数据流就比较清楚了。另外,这个专利介绍的是主要模块的功能,一些更细节的设计,以及系统运行的方式在另外几个专利里面有更详细的描述。


权利要求

权利要求一共有28条,主要就是针对前面几个框图说的,不再赘述。


其它几篇中也可以看出一些Google TPU实现的细节,还是很值得一看的。下面我把这几篇的摘要和比较重要的图贴一下大家参考。


COMPUTING CONVOLUTIONS USING A NEURAL NETWORK PROCESSOR 

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for computing a layer output for a convolutional neural network layer, the method comprising: receiving the layer input, the layer input comprising a plurality of activation inputs, the plurality of activation inputs represented as a multi-dimensional matrix comprising a plurality of depth levels, each depth level being a respective matrix of distinct activation inputs from the plurality of activation inputs; sending each respective kernel matrix structure to a distinct cell along a first dimension of the systolic array; for each depth level, sending the respective matrix of distinct activation inputs to a distinct cell along a second dimension of the systolic array; causing the systolic array to generate an accumulated output from the respective matrices sent to the cells; and generating the layer output from the accumulated output.

PREFETCHING WEIGHTS FOR USE IN A NEURAL NETWORK PROCESSOR 

Abstract

A circuit for performing neural network computations for a neural network, the circuit comprising: a systolic array comprising a plurality of cells; a weight fetcher unit configured to, for each of the plurality of neural network layers: send, for the neural network layer, a plurality of weight inputs to cells along a first dimension of the systolic array; and a plurality of weight sequencer units, each weight sequencer unit coupled to a distinct cell along the first dimension of the systolic array, the plurality of weight sequencer units configured to, for each of the plurality of neural network layers: shift, for the neural network layer, the plurality of weight inputs to cells along the second dimension of the systolic array over a plurality of clock cycles and where each cell is configured to compute a product of an activation input and a respective weight input using multiplication circuitry.


ROTATING DATA FOR NEURAL NETWORK COMPUTATIONS 

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for computing a layer output for a convolutional neural network layer, the method comprising: receiving a plurality of activation inputs; forming a plurality of vector inputs from the plurality of activation inputs, each vector input comprising values from a distinct region within the multi-dimensional matrix; sending the plurality of vector inputs to one or more cells along a first dimension of the systolic array; generating a plurality of rotated kernel structures from each of the plurality of kernel; sending each kernel structure and each rotated kernel structure to one or more cells along a second dimension of the systolic array; causing the systolic array to generate an accumulated output based on the plurality of value inputs and the plurality of kernels; and generating the layer output from the accumulated output.

VECTOR COMPUTATION UNIT IN A NEURAL NETWORK PROCESSOR 

Abstract

A circuit for performing neural network computations for a neural network comprising a plurality of layers, the circuit comprising: activation circuitry configured to receive a vector of accumulated values and configured to apply a function to each accumulated value to generate a vector of activation values; and normalization circuitry coupled to the activation circuitry and configured to generate a respective normalized value from each activation value.

BATCH PROCESSING IN A NEURAL NETWORK PROCESSOR 

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating a respective neural network output for each of a plurality of inputs, the method comprising, for each of the neural network layers: receiving a plurality of inputs to be processed at the neural network layer; forming one or more batches of inputs from the plurality of inputs, each batch having a number of inputs up to the respective batch size for the neural network layer; selecting a number of the one or more batches of inputs to process, where a count of the inputs in the number of the one or more batches is greater than or equal to the respective associated batch size of a subsequent layer in the sequence; and processing the number of the one or more batches of inputs to generate the respective neural network layer output.


最后还是要说,欢迎大家多留言,多交流,多转发。

T.S.

参考:

1. Ross, J., Jouppi, N., Phelps, A., Young, C., Norrie, T., Thorson, G., Luu, D., 2015. Neural Network Processor, Patent Application No. 62/164,931.

2. Ross, J., Phelps, A., 2015. Computing Convolutions Using a Neural Network Processor, Patent Application No. 62/164,902.

3. Ross, J., 2015. Prefetching Weights for a Neural Network Processor, Patent Application No. 62/164,981.

4. Ross, J., Thorson, G., 2015. Rotating Data for Neural Network Computations,Patent Application No. 62/164,908.

5. Thorson, G., Clark, C., Luu, D., 2015. Vector Computation Unit in a Neural Network Processor, Patent Application No. 62/165,022.

6. Young, C., 2015. Batch Processing in a Neural Network Processor, Patent Application No. 62/165,020.

您可能会对以下内容感兴趣  

Google TPU 揭密

嵌入式机器学习处理器的技术挑战和机会

ISSCC2017 Deep-Learning Processors导读文章汇总

初创公司在人工智能芯片(IP)领域的机会

处理器IP厂商的机器学习方案 - 背景

一张图看看人工智能各大“门派”

长按左侧二维码关注

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存