注意力是生物视觉思考等很重要的机制之一,这篇论文给出了更加接近生物注意机制的实现
我们首先给出论文12页ppt,接着是论文对译
ppt:
论文解读如下:
abstract
aim to model the top-down attention of a Convolutional Neural Network (CNN) classifier for generating task-specific attention maps.
我们的目标是实现从上而下的为特定任务服务CNN注意力模型,自上而下有没有感觉像自我意识的注意力调节?
Inspired by a top-down human visual attention model, we pro- pose a new backpropagation scheme, called Excitation Backprop, to pass along top-down signals downwards in the network hierarchy via a probabilistic Winner-Take-All process.
我们从人类的视觉注意力模型中收到启发,我们建立一种新的bp模式:激活BP;以便沿着自上而下的信号在层级网络中传递,通过概率Winner-Take-All 过程
Furthermore, we introduce the concept of contrastive attention to make the top-down attention maps more discriminative.
接着,我们介绍对比注意的概念,对比注意使得自上而下的注意力更可分辨
后面就是介绍实验的一些结果
In experiments, we demonstrate the accuracy and generaliz- ability of our method in weakly supervised localization tasks on the MS COCO, PASCAL VOC07 and ImageNet datasets. The usefulness of our method is further validated in the text-to-region association task. On the Flickr30k Entities dataset, we achieve promising performance in phrase localization by leveraging the top-down attention of a CNN model that has been trained on weakly labeled web images.
1 Introduction
Top-down task-driven attention is an important mechanism for efficient visual search. 自上而下的任务导向的注意力是高效视觉搜索的重要机制
Various top-down attention models have been proposed, e.g. [1,2,3,4]. 之前的文献
Among them, the Selective Tuning attention model [3] provides a biologically plausible formulation.
Selective Tuning attention模型最接近生物特点
Assuming a pyramidal neural network for visual processing, the Selective Tuning model is composed of a bottom-up sweep of the network to process input stimuli, and a top-down Winner-Take-ALL (WTA) process to localize the most relevant neurons in the network for a given top-down signal.
视觉处理的金字塔神经网络,视觉输入信号通过自底向上组成选择优化模型,
自上而下的注意力信号,通过自上而下的 Winner-Take-ALL (WTA)机制处理局部最相关神经元。
Inspired by the Selective Tuning model, we propose a top-down attention formulation for modern CNN classifiers. Instead of the deterministic WTA pro- cess used in [3], which can only generate binary attention maps, we formulate the top-down attention of a CNN classifier as a probabilistic WTA process.
从选择优化模型启发,我们提出一个CNN组成的自上而下的注意力模型。
之前的WTA模型只能生产二进制注意力地图,我们的模型是概率的。
The probabilistic WTA formulation is realized by a novel backpropagation scheme, called Excitation Backprop, which integrates both top-down and bottom- up information to compute the winning probability of each neuron efficiently.
概率WTA是由创新的BP构建:激活BP,整合了自上而下,自下而上的信息去计算最高概率的每个神经元效率
Interpretable attention maps can be generated by Excitation Backprop at in- termediate convolutional layers, thus avoiding the need to perform a complete backward sweep.
可解释的注意力地图可以从CNN内的网络层通过激活BP生产,因此避免了全后向处理
We further introduce the concept of contrastive top-down attention, which captures the differential effect between a pair of contrastive topdown signals. The contrastive top-down attention can significantly improve the discriminativeness of the generated attention maps.
我们还将解释对比top-down注意力概念,它会对比一对自上而下信号的不同影响。自上而下的对比可以明显提高生成注意力模型的分辨能力。
In experiments, our method achieves superior weakly supervised localization performance vs. [5,6,7,8,9] on challenging datasets such as PASCAL VOC [10] and MS COCO [11]. 结果是我们的性能好
We further explore the scalability of our method for localizing a large number of visual concepts.测试可扩展性
For this purpose, we train a CNN tag classifier to predict ∼18K tags using 6M weakly labeled web images. 训练集:
By leveraging our top-down attention model, our image tag classifier can be used to localize a variety of visual concepts. Moreover, 可以扩展到差异很大的视觉概念
our method can also help to understand what has been learned by our tag classifier. Some examples are shown in Fig. 1.我们的方法还可以帮助理解从我们的分类标签中我们学到了什么。
The performance of our large-scale tag localization method is evaluated on the challenging Flickr30k Entities dataset [12]. Without using a language model or any localization supervision, our top-down attention based approach achieves competitive phrase-to-region performance vs. a fully-supervised baseline [12].
为使用语言模型和其他定位监督,我们的模型达到了监督训练的水平。
To summarize, the main contributions of this paper are: 我们的贡献
– a top-down attention model for CNN based on a probabilistic Winner-Take- All process using a novel Excitation Backprop scheme;
CNN基础上的使用激活BP的概率的Winner-Take-All过程自上而下的注意力模型。
– a contrastive top-down attention formulation for enhancing the discrimina- tiveness of attention maps; and
对比的自上而下注意力构建为了提供注意力地图的分辨能力
– a large-scale empirical exploration of weakly supervised text-to-region asso- ciation by leveraging the top-down neural attention model.
使用自上而下的注意力模型大规模探索了弱监督的文字到图片区域映射
2 Related Work
There is a rich literature about modeling the top-down influences on selective attention in the human visual system (see [13] for a review). It is hypothesized
that top-down factors like knowledge, expectations and behavioral goals can
affect the feature and location expectancy in visual processing [4,14,1,15], and bias the competition among the neurons [16,3,15,17,18]. Our attention model is related to the Selective Tuning model of [3], which proposes a biologically inspired attention model using a top-down WTA inference process.
人脸视觉系统自上而下的选择注意力有很多文献,存在假设:自上如下就像是知识、期望、目标,可以影响视觉的特征区域处理及神经元竞争。我们的模型和选择优化相关,是生物启发的自上而下WTA推理过程。
Various methods have been proposed for grounding a CNN classifier’s pre- diction [5,6,7,8,19,9]. In [5,6,20], error backpropagation based methods are used for visualizing relevant regions for a predicted class or the activation of a hidden neuron. Recently, a layer-wise relevance backpropagation method is proposed by [9] to provide a pixel-level explanation of CNNs’ classification decisions. Caoet al. [7] propose a feedback CNN architecture for capturing the top-down at- tention mechanism that can successfully identify task relevant regions. In [19], it is shown that replacing fully-connected layers with an average pooling layer can help generate coarse class activation maps that highlight task relevant re- gions. Unlike these previous methods, our top-down attention model is based on the WTA principle, and has an interpretable probabilistic formulation. Our method is also conceptually simpler than [7,19] as we do not require modifying a network’s architecture or additional training. The ultimate goal of our method goes beyond visualization and explanation of a classifier’s decision [6,20,9], as we aim to maneuver CNNs’ top-down attention to generate highly discriminative attention maps for the benefits of localization.
之前的神经网络注意力相关模型和特点,我们的最有突破性。
Training CNN models for weak supervised localization has been studied by [21,22,23,24,25]. In [21,25,24], a CNN model is transformed into a fully con- volutional net to perform efficient sliding window inference, and then Multiple Instance Learning (MIL) is integrated in the training process through various pooling methods over the confidence score map. Due to the large receptive field and stride of the output layer, the resultant score maps only provide very coarse location information. To overcome this issue, a variety of strategies, e.g. image re-scaling and shifting, have been proposed to increase the granularity of the score maps [21,24,26]. Image and object priors are also leveraged to improve the object localization accuracy in [22,23,24]. Compared with weakly supervised localization, the problem setting of our task is essentially different. We assume a pre-trained deep CNN model is given, which may not use any dedicated train- ing process or model architecture for the purpose of localization. Our focus, instead, is to model the top-down attention mechanism of generic CNN models to produce interpretable and useful task-relevant attention maps.
我们的目标更好。
3 Method
3.1 Top-down Neural Attention based on Probabilistic WTA
基于概率WTA的自上而下神经注意力模型
We consider a generic feedforward neural network model. The goal of a top-down attention model is to identify the task-relevant neurons in the network.
我们考虑一个普通的前馈神经网络,自上而下的注意力模型的目标是确认网络中任务相关的神经元
Given a selected output unit, a deterministic top-down WTA scheme is used in the biologically inspired Selective Tuning model [3] to localize the most rel- evant neurons in the processing cone (see Fig. 2 (a)) and generate a binary
attention map.
确定型自上而下WTA是生物启发的选择优化模型,为了定位最相关锥形处理神经元,和生成二进制注意力地图
Inspired by the deterministic WTA, we propose a probabilisticWTA formulation to model a neural network’s top-down attention (Fig. 2 (b) and (c)), which leverages more information in the network and generates soft attention maps that can capture subtle differences between top-down signals. This is critical to our contrastive attention formulation in Sec. 3.3.
受确定型WTA启发,我们推出概率WTA,组成自上而下注意力模型,可以调动更多神经网络中的信息,生成软注意力地图,可以捕捉细微信号的不同
这个对后续的对比注意力很重要
In our formulation, the top-down signal is specified by a prior distributionP(A0) over the output units, which can model the uncertainty in the top-down control process. Then the winner neurons are recursively sampled in a top- down fashion based on a conditional winning probability P(At|At−1), whereAt , At−1 ∈ N denote the selected winner neuron at the current and the previous step respectively, and N is the overall neuron set. We formulate the top-down relevance of each neuron as its probability of being selected as a winner in this process. Formally, given a neuron aj ∈ N (note that aj denotes a specific neu- ron and At denotes a variable over the neurons), we would like to compute itsMarginal Winning Probability (MWP) P (aj ). The MWP P (aj ) can be factorized as
自上而下信号是由输出单元的先验分布P(A0) 指定的,可以建模自上而下控制过程的不确定性。接着获胜神经元自上而下方式循环采样,建立在有条件的获胜概率P(At|At−1),At , At−1 ∈ N 表示当前和之前选择的获胜神经元。N是所有神经元的集合,我们构建自上而下相关的神经元建立在它的被选择概率。 我们要计算Marginal Winning Probability (MWP) P (aj ).
where Pj is the parent node set of aj (in top-down order). As Eqn. 1 indicates,
given P(aj|ai), P(aj) is a function of the marginal winning probability of the parent nodes in the preceding layers. It follows that P(aj) can be computed in a top-down layer-wise fashion.
Pj 是aj的父节点集合,如上面方程所示,P(aj) 是父节点所在层的marginal winning probability的函数,此允许P(aj) 可以自上而下按层计算
Our formulation is equivalent to an absorbing Markov chain process [27]. A Markov Chain is an absorbing chain if 1) there is at least one absorbing state and 2) it is possible to go from any state to at least one absorbing state in a finite number of steps. Any walk will eventually end at one of the absorb- ing states. Non-absorbing states are called Transient States.
我们的构建等同于吸收马尔科夫链过程,吸收马尔科夫链需满足1 知识一个吸收态2从任何一个转移态经有限步骤可以至少到达一个吸收态,
For an absorbing Markov Chain, the canonical form of the transition matrix P can be represented by 转移矩阵P如下
where the entrypij is the the transition probability from state i to j.
Each row sums up to one and Ir is an r×r matrix corresponding to the r absorbing states.In our formulation, each random walk starts from an output neuron and ends at some absorbing node of the bottom layer in the network; and pij := P(aj|ai) is the transition probability.
The fundamental matrix of the absorbing Markov chain process is
The (i, j) entry of N can be interpreted as the the expected number of visits to node j, given that the walker starts at i. In our formulation, the MWPP(aj) can then be interpreted as the expected number of visits when a walker starts from a random node of the output layer according to P(A0). This expected number of visits can be computed by a simple matrix multiplication using the fundamental matrix of the absorbing Markov chain. In this light, the MWP P(ai) is a linear function of the the top-down signal P(A0), which will be shown to be convenient later (see Sec. 3.3). In practice, our Excitation Backprop does the computation in a layer-wise fashion, without the need to explicitly construct the fundamental matrix. This layer-wise propagation is possible due to the acyclic nature of the feedforward network.
3.2 Excitation Backprop
In this section, we propose the Excitation Backprop method to realize the prob- abilistic WTA formulation for modern CNN models.
我们提出激活BP这种方法来实现概率WTA
A modern CNN model [28,29,30] is mostly composed of a basic type of neuronai, whose response is computed by ai = φ( j wji aj +bj). 常规CNN计算公式 Here wji is the weight, aj is the input, bj is the bias and φ is the nonlinear activation function. We call this type of neuron an Activation Neuron. We have the following assumptions about the activation neurons.
A1. The response of the activation neuron is non-negative.
A2. An activation neuron is tuned to detect certain visual features. Its response
is positively correlated to its confidence of the detection.
响应和检测对应。
A1 holds for a majority of the modern CNN models, as they adopt the Rectified Linear Unit (ReLU) as the activation function. A2 has been empirically verified by many recent works [19,6,31,32]. It is observed that neurons at lower layers detect simple features like edge and color, while neurons at higher layers can detect complex features like objects and body parts.
Between activation neurons, we define a connection to be excitatory if its weight is non-negative, and inhibitory otherwise.
激活神经元中,我们定义链接为激活如果权重非负, 权重为负则是抑制
Our Excitation Backprop passes top-down signals through excitatory connections between activation neurons. 激活BP通过激活链接自上而下传递信号 Formally, let Ci denote the child node set of ai(in the top-down order). For eachaj ∈ Ci, the conditional winning probability P(aj|ai) is defined as
Eqn. 4 assumes that if ai is a winner neuron, the next winner neuron will be sampled among its child node set Ci based on the connection weight wji and the input neuron’s response aj. The weight wji captures the top-down feature expectancy, while aj represents the bottom-up feature strength, as assumed inA2. Due to A1, child neurons of ai with negative connection weights always have an inhibitory effect on ai, and thus are excluded from the competition.
Eqn. 4 recursively propagates the top-down signal layer by layer, and we can compute attention maps from any intermediate convolutional layer. For our method, we simply take the sum across channels to generate a marginal winning probability (MWP) map as our attention map, which is a 2D probability his- togram. Fig. 3 shows some example MWP maps generated using the pre-trainedVGG16 model [29]. Neurons at higher-level layers have larger receptive fields and strides. Thus, they can capture larger areas but with lower spatial accuracy. Neurons at lower layers tend to more precisely localize features at smaller scales.
For a class of activation functions that is lower bounded, e.g. the sigmoid function, tanh function and the Exponential Linear Unit (ELU) function [33], we can slightly modify our formulation of Excitation Backprop. Suppose λ is the minimum value in the range of the activation function. The modified formulation corresponding to Eqn. 4 in our paper is
3.3 Contrastive Top-down Attention
Since the MWP is a linear function of the top-down signal (see Sec. 3.1), we can compute any linear combination of MWP maps for an image by a single backward pass. All we need to do is linearly combine the top-down signal vectors at the top layer before performing the Excitation Backprop. In this section, we take advantage of this property to generate highly discriminative top-down attention maps by passing down pairs of contrastive signals.
For each output unit oi, we virtually construct a dual unit o ̄i, whose in- put weights are the negation of those of oi. For example, if an output unit corresponds to an elephant classifier, then its dual unit will correspond to anon-elephant classifier. Subtracting the MWP map for non-elephant from the one for elephant will cancel out common winner neurons and amplify the discriminative neurons for elephant. We call the resulting map a contrastiveMWP map, which can be computed by a single backward pass. Fig. 4 shows some examples.
Formally, let W1 be the weights of the top layer, and P1 be the correspond- ing transition matrix whose entries are the conditional probabilities defined by Eqn. 4. Suppose the number of the neurons at the top is m and at the next lower layer is n, and P1 is a m×n matrix.
The weights of the contrastive output units are the negation of the orig- inal weights at the top layer, namely −W1. Let P ̄1 denote the corresponding transition matrix. Regarding P ̄1, the entries that are positive were previously thresholded in P1 according to Eqn. 4 and vise versa. For example, pij > 0 inP1 indicates p ̄ij = 0 in P ̄1.
The MWP of a target layer, say the n-th layer from the top, is formulated as
C = P0 · P1 · P2 · . . . · Pn−1, (6)
and the dual MWP for the contrastive output units isC ̄ = P0 · P ̄1 · P2 · . . . · Pn−1, (7)
where P0 is the input top-down signal in the form of a horizontal vector. The resultant contrastive MWP is formulated as
C − C ̄ = P0 · (P1 − P ̄1) · P2 · . . . · Pn−1. (8)
In practice, we compute P0 · P1 and P0 · P ̄1 respectively by Excitation Back- prop. Then, we do the subtraction and propagate the contrastive signals P0 ·(P1 − P ̄1) downwards by Excitation Backprop again. Moreover, we truncate the contrastive MWP map at zero so that only positive parts are kept. Our prob- abilistic formulation ensures that there are always some positive parts on the contrastive MWP map, unless the MWP map and its dual are identical.
3.4 Implementation of Excitation Backprop
这一节是具体的实现架构介绍,可以阅读原文参考代码一起理解
论文后面部分就是实验结果及结论,这里就忽略了,需要可直接读论文。
qq群号 325921031
微信群请后台留言加群
线下活动报名请后台留言所在地区
阅读原文下载代码及论文
微信扫一扫
关注该公众号