[exploratory DSP]
T
oday, signal processing
research has a significantly
widened its scope compared
with just a few years ago [4],
and machine learning has
been an important technical area of the
signal processing society. Since 2006,
deep learning—a new area of machine
learning research—has emerged [7],
impacting a wide range of signal and
information processing work within the
traditional and the new, widened scopes.
Various workshops, such as the 2009
ICML Workshop on Learning Feature
Hierarchies; the 2008 NIPS Deep
Learning Workshop: Foundations and
Future Directions; and the 2009 NIPS
Workshop on Deep Learning for Speech
Recognition and Related Applications as
well as an upcoming special issue on deep
learning for speech and language process-
ing in IEEE Transactions on Audio,
Speech, and Language Processing (2010)
have been devoted exclusively to deep
learning and its applications to classical
signal processing areas. We have also seen
the government sponsor research on deep
learning (e.g., the DARPA deep learning
program, available at http://www.darpa.
mil/ipto/solicit/baa/BAA-09-40_PIP.pdf).
The purpose of this article is to intro-
duce the readers to the emerging technol-
ogies enabled by deep learning and to
review the research work conducted in
this area that is of direct relevance to sig-
nal processing. We also point out, in our
view, the future research directions that
may attract interests of and require efforts
from more signal processing researchers
and practitioners in this emerging area
for advancing signal and information pro-
cessing technology and applications.
INTRODUCTION TO DEEP LEARNING
Many traditional machine learning and
signal processing techniques exploit shal-
low architectures, which contain a single
layer of nonlinear feature transformation.
Examples of shallow architectures are
conventional hidden Markov models
(HMMs), linear or nonlinear dynamical
systems, conditional random fields
(CRFs), maximum entropy (MaxEnt)
models, support vector machines (SVMs),
kernel regression, and multilayer percep-
tron (MLP) with a single hidden layer. A
property common to these shallow learn-
ing models is the simple architecture that
consists of only one layer responsible for
transforming the raw input signals or fea-
tures into a problem-specific feature
space, which may be unobservable. Take
the example of a support vector machine.
It is a shallow linear separation model
with one feature transformation layer
when kernel trick is used, and with zero
feature transformation layer when kernel
trick is not used.
Human information processing
mechanisms (e.g., vision and speech),
however, suggest the need of deep archi-
tectures for extracting complex structure
and building internal representation
from rich sensory inputs (e.g., natural
image and its motion, speech, and
music). For example, human speech pro-
duction and perception systems are both
equipped with clearly layered hierarchical
structures in transforming information
from the waveform level to the linguistic
level and vice versa. It is natural to
believe that the state of the art can be
advanced in processing these types of
media signals if efficient and effective
deep learning algorithms are developed.
Signal processing systems with deep
architectures are composed of many lay-
ers of nonlinear processing stages, where
each lower layer’s outputs are fed to its
immediate higher layer as the input. The
successful deep learning techniques
developed so far share two additional key
properties: the generative nature of the
model, which typically requires an addi-
tional top layer to perform the discrimi-
native task, and an unsupervised
pretraining step that makes effective use
of large amounts of unlabeled training
data for extracting structures and regular-
ities in the input features.
A BRIEF HISTORY
The concept of deep learning originated
from artificial neural network research.
Multilayer perceptron with many hidden
layers is a good example of the models
with deep architectures. Backpropagation,
invented in 1980s, has been a well-known
algorithm for learning the weights of
these networks. Unfortunately backpropa-
gation alone does not work well in prac-
tice for learning networks with more than
a small number of hidden layers (see a
review and interesting analysis in [1]).
The pervasive presence of local optima in
the nonconvex objective function of the
deep networks is the main source of diffi-
culty in learning. Backpropagation is
based on local gradient descent and starts
usually at some random initial points. It
often gets trapped in poor local optima
and the severity increases significantly as
the depth of the networks increases. This
difficulty is partially responsible for steer-
ing away most of the machine learning
and signal processing research from neu-
ral networks to shallow models that have
convex loss functions (e.g., SVMs, CRFs,
and MaxEnt models) for which global
optimum can be efficiently obtained at
the cost of less powerful models.
The optimization difficulty associated
with the deep models was empirically
IEEE SIGNAL PROCESSING MAGAZINE [145] JANUARY 2011
Digital Object Identifier 10.1109/MSP.2010.939038
1053-5888/11/$26.00©2011IEEE
Dong Yu and Li Deng
Deep Learning and Its Applications
to Signal and Information Processing
Date of publication: 17 December 2010
IEEE SIGNAL PROCESSING MAGAZINE [146] JANUARY 2011
[exploratory DSP] continued
alleviated when a reasonably efficient,
unsupervised learning algorithm was
introduced in 2006 by Hinton et al. [7] for
a class of deep generative models that
they called deep belief networks (DBNs). A
core component of the DBN is a greedy,
layer-by-layer learning algorithm that
optimizes DBN weights at time complex-
ity linear to the size and depth of the net-
works. Separately and with some surprise,
initializing the weights of an MLP with a
correspondingly configured DBN often
produces much better results than that
with the random weights [1], [5]. As such,
deep networks that are learned with
unsupervised DBN pretraining followed
by the backpropagation fine-tuning are
also called DBNs in the literature (e.g., [8]
and [9]).
A DBN comes with additional attrac-
tive properties: 1) The learning algorithm
makes effective use of unlabeled data; 2) It
can be interpreted as Bayesian probabilis-
tic generative models; 3) The values of the
hidden variables in the deepest layer are
efficient to compute; and 4) The overfit-
ting problem that is often observed in the
models with millions of parameters such
as DBNs, and the underfitting problem
that occurs often in deep networks are
effectively addressed by the generative
pretraining step.
The DBN training procedure is not the
only one that makes deep learning possi-
ble. Since the publication of the seminal
work of [7], numerous researchers have
been improving and applying the deep
learning techniques with success.
Another popular technique is to pretrain
the deep networks layer by layer by con-
sidering each pair of layers as a denoising
auto-encoder [1]. We will provide a brief
overview of the original DBN work and
the subsequent progresses in the remain-
der of this article.
A PRIME ARCHITECTURE
OF DEEP LEARNING
In this section, we present a short tuto-
rial on the most extensively investigated
and widely deployed deep learning
architecture, the DBN, as originally
published in [7].
DBNs are probabilistic generative
models that are composed of multiple lay-
ers of stochastic, latent variables. The
unobserved variables can have binary val-
ues and are often called hidden units or
feature detectors. The top two layers have
undirected, symmetric connections
between them and form an associative
memory. The lower layers receive top-
down, directed connections from the
layer above. The states of the units in the
lowest layer, or the visible units, represent
an input data vector.
A DBN is built as a stack of its constit-
uents, called restricted Boltzmann
machines (RBMs) that we introduce next.
RESTRICTED BOLTZMANN
MACHINE
An RBM is a special type of Markov ran-
dom field that has one layer of (typically
Bernoulli) stochastic hidden units and
one layer of (typically Bernoulli or
Gaussian) stochastic visible or observable
units. RBMs can be represented as bipar-
tite graphs as shown in Figure 1, where
all visible units are connected to all hid-
den units, and there are no visible-visible
or hidden-hidden connections.
In an RBM, the joint distribution p (v,
h; u) over the visible units v and hidden
units h, given the model parameters u, is
defined in terms of an energy function
E (v, h; u) of
p 1v, h; u2 5 exp 12E 1v, h; u 22
Z
, (1)
where Z5 g vgh exp 12E 1v, h; u 22 is a
normalization factor or partition func-
tion, and the marginal probability that
the model assigns to a visible vector v is
p 1v; u2 5 g h exp 12E 1v, h; u22
Z
. (2)
For a Bernoulli (visible)-Bernoulli
(hidden) RBM, the energy function is
defined as
E 1v, h; u 2 5 2a
I
i51
a
J
j51
wijvihj 2 a
I
i51
bivi
2 a
J
j51
aj hj, (3)
where wi j represents the symmetric
interaction term between visible unit vi
and hidden unit hj, bi, and aj are the bias
terms, and I and J are the numbers of
visible and hidden units. The condition-
al probabilities can be efficiently calcu-
lated as
p 1hj 5 1|v; u2 5 saa
I
i51
wij vi1 ajb, (4)
p 1vi 5 1|h; u 2 5 saa
J
j51
wij hj1 bib, (5)
where s 1x 2 5 1/ 111 exp 1x 22 . See a der-
ivation in [1].
Similarly, for a Gaussian (visible)-Ber-
noulli (hidden) RBM, the energy is
E 1v, h; u 2 5 2a
I
i51
a
J
j51
wij vi hj
1
1
2a
I
i51
1vi2 bj2 22a
J
j51
aj hj.
(6)
The corresponding conditional probabili-
ties become
p 1hj5 1|v; u2 5saa
I
i51
wijvi1 ajb, (7)
p1vi|h; u 2 5N aa
J
j51
wijhj1 bi, 1b, (8)
where vi takes real values and follows a
Gaussian distribution with mean
a
J
j51
wijhj1 bi and variance one.
Gaussian-Bernoulli RBMs can be used to
convert real-valued stochastic variables to
binary stochastic variables, which can
then be further processed using the
Bernoulli-Bernoulli RBMs.
Taking the gradient of the log likeli-
hood log p 1v; u2 we can derive the update
rule for the RBM weights as
Dwij5 Edata 1vi hj2 2 Emodel 1vi hj 2 , (9)
where Edata 1vihj 2 is the expectation
observed in the training set and
Emodel 1vi hj2 is that same expectation
under the distribution defined by the
model. Unfortunately, Emodel 1vi hj2 is
intractable to compute so the contrastive
v1 v2 vi vI 1. . . . . .
h1 h2 1hJhj. . . . . .
[FIG1] An RBM with I visible units and
J hidden units.
IEEE SIGNAL PROCESSING MAGAZINE [147] JANUARY 2011
divergence (CD) approximation to the
gradient is used where Emodel 1vi hj2 is
replaced by running the Gibbs sampler
initialized at the data for one full step [7].
Careful training of RBMs is essential to
the success of applying deep learning to
practical problems. A practical guide of
the RBM training is provided in [6].
FROM RBM TO DBN
Stacking a number of the RBMs learned
layer by layer from bottom-up gives rise to
a DBN, an example of which is shown in
Figure 2. The stacking procedure is as fol-
lows. After learning a Gaussian-Bernoulli
RBM (for applications with continuous fea-
tures such as speech) or Bernoulli-
Bernoulli RBM (for applications with
nominal or binary features such as black-
white image or coded text), we treat the
activation probabilities of its hidden units
as the data for training the Bernoulli-
Bernoulli RBM one layer up. The activa-
tion probabilities of the second-layer
Bernoulli-Bernoulli RBM are then used as
the visible data input for the third-layer
Bernoulli-Bernoulli RBM, and so on.
Theoretical justification of this efficient
layer-by-layer greedy learning strategy is
given in [7], where it is shown that the
stacking procedure above improves a varia-
tional lower bound on the likelihood of the
training data under the composite model.
That is, the greedy procedure above
achieves approximate maximum likelihood
learning. Note that this learning procedure
is unsupervised and requires no class label.
When DBN is applied to classification
tasks, the generative pretraining can be
followed by or combined with other, typ-
ically discriminative, learning proce-
dures that fine-tune all of the weights
jointly to improve the performance of
the DBN. This discriminative fine-tuning
is often performed by adding a final layer
of variables that represent the desired
outputs or labels provided in the train-
ing data. Then, the backpropagation
algorithm can be used to adjust or fine-
tune the DBN weights. For example, for
speech recognition, the output layer can
represent either syllables, phones, sub-
phones, phone states, or other speech
units used in the HMM-based speech
recognition system.
The learning procedure discussed
above is typically expensive compared
with the inference procedure, which can
be efficiently carried out by a single for-
ward pass. The inference procedure of
DBN is analogous to the forward pass of
the conventional MLP.
APPLICATIONS OF DEEP LEARNING
TO SIGNAL PROCESSING AREAS
In the expanded technical scope of sig-
nal processing, the signal is endowed
with not only the traditional types such
as audio, speech, image and video, but
also text, language, and document that
convey high-level, semantic information
for human consumption. In addition,
the scope of processing has been extend-
ed from the conventional coding,
enhancement, analysis, and recognition
to include more human-centric tasks of
interpretation, understanding, retrieval,
mining, and user interface [4]. Many sig-
nal processing researchers have been
working on one or more of the signal
processing areas defined by the matrix
constructed with the two axes of “signal”
and “processing” discussed here. The
deep learning techniques discussed in
this article have recently been applied to
quite a number of extended signal pro-
cessing areas. We now provide a brief
survey of this body of work in three
main categories. Due to the limitation
on the number of references, we have
omitted some reference listings in the
following survey.
SPEECH AND AUDIO
The traditional MLP has been in use for
speech recognition for many years and
when used alone, their performance is
typically lower than the state-of-the-art
HMM systems with observation probabili-
ties approximated with Gaussian mixture
models (GMMs). Recently, the deep learn-
ing technique was successfully applied to
phone [8], [9] and large vocabulary con-
tinuous speech recognition (LVCSR) tasks
by integrating the powerful discrimina-
tive training ability of the DBNs and the
sequential modeling ability of the HMMs.
Such a model as shown in Figure 3 is typ-
ically named DBN-HMM, where the
observation probability is estimated using
the DBN and the sequential information
is modeled using the HMM.
In [9], a five-layer DBN was used to
replace the Gaussian mixture component
of the GMM-HMM and the monophones-
tate was used as the modeling unit.
Although the monophone model was
used, the DBN-HMM approach achieved
competitive phone recognition accuracy
with the state-of-the-art triphone GMM-
HMM systems.
The work in [8] improved the DBN-
HMM used in [9] by using the CRF
instead of the HMM to model the sequen-
tial information and by applying the max-
imum mutual information (MMI)
training technique successfully developed
in speech recognition to the DBN-CRF
training. The sequential discriminative
learning technique developed in [9]
jointly optimizes the DBN weights, transi-
tion weights, and phone language model
and achieved higher accuracy than the
DBN-HMM phone recognizer with the
frame-discriminative training criterion
implicit in the DBN’s fine-tuning proce-
dure implemented in [9].
The DBN-HMM can be extended from
the context-independent model to the
context-dependent model and from the
phone recognition to the LVCSR.
Experiments on the challenging Bing
mobile voice search data set collected
h1 h2 1hJhj
l1 l2 lLlj
. . . . . .
. . . . . .
h1 h2 1hJhj. . . . . .
h1 h2 1hJhj. . . . . .
v1 v2 vi vI 1. . . . . .
[FIG2] The DBN model used for
classification. The hidden layers are
generatively pretrained layer by layer
by considering each pair of layers as an
RBM. The output layer has labels from
the supervised data.
IEEE SIGNAL PROCESSING MAGAZINE [148] JANUARY 2011
[exploratory DSP] continued
under the real usage scenario demon-
strate that the context-dependent DBN-
HMM significantly outperforms the
state-of-the-art HMM system. Three fac-
tors contribute to the success: the usage
of triphone senones as the DBN modeling
units, the usage of the best available tri-
phone GMM-HMM to generate the senone
alignment, and the tuning of the transi-
tion probabilities. Experiments also indi-
cate that the decoding time of a five-layer
DBN-HMM is almost as that of the state-
of-the-art triphone GMM-HMM.
In [5], the deep auto-encoder [7] is
explored on the speech feature coding
problem with the goal to compress the
data to a predefined number of bits with
minimal reproduction error. DBN pre-
training is found to be crucial for high
coding efficiency. When DBN pretraining
is used, the deep auto-encoder is shown
to significantly outperform a traditional
vector quantization technique. If weights
in the deep auto-encoder are randomly
initialized, the performance is substan-
tially degraded.
Another popular deep model is the
convolutional DBN, which has been
applied to audio and speech data for a
number of tasks including music artist
and genre classification, speaker identifi-
cation, speaker gender classification,
and phone classification, with strong
results presented.
Other deep models have also been
developed and presented. For example,
the deep-structured CRF, which stacks
many layers of CRFs, have been success-
fully used in the speech-related task of
language identification, phone recogni-
tion, sequential labeling [15], and confi-
dence calibration.
IMAGE AND VIDEO
The original DBN and deep auto-encoder
were developed and demonstrated with
success on the simple image recognition
and dimensionality reduction (coding)
tasks (MNIST) in [7]. It is interesting to
note that the gain of coding efficiency
using the DBN-based auto-encoder on
the image data over the conventional
method of principal component analysis
as demonstrated in [7] is very similar to
the gain reported in [5] on the speech
data over the traditional technique of vec-
tor quantization.
In [10], Nair and Hinton developed a
modified DBN where the top-layer model
uses a third-order Boltzmann machine.
They applied this type of DBN to the
NORB database—a three-dimensional
object recognition task. An error rate
close to the best published result on this
task was reported. In particular, it was
shown that the DBN substantially outper-
forms shallow models such as SVMs.
Tang and Eliasmith developed two
strategies to improve the robustness of
the DBN in [14]. First, they used sparse
connections in the first layer of the DBN
as a way to regularize the model. Second,
they developed a probabilistic denoising
algorithm. Both techniques are shown to
be effective in improving the robustness
against occlusion and random noise in a
noisy image recognition task. Another
interesting work on image recognition
with a more general approach than DBN
appears in [11].
DBNs have also been successfully
applied to create compact but meaningful
representations of images for retrieval
purposes. On the large collection image
retrieval task, deep learning approaches
also produced strong results.
The u
本文档为【Deep Learning and Its Applications to Signal and Information Processing】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。