Deep Learning and Its Applications to Signal and Information Processing

Deep Learning and Its Applications to Signal and Information Processing [exploratory DSP] T oday, signal processing research has a significantly widened its scope compared with just a few years ago [4], and machine learning has been an important technical area of the signal processing society. Since 2006, deep learning—...

[exploratory DSP] T oday, signal processing research has a significantly widened its scope compared with just a few years ago [4], and machine learning has been an important technical area of the signal processing society. Since 2006, deep learning—a new area of machine learning research—has emerged [7], impacting a wide range of signal and information processing work within the traditional and the new, widened scopes. Various workshops, such as the 2009 ICML Workshop on Learning Feature Hierarchies; the 2008 NIPS Deep Learning Workshop: Foundations and Future Directions; and the 2009 NIPS Workshop on Deep Learning for Speech Recognition and Related Applications as well as an upcoming special issue on deep learning for speech and language process- ing in IEEE Transactions on Audio, Speech, and Language Processing (2010) have been devoted exclusively to deep learning and its applications to classical signal processing areas. We have also seen the government sponsor research on deep learning (e.g., the DARPA deep learning program, available at http://www.darpa. mil/ipto/solicit/baa/BAA-09-40_PIP.pdf). The purpose of this article is to intro- duce the readers to the emerging technol- ogies enabled by deep learning and to review the research work conducted in this area that is of direct relevance to sig- nal processing. We also point out, in our view, the future research directions that may attract interests of and require efforts from more signal processing researchers and practitioners in this emerging area for advancing signal and information pro- cessing technology and applications. INTRODUCTION TO DEEP LEARNING Many traditional machine learning and signal processing techniques exploit shal- low architectures, which contain a single layer of nonlinear feature transformation. Examples of shallow architectures are conventional hidden Markov models (HMMs), linear or nonlinear dynamical systems, conditional random fields (CRFs), maximum entropy (MaxEnt) models, support vector machines (SVMs), kernel regression, and multilayer percep- tron (MLP) with a single hidden layer. A property common to these shallow learn- ing models is the simple architecture that consists of only one layer responsible for transforming the raw input signals or fea- tures into a problem-specific feature space, which may be unobservable. Take the example of a support vector machine. It is a shallow linear separation model with one feature transformation layer when kernel trick is used, and with zero feature transformation layer when kernel trick is not used. Human information processing mechanisms (e.g., vision and speech), however, suggest the need of deep archi- tectures for extracting complex structure and building internal representation from rich sensory inputs (e.g., natural image and its motion, speech, and music). For example, human speech pro- duction and perception systems are both equipped with clearly layered hierarchical structures in transforming information from the waveform level to the linguistic level and vice versa. It is natural to believe that the state of the art can be advanced in processing these types of media signals if efficient and effective deep learning algorithms are developed. Signal processing systems with deep architectures are composed of many lay- ers of nonlinear processing stages, where each lower layer’s outputs are fed to its immediate higher layer as the input. The successful deep learning techniques developed so far share two additional key properties: the generative nature of the model, which typically requires an addi- tional top layer to perform the discrimi- native task, and an unsupervised pretraining step that makes effective use of large amounts of unlabeled training data for extracting structures and regular- ities in the input features. A BRIEF HISTORY The concept of deep learning originated from artificial neural network research. Multilayer perceptron with many hidden layers is a good example of the models with deep architectures. Backpropagation, invented in 1980s, has been a well-known algorithm for learning the weights of these networks. Unfortunately backpropa- gation alone does not work well in prac- tice for learning networks with more than a small number of hidden layers (see a review and interesting analysis in [1]). The pervasive presence of local optima in the nonconvex objective function of the deep networks is the main source of diffi- culty in learning. Backpropagation is based on local gradient descent and starts usually at some random initial points. It often gets trapped in poor local optima and the severity increases significantly as the depth of the networks increases. This difficulty is partially responsible for steer- ing away most of the machine learning and signal processing research from neu- ral networks to shallow models that have convex loss functions (e.g., SVMs, CRFs, and MaxEnt models) for which global optimum can be efficiently obtained at the cost of less powerful models. The optimization difficulty associated with the deep models was empirically IEEE SIGNAL PROCESSING MAGAZINE [145] JANUARY 2011 Digital Object Identifier 10.1109/MSP.2010.939038 1053-5888/11/$26.00©2011IEEE Dong Yu and Li Deng Deep Learning and Its Applications to Signal and Information Processing Date of publication: 17 December 2010 IEEE SIGNAL PROCESSING MAGAZINE [146] JANUARY 2011 [exploratory DSP] continued alleviated when a reasonably efficient, unsupervised learning algorithm was introduced in 2006 by Hinton et al. [7] for a class of deep generative models that they called deep belief networks (DBNs). A core component of the DBN is a greedy, layer-by-layer learning algorithm that optimizes DBN weights at time complex- ity linear to the size and depth of the net- works. Separately and with some surprise, initializing the weights of an MLP with a correspondingly configured DBN often produces much better results than that with the random weights [1], [5]. As such, deep networks that are learned with unsupervised DBN pretraining followed by the backpropagation fine-tuning are also called DBNs in the literature (e.g., [8] and [9]). A DBN comes with additional attrac- tive properties: 1) The learning algorithm makes effective use of unlabeled data; 2) It can be interpreted as Bayesian probabilis- tic generative models; 3) The values of the hidden variables in the deepest layer are efficient to compute; and 4) The overfit- ting problem that is often observed in the models with millions of parameters such as DBNs, and the underfitting problem that occurs often in deep networks are effectively addressed by the generative pretraining step. The DBN training procedure is not the only one that makes deep learning possi- ble. Since the publication of the seminal work of [7], numerous researchers have been improving and applying the deep learning techniques with success. Another popular technique is to pretrain the deep networks layer by layer by con- sidering each pair of layers as a denoising auto-encoder [1]. We will provide a brief overview of the original DBN work and the subsequent progresses in the remain- der of this article. A PRIME ARCHITECTURE OF DEEP LEARNING In this section, we present a short tuto- rial on the most extensively investigated and widely deployed deep learning architecture, the DBN, as originally published in [7]. DBNs are probabilistic generative models that are composed of multiple lay- ers of stochastic, latent variables. The unobserved variables can have binary val- ues and are often called hidden units or feature detectors. The top two layers have undirected, symmetric connections between them and form an associative memory. The lower layers receive top- down, directed connections from the layer above. The states of the units in the lowest layer, or the visible units, represent an input data vector. A DBN is built as a stack of its constit- uents, called restricted Boltzmann machines (RBMs) that we introduce next. RESTRICTED BOLTZMANN MACHINE An RBM is a special type of Markov ran- dom field that has one layer of (typically Bernoulli) stochastic hidden units and one layer of (typically Bernoulli or Gaussian) stochastic visible or observable units. RBMs can be represented as bipar- tite graphs as shown in Figure 1, where all visible units are connected to all hid- den units, and there are no visible-visible or hidden-hidden connections. In an RBM, the joint distribution p (v, h; u) over the visible units v and hidden units h, given the model parameters u, is defined in terms of an energy function E (v, h; u) of p 1v, h; u2 5 exp 12E 1v, h; u 22 Z , (1) where Z5 g vgh exp 12E 1v, h; u 22 is a normalization factor or partition func- tion, and the marginal probability that the model assigns to a visible vector v is p 1v; u2 5 g h exp 12E 1v, h; u22 Z . (2) For a Bernoulli (visible)-Bernoulli (hidden) RBM, the energy function is defined as E 1v, h; u 2 5 2a I i51 a J j51 wijvihj 2 a I i51 bivi 2 a J j51 aj hj, (3) where wi j represents the symmetric interaction term between visible unit vi and hidden unit hj, bi, and aj are the bias terms, and I and J are the numbers of visible and hidden units. The condition- al probabilities can be efficiently calcu- lated as p 1hj 5 1|v; u2 5 saa I i51 wij vi1 ajb, (4) p 1vi 5 1|h; u 2 5 saa J j51 wij hj1 bib, (5) where s 1x 2 5 1/ 111 exp 1x 22 . See a der- ivation in [1]. Similarly, for a Gaussian (visible)-Ber- noulli (hidden) RBM, the energy is E 1v, h; u 2 5 2a I i51 a J j51 wij vi hj 1 1 2a I i51 1vi2 bj2 22a J j51 aj hj. (6) The corresponding conditional probabili- ties become p 1hj5 1|v; u2 5saa I i51 wijvi1 ajb, (7) p1vi|h; u 2 5N aa J j51 wijhj1 bi, 1b, (8) where vi takes real values and follows a Gaussian distribution with mean a J j51 wijhj1 bi and variance one. Gaussian-Bernoulli RBMs can be used to convert real-valued stochastic variables to binary stochastic variables, which can then be further processed using the Bernoulli-Bernoulli RBMs. Taking the gradient of the log likeli- hood log p 1v; u2 we can derive the update rule for the RBM weights as Dwij5 Edata 1vi hj2 2 Emodel 1vi hj 2 , (9) where Edata 1vihj 2 is the expectation observed in the training set and Emodel 1vi hj2 is that same expectation under the distribution defined by the model. Unfortunately, Emodel 1vi hj2 is intractable to compute so the contrastive v1 v2 vi vI 1. . . . . . h1 h2 1hJhj. . . . . . [FIG1] An RBM with I visible units and J hidden units. IEEE SIGNAL PROCESSING MAGAZINE [147] JANUARY 2011 divergence (CD) approximation to the gradient is used where Emodel 1vi hj2 is replaced by running the Gibbs sampler initialized at the data for one full step [7]. Careful training of RBMs is essential to the success of applying deep learning to practical problems. A practical guide of the RBM training is provided in [6]. FROM RBM TO DBN Stacking a number of the RBMs learned layer by layer from bottom-up gives rise to a DBN, an example of which is shown in Figure 2. The stacking procedure is as fol- lows. After learning a Gaussian-Bernoulli RBM (for applications with continuous fea- tures such as speech) or Bernoulli- Bernoulli RBM (for applications with nominal or binary features such as black- white image or coded text), we treat the activation probabilities of its hidden units as the data for training the Bernoulli- Bernoulli RBM one layer up. The activa- tion probabilities of the second-layer Bernoulli-Bernoulli RBM are then used as the visible data input for the third-layer Bernoulli-Bernoulli RBM, and so on. Theoretical justification of this efficient layer-by-layer greedy learning strategy is given in [7], where it is shown that the stacking procedure above improves a varia- tional lower bound on the likelihood of the training data under the composite model. That is, the greedy procedure above achieves approximate maximum likelihood learning. Note that this learning procedure is unsupervised and requires no class label. When DBN is applied to classification tasks, the generative pretraining can be followed by or combined with other, typ- ically discriminative, learning proce- dures that fine-tune all of the weights jointly to improve the performance of the DBN. This discriminative fine-tuning is often performed by adding a final layer of variables that represent the desired outputs or labels provided in the train- ing data. Then, the backpropagation algorithm can be used to adjust or fine- tune the DBN weights. For example, for speech recognition, the output layer can represent either syllables, phones, sub- phones, phone states, or other speech units used in the HMM-based speech recognition system. The learning procedure discussed above is typically expensive compared with the inference procedure, which can be efficiently carried out by a single for- ward pass. The inference procedure of DBN is analogous to the forward pass of the conventional MLP. APPLICATIONS OF DEEP LEARNING TO SIGNAL PROCESSING AREAS In the expanded technical scope of sig- nal processing, the signal is endowed with not only the traditional types such as audio, speech, image and video, but also text, language, and document that convey high-level, semantic information for human consumption. In addition, the scope of processing has been extend- ed from the conventional coding, enhancement, analysis, and recognition to include more human-centric tasks of interpretation, understanding, retrieval, mining, and user interface [4]. Many sig- nal processing researchers have been working on one or more of the signal processing areas defined by the matrix constructed with the two axes of “signal” and “processing” discussed here. The deep learning techniques discussed in this article have recently been applied to quite a number of extended signal pro- cessing areas. We now provide a brief survey of this body of work in three main categories. Due to the limitation on the number of references, we have omitted some reference listings in the following survey. SPEECH AND AUDIO The traditional MLP has been in use for speech recognition for many years and when used alone, their performance is typically lower than the state-of-the-art HMM systems with observation probabili- ties approximated with Gaussian mixture models (GMMs). Recently, the deep learn- ing technique was successfully applied to phone [8], [9] and large vocabulary con- tinuous speech recognition (LVCSR) tasks by integrating the powerful discrimina- tive training ability of the DBNs and the sequential modeling ability of the HMMs. Such a model as shown in Figure 3 is typ- ically named DBN-HMM, where the observation probability is estimated using the DBN and the sequential information is modeled using the HMM. In [9], a five-layer DBN was used to replace the Gaussian mixture component of the GMM-HMM and the monophones- tate was used as the modeling unit. Although the monophone model was used, the DBN-HMM approach achieved competitive phone recognition accuracy with the state-of-the-art triphone GMM- HMM systems. The work in [8] improved the DBN- HMM used in [9] by using the CRF instead of the HMM to model the sequen- tial information and by applying the max- imum mutual information (MMI) training technique successfully developed in speech recognition to the DBN-CRF training. The sequential discriminative learning technique developed in [9] jointly optimizes the DBN weights, transi- tion weights, and phone language model and achieved higher accuracy than the DBN-HMM phone recognizer with the frame-discriminative training criterion implicit in the DBN’s fine-tuning proce- dure implemented in [9]. The DBN-HMM can be extended from the context-independent model to the context-dependent model and from the phone recognition to the LVCSR. Experiments on the challenging Bing mobile voice search data set collected h1 h2 1hJhj l1 l2 lLlj . . . . . . . . . . . . h1 h2 1hJhj. . . . . . h1 h2 1hJhj. . . . . . v1 v2 vi vI 1. . . . . . [FIG2] The DBN model used for classification. The hidden layers are generatively pretrained layer by layer by considering each pair of layers as an RBM. The output layer has labels from the supervised data. IEEE SIGNAL PROCESSING MAGAZINE [148] JANUARY 2011 [exploratory DSP] continued under the real usage scenario demon- strate that the context-dependent DBN- HMM significantly outperforms the state-of-the-art HMM system. Three fac- tors contribute to the success: the usage of triphone senones as the DBN modeling units, the usage of the best available tri- phone GMM-HMM to generate the senone alignment, and the tuning of the transi- tion probabilities. Experiments also indi- cate that the decoding time of a five-layer DBN-HMM is almost as that of the state- of-the-art triphone GMM-HMM. In [5], the deep auto-encoder [7] is explored on the speech feature coding problem with the goal to compress the data to a predefined number of bits with minimal reproduction error. DBN pre- training is found to be crucial for high coding efficiency. When DBN pretraining is used, the deep auto-encoder is shown to significantly outperform a traditional vector quantization technique. If weights in the deep auto-encoder are randomly initialized, the performance is substan- tially degraded. Another popular deep model is the convolutional DBN, which has been applied to audio and speech data for a number of tasks including music artist and genre classification, speaker identifi- cation, speaker gender classification, and phone classification, with strong results presented. Other deep models have also been developed and presented. For example, the deep-structured CRF, which stacks many layers of CRFs, have been success- fully used in the speech-related task of language identification, phone recogni- tion, sequential labeling [15], and confi- dence calibration. IMAGE AND VIDEO The original DBN and deep auto-encoder were developed and demonstrated with success on the simple image recognition and dimensionality reduction (coding) tasks (MNIST) in [7]. It is interesting to note that the gain of coding efficiency using the DBN-based auto-encoder on the image data over the conventional method of principal component analysis as demonstrated in [7] is very similar to the gain reported in [5] on the speech data over the traditional technique of vec- tor quantization. In [10], Nair and Hinton developed a modified DBN where the top-layer model uses a third-order Boltzmann machine. They applied this type of DBN to the NORB database—a three-dimensional object recognition task. An error rate close to the best published result on this task was reported. In particular, it was shown that the DBN substantially outper- forms shallow models such as SVMs. Tang and Eliasmith developed two strategies to improve the robustness of the DBN in [14]. First, they used sparse connections in the first layer of the DBN as a way to regularize the model. Second, they developed a probabilistic denoising algorithm. Both techniques are shown to be effective in improving the robustness against occlusion and random noise in a noisy image recognition task. Another interesting work on image recognition with a more general approach than DBN appears in [11]. DBNs have also been successfully applied to create compact but meaningful representations of images for retrieval purposes. On the large collection image retrieval task, deep learning approaches also produced strong results. The u

                    本文档为【Deep Learning and Its Applications to Signal and Information Processing】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

Deep Learning and Its Applications to Signal and Information Processing

你可能还喜欢