今日学术视野(2015.9.10)

2015年9月10日 06:32 阅读 657

cs.AI - 人工智能
cs.CL - 计算与语言
cs.CV - 机器视觉与模式识别
cs.CY - 计算与社会
cs.DC - 分布式、并行与集群计算 cs.DS - 数据结构与算法
cs.IR - 信息检索
cs.IT - 信息论
cs.LG - 自动学习 cs.NA - 数值分析
cs.NE - 神经与进化计算
cs.SI - 社交网络与信息网络
math.ST - 统计理论
stat.ME - 统计方法论
stat.ML - (统计)机器学习

• [cs.AI]C3: Lightweight Incrementalized MCMC for Probabilistic Programs using Continuations and Callsite Caching
• [cs.CL]Enhancing Automatically Discovered Multi-level Acoustic Patterns Considering Context Consistency With Applications in Spoken Term Detection
• [cs.CL]Probabilistic Bag-Of-Hyperlinks Model for Entity Linking
• [cs.CL]Unsupervised Discovery of Linguistic Structure Including Two-level Acoustic Patterns Using Three Cascaded Stages of Iterative Optimization
• [cs.CL]Unsupervised Domain Discovery using Latent Dirichlet Allocation for Acoustic Modelling in Speech Recognition
• [cs.CL]Unsupervised Spoken Term Detection with Spoken Queries by Multi-level Acoustic Patterns with Varying Model Granularity
• [cs.CV]Accelerated graph-based spectral polynomial filters
• [cs.CV]Deep Attributes from Context-Aware Regional Neural Codes
• [cs.CV]Diffusion tensor imaging with deterministic error bounds
• [cs.CV]Edge-enhancing Filters with Negative Weights
• [cs.CV]HEp-2 Cell Classification: The Role of Gaussian Scale Space Theory as A Pre-processing Approach
• [cs.CV]Object Proposals for Text Extraction in the Wild
• [cs.DC]Characterizing and Adapting the Consistency-Latency Tradeoff in Distributed Key-value Stores
• [cs.DC]linalg: Matrix Computations in Apache Spark
• [cs.DS]Optimizing Static and Adaptive Probing Schedules for Rapid Event Detection
• [cs.IR]Improved Twitter Sentiment Prediction through Cluster-then-Predict Model
• [cs.IR]Personalized Search
• [cs.LG]A Behavior Analysis-Based Game Bot Detection Approach Considering Various Play Styles
• [cs.LG]Data-selective Transfer Learning for Multi-Domain Speech Recognition
• [cs.LG]Sampled Weighted Min-Hashing for Large-Scale Topic Mining
• [cs.NA]SEP-QN: Scalable and Extensible Proximal Quasi-Newton Method for Dirty Statistical Models
• [cs.NE]DeepCough: A Deep Convolutional Neural Network in A Wearable Cough Detection System
• [cs.SI]Coupling Analysis Between Twitter and Call Centre
• [cs.SI]Wikipedia Page View Reflects Web Search Trend
• [math.ST]A new non-parametric detector of univariate outliers for distributions with positive unbounded support
• [stat.ML]A Variational Bayesian State-Space Approach to Online Passive-Aggressive Regression
• [stat.ML]Empirical risk minimization is consistent with the mean absolute percentage error
• [stat.ML]Modelling time evolving interactions in networks through a non stationary extension of stochastic block models
• [stat.ML]On the complexity of piecewise affine system identification 

·····································

• [cs.AI]C3: Lightweight Incrementalized MCMC for Probabilistic Programs using Continuations and Callsite Caching
Daniel Ritchie, Andreas Stuhlmüller, Noah D. Goodman
//arxiv.org/abs/1509.02151v2 

Lightweight, source-to-source transformation approaches to implementing MCMC for probabilistic programming languages are popular for their simplicity, support of existing deterministic code, and ability to execute on existing fast runtimes. However, they are also slow, requiring a complete re-execution of the program on every Metropolis Hastings proposal. We present a new extension to the lightweight approach, C3, which enables efficient, incrementalized re-execution of MH proposals. C3 is based on two core ideas: transforming probabilistic programs into continuation passing style (CPS), and caching the results of function calls. We show that on several common models, C3 reduces proposal runtime by 20-100x, in some cases reducing runtime complexity from linear in model size to constant. We also demonstrate nearly an order of magnitude speedup on a complex inverse procedural modeling application. 

• [cs.CL]Enhancing Automatically Discovered Multi-level Acoustic Patterns Considering Context Consistency With Applications in Spoken Term Detection
Cheng-Tao Chung, Wei-Ning Hsu, Cheng-Yi Lee, Lin-Shan Lee
//arxiv.org/abs/1509.02217v1 

This paper presents a novel approach for enhancing the multiple sets of acoustic patterns automatically discovered from a given corpus. In a previous work it was proposed that different HMM configurations (number of states per model, number of distinct models) for the acoustic patterns form a two-dimensional space. Multiple sets of acoustic patterns automatically discovered with the HMM configurations properly located on different points over this two-dimensional space were shown to be complementary to one another, jointly capturing the characteristics of the given corpus. By representing the given corpus as sequences of acoustic patterns on different HMM sets, the pattern indices in these sequences can be relabeled considering the context consistency across the different sequences. Good improvements were observed in preliminary experiments of pattern spoken term detection (STD) performed on both TIMIT and Mandarin Broadcast News with such enhanced patterns. 

• [cs.CL]Probabilistic Bag-Of-Hyperlinks Model for Entity Linking
Octavian-Eugen Ganea, Marina Horlescu, Aurelien Lucchi, Carsten Eickhoff, Thomas Hofmann
//arxiv.org/abs/1509.02301v1 

The goal of entity linking is to map spans of text to canonical entity representations such as Freebase entries or Wikipedia articles. It provides a foundation for various natural language processing tasks, including text understanding, summarization and machine translation. Name ambiguity, word polysemy, context dependencies, and a heavy-tailed distribution of entities contribute to the complexity of this problem. We propose a simple, yet effective, probabilistic graphical model for collective entity linking, which resolves entity links jointly across an entire document. Our model captures local information from linkable token spans (i.e., mentions) and their surrounding context and combines it with a document-level prior of entity co-occurrences. The model is acquired automatically from entity-linked text repositories with a lightweight computational step for parameter adaptation. Loopy belief propagation is used as an efficient approximate inference algorithm. In contrast to state-of-the-art methods, our model is conceptually simple and easy to reproduce. It comes with a small memory footprint and is sufficiently fast for real-time usage. We demonstrate its benefits on a wide range of well-known entity linking benchmark datasets. Our empirical results show the merits of the proposed approach and its competitiveness in comparison to state-of-the-art methods. 

• [cs.CL]Unsupervised Discovery of Linguistic Structure Including Two-level Acoustic Patterns Using Three Cascaded Stages of Iterative Optimization
Cheng-Tao Chung, Chun-an Chan, Lin-shan Lee
//arxiv.org/abs/1509.02208v1 

Techniques for unsupervised discovery of acoustic patterns are getting increasingly attractive, because huge quantities of speech data are becoming available but manual annotations remain hard to acquire. In this paper, we propose an approach for unsupervised discovery of linguistic structure for the target spoken language given raw speech data. This linguistic structure includes two-level (subword-like and word-like) acoustic patterns, the lexicon of word-like patterns in terms of subword-like patterns and the N-gram language model based on word-like patterns. All patterns, models, and parameters can be automatically learned from the unlabelled speech corpus. This is achieved by an initialization step followed by three cascaded stages for acoustic, linguistic, and lexical iterative optimization. The lexicon of word-like patterns defines allowed consecutive sequence of HMMs for subword-like patterns. In each iteration, model training and decoding produces updated labels from which the lexicon and HMMs can be further updated. In this way, model parameters and decoded labels are respectively optimized in each iteration, and the knowledge about the linguistic structure is learned gradually layer after layer. The proposed approach was tested in preliminary experiments on a corpus of Mandarin broadcast news, including a task of spoken term detection with performance compared to a parallel test using models trained in a supervised way. Results show that the proposed system not only yields reasonable performance on its own, but is also complimentary to existing large vocabulary ASR systems. 

• [cs.CL]Unsupervised Domain Discovery using Latent Dirichlet Allocation for Acoustic Modelling in Speech Recognition
Mortaza Doulaty, Oscar Saz, Thomas Hain
//arxiv.org/abs/1509.02412v1 

Speech recognition systems are often highly domain dependent, a fact widely reported in the literature. However the concept of domain is complex and not bound to clear criteria. Hence it is often not evident if data should be considered to be out-of-domain. While both acoustic and language models can be domain specific, work in this paper concentrates on acoustic modelling. We present a novel method to perform unsupervised discovery of domains using Latent Dirichlet Allocation (LDA) modelling. Here a set of hidden domains is assumed to exist in the data, whereby each audio segment can be considered to be a weighted mixture of domain properties. The classification of audio segments into domains allows the creation of domain specific acoustic models for automatic speech recognition. Experiments are conducted on a dataset of diverse speech data covering speech from radio and TV broadcasts, telephone conversations, meetings, lectures and read speech, with a joint training set of 60 hours and a test set of 6 hours. Maximum A Posteriori (MAP) adaptation to LDA based domains was shown to yield relative Word Error Rate (WER) improvements of up to 16% relative, compared to pooled training, and up to 10%, compared with models adapted with human-labelled prior domain knowledge. 

• [cs.CL]Unsupervised Spoken Term Detection with Spoken Queries by Multi-level Acoustic Patterns with Varying Model Granularity
Cheng-Tao Chung, Chun-an Chan, Lin-shan Lee
//arxiv.org/abs/1509.02213v1 

This paper presents a new approach for unsupervised Spoken Term Detection with spoken queries using multiple sets of acoustic patterns automatically discovered from the target corpus. The different pattern HMM configurations(number of states per model, number of distinct models, number of Gaussians per state)form a three-dimensional model granularity space. Different sets of acoustic patterns automatically discovered on different points properly distributed over this three-dimensional space are complementary to one another, thus can jointly capture the characteristics of the spoken terms. By representing the spoken content and spoken query as sequences of acoustic patterns, a series of approaches for matching the pattern index sequences while considering the signal variations are developed. In this way, not only the on-line computation load can be reduced, but the signal distributions caused by different speakers and acoustic conditions can be reasonably taken care of. The results indicate that this approach significantly outperformed the unsupervised feature-based DTW baseline by 16.16\% in mean average precision on the TIMIT corpus. 

• [cs.CV]Accelerated graph-based spectral polynomial filters
Andrew Knyazev, Alexander Malyshev
//arxiv.org/abs/1509.02468v1 

Graph-based spectral denoising is a low-pass filtering using the eigendecomposition of the graph Laplacian matrix of a noisy signal. Polynomial filtering avoids costly computation of the eigendecomposition by projections onto suitable Krylov subspaces. Polynomial filters can be based, e.g., on the bilateral and guided filters. We propose constructing accelerated polynomial filters by running flexible Krylov subspace based linear and eigenvalue solvers such as the Block Locally Optimal Preconditioned Conjugate Gradient (LOBPCG) method. 

• [cs.CV]Deep Attributes from Context-Aware Regional Neural Codes
Jianwei Luo, Jianguo Li, Jun Wang, Zhiguo Jiang, Yurong Chen
//arxiv.org/abs/1509.02470v1 

Recently, many researches employ middle-layer output of convolutional neural network models (CNN) as features for different visual recognition tasks. Although promising results have been achieved in some empirical studies, such type of representations still suffer from the well-known issue of semantic gap. This paper proposes so-called deep attribute framework to alleviate this issue from three aspects. First, we introduce object region proposals as intermedia to represent target images, and extract features from region proposals. Second, we study aggregating features from different CNN layers for all region proposals. The aggregation yields a holistic yet compact representation of input images. Results show that cross-region max-pooling of soft-max layer output outperform all other layers. As soft-max layer directly corresponds to semantic concepts, this representation is named “deep attributes”. Third, we observe that only a small portion of generated regions by object proposals algorithm are correlated to classification target. Therefore, we introduce context-aware region refining algorithm to pick out contextual regions and build context-aware classifiers. We apply the proposed deep attributes framework for various vision tasks. Extensive experiments are conducted on standard benchmarks for three visual recognition tasks, i.e., image classification, fine-grained recognition and visual instance retrieval. Results show that deep attribute approaches achieve state-of-the-art results, and outperforms existing peer methods with a significant margin, even though some benchmarks have little overlap of concepts with the pre-trained CNN models. 

• [cs.CV]Diffusion tensor imaging with deterministic error bounds
Artur Gorokh, Yury Korolev, Tuomo Valkonen
//arxiv.org/abs/1509.02223v1 

Errors in the data and the forward operator of an inverse problem can be handily modelled using partial order in Banach lattices. We present some existing results of the theory of regularisation in this novel framework, where errors are represented as bounds by means of the appropriate partial order. We apply the theory to Diffusion Tensor Imaging, where correct noise modelling is challenging: it involves the Rician distribution and the nonlinear Stejskal-Tanner equation. Linearisation of the latter in the statistical framework would complicate the noise model even further. We avoid this using the error bounds approach, which preserves simple error structure under monotone transformations. 

• [cs.CV]Edge-enhancing Filters with Negative Weights
Andrew Knyazev
//arxiv.org/abs/1509.02491v1 

In [DOI:10.1109/ICMEW.2014.6890711], a graph-based denoising is performed by projecting the noisy image to a lower dimensional Krylov subspace of the graph Laplacian, constructed using nonnegative weights determined by distances between image data corresponding to image pixels. We~extend the construction of the graph Laplacian to the case, where some graph weights can be negative. Removing the positivity constraint provides a more accurate inference of a graph model behind the data, and thus can improve quality of filters for graph-based signal processing, e.g., denoising, compared to the standard construction, without affecting the costs. 

• [cs.CV]HEp-2 Cell Classification: The Role of Gaussian Scale Space Theory as A Pre-processing Approach
Xianbiao Qi, Guoying Zhao, Jie Chen, Matti Pietikäinen
//arxiv.org/abs/1509.02320v1 

\textit{Indirect Immunofluorescence Imaging of Human Epithelial Type 2} (HEp-2) cells is an effective way to identify the presence of Anti-Nuclear Antibody (ANA). Most existing works on HEp-2 cell classification mainly focus on feature extraction, feature encoding and classifier design. Very few efforts have been devoted to study the importance of the pre-processing techniques. In this paper, we analyze the importance of the pre-processing, and investigate the role of Gaussian Scale Space (GSS) theory as a pre-processing approach for the HEp-2 cell classification task. We validate the GSS pre-processing under the Local Binary Pattern (LBP) and the Bag-of-Words (BoW) frameworks. Under the BoW framework, the introduced pre-processing approach, using only one Local Orientation Adaptive Descriptor (LOAD), achieved superior performance on the Executable Thematic on Pattern Recognition Techniques for Indirect Immunofluorescence (ET-PRT-IIF) image analysis. Our system, using only one feature, outperformed the winner of the ICPR 2014 contest that combined four types of features. Meanwhile, the proposed pre-processing method is not restricted to this work; it can be generalized to many existing works. 

• [cs.CV]Object Proposals for Text Extraction in the Wild
Lluis Gomez, Dimosthenis Karatzas
//arxiv.org/abs/1509.02317v1 

Object Proposals is a recent computer vision technique receiving increasing interest from the research community. Its main objective is to generate a relatively small set of bounding box proposals that are most likely to contain objects of interest. The use of Object Proposals techniques in the scene text understanding field is innovative. Motivated by the success of powerful while expensive techniques to recognize words in a holistic way, Object Proposals techniques emerge as an alternative to the traditional text detectors. In this paper we study to what extent the existing generic Object Proposals methods may be useful for scene text understanding. Also, we propose a new Object Proposals algorithm that is specifically designed for text and compare it with other generic methods in the state of the art. Experiments show that our proposal is superior in its ability of producing good quality word proposals in an efficient way. The source code of our method is made publicly available. 

• [cs.DC]Characterizing and Adapting the Consistency-Latency Tradeoff in Distributed Key-value Stores
Muntasir Raihan Rahman, Lewis Tseng, Son Nguyen, Indranil Gupta, Nitin Vaidya
//arxiv.org/abs/1509.02464v1 

The CAP theorem is a fundamental result that applies to distributed storage systems. In this paper, we first present and prove a probabilistic variation of the CAP theorem. We present probabilistic models to characterize the three important elements of the CAP theorem: consistency , availability or latency (A), and partition-tolerance (P). Then, we provide \textit{quantitative} characterization of the tradeoff among these three elements. Next, we leverage this result to present a new system, called PCAP, which allows applications running on a single data-center to specify either a latency SLA or a consistency SLA. The PCAP system automatically adapts, in real-time and under changing network conditions, to meet the SLA while optimizing the other C/A metric. We incorporate PCAP into two popular key-value stores – Apache Cassandra and Riak. Our experiments with these two deployments, under realistic workloads, reveal that the PCAP system satisfactorily meets SLAs, and performs close to the bounds dictated by our tradeoff analysis. We also extend PCAP from a single data-center to multiple geo-distributed data-centers. 

• [cs.DC]linalg: Matrix Computations in Apache Spark
Reza Bosagh Zadeh, Xiangrui Meng, Burak Yavuz, Aaron Staple, Li Pu, Shivaram Venkataraman, Evan Sparks, Alexander Ulanov, Matei Zaharia
//arxiv.org/abs/1509.02256v1 

We describe matrix computations available in the cluster programming framework, Apache Spark. Out of the box, Spark comes with the mllib.linalg library, which provides abstractions and implementations for distributed matrices. Using these abstractions, we highlight the computations that were more challenging to distribute. When translating single-node algorithms to run on a distributed cluster, we observe that often a simple idea is enough: separating matrix operations from vector operations and shipping the matrix operations to be ran on the cluster, while keeping vector operations local to the driver. In the case of the Singular Value Decomposition, by taking this idea to an extreme, we are able to exploit the computational power of a cluster, while running code written decades ago for a single core. We conclude with a comprehensive set of benchmarks for hardware accelerated matrix computations from the JVM, which is interesting in its own right, as many cluster programming frameworks use the JVM. 

• [cs.DS]Optimizing Static and Adaptive Probing Schedules for Rapid Event Detection
Ahmad Mahmoody, Evgenios M. Kornaropoulos, Eli Upfal
//arxiv.org/abs/1509.02487v1 

We formulate and study a fundamental search and detection problem, Schedule Optimization, motivated by a variety of real-world applications, ranging from monitoring content changes on the web, social networks, and user activities to detecting failure on large systems with many individual machines. We consider a large system consists of many nodes, where each node has its own rate of generating new events, or items. A monitoring application can probe a small number of nodes at each step, and our goal is to compute a probing schedule that minimizes the expected number of undiscovered items at the system, or equivalently, minimizes the expected time to discover a new item in the system. We study the Schedule Optimization problem both for deterministic and randomized memoryless algorithms. We provide lower bounds on the cost of an optimal schedule and construct close to optimal schedules with rigorous mathematical guarantees. Finally, we present an adaptive algorithm that starts with no prior information on the system and converges to the optimal memoryless algorithms by adapting to observed data. 

• [cs.IR]Improved Twitter Sentiment Prediction through Cluster-then-Predict Model
Rishabh Soni, K. James Mathai
//arxiv.org/abs/1509.02437v1 

Over the past decade humans have experienced exponential growth in the use of online resources, in particular social media and microblogging websites such as Facebook, Twitter, YouTube and also mobile applications such as WhatsApp, Line, etc. Many companies have identified these resources as a rich mine of marketing knowledge. This knowledge provides valuable feedback which allows them to further develop the next generation of their product. In this paper, sentiment analysis of a product is performed by extracting tweets about that product and classifying the tweets showing it as positive and negative sentiment. The authors propose a hybrid approach which combines unsupervised learning in the form of K-means clustering to cluster the tweets and then performing supervised learning methods such as Decision Trees and Support Vector Machines for classification. 

• [cs.IR]Personalized Search
Fredrik Nygård Carlsen
//arxiv.org/abs/1509.02207v1 

As the volume of electronically available information grows, relevant items become harder to find. This work presents an approach to personalizing search results in scientific publication databases. This work focuses on re-ranking search results from existing search engines like Solr or ElasticSearch. This work also includes the development of Obelix, a new recommendation system used to re-rank search results. The project was proposed and performed at CERN, using the scientific publications available on the CERN Document Server (CDS). This work experiments with re-ranking using offline and online evaluation of users and documents in CDS. The experiments conclude that the personalized search result outperform both latest first and word similarity in terms of click position in the search result for global search in CDS. 

• [cs.LG]A Behavior Analysis-Based Game Bot Detection Approach Considering Various Play Styles
Yeounoh Chung, Chang-yong Park, Noo-ri Kim, Hana Cho, Taebok Yoon, Hunjoo Lee, Jee-Hyong Lee
//arxiv.org/abs/1509.02458v1 

An approach for game bot detection in MMORPGs is proposed based on the analysis of game playing behavior. Since MMORPGs are large scale games, users can play in various ways. This variety in playing behavior makes it hard to detect game bots based on play behaviors. In order to cope with this problem, the proposed approach observes game playing behaviors of users and groups them by their behavioral similarities. Then, it develops a local bot detection model for each player group. Since the locally optimized models can more accurately detect game bots within each player group, the combination of those models brings about overall improvement. For a practical purpose of reducing the workloads of the game servers in service, the game data is collected at a low resolution in time. Behavioral features are selected and developed to accurately detect game bots with the low resolution data, considering common aspects of MMORPG playing. Through the experiment with the real data from a game currently in service, it is shown that the proposed local model approach yields more accurate results. 

• [cs.LG]Data-selective Transfer Learning for Multi-Domain Speech Recognition
Mortaza Doulaty, Oscar Saz, Thomas Hain
//arxiv.org/abs/1509.02409v1 

Negative transfer in training of acoustic models for automatic speech recognition has been reported in several contexts such as domain change or speaker characteristics. This paper proposes a novel technique to overcome negative transfer by efficient selection of speech data for acoustic model training. Here data is chosen on relevance for a specific target. A submodular function based on likelihood ratios is used to determine how acoustically similar each training utterance is to a target test set. The approach is evaluated on a wide-domain data set, covering speech from radio and TV broadcasts, telephone conversations, meetings, lectures and read speech. Experiments demonstrate that the proposed technique both finds relevant data and limits negative transfer. Results on a 6–hour test set show a relative improvement of 4% with data selection over using all data in PLP based models, and 2% with DNN features. 

• [cs.LG]Sampled Weighted Min-Hashing for Large-Scale Topic Mining
Gibran Fuentes-Pineda, Ivan Vladimir Meza-Ruiz
//arxiv.org/abs/1509.01771v2 

We present Sampled Weighted Min-Hashing (SWMH), a randomized approach to automatically mine topics from large-scale corpora. SWMH generates multiple random partitions of the corpus vocabulary based on term co-occurrence and agglomerates highly overlapping inter-partition cells to produce the mined topics. While other approaches define a topic as a probabilistic distribution over a vocabulary, SWMH topics are ordered subsets of such vocabulary. Interestingly, the topics mined by SWMH underlie themes from the corpus at different levels of granularity. We extensively evaluate the meaningfulness of the mined topics both qualitatively and quantitatively on the NIPS (1.7 K documents), 20 Newsgroups (20 K), Reuters (800 K) and Wikipedia (4 M) corpora. Additionally, we compare the quality of SWMH with Online LDA topics for document representation in classification. 

• [cs.NA]SEP-QN: Scalable and Extensible Proximal Quasi-Newton Method for Dirty Statistical Models
Shenjian Zhao, Zhihua Zhang
//arxiv.org/abs/1509.02314v1 

We develop a generalized proximal quasi-Newton method for handling “dirty” statistical models where multiple structural constraints are imposed. We consider a general class of M-estimators that minimize the sum of a smooth loss function and a hybrid regularization. We show that the generalized proximal quasi-Newton method inherits the superlinear convergence theoretically and empirically. By employing the smoothed conic dual approach with a quasi-LBFGS updating formula, we obtain a scalable and extensible proximal quasi-Newton (SEP-QN) method. Our method is potentially powerful because it can solve some popular “dirty” statistical models like the fused sparse group lasso with superlinear convergence rate. 

• [cs.NE]DeepCough: A Deep Convolutional Neural Network in A Wearable Cough Detection System
Justice Amoh, Kofi Odame
//arxiv.org/abs/1509.02512v1 

In this paper, we present a system that employs a wearable acoustic sensor and a deep convolutional neural network for detecting coughs. We evaluate the performance of our system on 14 healthy volunteers and compare it to that of other cough detection systems that have been reported in the literature. Experimental results show that our system achieves a classification sensitivity of 95.1% and a specificity of 99.5%. 

• [cs.SI]Coupling Analysis Between Twitter and Call Centre
Fangfang Li, Yanchang Zhao, Klaus Felsche, Guandong Xu, Longbing Cao
//arxiv.org/abs/1509.02238v1 

Social media has been contributing many research areas such as data mining, recommender systems, time series analysis, etc. However, there are not many successful applications regarding social media in government agencies. In fact, lots of governments have social media accounts such as twitter and facebook. More and more customers are likely to communicate with governments on social media, causing massive external social media data for governments. This external data would be bene?ficial for analysing behaviours and real needs of the customers. Besides this, most governments also have a call centre to help customers solve their problems. It is not diffi?cult to imagine that the enquiries on external social media and internal call centre may have some coupling relationships. The couplings could be helpful for studying customers' intent and allocating government’s limited resources for better service. In this paper, we mainly focus on analysing the coupling relations between internal call centre and external public media using time series analysis methods for Australia Department of Immigration and Border Protec-tion. The discovered couplings demonstrate that call centre and public media indeed have correlations, which are signi?cant for understanding customers' behaviours. 

• [cs.SI]Wikipedia Page View Reflects Web Search Trend
Mitsuo Yoshida, Yuki Arase, Takaaki Tsunoda, Mikio Yamamoto
//arxiv.org/abs/1509.02218v1 

The frequency of a web search keyword generally reflects the degree of public interest in a particular subject matter. Search logs are therefore useful resources for trend analysis. However, access to search logs is typically restricted to search engine providers. In this paper, we investigate whether search frequency can be estimated from a different resource such as Wikipedia page views of open data. We found frequently searched keywords to have remarkably high correlations with Wikipedia page views. This suggests that Wikipedia page views can be an effective tool for determining popular global web search trends. 

• [math.ST]A new non-parametric detector of univariate outliers for distributions with positive unbounded support
Jean-Marc Bardet, Faniaha Dimby
//arxiv.org/abs/1509.02473v1 

The purpose of this paper is the construction and the asymptotic property study of a new non-parametric detector of univariate outliers. This detector, based on a Hill’s type statistics, is valid for a large set of probability distributions with positive unbounded support, for instance for the absolute value of Gaussian, Gamma, Weibull, Student or regular variations distributions. We illustrate our results by numerical simulations which show the accuracy of this detector with respect to other usual univariate outlier detectors (Tukey, MADE or Local Outlier Factor detectors). An application to real-life data allows to detect outliers in a database providing the prices of used cars. 

• [stat.ML]A Variational Bayesian State-Space Approach to Online Passive-Aggressive Regression
Arnold Salas, Stephen J. Roberts, Michael A. Osborne
//arxiv.org/abs/1509.02438v1 

Online Passive-Aggressive (PA) learning is a class of online margin-based algorithms suitable for a wide range of real-time prediction tasks, including classification and regression. PA algorithms are formulated in terms of deterministic point-estimation problems governed by a set of user-defined hyperparameters: the approach fails to capture model/prediction uncertainty and makes their performance highly sensitive to hyperparameter configurations. In this paper, we introduce a novel PA learning framework for regression that overcomes the above limitations. We contribute a Bayesian state-space interpretation of PA regression, along with a novel online variational inference scheme, that not only produces probabilistic predictions, but also offers the benefit of automatic hyperparameter tuning. Experiments with various real-world data sets show that our approach performs significantly better than a more standard, linear Gaussian state-space model. 

• [stat.ML]Empirical risk minimization is consistent with the mean absolute percentage error
Arnaud De Myttenaere, Bénédicte Le Grand, Fabrice Rossi
//arxiv.org/abs/1509.02357v1 

We study in this paper the consequences of using the Mean Absolute Percentage Error (MAPE) as a measure of quality for regression models. We show that finding the best model under the MAPE is equivalent to doing weighted Mean Absolute Error (MAE) regression. We also show that, under some asumptions, universal consistency of Empirical Risk Minimization remains possible using the MAPE. 

• [stat.ML]Modelling time evolving interactions in networks through a non stationary extension of stochastic block models
Marco Corneli, Pierre Latouche, Fabrice Rossi
//arxiv.org/abs/1509.02347v1 

In this paper, we focus on the stochastic block model (SBM),a probabilistic tool describing interactions between nodes of a network using latent clusters. The SBM assumes that the networkhas a stationary structure, in which connections of time varying intensity are not taken into account. In other words, interactions between two groups are forced to have the same features during the whole observation time. To overcome this limitation,we propose a partition of the whole time horizon, in which interactions are observed, and develop a non stationary extension of the SBM,allowing to simultaneously cluster the nodes in a network along with fixed time intervals in which the interactions take place. The number of clusters (K for nodes, D for time intervals) as well as the class memberships are finallyobtained through maximizing the complete-data integrated likelihood by means of a greedy search approach. After showing that the model works properly with simulated data, we focus on a real data set. We thus consider the three days ACM Hypertext conference held in Turin,June 29th - July 1st 2009. Proximity interactions between attendees during the first day are modelled and an interestingclustering of the daily hours is finally obtained, with times of social gathering (e.g. coffee breaks) recovered by the approach. Applications to large networks are limited due to the computational complexity of the greedy search which is dominated bythe number $K_{max}$ and $D_{max}$ of clusters used in the initialization. Therefore,advanced clustering tools are considered to reduce the number of clusters expected in the data, making the greedy search applicable to large networks. 

• [stat.ML]On the complexity of piecewise affine system identification
Fabien Lauer
//arxiv.org/abs/1509.02348v1 

The paper provides results regarding the computational complexity of hybrid system identification. More precisely, we focus on the estimation of piecewise affine (PWA) maps from input-output data and analyze the complexity of computing a global minimizer of the error. Previous work showed that a global solution could be obtained for continuous PWA maps with a worst-case complexity exponential in the number of data. In this paper, we show how global optimality can be reached for a slightly more general class of possibly discontinuous PWA maps with a complexity only polynomial in the number of data, however with an exponential complexity with respect to the data dimension. This result is obtained via an analysis of the intrinsic classification subproblem of associating the data points to the different modes. In addition, we prove that the problem is NP-hard, and thus that the exponential complexity in the dimension is a natural expectation for any exact algorithm.

北邮PRIS模式识别实验室陈老师 商务合作 QQ:1289468869 Email:1289468869@qq.com