今日学术视野(2015.9.5)

2015年9月5日 07:07 阅读 85
• [astro-ph.HE]Machine Learning Model of the Swift/BAT Trigger Algorithm for Long GRB Population Studies

• [astro-ph.IM]A Gibbs Sampler for Multivariate Linear Regression
• [cs.AI]Building a Truly Distributed Constraint Solver with JADE
• [cs.AI]Generating Weather Forecast Texts with Case Based Reasoning
• [cs.CL]Encoding Prior Knowledge with Eigenword Embeddings
• [cs.CL]On TimeML-Compliant Temporal Expression Extraction in Turkish
• [cs.CV]A Novice Guide towards Human Motion Analysis and Understanding
• [cs.CV]Vision-Based Road Detection using Contextual Blocks
• [cs.CY]Big data, bigger dilemmas: A critical review
• [cs.DC]Parallel Knowledge Embedding with MapReduce on a Multi-core Processor
• [cs.LG]A tree-based kernel for graphs with continuous attributes
• [cs.LG]Fast Clustering and Topic Modeling Based on Rank-2 Nonnegative Matrix Factorization
• [cs.LG]On-the-Fly Learning in a Perpetual Learning Machine
• [cs.LG]Train faster, generalize better: Stability of stochastic gradient descent
• [cs.LG]Training a Restricted Boltzmann Machine for Classification by Labeling Model Samples
• [cs.NE]A compact aVLSI conductance-based silicon neuron
• [cs.NE]Training of CC4 Neural Network with Spread Unary Coding
• [cs.SI]Tag Me Maybe: Perceptions of Public Targeted Sharing on Facebook
• [cs.SY]Model Predictive Path Integral Control using Covariance Variable Importance Sampling
• [math.ST]Active Learning for Adaptive Clinical Trials: a Stream-based Selective Sampling Strategy
• [math.ST]Generalized Quantile Treatment Effect: A Flexible Bayesian Approach Using Quantile Ratio Smoothing
• [math.ST]Necessary and Sufficient Conditions for High-Dimensional Posterior Consistency under $g$-Priors
• [physics.data-an]Comparing non-nested models in the search for new physics
• [stat.AP]Extreme Value Theory for Time Series using Peak-Over-Threshold method
• [stat.ME]A novel principal component analysis for spatially-misaligned multivariate air pollution data
• [stat.ME]PCA leverage: outlier detection for high-dimensional functional magnetic resonance imaging data
• [stat.ML]Community Detection in Networks with Node Features
• [stat.ML]Semi-described and semi-supervised learning with Gaussian processes 

·····································

• [astro-ph.HE]Machine Learning Model of the Swift/BAT Trigger Algorithm for Long GRB Population Studies
Philip B Graff, Amy Y Lien, John G Baker, Takanori Sakamoto
//arxiv.org/abs/1509.01228v1 

To draw inferences about gamma-ray burst (GRB) source populations based on Swift observations, it is essential to understand the detection efficiency of the Swift burst alert telescope (BAT). This study considers the problem of modeling the Swift/BAT triggering algorithm for long GRBs, a computationally expensive procedure, and models it using machine learning algorithms. A large sample of simulated GRBs from Lien 2014 is used to train various models: random forests, boosted decision trees (with AdaBoost), support vector machines, and artificial neural networks. The best models have accuracies of $\gtrsim97\%$ ($\lesssim 3\%$ error), which is a significant improvement on a cut in GRB flux which has an accuracy of $89.6\%$ ($10.4\%$ error). These models are then used to measure the detection efficiency of Swift as a function of redshift $z$, which is used to perform Bayesian parameter estimation on the GRB rate distribution. We find a local GRB rate density of $n0 \sim 0.48^{+0.41}{-0.23} \ {\rm Gpc}^{-3} {\rm yr}^{-1}$ with power-law indices of $n1 \sim 1.7^{+0.6}{-0.5}$ and $n2 \sim -5.9^{+5.7}{-0.1}$ for GRBs above and below a break point of $z1 \sim 6.8^{+2.8}{-3.2}$. This methodology is able to improve upon earlier studies by more accurately modeling Swift detection and using this for fully Bayesian model fitting. The code used in this is analysis is publicly available online (https://github.com/PBGraff/SwiftGRB_PEanalysis). 

• [astro-ph.IM]A Gibbs Sampler for Multivariate Linear Regression
Adam B. Mantz
//arxiv.org/abs/1509.00908v1 

Kelly (2007, hereafter K07) described an efficient algorithm, using Gibbs sampling, for performing linear regression in the fairly general case where non-zero measurement errors exist for both the covariates and response variables, where these measurements may be correlated (for the same data point), where the response variable is affected by intrinsic scatter in addition to measurement error, and where the prior distribution of covariates is modeled by a flexible mixture of Gaussians rather than assumed to be uniform. Here I extend the K07 algorithm in two ways. First, the procedure is generalized to the case of multiple response variables. Second, I describe how to model the prior distribution of covariates using a Dirichlet process, which can be thought of as a Gaussian mixture where the number of mixture components is learned from the data. I present an example of multivariate regression using the extended algorithm, namely fitting scaling relations of the gas mass, temperature, and luminosity of dynamically relaxed galaxy clusters as a function of their mass and redshift. An implementation of the Gibbs sampler in the R language, called LRGS, is provided. 

• [cs.AI]Building a Truly Distributed Constraint Solver with JADE
Ibrahim Adeyanju
//arxiv.org/abs/1509.01040v1 

Real life problems such as scheduling meeting between people at different locations can be modelled as distributed Constraint Satisfaction Problems (CSPs). Suitable and satisfactory solutions can then be found using constraint satisfaction algorithms which can be exhaustive (backtracking) or otherwise (local search). However, most research in this area tested their algorithms by simulation on a single PC with a single program entry point. The main contribution of our work is the design and implementation of a truly distributed constraint solver based on a local search algorithm using Java Agent DEvelopment framework (JADE) to enable communication between agents on different machines. Particularly, we discuss design and implementation issues related to truly distributed constraint solver which might not be critical when simulated on a single machine. Evaluation results indicate that our truly distributed constraint solver works well within the observed limitations when tested with various distributed CSPs. Our application can also incorporate any constraint solving algorithm with little modifications. 

• [cs.AI]Generating Weather Forecast Texts with Case Based Reasoning
Ibrahim Adeyanju
//arxiv.org/abs/1509.01023v1 

Several techniques have been used to generate weather forecast texts. In this paper, case based reasoning (CBR) is proposed for weather forecast text generation because similar weather conditions occur over time and should have similar forecast texts. CBR-METEO, a system for generating weather forecast texts was developed using a generic framework (jCOLIBRI) which provides modules for the standard components of the CBR architecture. The advantage in a CBR approach is that systems can be built in minimal time with far less human effort after initial consultation with experts. The approach depends heavily on the goodness of the retrieval and revision components of the CBR process. We evaluated CBRMETEO with NIST, an automated metric which has been shown to correlate well with human judgements for this domain. The system shows comparable performance with other NLG systems that perform the same task. 

• [cs.CL]Encoding Prior Knowledge with Eigenword Embeddings
Dominique Osborne, Shashi Narayan, Shay B. Cohen
//arxiv.org/abs/1509.01007v1 

Canonical correlation analysis (CCA) is a method for reducing the dimension of data represented using two views. It has been previously used to derive word embeddings, where one view indicates a word, and the other view indicates its context. We describe a way to incorporate prior knowledge into CCA, give a theoretical justification for it, and test it by deriving word embeddings and evaluating them on a myriad of datasets. 

• [cs.CL]On TimeML-Compliant Temporal Expression Extraction in Turkish
Dilek Küçük, Doğan Küçük
//arxiv.org/abs/1509.00963v1 

It is commonly acknowledged that temporal expression extractors are important components of larger natural language processing systems like information retrieval and question answering systems. Extraction and normalization of temporal expressions in Turkish has not been given attention so far except the extraction of some date and time expressions within the course of named entity recognition. As TimeML is the current standard of temporal expression and event annotation in natural language texts, in this paper, we present an analysis of temporal expressions in Turkish based on the related TimeML classification (i.e., date, time, duration, and set expressions). We have created a lexicon for Turkish temporal expressions and devised considerably wide-coverage patterns using the lexical classes as the building blocks. We believe that the proposed patterns, together with convenient normalization rules, can be readily used by prospective temporal expression extraction tools for Turkish. 

• [cs.CV]A Novice Guide towards Human Motion Analysis and Understanding
Ahmed Nabil Mohamed
//arxiv.org/abs/1509.01074v1 

Human motion analysis and understanding has been, and is still, the focus of attention of many disciplines which is considered an obvious indicator of the wide and massive importance of the subject. The purpose of this article is to shed some light on this very important subject, so it can be a good insight for a novice computer vision researcher in this field by providing him/her with a wealth of knowledge about the subject covering many directions. There are two main contributions of this article. The first one investigates various aspects of some disciplines (e.g., arts, philosophy, psychology, and neuroscience) that are interested in the subject and review some of their contributions stressing on those that can be useful for computer vision researchers. Moreover, many examples are illustrated to indicate the benefits of integrating concepts and results among different disciplines. The second contribution is concerned with the subject from the computer vision aspect where we discuss the following issues. First, we explore many demanding and promising applications to reveal the wide and massive importance of the field. Second, we list various types of sensors that may be used for acquiring various data. Third, we review different taxonomies used for classifying motions. Fourth, we review various processes involved in motion analysis. Fifth, we exhibit how different surveys are structured. Sixth, we examine many of the most cited and recent reviews in the field that have been published during the past two decades to reveal various approaches used for implementing different stages of the problem and refer to various algorithms and their suitability for different situations. Moreover, we provide a long list of public datasets and discuss briefly some examples of these datasets. Finally, we provide a general discussion of the subject from the aspect of computer vision. 

• [cs.CV]Vision-Based Road Detection using Contextual Blocks
Caio César Teodoro Mendes, Vincent Frémont, Denis Fernando Wolf
//arxiv.org/abs/1509.01122v1 

Road detection is a fundamental task in autonomous navigation systems. In this paper, we consider the case of monocular road detection, where images are segmented into road and non-road regions. Our starting point is the well-known machine learning approach, in which a classifier is trained to distinguish road and non-road regions based on hand-labeled images. We proceed by introducing the use of “contextual blocks” as an efficient way of providing contextual information to the classifier. Overall, the proposed methodology, including its image feature selection and classifier, was conceived with computational cost in mind, leaving room for optimized implementations. Regarding experiments, we perform a sensible evaluation of each phase and feature subset that composes our system. The results show a great benefit from using contextual blocks and demonstrate their computational efficiency. Finally, we submit our results to the KITTI road detection benchmark achieving scores comparable with state of the art methods. 

• [cs.CY]Big data, bigger dilemmas: A critical review
Hamid Ekbia, Michael Mattioli, Inna Kouper, G. Arave, Ali Ghazinejad, Timothy Bowman, Venkata Ratandeep Suri, Andrew Tsou, Scott Weingart, Cassidy R. Sugimoto
//arxiv.org/abs/1509.00909v1 

The recent interest in Big Data has generated a broad range of new academic, corporate, and policy practices along with an evolving debate amongst its proponents, detractors, and skeptics. While the practices draw on a common set of tools, techniques, and technologies, most contributions to the debate come either from a particular disciplinary perspective or with an eye on a domain-specific issue. A close examination of these contributions reveals a set of common problematics that arise in various guises in different places. It also demonstrates the need for a critical synthesis of the conceptual and practical dilemmas surrounding Big Data. The purpose of this article is to provide such a synthesis by drawing on relevant writings in the sciences, humanities, policy, and trade literature. In bringing these diverse literatures together, we aim to shed light on the common underlying issues that concern and affect all of these areas. By contextualizing the phenomenon of Big Data within larger socio-economic developments, we also seek to provide a broader understanding of its drivers, barriers, and challenges. This approach allows us to identify attributes of Big Data that need to receive more attention–autonomy, opacity, and generativity, disparity, and futurity–leading to questions and ideas for moving beyond dilemmas. 

• [cs.DC]Parallel Knowledge Embedding with MapReduce on a Multi-core Processor
Miao Fan, Qiang Zhou, Thomas Fang Zheng, Ralph Grishman
//arxiv.org/abs/1509.01183v1 

This article firstly attempts to explore parallel algorithms of learning distributed representations for both entities and relations in large-scale knowledge repositories with {\it MapReduce} programming model on a multi-core processor. We accelerate the training progress of a canonical knowledge embedding method, i.e. {\it translating embedding} ({\bf TransE}) model, by dividing a whole knowledge repository into several balanced subsets, and feeding each subset into an individual core where local embeddings can concurrently run updating during the {\it Map} phase. However, it usually suffers from inconsistent low-dimensional vector representations of the same key, which are collected from different {\it Map} workers, and further leads to conflicts when conducting {\it Reduce} to merge the various vectors associated with the same key. Therefore, we try several strategies to acquire the merged embeddings which may not only retain the performance of {\it entity inference}, {\it relation prediction}, and even {\it triplet classification} evaluated by the single-thread {\bf TransE} on several well-known knowledge bases such as Freebase and NELL, but also scale up the learning speed along with the number of cores within a processor. So far, the empirical studies show that we could achieve comparable results as the single-thread {\bf TransE} performs by the {\it stochastic gradient descend} (SGD) algorithm, as well as increase the training speed multiple times via adapting the {\it batch gradient descend} (BGD) algorithm for {\it MapReduce} paradigm. 

• [cs.LG]A tree-based kernel for graphs with continuous attributes
Giovanni Da San Martino, Nicolò Navarin, Alessandro Sperduti
//arxiv.org/abs/1509.01116v1 

The availability of graph data with node attributes that can be either discrete or real-valued is constantly increasing. While existing kernel methods are effective techniques for dealing with graphs having discrete node labels, their adaptation to non-discrete or continuous node attributes has been limited, mainly for computational issues. Recently, a few kernels especially tailored for this domain, have been proposed. In order to alleviate the computational problems, the size of the feature space of such kernels tend to be smaller than the ones of the kernels for discrete node attributes. However, such choice might have a negative impact on the predictive performance. In this paper, we propose a graph kernel for complex and continuous nodes' attributes, whose features are tree structures extracted from specific graph visits. Experimental results obtained on real-world datasets show that the (approximated version of the) proposed kernel is comparable with current state-of-the-art kernels in terms of classification accuracy while requiring shorter running times. 

• [cs.LG]Fast Clustering and Topic Modeling Based on Rank-2 Nonnegative Matrix Factorization
Da Kuang, Richard Boyd, Barry Drake, Haesun Park
//arxiv.org/abs/1509.01208v1 

The importance of unsupervised clustering and topic modeling is well recognized with ever-increasing volumes of text data. In this paper, we propose a fast method for hierarchical clustering and topic modeling called HierNMF2. Our method is based on fast Rank-2 nonnegative matrix factorization (NMF) that performs binary clustering and an efficient node splitting rule. Further utilizing the final leaf nodes generated in HierNMF2 and the idea of nonnegative least squares fitting, we propose a new clustering/topic modeling method called FlatNMF2 that recovers a flat clustering/topic modeling result in a very simple yet significantly more effective way than any other existing methods. We describe highly optimized open source software in C++ for both HierNMF2 and FlatNMF2 for hierarchical and partitional clustering/topic modeling of document data sets. Substantial experimental tests are presented that illustrate significant improvements both in computational time as well as quality of solutions. We compare our methods to other clustering methods including K-means, standard NMF, and CLUTO, and also topic modeling methods including latent Dirichlet allocation (LDA) and recently proposed algorithms for NMF with separability constraints. Overall, we present efficient tools for analyzing large-scale data sets, and techniques that can be generalized to many other data analytics problem domains. 

• [cs.LG]On-the-Fly Learning in a Perpetual Learning Machine
Andrew J. R. Simpson
//arxiv.org/abs/1509.00913v1 

Despite the promise of brain-inspired machine learning, deep neural networks (DNN) have frustratingly failed to bridge the deceptively large gap between learning and memory. Here, we introduce a Perpetual Learning Machine; a new type of DNN that is capable of brain-like dynamic ‘on the fly’ learning because it exists in a self-supervised state of Perpetual Stochastic Gradient Descent. Thus, we provide the means to unify learning and memory within a machine learning framework. 

• [cs.LG]Train faster, generalize better: Stability of stochastic gradient descent
Moritz Hardt, Benjamin Recht, Yoram Singer
//arxiv.org/abs/1509.01240v1 

We show that any model trained by a stochastic gradient method with few iterations has vanishing generalization error. We prove this by showing the method is algorithmically stable in the sense of Bousquet and Elisseeff. Our analysis only employs elementary tools from convex and continuous optimization. Our results apply to both convex and non-convex optimization under standard Lipschitz and smoothness assumptions. Applying our results to the convex case, we provide new explanations for why multiple epochs of stochastic gradient descent generalize well in practice. In the nonconvex case, we provide a new interpretation of common practices in neural networks, and provide a formal rationale for stability-promoting mechanisms in training large, deep models. Conceptually, our findings underscore the importance of reducing training time beyond its obvious benefit. 

• [cs.LG]Training a Restricted Boltzmann Machine for Classification by Labeling Model Samples
Malte Probst, Franz Rothlauf
//arxiv.org/abs/1509.01053v1 

We propose an alternative method for training a classification model. Using the MNIST set of handwritten digits and Restricted Boltzmann Machines, it is possible to reach a classification performance competitive to semi-supervised learning if we first train a model in an unsupervised fashion on unlabeled data only, and then manually add labels to model samples instead of training data samples with the help of a GUI. This approach can benefit from the fact that model samples can be presented to the human labeler in a video-like fashion, resulting in a higher number of labeled examples. Also, after some initial training, hard-to-classify examples can be distinguished from easy ones automatically, saving manual work. 

• [cs.NE]A compact aVLSI conductance-based silicon neuron
Runchun Wang, Chetan Singh Thakur, Tara Julia Hamilton, Jonathan Tapson, Andre van Schaik
//arxiv.org/abs/1509.00962v1 

We present an analogue Very Large Scale Integration (aVLSI) implementation that uses first-order lowpass filters to implement a conductance-based silicon neuron for high-speed neuromorphic systems. The aVLSI neuron consists of a soma (cell body) and a single synapse, which is capable of linearly summing both the excitatory and inhibitory postsynaptic potentials (EPSP and IPSP) generated by the spikes arriving from different sources. Rather than biasing the silicon neuron with different parameters for different spiking patterns, as is typically done, we provide digital control signals, generated by an FPGA, to the silicon neuron to obtain different spiking behaviours. The proposed neuron is only ~26.5 um2 in the IBM 130nm process and thus can be integrated at very high density. Circuit simulations show that this neuron can emulate different spiking behaviours observed in biological neurons. 

• [cs.NE]Training of CC4 Neural Network with Spread Unary Coding
Pushpa Sree Potluri
//arxiv.org/abs/1509.01126v1 

This paper adapts the corner classification algorithm (CC4) to train the neural networks using spread unary inputs. This is an important problem as spread unary appears to be at the basis of data representation in biological learning. The modified CC4 algorithm is tested using the pattern classification experiment and the results are found to be good. Specifically, we show that the number of misclassified points is not particularly sensitive to the chosen radius of generalization. 

• [cs.SI]Tag Me Maybe: Perceptions of Public Targeted Sharing on Facebook
Saiph Savage, Andres Monroy-Hernandez, Kasturi Bhattacharjee, Tobias Hollerer
//arxiv.org/abs/1509.01095v1 

Social network sites allow users to publicly tag people in their posts. These tagged posts allow users to share to both the general public and a targeted audience, dynamically assembled via notifications that alert the people mentioned. We investigate people’s perceptions of this mixed sharing mode through a qualitative study with 120 participants. We found that individuals like this sharing modality as they believe it strengthens their relationships. Individuals also report using tags to have more control of Facebook’s ranking algorithm, and to expose one another to novel information and people. This work helps us understand people’s complex relationships with the algorithms that mediate their interactions with each another. We conclude by discussing the design implications of these findings. 

• [cs.SY]Model Predictive Path Integral Control using Covariance Variable Importance Sampling
Grady Williams, Andrew Aldrich, Evangelos Theodorou
//arxiv.org/abs/1509.01149v1 

In this paper we present a Model Predictive Path Integral (MPPI) control algorithm that is derived from the path integral control framework and a generalized importance sampling scheme. In order to operate in real time we parallelize the sampling based component of the algorithm and achieve massive speed-up by using a Graphical Processor Unit (GPU). We compare MPPI against traditional model predictive control methods based on Differential Dynamic Programming (MPC-DDP) on a benchmark cart-pole swing-up task as well as a navigation tasks for a racing car and a quad-rotor in simulation. Finally, we use MPPI to navigate a team of three (48 states) and nine quad rotors (144 states) through cluttered environments of fixed and moving obstacles in simulation. 

• [math.ST]Active Learning for Adaptive Clinical Trials: a Stream-based Selective Sampling Strategy
James E. Barrett
//arxiv.org/abs/1509.01058v1 

Active machine learning seeks out data samples that are estimated to be maximally informative with an aim to requiring fewer data samples overall in order to achieve successful learning. We apply active machine learning techniques to a novel adaptive clinical trial design in which only patients that are expected to provide sufficient statistical information are recruited. Allocation to a treatment arm is also done in an optimal manner. The proposed design is a type of stream-based selective sampling where candidate patients form a stream of data points. This work builds on previous selective recruitment designs by applying the approach to binary outcomes and developing four methods for evaluating the informativeness of a patient based on uncertainty sampling, the posterior entropy, the expected generalisation error and variance reduction. The proposed design can be extended to any type of stream-based active learning. We examine the performance of these methods using both experimental data and numerical simulations and illustrate that selective recruitment designs can achieve statistically significant results using fewer recruits than randomised trials thus offering both economic and ethical advantages. 

• [math.ST]Generalized Quantile Treatment Effect: A Flexible Bayesian Approach Using Quantile Ratio Smoothing
Sergio Venturini, Francesca Dominici, Giovanni Parmigiani
//arxiv.org/abs/1509.01042v1 

We propose a new general approach for estimating the effect of a binary treatment on a continuous and potentially highly skewed response variable, the generalized quantile treatment effect (GQTE). The GQTE is defined as the difference between a function of the quantiles under the two treatment conditions. As such, it represents a generalization over the standard approaches typically used for estimating a treatment effect (i.e., the average treatment effect and the quantile treatment effect) because it allows the comparison of any arbitrary characteristic of the outcome’s distribution under the two treatments. Following Dominici et al. (2005), we assume that a pre-specified transformation of the two quantiles is modeled as a smooth function of the percentiles. This assumption allows us to link the two quantile functions and thus to borrow information from one distribution to the other. The main theoretical contribution we provide is the analytical derivation of a closed form expression for the likelihood of the model. Exploiting this result we propose a novel Bayesian inferential methodology for the GQTE. We show some finite sample properties of our approach through a simulation study which confirms that in some cases it performs better than other nonparametric methods. As an illustration we finally apply our methodology to the 1987 National Medicare Expenditure Survey data to estimate the difference in the single hospitalization medical cost distributions between cases (i.e., subjects affected by smoking attributable diseases) and controls. 

• [math.ST]Necessary and Sufficient Conditions for High-Dimensional Posterior Consistency under $g$-Priors
Douglas K. Sparks, Kshitij Khare, Malay Ghosh
//arxiv.org/abs/1509.01060v1 

We examine necessary and sufficient conditions for posterior consistency under $g$-priors, including extensions to hierarchical and empirical Bayesian models. The key features of this article are that we allow the number of regressors to grow at the same rate as the sample size and define posterior consistency under the sup vector norm instead of the more conventional Euclidean norm. We consider in particular the empirical Bayesian model of George and Foster (2000), the hyper-$g$-prior of Liang et al. (2008), and the prior considered by Zellner and Siow (1980). 

• [physics.data-an]Comparing non-nested models in the search for new physics
Sara Algeri, Jan Conrad, David A. van Dyk
//arxiv.org/abs/1509.01010v1 

Searches for unknown physics and deciding between competing physical models to explain data rely on statistical hypotheses testing. A common approach, used for example in the discovery of the Brout-Englert-Higgs boson, is based on the statistical Likelihood Ratio Test (LRT) and its asymptotic properties. In the common situation, when neither of the two models under comparison is a special case of the other i.e., when the hypotheses are non-nested, this test is not applicable, and so far no efficient solution exists. In physics, this problem occurs when two models that reside in different parameter spaces are to be compared. An important example is the recently reported excess emission in astrophysical $\gamma$-rays and the question whether its origin is known astrophysics or dark matter. We develop and study a new, generally applicable, frequentist method and validate its statistical properties using a suite of simulations studies. We exemplify it on realistic simulated data of the Fermi-LAT $\gamma$-ray satellite, where non-nested hypotheses testing appears in the search for particle dark matter. 

• [stat.AP]Extreme Value Theory for Time Series using Peak-Over-Threshold method
Gianluca Rosso
//arxiv.org/abs/1509.01051v1 

This brief paper summarize the chances offered by the Peak-Over-Threshold method, related with analysis of extremes. Identification of appropriate Value at Risk can be solved by fitting data with a Generalized Pareto Distribution. Also an estimation of value for the Expected Shortfall can be useful, and the application of these few concepts are valid for the most wide range of risk analysis, from the financial application to the operational risk assessment, through the analysis for climate time series; resolving the problem of borderline data. 

• [stat.ME]A novel principal component analysis for spatially-misaligned multivariate air pollution data
Roman A. Jandarov, Lianne A. Sheppard, Paul D. Sampson, Adam A. Szpiro
//arxiv.org/abs/1509.01171v1 

We propose novel methods for predictive (sparse) PCA with spatially misaligned data. These methods identify principal component loading vectors that explain as much variability in the observed data as possible, while also ensuring the corresponding principal component scores can be predicted accurately by means of spatial statistics at locations where air pollution measurements are not available. This will make it possible to identify important mixtures of air pollutants and to quantify their health effects in cohort studies, where currently available methods cannot be used. We demonstrate the utility of predictive (sparse) PCA in simulated data and apply the approach to annual averages of particulate matter speciation data from national Environmental Protection Agency (EPA) regulatory monitors. 

• [stat.ME]PCA leverage: outlier detection for high-dimensional functional magnetic resonance imaging data
Amanda F. Mejia, Mary Beth Nebel, Ani Eloyan, Brian Caffo, Martin A. Lindquist
//arxiv.org/abs/1509.00882v1 

Outlier detection for high-dimensional data is a popular topic in modern statistical research. However, one source of high-dimensional data that has received relatively little attention is functional magnetic resonance images (fMRI), which consists of hundreds of thousands of measurements sampled at hundreds of time points. At a time when the availability of fMRI data is rapidly growing—primarily through large, publicly available grassroots datasets consisting of resting-state fMRI data—automated quality control and outlier detection methods are greatly needed. We propose PCA leverage and demonstrate how it can be used to identify outlying time points in an fMRI scan. Furthermore, PCA leverage is a measure of the influence of each observation on the estimation of principal components, which forms the basis of independent component analysis (ICA) and seed connectivity, two of the most widely used methods for analyzing resting-state fMRI data. We also propose an alternative measure, PCA robust distance, which is less sensitive to outliers and has controllable statistical properties. The proposed methods are validated through simulation studies and are shown to be highly accurate. We also conduct a reliability study using resting-state fMRI data from the Autism Brain Imaging Data Exchange (ABIDE) and find that removal of outliers using the proposed methods results in more reliable estimation of subject-level resting-state networks using ICA. 

• [stat.ML]Community Detection in Networks with Node Features
Yuan Zhang, Elizaveta Levina, Ji Zhu
//arxiv.org/abs/1509.01173v1 

Many methods have been proposed for community detection in networks, but most of them do not take into account additional information on the nodes that is often available in practice. In this paper, we propose a new joint community detection criterion that uses both the network edge information and the node features to detect community structures. One advantage our method has over existing joint detection approaches is the flexibility of learning the impact of different features which may differ across communities. Another advantage is the flexibility of choosing the amount of influence the feature information has on communities. The method is asymptotically consistent under the block model with additional assumptions on the feature distributions, and performs well on simulated and real networks. 

• [stat.ML]Semi-described and semi-supervised learning with Gaussian processes
Andreas Damianou, Neil D. Lawrence
//arxiv.org/abs/1509.01168v1 

Propagating input uncertainty through non-linear Gaussian process (GP) mappings is intractable. This hinders the task of training GPs using uncertain and partially observed inputs. In this paper we refer to this task as “semi-described learning”. We then introduce a GP framework that solves both, the semi-described and the semi-supervised learning problems (where missing values occur in the outputs). Auto-regressive state space simulation is also recognised as a special case of semi-described learning. To achieve our goal we develop variational methods for handling semi-described inputs in GPs, and couple them with algorithms that allow for imputing the missing values while treating the uncertainty in a principled, Bayesian manner. Extensive experiments on simulated and real-world data study the problems of iterative forecasting and regression/classification with missing values. The results suggest that the principled propagation of uncertainty stemming from our framework can significantly improve performance in these tasks.

北邮PRIS模式识别实验室陈老师 商务合作 QQ:1289468869 Email:1289468869@qq.com