"topic";"title";"abstract" "clustering";"City-Scale Map Creation and Updating using GPS Collections";"Applications such as autonomous driving or real-time route recommendations require up-to-date and accurate digital maps. However, manually creating and updating such maps is too costly to meet the rising demands. As large collections of GPS trajectories become widely available, constructing and updating maps using such trajectory collections can greatly reduce the cost of such maps. Unfortunately, due to GPS noise and varying trajectory sampling rates, inferring maps from GPS trajectories can be very challenging. In this paper, we present a framework to create up-to-date maps with rich knowledge from GPS trajectory collections. Starting from an unstructured GPS point cloud, we discover road segments using novel graph-based clustering techniques with prior knowledge on road design. Based on road segments, we develop a scale- and orientation-invariant traj-SIFT feature to localize and recognize junctions using a supervised learning framework. Maps with rich knowledge are created based on discovered road segments and junctions. Compared to state-of-the-art methods, our approach can efficiently construct high-quality maps at city scales from large collections of GPS trajectories." "clustering";"A Text Clustering Algorithm Using an Online Clustering Scheme for Initialization";"In this paper, we propose a text clustering algorithm using an online clustering scheme for initialization called FGSD-MM+. FGSDMM+ assumes that there are at most Kmax clusters in the corpus, and regards these Kmax potential clusters as one large potential cluster at the beginning. During initialization, FGSDMM+ processes the documents one by one in an online clustering scheme. The first document will choose the potential cluster, and FGSDMM+ will create a new cluster to store this document. Later documents will choose one of the non-empty clusters or the potential cluster with probabilities derived from the Dirichlet multinomial mixture model. Each time a document chooses the potential cluster, FGSDMM+ will create a new cluster to store that document and decrease the probability of later documents choosing the potential cluster. After initialization, FGSDMM+ will run a collapsed Gibbs sampling algorithm several times to obtain the final clustering result. Our extensive experimental study shows that FGSDMM+ can achieve better performance than three other clustering methods on both short and long text datasets." "clustering";"Batch model for batched timestamps data analysis with application to the SSA disability program";"The Office of Disability Adjudication and Review (ODAR) is responsible for holding hearings, issuing decisions, and reviewing appeals as part of the Social Security Administration’s disability determining process. In order to control and process cases, the ODAR has established a Case Processing and Management System (CPMS) to record management information since December 2003. The CPMS provides a detailed case status history for each case. Due to the large number of appeal requests and limited resources, the number of pending claims at ODAR was over one million cases by March 31, 2015. Our National Institutes of Health (NIH) team collaborated with SSA and developed a Case Status Change Model (CSCM) project to meet the ODAR’s urgent need of reducing backlogs and improve hearings and appeals process. One of the key issues in our CSCM project is to estimate the expected service time and its variation for each case status code. The challenge is that the systems recorded job departure times may not be the true job finished times. As the CPMS timestamps data of case status codes showed apparent batch patterns, we proposed a batch model and applied the constrained least squares method to estimate the mean service times and the variances. We also proposed a batch search algorithm to determine the optimal batch partition, as no batch partition was given in the real data. Simulation studies were conducted to evaluate the performance of the proposed methods. Finally, we applied the method to analyze a real CPMS data from ODAR/SSA." "clustering";"Efficient Frequent Directions Algorithm for Sparse Matrices";"This paper describes Sparse Frequent Directions, a variant of Frequent Directions for sketching sparse matrices. It resembles the original algorithm in many ways: both receive the rows of an input matrix An⇥d one by one in the streaming setting and compute a small sketch B 2 R`⇥d. Both share the same strong (provably optimal) asymptotic guarantees with respect to the space-accuracy tradeoff in the streaming setting. However, unlike Frequent Directions which runs in O(nd`) time regardless of the sparsity f the input ma rix A, Sparse Frequent Directions runs in O˜ nnz(A)` + n`2 time. Our analysis loosens the dependence on computing the Sin-gular Value Decomposition (SVD) as a black box within the Frequent Directions algorithm. Our bounds require recent results on the properties of fast approximate SVD computations. Finally, we empirically demonstrate that these asymptotic improvements are practical and significant on real and synthetic data." "clustering";"AnyDBC: An Efficient Anytime Density-based Clustering Algorithm for Very Large Complex Datasets";"The density-based clustering algorithm DBSCAN is a state-of-the-art data clustering technique with numerous applications in many fields. However, its O(n2) time complexity still remains a severe weakness. In this paper, we propose a novel anytime approach to cope with this problem by reducing both the range query and the label propagation time of DBSCAN. Our algorithm, called AnyDBC, compresses the data into smaller density-connected subsets called primitive clusters and labels objects based on connected components of these primitive clusters for reducing the label propagation time. Moreover, instead of passively performing the range query for all objects like existing techniques, AnyDBC iteratively and actively learns the current cluster structure of the data and selects a few most promising objects for refining clusters at each iteration. Thus, in the end, it performs substantially fewer range queries compared to DBSCAN while still guaranteeing the exact final result of DBSCAN. Experiments show speedup factors of orders of magnitude com-pared to DBSCAN and its fastest variants on very large real and synthetic complex datasets." "clustering";"Infinite Ensemble for Image Clustering";"Image clustering has been a critical preprocessing step for vision tasks, e.g., visual concept discovery, content-based image retrieval. Conventional image clustering methods use handcraft visual descriptors as basic features via K-means, or build the graph within spectral clustering. Recently, representation learning with deep structure shows appealing performance in unsupervised feature pre-treatment. However, few studies have discussed how to deploy deep representation learning to image clustering problems, especially the unified framework which integrates both representation learning and ensemble clustering for efficient image clustering still remains void. In addition, even though it is widely recognized that with the increasing number of basic partitions, ensemble clustering gets better performance and lower variances, the best number of basic partitions for a given data set is a pending problem. In light of this, we propose the Infinite Ensemble Clustering (IEC), which incorporates the power of deep representation and ensemble clustering in a one-step framework to fuse infinite basic partitions. Generally speaking, a set of basic partitions is firstly generated from the image data. Then by converting the basic partitions to the 1-of-K codings, we link the marginalized auto-encoder to the infinite ensemble clustering with i.i.d. basic partitions, which can be approached by the closed-form solutions. Finally we follow the layer-wise training procedure and feed the concatenated deep features to K-means for final clustering. Extensive experiments on diverse vision data sets with different levels of visual descriptors demonstrate both the time efficiency and superior performance of IEC com-pared to the state-of-the-art ensemble clustering and deep clustering methods." "clustering";"Data-driven Automatic Treatment Regimen Development and Recommendation";"The analysis of large-scale Electrical Medical Records (EMRs) has the potential to develop and optimize clinical treatment regimens. A treatment regimen usually includes a series of doctor orders containing rich temporal and heterogeneous information. However, in many existing studies, a doctor order is simplified as an event code and a treatment record is simplified as a code sequence. Thus, the information inherent in doctor orders is not fully used for in-depth analysis. In this paper, we aim at exploiting the rich information in doctor orders and developing data-driven approaches for improving clinical treatments. To this end, we first propose a novel method to measure the similarities between treatment records with consideration of sequential and multifaceted information in doctor orders. Then, we propose an efficient density-based clustering algorithm to summarize large-scale treatment records, and extract a semantic representation of each treatment cluster. Finally, we develop a unified frame-work to evaluate the discovered treatment regimens, and find the most effective treatment regimen for new patients. In the empirical study, we validate our methods with EMRs of 27,678 patients from 14 hospitals. The results show that: 1) Our method can successfully extract typical treatment regimens from large-scale treatment records. The extracted treatment regimens are intuitive and provide managerial implications for treatment regimen design and optimization. 2) By recommending the most effective treatment regimens, the total cure rate in our data improves from 19.89% to 21.28%, and the effective rate increases up to 98.29%." "clustering";"Structured Doubly Stochastic Matrix for Graph Based Clustering";"As one of the most significant machine learning topics, clustering has been extensively employed in various kinds of area. Its prevalent application in scientific research as well as industrial practice has drawn high attention in this day and age. A multitude of clustering methods have been developed, among which the graph based clustering method using the affinity matrix has been laid great ephasis on. Recent research work used the doubly stochastic matrix to normalize the input affinity matrix and enhance the graph based clustering models. Although the doubly stochastic matrix can improve the clustering performance, the clustering structure in the doubly stochastic matrix is not clear as expected. Thus, post-processing step is required to extract the final clustering results, which may not be optimal. To address this problem, in this paper, we propose a novel convex model to learn the structured doubly stochastic matrix by imposing low-rank constraint on the graph Laplacian matrix. Our new structured doubly stochastic matrix can explicitly uncover the clustering structure and encode the probabilities of pair-wise data points to be connected, such that the clustering results are enhanced. An efficient optimization algorithm is derived to solve our new objective. Also, we provide theoretical discussions that when the input differs, our method possesses interesting connections with K-means and spectral graph cut models respectively. We conduct experiments on both synthetic and bench-mark datasets to validate the performance of our proposed method. The empirical results demonstrate that our model provides an approach to better solving the K-mean clustering problem. By using the cluster indicator provided by our model as initialization, K-means converges to a smaller objective function value with better clustering performance. Moreover, we compare the clustering performance of our model with spectral clustering and related double stochastic model. On all datasets, our method performs equally or better than the related methods." "frequent-pattern-mining";"DeepIntent: Learning Attentions for Online Advertising with Recurrent Neural Networks";"In this paper, we investigate the use of recurrent neural networks (RNNs) in the context of search-based online advertising. We use RNNs to map both queries and ads to real valued vectors, with which the relevance of a given (query, ad) pair can be easily computed. On top of the RNN, we propose a novel attention network, which learns to assign attention scores to different word locations according to their intent importance (hence the name DeepIntent). The vector output of a sequence is thus computed by a weighted sum of the hidden states of the RNN at each word according their attention scores. We perform end-to-end training of both the RNN and attention network under the guidance of user click logs, which are sampled from a commercial search engine. We show that in most cases the attention network improves the quality of learned vector representations, evaluated by AUC on a manually labeled dataset. Moreover, we highlight the effectiveness of the learned attention scores from two aspects: query rewriting and a modified BM25 metric. We show that using the learned attention scores, one is able to produce sub-queries that are of better qualities than those of the state-of-the-art methods. Also, by modifying the term frequency with the attention scores in a standard BM25 formula, one is able to improve its performance evaluated by AUC." "frequent-pattern-mining";"Annealed Sparsity via Adaptive and Dynamic Shrinking";"Sparse learning has received tremendous amount of interest in high-dimensional data analysis due to its model interpretability and the low-computational cost. Among the various techniques, adaptive l1-regularization is an effective framework to improve the convergence behaviour of the LASSO, by using varying strength of regularization across different features. In the meantime, the adaptive structure makes it very powerful in modelling grouped sparsity pat-terns as well, being particularly useful in high-dimensional multi-task problems. However, choosing an appropriate, global regularization weight is still an open problem. In this paper, inspired by the annealing technique in matrial science, we propose to achieve “annealed sparsity” by designing a dynamic shrinking scheme that simultaneously optimizes the regularization weights and model coefficients in sparse (multi-task) learning. The dynamic structures of our algorithm are twofold. Feature-wise (“spatially”), the regularization weights are updated interactively with model coefficients, allowing us to improve the global regularization structure. Iteration-wise (“temporally”), such interaction is coupled with gradually boosted l1-regularization by adjusting an equality norm-constraint, achieving an “annealing” effect to further improve model selection. This renders interesting shrinking behaviour in the whole solution path. Our method competes favorably with state-of-the-art methods in sparse (multi-task) learning. We also apply it in expression quantitative trait loci analysis (eQTL), which gives useful biological insights in human cancer (melanoma) study." "frequent-pattern-mining";"Multi-Task Feature Interaction Learning";"Linear models are widely used in various data mining and machine learning algorithms. One major limitation of such models is the lack of capability to capture predictive information from interactions between features. While introducing high-order feature interaction terms can overcome this limitation, this approach dramatically increases the model complexity and imposes significant challenges in the learning against overfitting. When there are multiple related learning tasks, feature interactions from these tasks are usually related and modeling such relatedness is the key to improve their generalization. In this paper, we propose a novel Multi-Task feature Interaction Learning (MTIL) framework to exploit the task relatedness from high-order feature interactions. Specifically, we collectively represent the feature interactions from multiple tasks as a tensor, and prior knowledge of task relatedness can be incorporated into different structured regularizations on this tensor. We formulate two concrete approaches under this framework, namely the shared interaction approach and the embedded interaction approach. The former assumes tasks share the same set of interactions, and the latter assumes feature interactions from multiple tasks share a common subspace. We have provided efficient algorithms for solving the two formulations. Extensive empirical studies on both synthetic and real datasets have demonstrated the effectiveness of the proposed framework." "frequent-pattern-mining";"Analyzing Volleyball Match Data from the 2014 World Championships Using Machine Learning Techniques";"This paper proposes a relational-learning based approach for discovering strategies in volleyball matches based on optical tracking data. In contrast to most existing methods, our approach permits discovering patterns that account for both spatial (that is, partial configurations of the players on the court) and temporal (that is, the order of events and positions) aspects of the game. We analyze both the men’s and women’s final match from the 2014 FIVB Volleyball World Championships, and are able to identify several interesting and relevant strategies from the matches." "frequent-pattern-mining";"Lexis: An Optimization Framework for Discovering the Hierarchical Structure of Sequential Data";"Data represented as strings abounds in biology, linguistics, document mining, web search and many other fields. Such data often have a hierarchical structure, either because they were artificially designed and composed in a hierarchical manner or because there is an underlying evolutionary process that creates repeatedly more complex strings from simpler substrings. We propose a framework, referred to as Lexis, that produces an optimized hierarchical representation of a given set of “target” strings. The resulting hierarchy, “Lexis-DAG”, shows how to construct each target through the concatenation of intermediate substrings, minimizing the total number of such concatenations or DAG edges. The Lexis optimization problem is related to the smallest grammar problem. After we prove its NP-hardness for two cost formulations, we propose an efficient greedy algorithm for the construction of Lexis-DAGs. We also consider the problem of identifying the set of intermediate nodes (substrings) that collectively form the “core” of a Lexis-DAG, which is important in the analysis of Lexis-DAGs. We show that the Lexis framework can be applied in diverse applications such as optimized synthesis of DNA fragments in genomic libraries, hierarchical structure discovery in protein sequences, dictionary-based text compression, and feature extraction from a set of documents." "frequent-pattern-mining";"Just One More: Modeling Binge Watching Behavior";"Easy accessibility can often lead to over-consumption, as seen in food and alcohol habits. On video on-demand (VOD) services, this has recently been referred to as “binge watching”, where potentially entire seasons of TV shows are consumed in a single viewing session. While a user viewership model may reveal this binging behavior, creating an accurate model has several challenges, including censored data, deviations in the population, and the need to consider external influences on consumption habits. In this paper, we introduce a novel statistical mixture model that incorporates these factors and presents a “first of its kind” characterization of viewer consumption behavior using a real-world dataset that includes playback data from a VOD service. From our modeling, we tackle various predictive tasks to infer the consumption decisions of a user in a viewing session, including estimating the number of episodes they watch and classifying if they continue watching another episode. Using these insights, we then identify binge watching sessions based on deviation from normal viewing behavior. We observe different types of binging behavior, that binge watchers often view certain content out-of-order, and that binge watching is not a consistent behavior among our users. These insights and our findings have application in VOD revenue generation, consumer health applications, and customer retention analysis." "frequent-pattern-mining";"Online Feature Selection: A Limited-Memory Substitution Algorithm and its Asynchronous Parallel Vari";"This paper considers the feature selection scenario where only a few features are accessible at any time point. For example, features are generated sequentially and visible one by one. Therefore, one has to make an online decision to identify key features after all features are only scanned once or twice. The optimization based approach is a powerful tool for the online feature selection." "frequent-pattern-mining";"Generalized Hierarchical Sparse Model for Arbitrary-Order Interactive Antigenic Sites Identification";"Recent statistical evidence has shown that a regression model by incorporating the interactions among the original covariates (features) can significantly improve the interpretability for biological data. One major challenge is the exponentially expanded feature space when adding high-order feature interactions to the model. To tackle the huge dimensionality, Hierarchical Sparse Models (HSM) are developed by enforcing sparsity under heredity structures in the interactions among the covariates. However, existing methods only consider pairwise interactions, making the discovery of important high-order interactions a non-trivial open problem. In this paper, we propose a Generalized Hierarchical Sparse Model (GHSM) as a generalization of the HSM models to learn arbitrary-order inter-actions. The GHSM applies the l1 penalty to all the model coefficients under a constraint that given any covariate, if none of its associated kth-order interactions contribute to the regression model, then neither do its associated higher-order interactions. The resulting objective function is non-convex with a challenge lying in the coupled variables appearing in the arbitrary-order hierarchical constraints and we devise an efficient optimization algorithm to directly solve it. Specifically, we decouple the variables in the constraints via both the GIST and ADMM methods into three subproblems, each of which is proved to admit an efficiently analytical solution. We evaluate the GHSM method in both synthetic problem and the antigenic sites identification problem for the flu virus data, where we expand the feature space up to the 5th-order interactions. Empirical results demonstrate the effectiveness and efficiency of the proposed method and the learned high-order interactions have meaningful synergistic covariate patterns in the virus antigenicity." "frequent-pattern-mining";"Predict Risk of Relapse for Patients with Multiple Stages of Treatment of Depression";"Depression is a serious mood disorder afflicting millions of people around the globe. Medications of different types and with different effects on neural activity have been developed for its treatments during the past few decades. Due to the heterogeneity of the disorder, many patients cannot achieve symptomatic remission from a single clinical trial. Instead they need multiple clinical trials to achieve remission, resulting in a multiple stage treatment pattern. Furthermore those who indeed achieve symptom remission are still faced with substantial risk of relapse. One promising approach to predicting the risk of relapse is censored regression. Traditional censored regression typically applies only to situations in which the exact time of event of interest is known. How-ever, follow-up studies that track the patients’ relapse status can only provide an interval of time during which relapse occurs. The exact time of relapse is usually unknown. In this paper, we present a censored regression approach with a truncated l1 loss function that can handle the uncertainty of relapse time. Based on this general loss function, we develop a gradient boosting algorithm and a stochastic dual coordinate ascent algorithm when the hypothesis in the loss function is represented as (1) an ensemble of decision trees and (2) a linear combination of covariates, respectively. As an extension of our linear model, a multi-stage linear approach is further proposed to harness the data collected from multiple stages of treatment. We evaluate the proposed algorithms using a real-world clinical trial dataset. Results show that our methods outperform the well-known Cox proportional hazard model. In addition, the risk factors identified by our multi-stage linear model not only corroborate find-ings from recent research but also yield some new insights into how to develop effective measures for prevention of re-lapse among patients after their initial remission from the acute treatment stage." "frequent-pattern-mining";"A Closed-Loop Approach in Data-Driven Resource Allocation to Improve Network User Experience";"Machine learning methods have been widely used in modeling and predicting network user experience. In this pa-per, moving beyond user experience prediction, we propose a closed-loop approach that uses data-generated prediction models to explicitly guide resource allocation for user experience improvement. The closed-loop approach leverages and verifies the causal relation that often exists between certain feature values (e.g., bandwidth) and user experience in computer networks. The approach consists of three components: we train a neural network classifier to predict user experience, utilize the trained neural network classifier as the objective function to allocate network resource, and then evaluate user experience with allocated resource to (in)validate and adjust the original model. Specifically, we propose a dual decomposition algorithm to solve the neural network-based resource optimization problem, which is complex and non-convex. We further develop an iterative mechanism for classifier optimization. Numerical results show that the dual algorithm reduces the expected number of unsatisfied users by up to 2x compared with the baseline, and the optimized classifier further improves the performance by 50%." "frequent-pattern-mining";"Towards Robust and Versatile Causal Discovery for Business Applications";"Causal discovery algorithms can induce some of the causal relations from the data, commonly in the form of a causal network such as a causal Bayesian network. Arguably however, all such algorithms lack far behind what is necessary for a true business application. We develop an initial version of a new, general causal discovery algorithm called ETIO with many features suitable for business applications. These include (a) ability to accept prior causal knowledge (e.g., taking senior driving courses improves driving skills), (b) admit-ting the presence of latent confounding factors, (c) admitting the possibility of (a certain type of) selection bias in the data (e.g., clients sampled mostly from a given region), (d) ability to analyze data with missing-by-design (i.e., not planned to measure) values (e.g., if two companies merge and their databases measure different attributes), and (e) ability to analyze data from different interventions (e.g., prior and posterior to an advertisement campaign). ETIO is an instance of the logical approach to integrative causal discovery that has been relatively recently introduced and enables the solution of complex reverse-engineering problems in causal discovery. ETIO is compared against the state-of-the-art and is shown to be more effective in terms of speed, with only a slight degradation in terms of learning accuracy, while incorporating all the features above. The code is available on the mensxmachina.org website." "frequent-pattern-mining";"Interpretable Decision Sets: A Joint Framework for Description and Prediction";"One of the most important obstacles to deploying predictive models is the fact that humans do not understand and trust them. Knowing which variables are important in a model’s prediction and how they are combined can be very powerful in helping people understand and trust automatic decision making systems." "frequent-pattern-mining";"Causal Clustering for 1-Factor Measurement Models";"Many scientific research programs aim to learn the causal structure of real world phenomena. This learning problem is made more difficult when the target of study cannot be directly observed. One strategy commonly used by social scientists is to create measurable “indicator” variables that covary with the latent variables of interest. Before leveraging the indicator variables to learn about the latent variables, however, one needs a measurement model of the causal relations between the indicators and their corresponding latents. These measurement models are a special class of Bayesian networks. This paper addresses the problem of reliably infer-ring measurement models from measured indicators, with-out prior knowledge of the causal relations or the number of latent variables. We present a provably correct novel algorithm, FindOneFactorClusters (FOFC), for solving this inference problem. Compared to other state of the art algorithms, FOFC is faster, scales to larger sets of indicators, and is more reliable at small sample sizes. We also present the first correctness proofs for this problem that do not assume linearity or acyclicity among the latent variables." "frequent-pattern-mining";"Efficient Frequent Directions Algorithm for Sparse Matrices";"This paper describes Sparse Frequent Directions, a variant of Frequent Directions for sketching sparse matrices. It resembles the original algorithm in many ways: both receive the rows of an input matrix An⇥d one by one in the streaming setting and compute a small sketch B 2 R`⇥d. Both share the same strong (provably optimal) asymptotic guarantees with respect to the space-accuracy tradeoff in the streaming setting. However, unlike Frequent Directions which runs in O(nd`) time regardless of the sparsity f the input ma rix A, Sparse Frequent Directions runs in O˜ nnz(A)` + n`2 time. Our analysis loosens the dependence on computing the Sin-gular Value Decomposition (SVD) as a black box within the Frequent Directions algorithm. Our bounds require recent results on the properties of fast approximate SVD computations. Finally, we empirically demonstrate that these asymptotic improvements are practical and significant on real and synthetic data." "frequent-pattern-mining";"Subjectively Interesting Component Analysis: Data Projections that Contrast with Prior Expectations";"Methods that find insightful low-dimensional projections are essential to effectively explore high-dimensional data. Principal Component Analysis is used pervasively to find low-dimensional projections, not only because it is straightforward to use, but it is also often effective, because the variance in data is often dominated by relevant structure. However, even if the projections highlight real structure in the data, not all structure is interesting to every user. If a user is already aware of, or not interested in the dominant structure, Principal Component Analysis is less effective for finding interesting components. We introduce a new method called Subjectively Interesting Component Analysis (SICA), designed to find data projections that are subjectively interesting, i.e, projections that truly surprise the end-user. It is rooted in information theory and employs an explicit model of a user’s prior expectations about the data. The corresponding optimization problem is a simple eigenvalue problem, and the result is a trade-off between explained variance and novelty. We present five case studies on synthetic data, images, time-series, and spatial data, to illustrate how SICA enables users to find (subjectively) interesting projections." "frequent-pattern-mining";"Robust and Effective Metric Learning Using Capped Trace Norm";"Metric learning aims at automatically learning a metric from pair or triplet based constraints in data, and it can be potentially beneficial whenever the notion of metric between instances plays a nontrivial role. In Mahalanobis distance metric learning, distance matrix M is in symmetric positive semi-definite cone, and in order to avoid overfitting and to learn a better Mahalanobis distance from weakly supervised constraints, the low-rank regularization has been often imposed on matrix M to learn the correlations between features and samples. As the approximations of the rank minimization function, the trace norm and Fantope have been utilized to regularize the metric learning objectives and achieve good performance. However, these low-rank regularization models are either not tight enough to approximate rank minimization or time-consuming to tune an optimal rank. In this paper, we introduce a novel metric learning model using the capped trace norm based regularization, which uses a singular value threshold to constraint the metric matrix M as low-rank explicitly such that the rank of matrix M is stable when the large singular values vary. The capped trace norm regularization can also be viewed as the adaptive Fantope regularization. We minimize singular values which are less than threshold value and the rank of M is not necessary to be k, thus our method is more stable and applicable in practice when we do not know the optimal rank of matrix M. We derive an efficient optimization algorithm to solve the proposed new model and the algorithm convergence proof is also provided in this paper. We evaluate our method on a variety of challenging benchmarks, such as LFW and Pubfig datasets. Face verification experiments are performed and results show that our method consistently outperforms the state-of-the-art metric learning algorithms." "frequent-pattern-mining";"Inferring Network Effects from Observational Data";"We present Relational Covariate Adjustment (RCA), a general method for estimating causal effects in relational data. Relational Covariate Adjustment is implemented through two high-level operations: identification of an adjustment set and relational regression adjustment. The former is achieved through an extension of Pearl’s back-door criterion to relational domains. We demonstrate how this extended definition can be used to estimate causal effects in the presence of network interference and confounding. RCA is agnostic to functional form, and it can easily model both discrete and continuous treatments as well as estimate the effects of a wider array of network interventions than existing experimental approaches. We show that RCA can yield robust estimates of causal effects using common regression models without extensive parameter tuning. Through a series of simulation experiments on a variety of synthetic and real- world network structures, we show that causal effects estimated on observational data with RCA are nearly as accurate as those estimated from well-designed network experiments." "outlier-and-anomaly-detection";"Assessing Human Error Against a Benchmark of Perfection";"An increasing number of domains are providing us with detailed trace data on human decisions in settings where we can evaluate the quality of these decisions via an algorithm. Motivated by this development, an emerging line of work has begun to consider whether we can characterize and predict the kinds of decisions where people are likely to make errors." "outlier-and-anomaly-detection";"Semi-Markov Switching Vector Autoregressive Model-based Anomaly Detection in Aviation Systems";"In this work we consider the problem of anomaly detection in heterogeneous, multivariate, variable-length time series datasets. Our focus is on the aviation safety domain, where data objects are flights and time series are sensor readings and pilot switches. In this context the goal is to detect anomalous flight segments, due to mechanical, environmental, or human factors in order to identifying operationally significant events and highlight potential safety risks. For this purpose, we propose a framework which represents each flight using a semi-Markov switching vector autoregressive (SMS-VAR) model. Detection of anomalies is then based on measuring dissimilarities between the model’s prediction and data observation. The framework is scalable, due to the inherent parallel nature of most computations, and can be used to perform online anomaly detection. Extensive experimental results on simulated and real datasets illustrate that the framework can detect various types of anomalies along with the key parameters involved." "outlier-and-anomaly-detection";"Catch Me If You Can: Detecting Pickpocket Suspects from Large-Scale Transit Records";"Massive data collected by automated fare collection (AFC) systems provide opportunities for studying both personal traveling behaviors and collective mobility patterns in the urban area. Existing studies on the AFC data have primarily focused on identifying passengers’ movement patterns. In this paper, however, we creatively leveraged such data for identifying thieves in the public transit systems. In-deed, stopping pickpockets in the public transit systems has been critical for improving passenger satisfaction and public safety. However, it is challenging to tell thieves from regular passengers in practice. To this end, we developed a suspect detection and surveillance system, which can identify pick-pocket suspects based on their daily transit records. Specifically, we first extracted a number of features from each passenger’s daily activities in the transit systems. Then, we took a two-step approach that exploits the strengths of unsupervised outlier detection and supervised classification models to identify thieves, who exhibit abnormal traveling behaviors. Experimental results demonstrated the effective-ness of our method. We also developed a prototype system with a user-friendly interface for the security personnel." "outlier-and-anomaly-detection";"Modeling Precursors for Event Forecasting via Nested Multi-Instance Learning";"Forecasting large-scale societal events like civil unrest movements, disease outbreaks, and elections is an important and challenging problem. From the perspective of human analysts and policy makers, forecasting algorithms must not only make accurate predictions but must also provide sup-porting evidence, e.g., the causal factors related to the event of interest. We develop a novel multiple instance learning based approach that jointly tackles the problem of identifying evidence-based precursors and forecasts events into the future. Specifically, given a collection of streaming news articles from multiple sources we develop a nested multiple instance learning approach to forecast significant societal events such as protests. Using data from three countries in Latin America, we demonstrate how our approach is able to consistently identify news articles considered as precursors for protests. Our empirical evaluation demonstrates the strengths of our proposed approach in filtering candidate precursors, in forecasting the occurrence of events with a lead time advantage and in accurately predicting the characteristics of civil unrest events." "dimensionality-reduction";"Accelerating Online CP Decompositions for Higher Order Tensors";"Tensors are a natural representation for multidimensional data. In recent years, CANDECOMP/PARAFAC (CP) decomposition, one of the most popular tools for analyzing multi-way data, has been extensively studied and widely applied. However, today’s datasets are often dynamically changing over time. Tracking the CP decomposition for such dynamic tensors is a crucial but challenging task, due to the large scale of the tensor and the velocity of new data arriving. Traditional techniques, such as Alternating Least Squares (ALS), cannot be directly applied to this problem because of their poor scalability in terms of time and memory. Additionally, existing online approaches have only partially addressed this problem and can only be deployed on third-order tensors. To fill this gap, we propose an efficient online algorithm that can incrementally track the CP decompositions of dynamic tensors with an arbitrary number of dimensions. In terms of effectiveness, our algorithm demonstrates comparable results with the most accurate algorithm, ALS, whilst being computationally much more efficient. Specifically, on small and moderate datasets, our approach is tens to hundreds of times faster than ALS, while for large-scale datasets, the speedup can be more than 3,000 times. Compared to other state-of-the-art online approaches, our method shows not only significantly better decomposition quality, but also better performance in terms of stability, efficiency and scalability." "dimensionality-reduction";"Efficient Shift-Invariant Dictionary Learning";"Shift-invariant dictionary learning (SIDL) refers to the problem of discovering a set of latent basis vectors (the dictionary) that captures informative local patterns at different locations of the input sequences, and a sparse coding for each sequence as a linear combination of the latent basis elements. It differs from conventional dictionary learning and sparse coding where the latent basis has the same dimension as the input vectors, where the focus is on global patterns instead of shift-invariant local patterns. Unsupervised discovery of shift-invariant dictionary and the corresponding sparse coding has been an open challenge as the number of candidate local patterns is extremely large, and the number of possible linear combinations of such local patterns is even more so. In this paper we propose a new framework for unsupervised discovery of both the shift-invariant basis and the sparse coding of input data, with efficient algorithms for tractable optimization. Empirical evaluations on multiple time series data sets demonstrate the effectiveness and efficiency of the proposed method." "dimensionality-reduction";"Targeted Topic Modeling for Focused Analysis";"One of the overarching tasks of document analysis is to find what topics people talk about. One of the main techniques for this purpose is topic modeling. So far many models have been proposed. However, the existing models typically per-form full analysis on the whole data to find all topics. This is certainly useful, but in practice we found that the user almost always also wants to perform more detailed analyses on some specific aspects, which we refer to as targets (or targeted aspects). Current full-analysis models are not suitable for such analyses as their generated topics are often too coarse and may not even be on target. For example, given a set of tweets about e-cigarette, one may want to find out what topics under discussion are specifically related to children. Likewise, given a collection of online reviews about a camera, a consumer or camera manufacturer may be interested in finding out all topics about the camera’s screen, the targeted aspect. As we will see in our experiments, current full topic models are ineffective for such targeted analyses. This paper studies this problem and proposes a novel targeted topic model (TTM) to enable focused analyses on any specific aspect of interest. Our experimental results demonstrate the effectiveness of the TTM." "dimensionality-reduction";"Rebalancing Bike Sharing Systems: A Multi-source Data Smart Optimization";"Bike sharing systems, aiming at providing the missing links in public transportation systems, are becoming popular in urban cities. A key to success for a bike sharing systems is the effectiveness of rebalancing operations, that is, the effort-s of restoring the number of bikes in each station to its target value by routing vehicles through pick-up and drop-off operations. There are two major issues for this bike rebalancing problem: the determination of station inventory target level and the large scale multiple capacitated vehicle routing optimization with outlier stations. The key challenges include demand prediction accuracy for inventory target level determination, and an effective optimizer for vehicle routing with hundreds of stations. To this end, in this paper, we develop a Meteorology Similarity Weighted K-Nearest-Neighbor (M-SWK) regressor to predict the station pick-up demand based on large-scale historic trip records. Based on further analysis on the station network constructed by station-station connections and the trip duration, we propose an inter station bike transition (ISBT) model to predict the station drop-off demand. Then, we provide a mixed integer nonlinear programming (MINLP) formulation of multiple capacitated bike routing problem with the objective of minimizing total travel distance. To solve it, we propose an Adaptive Capacity Constrained K-centers Clustering (AdaCCKC) algorithm to separate outlier stations (the demands of these stations are very large and make the optimization infeasible) and group the rest stations into clusters within which one vehicle is scheduled to redistribute bikes between stations. In this way, the large scale multiple vehicle routing problem is reduced to inner cluster one vehicle routing problem with guaranteed feasible solutions. Finally, the extensive experimental results on the NYC Citi Bike system show the advantages of our approach for bike demand prediction and large-scale bike rebalancing optimization." "dimensionality-reduction";"Data-driven Automatic Treatment Regimen Development and Recommendation";"The analysis of large-scale Electrical Medical Records (EMRs) has the potential to develop and optimize clinical treatment regimens. A treatment regimen usually includes a series of doctor orders containing rich temporal and heterogeneous information. However, in many existing studies, a doctor order is simplified as an event code and a treatment record is simplified as a code sequence. Thus, the information inherent in doctor orders is not fully used for in-depth analysis. In this paper, we aim at exploiting the rich information in doctor orders and developing data-driven approaches for improving clinical treatments. To this end, we first propose a novel method to measure the similarities between treatment records with consideration of sequential and multifaceted information in doctor orders. Then, we propose an efficient density-based clustering algorithm to summarize large-scale treatment records, and extract a semantic representation of each treatment cluster. Finally, we develop a unified frame-work to evaluate the discovered treatment regimens, and find the most effective treatment regimen for new patients. In the empirical study, we validate our methods with EMRs of 27,678 patients from 14 hospitals. The results show that: 1) Our method can successfully extract typical treatment regimens from large-scale treatment records. The extracted treatment regimens are intuitive and provide managerial implications for treatment regimen design and optimization. 2) By recommending the most effective treatment regimens, the total cure rate in our data improves from 19.89% to 21.28%, and the effective rate increases up to 98.29%." "dimensionality-reduction";"Overcoming key weaknesses of Distance-based Neighbourhood Methods using a Data Dependent Dissimilari";"This paper introduces the first generic version of data dependent dissimilarity and shows that it provides a better closest match than distance measures for three existing algorithms in clustering, anomaly detection and multi-label classification. For each algorithm, we show that by simply replacing the distance measure with the data dependent dissimilarity measure, it overcomes a key weakness of the otherwise unchanged algorithm." "dimensionality-reduction";"Dynamic Clustering of Streaming Short Documents";"Clustering technology has found numerous applications in mining textual data. It was shown to enhance the performance of retrieval systems in various different ways, such as identifying different query aspects in search result diversification, improving smoothing in the context of language modeling, matching queries with documents in a latent topic space in ad-hoc retrieval, summarizing documents etc. The vast majority of clustering methods have been developed under the assumption of a static corpus of long (and hence textually rich) documents. Little attention has been given to streaming corpora of short text, which is the predominant type of data in Web 2.0 applications, such as social media, forums, and blogs. In this paper, we consider the problem of dynamically clustering a streaming corpus of short documents. The short length of documents makes the inference of the latent topic distribution challenging, while the temporal dynamics of streams allow topic distributions to change over time. To tackle these two challenges we propose a new dynamic clustering topic model - DCT - that enables tracking the time-varying distributions of topics over documents and words over topics. DCT models temporal dynamics by a short-term or long-term dependency model over sequential data, and overcomes the difficulty of handling short text by assigning a single topic to each short document and using the distributions inferred at a certain point in time as priors for the next inference, allowing the aggregation of information. At the same time, taking a Bayesian approach allows evidence obtained from new streaming documents to change the topic distribution. Our experimental results demonstrate that the proposed clustering algorithm outperforms state-of-the-art dynamic and non-dynamic clustering topic models in terms of perplexity and when integrated in a cluster-based query likelihood model it also outperforms state-of-the-art models in terms of retrieval quality." "dimensionality-reduction";"Continuous Experience-aware Language Model";"Online review communities are dynamic as users join and leave, adopt new vocabulary, and adapt to evolving trends. Recent work has shown that recommender systems benefit from explicit consideration of user experience. However, prior work assumes a fixed number of discrete experience levels, whereas in reality users gain experience and mature continuously over time." "dimensionality-reduction";"Structured Doubly Stochastic Matrix for Graph Based Clustering";"As one of the most significant machine learning topics, clustering has been extensively employed in various kinds of area. Its prevalent application in scientific research as well as industrial practice has drawn high attention in this day and age. A multitude of clustering methods have been developed, among which the graph based clustering method using the affinity matrix has been laid great ephasis on. Recent research work used the doubly stochastic matrix to normalize the input affinity matrix and enhance the graph based clustering models. Although the doubly stochastic matrix can improve the clustering performance, the clustering structure in the doubly stochastic matrix is not clear as expected. Thus, post-processing step is required to extract the final clustering results, which may not be optimal. To address this problem, in this paper, we propose a novel convex model to learn the structured doubly stochastic matrix by imposing low-rank constraint on the graph Laplacian matrix. Our new structured doubly stochastic matrix can explicitly uncover the clustering structure and encode the probabilities of pair-wise data points to be connected, such that the clustering results are enhanced. An efficient optimization algorithm is derived to solve our new objective. Also, we provide theoretical discussions that when the input differs, our method possesses interesting connections with K-means and spectral graph cut models respectively. We conduct experiments on both synthetic and bench-mark datasets to validate the performance of our proposed method. The empirical results demonstrate that our model provides an approach to better solving the K-mean clustering problem. By using the cluster indicator provided by our model as initialization, K-means converges to a smaller objective function value with better clustering performance. Moreover, we compare the clustering performance of our model with spectral clustering and related double stochastic model. On all datasets, our method performs equally or better than the related methods." "dimensionality-reduction";"Streaming-LDA: A Copula-based Approach to Modeling Topic Dependencies in Document Streams";"We propose in this paper two new models for modeling topic and word-topic dependencies between consecutive documents in document streams. The first model is a direct extension of Latent Dirichlet Allocation model (LDA) and makes use of a Dirichlet distribution to balance the influence of the LDA prior parameters wrt to topic and word-topic distribution of the previous document. The second extension makes use of copulas, which constitute a generic tools to model dependencies between random variables. We rely here on Archimedean copulas, and more precisely on Franck copulas, as they are symmetric and associative and are thus appropriate for exchangeable random variables. Our experiments, conducted on three standard collections that have been used in several studies on topic modeling, show that our proposals outperform previous ones (as dynamic topic models and temporal LDA), both in terms of perplexity and for tracking similar topics in a document stream." "dimensionality-reduction";"Mining Subgroups with Exceptional Transition Behavior";"We present a new method for detecting interpretable subgroups with exceptional transition behavior in sequential data. Identifying such patterns has many potential applications, e.g., for studying human mobility or analyzing the behavior of internet users. To tackle this task, we employ exceptional model mining, which is a general approach for identifying interpretable data subsets that exhibit unusual interactions between a set of target attributes with respect to a certain model class. Although exceptional model mining provides a well-suited framework for our problem, previously investigated model classes cannot capture transition behavior. To that end, we introduce first-order Markov chains as a novel model class for exceptional model mining and present a new interestingness measure that quantifies the exceptionality of transition subgroups. The measure compares the distance between the Markov transition matrix of a subgroup and the respective matrix of the entire data with the distance of random dataset samples. In addition, our method can be adapted to find subgroups that match or contradict given transition hypotheses. We demonstrate that our method is consistently able to recover subgroups with exceptional transition models from synthetic data and illustrate its potential in two application examples. Our work is relevant for researchers and practitioners interested in detecting exceptional transition behavior in sequential data." "dimensionality-reduction";"Bayesian Inference of Arrival Rate and Substitution Behavior from Sales Transaction Data with Stocko";"When an item goes out of stock, sales transaction data no longer reflect the original customer demand, since some customers leave with no purchase while others substitute alternative products for the one that was out of stock. Here we develop a Bayesian hierarchical model for inferring the underlying customer arrival rate and choice model from sales transaction data and the corresponding stock levels. The model uses a nonhomogeneous Poisson process to allow the arrival rate to vary throughout the day, and allows for a variety of choice models. Model parameters are inferred using a stochastic gradient MCMC algorithm that can scale to large transaction databases. We fit the model to data from a local bakery and show that it is able to make accurate out-of-sample predictions, and to provide actionable insight into lost cookie sales." "dimensionality-reduction";"FUSE: Full Spectral Clustering";"Multi-scale data which contains structures at different scales of size and density is a big challenge for spectral clustering. Even given a suitable locally scaled affinity matrix, the first k eigenvectors of such a matrix still cannot separate clusters well. Thus, in this paper, we exploit the fusion of the cluster-separation information from all eigenvectors to achieve a better clustering result. Our method FUll Spectral ClustEring (FUSE) is based on Power Iteration (PI) and Independent Component Analysis (ICA). PI is used to fuse all eigenvectors to one pseudo-eigenvector which inherits all the cluster-separation information. To conquer the cluster-collision problem, we utilize PI to generate p (p > k) pseudo-eigenvectors. Since these pseudo-eigenvectors are redundant and the cluster-separation information is contaminated with noise, ICA is adopted to rotate the pseudo-eigenvectors to make them pair-wise statistically independent. To let ICA overcome local optima and speed up the search process, we develop a self-adaptive and self-learning greedy search method. Finally, we select k rotated pseudoeigenvectors (independent components) which have more cluster-separation information measured by kurtosis for clustering. Various synthetic and real-world data verifies the effectiveness and efficiency of our FUSE method." "dimensionality-reduction";"Infinite Ensemble for Image Clustering";"Image clustering has been a critical preprocessing step for vision tasks, e.g., visual concept discovery, content-based image retrieval. Conventional image clustering methods use handcraft visual descriptors as basic features via K-means, or build the graph within spectral clustering. Recently, representation learning with deep structure shows appealing performance in unsupervised feature pre-treatment. However, few studies have discussed how to deploy deep representation learning to image clustering problems, especially the unified framework which integrates both representation learning and ensemble clustering for efficient image clustering still remains void. In addition, even though it is widely recognized that with the increasing number of basic partitions, ensemble clustering gets better performance and lower variances, the best number of basic partitions for a given data set is a pending problem. In light of this, we propose the Infinite Ensemble Clustering (IEC), which incorporates the power of deep representation and ensemble clustering in a one-step framework to fuse infinite basic partitions. Generally speaking, a set of basic partitions is firstly generated from the image data. Then by converting the basic partitions to the 1-of-K codings, we link the marginalized auto-encoder to the infinite ensemble clustering with i.i.d. basic partitions, which can be approached by the closed-form solutions. Finally we follow the layer-wise training procedure and feed the concatenated deep features to K-means for final clustering. Extensive experiments on diverse vision data sets with different levels of visual descriptors demonstrate both the time efficiency and superior performance of IEC com-pared to the state-of-the-art ensemble clustering and deep clustering methods." "dimensionality-reduction";"MANTRA: A Scalable Approach to Mining Temporally Anomalous Sub-trajectories";"In this paper, we study the problem of mining temporally anomalous sub-trajectory patterns from an input trajectory in a scalable manner. Given the prevailing road conditions, a sub-trajectory is temporally anomalous if its travel time deviates significantly from the expected time. Mining these patterns requires us to delve into the sub-trajectory space, which is not scalable for real-time analytics. To overcome this scalability challenge, we design a technique called MANTRA. We study the properties unique to anomalous sub-trajectories and utilize them in MANTRA to iteratively refine the search space into a disjoint set of sub-trajectory islands. The expensive enumeration of all possible sub-trajectories is performed only on the islands to compute the answer set of maximal anomalous sub-trajectories. Extensive experiments on both real and synthetic datasets establish MANTRA as more than 3 orders of magnitude faster than baseline techniques. Moreover, through trajectory classification and segmentation, we demonstrate that the proposed model conforms to human intuition." "dimensionality-reduction";"Generalized Hierarchical Sparse Model for Arbitrary-Order Interactive Antigenic Sites Identification";"Recent statistical evidence has shown that a regression model by incorporating the interactions among the original covariates (features) can significantly improve the interpretability for biological data. One major challenge is the exponentially expanded feature space when adding high-order feature interactions to the model. To tackle the huge dimensionality, Hierarchical Sparse Models (HSM) are developed by enforcing sparsity under heredity structures in the interactions among the covariates. However, existing methods only consider pairwise interactions, making the discovery of important high-order interactions a non-trivial open problem. In this paper, we propose a Generalized Hierarchical Sparse Model (GHSM) as a generalization of the HSM models to learn arbitrary-order inter-actions. The GHSM applies the l1 penalty to all the model coefficients under a constraint that given any covariate, if none of its associated kth-order interactions contribute to the regression model, then neither do its associated higher-order interactions. The resulting objective function is non-convex with a challenge lying in the coupled variables appearing in the arbitrary-order hierarchical constraints and we devise an efficient optimization algorithm to directly solve it. Specifically, we decouple the variables in the constraints via both the GIST and ADMM methods into three subproblems, each of which is proved to admit an efficiently analytical solution. We evaluate the GHSM method in both synthetic problem and the antigenic sites identification problem for the flu virus data, where we expand the feature space up to the 5th-order interactions. Empirical results demonstrate the effectiveness and efficiency of the proposed method and the learned high-order interactions have meaningful synergistic covariate patterns in the virus antigenicity." "dimensionality-reduction";"Keeping it Short and Simple: Summarising Complex Event Sequences with Multivariate Patterns";"We study how to obtain concise descriptions of discrete multivariate sequential data. In particular, how to do so in terms of rich multivariate sequential patterns that can capture potentially highly interesting (cor)relations between sequences. To this end we allow our pattern language to span over the domains (alphabets) of all sequences, allow patterns to overlap temporally, as well as allow for gaps in their occurrences." "dimensionality-reduction";"Hierarchical Incomplete Multi-source Feature Learning for Spatiotemporal Event Forecasting";"Forecasting significant societal events is an interesting and challenging problem as it taking into consideration multiple aspects of a society, including its economics, politics, and culture. Traditional forecasting methods based on a single data source find it hard to cover all these aspects comprehensively, thus limiting model performance. Multi-source event forecasting has proven promising but still suffers from several challenges, including 1) geographical hierarchies in multi-source data features, 2) missing values, and 3) characterization of structured feature sparsity. This paper proposes a novel feature learning model that concurrently addresses all the above challenges. Specifically, given multi-source data from different geographical levels, we design a new fore-casting model by characterizing the lower-level features’ dependence on higher-level features. To handle the correlations amidst structured feature sets and deal with missing values among the coupled features, we propose a novel feature learning model based on an Nth-order strong hierarchy and fused-overlapping group Lasso. An efficient algorithm is developed to optimize model parameters and ensure global optima. Extensive experiments on 10 datasets in different domains demonstrate the effectiveness and efficiency of the proposed model." "dimensionality-reduction";"Skinny-dip: Clustering in a Sea of Noise";"Can we find heterogeneous clusters hidden in data sets with 80% noise? Although such settings occur in the real-world, we struggle to find methods from the abundance of clustering techniques that perform well with noise at this level. Indeed, perhaps this is enough of a departure from classical cluster-ing to warrant its study as a separate problem. In this paper we present SkinnyDip which, based on Hartigan’s elegant dip test of unimodality, represents an intriguing approach to clustering with an attractive set of properties. Specifically, SkinnyDip is highly noise-robust, practically parameter-free and completely deterministic. SkinnyDip never performs multivariate distance calculations, but rather employs in-sightful recursion based on “dips” into univariate projections of the data. It is able to detect a range of cluster shapes and densities, assuming only that each cluster admits a unimodal shape. Practically, its run-time grows linearly with the data. Finally, for high-dimensional data, continuity properties of the dip enable SkinnyDip to exploit multimodal projection pursuit in order to find an appropriate basis for clustering. Although not without its limitations, SkinnyDip compares favorably to a variety of clustering approaches on synthetic and real data, particularly in high-noise settings." "dimensionality-reduction";"City-Scale Map Creation and Updating using GPS Collections";"Applications such as autonomous driving or real-time route recommendations require up-to-date and accurate digital maps. However, manually creating and updating such maps is too costly to meet the rising demands. As large collections of GPS trajectories become widely available, constructing and updating maps using such trajectory collections can greatly reduce the cost of such maps. Unfortunately, due to GPS noise and varying trajectory sampling rates, inferring maps from GPS trajectories can be very challenging. In this paper, we present a framework to create up-to-date maps with rich knowledge from GPS trajectory collections. Starting from an unstructured GPS point cloud, we discover road segments using novel graph-based clustering techniques with prior knowledge on road design. Based on road segments, we develop a scale- and orientation-invariant traj-SIFT feature to localize and recognize junctions using a supervised learning framework. Maps with rich knowledge are created based on discovered road segments and junctions. Compared to state-of-the-art methods, our approach can efficiently construct high-quality maps at city scales from large collections of GPS trajectories." "dimensionality-reduction";"Positive-Unlabeled Learning in Streaming Networks";"Data of many problems in real-world systems such as link prediction and one-class recommendation share common characteristics. First, data are in the form of positive-unlabeled (PU) measurements ( e.g. Twitter “following”, Facebook “like”, etc.) that do not provide negative information, which can be naturally represented as networks. Second, in the era of big data, such data are generated temporally-ordered, continuously and rapidly, which determines its streaming nature. These common characteristics allow us to unify many problems into a novel framework - PU learning in streaming networks. In this paper, a principled probabilistic approach SPU is proposed to leverage the characteristics of the streaming PU inputs. In particular, SPU captures temporal dynamics and provides real-time adaptations and predictions by identifying the potential negative signals concealed in unlabeled data. Our empirical results on various real-world datasets demonstrate the effectiveness of the proposed framework over other state-of-the-art methods in both link prediction and recommendation." "dimensionality-reduction";"How to Compete Online for News Audience: Modeling Words that Attract Clicks";"Headlines are particularly important for online news out-lets where there are many similar news stories competing for users’ attention. Traditionally, journalists have followed rules-of-thumb and experience to master the art of crafting catchy headlines, but with the valuable resource of large-scale click-through data of online news articles, we can apply quantitative analysis and text mining techniques to acquire an in-depth understanding of headlines. In this paper, we conduct a large-scale analysis and modeling of 150K news articles published over a period of four months on the Yahoo home page. We define a simple method to measure click-value of individual words, and analyze how temporal trends and linguistic attributes affect click-through rate (CTR). We then propose a novel generative model, headline click-based topic model (HCTM), that extends latent Dirichlet allocation (LDA) to reveal the effect of topical context on the click-value of words in headlines. HCTM leverages clicks in aggregate on previously published headlines to identify words for headlines that will generate more clicks in the future. We show that by jointly taking topics and clicks into account we can detect changes in user interests within topics. We evaluate HCTM in two different experimental settings and compare its performance with ALDA (adapted LDA), LDA, and TextRank. The first task, full headline, is to retrieve full headline used for a news article given the body of news article. The second task, good headline, is to specifically identify words in the headline that have high click values for current news audience. For full headline task, our model performs on par with ALDA, a state-of-the art web-page summarization method that utilizes click-through information. For good headline task, which is of more practical importance to both individual journalists and online news outlets, our model significantly outperforms all other comparative methods." "dimensionality-reduction";"A Text Clustering Algorithm Using an Online Clustering Scheme for Initialization";"In this paper, we propose a text clustering algorithm using an online clustering scheme for initialization called FGSD-MM+. FGSDMM+ assumes that there are at most Kmax clusters in the corpus, and regards these Kmax potential clusters as one large potential cluster at the beginning. During initialization, FGSDMM+ processes the documents one by one in an online clustering scheme. The first document will choose the potential cluster, and FGSDMM+ will create a new cluster to store this document. Later documents will choose one of the non-empty clusters or the potential cluster with probabilities derived from the Dirichlet multinomial mixture model. Each time a document chooses the potential cluster, FGSDMM+ will create a new cluster to store that document and decrease the probability of later documents choosing the potential cluster. After initialization, FGSDMM+ will run a collapsed Gibbs sampling algorithm several times to obtain the final clustering result. Our extensive experimental study shows that FGSDMM+ can achieve better performance than three other clustering methods on both short and long text datasets." "dimensionality-reduction";"AnyDBC: An Efficient Anytime Density-based Clustering Algorithm for Very Large Complex Datasets";"The density-based clustering algorithm DBSCAN is a state-of-the-art data clustering technique with numerous applications in many fields. However, its O(n2) time complexity still remains a severe weakness. In this paper, we propose a novel anytime approach to cope with this problem by reducing both the range query and the label propagation time of DBSCAN. Our algorithm, called AnyDBC, compresses the data into smaller density-connected subsets called primitive clusters and labels objects based on connected components of these primitive clusters for reducing the label propagation time. Moreover, instead of passively performing the range query for all objects like existing techniques, AnyDBC iteratively and actively learns the current cluster structure of the data and selects a few most promising objects for refining clusters at each iteration. Thus, in the end, it performs substantially fewer range queries compared to DBSCAN while still guaranteeing the exact final result of DBSCAN. Experiments show speedup factors of orders of magnitude com-pared to DBSCAN and its fastest variants on very large real and synthetic complex datasets." "dimensionality-reduction";"Goal-Directed Inductive Matrix Completion";"Matrix completion (MC) with additional information has found wide applicability in several machine learning applications. Among algorithms for solving such problems, Inductive Matrix Completion(IMC) has drawn a considerable amount of attention, not only for its well established theoretical guarantees but also for its superior performance in various real-world applications. However, IMC based methods usually place very strong constraints on the quality of the features (side information) to ensure accurate recovery, which might not be met in practice. In this paper, we pro-pose Goal-directed Inductive Matrix Completion(GIMC) to learn a nonlinear mapping of the features so that they satisfy the required properties. A key distinction between GIMC and IMC is that the feature mapping is learnt in a supervised manner, deviating from the traditional approach of un-supervised feature learning followed by model training. We establish the superiority of our method on several popular machine learning applications including multi-label learning, multi-class classification, and semi-supervised clustering." "dimensionality-reduction";"A Subsequence Interleaving Model for Sequential Pattern Mining";"Recent sequential pattern mining methods have used the minimum description length (MDL) principle to define an encoding scheme which describes an algorithm for mining the most compressing patterns in a database. We present a novel subsequence interleaving model based on a probabilistic model of the sequence database, which allows us to search for the most compressing set of patterns without designing a specific encoding scheme. Our proposed algorithm is able to efficiently mine the most relevant sequential patterns and rank them using an associated measure of interestingness. The efficient inference in our model is a direct result of our use of a structural expectation-maximization framework, in which the expectation-step takes the form of a submodular optimization problem subject to a coverage constraint. We show on both synthetic and real world datasets that our model mines a set of sequential patterns with low spuriousness and redundancy, high interpretability and usefulness in real-world applications. Furthermore, we demonstrate that the quality of the patterns from our approach is comparable to, if not better than, existing state of the art sequential pattern mining algorithms." "dimensionality-reduction";"node2vec: Scalable Feature Learning for Networks";"Prediction tasks over nodes and edges in networks require careful effort in engineering features used by learning algorithms. Recent research in the broader field of representation learning has led to significant progress in automating prediction by learning the features themselves. However, present feature learning approaches are not expressive enough to capture the diversity of connectivity patterns observed in networks." "dimensionality-reduction";"Distributing the Stochastic Gradient Sampler for Large-Scale LDA";"Learning large-scale Latent Dirichlet Allocation (LDA) models is beneficial for many applications that involve large collections of documents. Recent work has been focusing on developing distributed algorithms in the batch setting, while leaving stochastic methods behind, which can effectively explore statistical redundancy in big data and thereby are complementary to distributed computing. The distributed stochastic gradient Langevin dynamics (DSGLD) represents one attempt to combine stochastic sampling and distributed computing, but it suffers from drawbacks such as excessive communications and sensitivity to partitioning of datasets across nodes. DSGLD is typically limited to learn small models that have about 10^3 topics and 10^3 vocabulary size." "dimensionality-reduction";"Online Feature Selection: A Limited-Memory Substitution Algorithm and its Asynchronous Parallel Vari";"This paper considers the feature selection scenario where only a few features are accessible at any time point. For example, features are generated sequentially and visible one by one. Therefore, one has to make an online decision to identify key features after all features are only scanned once or twice. The optimization based approach is a powerful tool for the online feature selection." "dimensionality-reduction";"Subjectively Interesting Component Analysis: Data Projections that Contrast with Prior Expectations";"Methods that find insightful low-dimensional projections are essential to effectively explore high-dimensional data. Principal Component Analysis is used pervasively to find low-dimensional projections, not only because it is straightforward to use, but it is also often effective, because the variance in data is often dominated by relevant structure. However, even if the projections highlight real structure in the data, not all structure is interesting to every user. If a user is already aware of, or not interested in the dominant structure, Principal Component Analysis is less effective for finding interesting components. We introduce a new method called Subjectively Interesting Component Analysis (SICA), designed to find data projections that are subjectively interesting, i.e, projections that truly surprise the end-user. It is rooted in information theory and employs an explicit model of a user’s prior expectations about the data. The corresponding optimization problem is a simple eigenvalue problem, and the result is a trade-off between explained variance and novelty. We present five case studies on synthetic data, images, time-series, and spatial data, to illustrate how SICA enables users to find (subjectively) interesting projections." "dimensionality-reduction";"Safe Pattern Pruning: An Efficient Approach for Predictive Pattern Mining";"In this paper we study predictive pattern mining problems where the goal is to construct a predictive model based on a subset of predictive patterns in the database. Our main contribution is to introduce a novel method called safe pat-tern pruning (SPP) for a class of predictive pattern mining problems. The SPP method allows us to efficiently find a superset of all the predictive patterns in the database that are needed for the optimal predictive model. The advantage of the SPP method over existing boosting-type method is that the former can find the superset by a single search over the database, while the latter requires multiple searches. The SPP method is inspired by recent development of safe feature screening. In order to extend the idea of safe feature screening into predictive pattern mining, we derive a novel pruning rule called safe pattern pruning (SPP) rule that can be used for searching over the tree defined among patterns in the database. The SPP rule has a property that, if a node corresponding to a pattern in the database is pruned out by the SPP rule, then it is guaranteed that all the patterns corresponding to its descendant nodes are never needed for the optimal predictive model. We apply the SPP method to graph mining and item-set mining problems, and demonstrate its computational advantage." "recommender-systems";"The Limits of Popularity-Based Recommendations, and the Role of Social Ties";"In this paper we introduce a mathematical model that captures some of the salient features of recommender systems that are based on popularity and that try to exploit social ties among the users. We show that, under very general conditions, the market always converges to a steady state, for which we are able to give an explicit form. Thanks to this we can tell rather precisely how much a market is altered by a recommendation system, and determine the power of users to influence others. Our theoretical results are complemented by experiments with real world social networks showing that social graphs prevent large market distortions in spite of the presence of highly influential users." "recommender-systems";"The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain Recommendation";"With email overload becoming a billion-level drag on the economy, personalized email prioritization is of urgent need to help predict the importance level of an email. Despite lots of previous effort on the topic, broadcast email, an important type of emails with its unique challenges and intriguing opportunities, has been overlooked. The most salient opportunity lies in that effective collaborative filtering can be exploited due to thousands of receivers of a typical broad-cast email. However, every broadcast email is completely “cold” and it is very costly to obtain users’ preference feed-back. Fortunately, there exist up to million-level broadcast mailing lists in a real life email system. Similar mailing lists can provide useful extra information for broadcast email prioritization in a target mailing list. How to mine such useful extra information is a challenging problem that has never been touched. In this work, we propose the first broadcast email prioritization framework considering large numbers of mailing lists by formulating this problem as a cross domain recommendation problem. An optimization framework is proposed to select the optimal set of source domains considering multiple criteria including overlap of users, feedback pattern similarity and coverage of users. Our method is thoroughly evaluated on a real world industrial dataset from Samsung Electronics and is proved highly effective and out-performs all the baselines." "recommender-systems";"Minimizing Legal Exposure for High-Tech Companies through Collaborative Filtering Methods";"Patent litigation not only covers legal and technical issues, it is also a key consideration for managers of high-technology (high-tech) companies when making strategic decisions. Paten-t litigation influences the market value of high-tech companies. However, this raises unique challenges. To this end, in this paper, we develop a novel recommendation framework to solve the problem of litigation risk prediction. We will introduce a specific type of patent-related litigation, that is, Section 337 investigations, which prohibit all acts of unfair competition, or any unfair trade practices, when exporting products to the United States. To build this recommendation framework, we collect and exploit a large amount of published information related to almost all Section 337 investigation cases. This study has two aims: (1) to predict the litigation risk in a specific industry category for high-tech companies and (2) to predict the litigation risk from competitors for high-tech companies. These aims can be achieved by mining historical investigation cases and related patents. Specifically, we propose two methods to meet the needs of both aims: a proximal slope one predictor and a time-aware predictor. Several factors are considered in the proposed methods, including the litigation risk if a company wants to enter a new market and the risk that a potential competitor would file a lawsuit against the new entrant. Comparative experiments using real-world data demonstrate that the proposed methods outperform several base-lines with a significant margin." "recommender-systems";"Compute Job Memory Recommender System Using Machine Learning";"This paper presents a machine learning approach to predict the amount of compute memory needed by jobs which are submitted to Load Sharing Facility (LSF®) with a high level of accuracy. " "recommender-systems";"Online Context-Aware Recommendation with Time Varying Multi-Arm Bandit";"Contextual multi-armed bandit problems have gained increasing popularity and attention in recent years due to their capability of leveraging contextual information to deliver online personalized recommendation services (e.g., online advertising and news article selection). To predict the reward of each arm given a particular con-text, existing relevant research studies for contextual multi-armed bandit problems often assume the existence of a fixed yet unknown reward mapping function. However, this assumption rarely hold-s in practice, since real-world problems often involve underlying processes that are dynamically evolving over time." "recommender-systems";"Scalable Time-Decaying Adaptive Prediction Algorithm";"Online learning is used in a wide range of real applications, e.g., predicting ad click-through rates (CTR) and personalized recommendations. Based on the analysis of users’ behaviors in Video-On-Demand (VoD) recommender system-s, we discover that the most recent users’ actions can better reflect users’ current intentions and preferences. Under this observation, we thereby propose a novel time-decaying online learning algorithm derived from the state-of-the-art FTRL-proximal algorithm, called Time-Decaying Adaptive Prediction (TDAP) algorithm." "recommender-systems";"CaSMoS: A Framework for Learning Candidate Selection Models over Structured Queries and Documents";"User experience at social media and web platforms such as LinkedIn is heavily dependent on the performance and scalability of its products. Applications such as personalized search and recommendations require real-time scoring of millions of structured candidate documents associated with each query, with strict latency constraints. In such applications, the query incorporates the context of the user (in addition to search keywords if present), and hence can become very large, comprising of thousands of Boolean clauses over hundreds of document attributes. Consequently, candidate se-lection techniques need to be applied since it is infeasible to retrieve and score all matching documents from the underlying in-verted index. We propose CaSMoS, a machine learned candidate selection framework that makes use of Weighted AND (WAND) query. Our framework is designed to prune irrelevant documents and retrieve documents that are likely to be part of the top-k results for the query. We apply a constrained feature selection algorithm to learn positive weights for feature combinations that are used as part of the weighted candidate selection query. We have implemented and deployed this system to be executed in real time using LinkedIn’s Galene search platform. We perform extensive evaluation with different training data approaches and parameter settings, and investigate the scalability of the proposed candidate selection model. Our deployment of this system as part of LinkedIn’s job recommendation engine has resulted in significant reduction in latency (up to 25%) without sacrificing the quality of the retrieved results, thereby paving the way for more sophisticated scoring models." "recommender-systems";"When Recommendation Goes Wrong - Anomalous Link Discovery in Recommendation Networks";"We present a secondary ranking system to find and remove erroneous suggestions from a geospatial recommendation system. We discover such anomalous links by “double checking” the recommendation system’s output to ensure that it is both structurally cohesive, and semantically consistent." "recommender-systems";"Towards Conversational Recommender Systems";"People often ask others for restaurant recommendations as a way to discover new dining experiences. This makes restaurant recommendation an exciting scenario for recommender systems and has led to substantial research in this area. However, most such systems behave very differently from a human when asked for a recommendation. The goal of this paper is to begin to reduce this gap." "recommender-systems";"Collaborative Knowledge Base Embedding for Recommender Systems";"Among different recommendation techniques, collaborative filtering usually suffer from limited performance due to the sparsity of user-item interactions. To address the issues, auxiliary information is usually used to boost the performance. Due to the rapid collection of information on the web, the knowledge base provides heterogeneous information including both structured and unstructured data with different semantics, which can be consumed by various applications. In this paper, we investigate how to leverage the heterogeneous information in a knowledge base to improve the quality of recommender systems. First, by exploiting the knowledge base, we design three components to extract items’ semantic representations from structural content, textual content and visual content, respectively. To be specific, we adopt a heterogeneous network embedding method, termed as TransR, to extract items’ structural representations by considering the heterogeneity of both nodes and relationships. We apply stacked denoising auto-encoders and stacked convolutional auto-encoders, which are two types of deep learning based embedding techniques, to extract items’ textual representations and visual representations, respectively. Finally, we propose our final integrated framework, which is termed as Collaborative Knowledge Base Embedding (CKE), to jointly learn the latent representations in collaborative filtering as well as items’ semantic representations from the knowledge base. To evaluate the performance of each embedding component as well as the whole system, we conduct extensive experiments with two realworld datasets from different scenarios. The results reveal that our approaches outperform several widely adopted state-of-the-art recommendation methods." "recommender-systems";"Point-of-Interest Recommendations: Learning Potential Check-ins from Friends";"The emergence of Location-based Social Network (LBSN) services provides a wonderful opportunity to build personalized Point-of-Interest (POI) recommender systems. Although a personalized POI recommender system can significantly facilitate users’ outdoor activities, it faces many challenging problems, such as the hardness to model user’s POI decision making process and the difficulty to address data sparsity and user/location cold-start problem. To cope with these challenges, we define three types of friends (i.e., social friends, location friends, and neighboring friends) in LBSN, and develop a two-step framework to leverage the information of friends to improve POI recommendation accuracy and address cold-start problem. Specifically, we first propose to learn a set of potential locations that each individual’s friends have checked-in before and this individual is most interested in. Then we incorporate three types of check-ins (i.e., observed check-ins, potential check-ins and other unobserved check-ins) into matrix factorization model using two different loss functions (i.e., the square error based loss and the ranking error based loss). To evaluate the proposed model, we conduct extensive experiments with many state-of-the-art baseline methods and evaluation metrics on two real-world data sets. The experimental results demonstrate the effectiveness of our methods." "recommender-systems";"Contextual Intent Tracking for Personal Assistants";"A new paradigm of recommendation is emerging in intelligent personal assistants such as Apple’s Siri, Google Now, and Microsoft Cortana, which recommends “the right information at the right time” and proactively helps you “get things done”. This type of recommendation requires precisely tracking users’ contemporaneous intent, i.e., what type of information (e.g., weather, stock prices) users currently intend to know, and what tasks (e.g., playing music, getting taxis) they intend to do. Users’ intent is closely related to context, which includes both external environments such as time and location, and users’ internal activities that can be sensed by personal assistants. The relationship between context and intent exhibits complicated co-occurring and sequential correlation, and contextual signals are also heterogeneous and sparse, which makes modeling the contextintent relationship a challenging task. To solve the intent tracking problem, we propose the Kalman filter regularized PARAFAC2 (KP2) nowcasting model, which compactly represents the structure and co-movement of context and intent. The KP2 model utilizes collaborative capabilities among users, and learns for each user a personalized dynamic system that enables efficient nowcasting of users’ intent. Extensive experiments using real-world data sets from a commercial personal assistant show that the KP2 model significantly outperforms various methods, and provides inspiring implications for deploying large-scale proactive recommendation systems in personal assistants." "recommender-systems";"Goal-Directed Inductive Matrix Completion";"Matrix completion (MC) with additional information has found wide applicability in several machine learning applications. Among algorithms for solving such problems, Inductive Matrix Completion(IMC) has drawn a considerable amount of attention, not only for its well established theoretical guarantees but also for its superior performance in various real-world applications. However, IMC based methods usually place very strong constraints on the quality of the features (side information) to ensure accurate recovery, which might not be met in practice. In this paper, we pro-pose Goal-directed Inductive Matrix Completion(GIMC) to learn a nonlinear mapping of the features so that they satisfy the required properties. A key distinction between GIMC and IMC is that the feature mapping is learnt in a supervised manner, deviating from the traditional approach of un-supervised feature learning followed by model training. We establish the superiority of our method on several popular machine learning applications including multi-label learning, multi-class classification, and semi-supervised clustering." "recommender-systems";"From Online Behaviors to Offline Retailing";"To combat the ease of online shopping in pajamas, offline mall owners focus increasingly on driving satisfaction and improving retention by identifying customers’ preferences. However, most of these studies are based on customers’ offline consuming history only. Benefiting from the internet, we can also get customers’ online behaviors, such as the search logs, web browsing logs, online shopping logs, and so on. Might these seemingly irrelevant information from two different modalities (i.e. online and offline) be somehow interrelated? How can we make use of the online behaviors and offline actions jointly to promote recommendation for offline retailing? " "recommender-systems";"An Empirical Study on Recommendation with Multiple Types of Feedback";"User feedback like clicks and ratings on recommended items pro-vides important information for recommender systems to predict users’ interests in unseen items. Most systems rely on models trained using a single type of feedback, e.g., ratings for movie recommendation and clicks for online news recommendation. How-ever, in addition to the primary feedback, many systems also allow users to provide other types of feedback, e.g., liking or sharing an article, or hiding all articles from a source. These additional feed-back potentially provides extra information for the recommendation models. To optimize user experience and business objectives,it is important for a recommender system to use both the primary feedback and additional feedback. This paper presents an empirical study on various training methods for incorporating multiple user feedback types based on LinkedIn recommendation products. We study three important problems that we face at LinkedIn: (1) Whether to send an email based on clicks and complaints, (2) how to rank updates in LinkedIn feeds based on clicks and hides and (3) how jointly optimize for viral actions and clicks in LinkedIn feeds. Extensive offline experiments on historical data show the effectiveness of these methods in different situations. Online A/B testing results further demonstrate the impact of these methods on LinkedIn production systems." "recommender-systems";"Assessing Human Error Against a Benchmark of Perfection";"An increasing number of domains are providing us with detailed trace data on human decisions in settings where we can evaluate the quality of these decisions via an algorithm. Motivated by this development, an emerging line of work has begun to consider whether we can characterize and predict the kinds of decisions where people are likely to make errors." "recommender-systems";"Unified Point-of-Interest Recommendation with Temporal Interval Assessment";"Point-of-interest (POI) recommendation, which helps mobile users explore new places, has become an important location-based service. Existing approaches for POI recommendation have been mainly focused on exploiting the information about user preferences, social influence, and geographical influence. However, these approaches cannot handle the scenario where users are expecting to have POI recommendation for a specific time period. To this end, in this paper, we propose a unified recommender system, named the ‘Where and When to gO’ (WWO) recommender system, to integrate the user interests and their evolving sequential preferences with temporal interval assessment. As a result, the WWO system can make recommendations dynamically for a specific time period and the traditional POI recommender system can be treated as the special case of the WWO system by setting this time period long enough. Specifically, to quantify users’ sequential preferences, we consider the distributions of the temporal intervals between dependent POIs in the historical check-in sequences. Then, to estimate the distributions with only sparse observations, we develop the low-rank graph construction model, which identifies a set of bi-weighted graph bases so as to learn the static user preferences and the dynamic sequential preferences in a coherent way. Finally, we evaluate the proposed approach using real-world data sets from several location-based social networks (LBSNs). The experimental results show that our method outperforms the state-of-the-art approaches for POI recom-mendation in terms of various metrics, such as F-measure and NDCG, with a significant margin." "recommender-systems";"Continuous Experience-aware Language Model";"Online review communities are dynamic as users join and leave, adopt new vocabulary, and adapt to evolving trends. Recent work has shown that recommender systems benefit from explicit consideration of user experience. However, prior work assumes a fixed number of discrete experience levels, whereas in reality users gain experience and mature continuously over time." "recommender-systems";"Positive-Unlabeled Learning in Streaming Networks";"Data of many problems in real-world systems such as link prediction and one-class recommendation share common characteristics. First, data are in the form of positive-unlabeled (PU) measurements ( e.g. Twitter “following”, Facebook “like”, etc.) that do not provide negative information, which can be naturally represented as networks. Second, in the era of big data, such data are generated temporally-ordered, continuously and rapidly, which determines its streaming nature. These common characteristics allow us to unify many problems into a novel framework - PU learning in streaming networks. In this paper, a principled probabilistic approach SPU is proposed to leverage the characteristics of the streaming PU inputs. In particular, SPU captures temporal dynamics and provides real-time adaptations and predictions by identifying the potential negative signals concealed in unlabeled data. Our empirical results on various real-world datasets demonstrate the effectiveness of the proposed framework over other state-of-the-art methods in both link prediction and recommendation." "recommender-systems";"Smart Reply: Automated Response Suggestion for Email";"In this paper we propose and investigate a novel end-to-end method for automatically generating short email responses, called Smart Reply. It generates semantically diverse suggestions that can be used as complete email responses with just one tap on mobile. The system is currently used in Inbox by Gmail and is responsible for assisting with 10% of all mobile responses. It is designed to work at very high throughput and process hundreds of millions of messages daily. The system exploits state-of-the-art, large-scale deep learning." "graph-mining-and-social-networks";"A Truth Discovery Approach with Theoretical Guarantee";"In the information age, people can easily collect information about the same set of entities from multiple sources, among which conflicts are inevitable. This leads to an important task, truth discovery, i.e., to identify true facts (truths) via iteratively updating truths and source reliability. However, the convergence to the truths is never discussed in existing work, and thus there is no theoretical guarantee in the results of these truth discovery approaches. In contrast, in this paper we propose a truth discovery approach with theoretical guarantee. We propose a randomized gaussian mixture model (RGMM) to represent multi-source data, where truths are model parameters. We incorporate source bias which captures its reliability degree into RGMM formulation. The truth discovery task is then modeled as seeking the maximum likelihood estimate (MLE) of the truth-s. Based on expectation-maximization (EM) techniques, we propose population-based (i.e., on the limit of infinite data) and sample-based (i.e., on a finite set of samples) solutions for the MLE. Theoretically, we prove that both solutions are contractive to an ϵ-ball around the MLE, under certain conditions. Experimentally, we evaluate our method on both simulated and real-world datasets. Experimental results show that our method achieves high accuracy in identifying truths with convergence guarantee." "graph-mining-and-social-networks";"Smart broadcasting: Do you want to be seen?";"Many users in online social networks are constantly trying to gain attention from their followers by broadcasting posts to them. These broadcasters are likely to gain greater attention if their posts can remain visible for a longer period of time among their followers’ most recent feeds. Then when to post? In this paper, we study the problem of smart broad-casting using the framework of temporal point processes, where we model users feeds and posts as discrete events occurring in continuous time. Based on such continuous-time model, then choosing a broadcasting strategy for a user becomes a problem of designing the conditional intensity of her posting events. We derive a novel formula which links this conditional intensity with the “visibility” of the user in her followers’ feeds. Furthermore, by exploiting this formula, we develop an efficient convex optimization framework for the “when-to-post” problem. Our method can find broad-casting strategies that reach a desired “visibility” level with provable guarantees. We experimented with data gathered from Twitter, and show that our framework can consistently make broadcasters’ post more visible than alternatives." "graph-mining-and-social-networks";"When Social Influence Meets Item Inference";"Research issues and data mining techniques for product recommendation and viral marketing have been widely studied. Existing works on seed selection in social networks do not take into account the effect of product recommendations in e-commerce stores. In this paper, we investigate the seed selection problem for viral marketing that considers both effects of social influence and item inference (for product recommendation). We develop a new model, Social Item Graph (SIG), that captures both effects in the form of hyperedges. Accordingly, we formulate a seed selection problem, called Social Item Maximization Problem (SIMP), and prove the hardness of SIMP. We design an efficient algorithm with performance guarantee, called Hyperedge-Aware Greedy (HAG), for SIMP and develop a new index structure, called SIG-index, to accelerate the computation of diffusion process in HAG. Moreover, to construct realistic SIG models for SIMP, we develop a statistical inference based framework to learn the weights of hyperedges from data. Finally, we perform a comprehensive evaluation on our proposals with various baselines. Experimental result validates our ideas and demonstrates the effectiveness and efficiency of the pro-posed model and algorithms over baselines." "graph-mining-and-social-networks";"A multiple test correction for streams and cascades of statistical hypothesis tests";"Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e. rejecting a null hypothesis in error. Familywise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be predetermined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance." "graph-mining-and-social-networks";"Effcient Processing of Network Proximity Queries via Chebyshev Acceleration";"Network proximity is at the heart of a large class of network analytics and information retrieval techniques, including node/ edge rankings, network alignment, and random-walk based proximity queries, among many others. Owing to its importance, significant effort has been devoted to accelerating iterative processes underlying network proximity computations. These techniques rely on numerical proper-ties of power iterations, as well as structural properties of the networks to reduce the runtime of iterative algorithms." "graph-mining-and-social-networks";"Robust Influence Maximization";"Uncertainty about models and data is ubiquitous in the computational social sciences, and it creates a need for robust social network algorithms, which can simultaneously provide guarantees across a spectrum of models and parameter set-tings. We begin an investigation into this broad domain by studying robust algorithms for the Influence Maximization problem, in which the goal is to identify a set of k nodes in a social network whose joint influence on the network is maximized." "graph-mining-and-social-networks";"How to Compete Online for News Audience: Modeling Words that Attract Clicks";"Headlines are particularly important for online news out-lets where there are many similar news stories competing for users’ attention. Traditionally, journalists have followed rules-of-thumb and experience to master the art of crafting catchy headlines, but with the valuable resource of large-scale click-through data of online news articles, we can apply quantitative analysis and text mining techniques to acquire an in-depth understanding of headlines. In this paper, we conduct a large-scale analysis and modeling of 150K news articles published over a period of four months on the Yahoo home page. We define a simple method to measure click-value of individual words, and analyze how temporal trends and linguistic attributes affect click-through rate (CTR). We then propose a novel generative model, headline click-based topic model (HCTM), that extends latent Dirichlet allocation (LDA) to reveal the effect of topical context on the click-value of words in headlines. HCTM leverages clicks in aggregate on previously published headlines to identify words for headlines that will generate more clicks in the future. We show that by jointly taking topics and clicks into account we can detect changes in user interests within topics. We evaluate HCTM in two different experimental settings and compare its performance with ALDA (adapted LDA), LDA, and TextRank. The first task, full headline, is to retrieve full headline used for a news article given the body of news article. The second task, good headline, is to specifically identify words in the headline that have high click values for current news audience. For full headline task, our model performs on par with ALDA, a state-of-the art web-page summarization method that utilizes click-through information. For good headline task, which is of more practical importance to both individual journalists and online news outlets, our model significantly outperforms all other comparative methods." "graph-mining-and-social-networks";"Approximate Personalized PageRank on Dynamic Graphs";"We propose and analyze two algorithms for maintaining approximate Personalized PageRank (PPR) vectors on a dynamic graph, where edges are added or deleted. Our algorithms are natural dynamic versions of two known local variations of power iteration. One, Forward Push, propagates probability mass forwards along edges from a source node, while the other, Reverse Push, propagates local changes backwards along edges from a target. In both variations, we maintain an invariant between two vectors, and when an edge is updated, our algorithm first modifies the vectors to restore the invariant, then performs any needed local push operations to restore accuracy." "graph-mining-and-social-networks";"Compact and Scalable Graph Neighborhood Sketching";"The all-distances sketch (ADS) has recently emerged as a promising paradigm of graph neighborhood sketching. An ADS is a probabilistic data structure that is defined for each vertex of a graph. ADSs facilitate accurate estimation of many useful indicators for network analysis with the guarantee of accuracy, and the ADSs for all the vertices in a graph can be computed in near-linear time. Because of these useful properties, ADS has attracted considerable attention. However, a critical drawback of ADS is its space requirement, which tends to be much larger than that of the graph itself. In the present study, we address this issue by designing a new graph sketching scheme, namely, sketch retrieval shortcuts (SRS). Although SRSs are more space-efficient than ADSs by an order of magnitude, an ADS of any vertex can be quickly retrieved from the SRSs. The retrieved ADSs can be used to estimate the aforementioned indicators in exactly the same manner as with plain ADSs, inheriting the same accuracy guarantee. Our experiments on real-world networks demonstrate the usefulness of SRSs as a practical back-end of large-scale graph data mining." "graph-mining-and-social-networks";"Structural Neighborhood based Classification of Nodes in a Network";"Classification of entities based on the underlying network structure is an important problem. Networks encountered in practice are sparse and have many missing and noisy links. Statistical learning techniques have been used in intra-network classification; however, they typically exploit only the local neighborhood, so may not perform well. In this paper, we propose a novel structural neighborhood-based classifier learning using a random walk. For classifying a node, we take a random walk from the node and make a decision based on how nodes in the respective kth-level neighborhood are labeled. We observe that random walks of short length are helpful in classification. Emphasizing role of longer random walks may cause the underlying Markov chain to converge to a stationary distribution. Considering this, we take a lazy random walk based approach with variable termination probability for each node, based on the node’s structural properties including its degree. Our experimental study on real world datasets demonstrates the superiority of the proposed approach over the existing state-of-the-art approaches." "graph-mining-and-social-networks";"User Identity Linkage by Latent User Space Modelling";"User identity linkage across social platforms is an important problem of great research challenge and practical value. In real applications, the task often assumes an extra degree of difficulty by requiring linkage across multiple platforms. While pair-wise user linkage between two platforms, which has been the focus of most existing solutions, provides reasonably convincing linkage, the result depends by nature on the order of platform pairs in execution with no theoretical guarantee on its stability. In this paper, we explore a new concept of “Latent User Space” to more naturally model the relationship between the underlying real users and their observed projections onto the varied social platforms, such that the more similar the real users, the closer their profiles in the latent user space. We propose two effective algorithms, a batch model(ULink) and an online model(ULink-On), based on latent user space modelling. Two simple yet effective optimization methods are used for optimizing objective function: the first one based on the constrained concave-convex procedure(CCCP) and the second on accelerated proximal gradient. To our best knowledge, this is the first work to propose a unified framework to address the following two important aspects of the multi-platform user identity link-age problem — (I) the platform multiplicity and (II) on-line data generation. We present experimental evaluations on real-world data sets for not only traditional pairwise-platform linkage but also multi-platform linkage. The results demonstrate the superiority of our proposed method over the state-of-the-art ones." "graph-mining-and-social-networks";"Kam1n0: MapReduce-based Assembly Clone Search for Reverse Engineering";"Assembly code analysis is one of the critical processes for detecting and proving software plagiarism and software patent infringements when the source code is unavailable. It is also a common practice to discover exploits and vulnerabilities in existing software. However, it is a manually intensive and time-consuming process even for experienced reverse engineers. An effective and efficient assembly code clone search engine can greatly reduce the effort of this process, since it can identify the cloned parts that have been previously analyzed. The assembly code clone search problem belongs to the field of software engineering. However, it strongly depends on practical nearest neighbor search techniques in data mining and databases. By closely collaborating with reverse engineers and Defence Research and Development Canada (DRDC ), we study the concerns and challenges that make existing assembly code clone approaches not practically applicable from the perspective of data mining. We propose a new variant of LSH scheme and incorporate it with graph matching to address these challenges. We implement an integrated assembly clone search engine called Kam1n0. It is the first clone search engine that can efficiently identify the given query assembly function’s subgraph clones from a large assembly code repository. Kam1n0 is built upon the Apache Spark computation framework and Cassandra-like key-value distributed storage. A deployed demo system is publicly available. Extensive experimental results suggest that Kam1n0 is accurate, efficient, and scalable for handling large volume of assembly code." "graph-mining-and-social-networks";"Joint Community and Structural Hole Spanner Detection via Harmonic Modularity";"Detecting communities (or modular structures) and structural hole spanners, the nodes bridging different communities in a network, are two essential tasks in the realm of network analytics. Due to the topological nature of com-munities and structural hole spanners, these two tasks are naturally tangled with each other, while there has been little synergy between them. In this paper, we propose a novel harmonic modularity method to tackle both tasks simultaneously. Specifically, we apply a harmonic function to mea-sure the smoothness of community structure and to obtain the community indicator. We then investigate the sparsity level of the interactions between communities, with particular emphasis on the nodes connecting to multiple communities, to discriminate the indicator of SH spanners and assist the community guidance. Extensive experiments on real-world networks demonstrate that our proposed method outperforms several state-of-the-art methods in the community detection task and also in the SH spanner identification task (even the methods that require the supervised community information). Furthermore, by removing the SH spanners spotted by our method, we show that the quality of other community detection methods can be further improved." "graph-mining-and-social-networks";"Burstiness Scale: a highly parsimonious model forcharacterizing random series of events";"The problem to accurately and parsimoniously characterize random series of events (RSEs) seen in the Web, such as Yelp reviews or Twitter hashtags, is not trivial. Reports found in the literature reveal two apparent conflicting visions of how RSEs should be modeled. From one side, the Poissonian processes, of which consecutive events follow each other at a relatively regular time and should not be correlated. On the other side, the self-exciting processes, which are able to generate bursts of correlated events. The existence of many and sometimes conflicting approaches to model RSEs is a consequence of the unpredictability of the aggregated dynamics of our individual and routine activities, which sometimes show simple patterns, but sometimes results in irregular rising and falling trends. In this paper we propose a parsimonious way to characterize general RSEs, namely the Burstiness Scale (BuSca) model. BuSca views each RSE as a mix of two independent process: a Poissonian and a self-exciting one. Here we describe a fast method to extract the two parameters of BuSca that, together, gives the burstiness scale , which represents how much of the RSE is due to bursty and viral effects. We validated our method in eight diverse and large datasets containing real random series of events seen in Twitter, Yelp, e-mail conversations, Digg, and online forums. Results showed that, even using only two parameters, BuSca is able to accurately describe RSEs seen in these diverse systems, what can leverage many applications." "graph-mining-and-social-networks";"Talent Circle Detection in Job Transition Networks";"With the high mobility of talent, it becomes critical for the recruitment team to find the right talent from the right source in an efficient manner. The prevalence of Online Professional Networks (OPNs), such as LinkedIn, enables the new paradigm for talent recruitment and job search. How-ever, the dynamic and complex nature of such talent information imposes significant challenges to identify prospective talent sources from large-scale professional networks. Therefore, in this paper, we propose to create a job transition network where vertices stand for organizations and a directed edge represents the talent flow between two organizations for a time period. By analyzing this job transition network, it is able to extract talent circles in a way such that every circle includes the organizations with similar talent exchange patterns. Then, the characteristics of these talent circles can be used for talent recruitment and job search. To this end, we develop a talent circle detection model and design the corresponding learning method by maximizing the Normalized Discounted Cumulative Gain (NDCG) of inferred probability for the edge existence based on edge weights. Then, the identified circles will be labeled by the representative organizations as well as keywords in job descriptions. Moreover, based on these identified circles, we develop a talent exchange prediction method for talent recommendation. Finally, we have performed extensive experiments on real-world data. The results show that, our method can achieve much higher modularity when comparing to the benchmark approaches, as well as high precision and recall for talent exchange prediction." "graph-mining-and-social-networks";"Sampling of Attributed Networks From Hierarchical Generative Models";"Network sampling is a widely used procedure in social net-work analysis where a random network is sampled from a generative network model (GNM). Recently proposed GNMs, allow generation of networks with more realistic structural characteristics than earlier ones. This facilitates tasks such as hypothesis testing and sensitivity analysis. However, sampling of networks with correlated vertex attributes remains a challenging problem. While the recent work of [16] has provided a promising approach for attributed-network sampling, the approach was developed for use with relatively simple GNMs and does not work well with more complex hierarchical GNMs (which can model the range of characteristics and variation observed in real world networks more accurately). In contrast to simple GNMs where the probability mass is spread throughout the space of edges more evenly, hierarchical GNMs concentrate the mass to smaller regions of the space to reflect dependencies among edges in the network—this produces more realistic network characteristics, but also makes it more difficult to identify candidate networks from the sampling space." "graph-mining-and-social-networks";"FINAL: Fast Attributed Network Alignment";"Multiple networks naturally appear in numerous high-impact applications. Network alignment (i.e., finding the node correspondence across different networks) is often the very first step for many data mining tasks. Most, if not all, of the existing alignment methods are solely based on the topology of the underlying networks. Nonetheless, many real networks often have rich at-tribute information on nodes and/or edges. In this paper, we propose a family of algorithms (FINAL) to align attributed networks. The key idea is to leverage the node/edge attribute information to guide (topology-based) alignment process. We formulate this problem from an optimization perspective based on the alignment consistency principle, and develop effective and scalable algorithms to solve it. Our experiments on real networks show that (1) by leveraging the attribute information, our algorithms can significantly improve the alignment accuracy (i.e., up to a 30% improvement over the existing methods); (2) compared with the exact solution, our proposed fast alignment algorithm leads to a more than 10× speed-up, while preserving a 95% ac-curacy; and (3) our on-query alignment method scales linearly, with an around 90% ranking accuracy compared with our exact full alignment method and a near real-time response time." "graph-mining-and-social-networks";"FASCINATE: Fast Cross-Layer Dependency Inference on Multi-layered Networks";"Multi-layered networks have recently emerged as a new network model, which naturally finds itself in many high-impact application domains, ranging from critical inter-dependent infrastructure networks, biological systems, organization-level collaborations, to cross-platform e-commerce, etc. Cross-layer dependency, which describes the dependencies or the associations between nodes across different layers/networks, often plays a central role in many data mining tasks on such multi-layered networks. Yet, it remains a daunting task to accurately know the cross-layer dependency a prior. In this paper, we address the problem of inferring the missing cross-layer dependencies on multi-layered networks. The key idea behind our method is to view it as a collective collaborative filtering problem. By formulating the problem into a regularized optimization model, we propose an effective algorithm to find the local optima with linear complexity. Furthermore, we derive an online algorithm to accommodate newly arrived nodes, whose complexity is just linear wrt the size of the neighborhood of the new node. We perform extensive empirical evaluations to demonstrate the effectiveness and the efficiency of the proposed methods." "graph-mining-and-social-networks";"Finding Gangs in War from Signed Networks";"Given a signed network where edges are weighted in real number, and positive weights indicate cohesion between vertices and negative weights indicate opposition, we are interested in finding k-Oppositive Cohesive Groups (k-OCG). Each k-OCG is a group of k subgraphs such that (1) the edges within each subgraph are dense and cohesive; and (2) the edges crossing different subgraphs are dense and oppositive. Finding k-OCGs is challenging since the subgraphs are often small, there are multiple k-OCGs in a large signed net-work, and many existing dense subgraph extraction methods cannot handle edges of two signs. We model k-OCG finding task as a quadratic optimization problem. However, the classical Proximal Gradient method is very costly since it has to use the entire adjacency matrix, which is huge on large networks. Thus, we develop FOCG, an algorithm that is two orders of magnitudes faster than the Proximal Gradient method. The main idea is to only search in small subgraphs and thus avoids using a major portion of the adjacency matrix. Our experimental results on synthetic and real data sets as well as a case study clearly demonstrate the effectiveness and efficiency of our method." "graph-mining-and-social-networks";"Engagement Capacity and Engaging Team Formation for Reach Maximization of Online Social Media Platfo";"The challenges of assessing the “health” of online social media platforms and strategically growing them are recognized by many practitioners and researchers. For those platforms that primarily rely on user-generated content, the reach -the degree of participation referring to the percentage and involvement of users - is a key indicator of success. This paper lays a theoretical foundation for measuring engagement as a driver of reach that achieves growth via positive externality effects. The paper takes a game theoretic approach to quantifying engagement, viewing a platform’s social capital as a cooperatively created value and finding a fair distribution of this value among the contributors. It introduces engagement capacity, a measure of the ability of users and user groups to engage peers, and formulates the Engaging Team Formation Problem (EngTFP) to identify the sets of users that \make a platform go”. We show how engagement capacity can be useful in characterizing forum user behavior and in the reach maximization efforts. We also stress how engagement analysis differs from influence measurement. Computational investigations with Twitter and Health Forum data reveal the properties of engagement capacity and the utility of EngTFP." "graph-mining-and-social-networks";"QUINT: On Query-Specific Optimal Networks";"Measuring node proximity on large scale networks is a fundamental building block in many application domains, ranging from computer vision, e-commerce, social networks, software engineering, disaster management to biology and epidemiology. The state of the art (e.g., random walk based methods) typically assumes the input network is given a priori, with the known network topology and the associated edge weights. A few recent works aim to further infer the optimal edge weights based on the side information." "graph-mining-and-social-networks";"Dynamics of Large Multi-View Social Networks: Synergy, Cannibalization and Cross-View Interplay";"Most social networking services support multiple types of relation-ships between users, such as getting connected, sending messages, and consuming feed updates. These users and relationships can be naturally represented as a dynamic multi-view network, which is a set of weighted graphs with shared common nodes but having their own respective edges. Different network views, representing structural relationship and interaction types, could have very distinctive properties individually and these properties may change due to interplay across views. Therefore, it is of interest to study how multiple views interact and affect network dynamics and, in addition, explore possible applications to social networking." "graph-mining-and-social-networks";"Ranking Universities Based on Career Outcomes of Graduates";"Every year, millions of new students enter higher educational programs. Publicly available rankings of academic programs play a key role in prospective students’ decisions regarding which universities to apply to and enroll in. While surveys indicate that majority of freshmen enter college to get good jobs after graduation, established methodologies for ranking universities rely on indirect indicators of career outcomes such as reputational assessments of the universities among academic peers, acceptance and graduation rates, learning environment, and availability of research funding. In addition, many of these methodologies rely on arbitrary choices of weighting factors for the different ranking indicators, and suffer from lack of analyses of statistical stability. In this paper, we addresses these challenges holistically by developing a novel methodology for ranking and recommending universities for different professions on the basis of career outcomes of professionals who graduated from those schools. Our methodology incorporates a number of techniques for achieving statistical stability, and represents a step towards personalized educational recommendations based on interests and ambitions of individuals. We have applied this methodology on LinkedIn’s Economic Graph data of over 400 million professional from around the world. The resulting university rankings have been made available to the public and demonstrate that there are valuable insights to be gleaned from professional career data on LinkedIn." "graph-mining-and-social-networks";"Come-and-Go Patterns of Group Evolution: A Dynamic Model";"How do social groups, such as Facebook groups and Wechat groups, dynamically evolve over time? How do people join the social groups, uniformly or with burst? What is the pattern of people quitting from groups? Is there a simple universal model to depict the come-and-go patterns of various groups? In this paper, we examine temporal evolution patterns of more than 100 thousands social groups with more than 10 million users. We surprisingly find that the evolution patterns of real social groups goes far beyond the classic dynamic models like SI and SIR. For example, we observe both diffusion and non-diffusion mechanism in the group joining process, and power-law decay in group quitting process, rather than exponential decay as expected in SIR model. Therefore we propose a new model comeNgo, a concise yet flexible dynamic model for group evolution. Our model has the following advantages: (a) unification power: it generalizes earlier theoretical models and different joining and quitting mechanisms we find from observation. (b) succinctness and interpretability: it contains only six parameters with clear physical meanings. (c) accuracy: it can capture various kinds of group evolution patterns preciously and the goodness of fit increase by 58% over baseline. (d) usefulness: it can be used in multiple application scenarios such as forecasting and pattern discovery." "graph-mining-and-social-networks";"node2vec: Scalable Feature Learning for Networks";"Prediction tasks over nodes and edges in networks require careful effort in engineering features used by learning algorithms. Recent research in the broader field of representation learning has led to significant progress in automating prediction by learning the features themselves. However, present feature learning approaches are not expressive enough to capture the diversity of connectivity patterns observed in networks." "graph-mining-and-social-networks";"Meta Structure: Computing Relevance in Large Heterogeneous Information Networks";"A heterogeneous information network (HIN) is a graph model in which objects and edges are annotated with types. Large and complex databases, such as YAGO and DBLP, can be modeled as HINs. A fundamental problem in HINs is the computation of closeness, or relevance, between two HIN objects. Relevance measures can be used in various applications, including entity resolution, recommendation, and information retrieval. Several studies have investigated the use of HIN information for relevance computation, however, most of them only utilize simple structure, such as path, to measure the similarity between objects. In this paper, we propose to use meta structure, which is a directed acyclic graph of object types with edge types connecting in between, to measure the proximity between objects. The strength of meta structure is that it can describe complex relationship between two HIN objects (e.g., two papers in DBLP share the same authors and topics). We develop three relevance measures based on meta structure. Due to the computational complexity of these measures, we further design an algorithm with data structures proposed to support their evaluation. Our extensive experiments on YAGO and DBLP show that meta structure-based relevance is more effective than state-of-the-art approaches, and can be efficiently computed." "graph-mining-and-social-networks";"Reconstructing an Epidemic over Time";"We consider the problem of reconstructing an epidemic over time, or, more general, reconstructing the propagation of an activity in a network. Our input consists of a temporal network, which contains information about when two nodes interacted, and a sample of nodes that have been reported as infected. The goal is to recover the flow of the spread, including discovering the starting nodes, and identifying other likely-infected nodes that are not reported. The problem we consider has multiple applications, from public health to social media and viral marketing purposes." "graph-mining-and-social-networks";"Identifying Decision Makers from Professional Social Networks";"Sales professionals help organizations win clients for products and services. Generating new clients starts with identifying the right decision makers at the target organization. For the past decade, online professional networks have collected tremendous amount of data on people’s identity, their network and behavior data of buyers and sellers building relationships with each other for a variety of use-cases. Sales professionals are increasingly relying on these networks to research, identify and reach out to potential prospects, but it is often hard to find the right people effectively and efficiently. In this paper we present LDMS, the LinkedIn Decision Maker Score, to quantify the ability of making a sales decision for each of the 400M+ LinkedIn members. It is the key data-driven technology underlying Sales Navigator, a proprietary LinkedIn product that is designed for sales professionals. We will specifically discuss the modeling challenges of LDMS, and present two graph-based approaches to tackle this problem by leveraging the professional network data at LinkedIn. Both approaches are able to leverage both the graph information and the contextual information on the vertices, deal with small amount of labels on the graph, and handle heterogeneous graphs among different types of vertices. We will show some offline evaluations of LDMS on historical data, and also discuss its online usage in multiple applications in live production systems as well as future use cases within the LinkedIn ecosystem." "graph-mining-and-social-networks";"PTE: Enumerating Trillion Triangles On Distributed Systems";"How can we enumerate triangles from an enormous graph with billions of vertices and edges? Triangle enumeration is an important task for graph data analysis with many applications including identifying suspicious users in social networks, detecting web spams, finding communities, etc. However, recent networks are so large that most of the previous algorithms fail to process them. Recently, several MapReduce algorithms have been proposed to address such large networks; however, they suffer from the massive shuffled data resulting in a very long processing time." "graph-mining-and-social-networks";"Scalable Betweenness Centrality Maximization via Sampling";"Betweenness centrality (BWC) is a fundamental centrality mea-sure in social network analysis. Given a large-scale network, how can we find the most central nodes? This question is of great importance to many key applications that rely on BWC, including community detection and understanding graph vulnerability. Despite the large amount of work on scalable approximation algorithm design for BWC, estimat-ing BWC on large-scale networks remains a computational challenge." "graph-mining-and-social-networks";"Robust Influence Maximization";"In this paper, we address the important issue of uncertainty in the edge influence probability estimates for the well studied influence maximization problem — the task of finding k seed nodes in a social network to maximize the influence spread. We propose the problem of robust influence maximization, which maximizes the worst-case ratio between the influence spread of the chosen seed set and the optimal seed set, given the uncertainty of the parameter input. We de-sign an algorithm that solves this problem with a solution-dependent bound. We further study uniform sampling and adaptive sampling methods to effectively reduce the uncertainty on parameters and improve the robustness of the influence maximization task. Our empirical results show that parameter uncertainty may greatly affect influence maximization performance and prior studies that learned influence probabilities could lead to poor performance in robust influence maximization due to relatively large uncertainty in parameter estimates, and information cascade based adaptive sampling method may be an effective way to improve the robustness of influence maximization." "graph-mining-and-social-networks";"Graph Wavelets via Sparse Cuts";"Modeling information that resides on vertices of large graphs is a key problem in several real-life applications, ranging from social networks to the Internet-of-things. Signal Processing on Graphs and, in particular, graph wavelets can exploit the intrinsic smoothness of these datasets in order to represent them in a compact and accurate manner. However, how to discover wavelet bases that capture the geometry of the data with respect to the signal as well as the graph structure remains an open problem. In this paper, we study the problem of computing graph wavelet bases via sparse cuts in order to produce low-dimensional encodings of data-driven bases. This problem is connected to known hard problems in graph theory (e.g. multiway cuts) and thus requires an efficient heuristic. We formulate the basis discovery task as a relaxation of a vector optimization problem, which leads to an elegant solution as a regularized eigenvalue computation. Moreover, we propose several strategies in order to scale our algorithm to large graphs. Experimental results show that the proposed algorithm can effectively encode both the graph structure and signal, producing compressed and accurate representations for vertex values in a wide range of datasets (e.g. sensor and gene net-works) and significantly outperforming the best baseline." "graph-mining-and-social-networks";"Diversified Temporal Subgraph Pattern Mining";"Many graphs in real-world applications, such as telecommunications networks, social-interaction graphs and co-authorship graphs, contain temporal information. However, existing graph mining algorithms fail to exploit these temporal information and the resulting subgraph patterns do not contain any temporal attribute. In this paper, we study the problem of mining a set of diversified temporal subgraph patterns from a temporal graph, where each subgraph is associated with the time interval that the pattern spans. This problem motivates important applications such as finding social trends in social networks, or detecting temporal hotspots in telecommunications networks. We propose a divide-and-conquer algorithm along with effective pruning techniques, and our approach runs 2 to 3 orders of magnitude faster than a baseline algorithm and obtains high-quality temporal subgraph patterns in real temporal graphs." "graph-mining-and-social-networks";"CatchTartan: Representing and Summarizing Dynamic Multicontextual Behaviors";"Representing and summarizing human behaviors with rich contexts facilitates behavioral sciences and user-oriented services. Traditional behavioral modeling represents a behavior as a tuple in which each element is one contextual factor of one type, and the tensor-based summaries look for high-order dense blocks by clustering the values (including timestamps) in each dimension. However, the human behaviors are multicontextual and dynamic: (1) each behavior takes place within multiple contexts in a few dimensions, which requires the representation to enable non-value and set-values for each dimension; (2) many behavior collections, such as tweets or papers, evolve over time. In this paper, we represent the behavioral data as a two-level matrix (temporal-behaviors by dimensional-values) and propose a novel representation for behavioral summary called Tartan that includes a set of dimensions, the values in each dimension, a list of consecutive time slices and the behaviors in each slice. We further develop a propagation method CATCHTAR-TAN to catch the dynamic multicontextual patterns from the temporal multidimensional data in a principled and scalable way: it determines the meaningfulness of updating every element in the Tartan by minimizing the encoding cost in a compression manner. CATCHTARTAN outperforms the baselines on both the accuracy and speed. We apply CATCHTARTAN to four Twitter datasets up to 10 million tweets and the DBLP data, providing comprehensive summaries for the events, human life and scientific development." "graph-mining-and-social-networks";"Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs";"Given a stream of heterogeneous graphs containing different types of nodes and edges, how can we spot anomalous ones in real-time while consuming bounded memory? This problem is motivated by and generalizes from its application in security to host-level advanced persistent threat (APT) detection. We propose StreamSpot, a clustering based anomaly detection approach that addresses challenges in two key fronts: (1) heterogeneity, and (2) streaming nature. We introduce a new similarity function for heterogeneous graphs that compares two graphs based on their relative frequency of local substructures, represented as short strings. This function lends itself to a vector representation of a graph, which is (a) fast to compute, and (b) amenable to a sketched version with bounded size that preserves similarity." "graph-mining-and-social-networks";"Asymmetric Transitivity Preserving Graph Embedding";"Graph embedding algorithms embed a graph into a vector space where the structure and the inherent properties of the graph are preserved. The existing graph embedding methods cannot preserve the asymmetric transitivity well, which is a critical property of directed graphs. Asymmetric transitivity depicts the correlation among directed edges, that is, if there is a directed path from u to v, then there is likely a directed edge from u to v. Asymmetric transitivity can help in capturing structures of graphs and recovering from partially observed graphs. To tackle this challenge, we propose the idea of preserving asymmetric transitivity by approximating high-order proximity which are based on asymmetric transitivity. In particular, we develop a novel graph embed-ding algorithm, High-Order Proximity preserved Embedding (HOPE for short), which is scalable to preserve high-order proximities of large scale graphs and capable of capturing the asymmetric transitivity. More specifically, we first derive a general formulation that cover multiple popular high-order proximity measurements, then propose a scalable embedding algorithm to approximate the high-order proximity measurements based on their general formulation. Moreover, we provide a theoretical upper bound on the RMSE (Root Mean Squared Error) of the approximation. Our empirical experiments on a synthetic dataset and three real-world datasets demonstrate that HOPE can approximate the high-order proximities significantly better than the state-of-art algorithms and outperform the state-of-art algorithms in tasks of reconstruction, link prediction and vertex recommendation." "graph-mining-and-social-networks";"The Limits of Popularity-Based Recommendations, and the Role of Social Ties";"In this paper we introduce a mathematical model that captures some of the salient features of recommender systems that are based on popularity and that try to exploit social ties among the users. We show that, under very general conditions, the market always converges to a steady state, for which we are able to give an explicit form. Thanks to this we can tell rather precisely how much a market is altered by a recommendation system, and determine the power of users to influence others. Our theoretical results are complemented by experiments with real world social networks showing that social graphs prevent large market distortions in spite of the presence of highly influential users." "graph-mining-and-social-networks";"Structural Deep Network Embedding";"Network embedding is an important method to learn low-dimensional representations of vertexes in networks, aiming to capture and preserve the network structure. Almost all the existing network embedding methods adopt shallow models. However, since the underlying network structure is complex, shallow models cannot capture the highly non-linear network structure, resulting in sub-optimal network representations. Therefore, how to find a method that is able to effectively capture the highly non-linear network structure and preserve the global and local structure is an open yet important problem. To solve this problem, in this paper we propose a Structural Deep Network Embedding method, namely SDNE. More specifically, we first propose a semi-supervised deep model, which has multiple layers of non-linear functions, thereby being able to capture the highly non-linear network structure. Then we propose to exploit the first-order and second-order proximity jointly to preserve the network structure. The second-order proximity is used by the unsupervised component to capture the global network structure. While the first-order proximity is used as the supervised information in the supervised component to preserve the local network structure. By jointly optimizing them in the semi-supervised deep model, our method can preserve both the local and global network structure and is robust to sparse networks. Empirically, we conduct the experiments on five real-world networks, including a language network, a citation network and three social networks. The results show that compared to the baselines, our method can reconstruct the original network significantly better and achieves substantial gains in three applications, i.e. multi-label classification, link pre-diction and visualization." "graph-mining-and-social-networks";"ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with Rademacher Averages";"We present ABRA, a suite of algorithms to compute and maintain probabilistically-guaranteed, high-quality, approximations of the betweenness centrality of all nodes (or edges) on both static and fully dynamic graphs. Our algorithms use progressive random sampling and their analysis rely on Rademacher averages and pseudodimension, fundamental concepts from statistical learning theory. To our knowledge, this is the first application of these concepts to the field of graph analysis. Our experimental results show that ABRA is much faster than exact methods, and vastly outperforms, in both runtime and number of samples, state-of-the-art algorithms with the same quality guarantees." "graph-mining-and-social-networks";"Positive-Unlabeled Learning in Streaming Networks";"Data of many problems in real-world systems such as link prediction and one-class recommendation share common characteristics. First, data are in the form of positive-unlabeled (PU) measurements ( e.g. Twitter “following”, Facebook “like”, etc.) that do not provide negative information, which can be naturally represented as networks. Second, in the era of big data, such data are generated temporally-ordered, continuously and rapidly, which determines its streaming nature. These common characteristics allow us to unify many problems into a novel framework - PU learning in streaming networks. In this paper, a principled probabilistic approach SPU is proposed to leverage the characteristics of the streaming PU inputs. In particular, SPU captures temporal dynamics and provides real-time adaptations and predictions by identifying the potential negative signals concealed in unlabeled data. Our empirical results on various real-world datasets demonstrate the effectiveness of the proposed framework over other state-of-the-art methods in both link prediction and recommendation." "graph-mining-and-social-networks";"Ranking Causal Anomalies via Temporal and Dynamical Analysis on Vanishing Correlations";"Modern world has witnessed a dramatic increase in our ability to collect, transmit and distribute real-time monitoring and surveillance data from large-scale information systems and cyber-physical systems. Detecting system anomalies thus attracts significant amount of interest in many fields such as security, fault management, and industrial optimization. Recently, invariant network has shown to be a powerful way in characterizing complex system behaviours. In the invariant network, a node represents a system component and an edge indicates a stable, significant interaction between two components. Structures and evolutions of the invariance network, in particular the vanishing correlations, can shed important light on locating causal anomalies and performing diagnosis. However, existing approaches to detect causal anomalies with the invariant network often use the percentage of vanishing correlations to rank possible casual components, which have several limitations: 1) fault propagation in the network is ignored; 2) the root casual anomalies may not always be the nodes with a high-percentage of vanishing correlations; 3) temporal patterns of vanishing correlations are not exploited for robust detection. To address these limitations, in this paper we propose a network diffusion based framework to identify significant causal anomalies and rank them. Our approach can effectively model fault propagation over the entire invariant network, and can perform joint inference on both the structural, and the time-evolving broken invariance patterns. As a result, it can locate high-confidence anomalies that are truly responsible for the vanishing correlations, and can compensate for unstructured measurement noise in the system. Extensive experiments on synthetic datasets, bank information system datasets, and coal plant cyber-physical system datasets demonstrate the effectiveness of our approach." "graph-mining-and-social-networks";"Compressing Graphs and Indexes with Recursive Graph Bisection";"Graph reordering is a powerful technique to increase the locality of the representations of graphs, which can be helpful in several applications. We study how the technique can be used to improve compression of graphs and inverted indexes." "graph-mining-and-social-networks";"Inferring Network Effects from Observational Data";"We present Relational Covariate Adjustment (RCA), a general method for estimating causal effects in relational data. Relational Covariate Adjustment is implemented through two high-level operations: identification of an adjustment set and relational regression adjustment. The former is achieved through an extension of Pearl’s back-door criterion to relational domains. We demonstrate how this extended definition can be used to estimate causal effects in the presence of network interference and confounding. RCA is agnostic to functional form, and it can easily model both discrete and continuous treatments as well as estimate the effects of a wider array of network interventions than existing experimental approaches. We show that RCA can yield robust estimates of causal effects using common regression models without extensive parameter tuning. Through a series of simulation experiments on a variety of synthetic and real- world network structures, we show that causal effects estimated on observational data with RCA are nearly as accurate as those estimated from well-designed network experiments." "graph-mining-and-social-networks";"TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size";"We present TRIEST, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approximations of the global and local (i.e., incident to each vertex) number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions." "graph-mining-and-social-networks";"FRAUDAR: Bounding Graph Fraud in the Face of Camouflage";"Given a bipartite graph of users and the products that they review, or followers and followees, how can we detect fake reviews or follows? Existing fraud detection methods (spectral, etc.) try to identify dense subgraphs of nodes that are sparsely connected to the remaining graph. Fraudsters can evade these methods using camouflage, by adding reviews or follows with honest targets so that they look “normal”. Even worse, some fraudsters use hijacked accounts from honest users, and then the camouflage is indeed organic." "deep-learning";"Causal Clustering for 1-Factor Measurement Models";"Many scientific research programs aim to learn the causal structure of real world phenomena. This learning problem is made more difficult when the target of study cannot be directly observed. One strategy commonly used by social scientists is to create measurable “indicator” variables that covary with the latent variables of interest. Before leveraging the indicator variables to learn about the latent variables, however, one needs a measurement model of the causal relations between the indicators and their corresponding latents. These measurement models are a special class of Bayesian networks. This paper addresses the problem of reliably infer-ring measurement models from measured indicators, with-out prior knowledge of the causal relations or the number of latent variables. We present a provably correct novel algorithm, FindOneFactorClusters (FOFC), for solving this inference problem. Compared to other state of the art algorithms, FOFC is faster, scales to larger sets of indicators, and is more reliable at small sample sizes. We also present the first correctness proofs for this problem that do not assume linearity or acyclicity among the latent variables." "deep-learning";"Interpretable Decision Sets: A Joint Framework for Description and Prediction";"One of the most important obstacles to deploying predictive models is the fact that humans do not understand and trust them. Knowing which variables are important in a model’s prediction and how they are combined can be very powerful in helping people understand and trust automatic decision making systems." "deep-learning";"Compressing Convolutional Neural Networks in the Frequency Domain";"Convolutional neural networks (CNN) are increasingly used in many areas of computer vision. They are particularly attractive because of their ability to “absorb” great quantities of labeled data through millions of parameters. However, as model sizes increase, so do the storage and memory requirements of the classifiers, hindering many applications such as image and speech recognition on mobile phones and other devices. In this paper, we present a novel network architecture, Frequency-Sensitive Hashed Nets (FreshNets), which exploits inherent redundancy in both convolutional layers and fully-connected layers of a deep learning model, leading to dramatic savings in memory and storage consumption. Based on the key observation that the weights of learned convolutional filters are typically smooth and low-frequency, we first convert filter weights to the frequency domain with a discrete cosine transform (DCT) and use a low-cost hash function to randomly group frequency parameters into hash buckets. All parameters assigned the same hash bucket share a single value learned with standard back-propagation. To further reduce model size, we allocate fewer hash buckets to high-frequency components, which are generally less important. We evaluate FreshNets on eight data sets, and show that it leads to better compressed performance than several relevant baselines." "deep-learning";"Optimal Reserve Prices in Upstream Auctions: Empirical Application on Online Video Advertising";"We consider optimal reserve prices in BrightRoll Video Ex-change when the inventory opportunity comes from other exchanges (downstream marketplaces). We show that the existence of downstream auctions impacts the optimal floor. Moreover, it renders the classical derivation of the floor set by a monopolist inadequate and suboptimal. We derive the new downstream-corrected reserve price and compare its performance with respect to existing floors and the classical optimal monopoly price. In our application, the downstream-corrected reserve price proves superior to both. The proposed model also deals with data challenges commonly faced by exchanges: limited number of logged bids in an auction, and uncertainty regarding the bidding behavior in other exchanges." "deep-learning";"Subjectively Interesting Component Analysis: Data Projections that Contrast with Prior Expectations";"Methods that find insightful low-dimensional projections are essential to effectively explore high-dimensional data. Principal Component Analysis is used pervasively to find low-dimensional projections, not only because it is straightforward to use, but it is also often effective, because the variance in data is often dominated by relevant structure. However, even if the projections highlight real structure in the data, not all structure is interesting to every user. If a user is already aware of, or not interested in the dominant structure, Principal Component Analysis is less effective for finding interesting components. We introduce a new method called Subjectively Interesting Component Analysis (SICA), designed to find data projections that are subjectively interesting, i.e, projections that truly surprise the end-user. It is rooted in information theory and employs an explicit model of a user’s prior expectations about the data. The corresponding optimization problem is a simple eigenvalue problem, and the result is a trade-off between explained variance and novelty. We present five case studies on synthetic data, images, time-series, and spatial data, to illustrate how SICA enables users to find (subjectively) interesting projections." "deep-learning";"Images Don’t Lie: Transferring Deep Visual Semantic Features to Large-Scale Multimodal Learning to R";"Search is at the heart of modern e-commerce. As a result, the task of ranking search results automatically (learning to rank) is a multibillion dollar machine learning problem. Traditional models optimize over a few hand-constructed features based on the item’s text. In this paper, we introduce a multimodal learning to rank model that combines these traditional features with visual semantic features transferred from a deep convolutional neural network. In a large scale experiment using data from the online marketplace Etsy, we verify that moving to a multimodal representation significantly improves ranking quality. We show how image features can capture fine-grained style information not available in a text-only representation. In addition, we show concrete examples of how image information can successfully disentangle pairs of highly different items that are ranked similarly by a text-only model." "deep-learning";"Robust and Effective Metric Learning Using Capped Trace Norm";"Metric learning aims at automatically learning a metric from pair or triplet based constraints in data, and it can be potentially beneficial whenever the notion of metric between instances plays a nontrivial role. In Mahalanobis distance metric learning, distance matrix M is in symmetric positive semi-definite cone, and in order to avoid overfitting and to learn a better Mahalanobis distance from weakly supervised constraints, the low-rank regularization has been often imposed on matrix M to learn the correlations between features and samples. As the approximations of the rank minimization function, the trace norm and Fantope have been utilized to regularize the metric learning objectives and achieve good performance. However, these low-rank regularization models are either not tight enough to approximate rank minimization or time-consuming to tune an optimal rank. In this paper, we introduce a novel metric learning model using the capped trace norm based regularization, which uses a singular value threshold to constraint the metric matrix M as low-rank explicitly such that the rank of matrix M is stable when the large singular values vary. The capped trace norm regularization can also be viewed as the adaptive Fantope regularization. We minimize singular values which are less than threshold value and the rank of M is not necessary to be k, thus our method is more stable and applicable in practice when we do not know the optimal rank of matrix M. We derive an efficient optimization algorithm to solve the proposed new model and the algorithm convergence proof is also provided in this paper. We evaluate our method on a variety of challenging benchmarks, such as LFW and Pubfig datasets. Face verification experiments are performed and results show that our method consistently outperforms the state-of-the-art metric learning algorithms." "deep-learning";"Convolutional Neural Networks for Steady Flow Approximation";"In aerodynamics related design, analysis and optimization problems, flow fields are simulated using computational fluid dynamics (CFD) solvers. However, CFD simulation is usually a computationally expensive, memory demanding and time consuming iterative process. These drawbacks of CFD limit opportunities for design space exploration and forbid interactive design. We propose a general and flexible approximation model for real-time prediction of non-uniform steady laminar flow in a 2D or 3D domain based on convolutional neural networks (CNNs). We explored alternatives for the geometry representation and the network architecture of CNNs. We show that convolutional neural networks can estimate the velocity field two orders of magnitude faster than a GPU-accelerated CFD solver and four orders of magnitude faster than a CPU-based CFD solver at a cost of a low error rate. This approach can provide immediate feedback for real-time design iterations at the early stage of design. Compared with existing approximation models in the aero-dynamics domain, CNNs enable an efficient estimation for the entire velocity field. Furthermore, designers and engineers can directly apply the CNN approximation model in their design space exploration algorithms without training extra lower-dimensional surrogate models." "deep-learning";"Large-Scale Item Categorization in e-Commerce Using Multiple Recurrent Neural Networks";"Precise item categorization is a key issue in e-commerce domains. However, it still remains a challenging problem due to data size, category skewness, and noisy metadata. Here, we demonstrate a successful report on a deep learning-based item categorization method, i.e., deep categorization network (DeepCN), in an e-commerce website. DeepCN is an end-to-end model using multiple recurrent neural networks (RNNs) dedicated to metadata attributes for generating features from text metadata and fully connected layers for classifying item categories from the generated features. The categorization errors are propagated back through the fully connected layers to the RNNs for weight update in the learning process. This deep learning-based approach allows diverse attributes to be integrated into a common representation, thus overcoming sparsity and scalability problems. We evaluate DeepCN on large-scale real-world data including more than 94 million items with approximately 4,100 leaf categories from a Korean e-commerce website. Experiment results show our method improves the categorization accuracy compared to the model using single RNN as well as a standard classification model using unigram-based bag-of-words. Furthermore, we investigate how much the model parameters and the used attributes influence categorization performances." "deep-learning";"Predict Risk of Relapse for Patients with Multiple Stages of Treatment of Depression";"Depression is a serious mood disorder afflicting millions of people around the globe. Medications of different types and with different effects on neural activity have been developed for its treatments during the past few decades. Due to the heterogeneity of the disorder, many patients cannot achieve symptomatic remission from a single clinical trial. Instead they need multiple clinical trials to achieve remission, resulting in a multiple stage treatment pattern. Furthermore those who indeed achieve symptom remission are still faced with substantial risk of relapse. One promising approach to predicting the risk of relapse is censored regression. Traditional censored regression typically applies only to situations in which the exact time of event of interest is known. How-ever, follow-up studies that track the patients’ relapse status can only provide an interval of time during which relapse occurs. The exact time of relapse is usually unknown. In this paper, we present a censored regression approach with a truncated l1 loss function that can handle the uncertainty of relapse time. Based on this general loss function, we develop a gradient boosting algorithm and a stochastic dual coordinate ascent algorithm when the hypothesis in the loss function is represented as (1) an ensemble of decision trees and (2) a linear combination of covariates, respectively. As an extension of our linear model, a multi-stage linear approach is further proposed to harness the data collected from multiple stages of treatment. We evaluate the proposed algorithms using a real-world clinical trial dataset. Results show that our methods outperform the well-known Cox proportional hazard model. In addition, the risk factors identified by our multi-stage linear model not only corroborate find-ings from recent research but also yield some new insights into how to develop effective measures for prevention of re-lapse among patients after their initial remission from the acute treatment stage." "deep-learning";"Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features";"Manually crafted combinatorial features have been the “secret sauce” behind many successful models. For web-scale applications, however, the variety and volume of features make these manually crafted features expensive to create, maintain, and deploy. This paper proposes the Deep Crossing model which is a deep neural network that automatically combines features to produce superior models. The input of Deep Crossing is a set of individual features that can be either dense or sparse. The important crossing features are discovered implicitly by the networks, which are comprised of an embedding and stacking layer, as well as a cascade of Residual Units." "deep-learning";"Inferring Network Effects from Observational Data";"We present Relational Covariate Adjustment (RCA), a general method for estimating causal effects in relational data. Relational Covariate Adjustment is implemented through two high-level operations: identification of an adjustment set and relational regression adjustment. The former is achieved through an extension of Pearl’s back-door criterion to relational domains. We demonstrate how this extended definition can be used to estimate causal effects in the presence of network interference and confounding. RCA is agnostic to functional form, and it can easily model both discrete and continuous treatments as well as estimate the effects of a wider array of network interventions than existing experimental approaches. We show that RCA can yield robust estimates of causal effects using common regression models without extensive parameter tuning. Through a series of simulation experiments on a variety of synthetic and real- world network structures, we show that causal effects estimated on observational data with RCA are nearly as accurate as those estimated from well-designed network experiments." "deep-learning";"A Closed-Loop Approach in Data-Driven Resource Allocation to Improve Network User Experience";"Machine learning methods have been widely used in modeling and predicting network user experience. In this pa-per, moving beyond user experience prediction, we propose a closed-loop approach that uses data-generated prediction models to explicitly guide resource allocation for user experience improvement. The closed-loop approach leverages and verifies the causal relation that often exists between certain feature values (e.g., bandwidth) and user experience in computer networks. The approach consists of three components: we train a neural network classifier to predict user experience, utilize the trained neural network classifier as the objective function to allocate network resource, and then evaluate user experience with allocated resource to (in)validate and adjust the original model. Specifically, we propose a dual decomposition algorithm to solve the neural network-based resource optimization problem, which is complex and non-convex. We further develop an iterative mechanism for classifier optimization. Numerical results show that the dual algorithm reduces the expected number of unsatisfied users by up to 2x compared with the baseline, and the optimized classifier further improves the performance by 50%." "deep-learning";"Towards Robust and Versatile Causal Discovery for Business Applications";"Causal discovery algorithms can induce some of the causal relations from the data, commonly in the form of a causal network such as a causal Bayesian network. Arguably however, all such algorithms lack far behind what is necessary for a true business application. We develop an initial version of a new, general causal discovery algorithm called ETIO with many features suitable for business applications. These include (a) ability to accept prior causal knowledge (e.g., taking senior driving courses improves driving skills), (b) admit-ting the presence of latent confounding factors, (c) admitting the possibility of (a certain type of) selection bias in the data (e.g., clients sampled mostly from a given region), (d) ability to analyze data with missing-by-design (i.e., not planned to measure) values (e.g., if two companies merge and their databases measure different attributes), and (e) ability to analyze data from different interventions (e.g., prior and posterior to an advertisement campaign). ETIO is an instance of the logical approach to integrative causal discovery that has been relatively recently introduced and enables the solution of complex reverse-engineering problems in causal discovery. ETIO is compared against the state-of-the-art and is shown to be more effective in terms of speed, with only a slight degradation in terms of learning accuracy, while incorporating all the features above. The code is available on the mensxmachina.org website." "deep-learning";"Online Feature Selection: A Limited-Memory Substitution Algorithm and its Asynchronous Parallel Vari";"This paper considers the feature selection scenario where only a few features are accessible at any time point. For example, features are generated sequentially and visible one by one. Therefore, one has to make an online decision to identify key features after all features are only scanned once or twice. The optimization based approach is a powerful tool for the online feature selection." "deep-learning";"Just One More: Modeling Binge Watching Behavior";"Easy accessibility can often lead to over-consumption, as seen in food and alcohol habits. On video on-demand (VOD) services, this has recently been referred to as “binge watching”, where potentially entire seasons of TV shows are consumed in a single viewing session. While a user viewership model may reveal this binging behavior, creating an accurate model has several challenges, including censored data, deviations in the population, and the need to consider external influences on consumption habits. In this paper, we introduce a novel statistical mixture model that incorporates these factors and presents a “first of its kind” characterization of viewer consumption behavior using a real-world dataset that includes playback data from a VOD service. From our modeling, we tackle various predictive tasks to infer the consumption decisions of a user in a viewing session, including estimating the number of episodes they watch and classifying if they continue watching another episode. Using these insights, we then identify binge watching sessions based on deviation from normal viewing behavior. We observe different types of binging behavior, that binge watchers often view certain content out-of-order, and that binge watching is not a consistent behavior among our users. These insights and our findings have application in VOD revenue generation, consumer health applications, and customer retention analysis." "deep-learning";"Multi-Task Feature Interaction Learning";"Linear models are widely used in various data mining and machine learning algorithms. One major limitation of such models is the lack of capability to capture predictive information from interactions between features. While introducing high-order feature interaction terms can overcome this limitation, this approach dramatically increases the model complexity and imposes significant challenges in the learning against overfitting. When there are multiple related learning tasks, feature interactions from these tasks are usually related and modeling such relatedness is the key to improve their generalization. In this paper, we propose a novel Multi-Task feature Interaction Learning (MTIL) framework to exploit the task relatedness from high-order feature interactions. Specifically, we collectively represent the feature interactions from multiple tasks as a tensor, and prior knowledge of task relatedness can be incorporated into different structured regularizations on this tensor. We formulate two concrete approaches under this framework, namely the shared interaction approach and the embedded interaction approach. The former assumes tasks share the same set of interactions, and the latter assumes feature interactions from multiple tasks share a common subspace. We have provided efficient algorithms for solving the two formulations. Extensive empirical studies on both synthetic and real datasets have demonstrated the effectiveness of the proposed framework." "big-data";"Positive-Unlabeled Learning in Streaming Networks";"Data of many problems in real-world systems such as link prediction and one-class recommendation share common characteristics. First, data are in the form of positive-unlabeled (PU) measurements ( e.g. Twitter “following”, Facebook “like”, etc.) that do not provide negative information, which can be naturally represented as networks. Second, in the era of big data, such data are generated temporally-ordered, continuously and rapidly, which determines its streaming nature. These common characteristics allow us to unify many problems into a novel framework - PU learning in streaming networks. In this paper, a principled probabilistic approach SPU is proposed to leverage the characteristics of the streaming PU inputs. In particular, SPU captures temporal dynamics and provides real-time adaptations and predictions by identifying the potential negative signals concealed in unlabeled data. Our empirical results on various real-world datasets demonstrate the effectiveness of the proposed framework over other state-of-the-art methods in both link prediction and recommendation." "big-data";"TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size";"We present TRIEST, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approximations of the global and local (i.e., incident to each vertex) number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions." "big-data";"Learning Cumulatively to Become More Knowledgeable";"In classic supervised learning, a learning algorithm takes a fixed training data of several classes to build a classifier. In this paper, we propose to study a new problem, i.e., building a learning system that learns cumulatively. As time goes by, the system sees and learns more and more classes of data and becomes more and more knowledgeable. We believe that this is similar to human learning. We humans learn continuously, retaining the learned knowledge, identifying and learning new things, and updating the existing knowledge with new experiences. Over time, we cumulate more and more knowledge. A learning system should be able to do the same. As algorithmic learning matures, it is time to tackle this cumulative machine learning (or simply cumulative learning) problem, which is a kind of lifelong machine learning problem. It presents two major challenges. First, the system must be able to detect data from unseen classes in the test set. Classic supervised learning, however, assumes all classes in testing are known or seen at the training time. Second, the system needs to be able to selectively update its models whenever a new class of data arrives without re-training the whole system using the entire past and present training data. This paper proposes a novel approach and system to tackle these challenges. Experimental results on two datasets with learning from 2 classes to up to 100 classes show that the proposed approach is highly promising in terms of both classification accuracy and computational efficiency." "big-data";"Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices";"With massive high-dimensional data now commonplace in research and industry, there is a strong and growing demand for more scalable computational techniques for data analysis and knowledge discovery. Key to turning these data into knowledge is the ability to learn statistical models with high interpretability. Current methods for learning statistical models either produce models that are not interpretable or have prohibitive computational costs when applied to massive data. In this paper we address this need by presenting a scalable algorithm for partial least squares regression (PLS), which we call compression-based PLS (cPLS), to learn predictive linear models with a high interpretability from massive high-dimensional data. We propose a novel grammar-compressed representation of data matrices that supports fast row and column access while the data matrix is in a compressed form. The original data matrix is grammar-compressed and then the linear model in PLS is learned on the compressed data matrix, which results in a significant reduction in working space, greatly improving scalability. We experimentally test cPLS on its ability to learn linear models for classification, regression and feature extraction with various massive high-dimensional data, and show that cPLS performs superiorly in terms of prediction accuracy, computational efficiency, and interpretability." "big-data";"Boosted Decision Tree Regression Adjustment for Variance Reduction of Online Controlled Experiments";"Nowadays, the development of most leading web services is controlled by online experiments that qualify and quantify the steady stream of their updates achieving more than a thousand concurrent experiments per day. Despite the in-creasing need for running more experiments, these services are limited in their user traffic. This situation leads to the problem of finding a new or improving existing key performance metric with a higher sensitivity and lower variance. We focus on the problem of variance reduction for engagement metrics of user loyalty that are widely used in A/B testing of web services. We develop a general framework that is based on evaluation of the mean difference between the actual and the approximated values of the key performance metric (instead of the mean of this metric). On the one hand, it allows us to incorporate the state-of-the-art techniques widely used in randomized experiments of clinical and social research, but limitedly used in online evaluation. On the other hand, we propose a new class of methods based on advanced machine learning algorithms, including ensembles of decision trees, that, to the best of our knowledge, have not been applied earlier to the problem of variance reduction. We validate the variance reduction approaches on a very large set of real large-scale A/B experiments run at Yandex for different engagement metrics of user loyalty. Our best approach demonstrates 63% average variance reduction (which is equivalent to 63% saved user traffic) and detects the treatment effect in 2 times more A/B experiments." "big-data";"Robust Influence Maximization";"In this paper, we address the important issue of uncertainty in the edge influence probability estimates for the well studied influence maximization problem — the task of finding k seed nodes in a social network to maximize the influence spread. We propose the problem of robust influence maximization, which maximizes the worst-case ratio between the influence spread of the chosen seed set and the optimal seed set, given the uncertainty of the parameter input. We de-sign an algorithm that solves this problem with a solution-dependent bound. We further study uniform sampling and adaptive sampling methods to effectively reduce the uncertainty on parameters and improve the robustness of the influence maximization task. Our empirical results show that parameter uncertainty may greatly affect influence maximization performance and prior studies that learned influence probabilities could lead to poor performance in robust influence maximization due to relatively large uncertainty in parameter estimates, and information cascade based adaptive sampling method may be an effective way to improve the robustness of influence maximization." "big-data";"Lightweight Monitoring of Distributed Streams";"As data becomes dynamic, large, and distributed, there is in-creasing demand for what have become known as distributed stream algorithms. Since continuously collecting the data to a central server and processing it there incurs very high communication and computation complexities, it is advantageous to define local conditions at the nodes, such that – as long as they are maintained – some desirable global condition holds." "big-data";"Fast Unsupervised Online Drift Detection Using Incremental Kolmogorov-Smirnov Test";"Data stream research has grown rapidly over the last decade. Two major features distinguish data stream from batch learning: stream data are generated on the fly, possibly in a fast and variable rate; and the underlying data distribution can be non-stationary, leading to a phenomenon known as concept drift. Therefore, most of the research on data stream classification focuses on proposing efficient models that can adapt to concept drifts and maintain a stable performance over time. However, specifically for the classification task, the majority of such methods rely on the instantaneous avail-ability of true labels for all already classified instances. This is a strong assumption that is rarely fulfilled in practical applications. Hence there is a clear need for efficient methods that can detect concept drifts in an unsupervised way. One possibility is the well-known Kolmogorov-Smirnov test, a statistical hypothesis test that checks whether two samples differ. This work has two main contributions. The first one is the Incremental Kolmogorov-Smirnov algorithm that allows performing the Kolmogorov-Smirnov hypothesis test instantly using two samples that change over time, where the change is an insertion and/or removal of an observation. Our algorithm employs a randomized tree and is able to per-form the insertion and removal operations in O(log N) with high probability and calculate the Kolmogorov-Smirnov test in O(1), where N is the number of sample observations. This is a significant speed-up compared to the O(N log N) cost of the non-incremental implementation. The second contribution is the use of the Incremental Kolmogorov-Smirnov test to detect concept drifts without true labels. Classification algorithms adapted to use the test rely on a limited portion of those labels just to update the classification model after a concept drift is detected." "big-data";"Scalable Betweenness Centrality Maximization via Sampling";"Betweenness centrality (BWC) is a fundamental centrality mea-sure in social network analysis. Given a large-scale network, how can we find the most central nodes? This question is of great importance to many key applications that rely on BWC, including community detection and understanding graph vulnerability. Despite the large amount of work on scalable approximation algorithm design for BWC, estimat-ing BWC on large-scale networks remains a computational challenge." "big-data";"Recruitment Market Trend Analysis with Sequential Latent Variable Models";"Recruitment market analysis provides valuable understanding of industry-specific economic growth and plays an important role for both employers and job seekers. With the rapid development of online recruitment services, massive recruitment data have been accumulated and enable a new paradigm for recruitment market analysis. However, traditional methods for recruitment market analysis largely rely on the knowledge of domain experts and classic statistical models, which are usually too general to model large-scale dynamic recruitment data, and have difficulties to capture the fine-grained market trends. To this end, in this paper, we propose a new research paradigm for recruitment market analysis by leveraging unsupervised learning techniques for automatically discovering recruitment market trends based on large-scale recruitment data. Specifically, we develop a novel sequential latent variable model, named MTLVM, which is designed for capturing the sequential dependencies of corporate recruitment states and is able to automatically learn the latent recruitment topics within a Bayesian generative framework. In particular, to capture the variability of recruitment topics over time, we design hierarchical dirichlet processes for MTLVM. These processes allow to dynamically generate the evolving recruitment topics. Finally, we implement a prototype system to empirically evaluate our approach based on real-world recruitment data in China. Indeed, by visualizing the results from MTLVM, we can successfully reveal many interesting findings, such as the popularity of LBS related jobs reached the peak in the 2nd half of 2014, and decreased in 2015." "big-data";"Taxi Driving Behavior Analysis in Latent Vehicle-to-Vehicle Networks: A Social Influence Perspective";"With recent advances in mobile and sensor technologies, a large amount of efforts have been made on developing intelligent applications for taxi drivers, which provide beneficial guide and opportunity to improve the profit and work efficiency. However, limited scopes focus on the latent social interaction within cab drivers, and corresponding social propagation scheme to share driving behaviors has been largely ignored. To that end, in this paper, we propose a comprehensive study to reveal how the social propagation affects for better prediction of cab drivers’ future behaviors. To be specific, we first investigate the correlation between drivers’ skills and their mutual interactions in the latent vehicle-to-vehicle network, which intuitively indicates the effects of social influences. Along this line, by leveraging the classic social influence theory, we develop a two-stage framework for quantitatively revealing the latent driving pattern propagation within taxi drivers. Comprehensive experiments on a real-word data set collected from the New York City clearly validate the effectiveness of our proposed framework on predicting future taxi driving behaviors, which also support the hypothesis that social factors indeed improve the predictability of driving behaviors." "big-data";"Distributing the Stochastic Gradient Sampler for Large-Scale LDA";"Learning large-scale Latent Dirichlet Allocation (LDA) models is beneficial for many applications that involve large collections of documents. Recent work has been focusing on developing distributed algorithms in the batch setting, while leaving stochastic methods behind, which can effectively explore statistical redundancy in big data and thereby are complementary to distributed computing. The distributed stochastic gradient Langevin dynamics (DSGLD) represents one attempt to combine stochastic sampling and distributed computing, but it suffers from drawbacks such as excessive communications and sensitivity to partitioning of datasets across nodes. DSGLD is typically limited to learn small models that have about 10^3 topics and 10^3 vocabulary size." "big-data";"Safe Pattern Pruning: An Efficient Approach for Predictive Pattern Mining";"In this paper we study predictive pattern mining problems where the goal is to construct a predictive model based on a subset of predictive patterns in the database. Our main contribution is to introduce a novel method called safe pat-tern pruning (SPP) for a class of predictive pattern mining problems. The SPP method allows us to efficiently find a superset of all the predictive patterns in the database that are needed for the optimal predictive model. The advantage of the SPP method over existing boosting-type method is that the former can find the superset by a single search over the database, while the latter requires multiple searches. The SPP method is inspired by recent development of safe feature screening. In order to extend the idea of safe feature screening into predictive pattern mining, we derive a novel pruning rule called safe pattern pruning (SPP) rule that can be used for searching over the tree defined among patterns in the database. The SPP rule has a property that, if a node corresponding to a pattern in the database is pruned out by the SPP rule, then it is guaranteed that all the patterns corresponding to its descendant nodes are never needed for the optimal predictive model. We apply the SPP method to graph mining and item-set mining problems, and demonstrate its computational advantage." "big-data";"Singapore in Motion: insights on public transport service level through farecard and mobile data ana";"Given the changing dynamics of mobility patterns and rapid growth of cities, transport agencies seek to respond more rapidly to needs of the public with the goal of offering an effective and competitive public transport system. A more data-centric approach for transport planning is part of the evolution of this process. In particular, the vast penetration of mobile phones provides an opportunity to monitor and derive insights on transport usage. Real time and historical analyses of such data can give a detailed understanding of mobility patterns of people and also suggest improvements to current transit systems. On its own, however, mobile geolocation data has a number of limitations. We thus propose a joint telco-and-farecard-based learning approach to understanding urban mobility. The approach enhances telecommunications data by leveraging it jointly with other sources of real-time data. The approach is illustrated on the first- and last-mile problem as well as route choice estimation within a densely-connected train network." "big-data";"Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned";"Online controlled experiments, also called A/B testing, have been established as the mantra for data-driven decision making in many web-facing companies. In recent years, there are emerging research works focusing on building the platform and scaling it up [34], best practices and lessons learned to obtain trustworthy results [19; 20; 23; 26], and experiment design techniques and various issues related to statistical inference and testing [6; 7; 8]. However, despite playing a central role in online controlled experiments, there is little published work on treating metric development itself as a data-driven process. In this paper, we focus on the topic of how to develop meaningful and useful metrics for online services in their online experiments, and show how data-driven techniques and criteria can be applied in metric development process. In particular, we emphasize two fundamental qualities for the goal metrics (or Overall Evaluation Criteria) of any online service: directionality and sensitivity. We share lessons on why these two qualities are critical, how to measure these two qualities of metrics of interest, how to develop metrics with clear directionality and high sensitivity by using approaches based on user behavior models and data-driven calibration, and how to choose the right goal metrics for the entire online services." "big-data";"Communication Efficient Distributed Kernel Principal Component Analysis";"Kernel Principal Component Analysis (KPCA) is a key machine learning algorithm for extracting nonlinear features from data. In the presence of a large volume of high dimensional data collected in a distributed fashion, it becomes very costly to communicate all of this data to a single data center and then perform kernel PCA. Can we perform kernel PCA on the entire dataset in a distributed and communication efficient fashion while maintaining provable and strong guarantees in solution quality? In this paper, we give an affirmative answer to the question by developing a communication efficient algorithm to perform kernel PCA in the distributed setting. The algorithm is a clever combination of subspace embedding and adaptive sampling techniques, and we show that the algorithm can take as input an arbitrary configuration of distributed datasets, and compute a set of global kernel principal components with relative error guarantees independent of the dimension of the feature space or the total number of data points. In particular, computing k principal components with relative error ε over s workers has communication cost Õ(sρκ/ε + sκ^2/ε^3) words, where ρ is the average number of nonzero entries in each data point. Furthermore, we experimented the algorithm with large-scale real world datasets. The experimental results showed that the algorithm produces a high quality kernel PCA solution while using significantly less communication than alternative approaches." "big-data";"Regime Shifts in Streams: Real-time Forecasting of Co-evolving Time Sequences";"Given a large, online stream of multiple co-evolving event sequences, such as sensor data and Web-click logs, that contains various types of non-linear dynamic evolving patterns of different durations, how can we efficiently and effectively capture important patterns? How do we go about forecasting long-term future events?" "big-data";"Multi-layer Representation Learning for Medical Concepts";"Proper representations of medical concepts such as diagnosis, medication, procedure codes and visits from Electronic Health Records (EHR) has broad applications in healthcare analytics. Patient EHR data consists of a sequence of visits over time, where each visit includes multiple medical concepts, e.g., diagnosis, procedure, and medication codes. This hierarchical structure provides two types of relational information, namely sequential order of visits and co-occurrence of the codes within a visit. In this work, we propose Med2Vec, which not only learns the representations for both medical codes and visits from large EHR datasets with over million visits, but also allows us to interpret the learned representations confirmed positively by clinical experts. In the experiments, Med2Vec shows significant improvement in prediction accuracy in clinical applications compared to baselines such as Skip-gram, GloVe, and stacked autoencoder, while providing clinically meaningful interpretation." "big-data";"Fast Component Pursuit for Large-Scale Inverse Covariance Estimation";"The maximum likelihood estimation (MLE) for the Gaussian graphical model, which is also known as the inverse covariance estimation problem, has gained increasing interest recently. Most existing works assume that inverse covariance estimators contain sparse structure and then construct models with the l1 regularization. In this paper, different from existing works, we study the inverse co-variance estimation problem from another perspective by efficiently modeling the low-rank structure in the inverse covariance, which is assumed to be a combination of a low-rank part and a diagonal matrix. One motivation for this assumption is that the low-rank structure is common in many applications including the cli-mate and financial analysis, and another one is that such assumption can reduce the computational complexity when computing its inverse. Specifically, we propose an efficient COmponent Pursuit (COP) method to obtain the low-rank part, where each component can be sparse. For optimization, the COP method greedily learns a rank-one component in each iteration by maximizing the log-likelihood. Moreover, the COP algorithm enjoys several appealing properties including the existence of an efficient solution in each iteration and the theoretical guarantee on the convergence of this greedy approach. Experiments on large-scale synthetic and real-world datasets including thousands of millions variables show that the COP method is faster than the state-of-the-art techniques for the inverse covariance estimation problem when achieving comparable log-likelihood on test data." "big-data";"Scalable Fast Rank-1 Dictionary Learning for fMRI Big Data Analysis";"It has been shown from various functional neuroimaging studies that sparsity-regularized dictionary learning could achieve superior performance in decomposing comprehensive and neuroscientifically meaningful functional networks from massive fMRI signals. However, the computational cost for solving the dictionary learning problem has been known to be very demanding, especially when dealing with large-scale data sets. Thus in this work, we propose a novel distributed rank-1 dictionary learning (D-r1DL) model and apply it for fMRI big data analysis. The model estimates one rank-1 basis vector with sparsity constraint on its loading coefficient from the input data at each learning step through alternating least squares updates. By iteratively learning the rank-1 basis and deflating the input data at each step, the model is then capable of decomposing the whole set of functional networks. We implement and parallelize the rank-1 dictionary learning algorithm using Spark engine and deployed the resilient distributed dataset (RDDs) abstracts for the data distribution and operations. Experimental results from applying the model on the Human Connectome Project (HCP) data show that the proposed D-r1DL model is efficient and scalable towards fMRI big data analytics, thus enabling data-driven neuroscientific discovery from massive fMRI big data in the future." "big-data";"Evaluating Mobile App Release";"We have seen an explosive growth of mobile usage, particularly on mobile apps. It is more important than ever to be able to properly evaluate mobile app release. A/B testing is a standard framework to evaluate new ideas. We have seen much of its applications in the online world across the industry [9,10,12]. Running A/B tests on mobile apps turns out to be quite different, and much of it is attributed to the fact that we cannot ship code easily to mobile apps other than going through a lengthy build, review and release process. Mobile infrastructure and user behavior differences also contribute to how A/B tests are conducted differently on mobile apps, which will be discussed in details in this paper. In addition to measuring features individually in the new app version through randomized A/B tests, we have a unique opportunity to evaluate the mobile app as a whole using the quasi-experimental framework [21]. Not all features can be A/B tested due to infrastructure changes and wholistic product redesign. We propose and establish quasi-experimental techniques for measuring impact from mobile app release, with results shared from a recent major app launch at LinkedIn." "big-data";"Revisiting Random Binning Feature: Fast Convergence and Strong Parallelizability";"Kernel method has been developed as one of the standard approaches for nonlinear learning, which however, does not scale to large data set due to its quadratic complexity in the number of samples. A number of kernel approximation methods have thus been proposed in the recent years, among which the random features method gains much popularity due to its simplicity and direct reduction of non-linear problem to a linear one. Different random feature functions have since been proposed to approximate a variety of kernel functions. Among them the Random Binning (RB) feature, proposed in the first random-feature paper [21], has drawn much less attention than the Random Fourier (RF) feature proposed also in [21]. In this work, we observe that the RB features, with right choice of optimization solver, could be orders-of-magnitude more efficient than other random features and kernel approximation methods under the same requirement of accuracy. We thus propose the first analysis of RB from the perspective of optimization, which by interpreting RB as a Randomized Block Coordinate Descent in the infinite-dimensional space, gives a faster convergence rate compared to that of other random features. In particular, we show that by drawing R random grids with at least κ number of non-empty bins per grid in expectation, RB method achieves a convergence rate of O(1/(κR)), which not only sharpens its O(1/√R) rate from Monte Carlo analysis, but also shows a κ times speedup over other random features under the same analysis framework. In addition, we demonstrate another advantage of RB in the L1-regularized set-ting, where unlike other random features, a RB-based Coordinate Descent solver can be parallelized with guaranteed speedup proportional to κ. Our extensive experiments demonstrate the superior performance of the RB features over other random features and ker-nel approximation methods." "big-data";"PTE: Enumerating Trillion Triangles On Distributed Systems";"How can we enumerate triangles from an enormous graph with billions of vertices and edges? Triangle enumeration is an important task for graph data analysis with many applications including identifying suspicious users in social networks, detecting web spams, finding communities, etc. However, recent networks are so large that most of the previous algorithms fail to process them. Recently, several MapReduce algorithms have been proposed to address such large networks; however, they suffer from the massive shuffled data resulting in a very long processing time." "big-data";"“Why Should I Trust you?” Explaining the Predictions of Any Classifier";"Despite widespread adoption, machine learning models re- main mostly black boxes. Understanding the reasons behind predictions is, however, quite important in assessing trust, which is fundamental if one plans to take action based on a prediction, or when choosing whether to deploy a new model. Such understanding also provides insights into the model, which can be used to transform an untrustworthy model or prediction into a trustworthy one. " "big-data";"Parallel Dual Coordinate Descent Method for Large-scale Linear Classification in Multi-core Environm";"Dual coordinate descent method is one of the most effective approaches for large-scale linear classification. However, its sequential design makes the parallelization difficult. In this work, we target at the parallelization in a multi-core environment. After pointing out difficulties faced in some existing approaches, we propose a new framework to parallelize the dual coordinate descent method. The key idea is to make the majority of all operations (gradient calculation here) parallelizable. The proposed framework is shown to be theoretically sound. Further, we demonstrate through experiments that the new framework is robust and efficient in a multi-core environment." "big-data";"Online Asymmetric Active Learning with Imbalanced Data";"This paper considers online learning with imbalanced streaming data under a query budget, where the act of querying for labels is constrained to a budget limit. We study different active querying strategies for classification. In particular, we propose an asymmetric active querying strategy that assigns different probabilities for query to examples predicted as positive and negative. To corroborate the proposed asymmetric query model, we provide a theoretical analysis on a weighted mistake bound. We conduct extensive evaluations of the proposed asymmetric active querying strategy in comparison with several baseline querying strategies and with previous online learning algorithms for imbalanced data. In particular, we perform two types of evaluations ac-cording to which examples appear as “positive”/ “negative”. In push evaluation only the positive predictions given to the user are taken into account; in push and query evaluation the decision to query is also considered for evaluation. The push and query evaluation strategy is particularly suited for a recommendation setting because the items selected for querying for labels may go to the end-user to enable customization and personalization. These would not be shown any differently to the end-user compared to recommended content (i.e., the examples predicated as positive). Additionally, given our interest in imbalanced data we measure F -score instead of accuracy that is traditionally considered by online classification algorithms. We also compare the querying strategies on five classification tasks from different domains, and show that the probabilistic query strategy achieves higher F -scores on both types of evaluation than deterministic strategy, especially when the budget is small, and the asymmetric query model further improves performance. When compared to the state-of-the-art cost-sensitive online learning algorithm under a budget, our online classification algorithm with asymmetric querying achieves a higher F -score on four of the five tasks, especially on the push evaluation." "big-data";"Deploying Analytics with the Portable Format for Analytics (PFA)";"We introduce a new language for deploying analytic models into products, services and operational systems called the Portable Format for Analytics (PFA). PFA is an example of what is sometimes called a model interchange format, a language for describing analytic models that is independent of specific tools, applications or systems. Model interchange formats allow one application (the model producer) to ex-port models and another application (the model consumer or scoring engine) to import models. The core idea behind PFA is to support the safe execution of statistical functions, mathematical functions, and machine learning algorithms and their compositions within a safe execution environment. With this approach, the common analytic models used in data science can be implemented, as well as the data transformations and data aggregations required for pre- and post-processing data. PFA compliant scoring engines can be extended by adding new user defined functions described in PFA. We describe the design of PFA. A Data Mining Group (DMG) Working Group is developing the PFA standard. The current version is 0.8.1 and contains many of the commonly used statistical and machine learning models, including regression, clustering, support vector machines, neural networks, etc. We also describe two implementations of Hadrian, one in Scala and one in Python. We discuss four case studies that use PFA and Hadrian to specify analytic models, including two that are deployed in operations at client sites." "big-data";"EMBERS at 4 years:Experiences operating an Open Source Indicators Forecasting System";"EMBERS is an anticipatory intelligence system forecasting population-level events in multiple countries of Latin America. A deployed system from 2012, EMBERS has been generating alerts 24x7 by ingesting a broad range of data sources including news, blogs, tweets, machine coded events, currency rates, and food prices. In this paper, we describe our experiences operating EMBERS continuously for nearly 4 years, with specific attention to the discoveries it has enabled, correct as well as missed forecasts, lessons learnt from participating in a forecasting tournament, and our perspectives on the limits of forecasting including ethical considerations." "big-data";"Robust Large-Scale Machine Learning in the Cloud";"The convergence behavior of many distributed machine learning (ML) algorithms can be sensitive to the number of ma-chines being used or to changes in the computing environment. As a result, scaling to a large number of machines can be challenging. In this paper, we describe a new scalable coordinate descent (SCD) algorithm for generalized linear models whose convergence behavior is always the same, regardless of how much SCD is scaled out and regardless of the computing environment. This makes SCD highly robust and enables it to scale to massive datasets on low-cost commodity servers. Experimental results on a real advertising dataset in Google are used to demonstrate SCD’s cost effectiveness and scalability. Using Google’s internal cloud, we show that SCD can provide near linear scaling using thousands of cores for 1 trillion training examples on a petabyte of compressed data. This represents 10,000x more training examples than the ‘large-scale’ Netflix prize dataset. We also show that SCD can learn a model for 20 billion training examples in two hours for about $10." "big-data";"Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs";"Given a stream of heterogeneous graphs containing different types of nodes and edges, how can we spot anomalous ones in real-time while consuming bounded memory? This problem is motivated by and generalizes from its application in security to host-level advanced persistent threat (APT) detection. We propose StreamSpot, a clustering based anomaly detection approach that addresses challenges in two key fronts: (1) heterogeneity, and (2) streaming nature. We introduce a new similarity function for heterogeneous graphs that compares two graphs based on their relative frequency of local substructures, represented as short strings. This function lends itself to a vector representation of a graph, which is (a) fast to compute, and (b) amenable to a sketched version with bounded size that preserves similarity." "big-data";"FLASH: Fast Bayesian Optimization for Data Analytic Pipelines";"Modern data science relies on data analytic pipelines to organize interdependent computational steps. Such analytic pipelines often involve different algorithms across multiple steps, each with its own hyperparameters. To achieve the best performance, it is often critical to select optimal algorithms and to set appropriate hyperparameters, which requires large computational efforts. Bayesian optimization provides a principled way for searching optimal hyperparameters for a single algorithm. However, many challenges remain in solving pipeline optimization problems with high-dimensional and highly conditional search space. In this work, we propose Fast LineAr SearcH (FLASH), an efficient method for tuning analytic pipelines. FLASH is a two-layer Bayesian optimization framework, which firstly uses a parametric model to select promising algorithms, then computes a nonparametric model to fine-tune hyperparameters of the promising algorithms. FLASH also includes an effective caching algorithm which can further accelerate the search process. Extensive experiments on a number of benchmark datasets have demonstrated that FLASH significantly outperforms previous state-of-the-art methods in both search speed and accuracy. Using 50% of the time budget, FLASH achieves up to 20% improvement on test error rate compared to the baselines. FLASH also yields state-of-the-art performance on a real-world application for healthcare predictive modeling." "big-data";"Scalable Pattern Matching over Compressed Graphs via Dedensification";"One of the most common operations on graph databases is graph pattern matching (e.g., graph isomorphism and more general types of “subgraph pattern matching”). In fact, in some graph query languages every single query is expressed as a graph matching operation. Consequently, there has been a significant amount of research effort in optimizing graph matching operations in graph database systems. As graph databases have scaled in recent years, so too has recent work on scaling graph matching operations. However, the performance of recent proposals for scaling graph pattern matching is limited by the presence of high-degree nodes. These high-degree nodes result in an explosion of intermediate result sizes during query execution, and therefore significant performance bottlenecks. In this paper we present a de-densification technique that losslessly compresses the neighborhood around high-degree nodes. Furthermore, we introduce a query processing technique that enables direct operation of graph query processing operations over the com-pressed data, without ever having to decompress the data. For pattern matching operations, we show how this technique can be implemented as a layer above existing graph database systems, so that the end-user can benefit from this technique without requiring modifications to the core graph database engine code. Our technique reduces the size of the intermediate result sets during query processing, and thereby improves query performance." "big-data";"Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix";"Controlled experiments are widely regarded as the most scientific way to establish a true causal relationship between product changes and their impact on business metrics. Many technology companies rely on such experiments as their main data-driven decision-making tool. The sensitivity of a controlled experiment refers to its ability to detect differences in business metrics due to product changes. At Netflix, with tens of millions of users, increasing the sensitivity of con-trolled experiments is critical as failure to detect a small effect, either positive or negative, can have a substantial revenue impact. This paper focuses on methods to increase sensitivity by reducing the sampling variance of business metrics. We define Netflix business metrics and share context around the critical need for improved sensitivity. We review popular variance reduction techniques that are broadly applicable to any type of controlled experiment and metric. We describe an innovative implementation of stratified sampling at Netflix where users are assigned to experiments in real time and discuss some surprising challenges with the implementation. We conduct case studies to compare these variance reduction techniques on a few Netflix datasets. Based on the empirical results, we recommend to use post-assignment variance reduction techniques such as post stratification [7] and CUPED [3] instead of at-assignment variance reduction techniques such as stratified sampling [2] in large-scale controlled experiments." "big-data";"XGBoost: A Scalable Tree Boosting System";"Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems." "big-data";"Deep Visual-Semantic Hashing for Cross-Modal Retrieval";"Due to the storage and retrieval efficiency, hashing has been widely applied to approximate nearest neighbor search for large-scale multimedia retrieval. Cross-modal hashing, which enables efficient retrieval of images in response to text queries or vice versa, has received increasing attention recently. Most existing work on cross-modal hashing does not capture the spatial dependency of images and temporal dynamics of text sentences for learning powerful feature representations and cross-modal embeddings that mitigate the heterogeneity of different modalities. This paper presents a new Deep Visual-Semantic Hashing (DVSH) model that generates compact hash codes of images and sentences in an end-to-end deep learning architecture, which capture the intrinsic cross-modal correspondences between visual data and natural language. DVSH is a hybrid deep architecture that constitutes a visual-semantic fusion network for learning joint embedding space of images and text sentences, and two modality-specific hashing networks for learning hash functions to generate compact binary codes. Our architecture effectively unifies joint multi-modal embedding and cross-modal hashing, which is based on a novel combination of Convolutional Neural Networks over images, Recurrent Neural Networks over sentences, and a structured max-margin objective that integrates all things together to enable learning of similarity-preserving and high-quality hash codes. Extensive empirical evidence shows that our DVSH approach yields state of the art results in cross-modal retrieval experiments on image-sentences datasets, i.e. standard IAPR TC-12 and large-scale Microsoft COCO." "big-data";"Transfer Knowledge between Cities";"The rapid urbanization has motivated extensive research on urban computing. It is critical for urban computing tasks to unlock the power of the diversity of data modalities generated by different sources in urban spaces, such as vehicles and humans. However, we are more likely to encounter the label scarcity problem and the data insufficiency problem when solving an urban computing task in a city where services and infrastructures are not ready or just built. In this paper, we propose a FLexible multimOdal tRAnsfer Learning (FLORAL) method to transfer knowledge from a city where there exist sufficient multimodal data and labels to similar kind of cities to fully alleviate the problems of label scarcity and data insufficiency. FLORAL learns semantically related dictionaries for multiple modalities from a source domain and simultaneously transfers the dictionaries and labelled instances from the source into a target domain. We evaluate the proposed method with a real-world study of air quality prediction." "big-data";"Parallel Lasso Screening for Big Data Optimization";"Lasso regression is a widely used technique in data mining for model selection and feature extraction. In many applications, it remains challenging to apply the regression model to large-scale problems that have massive data samples with high-dimensional features. One popular and promising strategy is to solve the Lasso problem in parallel. Parallel solvers run multiple cores in parallel on a shared memory system to speedup the computation, while the practical usage is limited by the huge dimension in the feature space. Screening is a promising method to solve the problem of high dimensionality by discarding the inactive features and removing them from optimization. However, when integrating screening methods with parallel solvers, most of solvers cannot guarantee the convergence on the reduced feature matrix. In this paper, we propose a novel parallel framework by parallelizing screening methods and integrating it with our proposed parallel solver. We propose two parallel screening algorithms: Parallel Strong Rule (PSR) and Parallel Dual Polytope Projection (PDPP). For the parallel solver, we proposed an Asynchronous Grouped Coordinate Descent method (AGCD) to optimize the regression problem in parallel on the reduced feature matrix. AGCD is based on a grouped selection strategy to select the coordinate that has the maximum descent for the objective function in a group of candidates. Empirical studies on the real-world datasets demonstrate that the proposed parallel framework has a superior performance compared to the state-of-the-art parallel solvers." "big-data";"Crime Rate Inference with Big Data";"Crime is one of the most important social problems in the country, affecting public safety, children development, and adult socioeconomic status. Understanding what factors cause higher crime is critical for policy makers in their efforts to reduce crime and in-crease citizens’ life quality. We tackle a fundamental problem in our paper: crime rate inference at the neighborhood level. Traditional approaches have used demographics and geographical influences to estimate crime rates in a region. With the fast development of positioning technology and prevalence of mobile devices, a large amount of modern urban data have been collected and such big data can provide new perspectives for understanding crime. In this paper, we used large-scale Point-Of-Interest data and taxi flow data in the city of Chicago, IL in the USA. We observed significantly improved performance in crime rate inference compared to using traditional features. Such an improvement is consistent over multiple years. We also show that these new features are significant in the feature importance analysis." "big-data";"Modeling Precursors for Event Forecasting via Nested Multi-Instance Learning";"Forecasting large-scale societal events like civil unrest movements, disease outbreaks, and elections is an important and challenging problem. From the perspective of human analysts and policy makers, forecasting algorithms must not only make accurate predictions but must also provide sup-porting evidence, e.g., the causal factors related to the event of interest. We develop a novel multiple instance learning based approach that jointly tackles the problem of identifying evidence-based precursors and forecasts events into the future. Specifically, given a collection of streaming news articles from multiple sources we develop a nested multiple instance learning approach to forecast significant societal events such as protests. Using data from three countries in Latin America, we demonstrate how our approach is able to consistently identify news articles considered as precursors for protests. Our empirical evaluation demonstrates the strengths of our proposed approach in filtering candidate precursors, in forecasting the occurrence of events with a lead time advantage and in accurately predicting the characteristics of civil unrest events." "big-data";"Skinny-dip: Clustering in a Sea of Noise";"Can we find heterogeneous clusters hidden in data sets with 80% noise? Although such settings occur in the real-world, we struggle to find methods from the abundance of clustering techniques that perform well with noise at this level. Indeed, perhaps this is enough of a departure from classical cluster-ing to warrant its study as a separate problem. In this paper we present SkinnyDip which, based on Hartigan’s elegant dip test of unimodality, represents an intriguing approach to clustering with an attractive set of properties. Specifically, SkinnyDip is highly noise-robust, practically parameter-free and completely deterministic. SkinnyDip never performs multivariate distance calculations, but rather employs in-sightful recursion based on “dips” into univariate projections of the data. It is able to detect a range of cluster shapes and densities, assuming only that each cluster admits a unimodal shape. Practically, its run-time grows linearly with the data. Finally, for high-dimensional data, continuity properties of the dip enable SkinnyDip to exploit multimodal projection pursuit in order to find an appropriate basis for clustering. Although not without its limitations, SkinnyDip compares favorably to a variety of clustering approaches on synthetic and real data, particularly in high-noise settings." "big-data";"ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with Rademacher Averages";"We present ABRA, a suite of algorithms to compute and maintain probabilistically-guaranteed, high-quality, approximations of the betweenness centrality of all nodes (or edges) on both static and fully dynamic graphs. Our algorithms use progressive random sampling and their analysis rely on Rademacher averages and pseudodimension, fundamental concepts from statistical learning theory. To our knowledge, this is the first application of these concepts to the field of graph analysis. Our experimental results show that ABRA is much faster than exact methods, and vastly outperforms, in both runtime and number of samples, state-of-the-art algorithms with the same quality guarantees." "big-data";"Accelerated Stochastic Block Coordinate Descent with Optimal Sampling";"We study the composite minimization problem where the objective function is the sum of two convex functions: one is the sum of a finite number of strongly convex and smooth functions, and the other is a general convex function that is non-differentiable. Specifically, we consider the case where the non-differentiable function is block separable and admits a simple proximal mapping for each block. This type of composite optimization is common in many data mining and machine learning problems, and can be solved by block co-ordinate descent algorithms. We propose an accelerated stochastic block coordinate descent (ASBCD) algorithm, which incorporates the incrementally averaged partial derivative into the stochastic partial derivative and exploits optimal sampling. We prove that ASBCD attains a linear rate of convergence. In contrast to uniform sampling, we reveal that the optimal non-uniform sampling can be employed to achieve a lower iteration complexity. Experimental results on different large-scale real data sets support our theory." "big-data";"Stochastic Optimization Techniques for Quantification Performance Measures";"The estimation of class prevalence, i.e., of the fraction of a population that belongs to a certain class, is an important task in data analytics, and finds applications in many domains such as the social sciences, market research, epidemiology, and others. For example, in sentiment analysis the goal is often not to estimate whether a specific text conveys a positive or a negative sentiment, but rather to estimate the overall distribution of positive and negative sentiments, e.g., in a certain time frame. A popular way of performing the above task, often dubbed quantification, is to use supervised learning in order to train a prevalence estimator from labeled data." "big-data";"Compressing Graphs and Indexes with Recursive Graph Bisection";"Graph reordering is a powerful technique to increase the locality of the representations of graphs, which can be helpful in several applications. We study how the technique can be used to improve compression of graphs and inverted indexes." "big-data";"GLMix: Generalized Linear Mixed Models For Large-Scale Response Prediction";"Generalized linear model (GLM) is a widely used class of models for statistical inference and response prediction problems. For instance, in order to recommend relevant content to a user or optimize for revenue, many web companies use logistic regression models to predict the probability of the user’s clicking on an item (e.g., ad, news article, job). In scenarios where the data is abundant, having a more fine-grained model at the user or item level would potentially lead to more accurate prediction, as the user’s personal preferences on items and the item’s specific attraction for users can be better captured. One common approach is to introduce ID-level regression coefficients in addition to the global regression coefficients in a GLM setting, and such models are called generalized linear mixed models (GLMix) in the statistical literature. However, for big data sets with a large number of ID-level coefficients, fitting a GLMix model can be computationally challenging. In this paper, we re-port how we successfully overcame the scalability bottleneck by applying parallelized block coordinate descent under the Bulk Synchronous Parallel (BSP) paradigm. We deployed the model in the LinkedIn job recommender system, and generated 20% to 40% more job applications for job seekers on LinkedIn." "big-data";"Convex Optimization for Linear Query Processing under Approximate Differential Privacy";"Differential privacy enables organizations to collect accurate aggregates over sensitive data with strong, rigorous guarantees on individuals’ privacy. Previous work has found that under differential privacy, computing multiple correlated aggregates as a batch, using an appropriate strategy, may yield higher accuracy than computing each of them independently. However, finding the best strategy that maximizes result accuracy is non-trivial, as it involves solving a complex constrained optimization program that appears to be non-convex. Hence, in the past much effort has been devoted in solving this non-convex optimization program. Existing approaches include various sophisticated heuristics and expensive numerical solutions. None of them, however, guarantees to find the optimal solution of this optimization problem." "big-data";"Towards Optimal Cardinality Estimation of Unions and Intersections with Sketches";"Estimating the cardinality of unions and intersections of sets is a problem of interest in OLAP. Large data applications often require the use of approximate methods based on small sketches of the data. We give new estimators for the cardinality of unions and intersection and show they approximate an optimal estimation procedure. These estimators enable the improved accuracy of the streaming MinCount sketch to be exploited in distributed settings. Both theoretical and empirical results demonstrate substantial improvements over existing methods." "big-data";"Dynamic Clustering of Streaming Short Documents";"Clustering technology has found numerous applications in mining textual data. It was shown to enhance the performance of retrieval systems in various different ways, such as identifying different query aspects in search result diversification, improving smoothing in the context of language modeling, matching queries with documents in a latent topic space in ad-hoc retrieval, summarizing documents etc. The vast majority of clustering methods have been developed under the assumption of a static corpus of long (and hence textually rich) documents. Little attention has been given to streaming corpora of short text, which is the predominant type of data in Web 2.0 applications, such as social media, forums, and blogs. In this paper, we consider the problem of dynamically clustering a streaming corpus of short documents. The short length of documents makes the inference of the latent topic distribution challenging, while the temporal dynamics of streams allow topic distributions to change over time. To tackle these two challenges we propose a new dynamic clustering topic model - DCT - that enables tracking the time-varying distributions of topics over documents and words over topics. DCT models temporal dynamics by a short-term or long-term dependency model over sequential data, and overcomes the difficulty of handling short text by assigning a single topic to each short document and using the distributions inferred at a certain point in time as priors for the next inference, allowing the aggregation of information. At the same time, taking a Bayesian approach allows evidence obtained from new streaming documents to change the topic distribution. Our experimental results demonstrate that the proposed clustering algorithm outperforms state-of-the-art dynamic and non-dynamic clustering topic models in terms of perplexity and when integrated in a cluster-based query likelihood model it also outperforms state-of-the-art models in terms of retrieval quality." "big-data";"Rebalancing Bike Sharing Systems: A Multi-source Data Smart Optimization";"Bike sharing systems, aiming at providing the missing links in public transportation systems, are becoming popular in urban cities. A key to success for a bike sharing systems is the effectiveness of rebalancing operations, that is, the effort-s of restoring the number of bikes in each station to its target value by routing vehicles through pick-up and drop-off operations. There are two major issues for this bike rebalancing problem: the determination of station inventory target level and the large scale multiple capacitated vehicle routing optimization with outlier stations. The key challenges include demand prediction accuracy for inventory target level determination, and an effective optimizer for vehicle routing with hundreds of stations. To this end, in this paper, we develop a Meteorology Similarity Weighted K-Nearest-Neighbor (M-SWK) regressor to predict the station pick-up demand based on large-scale historic trip records. Based on further analysis on the station network constructed by station-station connections and the trip duration, we propose an inter station bike transition (ISBT) model to predict the station drop-off demand. Then, we provide a mixed integer nonlinear programming (MINLP) formulation of multiple capacitated bike routing problem with the objective of minimizing total travel distance. To solve it, we propose an Adaptive Capacity Constrained K-centers Clustering (AdaCCKC) algorithm to separate outlier stations (the demands of these stations are very large and make the optimization infeasible) and group the rest stations into clusters within which one vehicle is scheduled to redistribute bikes between stations. In this way, the large scale multiple vehicle routing problem is reduced to inner cluster one vehicle routing problem with guaranteed feasible solutions. Finally, the extensive experimental results on the NYC Citi Bike system show the advantages of our approach for bike demand prediction and large-scale bike rebalancing optimization." "big-data";"Accelerating Online CP Decompositions for Higher Order Tensors";"Tensors are a natural representation for multidimensional data. In recent years, CANDECOMP/PARAFAC (CP) decomposition, one of the most popular tools for analyzing multi-way data, has been extensively studied and widely applied. However, today’s datasets are often dynamically changing over time. Tracking the CP decomposition for such dynamic tensors is a crucial but challenging task, due to the large scale of the tensor and the velocity of new data arriving. Traditional techniques, such as Alternating Least Squares (ALS), cannot be directly applied to this problem because of their poor scalability in terms of time and memory. Additionally, existing online approaches have only partially addressed this problem and can only be deployed on third-order tensors. To fill this gap, we propose an efficient online algorithm that can incrementally track the CP decompositions of dynamic tensors with an arbitrary number of dimensions. In terms of effectiveness, our algorithm demonstrates comparable results with the most accurate algorithm, ALS, whilst being computationally much more efficient. Specifically, on small and moderate datasets, our approach is tens to hundreds of times faster than ALS, while for large-scale datasets, the speedup can be more than 3,000 times. Compared to other state-of-the-art online approaches, our method shows not only significantly better decomposition quality, but also better performance in terms of stability, efficiency and scalability." "big-data";"Approximate Personalized PageRank on Dynamic Graphs";"We propose and analyze two algorithms for maintaining approximate Personalized PageRank (PPR) vectors on a dynamic graph, where edges are added or deleted. Our algorithms are natural dynamic versions of two known local variations of power iteration. One, Forward Push, propagates probability mass forwards along edges from a source node, while the other, Reverse Push, propagates local changes backwards along edges from a target. In both variations, we maintain an invariant between two vectors, and when an edge is updated, our algorithm first modifies the vectors to restore the invariant, then performs any needed local push operations to restore accuracy." "big-data";"Annealed Sparsity via Adaptive and Dynamic Shrinking";"Sparse learning has received tremendous amount of interest in high-dimensional data analysis due to its model interpretability and the low-computational cost. Among the various techniques, adaptive l1-regularization is an effective framework to improve the convergence behaviour of the LASSO, by using varying strength of regularization across different features. In the meantime, the adaptive structure makes it very powerful in modelling grouped sparsity pat-terns as well, being particularly useful in high-dimensional multi-task problems. However, choosing an appropriate, global regularization weight is still an open problem. In this paper, inspired by the annealing technique in matrial science, we propose to achieve “annealed sparsity” by designing a dynamic shrinking scheme that simultaneously optimizes the regularization weights and model coefficients in sparse (multi-task) learning. The dynamic structures of our algorithm are twofold. Feature-wise (“spatially”), the regularization weights are updated interactively with model coefficients, allowing us to improve the global regularization structure. Iteration-wise (“temporally”), such interaction is coupled with gradually boosted l1-regularization by adjusting an equality norm-constraint, achieving an “annealing” effect to further improve model selection. This renders interesting shrinking behaviour in the whole solution path. Our method competes favorably with state-of-the-art methods in sparse (multi-task) learning. We also apply it in expression quantitative trait loci analysis (eQTL), which gives useful biological insights in human cancer (melanoma) study." "big-data";"Smart Reply: Automated Response Suggestion for Email";"In this paper we propose and investigate a novel end-to-end method for automatically generating short email responses, called Smart Reply. It generates semantically diverse suggestions that can be used as complete email responses with just one tap on mobile. The system is currently used in Inbox by Gmail and is responsible for assisting with 10% of all mobile responses. It is designed to work at very high throughput and process hundreds of millions of messages daily. The system exploits state-of-the-art, large-scale deep learning." "big-data";"Efficient Frequent Directions Algorithm for Sparse Matrices";"This paper describes Sparse Frequent Directions, a variant of Frequent Directions for sketching sparse matrices. It resembles the original algorithm in many ways: both receive the rows of an input matrix An⇥d one by one in the streaming setting and compute a small sketch B 2 R`⇥d. Both share the same strong (provably optimal) asymptotic guarantees with respect to the space-accuracy tradeoff in the streaming setting. However, unlike Frequent Directions which runs in O(nd`) time regardless of the sparsity f the input ma rix A, Sparse Frequent Directions runs in O˜ nnz(A)` + n`2 time. Our analysis loosens the dependence on computing the Sin-gular Value Decomposition (SVD) as a black box within the Frequent Directions algorithm. Our bounds require recent results on the properties of fast approximate SVD computations. Finally, we empirically demonstrate that these asymptotic improvements are practical and significant on real and synthetic data." "big-data";"Temporal Order-based First-Take-All Hashing for Fast Attention-Deficit-Hyperactive-Disorder Detectio";"Attention Deficit Hyperactive Disorder (ADHD) is one of the most common childhood disorders and can continue through adolescence and adulthood. Although the root cause of the problem still remains unknown, recent advancements in brain imaging technology reveal there exists differences between neural activities of Typically Developing Children (TDC) and ADHD subjects. Inspired by this, we propose a novel First-Take-All (FTA) hashing framework to investigate the problem of fast ADHD subjects detection through the fMRI time-series of neuron activities. By hashing time courses from regions of interests (ROIs) in the brain into fixed-size hash codes, FTA can compactly encode the temporal order differences between the neural activity patterns that are key to distinguish TDC and ADHD subjects. Such patterns can be directly learned via minimizing the training loss incurred by the generated FTA codes. By conducting similarity search on the resultant FTA codes, data-driven ADHD detection can be achieved in an efficient fashion. The experiments’ results on real-world ADHD detection bench-marks demonstrate the FTA can outperform the state-of-the-art baselines using only neural activity time series with-out any phenotypic information." "big-data";"Assessing Human Error Against a Benchmark of Perfection";"An increasing number of domains are providing us with detailed trace data on human decisions in settings where we can evaluate the quality of these decisions via an algorithm. Motivated by this development, an emerging line of work has begun to consider whether we can characterize and predict the kinds of decisions where people are likely to make errors." "time-series-and-stream-mining";"Computational Drug Repositioning Using Continuous Self-controlled Case Series";"Computational Drug Repositioning (CDR) is the task of discovering potential new indications for existing drugs by mining large-scale heterogeneous drug-related data sources. Leveraging the patient-level temporal ordering information between numeric physiological measurements and various drug prescriptions provided in Electronic Health Records (EHRs), we propose a Continuous Self-controlled Case Series (CSCCS) model for CDR. As an initial evaluation, we look for drugs that can control Fasting Blood Glucose (FBG) level in our experiments. Applying CSCCS to the Marshfield Clinic EHR, well-known drugs that are indicated for controlling blood glucose level are rediscovered. Furthermore, some drugs with recent literature support for the potential effect of blood glucose level control are also identified." "time-series-and-stream-mining";"Towards Optimal Cardinality Estimation of Unions and Intersections with Sketches";"Estimating the cardinality of unions and intersections of sets is a problem of interest in OLAP. Large data applications often require the use of approximate methods based on small sketches of the data. We give new estimators for the cardinality of unions and intersection and show they approximate an optimal estimation procedure. These estimators enable the improved accuracy of the streaming MinCount sketch to be exploited in distributed settings. Both theoretical and empirical results demonstrate substantial improvements over existing methods." "time-series-and-stream-mining";"Temporal Order-based First-Take-All Hashing for Fast Attention-Deficit-Hyperactive-Disorder Detectio";"Attention Deficit Hyperactive Disorder (ADHD) is one of the most common childhood disorders and can continue through adolescence and adulthood. Although the root cause of the problem still remains unknown, recent advancements in brain imaging technology reveal there exists differences between neural activities of Typically Developing Children (TDC) and ADHD subjects. Inspired by this, we propose a novel First-Take-All (FTA) hashing framework to investigate the problem of fast ADHD subjects detection through the fMRI time-series of neuron activities. By hashing time courses from regions of interests (ROIs) in the brain into fixed-size hash codes, FTA can compactly encode the temporal order differences between the neural activity patterns that are key to distinguish TDC and ADHD subjects. Such patterns can be directly learned via minimizing the training loss incurred by the generated FTA codes. By conducting similarity search on the resultant FTA codes, data-driven ADHD detection can be achieved in an efficient fashion. The experiments’ results on real-world ADHD detection bench-marks demonstrate the FTA can outperform the state-of-the-art baselines using only neural activity time series with-out any phenotypic information." "time-series-and-stream-mining";"Aircraft Trajectory Prediction made easy with Predictive Analytics";"At the heart of Air Traffic Management (ATM) lies the Decision Support Systems (DST) that rely upon accurate trajectory prediction to determine how the airspace will look like in the future to make better decisions and advisories. Dealing with airspace that is prone to congestion due to environmental factors still remains the challenge especially when a deterministic approach is used in the trajectory pre-diction process. In this paper, we describe a novel stochastic trajectory prediction approach for ATM that can be used for more efficient and realistic flight planning and to assist airspace flow management, potentially resulting in higher safety, capacity, and efficiency commensurate with fuel savings thereby reducing emissions for a better environment." "time-series-and-stream-mining";"Anomaly Detection Using Program Control Flow Graph Mining from Execution Logs";"We focus on the problem of detecting anomalous run-time behavior of distributed applications from their execution logs. Specifically we mine templates and template sequences from logs to form a control flow graph (cfg) spanning distributed components. This cfg represents the baseline healthy system state and is used to flag deviations from the expected behavior of runtime logs. The novelty in our work stems from the new techniques employed to: (1) overcome the instrumentation requirements or application specific assumptions made in prior log mining approaches, (2) improve the accuracy of mined templates and the cfg in the presence of long parameters and high amount of interleaving respectively, and (3) improve by orders of magnitude the scalability of the cfg mining process in terms of volume of log data that can be processed per day." "time-series-and-stream-mining";"Taxi Driving Behavior Analysis in Latent Vehicle-to-Vehicle Networks: A Social Influence Perspective";"With recent advances in mobile and sensor technologies, a large amount of efforts have been made on developing intelligent applications for taxi drivers, which provide beneficial guide and opportunity to improve the profit and work efficiency. However, limited scopes focus on the latent social interaction within cab drivers, and corresponding social propagation scheme to share driving behaviors has been largely ignored. To that end, in this paper, we propose a comprehensive study to reveal how the social propagation affects for better prediction of cab drivers’ future behaviors. To be specific, we first investigate the correlation between drivers’ skills and their mutual interactions in the latent vehicle-to-vehicle network, which intuitively indicates the effects of social influences. Along this line, by leveraging the classic social influence theory, we develop a two-stage framework for quantitatively revealing the latent driving pattern propagation within taxi drivers. Comprehensive experiments on a real-word data set collected from the New York City clearly validate the effectiveness of our proposed framework on predicting future taxi driving behaviors, which also support the hypothesis that social factors indeed improve the predictability of driving behaviors." "time-series-and-stream-mining";"Online Asymmetric Active Learning with Imbalanced Data";"This paper considers online learning with imbalanced streaming data under a query budget, where the act of querying for labels is constrained to a budget limit. We study different active querying strategies for classification. In particular, we propose an asymmetric active querying strategy that assigns different probabilities for query to examples predicted as positive and negative. To corroborate the proposed asymmetric query model, we provide a theoretical analysis on a weighted mistake bound. We conduct extensive evaluations of the proposed asymmetric active querying strategy in comparison with several baseline querying strategies and with previous online learning algorithms for imbalanced data. In particular, we perform two types of evaluations ac-cording to which examples appear as “positive”/ “negative”. In push evaluation only the positive predictions given to the user are taken into account; in push and query evaluation the decision to query is also considered for evaluation. The push and query evaluation strategy is particularly suited for a recommendation setting because the items selected for querying for labels may go to the end-user to enable customization and personalization. These would not be shown any differently to the end-user compared to recommended content (i.e., the examples predicated as positive). Additionally, given our interest in imbalanced data we measure F -score instead of accuracy that is traditionally considered by online classification algorithms. We also compare the querying strategies on five classification tasks from different domains, and show that the probabilistic query strategy achieves higher F -scores on both types of evaluation than deterministic strategy, especially when the budget is small, and the asymmetric query model further improves performance. When compared to the state-of-the-art cost-sensitive online learning algorithm under a budget, our online classification algorithm with asymmetric querying achieves a higher F -score on four of the five tasks, especially on the push evaluation." "time-series-and-stream-mining";"TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size";"We present TRIEST, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approximations of the global and local (i.e., incident to each vertex) number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions." "time-series-and-stream-mining";"Recurrent Marked Temporal Point Processes: Embedding Event History to Vector";"Large volumes of event data are becoming increasingly avail-able in a wide variety of applications, such as healthcare analytics, smart cities and social network analysis. The precise time interval or the exact distance between two events carries a great deal of information about the dynamics of the underlying systems. These characteristics make such data fundamentally different from independently and identically distributed data and time-series data where time and space are treated as indexes rather than random variables. Marked temporal point processes are the mathematical framework for modeling event data with covariates. However, typical point process models often make strong assumptions about the generative processes of the event data, which may or may not reflect the reality, and the specifically fixed para-metric assumptions also have restricted the expressive power of the respective processes. Can we obtain a more expressive model of marked temporal point processes? How can we learn such a model from massive data?" "time-series-and-stream-mining";"Improving Survey Aggregation with Sparsely Represented Signals";"In this paper, we develop a new aggregation technique to reduce the cost of surveying. Our method aims to jointly estimate a vector of target quantities such as public opinion or voter intent across time and maintain good estimates when using only a fraction of the data. Inspired by the James-Stein estimator, we resolve this challenge by shrinking the estimates to a global mean which is assumed to have a sparse representation in some known basis. This assumption has lead to two different methods for estimating the global mean: orthogonal matching pursuit and deep learning. Both of which significantly reduce the number of samples needed to achieve good estimates of the true means of the data and, in the case of presidential elections, can estimate the outcome of the 2012 United States elections while saving hundreds of thousands of samples and maintaining accuracy." "time-series-and-stream-mining";"Smart Reply: Automated Response Suggestion for Email";"In this paper we propose and investigate a novel end-to-end method for automatically generating short email responses, called Smart Reply. It generates semantically diverse suggestions that can be used as complete email responses with just one tap on mobile. The system is currently used in Inbox by Gmail and is responsible for assisting with 10% of all mobile responses. It is designed to work at very high throughput and process hundreds of millions of messages daily. The system exploits state-of-the-art, large-scale deep learning." "time-series-and-stream-mining";"Lightweight Monitoring of Distributed Streams";"As data becomes dynamic, large, and distributed, there is in-creasing demand for what have become known as distributed stream algorithms. Since continuously collecting the data to a central server and processing it there incurs very high communication and computation complexities, it is advantageous to define local conditions at the nodes, such that – as long as they are maintained – some desirable global condition holds." "time-series-and-stream-mining";"MANTRA: A Scalable Approach to Mining Temporally Anomalous Sub-trajectories";"In this paper, we study the problem of mining temporally anomalous sub-trajectory patterns from an input trajectory in a scalable manner. Given the prevailing road conditions, a sub-trajectory is temporally anomalous if its travel time deviates significantly from the expected time. Mining these patterns requires us to delve into the sub-trajectory space, which is not scalable for real-time analytics. To overcome this scalability challenge, we design a technique called MANTRA. We study the properties unique to anomalous sub-trajectories and utilize them in MANTRA to iteratively refine the search space into a disjoint set of sub-trajectory islands. The expensive enumeration of all possible sub-trajectories is performed only on the islands to compute the answer set of maximal anomalous sub-trajectories. Extensive experiments on both real and synthetic datasets establish MANTRA as more than 3 orders of magnitude faster than baseline techniques. Moreover, through trajectory classification and segmentation, we demonstrate that the proposed model conforms to human intuition." "time-series-and-stream-mining";"Burstiness Scale: a highly parsimonious model forcharacterizing random series of events";"The problem to accurately and parsimoniously characterize random series of events (RSEs) seen in the Web, such as Yelp reviews or Twitter hashtags, is not trivial. Reports found in the literature reveal two apparent conflicting visions of how RSEs should be modeled. From one side, the Poissonian processes, of which consecutive events follow each other at a relatively regular time and should not be correlated. On the other side, the self-exciting processes, which are able to generate bursts of correlated events. The existence of many and sometimes conflicting approaches to model RSEs is a consequence of the unpredictability of the aggregated dynamics of our individual and routine activities, which sometimes show simple patterns, but sometimes results in irregular rising and falling trends. In this paper we propose a parsimonious way to characterize general RSEs, namely the Burstiness Scale (BuSca) model. BuSca views each RSE as a mix of two independent process: a Poissonian and a self-exciting one. Here we describe a fast method to extract the two parameters of BuSca that, together, gives the burstiness scale , which represents how much of the RSE is due to bursty and viral effects. We validated our method in eight diverse and large datasets containing real random series of events seen in Twitter, Yelp, e-mail conversations, Digg, and online forums. Results showed that, even using only two parameters, BuSca is able to accurately describe RSEs seen in these diverse systems, what can leverage many applications." "time-series-and-stream-mining";"Modeling Precursors for Event Forecasting via Nested Multi-Instance Learning";"Forecasting large-scale societal events like civil unrest movements, disease outbreaks, and elections is an important and challenging problem. From the perspective of human analysts and policy makers, forecasting algorithms must not only make accurate predictions but must also provide sup-porting evidence, e.g., the causal factors related to the event of interest. We develop a novel multiple instance learning based approach that jointly tackles the problem of identifying evidence-based precursors and forecasts events into the future. Specifically, given a collection of streaming news articles from multiple sources we develop a nested multiple instance learning approach to forecast significant societal events such as protests. Using data from three countries in Latin America, we demonstrate how our approach is able to consistently identify news articles considered as precursors for protests. Our empirical evaluation demonstrates the strengths of our proposed approach in filtering candidate precursors, in forecasting the occurrence of events with a lead time advantage and in accurately predicting the characteristics of civil unrest events." "time-series-and-stream-mining";"Data-driven Automatic Treatment Regimen Development and Recommendation";"The analysis of large-scale Electrical Medical Records (EMRs) has the potential to develop and optimize clinical treatment regimens. A treatment regimen usually includes a series of doctor orders containing rich temporal and heterogeneous information. However, in many existing studies, a doctor order is simplified as an event code and a treatment record is simplified as a code sequence. Thus, the information inherent in doctor orders is not fully used for in-depth analysis. In this paper, we aim at exploiting the rich information in doctor orders and developing data-driven approaches for improving clinical treatments. To this end, we first propose a novel method to measure the similarities between treatment records with consideration of sequential and multifaceted information in doctor orders. Then, we propose an efficient density-based clustering algorithm to summarize large-scale treatment records, and extract a semantic representation of each treatment cluster. Finally, we develop a unified frame-work to evaluate the discovered treatment regimens, and find the most effective treatment regimen for new patients. In the empirical study, we validate our methods with EMRs of 27,678 patients from 14 hospitals. The results show that: 1) Our method can successfully extract typical treatment regimens from large-scale treatment records. The extracted treatment regimens are intuitive and provide managerial implications for treatment regimen design and optimization. 2) By recommending the most effective treatment regimens, the total cure rate in our data improves from 19.89% to 21.28%, and the effective rate increases up to 98.29%." "time-series-and-stream-mining";"Assessing Human Error Against a Benchmark of Perfection";"An increasing number of domains are providing us with detailed trace data on human decisions in settings where we can evaluate the quality of these decisions via an algorithm. Motivated by this development, an emerging line of work has begun to consider whether we can characterize and predict the kinds of decisions where people are likely to make errors." "time-series-and-stream-mining";"Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs";"Given a stream of heterogeneous graphs containing different types of nodes and edges, how can we spot anomalous ones in real-time while consuming bounded memory? This problem is motivated by and generalizes from its application in security to host-level advanced persistent threat (APT) detection. We propose StreamSpot, a clustering based anomaly detection approach that addresses challenges in two key fronts: (1) heterogeneity, and (2) streaming nature. We introduce a new similarity function for heterogeneous graphs that compares two graphs based on their relative frequency of local substructures, represented as short strings. This function lends itself to a vector representation of a graph, which is (a) fast to compute, and (b) amenable to a sketched version with bounded size that preserves similarity." "time-series-and-stream-mining";"Efficient Frequent Directions Algorithm for Sparse Matrices";"This paper describes Sparse Frequent Directions, a variant of Frequent Directions for sketching sparse matrices. It resembles the original algorithm in many ways: both receive the rows of an input matrix An⇥d one by one in the streaming setting and compute a small sketch B 2 R`⇥d. Both share the same strong (provably optimal) asymptotic guarantees with respect to the space-accuracy tradeoff in the streaming setting. However, unlike Frequent Directions which runs in O(nd`) time regardless of the sparsity f the input ma rix A, Sparse Frequent Directions runs in O˜ nnz(A)` + n`2 time. Our analysis loosens the dependence on computing the Sin-gular Value Decomposition (SVD) as a black box within the Frequent Directions algorithm. Our bounds require recent results on the properties of fast approximate SVD computations. Finally, we empirically demonstrate that these asymptotic improvements are practical and significant on real and synthetic data." "mining-rich-data-types";"GMove: Group-Level Mobility Modeling using Geo-Tagged Social Media";"Understanding human mobility is of great importance to various applications, such as urban planning, traffic scheduling, and location prediction. While there has been fruitful research on modeling human mobility using tracking data (e.g., GPS traces), the recent growth of geo-tagged social media (GeoSM) brings new opportunities to this task because of its sheer size and multi-dimensional nature. Nevertheless, how to obtain quality mobility models from the highly sparse and complex GeoSM data remains a challenge that cannot be readily addressed by existing techniques." "mining-rich-data-types";"Structural Deep Network Embedding";"Network embedding is an important method to learn low-dimensional representations of vertexes in networks, aiming to capture and preserve the network structure. Almost all the existing network embedding methods adopt shallow models. However, since the underlying network structure is complex, shallow models cannot capture the highly non-linear network structure, resulting in sub-optimal network representations. Therefore, how to find a method that is able to effectively capture the highly non-linear network structure and preserve the global and local structure is an open yet important problem. To solve this problem, in this paper we propose a Structural Deep Network Embedding method, namely SDNE. More specifically, we first propose a semi-supervised deep model, which has multiple layers of non-linear functions, thereby being able to capture the highly non-linear network structure. Then we propose to exploit the first-order and second-order proximity jointly to preserve the network structure. The second-order proximity is used by the unsupervised component to capture the global network structure. While the first-order proximity is used as the supervised information in the supervised component to preserve the local network structure. By jointly optimizing them in the semi-supervised deep model, our method can preserve both the local and global network structure and is robust to sparse networks. Empirically, we conduct the experiments on five real-world networks, including a language network, a citation network and three social networks. The results show that compared to the baselines, our method can reconstruct the original network significantly better and achieves substantial gains in three applications, i.e. multi-label classification, link pre-diction and visualization." "mining-rich-data-types";"Graph Wavelets via Sparse Cuts";"Modeling information that resides on vertices of large graphs is a key problem in several real-life applications, ranging from social networks to the Internet-of-things. Signal Processing on Graphs and, in particular, graph wavelets can exploit the intrinsic smoothness of these datasets in order to represent them in a compact and accurate manner. However, how to discover wavelet bases that capture the geometry of the data with respect to the signal as well as the graph structure remains an open problem. In this paper, we study the problem of computing graph wavelet bases via sparse cuts in order to produce low-dimensional encodings of data-driven bases. This problem is connected to known hard problems in graph theory (e.g. multiway cuts) and thus requires an efficient heuristic. We formulate the basis discovery task as a relaxation of a vector optimization problem, which leads to an elegant solution as a regularized eigenvalue computation. Moreover, we propose several strategies in order to scale our algorithm to large graphs. Experimental results show that the proposed algorithm can effectively encode both the graph structure and signal, producing compressed and accurate representations for vertex values in a wide range of datasets (e.g. sensor and gene net-works) and significantly outperforming the best baseline." "mining-rich-data-types";"Probabilistic Robust Route Recovery with Spatio-Temporal Dynamics";"Vehicle trajectories are one of the most important data in location-based services. The quality of trajectories directly affects the services. However, in the real applications, trajectory data are not always sampled densely. In this paper, we study the problem of recovering the entire route between two distant consecutive locations in a trajectory. Most existing works solve the problem with-out using those informative historical data or solve it in an empirical way. We claim that a data-driven and probabilistic approach is actually more suitable as long as data sparsity can be well handled. We propose a novel route recovery system in a fully probabilistic way which incorporates both temporal and spatial dynamics and addresses all the data sparsity problem introduced by the probabilistic method. It outperforms the existing works with a high accuracy (over 80%) and shows a strong robustness even when the length of routes to be recovered is very long (about 30 road segments) or the data is very sparse." "mining-rich-data-types";"Squish: Near-Optimal Compression for Archival of Relational Datasets";"Relational datasets are being generated at an alarmingly rapid rate across organizations and industries. Compressing these datasets could significantly reduce storage and archival costs. Traditional compression algorithms, e.g., gzip, are suboptimal for compressing relational datasets since they ignore the table structure and relation-ships between attributes." "mining-rich-data-types";"Beyond Sigmoids: the NetTide Model for Social Network Growth, and its Applications";"What is the growth pattern of social networks, like Facebook and WeChat? Does it truly exhibit exponential early growth, as predicted by textbook models like the Bass model, SI, or the Branching Process? How about the count of links, over time, for which there are few published models?" "mining-rich-data-types";"Recurrent Marked Temporal Point Processes: Embedding Event History to Vector";"Large volumes of event data are becoming increasingly avail-able in a wide variety of applications, such as healthcare analytics, smart cities and social network analysis. The precise time interval or the exact distance between two events carries a great deal of information about the dynamics of the underlying systems. These characteristics make such data fundamentally different from independently and identically distributed data and time-series data where time and space are treated as indexes rather than random variables. Marked temporal point processes are the mathematical framework for modeling event data with covariates. However, typical point process models often make strong assumptions about the generative processes of the event data, which may or may not reflect the reality, and the specifically fixed para-metric assumptions also have restricted the expressive power of the respective processes. Can we obtain a more expressive model of marked temporal point processes? How can we learn such a model from massive data?" "mining-rich-data-types";"Absolute Fused Lasso and Its Application to Genome-Wide Association Studies";"In many real-world applications, the samples/features acquired are in spatial or temporal order. In such cases, the magnitudes of adjacent samples/features are typically close to each other. Meanwhile, in the high-dimensional scenario, identifying the most relevant samples/features is also desired. In this paper, we consider a regularized model which can simultaneously identify important features and group similar features together. The model is based on a penalty called Absolute Fused Lasso (AFL). The AFL penalty encourages sparsity in the coefficients as well as their successive differences of absolute values—i.e., local constancy of the coefficient components in absolute values. Due to the non-convexity of AFL, it is challenging to develop efficient algorithms to solve the optimization problem. To this end, we employ the Difference of Convex functions (DC) programming to optimize the proposed non-convex problem. At each DC iteration, we adopt the proximal algorithm to solve a convex regularized sub-problem. One of the major contributions of this paper is to develop a highly efficient algorithm to compute the proximal operator. Empirical studies on both synthetic and real-world data sets from Genome-Wide Association Studies demonstrate the efficiency and effectiveness of the proposed approach in simultaneous identifying important features and grouping similar features." "mining-rich-data-types";"Diversified Temporal Subgraph Pattern Mining";"Many graphs in real-world applications, such as telecommunications networks, social-interaction graphs and co-authorship graphs, contain temporal information. However, existing graph mining algorithms fail to exploit these temporal information and the resulting subgraph patterns do not contain any temporal attribute. In this paper, we study the problem of mining a set of diversified temporal subgraph patterns from a temporal graph, where each subgraph is associated with the time interval that the pattern spans. This problem motivates important applications such as finding social trends in social networks, or detecting temporal hotspots in telecommunications networks. We propose a divide-and-conquer algorithm along with effective pruning techniques, and our approach runs 2 to 3 orders of magnitude faster than a baseline algorithm and obtains high-quality temporal subgraph patterns in real temporal graphs." "mining-rich-data-types";"Improving Survey Aggregation with Sparsely Represented Signals";"In this paper, we develop a new aggregation technique to reduce the cost of surveying. Our method aims to jointly estimate a vector of target quantities such as public opinion or voter intent across time and maintain good estimates when using only a fraction of the data. Inspired by the James-Stein estimator, we resolve this challenge by shrinking the estimates to a global mean which is assumed to have a sparse representation in some known basis. This assumption has lead to two different methods for estimating the global mean: orthogonal matching pursuit and deep learning. Both of which significantly reduce the number of samples needed to achieve good estimates of the true means of the data and, in the case of presidential elections, can estimate the outcome of the 2012 United States elections while saving hundreds of thousands of samples and maintaining accuracy." "mining-rich-data-types";"A Subsequence Interleaving Model for Sequential Pattern Mining";"Recent sequential pattern mining methods have used the minimum description length (MDL) principle to define an encoding scheme which describes an algorithm for mining the most compressing patterns in a database. We present a novel subsequence interleaving model based on a probabilistic model of the sequence database, which allows us to search for the most compressing set of patterns without designing a specific encoding scheme. Our proposed algorithm is able to efficiently mine the most relevant sequential patterns and rank them using an associated measure of interestingness. The efficient inference in our model is a direct result of our use of a structural expectation-maximization framework, in which the expectation-step takes the form of a submodular optimization problem subject to a coverage constraint. We show on both synthetic and real world datasets that our model mines a set of sequential patterns with low spuriousness and redundancy, high interpretability and usefulness in real-world applications. Furthermore, we demonstrate that the quality of the patterns from our approach is comparable to, if not better than, existing state of the art sequential pattern mining algorithms." "mining-rich-data-types";"Unbounded Human Learning: Optimal Scheduling for Spaced Repetition";"In the study of human learning, there is broad evidence that our ability to retain information improves with repeated exposure and decays with delay since last exposure. This plays a crucial role in the design of educational software, leading to a trade-off between teaching new material and reviewing what has already been taught. A common way to balance this trade-off is spaced repetition, which uses periodic review of content to improve long-term retention. Though spaced repetition is widely used in practice, e.g., in electronic flash-card software, there is little formal understanding of the design of these systems. Our paper addresses this gap in three ways. First, we mine log data from spaced repetition software to establish the functional dependence of retention on reinforcement and delay. Second, we use this memory model to develop a stochastic model for spaced repetition systems. We propose a queueing network model of the Leitner system for reviewing flashcards, along with a heuristic approximation that admits a tractable optimization problem for review scheduling. Finally, we empirically evaluate our queueing model through a Mechanical Turk experiment, verifying a key qualitative prediction of our model: the existence of a sharp phase transition in learning outcomes upon increasing the rate of new item introductions." "mining-rich-data-types";"Topic Modeling of Short Texts: A Pseudo-Document View";"Recent years have witnessed the unprecedented growth of online social media, which empower short texts as the prevalent format for information of Internet. Given the nature of sparsity, however, short text topic modeling remains a critical yet much-watched challenge in both academy and industry. Rich research efforts have been put on building different types of probabilistic topic models for short texts, among which the self aggregation methods without using auxiliary information become an emerging solution for pro-viding informative cross-text word co-occurrences. However, models along this line are still rarely seen, and the representative one Self-Aggregation Topic Model (SATM) is prone to overfitting and computationally expensive. In light of this, in this paper, we propose a novel probabilistic model called Pseudo-document-based Topic Model (PTM) for short text topic modeling. PTM introduces the concept of pseudo document to implicitly aggregate short texts against data sparsity. By modeling the topic distributions of latent pseudo documents rather than short texts, PTM is expected to gain excellent performance in both accuracy and efficiency. A Sparsity-enhanced PTM (SPTM for short) is also proposed by applying Spike and Slab prior, with the purpose of eliminating undesired correlations between pseudo documents and latent topics. Extensive experiments on various real-world data sets with state-of-the-art baselines demonstrate the high quality of topics learned by PTM and its robust-ness with reduced training samples. It is also interesting to show that i) SPTM gains a clear edge over PTM when the number of pseudo documents is relatively small, and ii) the constraint that a short text belongs to only one pseudo document is critically important for the success of PTM. We finally take an in-depth semantic analysis to unveil directly the fabulous function of pseudo documents in finding cross-text word co-occurrences for topic modeling." "mining-rich-data-types";"FRAUDAR: Bounding Graph Fraud in the Face of Camouflage";"Given a bipartite graph of users and the products that they review, or followers and followees, how can we detect fake reviews or follows? Existing fraud detection methods (spectral, etc.) try to identify dense subgraphs of nodes that are sparsely connected to the remaining graph. Fraudsters can evade these methods using camouflage, by adding reviews or follows with honest targets so that they look “normal”. Even worse, some fraudsters use hijacked accounts from honest users, and then the camouflage is indeed organic." "mining-rich-data-types";"Predicting Socio-Economic Indicators using News Events";"Many socio-economic indicators are sensitive to real-world events. Proper characterization of the events can help to identify the relevant events that drive fluctuations in these indicators. In this paper, we propose a novel generative model of real-world events and employ it to extract events from a large corpus of news articles. We introduce the notion of an event class, which is an abstract grouping of similarly themed events. These event classes are manifested in news articles in the form of event triggers which are specific words that describe the actions or incidents reported in any article. We use the extracted events to predict fluctuations in different socioeconomic indicators. Specifically, we focus on food prices and predict the price of 12 different crops based on real-world events that potentially influence food price volatility, such as transport strikes, festivals etc. Our experiments demonstrate that incorporating event information in the prediction tasks reduces the root mean square error (RMSE) of prediction by 22% compared to the standard ARIMA model. We also predict sudden increases in the food prices (i.e. spikes) using events as features, and achieve an average 5-10% increase in accuracy compared to baseline models, including an LDA topic-model based predictive model. " "mining-rich-data-types";"CatchTartan: Representing and Summarizing Dynamic Multicontextual Behaviors";"Representing and summarizing human behaviors with rich contexts facilitates behavioral sciences and user-oriented services. Traditional behavioral modeling represents a behavior as a tuple in which each element is one contextual factor of one type, and the tensor-based summaries look for high-order dense blocks by clustering the values (including timestamps) in each dimension. However, the human behaviors are multicontextual and dynamic: (1) each behavior takes place within multiple contexts in a few dimensions, which requires the representation to enable non-value and set-values for each dimension; (2) many behavior collections, such as tweets or papers, evolve over time. In this paper, we represent the behavioral data as a two-level matrix (temporal-behaviors by dimensional-values) and propose a novel representation for behavioral summary called Tartan that includes a set of dimensions, the values in each dimension, a list of consecutive time slices and the behaviors in each slice. We further develop a propagation method CATCHTAR-TAN to catch the dynamic multicontextual patterns from the temporal multidimensional data in a principled and scalable way: it determines the meaningfulness of updating every element in the Tartan by minimizing the encoding cost in a compression manner. CATCHTARTAN outperforms the baselines on both the accuracy and speed. We apply CATCHTARTAN to four Twitter datasets up to 10 million tweets and the DBLP data, providing comprehensive summaries for the events, human life and scientific development." "mining-rich-data-types";"Unified Point-of-Interest Recommendation with Temporal Interval Assessment";"Point-of-interest (POI) recommendation, which helps mobile users explore new places, has become an important location-based service. Existing approaches for POI recommendation have been mainly focused on exploiting the information about user preferences, social influence, and geographical influence. However, these approaches cannot handle the scenario where users are expecting to have POI recommendation for a specific time period. To this end, in this paper, we propose a unified recommender system, named the ‘Where and When to gO’ (WWO) recommender system, to integrate the user interests and their evolving sequential preferences with temporal interval assessment. As a result, the WWO system can make recommendations dynamically for a specific time period and the traditional POI recommender system can be treated as the special case of the WWO system by setting this time period long enough. Specifically, to quantify users’ sequential preferences, we consider the distributions of the temporal intervals between dependent POIs in the historical check-in sequences. Then, to estimate the distributions with only sparse observations, we develop the low-rank graph construction model, which identifies a set of bi-weighted graph bases so as to learn the static user preferences and the dynamic sequential preferences in a coherent way. Finally, we evaluate the proposed approach using real-world data sets from several location-based social networks (LBSNs). The experimental results show that our method outperforms the state-of-the-art approaches for POI recom-mendation in terms of various metrics, such as F-measure and NDCG, with a significant margin." "mining-rich-data-types";"Asymmetric Transitivity Preserving Graph Embedding";"Graph embedding algorithms embed a graph into a vector space where the structure and the inherent properties of the graph are preserved. The existing graph embedding methods cannot preserve the asymmetric transitivity well, which is a critical property of directed graphs. Asymmetric transitivity depicts the correlation among directed edges, that is, if there is a directed path from u to v, then there is likely a directed edge from u to v. Asymmetric transitivity can help in capturing structures of graphs and recovering from partially observed graphs. To tackle this challenge, we propose the idea of preserving asymmetric transitivity by approximating high-order proximity which are based on asymmetric transitivity. In particular, we develop a novel graph embed-ding algorithm, High-Order Proximity preserved Embedding (HOPE for short), which is scalable to preserve high-order proximities of large scale graphs and capable of capturing the asymmetric transitivity. More specifically, we first derive a general formulation that cover multiple popular high-order proximity measurements, then propose a scalable embedding algorithm to approximate the high-order proximity measurements based on their general formulation. Moreover, we provide a theoretical upper bound on the RMSE (Root Mean Squared Error) of the approximation. Our empirical experiments on a synthetic dataset and three real-world datasets demonstrate that HOPE can approximate the high-order proximities significantly better than the state-of-art algorithms and outperform the state-of-art algorithms in tasks of reconstruction, link prediction and vertex recommendation." "mining-rich-data-types";"Predicting Matchups and Preferences in Context";"We present a general probabilistic framework for predicting the outcome of pairwise matchups (e.g. two-player sport matches) and pairwise preferences (e.g. product preferences), both of which have widespread applications ranging from matchmaking in computer games to recommendation in e-commerce. Unlike existing models for these tasks, our model not only learns representations of the items in a more expressive latent vector space, but also models how context modifies matchup and preference outcomes. For example, the context “weather” may alter the winning probability in a tennis match, or the fact that the user is on a mobile device may alter his preferences among restaurants. More generally, the model is capable of handling any symmetric game/comparison problem that can be described by vectorized player/item and game/context features. We provide a comprehensive evaluation of its predictive performance with real datasets from both domains to show its ability to predict preference and game outcomes more accurately than existing models. Furthermore, we demonstrate on synthetic datasets the expressiveness of the model when compared against theoretical limits." "mining-rich-data-types";"Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding";"Current systems of fine-grained entity typing use distant supervision in conjunction with existing knowledge bases to assign categories (type labels) to entity mentions. However, the type labels so obtained from knowledge bases are often noisy (i.e., incorrect for the entity mention’s local context). We define a new task, Label Noise Reduction in Entity Typing (LNR), to be the automatic identification of correct type labels (type-paths) for training examples, given the set of candidate type labels obtained by distant supervision with a given type hierarchy. The unknown type labels for individual entity mentions and the semantic similarity between entity types pose unique challenges for solving the LNR task. We propose a general framework, called PLE, to jointly embed entity mentions, text features and entity types into the same low-dimensional space where, in that space, objects whose types are semantically close have similar representations. Then we estimate the type-path for each training example in a top-down manner using the learned embeddings. We formulate a global objective for learning the embeddings from text corpora and knowledge bases, which adopts a novel margin-based loss that is robust to noisy labels and faithfully models type correlation derived from knowledge bases. Our experiments on three public typing datasets demonstrate the effectiveness and robustness of PLE, with an average of 25% improvement in accuracy compared to next best method." "mining-rich-data-types";"A Real Linear and Parallel Multiple Longest Common Subsequences (MLCS) Algorithm";"Information in various applications is often expressed as character sequences over a finite alphabet (e.g., DNA or protein sequences). In Big Data era, the lengths and sizes of these sequences are growing explosively, leading to grand challenges for the classical NP-hard problem, namely searching for the Multiple Longest Common Subsequences (MLC-S ) from multiple sequences. In this paper, we first unveil the fact that the state-of-the-art MLCS algorithms are unable to be applied to long and large-scale sequences alignments. To overcome their defects and tackle the longer and large-scale or even big sequences alignments, based on the pro-posed novel problem-solving model and various strategies, e.g., parallel topological sorting, optimal calculating, reuse of intermediate results, subsection calculation and serialization, etc., we present a novel parallel MLCS algorithm. Exhaustive experiments on the datasets of both synthetic and real-world biological sequences demonstrate that both the time and space of the proposed algorithm are only linear in the number of dominants from aligned sequences, and the proposed algorithm significantly outperforms the state-of-the-art MLCS algorithms, being applicable to longer and large-scale sequences alignments." "mining-rich-data-types";"Latent Space Model for Road Networks to Predict Time-Varying Traffic";"Real-time traffic prediction from high-fidelity spatiotemporal traffic sensor datasets is an important problem for intelligent transportation systems and sustainability. However, it is challenging due to the complex topological dependencies and high dynamism associated with changing road conditions. In this paper, we pro-pose a Latent Space Model for Road Networks (LSM-RN) to ad-dress these challenges holistically. In particular, given a series of road network snapshots, we learn the attributes of vertices in latent spaces which capture both topological and temporal properties. As these latent attributes are time-dependent, they can estimate how traffic patterns form and evolve. In addition, we present an incremental online algorithm which sequentially and adaptively learns the latent attributes from the temporal graph changes. Our frame-work enables real-time traffic prediction by 1) exploiting real-time sensor readings to adjust/update the existing latent spaces, and 2) training as data arrives and making predictions on-the-fly. By con-ducting extensive experiments with a large volume of real-world traffic sensor data, we demonstrate the superiority of our frame-work for real-time traffic prediction on large road networks over competitors as well as baseline graph-based LSM’s." "mining-rich-data-types";"FINAL: Fast Attributed Network Alignment";"Multiple networks naturally appear in numerous high-impact applications. Network alignment (i.e., finding the node correspondence across different networks) is often the very first step for many data mining tasks. Most, if not all, of the existing alignment methods are solely based on the topology of the underlying networks. Nonetheless, many real networks often have rich at-tribute information on nodes and/or edges. In this paper, we propose a family of algorithms (FINAL) to align attributed networks. The key idea is to leverage the node/edge attribute information to guide (topology-based) alignment process. We formulate this problem from an optimization perspective based on the alignment consistency principle, and develop effective and scalable algorithms to solve it. Our experiments on real networks show that (1) by leveraging the attribute information, our algorithms can significantly improve the alignment accuracy (i.e., up to a 30% improvement over the existing methods); (2) compared with the exact solution, our proposed fast alignment algorithm leads to a more than 10× speed-up, while preserving a 95% ac-curacy; and (3) our on-query alignment method scales linearly, with an around 90% ranking accuracy compared with our exact full alignment method and a near real-time response time." "mining-rich-data-types";"Efficient Shift-Invariant Dictionary Learning";"Shift-invariant dictionary learning (SIDL) refers to the problem of discovering a set of latent basis vectors (the dictionary) that captures informative local patterns at different locations of the input sequences, and a sparse coding for each sequence as a linear combination of the latent basis elements. It differs from conventional dictionary learning and sparse coding where the latent basis has the same dimension as the input vectors, where the focus is on global patterns instead of shift-invariant local patterns. Unsupervised discovery of shift-invariant dictionary and the corresponding sparse coding has been an open challenge as the number of candidate local patterns is extremely large, and the number of possible linear combinations of such local patterns is even more so. In this paper we propose a new framework for unsupervised discovery of both the shift-invariant basis and the sparse coding of input data, with efficient algorithms for tractable optimization. Empirical evaluations on multiple time series data sets demonstrate the effectiveness and efficiency of the proposed method." "mining-rich-data-types";"Regime Shifts in Streams: Real-time Forecasting of Co-evolving Time Sequences";"Given a large, online stream of multiple co-evolving event sequences, such as sensor data and Web-click logs, that contains various types of non-linear dynamic evolving patterns of different durations, how can we efficiently and effectively capture important patterns? How do we go about forecasting long-term future events?" "mining-rich-data-types";"Point-of-Interest Recommendations: Learning Potential Check-ins from Friends";"The emergence of Location-based Social Network (LBSN) services provides a wonderful opportunity to build personalized Point-of-Interest (POI) recommender systems. Although a personalized POI recommender system can significantly facilitate users’ outdoor activities, it faces many challenging problems, such as the hardness to model user’s POI decision making process and the difficulty to address data sparsity and user/location cold-start problem. To cope with these challenges, we define three types of friends (i.e., social friends, location friends, and neighboring friends) in LBSN, and develop a two-step framework to leverage the information of friends to improve POI recommendation accuracy and address cold-start problem. Specifically, we first propose to learn a set of potential locations that each individual’s friends have checked-in before and this individual is most interested in. Then we incorporate three types of check-ins (i.e., observed check-ins, potential check-ins and other unobserved check-ins) into matrix factorization model using two different loss functions (i.e., the square error based loss and the ranking error based loss). To evaluate the proposed model, we conduct extensive experiments with many state-of-the-art baseline methods and evaluation metrics on two real-world data sets. The experimental results demonstrate the effectiveness of our methods." "mining-rich-data-types";"Structural Neighborhood based Classification of Nodes in a Network";"Classification of entities based on the underlying network structure is an important problem. Networks encountered in practice are sparse and have many missing and noisy links. Statistical learning techniques have been used in intra-network classification; however, they typically exploit only the local neighborhood, so may not perform well. In this paper, we propose a novel structural neighborhood-based classifier learning using a random walk. For classifying a node, we take a random walk from the node and make a decision based on how nodes in the respective kth-level neighborhood are labeled. We observe that random walks of short length are helpful in classification. Emphasizing role of longer random walks may cause the underlying Markov chain to converge to a stationary distribution. Considering this, we take a lazy random walk based approach with variable termination probability for each node, based on the node’s structural properties including its degree. Our experimental study on real world datasets demonstrates the superiority of the proposed approach over the existing state-of-the-art approaches." "mining-rich-data-types";"QUINT: On Query-Specific Optimal Networks";"Measuring node proximity on large scale networks is a fundamental building block in many application domains, ranging from computer vision, e-commerce, social networks, software engineering, disaster management to biology and epidemiology. The state of the art (e.g., random walk based methods) typically assumes the input network is given a priori, with the known network topology and the associated edge weights. A few recent works aim to further infer the optimal edge weights based on the side information." "mining-rich-data-types";"DeepIntent: Learning Attentions for Online Advertising with Recurrent Neural Networks";"In this paper, we investigate the use of recurrent neural networks (RNNs) in the context of search-based online advertising. We use RNNs to map both queries and ads to real valued vectors, with which the relevance of a given (query, ad) pair can be easily computed. On top of the RNN, we propose a novel attention network, which learns to assign attention scores to different word locations according to their intent importance (hence the name DeepIntent). The vector output of a sequence is thus computed by a weighted sum of the hidden states of the RNN at each word according their attention scores. We perform end-to-end training of both the RNN and attention network under the guidance of user click logs, which are sampled from a commercial search engine. We show that in most cases the attention network improves the quality of learned vector representations, evaluated by AUC on a manually labeled dataset. Moreover, we highlight the effectiveness of the learned attention scores from two aspects: query rewriting and a modified BM25 metric. We show that using the learned attention scores, one is able to produce sub-queries that are of better qualities than those of the state-of-the-art methods. Also, by modifying the term frequency with the attention scores in a standard BM25 formula, one is able to improve its performance evaluated by AUC." "mining-rich-data-types";"Dynamics of Large Multi-View Social Networks: Synergy, Cannibalization and Cross-View Interplay";"Most social networking services support multiple types of relation-ships between users, such as getting connected, sending messages, and consuming feed updates. These users and relationships can be naturally represented as a dynamic multi-view network, which is a set of weighted graphs with shared common nodes but having their own respective edges. Different network views, representing structural relationship and interaction types, could have very distinctive properties individually and these properties may change due to interplay across views. Therefore, it is of interest to study how multiple views interact and affect network dynamics and, in addition, explore possible applications to social networking." "mining-rich-data-types";"Effcient Processing of Network Proximity Queries via Chebyshev Acceleration";"Network proximity is at the heart of a large class of network analytics and information retrieval techniques, including node/ edge rankings, network alignment, and random-walk based proximity queries, among many others. Owing to its importance, significant effort has been devoted to accelerating iterative processes underlying network proximity computations. These techniques rely on numerical proper-ties of power iterations, as well as structural properties of the networks to reduce the runtime of iterative algorithms." "mining-rich-data-types";"Keeping it Short and Simple: Summarising Complex Event Sequences with Multivariate Patterns";"We study how to obtain concise descriptions of discrete multivariate sequential data. In particular, how to do so in terms of rich multivariate sequential patterns that can capture potentially highly interesting (cor)relations between sequences. To this end we allow our pattern language to span over the domains (alphabets) of all sequences, allow patterns to overlap temporally, as well as allow for gaps in their occurrences." "mining-rich-data-types";"Transfer Knowledge between Cities";"The rapid urbanization has motivated extensive research on urban computing. It is critical for urban computing tasks to unlock the power of the diversity of data modalities generated by different sources in urban spaces, such as vehicles and humans. However, we are more likely to encounter the label scarcity problem and the data insufficiency problem when solving an urban computing task in a city where services and infrastructures are not ready or just built. In this paper, we propose a FLexible multimOdal tRAnsfer Learning (FLORAL) method to transfer knowledge from a city where there exist sufficient multimodal data and labels to similar kind of cities to fully alleviate the problems of label scarcity and data insufficiency. FLORAL learns semantically related dictionaries for multiple modalities from a source domain and simultaneously transfers the dictionaries and labelled instances from the source into a target domain. We evaluate the proposed method with a real-world study of air quality prediction." "mining-rich-data-types";"Burstiness Scale: a highly parsimonious model forcharacterizing random series of events";"The problem to accurately and parsimoniously characterize random series of events (RSEs) seen in the Web, such as Yelp reviews or Twitter hashtags, is not trivial. Reports found in the literature reveal two apparent conflicting visions of how RSEs should be modeled. From one side, the Poissonian processes, of which consecutive events follow each other at a relatively regular time and should not be correlated. On the other side, the self-exciting processes, which are able to generate bursts of correlated events. The existence of many and sometimes conflicting approaches to model RSEs is a consequence of the unpredictability of the aggregated dynamics of our individual and routine activities, which sometimes show simple patterns, but sometimes results in irregular rising and falling trends. In this paper we propose a parsimonious way to characterize general RSEs, namely the Burstiness Scale (BuSca) model. BuSca views each RSE as a mix of two independent process: a Poissonian and a self-exciting one. Here we describe a fast method to extract the two parameters of BuSca that, together, gives the burstiness scale , which represents how much of the RSE is due to bursty and viral effects. We validated our method in eight diverse and large datasets containing real random series of events seen in Twitter, Yelp, e-mail conversations, Digg, and online forums. Results showed that, even using only two parameters, BuSca is able to accurately describe RSEs seen in these diverse systems, what can leverage many applications." "mining-rich-data-types";"Compact and Scalable Graph Neighborhood Sketching";"The all-distances sketch (ADS) has recently emerged as a promising paradigm of graph neighborhood sketching. An ADS is a probabilistic data structure that is defined for each vertex of a graph. ADSs facilitate accurate estimation of many useful indicators for network analysis with the guarantee of accuracy, and the ADSs for all the vertices in a graph can be computed in near-linear time. Because of these useful properties, ADS has attracted considerable attention. However, a critical drawback of ADS is its space requirement, which tends to be much larger than that of the graph itself. In the present study, we address this issue by designing a new graph sketching scheme, namely, sketch retrieval shortcuts (SRS). Although SRSs are more space-efficient than ADSs by an order of magnitude, an ADS of any vertex can be quickly retrieved from the SRSs. The retrieved ADSs can be used to estimate the aforementioned indicators in exactly the same manner as with plain ADSs, inheriting the same accuracy guarantee. Our experiments on real-world networks demonstrate the usefulness of SRSs as a practical back-end of large-scale graph data mining." "mining-rich-data-types";"Finding Gangs in War from Signed Networks";"Given a signed network where edges are weighted in real number, and positive weights indicate cohesion between vertices and negative weights indicate opposition, we are interested in finding k-Oppositive Cohesive Groups (k-OCG). Each k-OCG is a group of k subgraphs such that (1) the edges within each subgraph are dense and cohesive; and (2) the edges crossing different subgraphs are dense and oppositive. Finding k-OCGs is challenging since the subgraphs are often small, there are multiple k-OCGs in a large signed net-work, and many existing dense subgraph extraction methods cannot handle edges of two signs. We model k-OCG finding task as a quadratic optimization problem. However, the classical Proximal Gradient method is very costly since it has to use the entire adjacency matrix, which is huge on large networks. Thus, we develop FOCG, an algorithm that is two orders of magnitudes faster than the Proximal Gradient method. The main idea is to only search in small subgraphs and thus avoids using a major portion of the adjacency matrix. Our experimental results on synthetic and real data sets as well as a case study clearly demonstrate the effectiveness and efficiency of our method." "mining-rich-data-types";"Temporal Order-based First-Take-All Hashing for Fast Attention-Deficit-Hyperactive-Disorder Detectio";"Attention Deficit Hyperactive Disorder (ADHD) is one of the most common childhood disorders and can continue through adolescence and adulthood. Although the root cause of the problem still remains unknown, recent advancements in brain imaging technology reveal there exists differences between neural activities of Typically Developing Children (TDC) and ADHD subjects. Inspired by this, we propose a novel First-Take-All (FTA) hashing framework to investigate the problem of fast ADHD subjects detection through the fMRI time-series of neuron activities. By hashing time courses from regions of interests (ROIs) in the brain into fixed-size hash codes, FTA can compactly encode the temporal order differences between the neural activity patterns that are key to distinguish TDC and ADHD subjects. Such patterns can be directly learned via minimizing the training loss incurred by the generated FTA codes. By conducting similarity search on the resultant FTA codes, data-driven ADHD detection can be achieved in an efficient fashion. The experiments’ results on real-world ADHD detection bench-marks demonstrate the FTA can outperform the state-of-the-art baselines using only neural activity time series with-out any phenotypic information." "mining-rich-data-types";"Mining Subgroups with Exceptional Transition Behavior";"We present a new method for detecting interpretable subgroups with exceptional transition behavior in sequential data. Identifying such patterns has many potential applications, e.g., for studying human mobility or analyzing the behavior of internet users. To tackle this task, we employ exceptional model mining, which is a general approach for identifying interpretable data subsets that exhibit unusual interactions between a set of target attributes with respect to a certain model class. Although exceptional model mining provides a well-suited framework for our problem, previously investigated model classes cannot capture transition behavior. To that end, we introduce first-order Markov chains as a novel model class for exceptional model mining and present a new interestingness measure that quantifies the exceptionality of transition subgroups. The measure compares the distance between the Markov transition matrix of a subgroup and the respective matrix of the entire data with the distance of random dataset samples. In addition, our method can be adapted to find subgroups that match or contradict given transition hypotheses. We demonstrate that our method is consistently able to recover subgroups with exceptional transition models from synthetic data and illustrate its potential in two application examples. Our work is relevant for researchers and practitioners interested in detecting exceptional transition behavior in sequential data." "mining-rich-data-types";"Ranking Causal Anomalies via Temporal and Dynamical Analysis on Vanishing Correlations";"Modern world has witnessed a dramatic increase in our ability to collect, transmit and distribute real-time monitoring and surveillance data from large-scale information systems and cyber-physical systems. Detecting system anomalies thus attracts significant amount of interest in many fields such as security, fault management, and industrial optimization. Recently, invariant network has shown to be a powerful way in characterizing complex system behaviours. In the invariant network, a node represents a system component and an edge indicates a stable, significant interaction between two components. Structures and evolutions of the invariance network, in particular the vanishing correlations, can shed important light on locating causal anomalies and performing diagnosis. However, existing approaches to detect causal anomalies with the invariant network often use the percentage of vanishing correlations to rank possible casual components, which have several limitations: 1) fault propagation in the network is ignored; 2) the root casual anomalies may not always be the nodes with a high-percentage of vanishing correlations; 3) temporal patterns of vanishing correlations are not exploited for robust detection. To address these limitations, in this paper we propose a network diffusion based framework to identify significant causal anomalies and rank them. Our approach can effectively model fault propagation over the entire invariant network, and can perform joint inference on both the structural, and the time-evolving broken invariance patterns. As a result, it can locate high-confidence anomalies that are truly responsible for the vanishing correlations, and can compensate for unstructured measurement noise in the system. Extensive experiments on synthetic datasets, bank information system datasets, and coal plant cyber-physical system datasets demonstrate the effectiveness of our approach." "mining-rich-data-types";"Multi-layer Representation Learning for Medical Concepts";"Proper representations of medical concepts such as diagnosis, medication, procedure codes and visits from Electronic Health Records (EHR) has broad applications in healthcare analytics. Patient EHR data consists of a sequence of visits over time, where each visit includes multiple medical concepts, e.g., diagnosis, procedure, and medication codes. This hierarchical structure provides two types of relational information, namely sequential order of visits and co-occurrence of the codes within a visit. In this work, we propose Med2Vec, which not only learns the representations for both medical codes and visits from large EHR datasets with over million visits, but also allows us to interpret the learned representations confirmed positively by clinical experts. In the experiments, Med2Vec shows significant improvement in prediction accuracy in clinical applications compared to baselines such as Skip-gram, GloVe, and stacked autoencoder, while providing clinically meaningful interpretation." "mining-rich-data-types";"Lexis: An Optimization Framework for Discovering the Hierarchical Structure of Sequential Data";"Data represented as strings abounds in biology, linguistics, document mining, web search and many other fields. Such data often have a hierarchical structure, either because they were artificially designed and composed in a hierarchical manner or because there is an underlying evolutionary process that creates repeatedly more complex strings from simpler substrings. We propose a framework, referred to as Lexis, that produces an optimized hierarchical representation of a given set of “target” strings. The resulting hierarchy, “Lexis-DAG”, shows how to construct each target through the concatenation of intermediate substrings, minimizing the total number of such concatenations or DAG edges. The Lexis optimization problem is related to the smallest grammar problem. After we prove its NP-hardness for two cost formulations, we propose an efficient greedy algorithm for the construction of Lexis-DAGs. We also consider the problem of identifying the set of intermediate nodes (substrings) that collectively form the “core” of a Lexis-DAG, which is important in the analysis of Lexis-DAGs. We show that the Lexis framework can be applied in diverse applications such as optimized synthesis of DNA fragments in genomic libraries, hierarchical structure discovery in protein sequences, dictionary-based text compression, and feature extraction from a set of documents." "mining-rich-data-types";"Inferring Network Effects from Observational Data";"We present Relational Covariate Adjustment (RCA), a general method for estimating causal effects in relational data. Relational Covariate Adjustment is implemented through two high-level operations: identification of an adjustment set and relational regression adjustment. The former is achieved through an extension of Pearl’s back-door criterion to relational domains. We demonstrate how this extended definition can be used to estimate causal effects in the presence of network interference and confounding. RCA is agnostic to functional form, and it can easily model both discrete and continuous treatments as well as estimate the effects of a wider array of network interventions than existing experimental approaches. We show that RCA can yield robust estimates of causal effects using common regression models without extensive parameter tuning. Through a series of simulation experiments on a variety of synthetic and real- world network structures, we show that causal effects estimated on observational data with RCA are nearly as accurate as those estimated from well-designed network experiments." "mining-rich-data-types";"MANTRA: A Scalable Approach to Mining Temporally Anomalous Sub-trajectories";"In this paper, we study the problem of mining temporally anomalous sub-trajectory patterns from an input trajectory in a scalable manner. Given the prevailing road conditions, a sub-trajectory is temporally anomalous if its travel time deviates significantly from the expected time. Mining these patterns requires us to delve into the sub-trajectory space, which is not scalable for real-time analytics. To overcome this scalability challenge, we design a technique called MANTRA. We study the properties unique to anomalous sub-trajectories and utilize them in MANTRA to iteratively refine the search space into a disjoint set of sub-trajectory islands. The expensive enumeration of all possible sub-trajectories is performed only on the islands to compute the answer set of maximal anomalous sub-trajectories. Extensive experiments on both real and synthetic datasets establish MANTRA as more than 3 orders of magnitude faster than baseline techniques. Moreover, through trajectory classification and segmentation, we demonstrate that the proposed model conforms to human intuition." "mining-rich-data-types";"Semi-Markov Switching Vector Autoregressive Model-based Anomaly Detection in Aviation Systems";"In this work we consider the problem of anomaly detection in heterogeneous, multivariate, variable-length time series datasets. Our focus is on the aviation safety domain, where data objects are flights and time series are sensor readings and pilot switches. In this context the goal is to detect anomalous flight segments, due to mechanical, environmental, or human factors in order to identifying operationally significant events and highlight potential safety risks. For this purpose, we propose a framework which represents each flight using a semi-Markov switching vector autoregressive (SMS-VAR) model. Detection of anomalies is then based on measuring dissimilarities between the model’s prediction and data observation. The framework is scalable, due to the inherent parallel nature of most computations, and can be used to perform online anomaly detection. Extensive experimental results on simulated and real datasets illustrate that the framework can detect various types of anomalies along with the key parameters involved." "mining-rich-data-types";"PTE: Enumerating Trillion Triangles On Distributed Systems";"How can we enumerate triangles from an enormous graph with billions of vertices and edges? Triangle enumeration is an important task for graph data analysis with many applications including identifying suspicious users in social networks, detecting web spams, finding communities, etc. However, recent networks are so large that most of the previous algorithms fail to process them. Recently, several MapReduce algorithms have been proposed to address such large networks; however, they suffer from the massive shuffled data resulting in a very long processing time." "mining-rich-data-types";"Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs";"Given a stream of heterogeneous graphs containing different types of nodes and edges, how can we spot anomalous ones in real-time while consuming bounded memory? This problem is motivated by and generalizes from its application in security to host-level advanced persistent threat (APT) detection. We propose StreamSpot, a clustering based anomaly detection approach that addresses challenges in two key fronts: (1) heterogeneity, and (2) streaming nature. We introduce a new similarity function for heterogeneous graphs that compares two graphs based on their relative frequency of local substructures, represented as short strings. This function lends itself to a vector representation of a graph, which is (a) fast to compute, and (b) amenable to a sketched version with bounded size that preserves similarity." "mining-rich-data-types";"ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with Rademacher Averages";"We present ABRA, a suite of algorithms to compute and maintain probabilistically-guaranteed, high-quality, approximations of the betweenness centrality of all nodes (or edges) on both static and fully dynamic graphs. Our algorithms use progressive random sampling and their analysis rely on Rademacher averages and pseudodimension, fundamental concepts from statistical learning theory. To our knowledge, this is the first application of these concepts to the field of graph analysis. Our experimental results show that ABRA is much faster than exact methods, and vastly outperforms, in both runtime and number of samples, state-of-the-art algorithms with the same quality guarantees." "mining-rich-data-types";"TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size";"We present TRIEST, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approximations of the global and local (i.e., incident to each vertex) number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions." "mining-rich-data-types";"Distributing the Stochastic Gradient Sampler for Large-Scale LDA";"Learning large-scale Latent Dirichlet Allocation (LDA) models is beneficial for many applications that involve large collections of documents. Recent work has been focusing on developing distributed algorithms in the batch setting, while leaving stochastic methods behind, which can effectively explore statistical redundancy in big data and thereby are complementary to distributed computing. The distributed stochastic gradient Langevin dynamics (DSGLD) represents one attempt to combine stochastic sampling and distributed computing, but it suffers from drawbacks such as excessive communications and sensitivity to partitioning of datasets across nodes. DSGLD is typically limited to learn small models that have about 10^3 topics and 10^3 vocabulary size." "mining-rich-data-types";"City-Scale Map Creation and Updating using GPS Collections";"Applications such as autonomous driving or real-time route recommendations require up-to-date and accurate digital maps. However, manually creating and updating such maps is too costly to meet the rising demands. As large collections of GPS trajectories become widely available, constructing and updating maps using such trajectory collections can greatly reduce the cost of such maps. Unfortunately, due to GPS noise and varying trajectory sampling rates, inferring maps from GPS trajectories can be very challenging. In this paper, we present a framework to create up-to-date maps with rich knowledge from GPS trajectory collections. Starting from an unstructured GPS point cloud, we discover road segments using novel graph-based clustering techniques with prior knowledge on road design. Based on road segments, we develop a scale- and orientation-invariant traj-SIFT feature to localize and recognize junctions using a supervised learning framework. Maps with rich knowledge are created based on discovered road segments and junctions. Compared to state-of-the-art methods, our approach can efficiently construct high-quality maps at city scales from large collections of GPS trajectories." "mining-rich-data-types";"Taxi Driving Behavior Analysis in Latent Vehicle-to-Vehicle Networks: A Social Influence Perspective";"With recent advances in mobile and sensor technologies, a large amount of efforts have been made on developing intelligent applications for taxi drivers, which provide beneficial guide and opportunity to improve the profit and work efficiency. However, limited scopes focus on the latent social interaction within cab drivers, and corresponding social propagation scheme to share driving behaviors has been largely ignored. To that end, in this paper, we propose a comprehensive study to reveal how the social propagation affects for better prediction of cab drivers’ future behaviors. To be specific, we first investigate the correlation between drivers’ skills and their mutual interactions in the latent vehicle-to-vehicle network, which intuitively indicates the effects of social influences. Along this line, by leveraging the classic social influence theory, we develop a two-stage framework for quantitatively revealing the latent driving pattern propagation within taxi drivers. Comprehensive experiments on a real-word data set collected from the New York City clearly validate the effectiveness of our proposed framework on predicting future taxi driving behaviors, which also support the hypothesis that social factors indeed improve the predictability of driving behaviors." "mining-rich-data-types";"Rebalancing Bike Sharing Systems: A Multi-source Data Smart Optimization";"Bike sharing systems, aiming at providing the missing links in public transportation systems, are becoming popular in urban cities. A key to success for a bike sharing systems is the effectiveness of rebalancing operations, that is, the effort-s of restoring the number of bikes in each station to its target value by routing vehicles through pick-up and drop-off operations. There are two major issues for this bike rebalancing problem: the determination of station inventory target level and the large scale multiple capacitated vehicle routing optimization with outlier stations. The key challenges include demand prediction accuracy for inventory target level determination, and an effective optimizer for vehicle routing with hundreds of stations. To this end, in this paper, we develop a Meteorology Similarity Weighted K-Nearest-Neighbor (M-SWK) regressor to predict the station pick-up demand based on large-scale historic trip records. Based on further analysis on the station network constructed by station-station connections and the trip duration, we propose an inter station bike transition (ISBT) model to predict the station drop-off demand. Then, we provide a mixed integer nonlinear programming (MINLP) formulation of multiple capacitated bike routing problem with the objective of minimizing total travel distance. To solve it, we propose an Adaptive Capacity Constrained K-centers Clustering (AdaCCKC) algorithm to separate outlier stations (the demands of these stations are very large and make the optimization infeasible) and group the rest stations into clusters within which one vehicle is scheduled to redistribute bikes between stations. In this way, the large scale multiple vehicle routing problem is reduced to inner cluster one vehicle routing problem with guaranteed feasible solutions. Finally, the extensive experimental results on the NYC Citi Bike system show the advantages of our approach for bike demand prediction and large-scale bike rebalancing optimization." "mining-rich-data-types";"Data-driven Automatic Treatment Regimen Development and Recommendation";"The analysis of large-scale Electrical Medical Records (EMRs) has the potential to develop and optimize clinical treatment regimens. A treatment regimen usually includes a series of doctor orders containing rich temporal and heterogeneous information. However, in many existing studies, a doctor order is simplified as an event code and a treatment record is simplified as a code sequence. Thus, the information inherent in doctor orders is not fully used for in-depth analysis. In this paper, we aim at exploiting the rich information in doctor orders and developing data-driven approaches for improving clinical treatments. To this end, we first propose a novel method to measure the similarities between treatment records with consideration of sequential and multifaceted information in doctor orders. Then, we propose an efficient density-based clustering algorithm to summarize large-scale treatment records, and extract a semantic representation of each treatment cluster. Finally, we develop a unified frame-work to evaluate the discovered treatment regimens, and find the most effective treatment regimen for new patients. In the empirical study, we validate our methods with EMRs of 27,678 patients from 14 hospitals. The results show that: 1) Our method can successfully extract typical treatment regimens from large-scale treatment records. The extracted treatment regimens are intuitive and provide managerial implications for treatment regimen design and optimization. 2) By recommending the most effective treatment regimens, the total cure rate in our data improves from 19.89% to 21.28%, and the effective rate increases up to 98.29%." "mining-rich-data-types";"Modeling Precursors for Event Forecasting via Nested Multi-Instance Learning";"Forecasting large-scale societal events like civil unrest movements, disease outbreaks, and elections is an important and challenging problem. From the perspective of human analysts and policy makers, forecasting algorithms must not only make accurate predictions but must also provide sup-porting evidence, e.g., the causal factors related to the event of interest. We develop a novel multiple instance learning based approach that jointly tackles the problem of identifying evidence-based precursors and forecasts events into the future. Specifically, given a collection of streaming news articles from multiple sources we develop a nested multiple instance learning approach to forecast significant societal events such as protests. Using data from three countries in Latin America, we demonstrate how our approach is able to consistently identify news articles considered as precursors for protests. Our empirical evaluation demonstrates the strengths of our proposed approach in filtering candidate precursors, in forecasting the occurrence of events with a lead time advantage and in accurately predicting the characteristics of civil unrest events." "mining-rich-data-types";"Smart Reply: Automated Response Suggestion for Email";"In this paper we propose and investigate a novel end-to-end method for automatically generating short email responses, called Smart Reply. It generates semantically diverse suggestions that can be used as complete email responses with just one tap on mobile. The system is currently used in Inbox by Gmail and is responsible for assisting with 10% of all mobile responses. It is designed to work at very high throughput and process hundreds of millions of messages daily. The system exploits state-of-the-art, large-scale deep learning." "privacy-preserving-data-mining";"Ranking Causal Anomalies via Temporal and Dynamical Analysis on Vanishing Correlations";"Modern world has witnessed a dramatic increase in our ability to collect, transmit and distribute real-time monitoring and surveillance data from large-scale information systems and cyber-physical systems. Detecting system anomalies thus attracts significant amount of interest in many fields such as security, fault management, and industrial optimization. Recently, invariant network has shown to be a powerful way in characterizing complex system behaviours. In the invariant network, a node represents a system component and an edge indicates a stable, significant interaction between two components. Structures and evolutions of the invariance network, in particular the vanishing correlations, can shed important light on locating causal anomalies and performing diagnosis. However, existing approaches to detect causal anomalies with the invariant network often use the percentage of vanishing correlations to rank possible casual components, which have several limitations: 1) fault propagation in the network is ignored; 2) the root casual anomalies may not always be the nodes with a high-percentage of vanishing correlations; 3) temporal patterns of vanishing correlations are not exploited for robust detection. To address these limitations, in this paper we propose a network diffusion based framework to identify significant causal anomalies and rank them. Our approach can effectively model fault propagation over the entire invariant network, and can perform joint inference on both the structural, and the time-evolving broken invariance patterns. As a result, it can locate high-confidence anomalies that are truly responsible for the vanishing correlations, and can compensate for unstructured measurement noise in the system. Extensive experiments on synthetic datasets, bank information system datasets, and coal plant cyber-physical system datasets demonstrate the effectiveness of our approach." "privacy-preserving-data-mining";"Convex Optimization for Linear Query Processing under Approximate Differential Privacy";"Differential privacy enables organizations to collect accurate aggregates over sensitive data with strong, rigorous guarantees on individuals’ privacy. Previous work has found that under differential privacy, computing multiple correlated aggregates as a batch, using an appropriate strategy, may yield higher accuracy than computing each of them independently. However, finding the best strategy that maximizes result accuracy is non-trivial, as it involves solving a complex constrained optimization program that appears to be non-convex. Hence, in the past much effort has been devoted in solving this non-convex optimization program. Existing approaches include various sophisticated heuristics and expensive numerical solutions. None of them, however, guarantees to find the optimal solution of this optimization problem." "privacy-preserving-data-mining";"Privacy-preserving Class Ratio Estimation";"In this paper we present learning models for the class ratio estimation problem, which takes as input an unlabeled set of instances and predicts the proportions of instances in the set belonging to the different classes. This problem has applications in social and commercial data analysis. Existing models for class-ratio estimation however require instance-level supervision. Whereas in domains like politics, and demography, set-level supervision is more common. We present a new method for directly estimating class-ratios using set-level supervision. Another serious limitation in applying these techniques to sensitive domains like health is data privacy. We propose a novel label privacy-preserving mechanism that is well-suited for supervised class ratio estimation and has guarantees for achieving efficient differential privacy, provided the per-class counts are large enough. We derive learning bounds for the estimation with and with-out privacy constraints, which lead to important insights for the data-publisher. Extensive empirical evaluation shows that our model is more accurate than existing methods and that the proposed privacy mechanism and learning model are well-suited for each other." "privacy-preserving-data-mining";"FRAUDAR: Bounding Graph Fraud in the Face of Camouflage";"Given a bipartite graph of users and the products that they review, or followers and followees, how can we detect fake reviews or follows? Existing fraud detection methods (spectral, etc.) try to identify dense subgraphs of nodes that are sparsely connected to the remaining graph. Fraudsters can evade these methods using camouflage, by adding reviews or follows with honest targets so that they look “normal”. Even worse, some fraudsters use hijacked accounts from honest users, and then the camouflage is indeed organic." "privacy-preserving-data-mining";"Modeling Precursors for Event Forecasting via Nested Multi-Instance Learning";"Forecasting large-scale societal events like civil unrest movements, disease outbreaks, and elections is an important and challenging problem. From the perspective of human analysts and policy makers, forecasting algorithms must not only make accurate predictions but must also provide sup-porting evidence, e.g., the causal factors related to the event of interest. We develop a novel multiple instance learning based approach that jointly tackles the problem of identifying evidence-based precursors and forecasts events into the future. Specifically, given a collection of streaming news articles from multiple sources we develop a nested multiple instance learning approach to forecast significant societal events such as protests. Using data from three countries in Latin America, we demonstrate how our approach is able to consistently identify news articles considered as precursors for protests. Our empirical evaluation demonstrates the strengths of our proposed approach in filtering candidate precursors, in forecasting the occurrence of events with a lead time advantage and in accurately predicting the characteristics of civil unrest events." "optimization-techniques";"Online dual decomposition for performance and delivery-based distributed ad allocation";"Online optimization is central to display advertising, where we must sequentially allocate ad impressions to maximize the total welfare among advertisers, while respecting various advertiser-specified long-term constraints (e.g., total amount of the ad’s budget that is consumed at the end of the campaign). In this paper, we present the online dual decomposition (ODD) framework for large-scale, online, distributed ad allocation, which combines dual decomposition and on-line convex optimization. ODD allows us to account for the distributed and the online nature of the ad allocation problem and is extensible to a variety of ad allocation problems arising in real-world display advertising systems. Moreover, ODD does not require assumptions about auction dynamics, stochastic or adversarial feedback, or any other characteristics of the ad marketplace. We further provide guarantees for the online solution as measured by bounds on cumulative regret. The regret analysis accounts for the impact of having to estimate constraints in an online setting before they are observed and for the dependence on the smooth-ness with which constraints and constraint violations are generated. We provide an extensive set of results from a large-scale production advertising system at Amazon to validate the framework and compare its behavior to various ad allocation algorithms." "optimization-techniques";"Lossless Separation of Web Pages into Layout Code and Data";"A modern web page is often served by running layout code on data, producing an HTML document that enhances the data with front/back matters and layout/style operations. In this paper, we consider the opposite task: separating a given web page into a data component and a layout program. This separation has various important applications: page encoding may be significantly more compact (reducing web traffic), data representation is normalized across web designs (facilitating wrapping, retrieval and extraction), and repetitions are diminished (expediting updates and redesign)." "optimization-techniques";"Email Volume Optimization at LinkedIn";"Online social networking services distribute various types of messages geared towards providing increased value to their members. Common types of messages include news, con- nection requests, membership noti_cations, promotions, and event noti_cations. Such communication, if used judiciously, can provide an enormous value to the members. However sending a message for every instance of news, connection re- quest, or the like can result in an overwhelming number of messages in a member’s mailbox. This may result in reduced e_ectiveness of communication if the messages are not su_- ciently relevant to the member’s interests, and potentially a poor brand perception. In this paper, we discuss our strat- egy and experience with regard to the problem of email vol- ume optimization at LinkedIn. In particular, we present a cost-bene_t analysis of sending emails, the key factors to administer an e_ective volume optimization, our algorithm for volume optimization, the architecture of the supporting system, and experimental results from online A/B tests." "optimization-techniques";"Portfolio Selections in P2P Lending: A Multi-Objective Perspective";"P2P lending is an emerging wealth-management service for individuals, which allows lenders to directly bid and invest on the loans created by borrowers. In these platforms, lender-s often pursue multiple objectives (e.g., non-default probability, fully-funded probability and winning-bid probability) when they select loans to invest. How to automatically assess loans from these objectives and help lenders select loan portfolios is a very important but challenging problem. To that end, in this paper, we present a holistic study on portfolio selections in P2P lending. Specifically, we first propose to adapt gradient boosting decision tree, which combines both static features and dynamic features, to assess loans from multiple objectives. Then, we propose two strategies, i.e., weighted objective optimization strategy and multi-objective optimization strategy, to select portfolios for lenders. For each lender, the first strategy attempts to provide one optimal portfolio while the second strategy attempts to provide a Pareto-optimal portfolio set. Further, we design two algorithms, namely DPA and EVA, which can efficiently resolve the optimizations in these two strategies, respectively. Finally, extensive experiments on a large-scale real-world data set demonstrate the effectiveness of our solutions." "optimization-techniques";"Optimally Discriminative Choice Sets in Discrete Choice Models: Application to Data-Driven Test Desi";"Difficult multiple-choice (MC) questions can be made easy by providing a set of answer options of which most are obviously wrong. In the education literature, a plethora of instructional guides exist for crafting a suitable set of wrong choices (distractors) that enable the assessment of the students’ understanding. The art of MC question design thus hinges on the question-maker’s experience and knowledge of the potential misconceptions. In contrast, we advocate a data-driven approach, where correct and incorrect options are assembled directly from the students’ own past submissions. Large-scale online classroom settings, such as massively open online courses (MOOCs), provide an opportunity to design optimal and adaptive multiple-choice questions that are maximally informative about the students’ level of understanding of the material. In this work, we (i) develop a multinomial-logit discrete choice model for the setting of MC testing, (ii) derive an optimization objective for selecting optimally discriminative option sets, (iii) propose an algorithm for finding a globally-optimal solution, and (iv) demonstrate the effective-ness of our approach via synthetic experiments and a user study. We finally showcase an application of our approach to crowd-sourcing tests from technical online forums." "optimization-techniques";"Joint Optimization of Multiple Performance Metrics in Online Video Advertising";"The field of online advertising, in essence, deals with the problem of presenting ads to online users in the most appropriate contexts to achieve a multitude of advertiser goals. A vast amount of work in online advertising has been focused on optimizing banner display advertising campaigns where the main goal lies in direct response metrics, often as clicks or conversions. In this paper, we explore the newly popularized space of online video advertising, where brand recognition is the key focus. We propose a framework based on a feedback mechanism where we optimize multiple video specific performance indicators while making sure the delivery constraints (budget and user reach) of advertisers are satisfied. While our main focus is on improving metrics such as engagement (amount of view time), and viewability (whether a campaign is within eyesight of a user), we also discuss the possibilities of expanding to other metrics. We demonstrate the benefit of our framework via empirical results in multiple real-world advertising campaigns. To the best of our knowledge, this is the first paper that deals with the unique challenges arising from the nature of online video advertising." "optimization-techniques";"MAP: Frequency-Based Maximization of Airline Profits based on an Ensemble Forecasting Approach";"Though there are numerous traditional models to predict market share and demand along airline routes, the prediction of existing models is not precise enough and, to the best of our knowledge, there is no use of data-mining based forecasting techniques to improve airline profitability. We propose the MAP (Maximizing Air-line Profits) architecture designed to help airlines and make two key contributions in airline market share and route demand prediction and prediction-based airline profit optimization. Compared with past methods to forecast market share and demand along airline routes, we introduce a novel Ensemble Forecasting (MAP-EF) approach considering two new classes of features: (i) features derived from clusters of similar routes, and (ii) features based on equilibrium pricing. We show that MAP-EF achieves much better Pearson Correlation Coefficients (over 0.95 vs. 0.82 for market share, 0.98 vs. 0.77 for demand) and R2-values compared with three state-of-the-art works for forecasting market share and demand, while showing much lower variance. Using the results of MAP-EF, we develop MAP-Bilevel Branch and Bound (MAP-BBB) and MAP-Greedy (MAP-G) algorithms to optimally allocate flight frequencies over multiple routes, to maximize an airline’s profit. Experimental results show that airlines can increase profits by a significant margin. All experiments were conducted with data aggregated from four sources: US Bureau of Transportation Statistics (BTS), US Bureau of Economic Analysis (BEA), the National Transportation Safety Board (NTSB), and the US Census Bureau (CB)." "optimization-techniques";"Matrix Computations and Optimization in Apache Spark";"We describe matrix computations available in the cluster programming framework, Apache Spark. Out of the box, Spark provides abstractions and implementations for distributed matrices and optimization routines using these matrices. When translating single-node algorithms to run on a distributed cluster, we observe that often a simple idea is enough: separating matrix operations from vector operations and shipping the matrix operations to be ran on the cluster, while keeping vector operations local to the driver. In the case of the Singular Value Decomposition, by taking this idea to an extreme, we are able to exploit the computational power of a cluster, while running code written decades ago for a single core. Another example is our Spark port of the popular TFOCS optimization package, originally built for MATLAB, which allows for solving Linear programs as well as a variety of other convex programs. We conclude with a comprehensive set of benchmarks for hardware accelerated matrix computations from the JVM, which is interesting in its own right, as many cluster programming frameworks use the JVM. The contributions described in this paper are already merged into Apache Spark and available on Spark installations by default, and commercially supported by a slew of companies which provide further services." "optimization-techniques";"Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices";"With massive high-dimensional data now commonplace in research and industry, there is a strong and growing demand for more scalable computational techniques for data analysis and knowledge discovery. Key to turning these data into knowledge is the ability to learn statistical models with high interpretability. Current methods for learning statistical models either produce models that are not interpretable or have prohibitive computational costs when applied to massive data. In this paper we address this need by presenting a scalable algorithm for partial least squares regression (PLS), which we call compression-based PLS (cPLS), to learn predictive linear models with a high interpretability from massive high-dimensional data. We propose a novel grammar-compressed representation of data matrices that supports fast row and column access while the data matrix is in a compressed form. The original data matrix is grammar-compressed and then the linear model in PLS is learned on the compressed data matrix, which results in a significant reduction in working space, greatly improving scalability. We experimentally test cPLS on its ability to learn linear models for classification, regression and feature extraction with various massive high-dimensional data, and show that cPLS performs superiorly in terms of prediction accuracy, computational efficiency, and interpretability." "semi-supervised-learning";"Partial Label Learning via Feature-Aware Disambiguation";"Partial label learning deals with the problem where each training example is represented by a feature vector while associated with a set of candidate labels, among which only one label is valid. To learn from such ambiguous labeling information, the key is to try to disambiguate the candidate label sets of partial label training examples. Existing disambiguation strategies work by either identifying the ground-truth label iteratively or treating each candidate label equally. Nonetheless, the disambiguation process is generally conducted by focusing on manipulating the label space, and thus ignores making full use of potentially useful information from the feature space. In this paper, a novel two-stage approach is proposed to learning from partial label examples based on feature-aware disambiguation. In the first stage, the manifold structure of feature space is utilized to generate normalized labeling confidences over candidate label set. In the second stage, the predictive model is learned by performing regularized multi-output regression over the generated labeling confidences. Extensive experiments on artificial as well as real-world partial label data sets clearly validate the superiority of the proposed feature-aware disambiguation approach." "semi-supervised-learning";"A Multi-Task Learning Formulation for Survival Analysis";"Predicting the occurrence of a particular event of interest at future time points is the primary goal of survival analysis. The presence of incomplete observations due to time limitations or loss of data traces is known as censoring which brings unique challenges in this domain and differentiates survival analysis from other standard regression methods. The popularly used survival analysis methods such as Cox proportional hazard model and parametric survival regression suffer from some strict assumptions and hypotheses that are not realistic in most of the real-world applications. To overcome the weaknesses of these two types of methods, we reformulate the survival analysis problem as a multi-task learning problem and propose a new multi-task learning based formulation to predict the survival time by estimating the survival status at each time interval during the study du-ration. We propose an indicator matrix to enable the multi-task learning algorithm to handle censored instances and in-corporate some of the important characteristics of survival problems such as non-negative non-increasing list structure into our model through max-heap projection. We employ the l2,1-norm penalty to learn a shared representation across related tasks and hence select important features and alleviate over-fitting in high-dimensional feature spaces; thus, reducing the prediction error of each task. To efficiently handle the two non-smooth constraints, in this paper, we propose an optimization method which employs Alternating Direction Method of Multipliers (ADMM) algorithm to solve the proposed multi-task learning problem. We demonstrate the performance of the proposed method using real-world microarray gene expression datasets and show that our methods outperform state-of-the-art methods." "semi-supervised-learning";"Smart Reply: Automated Response Suggestion for Email";"In this paper we propose and investigate a novel end-to-end method for automatically generating short email responses, called Smart Reply. It generates semantically diverse suggestions that can be used as complete email responses with just one tap on mobile. The system is currently used in Inbox by Gmail and is responsible for assisting with 10% of all mobile responses. It is designed to work at very high throughput and process hundreds of millions of messages daily. The system exploits state-of-the-art, large-scale deep learning." "semi-supervised-learning";"Overcoming key weaknesses of Distance-based Neighbourhood Methods using a Data Dependent Dissimilari";"This paper introduces the first generic version of data dependent dissimilarity and shows that it provides a better closest match than distance measures for three existing algorithms in clustering, anomaly detection and multi-label classification. For each algorithm, we show that by simply replacing the distance measure with the data dependent dissimilarity measure, it overcomes a key weakness of the otherwise unchanged algorithm." "semi-supervised-learning";"Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding";"Current systems of fine-grained entity typing use distant supervision in conjunction with existing knowledge bases to assign categories (type labels) to entity mentions. However, the type labels so obtained from knowledge bases are often noisy (i.e., incorrect for the entity mention’s local context). We define a new task, Label Noise Reduction in Entity Typing (LNR), to be the automatic identification of correct type labels (type-paths) for training examples, given the set of candidate type labels obtained by distant supervision with a given type hierarchy. The unknown type labels for individual entity mentions and the semantic similarity between entity types pose unique challenges for solving the LNR task. We propose a general framework, called PLE, to jointly embed entity mentions, text features and entity types into the same low-dimensional space where, in that space, objects whose types are semantically close have similar representations. Then we estimate the type-path for each training example in a top-down manner using the learned embeddings. We formulate a global objective for learning the embeddings from text corpora and knowledge bases, which adopts a novel margin-based loss that is robust to noisy labels and faithfully models type correlation derived from knowledge bases. Our experiments on three public typing datasets demonstrate the effectiveness and robustness of PLE, with an average of 25% improvement in accuracy compared to next best method." "semi-supervised-learning";"Fast Unsupervised Online Drift Detection Using Incremental Kolmogorov-Smirnov Test";"Data stream research has grown rapidly over the last decade. Two major features distinguish data stream from batch learning: stream data are generated on the fly, possibly in a fast and variable rate; and the underlying data distribution can be non-stationary, leading to a phenomenon known as concept drift. Therefore, most of the research on data stream classification focuses on proposing efficient models that can adapt to concept drifts and maintain a stable performance over time. However, specifically for the classification task, the majority of such methods rely on the instantaneous avail-ability of true labels for all already classified instances. This is a strong assumption that is rarely fulfilled in practical applications. Hence there is a clear need for efficient methods that can detect concept drifts in an unsupervised way. One possibility is the well-known Kolmogorov-Smirnov test, a statistical hypothesis test that checks whether two samples differ. This work has two main contributions. The first one is the Incremental Kolmogorov-Smirnov algorithm that allows performing the Kolmogorov-Smirnov hypothesis test instantly using two samples that change over time, where the change is an insertion and/or removal of an observation. Our algorithm employs a randomized tree and is able to per-form the insertion and removal operations in O(log N) with high probability and calculate the Kolmogorov-Smirnov test in O(1), where N is the number of sample observations. This is a significant speed-up compared to the O(N log N) cost of the non-incremental implementation. The second contribution is the use of the Incremental Kolmogorov-Smirnov test to detect concept drifts without true labels. Classification algorithms adapted to use the test rely on a limited portion of those labels just to update the classification model after a concept drift is detected." "semi-supervised-learning";"Semi-Markov Switching Vector Autoregressive Model-based Anomaly Detection in Aviation Systems";"In this work we consider the problem of anomaly detection in heterogeneous, multivariate, variable-length time series datasets. Our focus is on the aviation safety domain, where data objects are flights and time series are sensor readings and pilot switches. In this context the goal is to detect anomalous flight segments, due to mechanical, environmental, or human factors in order to identifying operationally significant events and highlight potential safety risks. For this purpose, we propose a framework which represents each flight using a semi-Markov switching vector autoregressive (SMS-VAR) model. Detection of anomalies is then based on measuring dissimilarities between the model’s prediction and data observation. The framework is scalable, due to the inherent parallel nature of most computations, and can be used to perform online anomaly detection. Extensive experimental results on simulated and real datasets illustrate that the framework can detect various types of anomalies along with the key parameters involved." "semi-supervised-learning";"FRAUDAR: Bounding Graph Fraud in the Face of Camouflage";"Given a bipartite graph of users and the products that they review, or followers and followees, how can we detect fake reviews or follows? Existing fraud detection methods (spectral, etc.) try to identify dense subgraphs of nodes that are sparsely connected to the remaining graph. Fraudsters can evade these methods using camouflage, by adding reviews or follows with honest targets so that they look “normal”. Even worse, some fraudsters use hijacked accounts from honest users, and then the camouflage is indeed organic." "semi-supervised-learning";"Goal-Directed Inductive Matrix Completion";"Matrix completion (MC) with additional information has found wide applicability in several machine learning applications. Among algorithms for solving such problems, Inductive Matrix Completion(IMC) has drawn a considerable amount of attention, not only for its well established theoretical guarantees but also for its superior performance in various real-world applications. However, IMC based methods usually place very strong constraints on the quality of the features (side information) to ensure accurate recovery, which might not be met in practice. In this paper, we pro-pose Goal-directed Inductive Matrix Completion(GIMC) to learn a nonlinear mapping of the features so that they satisfy the required properties. A key distinction between GIMC and IMC is that the feature mapping is learnt in a supervised manner, deviating from the traditional approach of un-supervised feature learning followed by model training. We establish the superiority of our method on several popular machine learning applications including multi-label learning, multi-class classification, and semi-supervised clustering." "semi-supervised-learning";"Modeling Precursors for Event Forecasting via Nested Multi-Instance Learning";"Forecasting large-scale societal events like civil unrest movements, disease outbreaks, and elections is an important and challenging problem. From the perspective of human analysts and policy makers, forecasting algorithms must not only make accurate predictions but must also provide sup-porting evidence, e.g., the causal factors related to the event of interest. We develop a novel multiple instance learning based approach that jointly tackles the problem of identifying evidence-based precursors and forecasts events into the future. Specifically, given a collection of streaming news articles from multiple sources we develop a nested multiple instance learning approach to forecast significant societal events such as protests. Using data from three countries in Latin America, we demonstrate how our approach is able to consistently identify news articles considered as precursors for protests. Our empirical evaluation demonstrates the strengths of our proposed approach in filtering candidate precursors, in forecasting the occurrence of events with a lead time advantage and in accurately predicting the characteristics of civil unrest events." "data-reliability-and-truthfulness";"Towards Confidence in the Truth: A Bootstrapping based Truth Discovery Approach";"The demand for automatic extraction of true information (i.e., truths) from conflicting multi-source data has soared recently. A variety of truth discovery methods have witnessed great successes via jointly estimating source reliability and truths. All existing truth discovery methods focus on providing a point estimator for each object’s truth, but in many real-world applications, confidence interval estimation of truths is more desirable, since confidence interval contains richer information. To address this challenge, in this paper, we propose a novel truth discovery method (ET-CIBoot) to construct confidence interval estimates as well as identify truths, where the bootstrapping techniques are nicely integrated into the truth discovery procedure. Due to the properties of bootstrapping, the estimators obtained by ETCIBoot are more accurate and robust compared with the state-of-the-art truth discovery approaches. Theoretically, we prove the asymptotical consistency of the confidence interval obtained by ETCIBoot. Experimentally, we demonstrate that ETCIBoot is not only effective in constructing confidence intervals but also able to obtain better truth estimates." "data-reliability-and-truthfulness";"From Truth Discovery to Trustworthy Opinion Discovery: An Uncertainty-Aware Quantitative Modeling Ap";"In this era of information explosion, conflicts are often encountered when information is provided by multiple sources. Traditional truth discovery task aims to identify the truth –the most trustworthy information, from conflicting sources in different scenarios. In this kind of tasks, truth is regarded as a fixed value or a set of fixed values. However, in a number of real-world cases, objective truth existence cannot be ensured and we can only identify single or multiple reliable facts from opinions. Different from traditional truth discovery task, we address this uncertainty and introduce the concept of trustworthy opinion of an entity, treat it as a random variable, and use its distribution to describe consistency or controversy, which is particularly difficult for data which can be numerically measured, i.e. quantitative information. In this study, we focus on the quantitative opinion, propose an uncertainty-aware approach called Kernel Density Estimation from Multiple Sources (KDEm) to estimate its probability distribution, and summarize trustworthy in-formation based on this distribution. Experiments indicate that KDEm not only has outstanding performance on the classical numeric truth discovery task, but also shows good performance on multi-modality detection and anomaly detection in the uncertain-opinion setting." "large-scale-machine-learning-systems";"Fast Component Pursuit for Large-Scale Inverse Covariance Estimation";"The maximum likelihood estimation (MLE) for the Gaussian graphical model, which is also known as the inverse covariance estimation problem, has gained increasing interest recently. Most existing works assume that inverse covariance estimators contain sparse structure and then construct models with the l1 regularization. In this paper, different from existing works, we study the inverse co-variance estimation problem from another perspective by efficiently modeling the low-rank structure in the inverse covariance, which is assumed to be a combination of a low-rank part and a diagonal matrix. One motivation for this assumption is that the low-rank structure is common in many applications including the cli-mate and financial analysis, and another one is that such assumption can reduce the computational complexity when computing its inverse. Specifically, we propose an efficient COmponent Pursuit (COP) method to obtain the low-rank part, where each component can be sparse. For optimization, the COP method greedily learns a rank-one component in each iteration by maximizing the log-likelihood. Moreover, the COP algorithm enjoys several appealing properties including the existence of an efficient solution in each iteration and the theoretical guarantee on the convergence of this greedy approach. Experiments on large-scale synthetic and real-world datasets including thousands of millions variables show that the COP method is faster than the state-of-the-art techniques for the inverse covariance estimation problem when achieving comparable log-likelihood on test data." "large-scale-machine-learning-systems";"Parallel Lasso Screening for Big Data Optimization";"Lasso regression is a widely used technique in data mining for model selection and feature extraction. In many applications, it remains challenging to apply the regression model to large-scale problems that have massive data samples with high-dimensional features. One popular and promising strategy is to solve the Lasso problem in parallel. Parallel solvers run multiple cores in parallel on a shared memory system to speedup the computation, while the practical usage is limited by the huge dimension in the feature space. Screening is a promising method to solve the problem of high dimensionality by discarding the inactive features and removing them from optimization. However, when integrating screening methods with parallel solvers, most of solvers cannot guarantee the convergence on the reduced feature matrix. In this paper, we propose a novel parallel framework by parallelizing screening methods and integrating it with our proposed parallel solver. We propose two parallel screening algorithms: Parallel Strong Rule (PSR) and Parallel Dual Polytope Projection (PDPP). For the parallel solver, we proposed an Asynchronous Grouped Coordinate Descent method (AGCD) to optimize the regression problem in parallel on the reduced feature matrix. AGCD is based on a grouped selection strategy to select the coordinate that has the maximum descent for the objective function in a group of candidates. Empirical studies on the real-world datasets demonstrate that the proposed parallel framework has a superior performance compared to the state-of-the-art parallel solvers." "large-scale-machine-learning-systems";"Compressing Graphs and Indexes with Recursive Graph Bisection";"Graph reordering is a powerful technique to increase the locality of the representations of graphs, which can be helpful in several applications. We study how the technique can be used to improve compression of graphs and inverted indexes." "large-scale-machine-learning-systems";"Accelerated Stochastic Block Coordinate Descent with Optimal Sampling";"We study the composite minimization problem where the objective function is the sum of two convex functions: one is the sum of a finite number of strongly convex and smooth functions, and the other is a general convex function that is non-differentiable. Specifically, we consider the case where the non-differentiable function is block separable and admits a simple proximal mapping for each block. This type of composite optimization is common in many data mining and machine learning problems, and can be solved by block co-ordinate descent algorithms. We propose an accelerated stochastic block coordinate descent (ASBCD) algorithm, which incorporates the incrementally averaged partial derivative into the stochastic partial derivative and exploits optimal sampling. We prove that ASBCD attains a linear rate of convergence. In contrast to uniform sampling, we reveal that the optimal non-uniform sampling can be employed to achieve a lower iteration complexity. Experimental results on different large-scale real data sets support our theory." "large-scale-machine-learning-systems";"Parallel Dual Coordinate Descent Method for Large-scale Linear Classification in Multi-core Environm";"Dual coordinate descent method is one of the most effective approaches for large-scale linear classification. However, its sequential design makes the parallelization difficult. In this work, we target at the parallelization in a multi-core environment. After pointing out difficulties faced in some existing approaches, we propose a new framework to parallelize the dual coordinate descent method. The key idea is to make the majority of all operations (gradient calculation here) parallelizable. The proposed framework is shown to be theoretically sound. Further, we demonstrate through experiments that the new framework is robust and efficient in a multi-core environment." "large-scale-machine-learning-systems";"Stochastic Optimization Techniques for Quantification Performance Measures";"The estimation of class prevalence, i.e., of the fraction of a population that belongs to a certain class, is an important task in data analytics, and finds applications in many domains such as the social sciences, market research, epidemiology, and others. For example, in sentiment analysis the goal is often not to estimate whether a specific text conveys a positive or a negative sentiment, but rather to estimate the overall distribution of positive and negative sentiments, e.g., in a certain time frame. A popular way of performing the above task, often dubbed quantification, is to use supervised learning in order to train a prevalence estimator from labeled data." "large-scale-machine-learning-systems";"Safe Pattern Pruning: An Efficient Approach for Predictive Pattern Mining";"In this paper we study predictive pattern mining problems where the goal is to construct a predictive model based on a subset of predictive patterns in the database. Our main contribution is to introduce a novel method called safe pat-tern pruning (SPP) for a class of predictive pattern mining problems. The SPP method allows us to efficiently find a superset of all the predictive patterns in the database that are needed for the optimal predictive model. The advantage of the SPP method over existing boosting-type method is that the former can find the superset by a single search over the database, while the latter requires multiple searches. The SPP method is inspired by recent development of safe feature screening. In order to extend the idea of safe feature screening into predictive pattern mining, we derive a novel pruning rule called safe pattern pruning (SPP) rule that can be used for searching over the tree defined among patterns in the database. The SPP rule has a property that, if a node corresponding to a pattern in the database is pruned out by the SPP rule, then it is guaranteed that all the patterns corresponding to its descendant nodes are never needed for the optimal predictive model. We apply the SPP method to graph mining and item-set mining problems, and demonstrate its computational advantage." "large-scale-machine-learning-systems";"Convex Optimization for Linear Query Processing under Approximate Differential Privacy";"Differential privacy enables organizations to collect accurate aggregates over sensitive data with strong, rigorous guarantees on individuals’ privacy. Previous work has found that under differential privacy, computing multiple correlated aggregates as a batch, using an appropriate strategy, may yield higher accuracy than computing each of them independently. However, finding the best strategy that maximizes result accuracy is non-trivial, as it involves solving a complex constrained optimization program that appears to be non-convex. Hence, in the past much effort has been devoted in solving this non-convex optimization program. Existing approaches include various sophisticated heuristics and expensive numerical solutions. None of them, however, guarantees to find the optimal solution of this optimization problem." "classification";"Detecting Devastating Diseases in Search Logs";"Web search queries can offer a unique population-scale window onto streams of evidence that are useful for detecting the emergence of health conditions. We explore the promise of harnessing behavioral signals in search logs to provide advance warning about the presence of devastating diseases such as pancreatic cancer. Pancreatic cancer is often diagnosed too late to be treated effectively as the cancer has usually metastasized by the time of diagnosis. Symptoms of the early stages of the illness are often subtle and nonspecific. We identify searchers who issue credible, first-person diagnostic queries for pancreatic cancer and we learn models from prior search histories that predict which searchers will later input such queries. We show that we can infer the likelihood of seeing the rise of diagnostic queries months be-fore they appear and characterize the tradeoff between predictivity and false positive rate. The findings highlight the potential of harnessing search logs for the early detection of pancreatic cancer and more generally for harnessing search systems to reduce health risks for individuals." "classification";"Audience Expansion for Online Social Network Advertising";"Online social network advertising platforms, such as that provided by LinkedIn, generally allow marketers to specify targeting options so that their ads appear to a desired demographic. Audience Expansion is a technique developed at LinkedIn to simplify targeting and identify new audiences with similar attributes to the original target audience. We developed two methods to achieve Audience Expansion: campaign-agnostic expansion and campaign-aware expansion. In this paper, we describe the details of these methods, present in-depth analysis of their trade-offs, and demonstrate a hybrid strategy that possesses the combined strength of both methods. Through large scale online experiments, we show the effectiveness of the proposed approach, and as a result, the benefits it brings to the whole marketplace including both LinkedIn and advertisers. The achieved benefits can be characterized as: 1) simplified targeting process and increased reach for advertisers, and 2) better utilization of LinkedIn’s ads inventory and higher and more efficient market participation." "classification";"Identifying Police Officers at Risk of Adverse Events";"Adverse events between police and the public, such as deadly shootings or instances of racial profiling, can cause serious or deadly harm, damage police legitimacy, and result in costly litigation. Evidence suggests these events can be prevented by targeting interventions based on an Early Intervention System (EIS) that flags police officers who are at a high risk for involvement in such adverse events. Today’s EIS are not data-driven and typically rely on simple thresholds based entirely on expert intuition. In this paper, we de-scribe our work with the Charlotte-Mecklenburg Police Department (CMPD) to develop a machine learning model to predict which officers are at risk for an adverse event. Our approach significantly outperforms CMPD’s existing EIS, increasing true positives by ∼ 12% and decreasing false positives by ∼ 32%. Our work also sheds light on features related to officer characteristics, situational factors, and neighborhood factors that are predictive of adverse events. This work provides a starting point for police departments to take a comprehensive, data-driven approach to improve policing and reduce harm to both officers and members of the public." "classification";"Crystal:Employer Name Normalization in the Online Recruitment Industry";"Entity linking links entity mentions in text to the corresponding entities in a knowledge base (KB) and has many applications in both open domain and specific domains. For example, in the recruitment domain, linking employer names in job postings or resumes to entities in an employer KB is very important to many business applications. In this paper, we focus on this employer name normalization task, which has several unique challenges: handling employer names from both job postings and resumes, leveraging the corresponding location context, and handling name variations, irrelevant input data, and noises in the KB. We present a sys-tem called CompanyDepot which contains a machine learning based approach CompanyDepot-ML and a heuristic approach CompanyDepot-H to address these challenges in three steps: (1) searching for candidate entities based on a customized search engine for the KB; (2) ranking the candidate entities using learning-to-rank methods or heuristics; and (3) validating the top-ranked entity via binary classification or heuristics. While CompanyDepot-ML shows better extendability and flexibility, CompanyDepot-H serves as a strong baseline and useful way to collect training data for CompanyDepot-ML. The proposed system achieves 2.5%-21.4% higher coverage at the same precision level compared to an existing system used at CareerBuilder over multiple real-world datasets. Applying the system to a similar task of academic institution name normalization further shows the generalization ability of the method." "classification";"Firebird: Predicting Fire Risk and Prioritizing Fire Inspections in Atlanta";"The Atlanta Fire Rescue Department (AFRD), like many municipal fire departments, actively works to reduce fire risk by inspecting commercial properties for potential hazards and fire code violations. However, AFRD’s fire inspection practices relied on tradition and intuition, with no existing data-driven process for prioritizing fire inspections or identifying new properties requiring inspection. In collaboration with AFRD, we developed the Firebird framework to help municipal fire departments identify and prioritize commercial property fire inspections, using machine learning, geocoding, and information visualization. Firebird computes fire risk scores for over 5,000 buildings in the city, with true positive rates of up to 71% in predicting fires. It has identified 6,096 new potential commercial properties to inspect, based on AFRD’s criteria for inspection. Furthermore, through an interactive map, Firebird integrates and visualizes fire incidents, property information and risk scores to help AFRD make informed decisions about fire inspections. Firebird has already begun to make positive impact at both local and national levels. It is improving AFRD’s inspection processes and Atlanta residents’ safety, and was highlighted by National Fire Protection Association (NFPA) as a best practice for using data to inform fire inspections." "classification";"Predicting Disk Replacement towards Reliable Data Centers";"Disks are among the most frequently failing components in today’s IT environments. Despite a set of defense mechanisms such as RAID, the availability and reliability of the system are still often impacted severely." "classification";"The Profile of an Online Purchaser: A Case Study of Pinterest";"Online e-commerce applications are becoming a primary vehicle for people to find, compare, and ultimately purchase products. One of the fundamental questions that arises in e-commerce is to characterize, understand, and model user long-term purchasing intent, which is important as it allows for personalized and context relevant e-commerce services." "classification";"Dynamic and Robust Wildfire Risk Prediction System: An Unsupervised Approach";"Ability to predict the risk of damaging events (e.g. wildfires) is crucial in helping emergency services in their decision-making processes, to mitigate and reduce the impact of such events. Today, wildfire rating systems have been in operation extensively in many countries around the world to estimate the danger of wildfires. In this paper we propose a data-driven approach to predict wildfire risk using weather data. We show how we address the inherent challenge arising due to the temporal dynamicity of weather data. Weather observations naturally change in time, with finer-scale variation (e.g. stationary day or night) or large variations (non-stationary day or night), and this determines a temporal variation of the predicted wildfire danger." "classification";"Designing Policy Recommendations to Reduce Home Abandonment in Mexico";"Infonavit, the largest provider of mortgages in Mexico, assists working families to obtain low-interest rate housing solutions. An increasingly prevalent problem is home abandonment: when a homeowner decides to leave their property and forego their investment. A major causal factor of this outcome is a mismatch between the homeowner’s needs, in terms of access to services and employment, and the location characteristics of the home." "classification";"Gemello: Creating a Detailed Energy Breakdown from just the Monthly Electricity Bill";"The first step to saving energy in the home is often to create an energy breakdown: the amount of energy used by each individual appliance in the home. Unfortunately, current techniques that produce an energy breakdown are not scalable: they require hardware to be installed in each and every home. In this paper, we propose a more scalable solution called Gemello that estimates the energy breakdown for one home by matching it with similar homes for which the breakdown is already known. This matching requires only the monthly energy bill and household characteristics such as square footage of the home and the size of the household. We evaluate this approach using 57 homes and results indicate that the accuracy of Gemello is comparable to or better than existing techniques that use sensing infrastructure in each home. The information required by Gemello is often publicly available and, as such, it can be immediately applied to many homes around the world." "classification";"Days on Market: Measuring Liquidity in Real Estate Markets";"Days on Market (DOM) refers to the number of days a property is on the active market, which is an important measurement of market liquidity in real estate industry. Indeed, at the micro level, DOM is not only a special concern of house sellers, but also a useful indicator for potential buyers to evaluate the popularity of a house. At the macro level, DOM is an important indicator of real estate market status. However, it is very challenging to measure DOM, since there are a variety of factors which can impact on the DOM of a property. To this end, in this paper, we aim to measure real estate liquidity by examining multiple factors in a holistic manner. A special goal is to predict the DOM of a given property listing. Specifically, we first ex-tract key features from multiple types of heterogeneous real estate-related data, such as house profiles and geo-social in-formation of residential communities. Then, based on these features, we develop a multi-task learning based regression approach for predicting the DOM of real estates. This approach can effectively learn district-aware models for different property listings by considering multiple factors. Finally, we conduct extensive experiments on real-world real estate data collected in Beijing and develop a prototype system for practical use. The experimental results clearly validate the effectiveness of the proposed approach for measuring liquidity in real estate markets." "classification";"A Non-parametric Approach to Detect Epileptogenic Lesions using Restricted Boltzmann Machines";"Visual detection of lesional areas on a cortical surface is critical in rendering a successful surgical operation for Treatment Resistant Epilepsy (TRE) patients. Unfortunately, 45% of Focal Cortical Dysplasia (FCD, the most common kind of TRE) patients have no visual abnormalities in their brains’ 3D-MRI images. We collaborate with doctors from NYU Langone’s Comprehensive Epilepsy Center and apply ma-chine learning methodologies to identify the resective zones for these MRI-negative FCD patients. Our task is particularly challenging because MRI images can only provide a limited number of features. Furthermore, data from different patients often exhibit inter-patient variabilities due to age, gender, left/right handedness, etc. In this paper, we introduce a new approach which combines the restricted Boltzmann machines and a Bayesian non-parametric mixture model to address these issues. We demonstrate the efficacy of our model by applying it to a retrospective dataset of MRI-negative FCD patients who are seizure free after surgery." "classification";"Domain adaptation in the absence of source domain data";"The overwhelming majority of existing domain adaptation methods makes an assumption of freely available source domain data. An equal access to both source and target data makes it possible to measure the discrepancy between their distributions and to build representations common to both target and source domains. In reality, such a simplifying assumption rarely holds, since source data are routinely a subject of legal and contractual constraints between data owners and data customers. When source domain data can not be accessed, decision making procedures are often available for adaptation nevertheless. These procedures are often presented in the form of classification, identification, ranking etc. rules trained on source data and made ready for a direct deployment and later reuse. In other cases, the owner of a source data is allowed to share a few representative examples such as class means." "classification";"EMBERS AutoGSR: Automated Coding of Civil Unrest Events";"We describe the EMBERS AutoGSR system that conducts automated coding of civil unrest events from news articles published in multiple languages. The nuts and bolts of the AutoGSR system constitute an ecosystem of filtering, ranking, and recommendation models to determine if an article reports a civil unrest event and, if so, proceed to identify and encode specific characteristics of the civil unrest event such as the when, where, who, and why of the protest. AutoGSR is a deployed system for the past 6 months continually processing data 24x7 in languages such as Spanish, Portuguese, English and encoding civil unrest events in 10 countries of Latin America: Argentina, Brazil, Chile, Colombia, Ecuador, El Salvador, Mexico, Paraguay, Uruguay, and Venezuela. We demonstrate the superiority of AutoGSR over both manual approaches and other state-of-the-art encoding systems for civil unrest." "classification";"An Engagement-Based Customer Lifetime Value System for E-commerce";"A comprehensive understanding of individual customer value is crucial to any successful customer relationship management strategy. It is also the key to building products for long-term value returns. Modeling customer lifetime value (CLTV) can be fraught with technical difficulties, however, due to both the noisy nature of user-level behavior and the potentially large customer base. Here we describe a new CLTV system that solves these problems. This was built at Groupon, a large global e-commerce company, where confronting the unique challenges of local commerce means quickly iterating on new products and the optimal inventory to appeal to a wide and diverse audience. Given current purchaser frequency we need a faster way to determine the health of individual customers, and given finite resources we need to know where to focus our energy. " "classification";"Predictors without Borders: Behavioral Modeling of Product Adoption in Three Developing Countries";"Billions of people around the world live without access to banks or other formal financial institutions. In the past several years, many mobile operators have launched “Mobile Money” platforms that deliver basic financial services over the mobile phone network. While many believe that these services can improve the lives of the poor, in many countries adoption of Mobile Money still remains anemic. In this paper, we develop a predictive model of Mobile Money adoption that uses billions of mobile phone communications records to understand the behavioral determinants of adoption. We describe a novel approach to feature engineering that uses a Deterministic Finite Automaton to construct thousands of behavioral metrics of phone use from a concise set of recursive rules. These features provide the foundation for a predictive model that is tested on mobile phone operators logs from Ghana, Pakistan, and Zambia, three very different developing-country contexts. The results highlight the key correlates of Mobile Money use in each country, as well as the potential for such methods to predict and drive adoption. More generally, our analysis provides insight into the extent to which homogenized supervised learning methods can generalize across geographic contexts. We find that without careful tuning, a model that performs very well in one country frequently does not generalize to another." "classification";"Bid-aware Gradient Descent for Unbiased Learning with Censored Data in Display Advertising";"In real-time display advertising, ad slots are sold per impression via an auction mechanism. For an advertiser, the campaign information is incomplete — the user responses (e.g, clicks or conversions) and the market price of each ad impression are observed only if the advertiser’s bid had won the corresponding ad auction. The predictions, such as bid land-scape forecasting, click-through rate (CTR) estimation, and bid optimization, are all operated in the pre-bid stage with full-volume bid request data. However, the training data is gathered in the post-bid stage with a strong bias towards the winning impressions. A common solution for learning over such censored data is to reweight data instances to correct the discrepancy between training and prediction. However, little study has been done on how to obtain the weights in-dependent of previous bidding strategies and consequently integrate them into the final CTR prediction and bid generation steps. In this paper, we formulate CTR estimation and bid optimization under such censored auction data. Derived from a survival model, we show that historic bid information is naturally incorporated to produce Bid-aware Gradient De-scents (BGD) which controls both the importance and the direction of the gradient to achieve unbiased learning. The empirical study based on two large-scale real-world datasets demonstrates remarkable performance gains from our solution. The learning framework has been deployed on Yahoo!’s real-time bidding platform and provided 2.97% AUC lift for CTR estimation and 9.30% eCPC drop for bid optimization in an online A/B test." "classification";"Repeat Buyer Prediction for E-Commerce";"A large number of new buyers are often acquired by merchants during promotions. However, many of the attracted buyers are one-time deal hunters, and the promotions may have little long-lasting impact on sales. It is important for merchants to identify who can be converted to regular loyal buyers and then target them to reduce promotion cost and increase the return on investment (ROI). At International Joint Conferences on Artificial Intelligence (IJCAI) 2015, Alibaba hosted an international competition for repeat buyer prediction based on the sales data of the “Double 11” shop- ping event in 2014 at Tmall.com. We won the first place at stage 1 of the competition out of 753 teams. In this paper, we present our winning solution, which consists of comprehensive feature engineering and model training. We created pro- files for users, merchants, brands, categories, items and their interactions via extensive feature engineering. These profiles are not only useful for this particular prediction task, but can also be used for other important tasks in e-commerce, such as customer segmentation, product recommendation, and customer base augmentation for brands. Feature engineering is often the most important factor for the success of a prediction task, but not much work can be found in the literature on feature engineering for prediction tasks in e-commerce. Our work provides some useful hints and in- sights for data science practitioners in e-commerce." "classification";"Developing a Data-Driven Player Ranking in Soccer using Predictive Model Weights";"Quantitative evaluation of the ability of soccer players to contribute to team offensive performance is typically based on goals scored, assists made, and shots taken. In this paper, we describe a novel player ranking system based entirely on the value of passes completed. This value is derived based on the relationship of pass locations in a possession and shot opportunities generated. This relationship is learned by applying a supervised machine learning model to pass locations in event data from the 2012-2013 La Liga season. Interestingly, though this metric is based entirely on passes, the derived player rankings are largely consistent with general perceptions of offensive ability, e.g., Messi and Ronaldo are near the top. Additionally, when used to rank midfielders, it separates the more offensively-minded players from others." "classification";"Identifying Earmarks in Congressional Bills";"Earmarks are legislative provisions that direct federal funds to specific projects, circumventing the competitive grant-making process of federal agencies. Identifying and cataloging earmarks is a tedious, time-consuming process carried out by experts from public interest groups. In this paper, we present a machine learning system for automatically extracting earmarks from congressional bills and reports. We first describe a table-parsing algorithm for extracting budget allocations from appropriations tables in congressional bills. We then use machine learning classifiers to identify budget allocations as earmarked objects with an out of sample ROC AUC score of 0.89. Using this system, we construct the first publicly available database of earmarks dating back to 1995. Our machine learning approach adds transparency, accuracy and speed to the congressional appropriations process." "classification";"Text Mining in Clinical Domain: Dealing with Noise.";"Text mining in clinical domain is usually more difficult than general domains (e.g. newswire reports and scientific literature) because of the high level of noise in both the corpus and training data for machine learning (ML). A large number of unknown word, non-word and poor grammatical sentences made up the noise in the clinical corpus. Un-known words are usually complex medical vocabularies, misspellings, acronyms and abbreviations where unknown non-words are generally the clinical patterns including scores and measures. This noise produces obstacles in the initial lexical processing step as well as subsequent semantic analysis. Furthermore, the labelled data used to build ML models is very costly to obtain because it requires intensive clinical knowledge from the annotators. And even created by experts, the training examples usually contain errors and inconsistencies due to the variations in human annotators’ attentiveness. Clinical domain also suffers from the nature of the imbalanced data distribution problem. These kinds of noise are very popular and potentially affect the overall information extraction performance but they were not carefully investigated in most presented health informatics systems." "classification";"Ranking Relevance in Yahoo Search";"Search engines play a crucial role in our daily lives. Relevance is the core problem of a commercial search engine. It has attracted thousands of researchers from both academia and industry and has been studied for decades. Relevance in a modern search engine has gone far beyond text matching, and now involves tremendous challenges. The semantic gap between queries and URLs is the main barrier for improving base relevance. Clicks help provide hints to improve relevance, but unfortunately for most tail queries, the click information is too sparse, noisy, or missing entirely. For comprehensive relevance, the recency and location sensitivity of results is also critical. " "classification";"Question Independent Grading using Machine Learning: The Case of Computer Program Grading";"Learning supervised models to grade open-ended responses is an expensive process. A model has to be trained for every prompt/question separately, which in turn requires graded samples. In automatic programming evaluation specifically, the focus of this work, this issue is amplified. The models have to be trained not only for every question but also for every language the question is offered in. Moreover, the availability and time taken by experts to create a labeled set of programs for each question is a major bottleneck in scaling such a system. We address this issue by presenting a method to grade computer programs which requires no manually assigned labeled samples for grading responses to a new, unseen question. We extend our previous work [25] wherein we introduced a grammar of features to learn question specific models. In this work, we propose a method to transform those features into a set of features that maintain their structural relation with the labels across questions. Using these features we learn one supervised model, across questions for a given language, which can then be applied to an ungraded response to an unseen question. We show that our method rivals the performance of both, question specific models and the consensus among human experts while substantially outperforming extant ways of evaluating codes. We demonstrate the system’s value by deploying it to grade programs in a high stakes assessment. The learning from this work is transferable to other grading tasks such as math question grading and also provides a new variation to the supervised learning approach." "classification";"How to Get Them a Dream Job?";"This paper proposes an approach to applying standardized entity data to improve job search quality and to make search results more personalized. Specifically, we explore three types of entity-aware features and incorporate them into the job search ranking function. The first is query-job matching features which extract and standardize entities mentioned in queries and documents, then semantically match them based on these entities. The second type, searcher-job expertise homophily, aims to capture the fact that job searchers tend to be interested in the jobs requiring similar expertise as theirs. To measure the similarity, we use standardized skills in job descriptions and searchers’ profiles as well as skills that we infer searchers might have but not explicitly list in their profiles. Third, we propose a concept of entity-faceted historical click-through-rates (CTRs) to capture job document quality. Faceting jobs by their standardized companies, titles, locations, etc., and computing historical CTRs at the facet level instead of individual job level alleviate sparseness issue in historical action data. This is particularly important in job search where job lifetime is typically short. Both offline and online experiments confirm the effectiveness of the features. In offline experiment, using the entity-aware features gives improvements of +20%, +12.1%and +8.3% on Precision@1, MRR and NDCG@25, respectively. Online A/B test shows that a new model with these features is +11.3% and +5.3% better than the baseline in terms of click-through-rate and apply rate." "classification";"DopeLearning: A Computational Approach to Rap Lyrics Generation";"Writing rap lyrics requires both creativity to construct a meaningful, interesting story and lyrical skills to produce complex rhyme patterns, which form the cornerstone of good ow. We present a rap lyrics generation method that captures both of these aspects. First, we develop a prediction model to identify the next line of existing lyrics from a set of candidate next lines. This model is based on two machine-learning techniques: the RankSVM algorithm and a deep neural network model with a novel structure. Results show that the prediction model can identify the true next line among 299 randomly selected lines with an accuracy of 17%, i.e., over 50 times more likely than by random. Second, we employ the prediction model to combine lines from existing songs, producing lyrics with rhyme and a meaning. An evaluation of the produced lyrics shows that in terms of quantitative rhyme density, the method outperforms the best human rappers by 21%. The rap lyrics generator has been deployed as an online tool called DeepBeat, and the performance of the tool has been assessed by analyzing its usage logs. This analysis shows that machine-learned rankings correlate with user preferences." "classification";"The Legislative Influence Detector: Finding Text Reuse in State Legislation";"State legislatures introduce at least 45,000 bills each year. However, we lack a clear understanding of who is actually writing those bills. As legislators often lack the time and staff to draft each bill, they frequently copy text written by other states or interest groups."