Publications

Dingming Wu, Erjia Xiao, Yi Zhu, Christian S. Jensen, Kezhong Lu ,"Efficient Retrieval of the Top-k Most Relevant Event-Partner Pairs" in IEEE Transactions on Knowledge and Data Engineering, 2023

Link [Publicly available]

The proliferation of event-based social networking (EBSN) motivates studies on topics such as event, venue, and friend recommendation as well as event creation and organization. In this setting, the notion of event-partner recommendation has attracted attention. When recommending an event to a user, this functionality allows the recommendation of partners with whom to attend the event. However, in existing proposals, recommendations are pushed to users at the system's initiative. In contrast, EBSNs provide users with keyword-based search functionality. This way, users may retrieve information in pull mode. We propose a new way of accessing information in EBSNs that combines pull and push, thus allowing users to not only conduct ad-hoc searches for events, but also to receive partner recommendations for retrieved events. Specifically, we define and study top-k k event-partner (k kEP) pair retrieval querying that integrates keyword-based search for events with event-partner recommendation. This type of query retrieves event-partner pairs, taking into account the relevance of events to user-supplied keywords and so-called together preferences that indicate the extent of a user's preference to attend an event with a given partner. To compute k kEP queries efficiently, we propose a rank-join based framework with three optimizations. Results of empirical studies with implementations of the proposed techniques demonstrate that the proposed techniques are capable of excellent performance.

Bolong Zheng, Qi Hu, Lingfeng Ming, Jilin Hu, Lu Chen, Kai Zheng, Christian S. Jensen ,"SOUP: Spatial-Temporal Demand Forecasting and Competitive Supply in Transportation" in IEEE Transactions on Knowledge and Data Engineering, 2023

Link [Publicly available]

We consider a setting with an evolving set of requests for transportation from an origin to a destination before a deadline and a set of agents capable of servicing the requests. In this setting, an authority assigns agents to requests such that the average idle time of the agents is minimized. An example is the scheduling of taxis (agents) to meet incoming passenger requests for trips while ensuring that the taxis are empty as little as possible. We address the problem of spatial-Temporal demand forecasting and competitive supply (SOUP) in two steps. First, we build a granular model that provides spatial-Temporal predictions of requests. Specifically, we propose a Spatial-Temporal Graph Convolutional Sequential Learning (ST-GCSL) model that predicts requests across locations and time slots. Second, we provide means of routing agents to request origins while avoiding competition among the agents. In particular, we develop a demand-Aware route planning (DROP) algorithm that considers both the spatial-Temporal predictions and the supply-demand state. We report on extensive experiments with real-world data that offer insight into the performance of the solution and show that it is capable of outperforming the state-of-The-Art proposals.

Yan Zhao, Liwei Deng, Xuanhao Chen, Chenjuan Guo, Bin Yang, Tung Kieu, Feiteng Huang, Torben Bach Pedersen, Kai Zheng, Christian S. Jensen ,"A Comparative Study on Unsupervised Anomaly Detection for Time Series: Experiments and Analysis." in arXiv, 2022

Link

The continued digitization of societal processes translates into a proliferation of time series data that cover applications such as fraud detection, intrusion detection, and energy management, where anomaly detection is often essential to enable reliability and safety. Many recent studies target anomaly detection for time series data. Indeed, area of time series anomaly detection is characterized by diverse data, methods, and evaluation strategies, and comparisons in existing studies consider only part of this diversity, which makes it difficult to select the best method for a particular problem setting. To address this shortcoming, we introduce taxonomies for data, methods, and evaluation strategies, provide a comprehensive overview of unsupervised time series anomaly detection using the taxonomies, and systematically evaluate and compare state-of-the-art traditional as well as deep learning techniques. In the empirical study using nine publicly available datasets, we apply the most commonly-used performance evaluation metrics to typical methods under a fair implementation standard. Based on the structuring offered by the taxonomies, we report on empirical studies and provide guidelines, in the form of comparative tables, for choosing the methods most suitable for particular application settings. Finally, we propose research directions for this dynamic field.

Tung Kieu, Bin Yang, Chenjuan Guo, Razvan-Gabriel Cirstea, Yan Zhao, Yale Song, Christian S. JensenKontaktforfatter ,"Anomaly Detection in Time Series with Robust Variational Quasi-Recurrent Autoencoders" in 38th International Conference on Data Engineering (ICDE), 2022

Link

We propose variational quasi-recurrent autoencoders (VQRAEs) to enable robust and efficient anomaly detection in time series in unsupervised settings. The proposed VQRAEs employs a judiciously designed objective function based on robust divergences, including a, ß, and, -divergence, making it possible to separate anomalies from normal data without the reliance on anomaly labels, thus achieving robustness and fully unsupervised training. To better capture temporal dependencies in time series data, VQRAEs are built upon quasi-recurrent neural networks, which employ convolution and gating mechanisms to avoid the inefficient recursive computations used by classic recurrent neural networks. Further, VQRAEs can be extended to bi-directional Bi VQRAEs that utilize bi-directional information to further improve the accuracy. The above design choices make VQRAEs not only robust and thus accurate, but also efficient at detecting anomalies in streaming settings. Experiments on five real-world time series offer insight into the design properties of VQRAEs and demonstrate that VQRAEs are capable of outperforming state-of-the-art methods.

Zezhi Shao, Zhao Zhang, Wei Wei, Fei Wang, Yongjun Xu, Xin Cao, Christian S. Jensen ,"Decoupled Dynamic Spatial-Temporal Graph Neural Network for Traffic Forecasting." in arXiv, 2022

Link

We all depend on mobility, and vehicular transportation affects the daily lives of most of us. Thus, the ability to forecast the state of traffic in a road network is an important functionality and a challenging task. Traffic data is often obtained from sensors deployed in a road network. Recent proposals on spatial-temporal graph neural networks have achieved great progress at modeling complex spatial-temporal correlations in traffic data, by modeling traffic data as a diffusion process. However, intuitively, traffic data encompasses two different kinds of hidden time series signals, namely the diffusion signals and inherent signals. Unfortunately, nearly all previous works coarsely consider traffic signals entirely as the outcome of the diffusion, while neglecting the inherent signals, which impacts model performance negatively. To improve modeling performance, we propose a novel Decoupled Spatial-Temporal Framework (DSTF) that separates the diffusion and inherent traffic information in a data-driven manner, which encompasses a unique estimation gate and a residual decomposition mechanism. The separated signals can be handled subsequently by the diffusion and inherent modules separately. Further, we propose an instantiation of DSTF, Decoupled Dynamic Spatial-Temporal Graph Neural Network (D2STGNN), that captures spatial-temporal correlations and also features a dynamic graph learning module that targets the learning of the dynamic characteristics of traffic networks. Extensive experiments with four real-world traffic datasets demonstrate that the framework is capable of advancing the state-of-the-art.

Zezhi Shao, Zhao Zhang, Wei Wei, Fei Wang, Yongjun Xu, Xin Cao, Christian S. Jensen ,"Decoupled Dynamic Spatial-Temporal Graph Neural Network for Traffic Forecasting." in Proceedings of the VLDB Endowment, 2022

Link [Publicly available]

We all depend on mobility, and vehicular transportation affects the daily lives of most of us. Thus, the ability to forecast the state of traffic in a road network is an important functionality and a challenging task. Traffic data is often obtained from sensors deployed in a road network. Recent proposals on spatial-temporal graph neural networks have achieved great progress at modeling complex spatial-temporal correlations in traffic data, by modeling traffic data as a diffusion process. However, intuitively, traffic data encompasses two different kinds of hidden time series signals, namely the diffusion signals and inherent signals. Unfortunately, nearly all previous works coarsely consider traffic signals entirely as the outcome of the diffusion, while neglecting the inherent signals, which impacts model performance negatively. To improve modeling performance, we propose a novel Decoupled Spatial-Temporal Framework (DSTF) that separates the diffusion and inherent traffic information in a data-driven manner, which encompasses a unique estimation gate and a residual decomposition mechanism. The separated signals can be handled subsequently by the diffusion and inherent modules separately. Further, we propose an instantia-tion of DSTF, Decoupled Dynamic Spatial-Temporal Graph Neural Network (D2 STGNN), that captures spatial-temporal correlations and also features a dynamic graph learning module that targets the learning of the dynamic characteristics of traffic networks. Extensive experiments with four real-world traffic datasets demonstrate that the framework is capable of advancing the state-of-the-art.

Dingming Wu, Ilkcan Keles, Song Wu, Hao Zhou, Simonas Saltenis, Christian S. Jensen, Kezhong LuKontaktforfatter ,"Density-Based Top-K Spatial Textual Clusters Retrieval" in IEEE Transactions on Knowledge and Data Engineering, 2022

Link [Publicly available]

So-called spatial web queries retrieve web content representing points of interest, such that the points of interest have descriptions that are relevant to query keywords and are located close to a query location. Two broad categories of such queries exist. The first encompasses queries that retrieve single spatial web objects that each satisfy the query arguments. Most proposals belong to this category. The second category, to which this paper's proposal belongs, encompasses queries that support exploratory user behavior and retrieve sets of objects that represent regions of space that may be of interest to the user. Specifically, the paper proposes a new type of query, the top-k spatial textual cluster retrieval (k-STC) query that returns the top-k clusters that (i) are located close to a query location, (ii) contain objects that are relevant with regard to given query keywords, and (iii) have an object density that exceeds a given threshold. To compute this query, we propose a DBSCAN-based approach and an OPTICS-based approach that rely on on-line density-based clustering and that exploit early stop conditions. Empirical studies on real data sets offer evidence that the paper's proposals can find good quality clusters and are capable of excellent performance.

Dalin Zhang, Kaixuan Chen, Yan Zhao, Bin Yang, Lina Yao, Christian S. Jensen ,"Design Automation for Fast, Lightweight, and Effective Deep Learning Models: A Survey." in arXiv, 2022

Link

Deep learning technologies have demonstrated remarkable effectiveness in a wide range of tasks, and deep learning holds the potential to advance a multitude of applications, including in edge computing, where deep models are deployed on edge devices to enable instant data processing and response. A key challenge is that while the application of deep models often incurs substantial memory and computational costs, edge devices typically offer only very limited storage and computational capabilities that may vary substantially across devices. These characteristics make it difficult to build deep learning solutions that unleash the potential of edge devices while complying with their constraints. A promising approach to addressing this challenge is to automate the design of effective deep learning models that are lightweight, require only a little storage, and incur only low computational overheads. This survey offers comprehensive coverage of studies of design automation techniques for deep learning models targeting edge computing. It offers an overview and comparison of key metrics that are used commonly to quantify the proficiency of models in terms of effectiveness, lightness, and computational costs. The survey then proceeds to cover three categories of the state-of-the-art of deep model design automation techniques: automated neural architecture search, automated model compression, and joint automated design and compression. Finally, the survey covers open issues and directions for future research.

Christian S. Jensen ,"Digitalization in the Service of Society: The Case of Big Vehicle Trajectory Data."

Link

The ongoing, sweeping digitalization of societal processes generates massive volumes of data that capture the underlying processes at an unprecedented level of detail, in turn enabling us to better understand and improve those processes. Put differently, if harnessed properly, data holds the potential to enable value creation throughout society.Considering primarily vehicle trajectory data, this talk put focus on the important process of transportation: While we all depend on it for mobility, transportation has adverse effects on (i) our productivity due to lack of predictability and congestion, (ii) the climate due to greenhouse gas emissions, and (iii) our health and safety due to air and noise pollution and accidents. Thus, it makes good sense to invent techniques capable of leveraging big data for the improvement of transportation.The talk describes how the availability of massive trajectory data renders the traditional routing paradigm, where a road network is modeled as an edge-weighted graph, inadequate. Instead, new paradigms that thrive on massive trajectory data are called for. The talk covers several such paradigms, including path-centric, on-the-fly, and cost-oblivious routing [2, 3, 4, 10, 11, 12]. As even massive volumes of trajectory data are sparse in these settings, the talk also covers means of making good use of available data [6, 7, 13]. Finally, trajectory data has many uses beyond routing—the talk covers several such uses [1, 5, 8, 9].

Huan Li, Lanjing Yi, Bo Tang, Hua Lu, Christian S. Jensen ,"Efficient and Error-bounded Spatiotemporal Quantile Monitoring in Edge Computing Environments" in 48th International Conference on Very Large Data Bases, VLDB 2022, 2022

Link [Publicly available]

Underlying many types of data analytics, a spatiotemporal quantile monitoring (SQM) query continuously returns the quantiles of a dataset observed in a spatiotemporal range. In this paper, we study SQM in an Internet of Things (IoT) based edge computing environment, where concurrent SQM queries share the same infrastructure asynchronously. To minimize query latency while providing result accuracy guarantees, we design a processing framework that virtu-alizes edge-resident data sketches for quantile computing. In the framework, a coordinator edge node manages edge sketches and synchronizes edge sketch processing and query executions. The coordinator also controls the processed data fractions of edge sketches, which helps to achieve the optimal latency with error-bounded results for each single query. To support concurrent queries, we employ a grid to decompose queries into subqueries and process them efficiently using shared edge sketches. We also devise a relaxation algorithm to converge to optimal latencies for those subqueries whose result errors are still bounded. We evaluate our proposals using two high-speed streaming datasets in a simulated IoT setting with edge nodes. The results show that our proposals achieve efficient, scalable, and error-bounded SQM.

Lu Chen, Yunjun Gao, Xingrui Huang, Christian S. Jensen, Bolong Zheng ,"Efficient Distributed Clustering Algorithms on Star-Schema Heterogeneous Graphs" in IEEE Transactions on Knowledge and Data Engineering, 2022

Link [Publicly available]

Clustering graphs is able to provide useful insights into the structure of the data. To improve the quality of clustering, node attributes can be considered, resulting in attributed graphs. Existing attributed graph clustering methods generally consider attribute similarity and structural similarity separately. In this paper, we represent attributed graphs as star-schema heterogeneous graphs, where attributes are modeled as different types of graph nodes. This enables the use of personalized pagerank (PPR) as a unified distance measure that captures both structural and attribute similarities. We employ DBSCAN for clustering, and update edge weights iteratively to balance the importance of different attributes. The rapidly growing volume of data nowadays challenges traditional clustering algorithms, and thus, a distributed method is required. Hence, we adopt a popular distributed graph computing system Blogel, based on which, we develop four exact and approximate approaches that enable efficient PPR score computation when edge weights are updated. To improve the effectiveness of the clustering, we propose a simple yet effective edge weight update strategy based on entropy. Also, we present a game theory based method that enables trading efficiency for result quality. Extensive experiments on real-life datasets demonstrate the effectiveness and efficiency of our proposals.

Tianyi Li, Christian S. Jensen, Torben Bach Pedersen, Yunjun Gao, Jilin Hu ,"Evolutionary Clustering of Moving Objects" in 38th IEEE International Conference on Data Engineering, ICDE 2022, 2022

Link

The widespread deployment of smartphones, net-worked in-vehicle devices with geo-positioning capabilities, and vessel tracking technologies renders it feasible to collect the evolving geo-locations of populations of land- and sea-based moving objects. The continuous clustering of such data can enable a variety of real-time services, such as road traffic management and vessel collision risk assessment. However, little attention has so far been given to the quality of moving-object clusters-for example, it is beneficial to smooth short-term fluctuations in clusters to achieve robustness to exceptional data and to improve existing applications. We propose the notion of evolutionary clustering of moving objects, abbreviated ECM, that enhances the quality of moving object clustering by means of temporal smoothing that prevents abrupt changes in clusters across successive timestamps. Employing the notions of snapshot and historical costs, we formalize ECM and formulate ECM as an optimization problem. We prove that ECM can be performed approximately in linear time, thus eliminating iterative processes employed in previous studies. Further, we propose a minimal-group structure and a seed-point shifting strategy to facilitate temporal smoothing. Finally, we present all algorithms underlying ECM along with a set of optimization techniques. Extensive experiments with three real-life datasets offer insights into ECM and show that it outperforms state-of-the-art solutions in terms of both clustering quality and clustering efficiency.

Lu Chen, Yunjun Gao, Xuan Song, Zheng Li, Yifan Zhu, Xiaoye Miao, Christian S. JensenKontaktforfatter ,"Indexing Metric Spaces for Exact Similarity Search" in ACM Computing Surveys, 2022

Link [Publicly available]

With the continued digitization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when attempting to enable value creation from big data: volume, velocity, and variety. Many studies address volume or velocity, while fewer studies concern the variety. Metric spaces are ideal for addressing variety because they can accommodate any data as long as it can be equipped with a distance notion that satisfies the triangle inequality. To accelerate search in metric spaces, a collection of indexing techniques for metric data have been proposed. However, existing surveys offer limited coverage, and a comprehensive empirical study exists has yet to be reported. We offer a comprehensive survey of existing metric indexes that support exact similarity search: we summarize existing partitioning, pruning, and validation techniques used by metric indexes to support exact similarity search; we provide the time and space complexity analyses of index construction; and we offer an empirical comparison of their query processing performance. Empirical studies are important when evaluating metric indexing performance, because performance can depend highly on the effectiveness of available pruning and validation as well as on the data distribution, which means that complexity analyses often offer limited insights. This article aims at revealing strengths and weaknesses of different indexing techniques to offer guidance on selecting an appropriate indexing technique for a given setting, and to provide directions for future research on metric indexing.

Xuanhao Chen, Yan Zhao, Kai Zheng, Bin Yang, Christian S. JensenKontaktforfatter ,"Influence-aware Task Assignment in Spatial Crowdsourcing" in 38th International Conference on Data Engineering (ICDE), 2022

Link

With the widespread diffusion of smartphones, Spatial Crowdsourcing (SC), which aims to assign spatial tasks to mobile workers, has drawn increasing attention in both academia and industry. One of the major issues is how to best assign tasks to workers. Given a worker and a task, the worker will choose to accept the task based on her affinity towards the task, and the worker can propagate the information of the task to attract more workers to perform it. These factors can be measured as worker-task influence. Since workers' affinities towards tasks are different and task issuers may ask workers who performed tasks to propagate the information of tasks to attract more workers to perform them, it is important to analyze worker-task influence when making assignments. We propose and solve a novel influence-aware task assignment problem in SC, where tasks are assigned to workers in a manner that achieves high worker-task influence. In particular, we aim to maximize the number of assigned tasks and worker-task influence. To solve the problem, we first determine workers' affinities towards tasks by identifying workers' historical task-performing patterns. Next, a Historical Acceptance approach is developed to measure workers' willingness of performing a task, i.e., the probability of workers visiting the location of the task when they are informed. Next, we propose a Random reverse reachable-based Propagation Optimization algorithm that exploits reverse reachable sets to calculate the probability of workers being informed about tasks in a social network. Based on worker-task influence derived from the above three factors, we propose three influence-aware task assignment algorithms that aim to maximize the number of assigned tasks and worker-task influence. Extensive experiments on two real-world datasets offer detailed insight into the effectiveness of our solutions.

Xuanhao Chen, Yan Zhao, Kai Zheng, Bin Yang, Christian S. Jensen ,"Influence-aware Task Assignment in Spatial Crowdsourcing (Technical Report)." in arXiv, 2022

Link

Anton Dignös, Michael H. Böhlen, Johann Gamper, Christian S. Jensen, Peter MoserKontaktforfatter ,"Leveraging range joins for the computation of overlap joins" in VLDB Journal, 2022

Link [Publicly available]

Joins are essential and potentially expensive operations in database management systems. When data is associated with time periods, joins commonly include predicates that require pairs of argument tuples to overlap in order to qualify for the result. Our goal is to enable built-in systems support for such joins. In particular, we present an approach where overlap joins are formulated as unions of range joins, which are more general purpose joins compared to overlap joins, i.e., are useful in their own right, and are supported well by B+-trees. The approach is sufficiently flexible that it also supports joins with additional equality predicates, as well as open, closed, and half-open time periods over discrete and continuous domains, thus offering both generality and simplicity, which is important in a system setting. We provide both a stand-alone solution that performs on par with the state-of-the-art and a DBMS embedded solution that is able to exploit standard indexing and clearly outperforms existing DBMS solutions that depend on specialized indexing techniques. We offer both analytical and empirical evaluations of the proposals. The empirical study includes comparisons with pertinent existing proposals and offers detailed insight into the performance characteristics of the proposals.

Pengfei Jin, Lu Chen, Yunjun Gao, Xueqin Chang, Zhanyu Liu, Shu Shen, Christian S. JensenKontaktforfatter ,"Maximizing the influence of bichromatic reverse k nearest neighbors in geo-social networks" in World Wide Web, 2022

Link [Publicly available]

Geo-social networks offer opportunities for the marketing and promotion of geo-located services. In this setting, we explore a new problem, called Maximizing the Influence of Bichromatic Reverse kNearest Neighbors (MaxInfBRkNN). The objective is to find a set of points of interest (POIs), which are geo-textually and socially relevant to social influencers who are expected to largely promote the POIs online. In other words, the problem aims to detect an optimal set of POIs with the largest word-of-mouth (WOM) marketing potential. This functionality is useful in various real-life applications, including social advertising, location-based viral marketing, and personalized POI recommendation. However, solving MaxInfBRkNN with theoretical guarantees is challenging because of the prohibitive overheads on BRkNN retrieval in geo-social networks, and the NP and #P-hardness of finding the optimal POI set. To achieve practical solutions, we present a framework with carefully designed indexes, efficient batch BRkNN processing algorithms, and alternative POI selection policies that support both approximate and heuristic solutions. Extensive experiments on real and synthetic datasets demonstrate the good performance of our proposed methods.

Pengfei Jin, Lu Chen, Yunjun Gao, Xueqin Chang, Zhanyu Liu, Christian S. Jensen ,"Maximizing the Influence of Bichromatic Reverse k Nearest Neighbors in Geo-Social Networks." in arXiv, 2022

Link

Geo-social networks offer opportunities for the marketing and promotion of geo-located services. In this setting, we explore a new problem, called Maximizing the Influence of Bichromatic Reverse k Nearest Neighbors (MaxInfBRkNN). The objective is to find a set of points of interest (POIs), which are geo-textually and socially attractive to social influencers who are expected to largely promote the POIs through online influence propagation. In other words, the problem aims to detect an optimal set of POIs with the largest word-of-mouth (WOM) marketing potential. This functionality is useful in various real-life applications, including social advertising, location-based viral marketing, and personalized POI recommendation. However, solving MaxInfBRkNN with theoretical guarantees is challenging, because of the prohibitive overheads on BRkNN retrieval in geo-social networks, and the NP and #P-hardness in finding the optimal POI set. To achieve practical solutions, we present a framework with carefully designed indexes, efficient batch BRkNN processing algorithms, and alternative POI selection policies that support both approximate and heuristic solutions. Extensive experiments on real and synthetic datasets demonstrate the good performance of our proposed methods.

Karl Aberer, Christian S. Jensen, Kian Lee Tan ,"Message from the Test-of-Time Committee" in 23rd IEEE International Conference on Mobile Data Management, MDM 2022, 2022

Link

Presents the conference keynote speech or messages from conference chairs.

Mohamed F. Mokbel, Mahmoud Attia Sakr, Li Xiong, Andreas Züfle, Jussara M. Almeida, Taylor Anderson, Walid G. Aref, Gennady L. Andrienko, Natalia V. Andrienko, Yang Cao, Sanjay Chawla, Reynold Cheng, Panos K. Chrysanthis, Xiqi Fei, Gabriel Ghinita, Anita Graser, Dimitrios Gunopulos, Christian S. Jensen, Joon-Sook Kim, Kyoung-Sook KimPeer Kröger, John Krumm, Johannes Lauer, Amr Magdy, Mario A. Nascimento, Siva Ravada, Matthias Renz, Dimitris Sacharidis, Cyrus Shahabi, Flora D. Salim, Mohamed Sarwat, Maxime Schoemans, Bettina Speckmann, Egemen Tanin, Yannis Theodoridis, Kristian Torp, Goce Trajcevski, Marc J. van Kreveld, Carola Wenk, Martin Werner, Raymond Chi-Wing Wong, Song Wu, Jianqiu Xu, Moustafa Youssef, Demetris Zeinalipour, Mengxuan Zhang, Esteban ZimányiVis 27 andreVis mindre ,"Mobility Data Science: Dagstuhl Seminar 22021" in Dagstuhl Seminar 22021 , 2022

Link [Publicly available]

This report documents the program and the outcomes of Dagstuhl Seminar 22021 "Mobility Data Science". This seminar was held January 9-14, 2022, including 47 participants from industry and academia. The goal of this Dagstuhl Seminar was to create a new research community of mobility data science in which the whole is greater than the sum of its parts by bringing together established leaders as well as promising young researchers from all fields related to mobility data science.Specifically, this report summarizes the main results of the seminar by (1) defining Mobility Data Science as a research domain, (2) by sketching its agenda in the coming years, and by (3) building a mobility data science community. (1) Mobility data science is defined as spatiotemporal data that additionally captures the behavior of moving entities (human, vehicle, animal, etc.). To understand, explain, and predict behavior, we note that a strong collaboration with research in behavioral and social sciences is needed. (2) Future research directions for mobility data science described in this report include a) mobility data acquisition and privacy, b) mobility data management and analysis, and c) applications of mobility data science. (3) We identify opportunities towards building a mobility data science community, towards collaborations between academic and industry, and towards a mobility data science curriculum.

Yan Zhao, Xuanhao Chen, Liwei Deng, Tung Kieu, Chenjuan Guo, Bin Yang, Kai Zheng, Christian S. Jensen ,"Outlier Detection for Streaming Task Assignment in Crowdsourcing." in 31st ACM Web Conference, WWW 2022, 2022

Link

Crowdsourcing aims to enable the assignment of available resources to the completion of tasks at scale. The continued digitization of societal processes translates into increased opportunities for crowdsourcing. For example, crowdsourcing enables the assignment of computational resources of humans, called workers, to tasks that are notoriously hard for computers. In settings faced with malicious actors, detection of such actors holds the potential to increase the robustness of crowdsourcing platform. We propose a framework called Outlier Detection for Streaming Task Assignment that aims to improve robustness by detecting malicious actors. In particular, we model the arrival of workers and the submission of tasks as evolving time series and provide means of detecting malicious actors by means of outlier detection. We propose a novel socially aware Generative Adversarial Network (GAN) based architecture that is capable of contending with the complex distributions found in time series. The architecture includes two GANs that are designed to adversarially train an autoencoder to learn the patterns of distributions in worker and task time series, thus enabling outlier detection based on reconstruction errors. A GAN structure encompasses a game between a generator and a discriminator, where it is desirable that the two can learn to coordinate towards socially optimal outcomes, while avoiding being exploited by selfish opponents. To this end, we propose a novel training approach that incorporates social awareness into the loss functions of the two GANs. Additionally, to improve task assignment efficiency, we propose an efficient greedy algorithm based on degree reduction that transforms task assignment into a bipartite graph matching. Extensive experiments offer insight into the effectiveness and efficiency of the proposed framework.

Yifan Zhu, Lu Chen, Yunjun Gao, Christian S. Jensen ,"Pivot selection algorithms in metric spaces: a survey and experimental study" in VLDB Journal, 2022

Link

Similarity search in metric spaces is used widely in areas such as multimedia retrieval, data mining, data integration, to name but a few. To accelerate metric similarity search, pivot-based indexing is often employed. Pivot-based indexing first computes the distances between data objects and pivots and then exploits filtering techniques that use the triangle inequality on pre-computed distances to prune search space during search. The performance of pivot-based indexing depends on the quality of the pivots used, and many algorithms have been proposed for selecting high-quality pivots. We present a comprehensive empirical study of pivot selection algorithms. Specifically, we classify all existing algorithms into three categories according to the types of distances they use for selecting pivots. We also propose a new pivot selection algorithm that exploits the power law probabilistic distribution. Next, we report on a comprehensive empirical study of the search performance enabled by different pivot selection approaches, using different datasets and indexes, thus contributing new insight into the strengths and weaknesses of existing selection techniques. Finally, we offer advice on how to select appropriate pivot selection algorithms for different settings.

Bolong Zheng, Xi Zhao, Lianggui Weng, Quoc Viet Hung Nguyen, Hang Liu, Christian S. JensenKontaktforfatter ,"PM-LSH: a fast and accurate in-memory framework for high-dimensional approximate NN and closest pair search" in VLDB Journal, 2022

Link [Publicly available]

Nearest neighbor (NN) search is inherently computationally expensive in high-dimensional spaces due to the curse of dimensionality. As a well-known solution, locality-sensitive hashing (LSH) is able to answer c-approximate NN (c-ANN) queries in sublinear time with constant probability. Existing LSH methods focus mainly on building hash bucket-based indexing such that the candidate points can be retrieved quickly. However, existing coarse-grained structures fail to offer accurate distance estimation for candidate points, which translates into additional computational overhead when having to examine unnecessary points. This in turn reduces the performance of query processing. In contrast, we propose a fast and accurate in-memory LSH framework, called PM-LSH, that aims to compute the c-ANN query on large-scale, high-dimensional datasets. First, we adopt a simple yet effective PM-tree to index the data points. Second, we develop a tunable confidence interval to achieve accurate distance estimation and guarantee high result quality. Third, we propose an efficient algorithm on top of the PM-tree to improve the performance of computing c-ANN queries. In addition, we extend PM-LSH to support closest pair (CP) search in high-dimensional spaces. Here, we again adopt the PM-tree to organize the points in a low-dimensional space, and we propose a branch and bound algorithm together with a radius pruning technique to improve the performance of computing c-approximate closest pair (c-ACP) queries. Extensive experiments with real-world data offer evidence that PM-LSH is capable of outperforming existing proposals with respect to both efficiency and accuracy for both NN and CP search.

Yan Zhao, Kai Zheng, Yunchuan Li, Jinfu Xia, Bin Yang, Torben Bach Pedersen, Rui Mao, Christian S. Jensen, Xiaofang Zhou ,"Profit Optimization in Spatial Crowdsourcing: Effectiveness and Efficiency" in IEEE Transactions on Knowledge and Data Engineering, 2022

Link

In Spatial crowdsourcing, mobile users perform spatio-temporal tasks that involve travel to specified locations. Spatial crowdsourcing (SC) is enabled by SC platforms that support mobile worker recruitment and retention, as well as task assignment, which is essential to maximize profits that are accrued from serving task requests. Specifically, how to best achieve task assignment in a cost-effective manner while contending with spatio-temporal constraints is a key challenge in SC. To address this challenge, we formalize and study a novel Profit-driven Task Assignment problem. We first establish a task reward pricing model that takes into account the temporal constraints (i.e., expected completion time and deadline) of tasks. Then we adopt an optimal algorithm based on tree decomposition to achieve an optimal task assignment and propose greedy algorithms based on Random Tuning Optimization to improve the computational efficiency. To balance effectiveness and efficiency, we also provide a heuristic task assignment algorithm based on Ant Colony Optimization that assigns tasks by simulating behavior of ant colonies foraging for food. Finally, we conduct extensive experiments using real and synthetic data, offering detailed insight into effectiveness and efficiency of the proposed methods.

Tobias Skovgaard Jepsen, Christian S. Jensen, Thomas Dyhre NielsenKontaktforfatter ,"Relational Fusion Networks: Graph Convolutional Networks for Road Networks" in IEEE Transactions on Intelligent Transportation Systems, 2022

Link [Publicly available]

The application of machine learning techniques in the setting of road networks holds the potential to facilitate many important intelligent transportation applications. Graph Convolutional Networks (GCNs) are neural networks that are capable of leveraging the structure of a network. However, many implicit assumptions of GCNs do not apply to road networks. We introduce the Relational Fusion Network (RFN), a novel type of Graph Convolutional Network (GCN) designed specifically for road networks. In particular, we propose methods that outperform state-of-the-art GCN architectures by up to 21-40% on two machine learning tasks in road networks. Furthermore, we show that state-of-the-art GCNs may fail to effectively leverage road network structure and may not generalize well to other road networks.

Tung Kieu, Bin Yang, Chenjuan Guo, Christian S. Jensen, Yan Zhao, Feiteng Huang, Kai ZhengKontaktforfatter ,"Robust and Explainable Autoencoders for Unsupervised Time Series Outlier Detection"

Link

Time series data occurs widely, and outlier detection is a fundamental problem in data mining, which has numerous applications. Existing autoencoder-based approaches deliver state-of-the-art performance on challenging real-world data but are vulnerable to outliers and exhibit low explainability. To address these two limitations, we propose robust and explainable unsupervised autoencoder frameworks that decompose an input time series into a clean time series and an outlier time series using autoencoders. Improved explainability is achieved because clean time series are better explained with easy-to-understand patterns such as trends and periodicities. We provide insight into this by means of a post-hoc explainability analysis and empirical studies. In addition, since outliers are separated from clean time series iteratively, our approach offers improved robustness to outliers, which in turn improves accuracy. We evaluate our approach on five real-world datasets and report improvements over the state-of-the-art approaches in terms of robustness and explainability.

Tung Kieu, Bin Yang, Chenjuan Guo, Christian S. Jensen, Yan Zhao, Feiteng Huang, Kai Zheng ,"Robust and Explainable Autoencoders for Unsupervised Time Series Outlier Detection - Extended Version." in CoRR, 2022

Link

Huan Li, Bo Tang, Hua Lu, Muhammad Aamir Cheema, Christian S. Jensen ,"Spatial Data Quality in the IoT Era: Management and Exploitation" in 2022 ACM SIGMOD International Conference on the Management of Data, SIGMOD 2022, 2022

Link [Publicly available]

Within the rapidly expanding Internet of Things (IoT), growing amounts of spatially referenced data are being generated. Due to the dynamic, decentralized, and heterogeneous nature of the IoT, spatial IoT data (SID) quality has attracted considerable attention in academia and industry. How to invent and use technologies for managing spatial data quality and exploiting low-quality spatial data are key challenges in the IoT. In this tutorial, we highlight the SID consumption requirements in applications and offer an overview of spatial data quality in the IoT setting. In addition, we review pertinent technologies for quality management and low-quality data exploitation, and we identify trends and future directions for quality-aware SID management and utilization. The tutorial aims to not only help researchers and practitioners to better comprehend SID quality challenges and solutions, but also offer insights that may enable innovative research and applications.

Ziquan Fang, Yuntao Du, Xinjun Zhu, Danlei Hu, Lu Chen, Yunjun Gao, Christian S. Jensen ,"Spatio-Temporal Trajectory Similarity Learning in Road Networks" in 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2022, 2022

Link

Deep learning based trajectory similarity computation holds the potential for improved efficiency and adaptability over traditional similarity computation. However, existing learning-based trajectory similarity learning solutions prioritize spatial similarity over temporal similarity, making them suboptimal for time-aware analyses. To this end, we propose ST2Vec, a representation learning based solution that considers fine-grained spatial and temporal relations between trajectories to enable spatio-temporal similarity computation in road networks. Specifically, ST2Vec encompasses two steps: (i) spatial and temporal modeling that encode spatial and temporal information of trajectories, where a generic temporal modeling module is proposed for the first time; and (ii) spatio-temporal co-attention fusion, where two fusion strategies are designed to enable the generation of unified spatio-temporal embeddings of trajectories. Further, under the guidance of triplet loss, ST2Vec employs curriculum learning in model optimization to improve convergence and effectiveness. An experimental study offers evidence that ST2Vec outperforms state-of-the-art competitors substantially in terms of effectiveness and efficiency, while showing low parameter sensitivity and good model robustness. Moreover, similarity involved case studies including top-k querying and DBSCAN clustering offer further insight into the capabilities of ST2Vec.

Bezaye Tesfaye, Nikolaus Augsten, Mateusz Pawlik, Michael H. Böhlen, Christian S. JensenKontaktforfatter ,"Speeding Up Reachability Queries in Public Transport Networks Using Graph Partitioning" in Information Systems Frontiers, 2022

Link [Publicly available]

Computing path queries such as the shortest path in public transport networks is challenging because the path costs between nodes change over time. A reachability query from a node at a given start time on such a network retrieves all points of interest (POIs) that are reachable within a given cost budget. Reachability queries are essential building blocks in many applications, for example, group recommendations, ranking spatial queries, or geomarketing. We propose an efficient solution for reachability queries in public transport networks. Currently, there are two options to solve reachability queries. (1) Execute a modified version of Dijkstra’s algorithm that supports time-dependent edge traversal costs; this solution is slow since it must expand edge by edge and does not use an index. (2) Issue a separate path query for each single POI, i.e., a single reachability query requires answering many path queries. None of these solutions scales to large networks with many POIs. We propose a novel and lightweight reachability index. The key idea is to partition the network into cells. Then, in contrast to other approaches, we expand the network cell by cell. Empirical evaluations on synthetic and real-world networks confirm the efficiency and the effectiveness of our index-based reachability query solution.

Tobias Skovgaard Jepsen, Christian S. Jensen, Thomas Dyhre Nielsen ,"UniTE - The Best of Both Worlds - Unifying Function-Fitting and Aggregation-Based Approaches to Travel Time and Travel Speed Estimation." in Transactions on Spatial Algorithms and Systems, 2022

Link [Publicly available]

Travel time and speed estimation are part of many intelligent transportation applications. Existing estimation approaches rely on either function fitting or data aggregation and represent different tradeoffs between generalizability and accuracy.Function-fitting approaches learn functions that map feature vectors of, e.g., routes to travel time or speed estimates, which enables generalization to unseen routes. However, mapping functions are imperfect and offer poor accuracy in practice. Aggregation-based approaches instead form estimates by aggregating historical data, e.g., traversal data for routes. This enables very high accuracy given sufficient data. However, they rely on simplistic heuristics when insufficient data is available, yielding poor generalizability.We present a Unifying approach to Travel time and speed Estimation (UniTE) that combines function-fitting and aggregation-based approaches into a unified framework that aims to achieve the generalizability of function-fitting approaches and the accuracy of aggregation-based approaches when data is available. We demonstrate empirically that an instance of UniTE can improve the accuracies of travel speed and travel time estimation by 40–64% and 3–23%, respectively, compared to using only function fitting or data aggregation.

Sean Bin Yang, Chenjuan Guo, Jilin Hu, Bin Yang, Jian Tang, Christian S. Jensen ,"Weakly-supervised Temporal Path Representation Learning with Contrastive Curriculum Learning"

Link [Publicly available]

In step with the digitalization of transportation, we are witnessing a growing range of path-based smart-city applications, e.g., travel-time estimation and travel path ranking. A temporal path (TP) that includes temporal information, e.g., departure time, into the path is of fundamental to enable such applications. In this setting, it is essential to learn generic temporal path representations (TPRs) that consider spatial and temporal correlations simultaneously and that can be used in different applications, i.e., downstream tasks. Existing methods fail to achieve the goal since (i) supervised methods require large amounts of task-specific labels when training and thus fail to generalize the obtained TPRs to other tasks; (ii) though unsupervised methods can learn generic representations, they disregard the temporal aspect, leading to sub-optimal results. To contend with the limitations of existing solutions, we propose a Weakly-Supervised Contrastive learning model. We first propose a temporal path encoder that encodes both the spatial and temporal information of a temporal path into a TPR. To train the encoder, we introduce weak labels that are easy and inexpensive to obtain, and are relevant to different tasks, e.g., temporal labels indicating peak vs. off-peak hour from departure times. Based on the weak labels, we construct meaningful positive and negative temporal path samples by considering both spatial and temporal information, which facilities training the encoder using contrastive learning by pulling closer the positive samples' representations while pushing away the negative samples' representations. To better guide the contrastive learning, we propose a learning strategy based on Curriculum Learning such that the learning performs from easy to hard training instances. Experimental studies involving three downstream tasks, i.e., travel time estimation, path ranking, and path recommendation, on three road networks offer strong evidence that the proposal is superior to state-of-the-art unsupervised and supervised methods and that it can be used as a pre-training approach to enhance supervised TPR learning.

Bin Yang, Chenjuan Guo, Jilin Hu, Bin Yang, Jian Tang, Christian S. Jensen ,"Weakly-supervised Temporal Path Representation Learning with Contrastive Curriculum Learning - Extended Version." in arXiv, 2022

Link

In step with the digitalization of transportation, we are witnessing a growing range of path-based smart-city applications, e.g., travel-time estimation and travel path ranking. A temporal path(TP) that includes temporal information, e.g., departure time, into the path is fundamental to enable such applications. In this setting, it is essential to learn generic temporal path representations(TPRs) that consider spatial and temporal correlations simultaneously and that can be used in different applications, i.e., downstream tasks. Existing methods fail to achieve the goal since (i) supervised methods require large amounts of task-specific labels when training and thus fail to generalize the obtained TPRs to other tasks; (ii) through unsupervised methods can learn generic representations, they disregard the temporal aspect, leading to sub-optimal results. To contend with the limitations of existing solutions, we propose a Weakly-Supervised Contrastive (WSC) learning model. We first propose a temporal path encoder that encodes both the spatial and temporal information of a temporal path into a TPR. To train the encoder, we introduce weak labels that are easy and inexpensive to obtain and are relevant to different tasks, e.g., temporal labels indicating peak vs. off-peak hours from departure times. Based on the weak labels, we construct meaningful positive and negative temporal path samples by considering both spatial and temporal information, which facilities training the encoder using contrastive learning by pulling closer to the positive samples' representations while pushing away the negative samples' representations. To better guide contrastive learning, we propose a learning strategy based on Curriculum Learning such that the learning performs from easy to hard training instances. Experiments studies verify the effectiveness of the proposed method.

Jingyi Wan, Yongyong Gao, Yong Ma, Kai Huang, Xiaofang Zhou, Christian S. Jensen, Bolong Zheng ,"Workload-Aware Shortest Path Distance Querying in Road Networks" in 38th IEEE International Conference on Data Engineering, ICDE 2022, 2022

Link

Computing shortest-path distances in road networks is core functionality in a range of applications. To enable the efficient computation of such distance queries, existing proposals frequently apply 2-hop labeling that constructs a label for each vertex and enables the computation of a query by performing only a linear scan of labels. However, few proposals take into account the spatio-temporal characteristics of query workloads. We observe that real-world workloads exhibit (1) spatial skew, meaning that only a small subset of vertices are queried frequently, and (2) temporal locality, meaning that adjacent time intervals have similar query distributions. We propose a Workload-aware Core-Forest label index (WCF) to exploit spatial skew in workloads. In addition, we develop a Reinforcement Learning based Time Interval Partitioning (RL-TIP) algorithm that exploits temporal locality to partition workloads to achieve further performance improvements. Extensive experiments with real-world data offer insights into the performance of the proposals, showing that they achieve 62% speedup on average for query processing with less preprocessing time and space overhead when compared with the state-of-the-art proposals.

Xinle Wu, Dalin Zhang, Chenjuan Guo, Chaoyang He, Bin Yang, Christian S. Jensen ,"AutoCTS: Automated Correlated Time Series Forecasting" in Proceedings of the VLDB Endowment, 2021

Link [Publicly available]

Correlated time series (CTS) forecasting plays an essential role in many cyber-physical systems, where multiple sensors emit time series that capture interconnected processes. Solutions based on deep learning that deliver state-of-the-art CTS forecasting performance employ a variety of spatio-temporal (ST) blocks that are able to model temporal dependencies and spatial correlations among time series. However, two challenges remain. First, ST-blocks are designed manually, which is time consuming and costly. Second, existing forecasting models simply stack the same ST-blocks multiple times, which limits the model potential. To address these challenges, we propose AutoCTS that is able to automatically identify highly competitive ST-blocks as well as forecasting models with heterogeneous ST-blocks connected using diverse topologies, as opposed to the same ST-blocks connected using simple stacking. Specifically, we design both a micro and a macro search space to model possible architectures of ST-blocks and the connections among heterogeneous ST-blocks, and we provide a search strategy that is able to jointly explore the search spaces to identify optimal forecasting models. Extensive experiments on eight commonly used CTS forecasting benchmark datasets justify our design choices and demonstrate that AutoCTS is capable of automatically discovering forecasting models that outperform state-of-the-art human-designed models.

Xinle Wu, Dalin Zhang, Chenjuan Guo, Chaoyang He, Bin Yang, Christian S. Jensen ,"AutoCTS - Automated Correlated Time Series Forecasting - Extended Version."

Link

Christian S. Jensen (Redaktør), Ee-Peng Lim (Redaktør), De-Nian Yang (Redaktør), Wang-Chien Lee (Redaktør), Vincent S. Tseng (Redaktør), Vana Kalogeraki (Redaktør), Jen-Wei Huang (Redaktør), Chih-Ya Shen (Redaktør) ,"Database Systems for Advanced Applications: 26th International Conference, DASFAA 2021, Taipei, Taiwan, April 11-14, 2021, Proceedings, Part II" in 26th International Conference, DASFAA 2021, 2021

Link

The three-volume set LNCS 12681-12683 constitutes the proceedings of the 26th International Conference on Database Systems for Advanced Applications, DASFAA 2021, held in Taipei, Taiwan, in April 2021.The total of 156 papers presented in this three-volume set was carefully reviewed and selected from 490 submissions.The topic areas for the selected papers include information retrieval, search and recommendation techniques; RDF, knowledge graphs, semantic web, and knowledge management; and spatial, temporal, sequence, and streaming data management, while the dominant keywords are network, recommendation, graph, learning, and model. These topic areas and keywords shed the light on the direction where the research in DASFAA is moving towards.Due to the Corona pandemic this event was held virtually.

Christian S. Jensen (Redaktør), Ee-Peng Lim (Redaktør), De-Nian Yang (Redaktør), Wang-Chien Lee (Redaktør), Vincent S. Tseng (Redaktør), Vana Kalogeraki (Redaktør), Jen-Wei Huang (Redaktør), Chih-Ya Shen (Redaktør) ,"Database Systems for Advanced Applications: 26th International Conference, DASFAA 2021, Taipei, Taiwan, April 11-14, 2021, Proceedings, Part I" in 26th International Conference, DASFAA 2021, 2021

Link

Christian S. Jensen (Redaktør), Ee-Peng Lim (Redaktør), De-Nian Yang (Redaktør), Wang-Chien Lee (Redaktør), Vincent S. Tseng (Redaktør), Vana Kalogeraki (Redaktør), Jen-Wei Huang (Redaktør), Chih-Ya Shen (Redaktør) ,"Database Systems for Advanced Applications: 26th International Conference, DASFAA 2021, Taipei, Taiwan, April 11-14, 2021, Proceedings, Part III" in 26th International Conference, DASFAA 2021, 2021

Link

Christian S. Jensen (Redaktør), Ee-Peng Lim (Redaktør), De-Nian Yang (Redaktør), Chia-Hui Chang (Redaktør), Jianliang Xu (Redaktør), Wen-Chih Peng (Redaktør), Jen-Wei Huang (Redaktør), Chih-Ya Shen (Redaktør) ,"Database Systems for Advanced Applications. DASFAA 2021 International Workshops: BDQM, GDMA, MLDLDSA, MobiSocial, and MUST, Taipei, Taiwan, April 11-14, 2021, Proceedings" in 26th International Conference, DASFAA 2021, 2021

Link

This volume constitutes the papers of several workshops which were held in conjunction with the 26th International Conference on Database Systems for Advanced Applications, DASFAA 2021, held in Taipei, Taiwan, in April 2021.The 29 revised full papers presented in this book were carefully reviewed and selected from 84 submissions. DASFAA 2021 presents the following five workshops:6th International Workshop on Big Data Quality Management (BDQM 2021)5th International Workshop on Graph Data Management and Analysis (GDMA 2021)First International Workshop on Machine Learning and Deep Learning for Data Security Applications (MLDLDSA 2021)6th International Workshop on Mobile Data Management, Mining, and Computing on Social Network (MobiSocial 2021)2021 International Workshop on Mobile Ubiquitous Systems and Technologies (MUST 2021)Due to the Corona pandemic this event was held virtually.

Ziquan Fang, Lu Chen, Yunjun Gao, Lu Pan, Christian S. JensenKontaktforfatter ,"Dragoon: a hybrid and efficient big trajectory management system for offline and online analytics" in VLDB Journal, 2021

Link

With the explosive use of GPS-enabled devices, increasingly massive volumes of trajectory data capturing the movements of people and vehicles are becoming available, which is useful in many application areas, such as transportation, traffic management, and location-based services. As a result, many trajectory data management and analytic systems have emerged that target either offline or online settings. However, some applications call for both offline and online analyses. For example, in traffic management scenarios, offline analyses of historical trajectory data can be used for traffic planning purposes, while online analyses of streaming trajectories can be adopted for congestion monitoring purposes. Existing trajectory-based systems tend to perform offline and online trajectory analysis separately, which is inefficient. In this paper, we propose a hybrid and efficient framework, called Dragoon, based on Spark, to support both offline and online big trajectory management and analytics. The framework features a mutable resilient distributed dataset model, including RDD Share, RDD Update, and RDD Mirror, which enables hybrid storage of historical and streaming trajectories. It also contains a real-time partitioner capable of efficiently distributing trajectory data and supporting both offline and online analyses. Therefore, Dragoon provides a hybrid analysis pipeline. Support for several typical trajectory queries and mining tasks demonstrates the flexibility of Dragoon. An extensive experimental study using both real and synthetic trajectory datasets shows that Dragoon (1) has similar offline trajectory query performance with the state-of-the-art system UlTraMan; (2) decreases up to doubled storage overhead compared with UlTraMan during trajectory editing; (3) achieves at least 40% improvement of scalability compared with popular streaming processing frameworks (i.e., Flink and Spark Streaming); and (4) offers an average doubled performance improvement for online trajectory data analytics.

Tianyi Li, Lu Chen, Christian S. Jensen, Torben Bach Pedersen, Jilin Hu ,"Evolutionary Clustering of Streaming Trajectories."

Link

Yan Zhao, Kai Zheng, Jiannan Guo, Bin Yang, Torben Bach Pedersen, Christian S. JensenKontaktforfatter ,"Fairness-aware task assignment in spatial crowdsourcing: Game-theoretic approaches" in 37th IEEE International Conference on Data Engineering, ICDE 2021, 2021

Link

The widespread diffusion of smartphones offers a capable foundation for the deployment of Spatial Crowdsourcing (SC), where mobile users, called workers, perform location- dependent tasks assigned to them. A key issue in SC is how best to assign tasks, e.g., the delivery of food and packages, to appropriate workers. Specifically, we study the problem of Fairness-aware Task Assignment (FTA) in SC, where tasks are to be assigned in a manner that achieves some notion of fairness across workers. In particular, we aim to minimize the payoff difference among workers while maximizing the average worker payoff. To solve the problem, we first generate so-called Valid Delivery Point Sets (VDPSs) for each worker according to an approach that exploits dynamic programming and distance- constrained pruning. Next, we show that FTA is NP-hard and proceed to propose two heuristic algorithms, a Fairness-aware Game-Theoretic (FGT) algorithm and an Improved Evolutionary Game-Theoretic (IEGT) algorithm. More specifically, we formulate FTA as a multi-player game. In this setting, the FGT approach represents a best-response method with sequential and asynchronous updates of workers' strategies, given by the VDPSs, that achieves a satisfying task assignment when a pure Nash equilibrium is reached. Next, the IEGT approach considers a setting with a large population of workers that repeatedly engage in strategic interactions. The IEGT approach exploits replicator dynamics that cause the whole population to evolve and choose better resources, i.e., VDPSs. Using the property of evolutionary equilibrium, a satisfying task assignment is obtained that corresponds to a stable state with similar payoffs among workers and good average worker payoff. Extensive experiments offer insight into the effectiveness and efficiency of the proposed solutions.

Zhida Chen, Lisi Chen, Gao Cong, Christian S. JensenKontaktforfatter ,"Location- and keyword-based querying of geo-textual data: a survey" in VLDB Journal, 2021

Link [Publicly available]

With the broad adoption of mobile devices, notably smartphones, keyword-based search for content has seen increasing use by mobile users, who are often interested in content related to their geographical location. We have also witnessed a proliferation of geo-textual content that encompasses both textual and geographical information. Examples include geo-tagged microblog posts, yellow pages, and web pages related to entities with physical locations. Over the past decade, substantial research has been conducted on integrating location into keyword-based querying of geo-textual content in settings where the underlying data is assumed to be either relatively static or is assumed to stream into a system that maintains a set of continuous queries. This paper offers a survey of both the research problems studied and the solutions proposed in these two settings. As such, it aims to offer the reader a first understanding of key concepts and techniques, and it serves as an “index” for researchers who are interested in exploring the concepts and techniques underlying proposed solutions to the querying of geo-textual data.

Bolong Zheng, Xi Zhao, Lianggui Weng, Nguyen Quoc Viet Hung, Hang Liu, Christian S. Jensen ,"PM-LSH - a fast and accurate in-memory framework for high-dimensional approximate NN and closest pair search."

Link

Zhe Li, Tsz Nam Chan, Man Lung Yiu, Christian S. Jensen ,"PolyFit: Polynomial-based indexing approach for fast approximate range aggregate queries" in Advances in Database Technology - 24th International Conference on Extending Database Technology, EDBT 2021, 2021

Link [Publicly available]

Range aggregate queries find frequent application in data analytics. In many use cases, approximate results are preferred over accurate results if they can be computed rapidly and satisfy approximation guarantees. Inspired by a recent indexing approach, we provide means of representing a discrete point dataset by continuous functions that can then serve as compact index structures. More specifically, we develop a polynomial-based indexing approach, called PolyFit, for processing approximate range aggregate queries. PolyFit is capable of supporting multiple types of range aggregate queries, including COUNT, SUM, MIN and MAX aggregates, with guaranteed absolute and relative error bounds. Experimental results show that PolyFit is faster and more accurate and compact than existing learned index structures.

Bolong Zheng, Lianggui Weng, Xi Zhao, Kai Zeng, Xiaofang Zhou, Christian S. Jensen ,"REPOSE: Distributed Top-k Trajectory Similarity Search with Local Reference Point Tries"

Link

Trajectory similarity computation is a fundamental component in a variety of real-world applications, such as ridesharing, road planning, and transportation optimization. Recent advances in mobile devices have enabled an unprecedented increase in the amount of available trajectory data such that efficient query processing can no longer be supported by a single machine. As a result, means of performing distributed in-memory trajectory similarity search are called for. However, existing distributed proposals suffer from either computing resource waste or are unable to support the range of similarity measures that are being used. We propose a distributed in-memory management framework called REPOSE for processing top-k trajectory similarity queries on Spark. We develop a reference point trie (RP-Trie) index to organize trajectory data for local search. In addition, we design a novel heterogeneous global partitioning strategy to eliminate load imbalance in distributed settings. We report on extensive experiments with real-world data that offer insight into the performance of the solution, and show that the solution is capable of outperforming the state-of-the-art proposals.

Bolong Zheng, Lianggui Weng, Xi Zhao, Kai Zeng, Xiaofang Zhou, Christian S. Jensen ,"REPOSE: Distributed top-k trajectory similarity search with local reference point tries" in 37th IEEE International Conference on Data Engineering, ICDE 2021, 2021

Link [Publicly available]

Trajectory similarity computation is a fundamental component in a variety of real-world applications, such as ridesharing, road planning, and transportation optimization. Recent advances in mobile devices have enabled an unprecedented increase in the amount of available trajectory data such that efficient query processing can no longer be supported by a single machine. As a result, means of performing distributed in-memory trajectory similarity search are called for. However, existing distributed proposals either suffer from computing resource waste or are unable to support the range of similarity measures that are being used. We propose a distributed in-memory management framework called REPOSE for processing top-k trajectory similarity queries on Spark. We develop a reference point trie (RP-Trie) index to organize trajectory data for local search. In addition, we design a novel heterogeneous global partitioning strategy to eliminate load imbalance in distributed settings. We report on extensive experiments with real-world data that offer insight into the performance of the solution, and show that the solution is capable of outperforming the state-of-the-art proposals.

Qi Hu, Lingfeng Ming, Ruijie Xi, Lu Chen, Christian S. Jensen, Bolong ZhengKontaktforfatter ,"SOUP: A fleet management system for passenger demand prediction and competitive taxi supply" in 37th IEEE International Conference on Data Engineering, ICDE 2021, 2021

Link [Publicly available]

Online car-hailing services have gained substantial popularity. An effective taxi fleet management strategy should not only increase taxi utilization by reducing taxi idle time, but should also improve passenger satisfaction by minimizing passenger waiting time. We demonstrate a fleet management system called SOUP that aims at minimizing taxi idle time and that monitors the fleet movement status. SOUP includes a passenger request prediction model called ST-GCSL that predicts the number of requests in the near future, and it includes a demand-aware route planning algorithm called DROP that provides idle taxis with search routes to serve potential requests. In addition, SOUP supports visualizing and analyzing historical passenger requests, simulating fleet movement, and computing evaluation metrics. We demonstrate how SOUP accurately predicts passenger demand and significantly reduces taxi idle time.

Lei Bi, Juan Cao, Guohui Li, Nguyen Quoc Viet Hung, Christian S. Jensen, Bolong ZhengKontaktforfatter ,"SpeakNav: A voice-based navigation system via route description language understanding" in 37th IEEE International Conference on Data Engineering, ICDE 2021, 2021

Link [Publicly available]

Many navigation applications take natural language speech as input, which avoids typing in words with their hands and decreases the occurrence of traffic accidents. We propose the SpearkNav navigation system that enables users to describe intended routes via speech and supports clue-based route retrieval. SpeakNav includes a route description language understanding model for determining POIs and distances along expected routes, and it includes an efficient algorithm to compute desired routes. In addition, SpeakNav supports basic POI and location search and location-based route navigation. We demonstrate how SpeakNav accurately recognizes users' intentions and recommends appropriate routes in real application scenarios.

Bolong Zheng, Lei Bi, Juan Cao, Hua Chai, Jun Fang, Lu Chen, Yunjun Gao, Xiaofang Zhou, Christian S. Jensen ,"Speaknav: Voice-based route description language understanding for template-driven path search" in 47th International Conference on Very Large Data Bases, VLDB 2021, 2021

Link [Publicly available]

Many navigation applications take natural language speech as input, which avoids users typing in words and thus improves traffic safety. However, navigation applications often fail to understand a user’s free-form description of a route. In addition, they only support input of a specific source or destination, which does not enable users to specify additional route requirements. We propose a SpeakNav framework that enables users to describe intended routes via speech and then recommends appropriate routes. Specifically, we propose a novel Route Template based Bidirectional Encoder Representation from Transformers (RT-BERT) model that supports the understanding of natural language route descriptions. The model enables extraction of information of intended POI keywords and related distances. Then we formalize a template-driven path query that uses the extracted information. To enable efficient query processing, we develop a hybrid label index for computing network distances between POIs, and we propose a branch-and-bound algorithm along with a pivot reverse B-tree (PB-tree) index. Experiments with real and synthetic data indicate that RT-BERT offers high accuracy and that the proposed algorithm is capable of outperforming baseline algorithms.

Ziquan Fang, Yuntao Du, Xinjun Zhu, Lu Chen, Yunjun Gao, Christian S. Jensen ,"ST2Vec - Spatio-Temporal Trajectory Similarity Learning in Road Networks."

Link

Hao Huang, Qian Yan, Lu Chen, Yunjun Gao, Christian S. Jensen ,"Statistical Inference of Diffusion Networks" in I E E E Transactions on Knowledge & Data Engineering, 2021

Link [Publicly available]

To infer structures in diffusion networks, existing approaches mostly need to know not only the final infection statuses of network nodes, but also the exact times when infections occur. In contrast, in many real-world settings, such as disease propagation, monitoring exact infection times is often infeasible due to a high cost. We investigate the problem of how to learn diffusion network structures based on only the final infection statuses of nodes. Instead of utilizing sequences of timestamps to determine potential parent-child influence relationships between nodes, we propose to find influence relationships with high statistical significance. To this end, we design a probabilistic generative model of the final infection statuses to quantitatively measure the likelihood of potential structures of the objective diffusion network, taking into account network complexity. Based on this model, we can infer an appropriate number of most probable parent nodes for each node in the network. Furthermore, to reduce redundant inference computations, we are able to preclude insignificant candidate parent nodes from being considered during inferencing, if their infections have little correlation with the infections of the corresponding child nodes. Extensive experiments on both synthetic and real-world networks offer evidence that the proposed approach is effective and efficient.

Tianyi Li, Lu Chen, Christian S. Jensen, Torben Bach Pedersen ,"TRACE: Real-time Compression of Streaming Trajectories in Road Networks" in Proceedings of the VLDB Endowment, 2021

Link [Publicly available]

The deployment of vehicle location services generates increasingly massive vehicle trajectory data, which incurs high storage and transmission costs. A range of studies target offline compression to reduce the storage cost. However, to enable online services such as real-time traffic monitoring, it is attractive to also reduce transmission costs by being able to compress streaming trajectories in real-time. Hence, we propose a framework called TRACE that enables compression, transmission, and querying of network-constrained streaming trajectories in a fully online fashion. We propose a compact two-stage representation of streaming trajectories: a speed-based representation removes redundant information, and a multiple-references based referential representation exploits subtrajectory similarities. In addition, the online referential representation is extended with reference selection, deletion and rewriting functions that further improve the compression performance. An efficient data transmission scheme is provided for achieving low transmission overhead. Finally, indexing and filtering techniques support efficient real-time range queries over compressed trajectories. Extensive experiments with real-life and synthetic datasets evaluate the different parts of TRACE, offering evidence that it is able to outperform the existing representative methods in terms of both compression ratio and transmission cost.

David Campos, Tung Kieu, Chenjuan Guo, Feiteng Huang, Kai Zheng, Bin Yang, Christian S. Jensen ,"Unsupervised Time Series Outlier Detection with Diversity-Driven Convolutional Ensembles." in Proceedings of the VLDB Endowment, 2021

Link [Publicly available]

With the sweeping digitalization of societal, medical, industrial,and scientific processes, sensing technologies are being deployedthat produce increasing volumes of time series data, thus fueling aplethora of new or improved applications. In this setting, outlierdetection is frequently important, and while solutions based onneural networks exist, they leave room for improvement in termsof both accuracy and efficiency. With the objective of achievingsuch improvements, we propose a diversity-driven, convolutionalensemble. To improve accuracy, the ensemble employs multiplebasic outlier detection models built on convolutional sequence-tosequence autoencoders that can capture temporal dependencies intime series. Further, a novel diversity-driven training method maintains diversity among the basic models, with the aim of improvingthe ensemble’s accuracy. To improve efficiency, the approach enables a high degree of parallelism during training. In addition, itis able to transfer some model parameters from one basic modelto another, which reduces training time. We report on extensiveexperiments using real-world multivariate time series that offerinsight into the design choices underlying the new approach andoffer evidence that it is capable of improved accuracy and efficiency.

Link

Zhong Yang, Bolong Zheng, Guohui Li, Zhao Xi, Xiaofang Zhou, Christian S. Jensen ,"Adaptive Top-k Overlap Set Similarity Joins" in 36th IEEE International Conference on Data Engineering, 2020

Link

The set similarity join (SSJ) is core functionality in a range of applications, including data cleaning, near-duplicate object detection, and data integration. Threshold-based SSJ queries return all pairs of sets with similarity no smaller than a given threshold. As results, and their utility, are very sensitive to the choice of threshold value, it is a problem that it is difficult to choose such an appropriate value. Doing so requires prior knowledge of the data, which users often do not have. To avoid this problem, we propose a solution to the top-k overlap set similarity join (TkOSSJ) that returns k pairs of sets with the highest overlap similarities. The state-of-the-art solution disregards the effect of the so-called step size, which is the number of elements accessed in each iteration of the algorithm. This affects its performance negatively. To address this issue, we first propose an algorithm that uses a fixed step size, thus taking advantage of the benefits of a large step size, and then we present an adaptive step size algorithm that is capable of automatically adjusting the step size, thus reducing redundant computations. An extensive empirical study offers insight into the new algorithms and indicates that they are capable of outperforming the state-of-the-art method on real, large-scale data sets.

Simon Aagaard Pedersen, Bin Yang, Christian S. Jensen ,"A Hybrid Learning Approach to Stochastic Routing" in International Conference on Data Engineering, 2020

Link

Increasingly available trajectory data enables detailed capture of traffic conditions. We consider an uncertain road network graph, where each graph edge is associated with a travel time distribution, and we study probabilistic budget routing that aims to find the path with the highest probability of arriving within a given time budget. In this setting, a fundamental operation is to compute the travel cost distribution of a path from the cost distributions of the edges in the path. Solutions that rely on convolution generally assume independence among the edges' distributions, which often does not hold and thus incurs poor accuracy. We propose a hybrid approach that combines convolution and machine learning-based estimation to take into account dependencies among distributions in order to improve accuracy. Next, we propose an efficient routing algorithm that is able to utilize the hybrid approach and that features effective pruning techniques to enable faster routing. Empirical studies on a substantial real-world trajectory set offer insight into the properties of the proposed solution, indicating that it is promising.

Bezaye Tesfaye, Nikolaus Augsten, Mateusz Pawlik, Michael Hanspeter Böhlen, Christian S. Jensen ,"An Efficient Index for Reachability Queries in Public Transport Networks" in European Conference on Advances in Databases and Information Systems 2020, 2020

Link [Publicly available]

Bolong Zheng, Kai Zheng, Christian S. Jensen, Quoc Viet Hung Nguyen, Han Su, Guohui Li, Xiaofang Zhou ,"Answering Why-Not Group Spatial Keyword Queries" in I E E E Transactions on Knowledge & Data Engineering, 2020

Link

With the proliferation of geo-textual objects on the web, extensive efforts have been devoted to improving the efficiency of top-kk spatial keyword queries in different settings. However, comparatively much less work has been reported on enhancing the quality and usability of such queries. In this context, we propose means of enhancing the usability of a top-kk group spatial keyword query, where a group of users aim to find kk objects that contain given query keywords and are nearest to the users. Specifically, when users receive the result of such a query, they may find that one or more objects that they expect to be in the result are in fact missing, and they may wonder why. To address this situation, we develop a so-called why-not query that is able to minimally modify the original query into a query that returns the expected, but missing, objects, in addition to other objects. Specifically, we formalize the why-not query in relation to the top-kk group spatial keyword query, called the Why-not Group Spatial Keyword Query (WGSKWGSK) that is able to provide a group of users with a more satisfactory query result. We propose a three-phase framework for efficiently computing the WGSKWGSK. The first phase substantially reduces the search space for the subsequent phases by retrieving a set of objects that may affect the ranking of the user-expected objects. The second phase provides an incremental sampling algorithm that generates candidate weightings of more promising queries. The third phase determines the penalty of each refined query and returns the query with minimal penalty, i.e., the minimally modified query. Extensive experiments with real and synthetic data offer evidence that the proposed solution excels over baselines with respect to both effectiveness and efficiency.

Simon Aagaard Pedersen, Bin Yang, Christian S. Jensen ,"Anytime Stochastic Routing with Hybrid Learning" in 2020 International Conference on Very Large Databases PhD Workshop, VLDB-PhD 2020, 2020

Link

Increasingly massive volumes of vehicle trajectory data hold the potential to enable higher-resolution traffic services than hitherto possible. We use trajectory data to create a high-resolution, uncertain road-network graph, where edges are associated with travel-time distributions. In this setting, we study probabilistic budget routing that aims to find the path with the highest probability of arriving at a destination within a given time budget. A key challenge is to compute accurately and efficiently the travel-time distribution of a path from the travel-time distributions of the edges in the path. Existing solutions that rely on convolution assume independence among the distributions to be convolved, but as distributions are often dependent, the result distributions exhibit poor accuracy. We propose a hybrid approach that combines convolution with estimation based on machine learning to account for dependencies among distributions in order to improve accuracy. Since the hybrid approach cannot rely on the independence assumption that enables effective pruning during routing, naive use of the hybrid approach is costly. To address the resulting efficiency challenge, we propose an anytime routing algorithm that is able to return a “good enough” path at any time and that eventually computes a high-quality path. Empirical studies involving a substantial real-world trajectory set offer insight into the design properties of the proposed solution, indicating that it is practical in real-world settings.

Ziquan Fang, Yunjun Gao, Lu Pan, Lu Chen, Xiaoye Miao, Christian S. Jensen ,"CoMing: A Real-time Co-Movement Mining System for Streaming Trajectories" in ACM SIGMOD International Conference on Management of Data 2020, 2020

Link

The aim of real-time co-movement pattern mining for streaming trajectories is to discover co-moving objects that satisfy specific spatio-temporal constraints in real time. This functionality serves a range of real-world applications, such as traffic monitoring and management. However, little work targets the visualization and interaction with such co-movement detection on streaming trajectories. To this end, we develop CoMing, a real-time co-movement pattern mining system, to handle streaming trajectories. CoMing leverages ICPE, a real-time distributed co-movement pattern detection framework, and thus, it has its capacity of good performance. This demonstration offers hands-on experience with CoMing's visual and user-friendly interface. Moreover, several applications in the traffic domain, including object monitoring and traffic statistics visualization, are also provided to users.

Tianyi Li, Ruikai Huang, Lu Chen, Christian S. Jensen, Torben Bach Pedersen ,"Compression of Uncertain Trajectories in Road Networks" in Proceedings of the VLDB Endowment, 2020

Link [Publicly available]

Massive volumes of uncertain trajectory data are being generated by GPS devices. Due to the limitations of GPS data, these trajectories are generally uncertain. This state of affairs renders it is attractive to be able to compress uncertain trajectories and to be able to query the trajectories efficiently without the need for (full) decompression. Unlike existing studies that target accurate trajectories, we propose a framework that accommodates uncertain trajectories in road networks. To address the large cardinality of instances of a single uncertain trajectory, we exploit the similarity between uncertain trajectory instances and provide a referential representation. First, we propose a reference selection algorithm based on the notion of Fine-grained Jaccard Distance to efficiently select trajectory instances as references. Then we provide referential representations of the different types of information contained in trajectories to achieve high compression ratios. In particular, a new compression scheme for temporal information is presented to take into account variations in sample intervals. Finally, we propose an index and develop filtering techniques to support efficient queries over compressed uncertain trajectories. Extensive experiments with real-life datasets offer insight into the properties of the framework and suggest that it is capable of outperforming the existing state-of-the-art method in terms of both compression ratio and efficiency.

Chenjuan Guo, Bin Yang, Jilin Hu, Christian S. Jensen, Lu ChenKontaktforfatter ,"Context-aware, preference-based vehicle routing" in V L D B Journal, 2020

Link

Vehicle routing is an important service that is used by both private individuals and commercial enterprises. Drivers may have different contexts that are characterized by different routing preferences. For example, during different times of day or weather conditions, drivers may make different routing decisions such as preferring or avoiding highways. The increasing availability of vehicle trajectory data yields an increasingly rich data foundation for context-aware, preference-based vehicle routing. We aim to improve routing quality by providing new, efficient routing techniques that identify and take contexts and their preferences into account. In particular, we first provide means of learning contexts and their preferences, and we apply these to enhance routing quality while ensuring efficiency. Our solution encompasses an off-line phase that exploits a contextual preference tensor to learn the relationships between contexts and routing preferences. Given a particular context for which trajectories exist, we learn a routing preference. Then, we transfer learned preferences from contexts with trajectories to similar contexts without trajectories. In the on-line phase, given a context, we identify the corresponding routing preference and use it for routing. To achieve efficiency, we propose preference-based contraction hierarchies that are capable of speeding up both off-line learning and on-line routing. Empirical studies with vehicle trajectory data offer insight into the properties of proposed solution, indicating that it is capable of improving quality and is efficient.

Christian S. Jensen ,"Editorial: Updates to the Editorial Board" in A C M Transactions on Database Systems, 2020

Link [Publicly available]

Jianzhong Qi, Guanli Liu, Christian S. Jensen, Lars Kulik ,"Effectively Learning Spatial Indices" in Proceedings of the VLDB Endowment, 2020

Link [Publicly available]

Machine learning, especially deep learning, is used increasingly to enable better solutions for data management tasks previously solved by other means, including database indexing. A recent study shows that a neural network can not only learn to predict the disk address of the data value associated with a one-dimensional search key but also outperform B-tree-based indexing, thus promises to speed up a broad range of database queries that rely on B-trees for efficient data access. We consider the problem of learning an index for two-dimensional spatial data. A direct application of a neural network is unattractive because there is no obvious ordering of spatial point data. Instead, we introduce a rank space based ordering technique to establish an ordering of point data and group the points into blocks for index learning. To enable scalability, we propose a recursive strategy that partitions a large point set and learns indices for each partition. Experiments on real and synthetic data sets with more than 100 million points show that our learned indices are highly effective and efficient. Query processing using our indices is more than an order of magnitude faster than the use of R-trees or a recently proposed learned index.

Jiehuan Luo, Xin Cao, Xike Xie, Qiang Qu, Zhiqiang Xu, Christian S. Jensen ,"Efficient Attribute-Constrained Co-Located Community Search" in 36th IEEE International Conference on Data Engineering, 2020

Link

Networked data, notably social network data, often comes with a rich set of annotations, or attributes, such as documents (e.g., tweets) and locations (e.g., check-ins). Community search in such attributed networks has been studied intensively due to its many applications in friends recommendation, event organization, advertising, etc. We study the problem of attribute-constrained co-located community (ACOC) search, which returns a community that satisfies three properties: i) structural cohesiveness: the members in the community are densely connected; ii) spatial co-location: the members are close to each other; and iii) attribute constraint: a set of attributes are covered by the attributes associated with the members. The ACOC problem is shown to be NP-hard. We develop four efficient approximation algorithms with guaranteed error bounds in addition to an exact solution that works on relatively small graphs. Extensive experiments conducted with both real and synthetic data offer insight into the efficiency and effectiveness of the proposed methods, showing that they outperform three adapted state-of-the-art algorithms by an order of magnitude. We also find that the approximation algorithms are much faster than the exact solution and yet offer high accuracy.

Xinjue Wang, Ke Deng, Jianxing Li, Jeffery Xu Yu, Christian S. Jensen, Xiaochun Yang ,"Efficient targeted influence minimization in big social networks" in World Wide Web, 2020

Link

An online social network can be used for the diffusion of malicious information like derogatory rumors, disinformation, hate speech, revenge pornography, etc. This motivates the study of influence minimization that aim to prevent the spread of malicious information. Unlike previous influence minimization work, this study considers the influence minimization in relation to a particular group of social network users, called targeted influence minimization. Thus, the objective is to protect a set of users, called target nodes, from malicious information originating from another set of users, called active nodes. This study also addresses two fundamental, but largely ignored, issues in different influence minimization problems: (i) the impact of a budget on the solution; (ii) robust sampling. To this end, two scenarios are investigated, namely unconstrained and constrained budget. Given an unconstrained budget, we provide an optimal solution; Given a constrained budget, we show the problem is NP-hard and develop a greedy algorithm with an (1−1e)-approximation. More importantly, in order to solve the influence minimization problem in large, real-world social networks, we propose a robust sampling-based solution with a desirable theoretic bound. Extensive experiments using real social network datasets offer insight into the effectiveness and efficiency of the proposed solutions.

Simon Aagaard Pedersen, Bin Yang, Christian S. Jensen ,"Fast stochastic routing under time-varying uncertainty" in The VLDB Journal, 2020

Link

Data are increasingly available that enable detailed capture of travel costs associated with the movements of vehicles in road networks, notably travel time, and greenhouse gas emissions. In addition to varying across time, such costs are inherently uncertain, due to varying traffic volumes, weather conditions, different driving styles among drivers, etc. In this setting, we address the problem of enabling fast route planning with time-varying, uncertain edge weights. We initially present a practical approach to transforming GPS trajectories into time-varying, uncertain edge weights that guarantee the first-in-first-out property. Next, we propose time-dependent uncertain contraction hierarchies (TUCHs), a generic speed-up technique that supports a wide variety of stochastic route planning functionality in the paper’s setting. In particular, we propose query processing methods based on TUCH for two representative types of stochastic routing: non-dominated routing and probabilistic budget routing. Experimental studies with a substantial GPS data set offer insight into the design properties of the paper’s proposals and suggest that they are capable of enabling efficient stochastic routing.

Shuo Shang, Lisi Chen, Christian S. Jensen, Panos Kalnis ,"Introduction to Spatio-temporal data management and analytics for Smart City research" in Geoinformatica, 2020

Link [Publicly available]

This special issue of the GeoInformatica journal covers recent advances in spatio-temporal data management and analytics in the context of smart city and urban computing. It contains 11 articles that present solid research studies and innovative ideas in the area of spatio-temporal data management for smart city research. All of the 11 papers went through several rounds of rigorous reviews by the guest editors and invited reviewers.Geo-textual query processing has been receiving much attention in area of spatio-temporal data management. The paper, by Xinyu Chen et al., “S2R-tree: a pivot-based indexing structure for semantic-aware spatial keyword search,” proposes a pivot-based hierarchical indexing structure to integrate spatial and semantic information in a seamless way. The proposed index is able to return accurate query results that take semantic meaning of geo-textual objects into consideration. Another paper, by Zhongpu Chen et al., “ITISS: an efficient framework for querying big temporal data,” proposes an in-memory based two-level index structure in Spark, which is easily understood and implemented, but without loss of effectiveness and efficiency. Additionally, the paper, by Xiaozhao Song et al., “Collective spatial keyword search on activity trajectories,” presents an effective and efficient collective spatial keyword query processing algorithm on activity trajectories. Finally, Lisi Chen et al., “Spatial keyword search: a survey,” present a survey of existing studies regarding spatial keyword search.Location-based social networks (LBSNs) are becoming increasingly indispensable in smart cities. Hao Wang and Ziyu Lu develop the first unified and generic framework to support user-preference based sequence matching in their paper “Preference-aware sequence matching for location-based services.” Yanhui Li et al. propose an approach to extracting similar user pattern from LBSNs and annotating semantic tags of locations in their paper “Annotating semantic tags of locations in location-based social networks.” The problem is solved by training a binary ELM classifier for each tag in the tag space to support multi-label classification.Spatial crowdsourcing (SC) is an emerging research direction in spatio-temporal data analytics. Tianshu Song et al. focus on solving a fundamental issue in SC, assigning tasks to suitable workers to obtain multiple global objectives, in their paper “Multi-skill aware task assignment in real-time spatial crowdsourcing.” They define the multi-skill aware task assignment problem in real-time SC, which is proven to be NP-hard, and propose an online greedy algorithm that iteratively assigns optimal workers. Yiming Li et al., in their paper “Two-sided online bipartite matching in spatial data: experiments and analysis,” present a comprehensive evaluation and analysis of the representative algorithms for the two-sided online bipartite matching problem, which is widely studied in the area of spatio-temporal data management.Furthermore, the paper, by Yuliang Ma et al., “Graph simulation on large scale temporal graphs,” investigates the problem of temporal bounded simulation on temporal graphs, which is a fundamental problem in urban computing. It presents a simulation matching framework consisting of pattern segmentation, temporal bounded simulation of pattern segments, and result integration. Mengqing Mei et al. focus on another fundamental problem in urban computing, identifying the correlation between features and labels from multi-label urban datasets, in their paper “An innovative multi-label learning based algorithm for city data computing.” In particular, they propose a multi-label learning algorithm that learns separate subspaces for features and labels by maximizing the independence between the components in each subspace.Finally, the paper, by Jihai Yang et al., “Joint hyperspectral unmixing for urban computing,” focuses on an important problem related to urban computing: joint hyperspectral unmixing. Specifically, it presents an algorithm to process two hyperspectral images, simultaneously, and makes full use of the available information when most of the signals at the two end points are similar.These papers represent a variety of directions in the fast-growing area of spatio-temporal data management and analytics in smart city applications. We hope that these papers will foster the development of smart cities and inspire more research in this promising area.

Bolong Zheng, Chenze Huang, Christian S. Jensen, Lu Chen, Nguyen Quoc Viet Hung, Guanfeng Liu, Guohui Li, Kai Zheng ,"Online Trichromatic Pickup and Delivery Scheduling in Spatial Crowdsourcing" in International Conference on Data Engineering, 2020

Link [Publicly available]

In Pickup-and-Delivery problems (PDP), mobile workers are employed to pick up and deliver items with the goal of reducing travel and fuel consumption. Unlike most existing efforts that focus on finding a schedule that enables the delivery of as many items as possible at the lowest cost, we consider trichromatic (worker-item-task) utility that encompasses worker reliability, item quality, and task profitability. Moreover, we allow customers to specify keywords for desired items when they submit tasks, which may result in multiple pickup options, thus further increasing the difficulty of the problem. Specifically, we formulate the problem of Online Trichromatic Pickup and Delivery Scheduling (OTPD) that aims to find optimal delivery schedules with highest overall utility. In order to quickly respond to submitted tasks, we propose a greedy solution that finds the schedule with the highest utility-cost ratio. Next, we introduce a skyline kinetic tree-based solution that materializes intermediate results to improve the result quality. Finally, we propose a density-based grouping solution that partitions streaming tasks and efficiently assigns them to the workers with high overall utility. Extensive experiments with real and synthetic data offer evidence that the proposed solutions excel over baselines with respect to both effectiveness and efficiency.

Lisi Chen, Shuo Shang, Christian S. Jensen, Bin Yao, Panos Kalnis ,"Parallel Semantic Trajectory Similarity Join" in International Conference on Data Engineering, 2020

Link [Publicly available]

Matching similar pairs of trajectories, called trajectory similarity join, is a fundamental functionality in spatial data management. We consider the problem of semantic trajectory similarity join (STS-Join). Each semantic trajectory is a sequence of Points-of-interest (POIs) with both location and text information. Thus, given two sets of semantic trajectories and a threshold θ, the STS-Join returns all pairs of semantic trajectories from the two sets with spatio-textual similarity no less than θ. This join targets applications such as term-based trajectory near-duplicate detection, geo-text data cleaning, personalized ridesharing recommendation, keyword-aware route planning, and travel itinerary recommendation.With these applications in mind, we provide a purposeful definition of spatio-textual similarity. To enable efficient STS-Join processing on large sets of semantic trajectories, we develop trajectory pair filtering techniques and consider the parallel processing capabilities of modern processors. Specifically, we present a two-phase parallel search algorithm. We first group semantic trajectories based on their text information. The algorithm's per-group searches are independent of each other and thus can be performed in parallel. For each group, the trajectories are further partitioned based on the spatial domain. We generate spatial and textual summaries for each trajectory batch, based on which we develop batch filtering and trajectory-batch filtering techniques to prune unqualified trajectory pairs in a batch mode. Additionally, we propose an efficient divide-and-conquer algorithm to derive bounds of spatial similarity and textual similarity between two semantic trajectories, which enable us prune dissimilar trajectory pairs without the need of computing the exact value of spatio-textual similarity. Experimental study with large semantic trajectory data confirms that our algorithm of processing semantic trajectory join is capable of outperforming our well-designed baseline by a factor of 8-12.

Bolong Zheng, Zhao Xi, Lianggui Weng, Nguyen Quoc Viet Hung, Hang Liu, Christian S. Jensen ,"PM-LSH: A fast and accurate LSH framework for high-dimensional approximate NN search" in Proceedings of the VLDB Endowment, 2020

Link [Publicly available]

Nearest neighbor (NN) search in high-dimensional spaces isinherently computationally expensive due to the curse of dimensionality. As a well-known solution to approximate NNsearch, locality-sensitive hashing (LSH) is able to answerc-approximate NN (c-ANN) queries in sublinear time withconstant probability. Existing LSH methods focus mainlyon building hash bucket based indexing such that the candidate points can be retrieved quickly. However, existingcoarse-grained structures fail to offer accurate distance estimation for candidate points, which translates into additionalcomputational overhead when having to examine unnecessary points. This in turn reduces the performance of queryprocessing. In contrast, we propose a fast and accurate LSHframework, called PM-LSH, that aims to compute the cANN query on large- scale, high-dimensional datasets. First,we adopt a simple yet effective PM-tree to index the datapoints. Second, we develop a tunable confidence intervalto achieve accurate distance estimation and guarantee highresult quality. Third, we propose an efficient algorithm ontop of the PM-tree to improve the performance of computing c-ANN queries. Extensive experiments with real-worlddata offer evidence that PM-LSH is capable of outperforming existing proposals with respect to both efficiency andaccuracy.

Dingming Wu, Can Hou, Erjia Xiao, Christian S. Jensen ,"Semantic Region Retrieval from Spatial RDF Data" in International Conference on Database Systems for Advanced Applications, 2020

Link

The top-k most relevant Semantic Place retrieval (kSP) query on spatial RDF data combines keyword-based and location-based retrieval. The query returns semantic places that are subgraphs rooted at a place entity with an associated location. The relevance to the query keywords of a semantic place is measured by a looseness score that aggregates the graph distances between the place (root) and the occurrences of the keywords in the nodes of the tree. We observe that kSP queries may retrieve semantic places that are spatially close to the query location, but with very low keyword relevance. When any single nearby place has low relevance, returning instead multiple relevant places maybe helpful. Hence, we propose a generalization of semantic place retrieval, namely semantic region (SR) retrieval. An SR query aims to return multiple places that are spatially close to the query location such that each place is relevant to one or more query keywords. An algorithm and optimization techniques are proposed for the efficient processing of SR queries. Extensive empirical studies with two real datasets offer insight into the performance of the proposals.

Jilin Hu, Bin Yang, Chenjuan Guo, Christian S. Jensen, Hui Xiong ,"Stochastic Origin-Destination Matrix Forecasting Using Dual-Stage Graph Convolutional, Recurrent Neural Networks" in International Conference on Data Engineering, 2020

Link

Origin-destination (OD) matrices are used widely in transportation and logistics to record the travel cost (e.g., travel speed or greenhouse gas emission) between pairs of OD regions during different intervals within a day. We model a travel cost as a distribution because when traveling between a pair of OD regions, different vehicles may travel at different speeds even during the same interval, e.g., due to different driving styles or different waiting times at intersections. This yields stochastic OD matrices. We consider an increasingly pertinent setting where a set of vehicle trips is used for instantiating OD matrices. Since the trips may not cover all OD pairs for each interval, the resulting OD matrices are likely to be sparse. We then address the problem of forecasting complete, near future OD matrices from sparse, historical OD matrices. To solve this problem, we propose a generic learning framework that (i) employs matrix factorization and graph convolutional neural networks to contend with the data sparseness while capturing spatial correlations and that (ii) captures spatio-temporal dynamics via recurrent neural networks extended with graph convolutions. Empirical studies using two taxi trajectory data sets offer detailed insight into the properties of the framework and indicate that it is effective.

Lisi Chen, Shuo Shang, Christian S. Jensen, Jianliang Xu, Panos Kalnis, Bin Yao, Ling Shao ,"Top-k term publish/subscribe for geo-textual data streams" in V L D B Journal, 2020

Link

Massive amounts of data that contain spatial, textual, and temporal information are being generated at a rapid pace. With streams of such data, which includes check-ins and geo-tagged tweets, available, users may be interested in being kept up-to-date on which terms are popular in the streams in a particular region of space. To enable this functionality, we aim at efficiently processing two types of general top-k term subscriptions over streams of spatio-temporal documents: region-based top-k spatial-temporal term (RST) subscriptions and similarity-based top-k spatio-temporal term (SST) subscriptions. RST subscriptions continuously maintain the top-k most popular trending terms within a user-defined region. SST subscriptions free users from defining a region and maintain top-k locally popular terms based on a ranking function that combines term frequency, term recency, and term proximity. To solve the problem, we propose solutions that are capable of supporting real-life location-based publish/subscribe applications that process large numbers of SST and RST subscriptions over a realistic stream of spatio-temporal documents. The performance of our proposed solutions is studied in extensive experiments using two spatio-temporal datasets.

Simon Aagaard Pedersen, Bin Yang, Christian S. Jensen ,"A Hybrid Learning Approach to Stochastic Routing"

Link [Publicly available]

Emerging disruptive innovations in transportation, e.g., autonomous vehicles and transportation-as-a-service, will benefit from high-resolution routing, where travel-time uncertainty is captured accurately.

Robert Waury, Peter Dolog, Christian S. Jensen, Kristian Torp ,"Analyzing trajectories using a path-based API" in 16th International Symposium on Spatial and Temporal Databases, SSTD 2019, 2019

Link

Large vehicle trajectory data sets can give detailed insight into traffic and congestion that is useful for routing as well as transportation planning. Making information from such data sets available to more users can enable applications that reduce travel time and fuel consumption. However, extracting such information efficiently requires deep knowledge of the underlying schema and indexing methods. To enable more users to extract information from trajectory data, we have developed an API that removes the need to be familiar with the schema. Furthermore, when giving access to trajectory data, privacy concerns often call for the application of anonymization methods before analysis results are made available. In our demonstration, owners of trajectory data are able to experiment with different levels of anonymization to see how this affects the quality of different types of trajectory analysis services implemented on top of a large trajectory data set.

Bolong Zheng, Kai Zheng, Christian S. Jensen, Nguyen Quoc Viet Hung, Han Su, Guohui Li, Xiaofang Zhou ,"Answering Why-Not Group Spatial Keyword Queries (Extended Abstract)" in 35th IEEE International Conference on Data Engineering, ICDE 2019, 2019

Link

With the proliferation of geo-textual objects on the web, extensive efforts have been devoted to improving the efficiency of top-k spatial keyword queries in different settings. However, comparatively much less work has been reported on enhancing the quality and usability of such queries. In this context, we propose means of enhancing the usability of a top-k group spatial keyword query, where a group of users aim to find k objects that contain given query keywords and are nearest to the users. Specifically, when users receive the result of such a query, they may find that one or more objects that they expect to be in the result are in fact missing, and they may wonder why. To address this situation, we develop a so-called why-not query that is able to minimally modify the original query into a query that returns the expected, but missing, objects, in addition to other objects. Specifically, we formalize the why-not query in relation to the top-k group spatial keyword query, called the Why-not Group Spatial Keyword Query (WGSK) that is able to provide a group of users with a more satisfactory query result. We propose a three-phase framework for efficiently computing he WGSK. Extensive experiments with real and synthetic data offer evidence that the proposed solution excels over baselines with respect to both effectiveness and efficiency.

Robert Waury, Christian S. Jensen, Kristian Torp ,"A NUMA-aware Trajectory Store for Travel-Time Estimation" in International Conference on Advances in Geographic Information Systems, 2019

Link

The increasingly massive volumes of vehicle trajectory data that are becoming available hold the potential to enable more accurate vehicle travel-time estimation than hitherto possible. To enable such uses, we present a multi-threaded, in-memory trajectory store that supports efficient and accurate travel-time estimation for road-network paths based on network-constrained trajectories. The trajectory store employs advanced indexing to support so-called strict-path queries that retrieve all trajectories that traverse a given path to provide accurate travel-time estimations. As a key novel feature, the store is designed and implemented to exploit modern non-uniform memory access (NUMA) systems. We provide a detailed experimental study of the performance of the trajectory store using a synthetic trajectory data set based on real traffic data. The study shows that query latency can be halved compared to our baseline system.

Lisi Chen, Shuo Shang, Christian S. Jensen, Bin Yao, Zhiwei Zhang, Ling Shao ,"Effective and Efficient Reuse of Past Travel Behavior for Route Recommendation" in ACM Conference on Knowledge Discovery and Data Mining , 2019

Link

With the increasing availability of moving-object tracking data, use of this data for route search and recommendation is increasingly important. To this end, we propose a novel parallel split-and-combine approach to enable route search by locations (RSL-Psc). Given a set of routes, a set of places to visit O, and a threshold θ, we retrieve the route composed of sub-routes that (i) has similarity to O no less than θ and (ii) contains the minimum number of sub-route combinations. The resulting functionality targets a broad range of applications, including route planning and recommendation, ridesharing, and location-based services in general. To enable efficient and effective RSL-Psc computation on massive route data, we develop novel search space pruning techniques and enable use of the parallel processing capabilities of modern processors. Specifically, we develop two parallel algorithms, Fully-Split Parallel Search (FSPS) and Group-Split Parallel Search (GSPS). We divide the route split-and-combine task into ∑k=0 M S(|O|,k+1) sub-tasks, where M is the maximum number of combinations and S(⋅) is the Stirling number of the second kind. In each sub-task, we use network expansion and exploit spatial similarity bounds for pruning. The algorithms split candidate routes into sub-routes and combine them to construct new routes. The sub-tasks are independent and are performed in parallel. Extensive experiments with real data offer insight into the performance of the algorithms, indicating that our RSL-Psc problem can generate high-quality results and that the two algorithms are capable of achieving high efficiency and scalability.

Lu Chen, Yunjun Gao, Yuanliang Zhang, Christian S. Jensen, Bolong Zheng ,"Efficient and Incremental Clustering Algorithms on Star-Schema Heterogeneous Graphs" in The 35th IEEE International Conference on Data Engineering (ICDE), 2019

Link

Many datasets including social media data and bibliographic data can be modeled as graphs. Clustering such graphs is able to provide useful insights into the structure of the data. To improve the quality of clustering, node attributes can be taken into account, resulting in attributed graphs. Existing attributed graph clustering methods generally consider attribute similarity and structural similarity separately. In this paper, we represent attributed graphs as star-schema heterogeneous graphs, where attributes are modeled as different types of graph nodes. This enables the use of personalized pagerank (PPR) as a unified distance measure that captures both structural and attribute similarity. We employ DBSCAN for clustering, and we update edge weights iteratively to balance the importance of different attributes. To improve the efficiency of the clustering, we develop two incremental approaches that aim to enable efficient PPR score computation when edge weights are updated. To boost the effectiveness of the clustering, we propose a simple yet effective edge weight update strategy based on entropy. In addition, we present a game theory based method that enables trading efficiency for result quality. Extensive experiments on real-life datasets offer insight into the effectiveness and efficiency of our proposals, compared with existing methods.

Tianming Zhang, Yunjun Gao, Lu Chen, Wei Guo, Shiliang Pu, Baihua Zheng, Christian S. Jensen ,"Efficient distributed reachability querying of massive temporal graphs" in VLDB Journal, 2019

Link [Publicly available]

Reachability computation is a fundamental graph functionality with a wide range of applications. In spite of this, little work has as yet been done on efficient reachability queries over temporal graphs, which are used extensively to model time-varying networks, such as communication networks, social networks, and transportation schedule networks. Moreover, we are faced with increasingly large real-world temporal networks that may be distributed across multiple data centers. This state of affairs motivates the paper’s study of efficient reachability queries on distributed temporal graphs. We propose an efficient index, called Temporal Vertex Labeling (TVL), which is a labeling scheme for distributed temporal graphs. We also present algorithms that exploit TVL to achieve efficient support for distributed reachability querying over temporal graphs in Pregel-like systems. The algorithms exploit several optimizations that hinge upon non-trivial lemmas. Extensive experiments using massive real and synthetic temporal graphs are conducted to provide detailed insight into the efficiency and scalability of the proposed methods, covering both index construction and query processing. Compared with the state-of-the-art methods, the TVL based query algorithms are capable of up to an order of magnitude speedup with lower index construction overhead.

Dingming Wu, Dexin Luo, Christian S. Jensen, Joshua Zhexu Huang ,"Efficiently Mining Maximal Diverse Frequent Itemsets" in International Conference on Database Systems for Advanced Applications, 2019

Link [Publicly available]

Given a database of transactions, where each transaction is a set of items, maximal frequent itemset mining aims to find all itemsets that are frequent, meaning that they consist of items that co-occur in transactions more often than a given threshold, and that are maximal, meaning that they are not contained in other frequent itemsets. Such itemsets are the most interesting ones in a meaningful sense. We study the problem of efficiently finding such itemsets with the added constraint that only the top-k most diverse ones should be returned. An itemset is diverse if its items belong to many different categories according to a given hierarchy of item categories. We propose a solution that relies on a purposefully designed index structure called the FP*-tree and an accompanying bound-based algorithm. An extensive experimental study offers insight into the performance of the solution, indicating that it is capable of outperforming an existing method by orders of magnitude and of scaling to large databases of transactions

Kaiyu Feng, Gao Cong, Christian S. Jensen, Tao Guo ,"Finding Attribute-Aware Similar Region for Data Analysis" in Proceedings of the VLDB Endowment, 2019

Link [Publicly available]

With the proliferation of mobile devices and location-based services, increasingly massive volumes of geo-tagged data are becoming available. This data typically also contains non-location information. We study how to use such information to characterize a region and then how to find a region of the same size and with the most similar characteristics. This functionality enables a user to identify regions that share characteristics with a user-supplied region that the user is familiar with and likes. More specifically, we formalize and study a new problem called the attribute-aware similar region search (ASRS) problem. We first define so-called composite aggregators that are able to express aspects of interest in terms of the information associated with a user-supplied region. When applied to a region, an aggregator captures the region's relevant characteristics. Next, given a query region and a composite aggregator, we propose a novel algorithm called DS-Search to find the most similar region of the same size. Unlike any previous work on region search, DS-Search repeatedly discretizes and splits regions until an split region either satisfies a drop condition or it is guaranteed to not contribute to the result. In addition, we extend DS-Search to solve the ASRS problem approximately. Finally, we report on extensive empirical studies that offer insight into the efficiency and effectiveness of the paper's proposals.

Tobias Skovgaard Jepsen, Christian S. Jensen, Thomas Dyhre NielsenKontaktforfatter ,"Graph Convolutional Networks for Road Networks" in 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2019

Link [Publicly available]

The application of machine learning techniques in the setting of road networks holds the potential to facilitate many important transportation applications. Graph Convolutional Networks (GCNs) are neural networks that are capable of leveraging the structure of a network. However, many implicit assumptions of GCNs do not apply to road networks.We introduce the Relational Fusion Network (RFN), a novel type of GCN designed specifically for road networks. In particular, we pro- pose methods that substantially outperform state-of-the-art GCNs on two machine learning tasks in road networks. Furthermore, we show that state-of-the-art GCNs fail to effectively leverage road network structure on these tasks.

Congcong Ge, Yunjun Gao, Xiaoye Miao, Lu Chen, Christian S. Jensen, Ziyuan Zhu ,"IHCS: An Integrated Hybrid Cleaning System" in 45th International Conference on Very Large Data Bases, 2019

Link [Publicly available]

Data cleaning is a prerequisite to subsequent data analysis, and is know to often be time-consuming and labor-intensive. We present IHCS, a hybrid data cleaning system that integrates error detection and repair to contend effectively with multiple error types. In a preprocessing step that precedes the data cleaning, IHCS formats an input dataset to be cleaned, and transforms applicable data quality rules into a unified format. Then, an MLN index structure is formed according to the unified rules, enabling IHCS to handle multiple error types simultaneously. During the cleaning, IHCS first tackles abnormalities through an abnormal group process, and then, it generates multiple data versions based on the MLN index. Finally, IHCS eliminates conflicting values across the multiple versions, and derives the final unified clean data. A visual interface enables cleaning process monitoring and cleaning result analysis.

Robert Waury, Christian S. Jensen, Satoshi Koide, Yoshiharu Ishikawa, Chuan Xiao ,"Indexing Trajectories for Travel-Time Histogram Retrieval" in 22nd International Conference on Extending Database Technology, EDBT 2019, 2019

Link [Publicly available]

A key service in vehicular transportation is routing according to estimated travel times. With the availability of massive volumes of vehicle trajectory data, it has become increasingly feasible to estimate travel times, which are typically modeled as probability distributions in the form of histograms. An earlier study shows that use of a carefully selected, context-dependent subset of available trajectories when estimating a travel-time histogram along a user-specified path can significantly improve the accuracy of the estimates. This selection of trajectories cannot occur in a pre-processing step, but must occur online—it must be integrated into the routing itself. It is then a key challenge to be able to select very efficiently the "right" subset of trajectories that offer the best accuracy when the cost of a route is to be assessed. To address this challenge, we propose a solution that applies novel indexing to all available trajectories and that then is capable of selecting the most relevant trajectories and of computing a travel-time distribution based on these trajectories. Specifically, the solution utilizes an in-memory trajectory index and a greedy algorithm to identify and retrieve the relevant trajectories. The paper reports on an extensive empirical study with a large real-world GPS data set that offers insight into the accuracy and efficiency of the proposed solution. The study shows that the proposed online selection of trajectories can be performed efficiently and is able to provide highly accurate travel-time distributions.

Dingming Wu, Yi Zhu, Christian S. Jensen ,"In Good Company: Efficient Retrieval of the Top-k Most Relevant Event-Partner Pairs" in International Conference on Database Systems for Advanced Applications, 2019

Link

The proliferation of event-based social networking (ESBN) motivates a range of studies on topics such as event, venue, and friend recommendation and event creation and organization. In this setting, the notion of event-partner recommendation has recently attracted attention. When recommending an event to a user, this functionality allows recommendation of partner with whom to attend the event. However, existing proposals are push-based: recommendations are pushed to users at the system’s initiative. In contrast, EBSNs provide users with keyword-based search functionality. This way, users may retrieve information in pull mode. We propose a new way of accessing information in EBSNs that combines push and pull, thus allowing users to not only conduct ad-hoc searches for events, but also to receive partner recommendations for retrieved events. Specifically, we define and study the top-k event-partner (kEP) pair retrieval query that integrates event-partner recommendation and keyword-based search for events. The query retrieves event-partner pairs, taking into account the relevance of events to user-supplied keywords and so-called together preferences that indicate the extent of a user’s preference to attend an event with a given partner. In order to compute kEP queries efficiently, we propose a rank-join based framework with three optimizations. Results of empirical studies with implementations of the proposed techniques demonstrate that the proposed techniques are capable of excellent performance.

Christian S. Jensen ,"Letter from the Impact Award Winner"

Link [Publicly available]

Christian S. Jensen, Dik Lee, Ling Liu ,"Message from the General Co-Chairs" in 20th International Conference on Mobile Data Management, MDM 2019, 2019

Link

Presents the introductory welcome message from the conference proceedings. May include the conference officers' congratulations to all involved with the conference event and publication of the proceedings record.

Wenfei Fan, Xuemin Lin, Divesh Srivastava, Christian S. Jensen, Lionel M. Ni, M. Tamer Özsu ,"Message from the ICDE 2019 Chairs" in The 35th IEEE International Conference on Data Engineering (ICDE), 2019

Link

Alvis Logins, Panagiotis Karras, Christian S. Jensen ,"Multicapacity Facility Selection in Networks" in The 35th IEEE International Conference on Data Engineering (ICDE), 2019

Link

Consider the task of selecting a set of facilities, e.g., hotspots, shops, or utility stations, each with a capacity to serve a certain number of customers. Given a set of customer locations, we have to minimize a cumulative distance between each customer and the facility earmarked to serve this customer within its capacity. This problem is known as the Capacitated k-Median (CKM) problem. In a data-intensive variant, distances are calculated over a network, while a data set associates each candidate facility location with a different capacity. In other words, going beyond positioning facilities in a metric space, the problem is to select a small subset out of a large data set of candidate network-based facilities with capacity constraints. We call this variant the Multicapacity Facility Selection (MCFS) problem. Linear Programming solutions are unable to contend with the network sizes and supplies of candidate facilities encountered in real-world applications; yet the problem may need to be solved scalably and repeatedly, as in applications requiring the dynamic reallocation of customers to facilities. We present the first, to our knowledge, solution to the MCFS problem that achieves both scalability and high quality, the Wide Matching Algorithm (WMA). WMA iteratively assigns customers to candidate facilities and leverages a data-driven heuristic for the SETCOVER problem inherent to the MCFS problem. An extensive experimental study with real-world and synthetic networks demonstrates that WMA scales gracefully to million-node networks and large facility and customer data sets; further, WMA provides a solution quality superior to scalable baselines (also proposed in the paper) and competitive vis-á-vis the optimal solution, returned by an off-the-shelf solver that runs only on small facility databases.

Qiang Qu, Ildar Nurgaliev, Muhammad Muzammal, Christian S. Jensen, Jianping Fan ,"On spatio-temporal blockchain query processing" in Future Generation Computer Systems, 2019

Link [Publicly available]

Recent advances in blockchain technology suggest that the technology has potential for use in applications in a variety of new domains including spatio-temporal data management. The reliability and immutability of blockchains combined with the support for decentralized, trustless data processing offer new opportunities for applications in such domains. However, current blockchain proposals do not support spatio-temporal data processing, and the block-based sequential access in blockchain hinders efficient query processing. We propose spatio-temporal blockchain technology that supports fast query processing. More specifically, we propose blockchain technology that records time and location attributes for the transactions, maintains data integrity, and supports fast spatial queries by the introduction of a cryptographically signed tree data structure, the Merkle Block Space Index (BSI), which is a modification of the Merkle KD-tree. We consider Bitcoin-like near-uniform block generation, and we process temporal queries by means of a block-DAG data structure, called Temporal Graph Search (TGS), without the need for temporal indexes. To enable the experiments, we propose a random graph model to generate a block-DAG topology for an abstract peer-to-peer network. We perform a comprehensive evaluation to offer insight into the applicability and effectiveness of the proposed technology. The evaluation indicates that TGS-BSI is a promising solution for efficient spatio-temporal query processing on blockchains.

Tung Kieu, Bin Yang, Chenjuan Guo, Christian S. Jensen ,"Outlier Detection for Time Series with Recurrent Autoencoder Ensembles" in the 28th International Joint Conference on Artificial Intelligence, 2019

Link [Publicly available]

We propose two solutions to outlier detection in time series based on recurrent autoencoder ensembles. The solutions exploit autoencoders built using sparsely-connected recurrent neural networks (S-RNNs). Such networks make it possible to generate multiple autoencoders with different neural network connection structures. The two solutions are ensemble frameworks, specifically an independent framework and a shared framework, both of which combine multiple S-RNN based autoencoders to enable outlier detection. This ensemble-based approach aims to reduce the effects of some autoencoders being overfitted to outliers, this way improving overall detection quality. Experiments with two large real-world time series data sets, including univariate and multivariate time series, offer insight into the design properties of the proposed frameworks and demonstrate that the resulting solutions are capable of outperforming both baselines and the state-of-the-art methods.

Walid Aref, Michela Bertolotto, Panagiotis Bouros, Christian S. Jensen, Ahmed Mahmood, Kjetil Nørvåg, Dimitris Sacharidis, Mohamed Sarwat ,"Preface" in 16th International Symposium on Spatial and Temporal Databases, SSTD 2019, 2019

Link

Symposium brought together, for three days, researchers, practitioners, and developers for the presentation and discussion of current research on concepts, tools, and techniques related to spatial and temporal databases. SSTD 2019 was the 16th in a series of biannual events. Previous symposia were held in Santa Barbara (1989), Zurich (1991), Singapore (1993), Portland (1995),Berlin (1997), Hong Kong (1999), Los Angeles (2001), Santorini, Greece (2003), Angrados Reis (2005), Boston (2007), Aalborg (2009), Minneapolis (2011), Munich (2013),Hong Kong (2015), and Arlington (2017).

Walid Aref (Redaktør), Michela Bertolotto (Redaktør), Panagiotis Bouros (Redaktør), Christian S. Jensen (Redaktør), Ahmed Mahmood (Redaktør), Kjetil Nørvåg (Redaktør), Dimitris Sacharidis (Redaktør), Mohammed Sarwat (Redaktør) ,"Proceedings of the 16th International Symposium on Spatial and Temporal Databases" in 16th International Symposium on Spatial and Temporal Databases, SSTD 2019, 2019

Link

Lu Chen, Yunjun Gao, Ziquan Fang, Xiaoye Miao, Christian S. Jensen, Chenjuan Guo ,"Real-time Distributed Co-Movement Pattern Detection on Streaming Trajectories" in Proceedings of the VLDB Endowment, 2019

Link [Publicly available]

With the widespread deployment of mobile devices with positioning capabilities, increasingly massive volumes of trajectory data are being collected that capture the movements of people and vehicles. This data enables co-movement pattern detection, which is important in applications such as trajectory compression and future-movement prediction. Existing co-movement pattern detection studies generally consider historical data and thus propose offline algorithms. However, applications such as future movement prediction need real-time processing over streaming trajectories. Thus, we investigate real-time distributed co-movement pattern detection over streaming trajectories.Existing off-line methods assume that all data is available when the processing starts. Nevertheless, in a streaming setting, unbounded data arrives in real time, making pattern detection challenging. To this end, we propose a framework based on Apache Flink, which is designed for efficient distributed streaming data processing. The framework encompasses two phases: clustering and pattern enumeration. To accelerate the clustering, we use a range join based on two-layer indexing, and provide techniques that eliminate unnecessary verifications. To perform pattern enumeration efficiently, we present two methods FBA and VBA that utilize id-based partitioning. When coupled with bit compression and candidate-based enumeration techniques, we reduce the enumeration cost from exponential to linear. Extensive experiments offer insight into the efficiency of the proposed framework and its constituent techniques compared with existing methods.

Gao Cong, Christian Søndergaard Jensen ,"Spatio-Textual Data"

Link

Jilin Hu, Chenjuan Guo, Bin Yang, Christian Søndergaard Jensen ,"Stochastic Weight Completion for Road Networks using Graph Convolutional Networks" in The 35th IEEE International Conference on Data Engineering (ICDE), 2019

Link

Innovations in transportation, such as mobility-on-demand services and autonomous driving, call for high-resolution routing that relies on an accurate representation of travel time throughout the underlying road network. Specifically, the travel time of a road-network edge is modeled as a time-varying distribution that captures the variability of traffic over time and the fact that different drivers may traverse the same edge at the same time at different speeds. Such stochastic weights may be extracted from data sources such as GPS and loop detector data. However, even very large data sources are incapable of covering all edges of a road network at all times. Yet, high-resolution routing needs stochastic weights for all edges. We solve the problem of filling in the missing weights. To achieve that, we provide techniques capable of estimating stochastic edge weights for all edges from traffic data that covers only a fraction of all edges. We propose a generic learning framework called Graph Convolutional Weight Completion (GCWC) that exploits the topology of a road network graph and the correlations of weights among adjacent edges to estimate stochastic weights for all edges. Next, we incorporate contextual information into GCWC to further improve accuracy. Empirical studies using loop detector data from a highway toll gate network and GPS data from a large city offer insight into the design properties of GCWC and its effectiveness.

Christian S. Jensen ,"Value Creation from Massive Data in Transportation - The Case of Vehicle Routing."

Link [Publicly available]

Robert Waury, Christian Søndergaard Jensen, Kristian Torp ,"Adaptive Travel-Time Estimation: A Case for Custom Predicate Selection" in 19th IEEE International Conference on Mobile Data Management, MDM 2018, 2018

Link

Travel-time estimation for paths in a road network often relies on pre-computed histograms that are usually available on a road segment level. Then the pre-computed histograms of the segments of a path are convolved to obtain a histogram that estimates the travel time. With the growing sizes of trajectory datasets, it becomes possible to compute histograms for increasingly longer sub-paths. Since pre-computation is infeasible for all sub-paths in a road network, we propose computing histograms on-the-fly, i.e., during routing. Such an on-the-fly method must filter the underlying trajectory dataset by spatio-temporal predicates to obtain the relevant trajectories and offers the opportunity to apply additional filtering predicates to the trajectories with little overhead. We report on a study showing that considerable improvements in accuracy of the histograms obtained for paths can be obtained by choosing filtering predicates that not only adapt to the intended start of a trip, but also to the driver and the weather. We also make the cases for a sub-path partitioning based on segment categories since there are significant differences between road types when applying our on-the-fly method.

Jianzhong Qi, Rui Zhang, Christian Søndergaard Jensen, Ramamohanarao Kotagiri, Jiayuan He ,"Continuous Spatial Query Processing: A Survey of Safe Region Based Techniques" in A C M Computing Surveys, 2018

Link [Publicly available]

In the past decade, positioning system-enabled devices such as smartphones have become most prevalent. This functionality brings the increasing popularity of location-based services in business as well as daily applications such as navigation, targeted advertising, and location-based social networking. Continuous spatial queries serve as a building block for location-based services. As an example, an Uber driver may want to be kept aware of the nearest customers or service stations. Continuous spatial queries require updates to the query result as the query or data objects are moving. This poses challenges to the query efficiency, which is crucial to the user experience of a service. A large number of approaches address this efficiency issue using the concept of safe region. A safe region is a region within which arbitrary movement of an object leaves the query result unchanged. Such a region helps reduce the frequency of query result update and hence improves query efficiency. As a result, safe region-based approaches have been popular for processing various types of continuous spatial queries. Safe regions have interesting theoretical properties and are worth in-depth analysis. We provide a comparative study of safe region-based approaches. We describe how safe regions are computed for different types of continuous spatial queries, showing how they improve query efficiency. We compare the different safe region-based approaches and discuss possible further improvements.

Michael Hanspeter Böhlen, Anton Dignös, Johann Gamper, Christian Søndergaard Jensen ,"Database Technology for Processing Temporal Data" in 25th International Symposium on Temporal Representation and Reasoning, 2018

Link [Publicly available]

Despite the ubiquity of temporal data and considerable research on processing such data, database systems largely remain designed for processing the current state of some modeled reality. More recently, we have seen an increasing interest in processing historical or temporal data. The SQL:2011 standard introduced some temporal features, and commercial database management systems have started to offer temporal functionalities in a step-by-step manner. There has also been a proposal for a more fundamental and comprehensive solution for sequenced temporal queries, which allows a tight integration into relational database systems, thereby taking advantage of existing query optimization and evaluation technologies. New challenges for processing temporal data arise with multiple dimensions of time and the increasing amounts of data, including time series data that represent a special kind of temporal data.

Xiucheng Li, Kaiqi Zhao, Gao Cong, Christian Søndergaard Jensen, Wei Wei ,"Deep representation learning for trajectory similarity computation" in 34th IEEE International Conference on Data Engineering, ICDE 2018, 2018

Link

Trajectory similarity computation is fundamental functionality with many applications such as animal migration pattern studies and vehicle trajectory mining to identify popular routes and similar drivers. While a trajectory is a continuous curve in some spatial domain, e.g., 2D Euclidean space, trajectories are often represented by point sequences. Existing approaches that compute similarity based on point matching suffer from the problem that they treat two different point sequences differently even when the sequences represent the same trajectory. This is particularly a problem when the point sequences are non-uniform, have low sampling rates, and have noisy points. We propose the first deep learning approach to learning representations of trajectories that is robust to low data quality, thus supporting accurate and efficient trajectory similarity computation and search. Experiments show that our method is capable of higher accuracy and is at least one order of magnitude faster than the state-of-The-Art methods for k-nearest trajectory search.

Tung Kieu, Bin Yang, Chenjuan Guo, Christian S. Jensen ,"Distinguishing Trajectories from Different Drivers using Incompletely Labeled Trajectories" in 27th ACM International Conference on Information and Knowledge Management, 2018

Link

We consider a scenario that occurs often in the auto insurance industry. We are given a large collection of trajectories that stem from many different drivers. Only a small number of the trajectories are labeled with driver identifiers, and only some drivers are used in labels. The problem is to label correctly the unlabeled trajectories with driver identifiers. This is important in auto insurance to detect possible fraud and to identify the driver in, e.g., pay-as-you-drive settings when a vehicle has been involved in an incident. To solve the problem, we first propose a Trajectory-to-Image( T2I) encoding scheme that captures both geographic features and driving behavior features of trajectories in 3D images. Next, we propose a multi-task, deep learning model called T2INet for estimating the total number of drivers in the unlabeled trajectories, and then we partition the unlabeled trajectories into groups so that the trajectories in a group belong to the same driver. Experimental results on a large trajectory data set offer insight into the design properties of T2INet and demonstrate that T2INet is capable of outperforming baselines and the state-of-the-art method.

Christian Søndergaard Jensen ,"Editorial: Updates to the Editorial Board" in A C M Transactions on Database Systems, 2018

Link

Ilkcan Keles, Christian Søndergaard Jensen, Simonas Saltenis ,"Extracting Rankings for Spatial Keyword Queries from GPS Data" in 14th International Conference on Location Based Services, 2018

Link

Studies suggest that many search engine queries have local intent. We consider the evaluation of ranking functions important for such queries. The key challenge is to be able to determine the “best” ranking for a query, as this enables evaluation of the results of ranking functions. We propose a model that synthesizes a ranking of points of interest (PoI) for a given query using historical trips extracted from GPS data. To extract trips, we propose a novel PoI assignment method that makes use of distances and temporal information. We also propose a PageRank-based smoothing method to be able to answer queries for regions that are not covered well by trips. We report experimental results on a large GPS dataset that show that the proposed model is capable of capturing the visits of users to PoIs and of synthesizing rankings.

Qing Liu, Zijin Feng, Xike Xi, Jianliang Xu, Xin Lin, Christian Søndergaard Jensen ,"IZone: Efficient influence zone evaluation over geo-Textual Data" in 34th IEEE International Conference on Data Engineering, ICDE 2018, 2018

Link

Owing to the widespread use of location-Aware devices and the increased popularity of micro-blogging applications, we are witnessing a rapid proliferation of geo-Textual data. In this demonstration, we present iZone, an efficient system for determining influence zones over geo-Textual data. Specifically, iZone allows users to browse geo-Textual objects, evaluate the influence zones of specified geo-Textual objects, and obtain explanations of the evaluation results. The iZone system adopts a browser-server model. The server side integrates two types of spatial keyword search, namely top-k spatial keyword query and reverse top-k keyword-based location query, to support the functionality of the system. A variety of spatial indexes are employed to enhance the efficiency of the system. The browser side provides a map-based GUI interface, which enables convenient and user-friendly interaction with the system. Using a real hotel dataset from Hong Kong, iZone offers hands-on experience with influence zone evaluation in real-life applications.

Chenjuan Guo, Bin Yang, Jilin Hu, Christian Søndergaard JensenKontaktforfatter ,"Learning to route with sparse trajectory sets" in 34th IEEE International Conference on Data Engineering, ICDE 2018, 2018

Link

Motivated by the increasing availability of vehicle trajectory data, we propose learn-To-route, a comprehensive trajectory-based routing solution. Specifically, we first construct a graph-like structure from trajectories as the routing infrastructure. Second, we enable trajectory-based routing given an arbitrary (source, destination) pair. In the first step, given a road network and a collection of trajectories, we propose a trajectory-based clustering method that identifies regions in a road network. If a pair of regions are connected by trajectories, we maintain the paths used by these trajectories and learn a routing preference for travel between the regions. As trajectories are skewed and sparse, %and although the introduction of regions serves to consolidate the sparse data, many region pairs are not connected by trajectories. We thus transfer routing preferences from region pairs with sufficient trajectories to such region pairs and then use the transferred preferences to identify paths between the regions. In the second step, we exploit the above graph-like structure to achieve a comprehensive trajectory-based routing solution. Empirical studies with two substantial trajectory data sets offer insight into the proposed solution, indicating that it is practical. A comparison with a leading routing service offers evidence that the paper's proposal is able to enhance routing quality.

Lisi Chen, Shuo Shang, Zhiwei Zhang, Xin Cao, Christian Søndergaard Jensen, Panos Kalnis ,"Location-aware top-κ term publish/subscribe" in 34th IEEE International Conference on Data Engineering, ICDE 2018, 2018

Link

Massive amount of data that contain spatial, textual, and temporal information are being generated at a high scale. These spatio-Temporal documents cover a wide range of topics in local area. Users are interested in receiving local popular terms from spatio-Temporal documents published with a specified region. We consider the Top-k Spatial-Temporal Term (ST2) Subscription. Given an ST2 subscription, we continuously maintain up-To-date top-k most popular terms over a stream of spatio-Temporal documents. The ST2 subscription takes into account both frequency and recency of a term generated from spatio-Temporal document streams in evaluating its popularity. We propose an efficient solution to process a large number of ST2 subscriptions over a stream of spatio-Temporal documents. The performance of processing ST2 subscriptions is studied in extensive experiments based on two real spatio-Temporal datasets.

Tobias S. Jepsen, Christian Søndergaard Jensen, Thomas Dyhre Nielsen, Kristian Torp ,"On Network Embedding for Machine Learning on Road Networks: A Case Study on the Danish Road Network" in 2018 IEEE International Conference on Big Data, 2018

Link

Road networks are a type of spatial network, whereedges may be associated with qualitative information such asroad type and speed limit. Unfortunately, such information isoften incomplete; for instance, OpenStreetMap only has speedlimits for 13% of all Danish road segments. This is problematicfor analysis tasks that rely on such information for machinelearning. To enable machine learning in such circumstances, onemay consider the application of network embedding methods toextract structural information from the network. However, thesemethods have so far mostly been used in the context of socialnetworks, which differ significantly from road networks in termsof, e.g., node degree and level of homophily (which are key tothe performance of many network embedding methods).We analyze the use of network embedding methods, specifically node2vec, for learning road segment embeddings in roadnetworks. Due to the often limited availability of informationon other relevant road characteristics, the analysis focuses onleveraging the spatial network structure. Our results suggest thatnetwork embedding methods can indeed be used for derivingrelevant network features (that may, e.g, be used for predictingspeed limits), but that the qualities of the embeddings differ fromembeddings for social networks.

Tung Kieu, Bin Yang, Christian Søndergaard JensenKontaktforfatter ,"Outlier Detection for Multidimensional Time Series using Deep Neural Networks" in 19th IEEE International Conference on Mobile Data Management, MDM 2018, 2018

Link

Due to the continued digitization of industrial and societal processes, including the deployment of networked sensors, we are witnessing a rapid proliferation of time-ordered observations, known as time series. For example, the behavior of drivers can be captured by GPS or accelerometer as a time series of speeds, directions, and accelerations. We propose a framework for outlier detection in time series that, for example, can be used for identifying dangerous driving behavior and hazardous road locations. Specifically, we first propose a method that generates statistical features to enrich the feature space of raw time series. Next, we utilize an autoencoder to reconstruct the enriched time series. The autoencoder performs dimensionality reduction to capture, using a small feature space, the most representative features of the enriched time series. As a result, the reconstructed time series only capture representative features, whereas outliers often have non-representative features. Therefore, deviations of the enriched time series from the reconstructed time series can be taken as indicators of outliers. We propose and study autoencoders based on convolutional neural networks and long-short term memory neural networks. In addition, we show that embedding of contextual information into the framework has the potential to further improve the accuracy of identifying outliers. We report on empirical studies with multiple time series data sets, which offers insight into the design properties of the proposed framework, indicating that it is effective at detecting outliers.

Bin Yang, Jian Dai, Chenjuan Guo, Christian S. Jensen, Jilin HuKontaktforfatter ,"PACE: a PAth-CEntric paradigm for stochastic path finding" in VLDB Journal, 2018

Link

With the growing volumes of vehicle trajectory data, it becomes increasingly possible to capture time-varying and uncertain travel costs, e.g., travel time, in a road network. The current paradigm for doing so is edge-centric: it represents a road network as a weighted graph and splits trajectories into small fragments that fit the underlying edges to assign time-varying and uncertain weights to edges. It then applies path finding algorithms to the resulting, weighted graph. We propose a new PAth-CEntric paradigm, PACE, that targets more accurate and more efficient path cost estimation and path finding. By assigning weights to paths, PACE avoids splitting trajectories into small fragments. We solve two fundamental problems to establish the PACE paradigm: (i) how to compute accurately the travel cost distribution of a path and (ii) how to conduct path finding for a source–destination pair. To solve the first problem, given a departure time and a query path, we show how to select an optimal set of paths that cover the query path and such that the weights of the paths enable the most accurate joint cost distribution estimation for the query path. The joint cost distribution models well the travel cost dependencies among the edges in the query path, which in turn enables accurate estimation of the cost distribution of the query path. We solve the second problem by showing that the resulting path cost distribution estimation method satisfies an incremental property that enables the method to be integrated seamlessly into existing stochastic path finding algorithms. Further, we propose a new stochastic path finding algorithm that fully explores the improved accuracy and efficiency provided by PACE. Empirical studies with trajectory data from two different cities offer insight into the design properties of the PACE paradigm and offer evidence that PACE is accurate, efficient, and effective in real-world settings.

Shuo Shang, Lisi Chen, Zhewei Wei, Christian S. Jensen, Kai Zheng, Panos KalnisKontaktforfatter ,"Parallel trajectory similarity joins in spatial networks" in VLDB Journal, 2018

Link [Publicly available]

The matching of similar pairs of objects, called similarity join, is fundamental functionality in data management. We consider two cases of trajectory similarity joins (TS-Joins), including a threshold-based join (Tb-TS-Join) and a top-k TS-Join (k-TS-Join), where the objects are trajectories of vehicles moving in road networks. Given two sets of trajectories and a threshold θ, the Tb-TS-Join returns all pairs of trajectories from the two sets with similarity above θ. In contrast, the k-TS-Join does not take a threshold as a parameter, and it returns the top-k most similar trajectory pairs from the two sets. The TS-Joins target diverse applications such as trajectory near-duplicate detection, data cleaning, ridesharing recommendation, and traffic congestion prediction. With these applications in mind, we provide purposeful definitions of similarity. To enable efficient processing of the TS-Joins on large sets of trajectories, we develop search space pruning techniques and enable use of the parallel processing capabilities of modern processors. Specifically, we present a two-phase divide-and-conquer search framework that lays the foundation for the algorithms for the Tb-TS-Join and the k-TS-Join that rely on different pruning techniques to achieve efficiency. For each trajectory, the algorithms first find similar trajectories. Then they merge the results to obtain the final result. The algorithms for the two joins exploit different upper and lower bounds on the spatiotemporal trajectory similarity and different heuristic scheduling strategies for search space pruning. Their per-trajectory searches are independent of each other and can be performed in parallel, and the mergings have constant cost. An empirical study with real data offers insight in the performance of the algorithms and demonstrates that they are capable of outperforming well-designed baseline algorithms by an order of magnitude.

Shuo Shang, Lisi Chen, Kai Zheng, Christian S. Jensen, Zhewei Wei, Panos Kalnis ,"Parallel Trajectory-to-Location Join" in IEEE Transactions on Knowledge and Data Engineering, 2018

Link [Publicly available]

The matching between trajectories and locations, called Trajectory-to-Location join (TL-Join), is fundamental functionality in spatiotemporal data management. Given a set of trajectories, a set of locations, and a threshold θ, the TL-Join finds all (trajectory, location) pairs from the two sets with spatiotemporal correlation above θ. This join targets diverse applications, including location recommendation, event tracking, and trajectory activity analyses. We address three challenges in relation to the TL-Join: how to define the spatiotemporal correlation between trajectories and locations, how to prune the search space effectively when computing the join, and how to perform the computation in parallel. Specifically, we define new metrics to measure the spatiotemporal correlation between trajectories and locations. We develop a novel parallel collaborative (PCol) search method based on a divide-and-conquer strategy. For each location $o$, we retrieve the trajectories with high spatiotemporal correlation to $o$, and then we merge the results. An upper bound on the spatiotemporal correlation and a heuristic scheduling strategy are developed to prune the search space. The trajectory searches from different locations are independent and are performed in parallel, and the result merging cost is independent of the degree of parallelism. Studies of the performance of the developed algorithms using large spatiotemporal data sets are reported.

Lu Chen, Qilu Zhong, Xiaokui Xiao, Yunjun Gao, Pengfei Jin, Christian Søndergaard Jensen ,"Price-and-Time-Aware Dynamic Ridesharing" in 34th IEEE International Conference on Data Engineering, ICDE 2018, 2018

Link

Ridesharing refers to a transportation scenario where travellers with similar itineraries and time schedules share a vehicle for a trip and split the travel cost, which may include fuel, tolls, and parking fees. Ridesharing is popular among travellers because it can reduce their travel costs, and it also holds the potential to reduce travel time, congestion, air pollution, and overall fuel consumption. However, existing ridesharing systems often offer each traveller only one choice that aims to minimize system-wide vehicle travel distance or time. We propose a solution that offers more options. Specifically, we do this by considering both pick-up time and price, so that travellers are able to choose the vehicle that matches their preferences best. In order to identify quickly vehicles that satisfy incoming ridesharing requests, we propose two efficient matching algorithms that follow the single-side and dual-side search paradigms, respectively. To further accelerate the matching, indexes on the road network and vehicles are developed, based on which several pruning heuristics are designed. Extensive experiments on a large Shanghai taxi dataset offer insights into the performance of our proposed techniques and compare with a baseline that extends the state-of-The art method. © 2018 IEEE.

Lu Chen, Yunjun Gao, Zixian Liu, Xiaokui Xiao, Christian Søndergaard Jensen, Yifan Zhu ,"PTRider: A Price-and-Time-Aware Ridesharing System" in Proceedings of the VLDB Endowment, 2018

Link [Publicly available]

Ridesharing is popular among travellers because it can reducetheir travel costs, and it also holds the potential to reduce traveltime, congestion, air pollution, and overall fuel consumption.Existing ridesharing systems (e.g., lyft, uberPOOL) often offereach traveler only one choice that aims to minimize system-widevehicle travel distance or time. In this demonstration, we present aprice-and-time-aware ridesharing system, termed as PTRider,which provides more options. It considers both pick-up time andprice, so that travellers are able to choose the vehicle matchingtheir preferences best. To answer the ridesharing request in realtime, PTRider builds indexes on the road network and vehiclesseparately, and utilizes corresponding efficient matching methods.A real-life dataset that contains 432,327 trips extracted from17,000 Shanghai taxis for one day (May 29, 2009) is used todemonstrate that PTRider can return various options for everyridesharing request in real time.

Jilin Hu, Bin Yang, Chenjuan Guo, Christian Søndergaard JensenKontaktforfatter ,"Risk-aware path selection with time-varying, uncertain travel costs: a time series approach" in VLDB Journal, 2018

Link

We address the problem of choosing the best paths among a set of candidate paths between the same origin–destination pair. This functionality is used extensively when constructing origin–destination matrices in logistics and flex transportation. Because the cost of a path, e.g., travel time, varies over time and is uncertain, there is generally no single best path. We partition time into intervals and represent the cost of a path during an interval as a random variable, resulting in an uncertain time series for each path. When facing uncertainties, users generally have different risk preferences, e.g., risk-loving or risk-averse, and thus prefer different paths. We develop techniques that, for each time interval, are able to find paths with non-dominated lowest costs while taking the users’ risk preferences into account. We represent risk by means of utility function categories and show how the use of first-order and two kinds of second-order stochastic dominance relationships among random variables makes it possible to find all paths with non-dominated lowest costs. We report on empirical studies with large uncertain time series collections derived from a 2-year GPS data set. The study offers insight into the performance of the proposed techniques, and it indicates that the best techniques combine to offer an efficient and robust solution.

Shuo Shang, Lisi Chen, Christian Søndergaard Jensen, Ji-Rong Wen, Panos Kalnis ,"Searching Trajectories by Regions of Interest" in 34th IEEE International Conference on Data Engineering, ICDE 2018, 2018

Link

We propose and investigate a novel query type named trajectory search by regions of interest (TSR query). Given an argument set of trajectories, a TSR query takes a set of regions of interest as a parameter and returns the trajectory in the argument set with the highest spatial-density correlation to the query regions. This type of query is useful in applications such as trip planning and recommendation. To process the TSR query, a set of new metrics are defined to model spatial-density correlations. An efficient trajectory search algorithm is developed that exploits upper and lower bounds to prune the search space and that adopts a query-source selection strategy, as well as integrates a heuristic search strategy based on priority ranking to schedule multiple query sources. The performance of TSR query processing is studied in extensive experiments based on real and synthetic spatial data.

Xinjue Wang, Ke Deng, Jianxin Li, Jeffery Xu Yu, Christian Søndergaard Jensen, Xiaochun Yang ,"Targeted Influence Minimization in Social Networks" in 22nd Pacific-Asia Conference, 2018

Link

An online social network can be used for the diffusion of malicious information like derogatory rumors, disinformation, hate speech, revenge pornography, etc. This motivates the study of influence minimization that aim to prevent the spread of malicious information. Unlike previous influence minimization work, this study considers the influence minimization in relation to a particular group of social network users, called targeted influence minimization. Thus, the objective is to protect a set of users, called target nodes, from malicious information originating from another set of users, called active nodes. This study also addresses two fundamental, but largely ignored, issues in different influence minimization problems: (i) the impact of a budget on the solution; (ii) robust sampling. To this end, two scenarios are investigated, namely unconstrained and constrained budget. Given an unconstrained budget, we provide an optimal solution; Given a constrained budget, we show the problem is NP-hard and develop a greedy algorithm with an (1−1/e) -approximation. More importantly, in order to solve the influence minimization problem in large, real-world social networks, we propose a robust sampling-based solution with a desirable theoretic bound. Extensive experiments using real social network datasets offer insight into the effectiveness and efficiency of the proposed solutions.

Michael Hanspeter Böhlen, Anton Dignös, Johann Gamper, Christian Søndergaard Jensen ,"Temporal Data Management—An Overview" in European Business Intelligence and Big Data Summer School, 2018

Link [Publicly available]

Despite the ubiquity of temporal data and considerable research on the effective and efficient processing of such data, database systems largely remain designed for processing the current state of some modeled reality. More recently, we have seen an increasing interest in the processing of temporal data that captures multiple states of reality. The SQL:2011 standard incorporates some temporal support, and commercial DBMSs have started to offer temporal functionality in a step-by-step manner, such as the representation of temporal intervals, temporal primary and foreign keys, and the support for so-called time-travel queries that enable access to past states.This tutorial gives an overview of state-of-the-art research results and technologies for storing, managing, and processing temporal data in relational database management systems. Following an introduction that offers a historical perspective, we provide an overview of basic temporal database concepts. Then we survey the state-of-the-art in temporal database research, followed by a coverage of the support for temporal data in the current SQL standard and the extent to which the temporal aspects of the standard are supported by existing systems. The tutorial ends by covering a recently proposed framework that provides comprehensive support for processing temporal data and that has been implemented in PostgreSQL.

Lei Chen, Yafei Li, Jianliang Xu, Christian S. Jensen ,"Towards Why-Not Spatial Keyword Top-k Queries: A Direction-Aware Approach" in IEEE Transactions on Knowledge and Data Engineering, 2018

Link [Publicly available]

With the continued proliferation of location-based services, a growing number of web-accessible data objects are geo-tagged and have text descriptions. An important query over such web objects is the direction-aware spatial keyword query that aims to retrieve the top-k objects that best match query parameters in terms of spatial distance and textual similarity in a given query direction. In some cases, it can be difficult for users to specify appropriate query parameters. After getting a query result, users may find some desired objects are unexpectedly missing and may therefore question the entire result. Enabling why-not questions in this setting may aid users to retrieve better results, thus improving the overall utility of the query functionality. This paper studies the direction-aware why-not spatial keyword top-k query problem. We propose efficient query refinement techniques to revive missing objects by minimally modifying users direction-aware queries. We prove that the best refined query directions lie in a finite solution space for a special case and reduce the search for the optimal refinement to a linear programming problem for the general case. Extensive experimental studies demonstrate that the proposed techniques outperform a baseline method by two orders of magnitude and are robust in a broad range of settings.

Xin Ding, Lu Chen, Yunjun Gao, Christian Søndergaard Jensen, Hujun Bao ,"UlTraMan: A Unified Platform for Big Trajectory Data Management and Analytics" in Proceedings of the VLDB Endowment, 2018

Link [Publicly available]

Massive trajectory data is being generated by GPS-equipped devices, such as cars and mobile phones, which is used increasingly in transportation, location-based services, and urban computing. As a result, a variety of methods have been proposed for trajectory data management and analytics. However, traditional systems and methods are usually designed for very specific data management or analytics needs, which forces users to stitch together heterogeneous systems to analyze trajectory data in an inefficient manner. Targeting the overall data pipeline of big trajectory data management and analytics, we present a unified platform, termed as UlTraMan. In order to achieve scalability, efficiency, persistence, and flexibility, (i) we extend Apache Spark with respect to both data storage and computing by seamlessly integrating a key-value store, and (ii) we enhance the MapReduce paradigm to allow flexible optimizations based on random data access. We study the resulting system's flexibility using case studies on data retrieval, aggregation analyses, and pattern mining. Extensive experiments on real and synthetic trajectory data are reported to offer insight into the scalability and performance of UlTraMan.

Xin Ding, Rui Chen, Lu Chen, Yunjun Gao, Christian Søndergaard Jensen ,"VIPTRA: Visualization and Interactive Processing on Big Trajectory Data" in 19th IEEE International Conference on Mobile Data Management, MDM 2018, 2018

Link

Massive trajectory data is being collected and used widely in many applications such as transportation, location-based services, and urban computing. As a result, abundant methods and systems have been proposed for managing and processing trajectory data. However, it remains difficult for users to interact well with data management and processing, due to the lack of efficient data processing methods and effective visualization techniques for big trajectory data. In this demonstration, we present a new framework, VIPTRA, to process big trajectory data visually and interactively. VIPTRA builds upon UlTraMan, a distributed in-memory system for big trajectory data, and thus, it takes advantage of its capability of high performance. The demonstration shows the efficiency of data processing and user-friendly visualization and interaction techniques provided in VIPTRA, via several scenarios of visual analysis and trajectory editing tasks.

Robert Waury, Jilin Hu, Bin Yang, Christian S. Jensen ,"Assessing the accuracy benefits of on-the-fly trajectory selection in fine-grained travel-time estimation" in 18th IEEE International Conference on Mobile Data Management, MDM 2017, 2017

Link

Today's one-size-fits-all approach to travel-time computation in spatial networks proceeds in two steps. In a preparatory off-line step, a set of distributions, e.g., one per hour of the day, is computed for each network segment. Then, when a path and a departure time are provided, a distribution for the path is computed on-line from pertinent pre-computed distributions. Motivated by the availability of massive trajectory data from vehicles, we propose a completely on-line approach, where distributions are computed from trajectories on-the-fly, i.e., when a query arrives. This new approach makes it possible to use arbitrary sets of underlying trajectories for a query. Specifically, we study the potential for accuracy improvements over the one-size-fits-all approach that can be obtained using the on-the-fly approach and report findings from an empirical study that suggest that the on-the-fly approach is able to improve accuracy significantly and has the potential to replace the current one-size-fits-all approach.

junling liu, Ke Deng, Huanliang Sun, Yu Ge, Xiaofang Zhou, Christian Søndergaard Jensen ,"Clue-based Spatio-textual Query" in Proceedings of the VLDB Endowment, 2017

Link [Publicly available]

Shuo Shang, Lisi Chen, Zhewei Wei, Christian S. Jensen, Ji Rong Wen, Panos Kalnis ,"Collective travel planning in spatial networks" in 33rd IEEE International Conference on Data Engineering, ICDE 2017, 2017

Link

Jinpeng Chen, Hua Lu, Ilkcan Keles, Christian S. Jensen ,"Crowdsourcing Based Evaluation of Ranking Approaches for Spatial Keyword Querying" in 18th IEEE International Conference on Mobile Data Management, MDM 2017, 2017

Link

Lei Chen, Yafei Li, Jianliang Xu, Christian S. Jensen ,"Direction-Aware why-not spatial keyword Top-k queries" in 33rd IEEE International Conference on Data Engineering, ICDE 2017, 2017

Link

With the continued proliferation of location-based services, a growing number of web-Accessible data objects are geotagged and have text descriptions. An important query over such web objects is the direction-Aware spatial keyword query that aims to retrieve the top-k objects that best match query parameters in terms of spatial distance and textual similarity in a given query direction. In some cases, it can be difficult for users to specify appropriate query parameters. After getting a query result, users may find some desired objects are unexpectedly missing and may therefore question the entire result. Enabling why-not questions in this setting may aid users to retrieve better results, thus improving the overall utility of the query functionality. This paper studies the direction-Aware why-not spatial keyword top-k query problem. We propose efficient query refinement techniques to revive missing objects by minimally modifying users' directionaware queries. Experimental studies demonstrate the efficiency and effectiveness of the proposed techniques.

Christian Søndergaard Jensen ,"Editorial: Updates to the Editorial Board" in A C M Transactions on Database Systems, 2017

Link

Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen ,"Efficient Metric Indexing for Similarity Search and Similarity Joins" in IEEE Transactions on Knowledge and Data Engineering, 2017

Link

Spatial queries including similarity search and similarity joins are useful in many areas, such as multimedia retrieval, data integration, and so on. However, they are not supported well by commercial DBMSs. This may be due to the complex data types involved and the needs for flexible similarity criteria seen in real applications. In this paper, we propose a versatile and efficient disk-based index for metric data, the S pace-filling curve and Pivot-based B+-tree (SPB-tree). This index leverages the B+-tree, and uses space-filling curve to cluster data into compact regions, thus achieving storage efficiency. It utilizes a small set of so-called pivots to reduce significantly the number of distance computations when using the index. Further, it makes use of a separate random access file to support a broad range of data. By design, it is easy to integrate the SPB-tree into an existing DBMS. We present efficient algorithms for processing similarity search and similarity joins, as well as corresponding cost models based on SPB-trees. Extensive experiments using both real and synthetic data show that, compared with state-of-the-art competitors, the SPB-tree has much lower construction cost, smaller storage size, and supports more efficient similarity search and similarity joins with high accuracy cost models.

Jilin Hu, Bin Yang, Christian Søndergaard Jensen, Yu Ma ,"Enabling time-dependent uncertain eco-weights for road networks" in Geoinformatica, 2017

Link

Reduction of greenhouse gas (GHG) emissions from transportation is an essential part of the efforts to prevent global warming and climate change. Eco-routing, which enables drivers to use the most environmentally friendly routes, is able to substantially reduce GHG emissions from vehicular transportation. The foundation of eco-routing is a weighted-graph representation of a road network in which road segments, or edges, are associated with eco-weights that capture the GHG emissions caused by traversing the edges. Due to the dynamics of traffic, the eco-weights are best modeled as being time dependent and uncertain. We formalize the problem of assigning a time-dependent, uncertain eco-weight to each edge in a road network based on historical GPS records. In particular, a sequence of histograms is employed to describe the uncertain eco-weight of an edge at different time intervals. Compression techniques, including histogram merging and buckets reduction, are proposed to maintain compact histograms while retaining their accuracy. In addition, to better model real traffic conditions, virtual edges and extended virtual edges are proposed in order to represent adjacent edges with highly dependent travel costs. Based on the techniques above, different histogram aggregation methods are proposed to accurately estimate time-dependent GHG emissions for routes. Based on a 200-million GPS record data set collected from 150 vehicles in Denmark over two years, a comprehensive empirical study is conducted in order to gain insight into the effectiveness and efficiency of the proposed approach.

Ilkcan Keles, Matthias Schubert, Peer Kröger, Simonas Saltenis, Christian Søndergaard Jensen ,"Extracting Visited Points of Interest from Vehicle Trajectories"

Link

Identifying visited points of interest (PoIs) from vehicle trajectories remains an open problem that is difficult due to vehicles parking often at some distance from the visited PoI and due to some regions having a high PoI density. We propose a visited PoI extraction (VPE) method that identifies visited PoIs using a Bayesian network. The method considers stay duration, weekday, arrival time, and PoI category to compute the probability that a PoI is visited. We also provide a method to generate labeled data from unlabeled GPS trajectories. An experimental evaluation shows that VPE achieves a precision@3 value of 0.8, indicating that VPE is able to model the relationship between the temporal features of a stop and the category of the visited PoI.

Saad Aljubayrin, Jianzhong Qi, Christian S. Jensen, Rui Zhang, Zhen He, Yuan LiKontaktforfatter ,"Finding lowest-cost paths in settings with safe and preferred zones" in VLDB Journal, 2017

Link

We define and study Euclidean and spatial network variants of a new path finding problem: given a set of safe or preferred zones with zero or low cost, find paths that minimize the cost of travel from an origin to a destination. In this problem, the entire space is passable, with preference given to safe or preferred zones. Existing algorithms for problems that involve unsafe regions to be avoided strictly are not effective for this new problem. To solve the Euclidean variant, we devise a transformation of the continuous data space with safe zones into a discrete graph upon which shortest path algorithms apply. A naive transformation yields a large graph that is expensive to search. In contrast, our transformation exploits properties of hyperbolas in Euclidean space to safely eliminate graph edges, thus improving performance without affecting correctness. To solve the spatial network variant, we propose a different graph-to-graph transformation that identifies critical points that serve the same purpose as do the hyperbolas, thus also avoiding the extraneous edges. Having solved the problem for safe zones with zero costs, we extend the transformations to the weighted version of the problem, where travel in preferred zones has nonzero costs. Experiments on both real and synthetic data show that our approaches outperform baseline approaches by more than an order of magnitude in graph construction time, storage space, and query response time.

Lu Chen, Yunjun Gao, Aoxiao Zhong, Christian S. Jensen, Gang Chen, Baihua ZhengKontaktforfatter ,"Indexing metric uncertain data for range queries and range joins" in VLDB Journal, 2017

Link

Range queries and range joins in metric spaces have applications in many areas, including GIS, computational biology, and data integration, where metric uncertain data exist in different forms, resulting from circumstances such as equipment limitations, high-throughput sequencing technologies, and privacy preservation. We represent metric uncertain data by using an object-level model and a bi-level model, respectively. Two novel indexes, the uncertain pivot B+-tree (UPB-tree) and the uncertain pivot B+-forest (UPB-forest), are proposed in order to support probabilistic range queries and range joins for a wide range of uncertain data types and similarity metrics. Both index structures use a small set of effective pivots chosen based on a newly defined criterion and employ the B+-tree(s) as the underlying index. In addition, we present efficient metric probabilistic range query and metric probabilistic range join algorithms, which utilize validation and pruning techniques based on derived probability lower and upper bounds. Extensive experiments with both real and synthetic data sets demonstrate that, compared against existing state-of-the-art indexes for metric uncertain data, the UPB-tree and the UPB-forest incur much lower construction costs, consume less storage space, and can support more efficient metric probabilistic range queries and metric probabilistic range joins.

Christian Søndergaard Jensen, Dan Lin, Beng Chin Ooi ,"Indexing of Moving Objects, Bx-Tree"

Link

Sadegh Nobari, Qiang Qu, Christian Søndergaard Jensen ,"In-Memory Spatial Join: The Data Matters!" in 20th International Conference on Extending Database Technology, 2017

Link [Publicly available]

Johannes Lindhart Borresen, Ove Andersen, Christian Søndergaard Jensen, Kristian Torp ,"Interactive Intersection Analysis using Trajectory Data" in 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2017

Link

Francesco Lettich, Salvatore Orlando, Claudio Silvestri, Christian S. JensenKontaktforfatter ,"Manycore GPU processing of repeated range queries over streams of moving objects observations" in Concurrency Computation, 2017

Link

The ability to timely process significant amounts of continuously updated spatial data is mandatory for an increasing number of applications. Parallelism enables such applications to face this data-intensive challenge and allows the devised systems to feature low latency and high scalability. In this paper, we focus on a specific data-intensive problem concerning the repeated processing of huge amounts of range queries over massive sets of moving objects, where the spatial extent of queries and objects is continuously modified over time. To tackle this problem and significantly accelerate query processing, we devise a hybrid CPU/GPU pipeline that compresses data output and saves query processing work. The devised system relies on an ad-hoc spatial index leading to a problem decomposition that results in a set of independent data-parallel tasks. The index is based on a point-region quadtree space decomposition and allows to tackle effectively a broad range of spatial object distributions, even those very skewed. Also, to deal with the architectural peculiarities and limitations of the GPUs, we adopt non-trivial GPU data structures that avoid the need of locked memory accesses while favouring coalesced memory accesses, thus enhancing the overall memory throughput. To the best of our knowledge, this is the first work that exploits GPUs to efficiently solve repeated range queries over massive sets of continuously moving objects, possibly characterized by highly skewed spatial distributions. In comparison with state-of-the-art CPU-based implementations, our method highlights significant speedups in the order of 10 − 20×, depending on the dataset.

Christian Søndergaard Jensen, Dan Lin, Beng Chin Ooi ,"Maximum Update Interval in Moving Objects Databases"

Link

Lu Chen, Yunjun Gao, Baihua Zheng, Christian Søndergaard Jensen, Hanyu Yang, Keyu Yang ,"Pivot-based Metric Indexing" in Proceedings of the VLDB Endowment, 2017

Link

Xike Xie, Xin Lin, Jianliang Xu, Christian S. Jensen ,"Reverse keyword-based location search" in 33rd IEEE International Conference on Data Engineering, ICDE 2017, 2017

Link

The proliferation of geo-Textual data gives prominence to spatial keyword search. The basic top-k spatial keyword query, returns k geo-Textual objects that rank the highest according to their textual relevance and spatial proximity to query keywords and a query location. We define, study, and provide means of computing the reverse top-k keyword-based location query. This new type of query takes a set of keywords, a query object q, and a number k as arguments, and it returns a spatial region such that any top-k spatial keyword query with the query keywords and a location in this region would contain object q in its result. This query targets applications in market analysis, geographical planning, and location optimization, and it may support applications related to safe zones and influence zones that are used widely in location-based services. We show that computing an exact query result requires evaluating and merging a set of weighted Voronoi cells, which is expensive. We therefore devise effective algorithms that approximate result regions with quality guarantees. We develop novel pruning techniques on top of an index, and we offer a series of optimization techniques that aim to further accelerate query processing. Empirical studies suggest that the proposed query processing is efficient and scalable.

Jingwen Zhao, Yunjun Gao, Gang Chen, Christian S. Jensen, Rui Chen, Deng Cai ,"Reverse Top-k geo-social keyword queries in road networks" in 33rd IEEE International Conference on Data Engineering, ICDE 2017, 2017

Link

Identifying prospective customers is an important aspect of marketing research. In this paper, we provide support for a new type of query, the Reverse Top-k Geo-Social Keyword (RkGSK) query. This query takes into account spatial, textual, and social information, and finds prospective customers for geotagged objects. As an example, a restaurant manager might apply the query to find prospective customers. To address this, we propose a hybrid index, the GIM-Tree, which indexes locations, keywords, and social information of geo-Tagged users and objects, and then, using the GIM-Tree, we present efficient RkGSK query processing algorithms that exploit several pruning strategies. The effectiveness of RkGSK retrieval is characterized via a case study, and extensive experiments using real datasets offer insight into the efficiency of the proposed index and algorithms.

Shuo Shang, Lisi Chen, Christian S. Jensen, Ji-Rong Wen, Panos Kalnis ,"Searching trajectories by regions of interest" in IEEE Transactions on Knowledge and Data Engineering, 2017

Link

With the increasing availability of moving-object tracking data, trajectory search is increasingly important. We propose and investigate a novel query type named trajectory search by regions of interest (TSR query). Given an argument set of trajectories, a TSR query takes a set of regions of interest as a parameter and returns the trajectory in the argument set with the highest spatial-density correlation to the query regions. This type of query is useful in many popular applications such as trip planning and recommendation, and location based services in general. TSR query processing faces three challenges: how to model the spatial-density correlation between query regions and data trajectories, how to effectively prune the search space, and how to effectively schedule multiple so-called query sources. To tackle these challenges, a series of new metrics are defined to model spatial-density correlations. An efficient trajectory search algorithm is developed that exploits upper and lower bounds to prune the search space and that adopts a query-source selection strategy, as well as integrates a heuristic search strategy based on priority ranking to schedule multiple query sources. The performance of TSR query processing is studied in extensive experiments based on real and synthetic spatial data.

Nectaria Tryfona, Christian Søndergaard Jensen ,"Spatiotemporal Database Modeling with an Extended Entity-Relationship Model"

Link

Mann Willi, Nikolaus Augsten, Christian Søndergaard Jensen ,"SWOOP: Top-k Similarity Joins over Set Streams" in 43rd International Conference on Very Large Data Bases, VLDB 2017, 2017

Link

Shuo Shang, Lisi Chen, Zhewei Wei, Christian Søndergaard Jensen, Kai Zheng, Panos Kalnis ,"Trajectory Similarity Join in Spatial Networks" in Proceedings of the VLDB Endowment, 2017

Link [Publicly available]

Christian Søndergaard Jensen ,"Updates to the TODS Editorial Board" in SIGMOD Record, 2017

Link [Publicly available]

Lei Chen (Redaktør), Christian Søndergaard Jensen (Redaktør), Cyrus Shahabi (Redaktør), Xiaochun Yang (Redaktør), Xiang Lian (Redaktør) ,"Web and Big Data: First International Joint Conference, APWeb-WAIM 2017, Part II"

Link

Xie, X., P. Jin, M.-L. Yiu, J. Du, M. Yuan, C. S. Jensen, ,"Enabling Scalable Geographic Service Sharing with Weighted Imprecise Voronoi Cells" in in IEEE Transactions on Knowledge and Data Engineering, 28(2): 439–453,, 2016

Prof. Christian S. Jensen List of Publications

This page contains a list of research publications with abstracts and, generally, links to full paper versions.

Prof. Christian S. Jensen
List of Publications

This page contains a list of research publications with
abstracts and, generally, links to full paper versions.