DATA 2020 Abstracts


Area 1 - Big Data

Full Papers
Paper Nr: 30
Title:

Integrating Lightweight Compression Capabilities into Apache Arrow

Authors:

Juliana Hildebrandt, Dirk Habich and Wolfgang Lehner

Abstract: With the ongoing shift to a data-driven world in almost all application domains, the management and in particular the analytics of large amounts of data gain in importance. For that reason, a variety of new big data systems has been developed in recent years. Aside from that, a revision of the data organization and formats has been initiated as a foundation for these big data systems. In this context, Apache Arrow is a novel cross-language development platform for in-memory data with a standardized language-independent columnar memory format. The data is organized for efficient analytic operations on modern hardware, whereby Apache Arrow only supports dictionary encoding as a specific compression approach. However, there exists a large corpus of lightweight compression algorithms for columnar data which helps to reduce the necessary memory space as well as to increase the processing performance. Thus, we present a flexible and language-independent approach integrating lightweight compression algorithms into the Apache Arrow framework in this paper. With our so-called ArrowComp approach, we preserve the unique properties of Apache Arrow, but enhance the platform with a large variety of lightweight compression capabilities.
Download

Paper Nr: 35
Title:

On Generating Efficient Data Summaries for Logistic Regression: A Coreset-based Approach

Authors:

Nery Riquelme-Granada, Khuong A. Nguyen and Zhiyuan Luo

Abstract: In the era of datasets of unprecedented sizes, data compression techniques are an attractive approach for speeding up machine learning algorithms. One of the most successful paradigms for achieving good-quality compression is that of coresets: small summaries of data that act as proxies to the original input data. Even though coresets proved to be extremely useful to accelerate unsupervised learning problems, applying them to supervised learning problems may bring unexpected computational bottlenecks. We show that this is the case for Logistic Regression classification, and hence propose two methods for accelerating the computation of coresets for this problem. When coresets are computed using our methods on three public datasets, computing the coreset and learning from it is, in the worst case, 11 times faster than learning directly from the full input data, and 34 times faster in the best case. Furthermore, our results indicate that our accelerating approaches do not degrade the empirical performance of coresets.
Download

Short Papers
Paper Nr: 40
Title:

Real-time Visualization of Sensor Data in Smart Manufacturing using Lambda Architecture

Authors:

Nadeem Iftikhar, Bartosz P. Lachowicz, Akos Madarasz, Finn E. Nordbjerg, Thorkil Baattrup-Andersen and Karsten Jeppesen

Abstract: Smart manufacturing technologies (Industry 4.0) as solutions to enhance productivity and improve efficiency are a priority to manufacturing industries worldwide. Such solutions have the ability to extract, integrate, analyze and visualize sensor and data from other legacy systems in order to enhance the operational performance. This paper proposes a solution to the challenge of real-time analysis and visualization of sensor and ERP data. Dynamic visualization is achieved using a machine learning approach. The combination of real-time visualization and machine learning allows for early detection and prevention of undesirable situations or outcomes. The prototype system has so far been tested by a smart manufacturing company with promising results.
Download

Paper Nr: 44
Title:

Identification of Social Influence on Social Networks and Its Use in Recommender Systems: A Systematic Review

Authors:

Lesly G. Camacho and Solange N. Alves-Souza

Abstract: Currently the popularization of social networks has encouraged people to have more interactions on the internet through information sharing or posting activities. Different social media are a source of information that can provide valuable insight into user feedbacks, interaction history and social relationships. With this information it is possible to discover relationships of trust between people that can influence their potential behavior when purchasing a product or service. Social networks have shown to play an important role in e-commerce for the diffusion or acquisition of products. Knowing how to mine information from social networks to discover patterns of social influence can be very useful for e-commerce platforms, or for streaming of music, tv or movies. Discovering influence patterns can make item recommendations more accurate, especially when there is no knowledge about a user’s tastes. This paper presents a systematic literature review that shows the main works that use social networking data to identify the most influential set of users within a social network and how this information is used in recommender systems. The results of this work show the main techniques used to calculate social influence, as well as identify which data are the most used to determine influence and which evaluation metrics are used to validate each of the proposals. From 80 papers analyzed, 14 were classified as completely relevant regarding the research questions defined in the SLR.
Download

Area 2 - Business Analytics

Full Papers
Paper Nr: 18
Title:

Auxiliary Decision-making for Controlled Experiments based on Mid-term Treatment Effect Prediction: Applications in Ant Financial’s Offline-payment Business

Authors:

Gang Li and Huizhi Xie

Abstract: Controlled experiments are commonly used in technology companies for product development, algorithm improvement, marketing strategy evaluation, etc. These experiments are usually run for a short period of time to enable fast business/product iteration. Due to the relatively short lifecycle of these experiments, key business metrics that span a longer window cannot be calculated and compared among different variations of these experiments. This is essentially a treatment effect prediction issue. Research in this paper focuses on experiments in the offline-payment business at Ant Financial. Experiments in this area are usually run for one or two weeks, sometimes even shorter, yet the accumulating window of key business metrics such as payment days, payment counts is one month. In this paper, we apply the classic BG/NBD model(Fader et al., 2005) in marketing to predict users payment behavior based on data collected from the relatively short experimentation periods. The predictions are then used to evaluate the impact on the key business metrics. We compare this method with supervised learning methods and direct modelling of treatment effect as a time series. We show the advantage of the proposed method using data collected from plenty of controlled experiments in Ant Financial. The proposed technique has been integrated into Ant Financial experimentation reporting platform, where metrics based on the predictions are one of the auxiliary evaluation criteria in offline-payment experiments.
Download

Paper Nr: 47
Title:

Catalog Integration of Low-quality Product Data by Attribute Label Ranking

Authors:

Oliver Schmidts, Bodo Kraft, Marvin Winkens and Albert Zündorf

Abstract: The integration of product data from heterogeneous sources and manufacturers into a single catalog is often still a laborious, manual task. Especially small- and medium-sized enterprises face the challenge of timely integrating the data their business relies on to have an up-to-date product catalog, due to format specifications, low quality of data and the requirement of expert knowledge. Additionally, modern approaches to simplify catalog integration demand experience in machine learning, word vectorization, or semantic similarity that such enterprises do not have. Furthermore, most approaches struggle with low-quality data. We propose Attribute Label Ranking (ALR), an easy to understand and simple to adapt learning approach. ALR leverages a model trained on real-world integration data to identify the best possible schema mapping of previously unknown, proprietary, tabular format into a standardized catalog schema. Our approach predicts multiple labels for every attribute of an input column. The whole column is taken into consideration to rank among these labels. We evaluate ALR regarding the correctness of predictions and compare the results on real-world data to state-of-the-art approaches. Additionally, we report findings during experiments and limitations of our approach.
Download

Paper Nr: 53
Title:

A Gradient Descent based Heuristic for Solving Regression Clustering Problems

Authors:

Enis Kayış

Abstract: Regression analysis is the method of quantifying the effects of a set of independent variables on a dependent variable. In regression clustering problems, the data points with similar regression estimates are grouped into the same cluster either due to a business need or to increase the statistical significance of the resulting regression estimates. In this paper, we consider an extension of this problem where data points belonging to the same level of another partitioning categorical variable should belong to the same partition. Due to the combinatorial nature of this problem, an exact solution is computationally prohibitive. We provide an integer programming formulation and offer gradient descent based heuristic to solve this problem. Through simulated datasets, we analyze the performance of our heuristic across a variety of different settings. In our computational study, we find that our heuristic provides remarkably better solutions than the benchmark method within a reasonable time. Albeit the slight decrease in the performance as the number of levels increase, our heuristic provides good solutions when each of the true underlying partition has a similar number of levels.
Download

Short Papers
Paper Nr: 8
Title:

Applied Data Science: An Approach to Explain a Complex Team Ball Game

Authors:

Friedemann Schwenkreis and Eckard Nothdurft

Abstract: Team handball is a fast and complex game with a very traditional background and so far, almost no collection of digital information. Only a few attempts have been made to come up with models to explain the mechanisms of the game based on measured indicators. CoCoAnDa is a project located at the Baden-Wuerttemberg Cooperative State University that addresses this gap. While having started with the aim to introduce data mining technology into an almost non-digitalized team sport, the project has extended its scope by introducing mechanisms to collect digital information as well as by developing field specific models to interpret the collected data. The work presented will show the design of specialized apps that have been implemented to manually collect a maximum of data during team handball matches by a single observer. This paper will also describe the analysis of available data collected as part of the match organization of 1,190 matches of the first and 1,559 matches of the second German team handball league, HBL. Furthermore, the data of more than 150 games of national teams, the first league, and the third league have been manually collected using the apps developed as part of the project.
Download

Paper Nr: 29
Title:

Improving Statistical Reporting Data Explainability via Principal Component Analysis

Authors:

Shengkun Xie and Clare Chua-Chow

Abstract: The study of high dimensional data for decision-making is rapidly growing since it often leads to more accurate information that is needed to make reliable decision. To better understand the natural variation and the pattern of statistical reporting data, visualization and interpretability of data have been an on-going challenging problem, mainly, in the area of complex statistical data analysis. In this work, we propose an approach of dimension reduction and feature extraction using principal component analysis, in a novel way, for analyzing the statistical reporting data of auto insurance. We investigate the functionality of loss relative frequency, to the size-of-loss as well as the pattern and variability of extracted features, for a better understanding of the nature of auto insurance loss data. The proposed method helps improve the data explainability and gives an in-depth analysis of the overall pattern of the size-of-loss relative frequency. The findings in our study will help the insurance regulators to make a better rate filling decision in the auto insurance that would benefit both the insurers and their clients. It is also applicable to similar data analysis problems in other business applications.
Download

Paper Nr: 34
Title:

Improving Public Sector Efficiency using Advanced Text Mining in the Procurement Process

Authors:

Nikola Modrušan, Kornelije Rabuzin and Leo Mršić

Abstract: The analysis of the Public Procurement Processes (PPP) and the detection of suspicious or corrupt procedures is an important topic, especially for improving the process’s transparency and for protecting public financial interests. Creating a quality model as a foundation to perform a quality analysis largely depends on the quality and volume of data that is analyzed. It is important to find a way to identify anomalies before they occur and to prevent any kind of harm that is of public interest. For this reason, we focused our research on an early phase of the PPP, the preparation of the tender documentation. During this phase, it is important to collect documents, detect and extract quality content from it, and analyze this content for any possible manipulation of the PPP’s outcome. Part of the documentation related to defining the rules and restrictions for the PPP is usually within a specific section of the documents, often called “technical and professional ability.” In previous studies, the authors extracted and processed these sections and used extracted content in order to develop a prediction model for indicating fraudulent activities. As the criteria and conditions can also be found in other parts of the PPP’s documentation, the idea of this research is to detect additional content and to investigate its impact on the outcome of the prediction model. Therefore, our goal was to determine a list of relevant terms and to develop a data science model finding and extracting terms in order to improve the predictions of suspicious tender. An evaluation was conducted based on an initial prediction model trained with the extracted content as additional input parameters. The training results show a significant improvement in the output metrics. This study presents a methodology for detecting the content needed to predict suspicious procurement procedures, for measuring the relevance of extracted terms, and for storing the most important information in a relational structure in a database.
Download

Paper Nr: 42
Title:

A Conceptual Framework for a Flexible Data Analytics Network

Authors:

Daniel Tebernum and Dustin Chabrowski

Abstract: It is becoming increasingly important for enterprises to generate insights into their own data and thus make business decisions based on it. A common way to generate insights is to collect the available data and use suitable analysis methods to process and prepare it so that decisions can be made faster and with more confidence. This can be computational and storage intensive and is therefore often outsourced to cloud services or a local server setup. With regards to data sovereignty, bandwidth limitations, and potentially high charges, this does not always appear to be a good solution at all costs. Therefore, we present a conceptual framework that gives enterprises a guideline for building a flexible data analytics network that is able to incorporate already existing edge device resources in the enterprise computer network. The proposed solution can automatically distribute data and code to the nodes in the network using customizable workflows. With a data management focused on content addressing, workflows can be replicated with no effort, ensuring the integrity of results and thus strengthen business decisions. We implemented our concept and were able to apply it successfully in a laboratory pilot.
Download

Paper Nr: 71
Title:

Context-aware Retrieval and Classification: Design and Benefits

Authors:

Kurt Englmeier

Abstract: Context encompasses the classification of a certain environment by its key attributes. It is an abstract representation of a certain data environment. In texts, the context classifies and represents a piece of text in a generalized form. Context can be a recursive construct when summarizing text on a more coarse-grained level. Context-aware information retrieval and classification has many aspects. This paper presents identification and standardization of context on different levels of granularity that supports faster and more precise location of relevant text sections. The prototypical system presented here applies supervised learning for a semiautomatic approach to extract, distil, and standardize data from text. The approach is based on named-entity recognition and simple ontologies for identification and disambiguation of context. Even though the prototype shown here still represents work in progress and demonstrates its potential of information retrieval on different levels of context granularity. The paper presents the application of the prototype in the realm of economic information and hate speech detection.
Download

Area 3 - Data Science

Full Papers
Paper Nr: 33
Title:

Classification of Products in Retail using Partially Abbreviated Product Names Only

Authors:

Oliver Allweyer, Christian Schorr, Rolf Krieger and Andreas Mohr

Abstract: The management of product data in ERP systems is a big challenge for most retail companies. The reason lies in the large amount of data and its complexity. There are companies having millions of product data records. Sometimes more than one thousand data records are created daily. Because data entry and maintenance processes are linked with considerable manual effort, costs - both in time and money - for data management are high. In many systems, the product name and product category must be specified before the product data can be entered manually. Based on the product category many default values are proposed to simplify the manual data entry process. Consequently, classification is essential for error-free and efficient data entry. In this paper, we show how to classify products automatically and compare different machine learning approaches to this end. In order to minimize the effort for the manual data entry and due to the severely limited length of the product name field the classification algorithms are based on shortened names of the products. In particular, we analyse the benefits of different pre-processing strategies and compare the quality of classification models on different hierarchy levels. Our results show that, even in this special case, machine learning can considerably simplify the process of data input.
Download

Paper Nr: 74
Title:

Learning Interpretable and Statistically Significant Knowledge from Unlabeled Corpora of Social Text Messages: A Novel Methodology of Descriptive Text Mining

Authors:

Giacomo Frisoni, Gianluca Moro and Antonella Carbonaro

Abstract: Though the strong evolution of knowledge learning models has characterized the last few years, the explanation of a phenomenon from text documents, called descriptive text mining, is still a difficult and poorly addressed problem. The need to work with unlabeled data, explainable approaches, unsupervised and domain independent solutions further increases the complexity of this task. Currently, existing techniques only partially solve the problem and have several limitations. In this paper, we propose a novel methodology of descriptive text mining, capable of offering accurate explanations in unsupervised settings and of quantifying the results based on their statistical significance. Considering the strong growth of patient communities on social platforms such as Facebook, we demonstrate the effectiveness of the contribution by taking the short social posts related to Esophageal Achalasia as a typical case study. Specifically, the methodology produces useful explanations about the experiences of patients and caregivers. Starting directly from the unlabeled patient’s posts, we derive correct scientific correlations among symptoms, drugs, treatments, foods and so on.
Download

Short Papers
Paper Nr: 14
Title:

Leveraging Machine Learning for Fake News Detection

Authors:

Elio Masciari, Vincenzo Moscato, Antonio Picariello and Giancarlo Sperlì

Abstract: The uncontrolled growth of fake news creation and dissemination we observed in recent years causes continuous threats to democracy, justice, and public trust. This problem has significantly driven the effort of both academia and industries for developing more accurate fake news detection strategies. Early detection of fake news is crucial, however the availability of information about news propagation is limited. Moreover, it has been shown that people tend to believe more fake news due to their features (Vosoughi et al., 2018). In this paper, we present our complete framework for fake news detection and we discuss in detail a solution based on machine learning. Our experiments conducted on two well-known and widely used real-world datasets suggest that our settings can outperform the state-of-the-art approaches and allows fake news accurate detection, even in the case of limited content information.
Download

Paper Nr: 21
Title:

Multi-view Clustering Analyses for District Heating Substations

Authors:

Shahrooz Abghari, Veselka Boeva, Jens Brage and Håkan Grahn

Abstract: In this study, we propose a multi-view clustering approach for mining and analysing multi-view network datasets. The proposed approach is applied and evaluated on a real-world scenario for monitoring and analysing district heating (DH) network conditions and identifying substations with sub-optimal behaviour. Initially, geographical locations of the substations are used to build an approximate graph representation of the DH network. Two different analyses can further be applied in this context: step-wise and parallel-wise multi-view clustering. The step-wise analysis is meant to sequentially consider and analyse substations with respect to a few different views. At each step, a new clustering solution is built on top of the one generated by the previously considered view, which organizes the substations in a hierarchical structure that can be used for multi-view comparisons. The parallel-wise analysis on the other hand, provides the opportunity to analyse substations with regards to two different views in parallel. Such analysis is aimed to represent and identify the relationships between substations by organizing them in a bipartite graph and analysing the substations’ distribution with respect to each view. The proposed data analysis and visualization approach arms domain experts with means for analysing DH network performance. In addition, it will facilitate the identification of substations with deviating operational behaviour based on comparative analysis with their closely located neighbours.
Download

Paper Nr: 24
Title:

Automatic Detection of Gait Asymmetry

Authors:

Maciej Cwierlikowski and Mercedes T. Torres

Abstract: Gait analysis, and gait symmetry assessment in particular, are commonly adopted in clinical settings to determine sensorimotor fitness reflecting body’s ability to integrate multi-sensory stimuli, and use this information to induce ongoing motor commands. Inter-limb deviation can serve as a non-invasive marker of gait function to identify health conditions and monitor the effects of rehabilitation regimen. This paper examines the performance of machine learning methods (decision trees, k-NN, SVMs, ANNs) to learn and predict gait symmetry from kinetic and kinematic data of 42 participants walking across a range of speeds on treadmill. Classification was conducted for each speed independently with several feature extraction techniques applied. Subjects elicited gait asymmetry, yet ground reaction forces were more discriminative than joint angles. Walking speed affected gait symmetry with larger discrepancies registered at slower speeds; the highest F1 scores were noted at the slowest condition (decision trees: 87.35%, k-NN: 91.46%, SVMs: 88.88%, ANNs: 87.22%). None of the existing research has yet addressed ML-assisted assessment of gait symmetry across a range of walking speeds using both, kinetic and kinematic information. The proposed methodology was sufficiently sensitive to discern subtle deviations in healthy subjects, hence could facilitate an early diagnosis when anomalies in gait patterns emerge.
Download

Paper Nr: 37
Title:

Reference Data Abstraction and Causal Relation based on Algebraic Expressions

Authors:

Susumu Yamasaki and Mariko Sasakura

Abstract: This paper is related to algebraic aspects of referential relations in distributed systems, where the sites as states are assumed to contain pages, and each page as reference data involves links to others as well as its own contents. The links among pages are abstracted into causal relations in terms of algebraic expressions. As an algebra for the representation basis of causal relations, more abstract Heyting algebra (a bounded lattice with Heyting implication) is taken rather than the Boolean algebra with classical implication, where the meanings of negatives are different in the two algebras. A standard form may be obtained from any Heyting algebra expression, which may denote causal relations with Heyting negatives. If the evaluation domain is taken from the 3-valued, then the algebraic expressions are abstract enough to represent referential links of pages in a distributed system, where the link may be interpreted as active, inactive and unknown. There is a critical problem to be solved in such a framework as theoretical basis. The model theory is relevant to nonmonotonic function or reasoning in AI, with respect to the mapping associated with the causal relations, such that fixed point theory cannot be always routines. This paper presents a method to inductively construct models of algebraic expressions conditioned in accordance to reference data characters. Then we examine the traverse of states with models of algebraic expressions clustering at states, for metatheory regarding searching the reference data in a distributed system. With abstraction from state transitions, an algebraic structure is refined such that operational aspect of traversing may be well formulated.
Download

Paper Nr: 61
Title:

Towards Large-scale Gaussian Process Models for Efficient Bayesian Machine Learning

Authors:

Fabian Berns and Christian Beecks

Abstract: Gaussian Process Models (GPMs) are applicable for a large variety of different data analysis tasks, such as time series interpolation, regression, and classification. Frequently, these models of bayesian machine learning instantiate a Gaussian Process by a zero-mean function and the well-known Gaussian kernel. While these default instantiations yield acceptable analytical quality for many use cases, GPM retrieval algorithms allow to automatically search for an application-specific model suitable for a particular dataset. State-of-the-art GPM retrieval algorithms have only been applied for small datasets, as their cubic runtime complexity impedes analyzing datasets beyond a few thousand data records. Even though global approximations of Gaussian Processes extend the applicability of those models to medium-sized datasets, sets of millions of data records are still far beyond their reach. Therefore, we develop a new large-scale GPM structure, which incorporates a divide-&-conquer-based paradigm and thus enables efficient GPM retrieval for large-scale data. We outline challenges concerning this newly developed GPM structure regarding its algorithmic retrieval, its integration with given data platforms and technologies, as well as cross-model comparability and interpretability.
Download

Paper Nr: 66
Title:

Ensemble Clustering based Semi-supervised Learning for Revenue Accounting Workflow Management

Authors:

Tianshu Yang, Nicolas Pasquier and Frederic Precioso

Abstract: We present a semi-supervised ensemble clustering framework for identifying relevant multi-level clusters, regarding application objectives, in large datasets and mapping them to application classes for predicting the class of new instances. This framework extends the MultiCons closed sets based multiple consensus clustering approach but can easily be adapted to other ensemble clustering approaches. It was developed to optimize the Amadeus Revenue Management application. Revenue Accounting in travel industry is a complex task when travels include several transportations, with associated services, performed by distinct operators and on geographical areas with different taxes and currencies for example. Preliminary results show the relevance of the proposed approach for the automation of Amadeus Revenue Management workflow anomaly corrections.
Download

Paper Nr: 67
Title:

Predicting the Environment of a Neighborhood: A Use Case for France

Authors:

Nelly Barret, Fabien Duchateau, Franck Favetta and Loïc Bonneval

Abstract: Notion of neighbourhoods is critical in many applications such as social studies, cultural heritage management, urban planning or environment impact on health. Two main challenges deal with the definition and representation of this spatial concept and with the gathering of descriptive data on a large area (country). In this paper, we present a use case in the context of real estate search for representing French neighbourhoods in a uniform manner, using a few environment variables (e.g., building type, social class). Since it is not possible to manually classify all neighbourhoods, our objective is to automatically predict this new information.
Download

Paper Nr: 75
Title:

Sentiment Polarity Classification of Corporate Review Data with a Bidirectional Long-Short Term Memory (biLSTM) Neural Network Architecture

Authors:

R. E. Loke and O. Kachaniuk

Abstract: A considerable amount of literature has been published on Corporate Reputation, Branding and Brand Image. These studies are extensive and focus particularly on questionnaires and statistical analysis. Although extensive research has been carried out, no single study was found which attempted to predict corporate reputation performance based on data collected from media sources. To perform this task, a biLSTM Neural Network extended with attention mechanism was utilized. The advantages of this architecture are that it obtains excellent performance for NLP tasks. The state-of-the-art designed model achieves highly competitive results, F1 scores around 72%, accuracy of 92% and loss around 20%.
Download

Paper Nr: 76
Title:

Towards Self-adaptive Defect Classification in Industrial Monitoring

Authors:

Andreas Margraf, Jörg Hähner, Philipp Braml and Steffen Geinitz

Abstract: The configuration of monitoring applications is usually performed using annotations created by experts. Unlike many industrial products, carbon fiber textiles exhibit low rigity. Hence, surface anomalies vary to a great extend which poses challenges to quality monitoring and decision makers. This paper therefore proposes an unsupervised learning approach for carbon fiber production. The data consists of images continously acquired using a line scan camera. An image processing pipeline, generated by an evolutionary algorithm is applied to segement regions of interest. We then cluster the incoming defect data with stream clustering algorithms in order to identify structures, tendencies and anomalies. We compare well-known heuristics, based on k-means, hierarchical- and density based clustering and configure them to work best under the given circumstances. The clustering results are then compared to expert labels. A best-practice approach is presented to analyse the defects and their origin in the given image data. The experiments show promising results for classification of highly specialised production processes with low defect rates which do not allow reliable, repeatable manual identification of classes. We show that unsupervised learning enables quality managers to gain better insights into measurement data in the context of image classification without prior knowledge. In addition, our approach helps to reduce training effort of image based monitoring systems.
Download

Paper Nr: 77
Title:

A CART-based Genetic Algorithm for Constructing Higher Accuracy Decision Trees

Authors:

Elif Ersoy, Erinç Albey and Enis Kayış

Abstract: Decision trees are among the most popular classification methods due to ease of implementation and simple interpretation. In traditional methods like CART (classification and regression tree), ID4, C4.5; trees are constructed by myopic, greedy top-down induction strategy. In this strategy, the possible impact of future splits in the tree is not considered while determining each split in the tree. Therefore, the generated tree cannot be the optimal solution for the classification problem. In this paper, to improve the accuracy of the decision trees, we propose a genetic algorithm with a genuine chromosome structure. We also address the selection of the initial population by considering a blend of randomly generated solutions and solutions from traditional, greedy tree generation algorithms which is constructed for reduced problem instances. The performance of the proposed genetic algorithm is tested using different datasets, varying bounds on the depth of the resulting trees and using different initial population blends within the mentioned varieties. Results reveal that the performance of the proposed genetic algorithm is superior to that of CART in almost all datasets used in the analysis.
Download

Paper Nr: 11
Title:

Trading Desk Behavior Modeling via LSTM for Rogue Trading Fraud Detection

Authors:

Marine Neyret, Jaouad Ouaggag and Cédric Allain

Abstract: Rogue trading is a term used to designate a fraudulent trading activity and rogue traders refer to operators who take unauthorised positions with regard to the mandate of the desk to which they belong and to the regulations in force. Through this fraudulent behavior, a rogue trader exposes his group to operational and market risks that can lead to heavy financial losses and to financial and criminal sanctions. We present a two-step methodology to detect rogue trading activity among the deals of a desk. Using a dataset of transactions booked by operators, we first build time series behavioral features that describe their activity in order to predict these features’ future values using a Long Short-Term Memory (LSTM) network. The detection step is then performed by comparing the predictions made by the LSTM to real values assuming that unexpected values in our trading behavioral features predictions reflect potential rogue trading activity. In order to detect anomalies, we define a prediction error that is used to compute an anomaly score based on the Mahalanobis distance.
Download

Paper Nr: 58
Title:

Initializing k-means Clustering

Authors:

Christian Borgelt and Olha Yarikova

Abstract: The quality of clustering results obtained with the k-means algorithm depends heavily on the initialization of the cluster centers. Simply sampling centers uniformly at random from the data points usually yields fairly poor and unstable results. Hence several alternatives have been suggested in the past, among which Maximin (Hathaway et al., 2006) and k-means++ (Arthur and Vassilvitskii, 2007) are best known and most widely used. In this paper we explore modifications of these methods that deal with cases, in which the original methods still yield suboptimal choices of the initial cluster centers. Furthermore we present efficient implementations of our new methods.
Download

Area 4 - Data Management and Quality

Full Papers
Paper Nr: 26
Title:

Capability-based Scheduling of Scientific Workflows in the Cloud

Authors:

Michel Krämer

Abstract: We present a distributed task scheduling algorithm and a software architecture for a system executing scientific workflows in the Cloud. The main challenges we address are (i) capability-based scheduling, which means that individual workflow tasks may require specific capabilities from highly heterogeneous compute machines in the Cloud, (ii) a dynamic environment where resources can be added and removed on demand, (iii) scalability in terms of scientific workflows consisting of hundreds of thousands of tasks, and (iv) fault tolerance because in the Cloud, faults can happen at any time. Our software architecture consists of loosely coupled components communicating with each other through an event bus and a shared database. Workflow graphs are converted to process chains that can be scheduled independently. Our scheduling algorithm collects distinct required capability sets for the process chains, asks the agents which of these sets they can manage, and then assigns process chains accordingly. We present the results of four experiments we conducted to evaluate if our approach meets the aforementioned challenges. We finish the paper with a discussion, conclusions, and future research opportunities. An implementation of our algorithm and software architecture is publicly available with the open-source workflow management system “Steep”.
Download

Paper Nr: 64
Title:

Is Open Data Ready for Use by Enterprises? Learnings from Corporate Registers

Authors:

Pavel Krasikov, Timo Obrecht, Christine Legner and Markus Eurich

Abstract: Open data initiatives have long focused on motivating governmental bodies to open up their data. The number of open datasets is growing steadily, but their adoption is still lagging behind. An increasing number of studies assess open data portals and open data quality to shed light on open data’s current state. Since prior research addressed neither datasets’ content, nor whether it met enterprises’ data needs, our study aims to address this gap by investigating the extent to which open data is ready for use in the enterprise context. We focus on open corporate registers as an important segment of open government data with high relevance for enterprises. Our findings confirm that open datasets are heterogeneous in terms of access, licensing, and content, which makes them difficult to use in a business context. Our content analysis reveals that less than 50% of analyzed registers provide companies’ full legal addresses, while only 10% note their contact information. We conclude that open data in corporate registers has limited use to its lack of required attributes and relevant business concepts for typical use cases.
Download

Short Papers
Paper Nr: 48
Title:

iTLM: A Privacy Friendly Crowdsourcing Architecture for Intelligent Traffic Light Management

Authors:

Christian Roth, Mirja Nitschke, Matthias Hörmann and Doğan Kesdoğan

Abstract: Vehicle-to-everything (V2X) interconnects participants in vehicular environments to exchange information. This enables a broad range of new opportunities. We propose a self learning traffic light system which uses crowdsoured information from vehicles in a privacy friendly manner to optimize the overall traffic flow. Our simulation, based on real world data, shows that the information gain vastly decreases waiting time at traffic lights eventually reducing CO2 emissions. A privacy analysis shows that our approach provides a significant level of k-anonymity even in low traffic scenarios.
Download

Paper Nr: 60
Title:

The Need for an Enterprise Risk Management Framework for Big Data Science Projects

Authors:

Jeffrey Saltz and Sucheta Lahiri

Abstract: This position paper explores the need for, and benefits of, a Big Data Science Enterprise Risk Management Framework (RMF). The paper highlights the need for an RMF for Big Data Science projects, as well as the gaps and deficiencies of current risk management frameworks in addressing Big Data Science project risks. Furthermore, via a systematic literature review, the paper notes a dearth of research which looks at risk management frameworks for Big Data Science projects. The paper also reviews other emerging technology domains, and notes the creation of enhanced risk management frameworks to address the new risks introduced due to that emerging technology. Finally, this paper charts a possible path forward to define a risk management framework for Big Data Science projects.
Download

Paper Nr: 31
Title:

Toward a New Quality Measurement Model for Big Data

Authors:

Mandana Omidbakhsh and Olga Ormandjieva

Abstract: Along with wide accessibility to Big Data, arise the need for a standardized quality measurement model in order to facilitate the complex modeling, analysis and interpretation of Big Data quality requirements and evaluating data quality. In this paper we propose a new hierarchical goal-driven quality model for ten Big Data characteristics (V’s) at its different levels of granularity built on the basis of: i) NIST (National Institute of Standards and Technology) definitions and taxonomies for Big Data, and ii) the ISO/IEC standard data terminology and measurements. According to our research findings, there is no related measurements in ISO/IEC for important Big Data characteristics such as Volume, Variety and Valence. As our future work we intend to investigate theoretically valid methods for quality assessment of the above-mentioned V’s.
Download

Area 5 - Databases and Data Security

Full Papers
Paper Nr: 23
Title:

A Framework for Creating Policy-agnostic Programming Languages

Authors:

Fabian Bruckner, Julia Pampus and Falk Howar

Abstract: This paper introduces the policy system of the domain specific language D◦ (spoken di’grē). The central feature of this DSL is the automatic integration of usage control mechanisms into the application logic. The introduced DSL is cross-compiled to a host language. D◦ implements the policy-agnostic programming paradigm which means that application logic and policy enforcement are considered separately during the development. Both aspects are combined (automatically) in a later state. We propose the well-defined combination of blacklisting and whitelisting which we define as greylisting. Based on a simple example, we present the different aspects of the proposed policy system. Extensibility of the policy system and D◦ is another central functionality of the DSL. We demonstrate how the policy system and the language itself can be extended by new elements by implementing a simple use case. For this implementation, we use a prototypically implementation of D◦ which uses Java as host language.
Download

Short Papers
Paper Nr: 25
Title:

Hybrid Multi-model Multi-platform (HM3P) Databases

Authors:

Sven Groppe and Jinghua Groppe

Abstract: There exist various standards for different models of data, and hence users often must handle a zoo of data models. Storing and processing data in their native models, but spanning optimizations and processing across these models seem to be the most efficient way, such that we recently observe an advent of multi-model databases for this purpose. Companies, end users and developers typically run different platforms like mobile devices, web, desktops, servers, clouds and post-clouds (e.g., fog and edge computing) as execution environments for their applications at the same time. In this paper, we propose to utilize the different platforms according to their advantages and benefits for data distribution, query processing and transaction handling in an overall integrated hybrid multi-model multi-platform (HM3P) database. We analyze current state-of-the-art multi-model databases according to the support of multiple platforms. Furthermore, we analyze the properties of databases running on different types of platforms. We detail new challenges for the novel concept of HM3P databases concerning a global optimization of data distribution, query processing and transaction handling across multiple platforms.
Download

Paper Nr: 46
Title:

Trust Profile based Trust Negotiation for the FHIR Standard

Authors:

Eugene Sanzi and Steven A. Demurjian

Abstract: Sensitive healthcare data within Electronic Healthcare Records (EHRs) is traditionally protected through an authentication and authorization process. The user is authenticated based on a username/password combination which requires a pre-registration process. Trust profile based trust negotiation replaces the required human intervention during the traditional pre-registration process with an automated approach of verifying that the user owns the trust profile with digital signatures. To accomplish this, the negotiation process gradually exchanges the credentials within the trust profile to build trust and automatically assign authorization rules to previously unknown users. In this paper, we propose a new model for attaching trust profile authorization data to Fast Healthcare Interoperability Resources (FHIR), a standard created by HL7, in order to integrate the process of trust profile based trust negotiation into FHIR.
Download