DATA 2021 Abstracts

Area 1 - Big Data

Full Papers

Paper Nr:	51
Title:	Detecting Twitter Fake Accounts using Machine Learning and Data Reduction Techniques
Authors:	Ahmad Homsi, Joyce Al Nemri, Nisma Naimat, Hamzeh Abdul Kareem, Mustafa Al-Fayoumi and Mohammad Abu Snober
Abstract:	Internet Communities are affluent in Fake Accounts. Fake accounts are used to spread spam, give false reviews for products, publish fake news, and even interfere in political campaigns. In business, fake accounts could do massive damage like waste money, damage reputation, legal problems, and many other things. The number of fake accounts is increasing dramatically by the enormous growth of the online social network; thus, such accounts must be detected. In recent years, researchers have been trying to develop and enhance machine learning (ML) algorithms to detect fake accounts efficiently and effectively. This paper applies four Machine Learning algorithms (J48, Random Forest, Naive Bayes, and KNN) and two reduction techniques (PCA, and Correlation) on a MIB Twitter Dataset. Our results provide a detailed comparison among those algorithms. We prove that combining Correlation along with the Random Forest algorithm gave better results of about 98.6%.
Download

Short Papers

Paper Nr:	58
Title:	Enhanced AI On-the-Edge 3D Vision Accelerated Point Cloud Spatial Computing Solution
Authors:	Gaurav Kumar Wankar and Shubham Vohra
Abstract:	With the emergence of Industry - 5.0, the 3D Vision Product Market is growing rapidly. Leveraging Disruptive Technologies, we are exploring Artificial Intelligence driven Advanced 3D Vision immersive Business Solutions with transformative experiences leveraging Deep Learning accelerated with Voxelization, PointPillars and PointNet approaches for classification of Point Clouds enhancing the feature extraction to be more accurate bringing our work and data to life. NVIDIA Jetson Tx2 targeted at power constrained AI on-the-edge applications maintains awareness of its surroundings by visualizing in 3D space leveraging Azure Kinect DK depth sensing instead of 2D space thereby improving the performance in Edge AI computing device. Leveraging state of the art technologies converging AI and Mixed Reality we further encourage the readers to explore the possibilities of Next Generation services bringing accurate and immersive real-world information allowing decision-making based on Digital Reality driving Digital Transformation.
Download

Area 2 - Business Analytics

Full Papers

Paper Nr:	49
Title:	A Study on the Effects of Response Time on Travel Package Attributes
Authors:	Usha Ananthakumar and Sagun Pai
Abstract:	The rapid growth of online surveys in the past decade has raised questions about the effects of response time on the results. The focus of our current study is to discuss the impact of response time on various travel package attributes, thereby understanding consumer cognitive process. This study makes use of a recently conducted conjoint analysis experiment on travel package preferences in order to gain insights into the impact of response time on attribute importance and willingness to pay (WTP). Accordingly, the respondents are grouped as fast and slow depending on their response time and their differences in conjoint attribute importance estimates are investigated. The study also examines the changes in consumer willingness to pay for the two groups. Additionally, the distinctions in socioeconomic characteristics between the fast and slow respondents are also analyzed. The results and conclusions obtained from this research will help tour operators to scrutinize the time taken by consumers and thereby deploy appropriate marketing strategy based on the respective importance values and WTP trends.
Download

Paper Nr:	54
Title:	A Longitudinal Model for Song Popularity Prediction
Authors:	Ahmet Çimen and Enis Kayış
Abstract:	Usage of new generation music streaming platforms such as Spotify and Apple Music has increased rapidly in the last years. Automatic prediction of a song’s popularity is valuable for these firms which in turn translates into higher customer satisfaction. In this study, we develop and compare several statistical models to predict song popularity by using acoustic and artist-related features. We compare results from two countries to understand whether there are any cultural differences for popular songs. To compare the results, we use weekly charts and songs’ acoustic features as data sources. In addition to acoustic features, we add acoustic similarity, genre, local popularity, song recentness features into the dataset. We applied Flexible Least Squares (FLS) method to estimate song streams and observe time-varying regression coefficients using a quadratic program. FLS method predicts the number of weekly streams of a song using the acoustic features and the additional features in the dataset while keeping weekly model differences as small as possible. Results show that the significant changes in the regression coefficients may reflect the changes in the music tastes of the countries.
Download

Paper Nr:	61
Title:	A Graph-based Approach at Passage Level to Investigate the Cohesiveness of Documents
Authors:	Ghulam Sarwar and Colm O’Riordan
Abstract:	Approaches involving the representation of documents as a series of passages have been used in the past to improve the performance of ad-hoc retrieval systems. In this paper, we represent the top returned passages as a graph with each passage corresponding to a vertex. We connected the vertices (passages) that belongs to the same document to form a graph. The underlying intuition behind this approach is to identify some measure of the cohesiveness of the documents. We introduce a graph-based approach at the passage level to calculate the cohesion score of each document. The scores for both relevant and non-relevant documents are compared, and we illustrate that the cohesion score differs for relevant and non-relevant. Moreover, we also re-ranked the documents by applying the cohesion score with a document similarity score to inspect its impact on the system’s performance.
Download

Paper Nr:	62
Title:	A Reference Process for Judging Reliability of Classification Results in Predictive Analytics
Authors:	Simon Staudinger, Christoph G. Schuetz and Michael Schrefl
Abstract:	Organizations employ data mining to discover patterns in historic data. The models that are learned from the data allow analysts to make predictions about future events of interest. Different global measures, e.g., accuracy, sensitivity, and specificity, are employed to evaluate a predictive model. In order to properly assess the reliability of an individual prediction for a specific input case, global measures may not suffice. In this paper, we propose a reference process for the development of predictive analytics applications that allow analysts to better judge the reliability of individual classification results. The proposed reference process is aligned with the CRISP-DM stages and complements each stage with a number of tasks required for reliability checking. We further explain two generic approaches that assist analysts with the assessment of reliability of individual predictions, namely perturbation and local quality measures.
Download

Short Papers

Paper Nr:	8
Title:	GRASP: Graph-based Mining of Scientific Papers
Authors:	Navid Nobani, Mauro Pelucchi, Matteo Perico, Andrea Scrivanti and Alessandro Vaccarino
Abstract:	Over the past two decades, academia has witnessed numerous tools and search engines which facilitate the retrieval procedure in the literature review process and aid researchers to review the literature with more ease and accuracy. These tools mostly work based on a simple textual input which supposedly encapsulates the primary keywords in the desired research areas. Such tools mainly suffer from the following shortcomings: (i) they rely on textual search queries that are expected to reflect all the desired keywords and concepts, and (ii) shallow results which makes following a paper through time via citations a cumbersome task. In this paper, we introduce GRASP, a search engine that retrieves scientific papers starting from a sub-graph query provided by the user, offering (i) a list of time papers based on the query and (ii) a graph with papers and authors as vertices and edges being cited and published-by. GRASPhas been created using a Neo4j graph database, based on DBLP and AMiner corpora provided by their API. Acting performance evaluation by asking ten computer science experts, we demonstrate how GRASPcan efficiently retrieve and rank the most related papers based on the user’s input.
Download

Paper Nr:	13
Title:	A Comparison of Methods for the Evaluation of Text Summarization Techniques
Authors:	Marcello Barbella, Michele Risi and Genoveffa Tortora
Abstract:	Automatic Text Summarization techniques aim to extract key information from one or more input texts automatically, producing summaries and preserving the meaning of content. These techniques are divided into two main families: Extractive and Abstractive, which differ for their operating mode. The former picks up sentences directly from the document text, whilst the latter produces a summary by interpreting the text and rephrases sentences by incorporating information. Therefore, there is the need to evaluate and verify how close a summary is to original text. The research question is: how to evaluate the quality of the summaries produced by these techniques? Different metrics and scores have been proposed in the literature (e.g., ROUGE) for the evaluation of text summarization. Thus, the main purpose of this paper is to deeply estimate the behaviour of the ROUGE metric. In particular, we performed a first experiment to compare the metric efficiency for the evaluation of the Abstractive versus Extractive Text Summarization algorithms while, in a second one, we compared the obtained score for two different summary approaches: the simple execution of a summarization algorithm versus the multiple execution of different algorithms on the same text. The conclusions lead to the following interesting results: ROUGE does not achieve excellent results, because it has similar performance on both the Abstractive and Extractive algorithms; multiple execution works better than single one most of the time.
Download

Paper Nr:	39
Title:	Well-Being in Plastic Surgery: Deep Learning Reveals Patients’ Evaluations
Authors:	Joschka Kersting and Michaela Geierhos
Abstract:	This study deals with aspect-based sentiment analysis, the correlation of extracted aspects and their sentiment polarities with metadata. There are millions of review texts on the Internet that cannot be analyzed and thus people cannot benefit from the contained information. While most research so far has focused on explicit aspects from product or service data (e.g., hotels), we extract and classify implicit and explicit aspect phrases from German-language physician review texts. We annotated aspect phrases that indicate ratings about the doctor’s practice, such as waiting time or general perceived well-being conveyed by all staff members of a practice. We also apply a sentiment polarity classifier. While we compare several traditional and transformer networks, we apply the best model, the multilingual XLM-RoBERTa, to a dedicated German-language dataset dealing with plastic surgeons. We choose plastic surgery as sample domain because it is especially sensitive with its relation to a person’s self-image and felt acceptance. In addition to standard evaluation measures such as Precision, Recall, and F1-Score, we correlate our results with metadata from physician review websites, such as a physician’s gender. We figure out several correlations and present methods for analyzing unstructured review texts to enable service improvements in healthcare.
Download

Paper Nr:	50
Title:	Estimating Territory Risk Relativity for Auto Insurance Rate Regulation using Generalized Linear Mixed Models
Authors:	Shengkun Xie, Chong Gan and Clare Chua-Chow
Abstract:	Territory risk analysis has played an essential role in auto insurance rate regulation. It aims to obtain a set of regions to estimate their respective relativities to reflect the regional risk. Cluster as a latent variable has not yet been considered in modelling the regional risk of auto insurance. In this work, spatially constrained clustering is first applied to insurance loss data to form such regions. The generalized linear mixed model is then proposed to derive the risk relativities for obtained clusters and then for each basic rating unit. The results are compared to the ones from generalized linear models. The Forward Sortation Area (FSA) grouping to a specific region by spatially constrained clustering is to reduce the insurance rate heterogeneity caused by some smaller number of risk exposures. The spatially constrained clustering and risk relativity estimation help obtain a set of territory risk benchmarks, which can be used in rate filings within the regulation process. It also provides guidance for auto insurance companies on rate-making. The proposed methodologies could be helpful and applicable in many other fields, including business data analytic.
Download

Paper Nr:	63
Title:	A Company’s Corporate Reputation through the Eyes of Employees Measured with Sentiment Analysis of Online Reviews
Authors:	R. E. Loke and R. Lam-Lion
Abstract:	Corporate reputation can be defined as the overall assessment of a company’s performance over time (Kircova & Esen, 2018). Organizations with a positive corporate reputation create a competitive advantage and are more likely to influence customer’s behaviors and attitudes (Kircova, 2018). Measuring corporate reputation from online data is an increasingly important area in business studies because the amount of opinions and comments is increasingly growing on the internet and has become very accessible to strangers (Shayaa, 2018). Traditionally, corporate reputation is measured with well-known approaches such as surveys, qualitative interviews, and sample groups (Smith, 2010). Researchers like Fombrun, Fonzy and Newburry (2015) developed instruments to measure corporate reputation and predictivily modeled its impact on stakeholder outcomes. So far, however, there has been little attention in the literature on sophisticated measurement techniques for corporate reputation that can be applied to online reviews from the public web. This paper applies sentiment analysis in combination with semantic search as a suitable technique to explore how employees perceive organizations. By using our toolbox, organizations can adapt to market changes and cater to stakeholders’ needs. Also, it can be used to raise awareness for organizations that are unaware of negative reviews online.
Download

Paper Nr:	2
Title:	Determining How Different Factors Affect Police-Allegation’s Sustainability in Chicago using Decision-Tree
Authors:	Linxin Yang
Abstract:	The Citizen Police Data Project (CPDP) is a database of allegations made against the Chicago Police Department. Reports made against officers are rarely sustained, which results in the perception of little officer accountability and contributes to widespread distrust of law enforcement. Using a decision tree model on the CPDP database, this work explores how the following factors: officer years of employment, complainant type, investigation agency, and allegation severity level, affects the outcome of an allegation work together to increase or decrease the sustainability of allegations made against CPD between 2008 to 2018. The results found that when a CPD employee reports an allegation, it has higher chances to be sustained. However, for allegations reported by civilians, a third-party agency increases the likelihood of allegation sustainability.
Download

Paper Nr:	4
Title:	Building an Integrated Relational Database from Swiss Nutrition’s (menuCH) and Multiple Swiss Health Datasets Acquired from 1992 to 2012 for Data Mining Purposes
Authors:	Timo Lustenberger, Helena Jenzer and Farshideh Einsele
Abstract:	Objective: The objective of the study was to integrate a large database from Swiss nutrition national survey (menu-CH) with 5 extensive databases derived from 5 consecutive Swiss health national surveys from 1992 to 2012 for data mining purposes. Each database has additionally a demographic base data. An integrated Swiss database is built to later discover critical food consumption patterns linked with lifestyle diseases known to be strongly tied with food consumption and compare the derived rules with the rules resulted with a previous study which used a significantly smaller database. Design: Swiss nutrition national survey (menu-CH) with approx. 2000 respondents from two different surveys, one by Phone and the other by questionnaire along with Swiss health national surveys from 1992 to 2012 with over than 100000 respondents were preprocessed, cleaned, transformed and finally integrated to a unique relational database. Results: The result of this study is an integrated relational database from the Swiss nutritional and 20 years of Swiss health data.
Download

Paper Nr:	38
Title:	Using BPMN for ETL Conceptual Modelling: A Case Study
Authors:	Bruno Oliveira, Óscar Oliveira and Orlando Belo
Abstract:	One of the most important parts of a Data Warehousing System is the Extract-Transform-Load (ETL) component. It is responsible for extracting, transforming, conciliating, and loading data for supporting decision-making requirements. Usually, due to the complexity of managing heterogeneous data, this component is responsible for consuming most of the resources required for implementing a Data Warehousing System, representing a critical component that compromises the adequacy of the system. Despite their importance, the ETL development method is essentially ad-hoc, which does not always follow or embodies the best practices. With the emergence of Big Data and associated tools, script-based ETL became, even more, a common approach. In the last years, BPMN – Business Process Model and Notation – have been proposed and used to support ETL conceptual models. Still, as an expressive language, it provides different approaches for representing the same requirements. In this paper, we explore the use of BPMN for ETL conceptual modelling, analyzing existing approaches, and proposing a set of guidelines for using this notation in a more consistent way.
Download

Paper Nr:	53
Title:	Aspect Based Sentiment Analysis on Online Review Data to Predict Corporate Reputation
Authors:	R. E. Loke and W. Reitter
Abstract:	Corporate reputation is an intangible resource that is closely tied to an organization’s success but measuring it and to derive actions that can improve the reputations can be a long and expensive journey for an organization. In the available literature, corporate reputation is primarily measured through surveys, which can be time and cost intensive. This paper uses online reviews on the web as the source for a machine-learning driven aspect-based sentiment analysis that can enable organizations to evaluate their corporate reputation on a fine-grained level. The analysis is done unsupervised without organizations needing to manually label datasets. Using the insights generated through the analysis, on one hand, organizations can save costs and time to measure corporate reputation, and, on the other hand, it provides an in-depth analysis that splits the overall reputation into multiple aspects, with which organizations can identify weaknesses and in turn improve their corporate reputation. Therefore, this research is relevant for organizations aiming to understand and improve their corporate reputation to achieve success, for example, in form of financial performance, or for organizations that help and consult other organizations on their journeys to increased success. Our approach is validated, evaluated and illustrated with Trustpilot review data.
Download

Area 3 - Data Science

Full Papers

Paper Nr:	11
Title:	Similarity of Software Libraries: A Tag-based Classification Approach
Authors:	Maximilian Auch, Maximilian Balluff, Peter Mandl and Christian Wolff
Abstract:	The number of software libraries has increased over time, so grouping them into classes according to their functionality simplifies repository management and analyses. With the large number of software libraries, the task of categorization requires automation. Using a crawled dataset based on Java software libraries from Apache Maven repositories as well as tags and categories from the indexing platform MvnRepository.com, we show how the data in this set is structured and point out an imbalance of classes. We introduce a class mapping relevant for the procedure, which maps the libraries from very specific, technical classes into more generic classes. Using this mapping, we investigate supervised machine learning techniques that classify software libraries from the dataset based on their available tags. We show that a tag-based approach to classify libraries with an accuracy of 97.46% can be achieved by using neural networks. Overall, we found techniques such as neural networks and naíve Bayes more suitable in this use case than a logistic regression or a random forest.
Download

Paper Nr:	20
Title:	A Comparative Study on Inflated and Dispersed Count Data
Authors:	Monika Arora, Yash Kalyani and Shivam Shanker
Abstract:	: The availability of zero inflated count data has led to the demonstration of various statistical models and machine learning algorithms to be applied in diverse fields such as healthcare, economics and travel. However, in real life there could be a count k > 0 that is inflated. There are only a few studies on k− inflated count models. To the best of our knowledge, there is no article that demonstrates the machine learning algorithms on such data sets. We apply existing k− inflated count models as well as machine learning algorithms on travel data to compare the prediction and fitness of the models and find the significant covariates. Our study shows that the k− inflated models provide a good fit to the data, however, the predictions from machine learning algorithms are superior. This study can be extended further to include other artificial neural network approaches on a larger data set.
Download

Paper Nr:	31
Title:	textPrep: A Text Preprocessing Toolkit for Topic Modeling on Social Media Data
Authors:	Rob Churchill and Lisa Singh
Abstract:	With the rapid growth of social media in recent years, there has been considerable effort toward understanding the topics of online discussions. Unfortunately, state of the art topic models tend to perform poorly on this new form of data, due to their noisy and unstructured nature. There has been a lot of research focused on improving topic modeling algorithms, but very little focused on improving the quality of the data that goes into the algorithms. In this paper, we formalize the notion of preprocessing configurations and propose a standardized, modular toolkit and pipeline for performing preprocessing on social media texts for use in topic models. We perform topic modeling on three different social media data sets and in the process show the importance of preprocessing and the usefulness of our preprocessing pipeline when dealing with different social media data. We release our preprocessing toolkit code (textPrep) in a python package for others to use for advancing research on data mining and machine learning on social media text data.
Download

Paper Nr:	33
Title:	Semantic Entanglement on Verb Negation
Authors:	Yuto Kikuchi, Kazuo Hara and Ikumi Suzuki
Abstract:	The word2vec, developed by Mikolov et al. in 2013, is an epoch-creating method that embeds words into a vector space to capture their fine-grained meaning. However, the reliability of word2vec is inconsistent. To evaluate the reliability of word vectors, we perform Mikolov’s word analogy task, where word୅, word୆, and wordେ are provided. Under the condition that word୆ exhibits a particular relation with word୅, the task involves searching the vocabulary and returning the most relevant word for wordେ for the same relation. We conduct an experiment to return negative words for verbs using word2vec for 100 typical Japanese verbs and investigate the effect of context (i.e., surrounding words) on correct or incorrect responses. It is shown that the task fails when the sense of verbs and negative relation are entangled because the semantic calculation of verb negation does not hold.
Download

Paper Nr:	60
Title:	Forecasting Stock Market Trends using Deep Learning on Financial and Textual Data
Authors:	Georgios-Markos Chatziloizos, Dimitrios Gunopulos and Konstantinos Konstantinou
Abstract:	Stock market research has increased significantly in recent years. Researchers from both economics and computer science backgrounds are applying novel machine learning techniques to the stock market. In this paper we combine some of the techniques used in both of these fields, namely Technical Analysis and Sentiment Analysis techniques, to show whether or not it is possible to successfully forecast the trend of the stock price and to what extent. Using the four tickers AAPL, GOOG, NVDA and S&P 500 Information Technology, we collected historical financial data and historical textual data and we used each type of data individually and in unison, to display in which case the results were more accurate and more profitable. We describe in detail how we analysed each type of data, how we used it to come up with our results.
Download

Short Papers

Paper Nr:	6
Title:	Data Mining for Animal Health to Improve Human Quality of Life: Insights from a University Veterinary Hospital
Authors:	Oscar Tamburis, Elio Masciari, Christian Esposito and Gerardo Fatone
Abstract:	The increasing importance of Veterinary Informatics is driving the implementation of integrated veterinary information management systems (VIMS) for the capture, storage, analysis and retrieval of animal data. In this paper, a decision tree algorithm was implemented, starting from the database of the University Veterinary Hospital at Federico II University of Naples, aiming at building a predictive model for an effective recognition of neoplastic diseases and zoonoses for cats and dogs focusing to Campania Region, in order to figure out, according to the One (Digital) Health perspective specifics, the connection between humans, animals, and surrounding environment.
Download

Paper Nr:	12
Title:	Biomedical Dataset Recommendation
Authors:	Xu Wang, Frank van Harmelen and Zhisheng Huang
Abstract:	Dataset search is a special application of information retrieval, which aims to help scientists with finding the datasets they want. Current dataset search engines are query-driven, which implies that the results are limited by the ability of the user to formulate the appropriate query. In this paper we aim to solve this limitation by framing dataset search as a recommendation task: given a dataset by the user, the search engine recommends similar datasets. We solve this dataset recommendation task using a similarity approach. We provide a simple benchmark task to evaluate different approaches for this dataset recommendation task. We also evaluate the recommendation task with several similarity approaches in the biomedical domain. We benchmark 8 different similarity metrics between datasets, including both ontology-based techniques and techniques from machine learning. Our results show that the task of recommending scientific datasets based on meta-data as it occurs in realistic dataset collections is a hard task. None of the ontology-based methods manage to perform well on this task, and are outscored by the majority of the machine-learning methods. Of these ML methods only one of the approaches performs reasonably well, and even then only reaches 70% accuracy.
Download

Paper Nr:	14
Title:	Deep Learning for RF-based Drone Detection and Identification using Welch’s Method
Authors:	Mahmoud Almasri
Abstract:	Radio Frequency (RF) combined with the deep learning methods promised a solution to detect the presence of the drones. Indeed, the classical techniques (i.e. radar, vision and acoustics, etc.) suffer several drawbacks such as difficult to detect the small drones, false alarm of flying birds or balloons, the influence of the wind on the performance, etc. For an effective drones’s detection, two main stages should be established: Feature extraction and feature classification. The proposed approach in this paper is based on a novel feature extraction method and an optimized deep neural network (DNN). At first, we present a novel method based on Welch to extract meaningful features from the RF signal of drones. Later on, three optimized Deep Neural Network (DNN) models are considered to classify the extracted features. The first DNN model can be used to detect the presence of the drones and contains two classes. The second DNN help us to detect and recognize the type of the drone with 4 classes: A class for each drone and the last one for the RF background activities. In the third model, 10 classes have been considered: the presence of the drone, its type, and its flight mode (i.e. Stationary, Hovering, flying with or without video recording). Our proposed approach can achieve an average accuracy higher than 94% and it significantly improves the accuracy, up to 30%, compared to existing methods.
Download

Paper Nr:	19
Title:	A Survey of Social Emotion Prediction Methods
Authors:	Abdullah Alsaedi, Phillip Brooker, Floriana Grasso and Stuart Thomason
Abstract:	Emotions are an important factor that affects our communication. Considerable research has been done to detect and classify emotion in text. However, most deal with emotion from the writer’s perspective. Social emotion is the emotion of the reader when exposed to the text. With the increased use of social media, many works are performed for social emotion prediction. In this paper, we attempt to provide a survey of social emotion prediction methods. To the best of our knowledge, this is the first work to survey the literature of social emotion, review methods, and used techniques, compare the methods, and highlight their limitations.
Download

Paper Nr:	23
Title:	A Network based Approach for Reducing Variant Diversity in Production Planning and Control
Authors:	Shailesh Tripathi, Sonja Strasser and Herbert Jodlbauer
Abstract:	This paper presents a network-based procedure for selecting representative materials using routings of materials as features and applies this procedure to a sheet metal processing case study which is used for parameterizing discrete event simulation models for PPC control. The discrete event simulation model (simgen) is a generic and scalable model that is commonly used to deal with optimization problems in production planning and control, such as manufacturing resource planning. The preparatory steps of discrete event simulations for production planning and control are data preprocessing, parameterization, and experimental design. Given the complexity of the manufacturing environment, discrete event simulation models must incorporate appropriate model details for parameterization and a practical approach to experimental design to ensure efficient execution of simulation models in a reasonable time. The parameterization for discrete event simulation is not trivial; it requires optimizing parameter settings for different materials dependent on routing, bill of materials complexity, and other production process-related features. For a suitable parameterization that completes the execution of discrete event simulation in an expected time, we must reduce variant diversity to an optimized level that removes redundant materials and reflects the validity of the overall production scenario. We employ a network based approach by constructing a bipartite graph and Jaccard-index measure with an overlap threshold to group similar materials using routing features and identify representative materials and manufacturing subnetworks, thus reducing the complexity of products and manufacturing routes.
Download

Paper Nr:	41
Title:	Applied Feature-oriented Project Life Cycle Classification
Authors:	Oliver Böhme and Tobias Meisen
Abstract:	The increasing complexity in automotive product development is forcing traditional manufacturers to fundamentally rethink. As a result, many companies are already investing in the development of methods to increase the controllability of their development processes. The use of data-driven approaches is a promising way to provide an early prediction of potential problems in the course of a project by learning from the past. In vehicle development, projects can be divided into two basic categories: new vehicle launches and model enhancement projects. The course of projects according to the above-mentioned categories can be based on different influencing factors. To verify this hypothesis and to determine the extent of the differences in the data, we carry out a data-driven classification of the project category. In contrast to the recognition of other time-dependent data (e.g., univariate sensor data courses), we use multivariate project information from the automotive industry. With this paper, which is of an application nature, we prove that a multivariate classification of automotive projects can be realized based on the underlying project’s progression.
Download

Paper Nr:	44
Title:	Data Driven Hybrid Approach for Health Monitoring and Fault Detection in Military Ground Vehicles
Authors:	Indu Shukla, Antoinette Silas, Haley Dozier, Brandon E. Hansen and W. Glenn Bond
Abstract:	This paper presents a data driven hybrid approach for Prognostics and Health Management (PHM) of military ground vehicles to mitigate a number of the unexpected failures, enabling intelligent decision-making for improved performance, safety, reliability, and maintainability. For military ground vehicles, the Controller Area Network (CAN) bus provides sensor data for collection and analysis. In this study we used collected operational time-series data for generating future operational time series data for military ground vehicles. Our sensor data share stochastic trends with more than one-time dependent variable to develop Vector AutoRegression (VAR) models suitable to forecast operational data. We have developed Long Short-Term Memory (LSTM) fault detection models which ingest VAR forecasted data to identify fault detection. Our experimental results show our hybrid approach provides promising fault diagnosis performance. Root mean squared error, mean absolute percentage error and mean absolute error have been used as the evaluation criteria.
Download

Paper Nr:	45
Title:	Impact of Duplicating Small Training Data on GANs
Authors:	Yuki Eizuka, Kazuo Hara and Ikumi Suzuki
Abstract:	Emoticons such as (_̂)̂ are face-shaped symbol sequences that are used to express emotions in text. However, the number of emoticons is miniscule. To increase the number of emoticons, we created emoticons using SeqGANs, which are generative adversarial networks for generating sequences. However, the small number of emoticons means that few emoticons can be used as training data for SeqGANs. This is concerning because as SeqGANs underfit small training data, generating emoticons using SeqGANs is difficult. To address this problem, we duplicate the training data. We observed that emoticons can be generated when the duplication magnification is of an appropriate value. However, as a trade-off, it was also observed that SeqGANs overfit the training data, i.e., they produce emoticons that are exactly the same as the training data.
Download

Paper Nr:	48
Title:	Toward a Multimodal Multitask Model for Neurodegenerative Diseases Diagnosis and Progression Prediction
Authors:	Sofia Lahrichi, Maryem Rhanoui, Mounia Mikram and Bouchra El Asri
Abstract:	Recent studies on modelling the progression of Alzheimer’s disease use a single modality for their predictions while ignoring the time dimension. However, the nature of patient data is heterogeneous and time dependent which requires models that value these factors in order to achieve a reliable diagnosis, as well as making it possible to track and detect changes in the progression of patients’ condition at an early stage. This article overviews various categories of models used for Alzheimer’s disease prediction with their respective learning methods, by establishing a comparative study of early prediction and detection Alzheimer’s disease progression. Finally, a robust and precise detection model is proposed.
Download

Paper Nr:	3
Title:	Archival and Museum Information as a Component of the Common Digital Space of Scientific Knowledge
Authors:	N. Kalenov, I. Sobolevskaya and A. Sotnikov
Abstract:	The Common Digital Space of Scientific Knowledge (CDSSK), in its modern interpretation, is a fundamentally new information environment that accumulates knowledge from various fields of science and is the basis for solving a wide range of problems: from artificial intelligence to the science popularization. One of the prototypes of the CDSSK model is the digital library "Scientific Heritage of Russia" (DL SHR), within which methods and means of integrating heterogeneous digital information (including archival and museum information) related to Russian scientific achievements are being developed. For several years, the Archive of the Russian Academy of Sciences (ARAN) and the V. I. Vernadsky State Geological Museum Russian Academy of Sciences (GGM RAS) participated in the development and in the DL SHR development and filling the DL SHR with digital content of the DL SHR. The paper discusses the metadata profiles adopted for displaying archival and museum objects in the CDSSK, Provides examples of search and visualization.
Download

Paper Nr:	9
Title:	Motif-based Classification using Enhanced Sub-Sequence-Based Dynamic Time Warping
Authors:	Mohammed Alshehri, Frans Coenen and Keith Dures
Abstract:	In time series analysis, Dynamic Time Warping (DTW) coupled with k Nearest Neighbour classification, where k = 1, is the most commonly used classification model. Even though DTW has a quadratic complexity, it outperforms other similarity measurements in terms of accuracy, hence its popularity. This paper presents two motif-based mechanisms directed at speeding up the DTW process in such a way that accuracy is not adversely affected: (i) the Differential Sub-Sequence Motifs (DSSM) mechanism and (ii) the Matrix Profile Sub-Sequence Motifs (MPSSM) mechanism. Both mechanisms are fully described and evaluated. The evaluation indicates that both DSSM and MPSSM can speed up the DTW process while producing a better, or at least comparable accuracy, in 90% of cases.
Download

Paper Nr:	28
Title:	Predicting Shopping Intent of e-Commerce Users using LSTM Recurrent Neural Networks
Authors:	Konstantinos Diamantaras, Michail Salampasis, Alkiviadis Katsalis and Konstantinos Christantonis
Abstract:	An e-commerce web site is effective if it turns visitors into buyers achieving a high conversion rate. To this realm, it is useful to predict each user’s purchase intent and understand their navigation behavior. Such predictions may be utilized to improve web design and to personalize shopper’s experience, hopefully leading to increased conversion rates. Additionally, if such predictions can be done in real-time, during the ongoing navigation of an e-commerce user, the e-commerce application can take proactive stimuli actions to offer incentives with a view to increase the probability that a user will finally make a purchase. This paper presents a method for predicting in real-time the shopping intent of e-commerce users using LSTM recurrent neural networks. We test several variants of our method in a dataset created from the processing of Web server logs of an industry e-commerce web application, dividing user sessions in three different classes: browsing, cart abandonment, purchase. The best classifier achieves a predictive accuracy of almost 98%. This result is competitive with other state-of-the-art methods, which affirms that accurate and scalable purchasing intention prediction for e-commerce, using only session-based data, is feasible without any intense feature engineering.
Download

Paper Nr:	46
Title:	Making Data Big for a Deep-learning Analysis: Aggregation of Public COVID-19 Datasets of Lung Computed Tomography Scans
Authors:	Francesca Lizzi, Francesca Brero, Raffaella Fiamma Cabini, Maria Evelina Fantacci, Stefano Piffer, Ian Postuma, Lisa Rinaldi and Alessandra Retico
Abstract:	Lung Computed Tomography (CT) is an imaging technique useful to assess the severity of COVID-19 infection in symptomatic patients and to monitor its evolution over time. Lung CT can be analysed with the support of deep learning methods for both aforementioned tasks. We have developed a U-net based algorithm to segment the COVID-19 lesions. Unfortunately, public datasets populated with a huge amount of labelled CT scans of patients affected by COVID-19 are not available. In this work, we first review all the currently available public datasets of COVID-19 CT scans, presenting an extensive description of their characteristics. Then, we describe the design of the U-net we developed for the automated identification of COVID-19 lung lesions. Finally, we discuss the results obtained by using the different publicly available datasets. In particular, we trained the U-net on the dataset made available within the COVID-19 Lung CT Lesion Segmentation Challenge 2020, and we tested it on data from the MosMed and the COVID-19-CT-Seg datasets to explore the transferability of the model and to assess whether the image annotation process affects the detection performances. We evaluated the performance of the system in lesion segmentation in terms of the Dice index, which measures the overlap between the ground truth and the predicted masks. The proposed U-net segmentation model reaches a Dice index equal to 0.67, 0.42 and 0.58 on the independent validation sets of the COVID-19 Lung CT Lesion Segmentation Challenge 2020, on the MosMed and on the COVID-19-CT-Seg datasets, respectively. This work focusing on lesion segmentation constitutes a preliminary work for a more accurate analysis of COVID-19 lesions, based for example on the extraction and analysis of radiomic features.
Download

Area 4 - Data Management and Quality

Full Papers

Paper Nr:	30
Title:	An Efficient Representation of Enriched Temporal Trajectories
Authors:	Nieves R. Brisaboa, Antonio Fariña, Diego Otero-González and Tirso V. Rodeiro
Abstract:	We present a novel representation of enriched trajectories of a mobile workforce management system. In this system, employees are tracked during their working day and both their routes and the tasks performed at each time instant are recorded. Our proposal tackles the representation of this information paying special attention to the space footprint without neglecting query time. We performed experiments using real and synthetic datasets where we show the compression effectiveness as well as the efficiency at query time. Our results showed that our proposal yields promising results in terms of the space needed to represent both users’ locations and activities while performing access queries to the original data within microseconds.
Download

Short Papers

Paper Nr:	7
Title:	DERM: A Reference Model for Data Engineering
Authors:	Daniel Tebernum, Marcel Altendeitering and Falk Howar
Abstract:	Data forms an essential organizational asset and is a potential source for competitive advantages. To exploit these advantages, the engineering of data-intensive applications is becoming increasingly important. Yet, the professional development of such applications is still in its infancy and a practical engineering approach is necessary to reach the next maturity level. Therefore, resources and frameworks that bridge the gaps between theory and practice are required. In this study, we developed a data engineering reference model (DERM), which outlines the important building-blocks for handling data along the data lifecycle. For the creation of the model, we conducted a systematic literature review on data lifecycles to find commonalities between these models and derive an abstract meta-model. We successfully validated our model by matching it with established data engineering topics. Using the model derived six research gaps that need further attention for establishing a practically-grounded engineering process. Our model will furthermore contribute to a more profound development process within organizations and create a common ground for communication.
Download

Paper Nr:	18
Title:	DQ-MeeRKat: Automating Data Quality Monitoring with a Reference-Data-Profile-Annotated Knowledge Graph
Authors:	Lisa Ehrlinger, Alexander Gindlhumer, Lisa-Marie Huber and Wolfram Wöß
Abstract:	High data quality (e.g., completeness, accuracy, non-redundancy) is essential to ensure the trustworthiness of AI applications. In such applications, huge amounts of data is integrated from different heterogeneous sources and complete, global domain knowledge is often not available. This scenario has a number of negative effects, in particular, it is difficult to monitor data quality centrally and manual data curation is not feasible. To overcome these problems, we developed DQ-MeeRKat, a data quality tool that implements a new method to automate data quality monitoring. DQ-MeeRKat uses a knowledge graph to represent a global, homogenized view of local data sources. This knowledge graph is annotated with reference data profiles, which serve as quasi-gold-standard to automatically verify the quality of modified data. We evaluated DQ-MeeRKat on six real-world data streams with qualitative feedback from the data owners. In contrast to existing data quality tools, DQ-MeeRKat does not require domain experts to define rules, but can be fully automated.
Download

Paper Nr:	42
Title:	Semantic Enrichment of Vital Sign Streams through Ontology-based Context Modeling using Linked Data Approach
Authors:	Sachiko Lim, Rahim Rahmani and Paul Johannesson
Abstract:	The Internet of Things (IoT) creates an ecosystem that connects people and objects through the internet. IoT-enabled healthcare has revolutionized healthcare delivery by moving toward a more pervasive, patient-centered, and preventive care model. In the ongoing COVID-19 pandemic, it has also shown a great potential for effective remote patient health monitoring and management, which leads to preventing straining the healthcare system. Nevertheless, due to the heterogeneity of data sources and technologies, IoT-enabled healthcare systems often operate in vertical silos, hampering interoperability across different systems. Consequently, such sensory data are rarely shared nor integrated, which can undermine the full potential of IoT-enabled healthcare. Applying semantic technologies to IoT is a promising approach for fulfilling heterogeneity, contextualization, and situation-awareness requirements for real-time healthcare solutions. However, the enrichment of sensor streams has been under-explored in the existing literature. There is also a need for an ontology that enables effective patient health monitoring and management during infectious disease outbreaks. This study, therefore, aims to extend the existing ontology to allow patient health monitoring for the prevention, early detection, and mitigation of patient deterioration. We evaluated the extended ontology using competency questions and illustrated a proof-of-concept of ontology-based semantic representation of vital sign streams.
Download

Paper Nr:	21
Title:	WFDU-net: A Workflow Notation for Sovereign Data Exchange
Authors:	Heinrich Pettenpohl, Daniel Tebernum and Boris Otto
Abstract:	Data is the main driver of the digital economy. Accordingly, companies are interested in maintaining technical control over the usage of their data at any given time. The International Data Spaces initiative addresses exactly this aspect of data sovereignty with usage control enforcement. In this paper, we introduce the so-called Workflow with Data and Usage control network (WFDU-net) model. The data consumer can visually define his or her workflow using the WFDU-net model and annotate the data operations and context. With model checking we validate that the WFDU-net follows the usage policies defined by the data owner. Afterwards, the compliant WFDU-net can be executed by exporting the WFDU-net in a Petri Net Markup Language (PNML). We evaluated our approach by using our example WFDU-net in a data analytics use case.
Download

Paper Nr:	52
Title:	Knowledge Graph Analysis of Russian Trolls
Authors:	Chih-yuan Li, Soon Ae Chun and James Geller
Abstract:	Social media, such as Twitter, have been exploited by trolls to manipulate political discourse and spread disinformation during the 2016 US Presidential Election. Trolls are users of social media accounts created with intentions to influence the public opinion by posting or reposting messages containing misleading or inflammatory information with malicious intentions. There has been previous research that focused on troll detection using Machine Learning approaches, and troll understanding using visualizations, such as word clouds. In this paper, we focus on the content analysis of troll tweets to identify the major entities mentioned and the relationships among these entities, to understand the events and statements mentioned in Russian Troll tweets coming from the Internet Research Agency (IRA), a troll factory allegedly financed by the Russian government. We applied several NLP techniques to develop Knowledge Graphs to understand the relationships of entities, often mentioned by dispersed trolls, and thus hard to uncover. This integrated KG helped to understand the substance of Russian Trolls’ influence in the election. We identified three clusters of troll tweet content: one consisted of information supporting Donald Trump, the second for exposing and attacking Hillary Clinton and her family, and the third for spreading other inflammatory content. We present the observed sentiment polarization using sentiment analysis for each cluster and derive the concern index for each cluster, which shows a measurable difference between the presidential candidates that seems to have been reflected in the election results.
Download

Area 5 - Databases and Data Security

Full Papers

Paper Nr:	27
Title:	Database Recovery from Malicious Transactions: A Use of Provenance Information
Authors:	Theppatorn Rhujittawiwat, John Ravan, Ahmed Saaudi, Shankar Banik and Csilla Farkas
Abstract:	In this paper, we propose a solution to recover a database from the effects of malicious transactions. The traditional approach for recovery is to execute all non-malicious transactions from a consistent rollback point. However, this approach is inefficient. First, the database will be unavailable until the restoration is finished. Second, all non-malicious transactions that committed after the rollback state need to be re-executed. The intuition for our approach is to re-execute partial transactions, i.e., only the operations that were affected by the malicious transactions. We develop algorithms to reduce the downtime of the database during recovery process. We show that our solution is 1.) Complete, i.e., all the effects of the malicious transactions are removed, 2.) Sound, i.e., all the effects of non-malicious transactions are preserved, and 3.) Minimal, i.e., only affected data items are modified. We also show that our algorithms preserve conflict serializability of the transaction execution history.
Download

Short Papers

Paper Nr:	36
Title:	Invers Natural Number System to Maintain User-defined Sequence of Data Records
Authors:	Seyfettin Öztürk
Abstract:	The objective of this paper is to present a method to insert, edit, and delete database records without affecting the sequence of existing data. Typically, databases comprise integer data fields, in this paper named sequence number, meant to determine the user-defined sequence of data records. Inserting new data records or editing the sequence number of data records might cause a resequencing of the existing data records. This resequencing can be avoided by using a numbering system that decreases the value of a number when a digit is added to its end. Such a numbering system allows to insert an infinite quantity of additional sequence numbers between two sequence numbers even if their difference is 1.
Download

Paper Nr:	59
Title:	Tailoring Taint Analysis for Database Applications in the K Framework
Authors:	Md. Imran Alam and Raju Halder
Abstract:	Maintaining the integrity of underlying databases of any information systems is one of the challenges. This could be either due to coding flaws or due to improper flow of information from source to sink in the associated database applications. Compromising this may lead to either disclosure of sensitive information to the attackers or illegitimately modification of private data stored in the databases. Taint analysis is a widely used program analysis technique that aims at averting malicious inputs from corrupting data values in critical computations of programs. In this paper, we propose K-DBTaint, a rewriting logic-based executable semantics for taint analysis of database applications in the K framework. We specify the semantics for a subset of SQL statements along with host imperative program statements. Our K semantics can be seen as a sound approximation of program semantics in the corresponding security type domain. With respect to the existing methods, K-DBTaint supports context- and flow-sensitive analysis, reduces false alarms, and provides a scalable solution. Experimental evaluation on several PL/SQL benchmark codes demonstrates encouraging results as an improvement in the precision of the analysis.
Download

Paper Nr:	57
Title:	Evo-Path: Querying Data Evolution through Complex Changes
Authors:	Theodora Galani, Yannis Stavrakas, George Papastefanatos and Yannis Vassiliou
Abstract:	Evo-graph is a model for data evolution that captures data versions and treats changes as first-class citizens. A change in evo-graph can be compound, comprising disparate changes, and is associated with the data items it affects. In previous work, we specified how an evo-graph can be reduced to a snapshot holding under a specific time instance, we presented an XML representation of evo-graph called evoXML, we defined how evo-graph is constructed as the current snapshot evolves, as well as presented and evaluated the C2D framework that implements these concepts using XML technologies. In this paper, we formally define evo-path, an XPath extension for querying the data history and change structure in a uniform way over evo-graph. We specify the evo-path syntax, semantics and implementation, and present several query categories.
Download