DATA 2022 Abstracts

Area 1 - Big Data

Full Papers

Paper Nr:	19
Title:	Preprocessing of Terrain Data in the Cloud using a Workflow Management System
Authors:	Marvin Kaster and Hendrik M. Würz
Abstract:	The preparation of terrain data for web visualization is time-consuming. We model the problem as a scientific workflow that can be executed by a workflow management system (WMS) in a cloud-based environment. Such a workflow management system provides easy access to the almost unlimited resources of cloud infrastructure and still allows a lot of freedom in the implementation of tasks. We take advantage of this and optimize the computation of individual tiles in the created level of detail (LOD) structure, as well as the scheduling of tasks in the scientific workflow. This enables us to utilize allocated resources very efficiently and improve computation time. In the evaluation, we analyze the impact of different storage endpoints, the number of threads, and the number of tasks on the run time. We show that our approach scales well and outperforms our previous work based on the framework GeoTrellis considerably (Krämer et al., 2020).
Download

Short Papers

Paper Nr:	71
Title:	Automated Neoclassical Vertical Canon Validation in Human Faces with Machine Learning
Authors:	Ashwinee Mehta, Maged Abdelaal, Moamen Sheba and Nic Herndon
Abstract:	The proportions defined by the neoclassical canons for face evaluation were developed by artists and anatomists in the 17th and 18 th centuries. These proportions are used as a reference for planning facial or dental reconstruction treatments. However, the assumption that the face is divided vertically into three equal thirds, which was adopted a long time ago, has not been verified yet. We used photos freely available online, annotated them with anthropometric landmarks using machine learning, and verified this hypothesis. Our results indicate that the vertical dimensions of the face are not always divided equally into thirds. Thus, this vertical canon should be used with caution in cosmetic, plastic, or dental surgeries, and reconstruction procedures.
Download

Paper Nr:	72
Title:	Towards Programmable Memory Controller for Tensor Decomposition
Authors:	Sasindu Wijeratne, Ta-Yang Wang, Rajgopal Kannan and Viktor Prasanna
Abstract:	Tensor decomposition has become an essential tool in many data science applications. Sparse Matricized Tensor Times Khatri-Rao Product (MTTKRP) is the pivotal kernel in tensor decomposition algorithms that decompose higher-order real-world large tensors into multiple matrices. Accelerating MTTKRP can speed up the tensor decomposition process immensely. Sparse MTTKRP is a challenging kernel to accelerate due to its irregular memory access characteristics. Implementing accelerators on Field Programmable Gate Array (FPGA) for kernels such as MTTKRP is attractive due to the energy efficiency and the inherent parallelism of FPGA. This paper explores the opportunities, key challenges, and an approach for designing a custom memory controller on FPGA for MTTKRP while exploring the parameter space of such a custom memory controller.
Download

Paper Nr:	88
Title:	ADiBA Big Data Adoption Framework: Accelerating Big Data Revolution 5.0
Authors:	Norhayati Daut, Naomie Salim, Chan Weng Howe, Anazida Zainal, Sharin Hazlin Huspi, Masitah Ghazali and Fatimah Shafinaz Ahmad
Abstract:	Researchers have formulated the revolution of Big Data into several stages, from stage 1 using raw data until stage 5 using operational intelligence and advanced analytics is used to provide wisdom. However, for organisations to reap the values from big data adoption and implementation, they must embrace Big Data Revolution 5.0: digital acceleration. At this stage, Big Data Analytics (BDA) becomes an asset from which, businesses can get new insights and aid value creation, resulting in increased profits. BDA will play a large part in extending an organisation’s presence, which will lead to enticing possible investors and hasten global growth. In this paper, we proposed a framework that aid organisations toward big data adoption and implementation that can create the best value for the organisations. It covers the whole value chain of big data adoption and implementation from the enculturation of big data in the organisation, to business understanding, to data management and governance, to big data project planning, to data understanding, to data preparation, to procurement, to analytics modeling, data product development, evaluation of model and data product deployment, maintenance, and upgrades and inculturation of data analytics into business. The framework has been successfully used in several Malaysian organisations, government, semi-government, and private sectors.
Download

Area 2 - Business Analytics

Full Papers

Paper Nr:	17
Title:	Different Metrics Results in Text Summarization Approaches
Authors:	Marcello Barbella, Michele Risi, Genoveffa Tortora and Alessia Auriemma Citarella
Abstract:	Automatic Text Summarization is the result of more than 50 years of research. Several methods for creating a summary from a single document or a group of related documents have been proposed over time, all of which have shown very efficient results. Artificial intelligence has enabled advancement in generating summaries that include other words compared to the original text. Instead, the issue is identifying how a summary may be regarded as ideal compared to a reference summary, which is still a topic of research that is open to new answers. How can the outcomes of the numerous new algorithms that appear year after year be assessed? This research aims to see if the ROUGE metric, widely used in the literature to evaluate the results of Text Summarization algorithms, helps deal with these new issues, mainly when the original reference dataset is limited to a small field of interest. Furthermore, an in-depth experiment is conducted by comparing the results of the ROUGE metric with other metrics. In conclusion, determining an appropriate metric to evaluate the summaries produced by a machine is still a long way off.
Download

Paper Nr:	68
Title:	A Formal Model to Support Discourse Semantic Landscape Analysis
Authors:	Isabelle Linden, Bruno Dumas and Anne Wallemacq
Abstract:	Discourse Analysis is a broadly spread methodology in human and social sciences. The Evoq Software (Clarinval et al., 2018) has been developed to offer advanced support for a deep semantic analysis of discourse by providing intermediary transposition tools that allow the exploration of the semantic landscape underlying a discourse from multiple angles. This paper presents the formal knowledge model to support these functionalities development and ensure a strong coherence between multiple visualisations seen as so many intermediary transpositions of the same object.
Download

Paper Nr:	69
Title:	A Model based on 2-tuple Linguistic Model and CRITIC Method for Hotel Classification
Authors:	Ziwei Shu, Ramón Alberto Carrasco González, Javier Portela García-Miguel and Manuel Sánchez-Montañés
Abstract:	Hotel classification is critical for both customers and hotel managers. It can help hotel managers better understand their customers’ needs and improve their various aspects by implementing relevant strategies. Moreover, it can assist customers in recognizing different hotel aspects and making a more informed decision. This paper categorizes hotels on TripAdvisor based on their six aspects. The 2-tuple linguistic model is applied to solve the problem of information loss in linguistic information fusion. The CRiteria Importance Through Intercriteria Correlation (CRITIC) approach is employed to generate objective weights to calculate the overall score of each hotel, as this method does not require any human participation in the weighting computation. Finally, various hotels segments are obtained with Weighted K-means clustering. This proposal has been evaluated by a use case with more than fifty million TripAdvisor hotel reviews. The results demonstrate that the proposed model can increase the linguistic interpretability of clustering results and provide customers with a more understandable objective overall hotel score, which can assist them in selecting a better hotel. Moreover, these classification results aid hotel managers in designing more effective tactics for acquiring a new competitive advantage or enhancing those aspects that require improvement.
Download

Paper Nr:	81
Title:	Tackling Model Drifts in Industrial Model-driven Software Product Lines by Means of a Graph Database
Authors:	Christof Tinnes, Uwe Hohenstein, Wolfgang Rössler and Andreas Biesdorf
Abstract:	This paper reports on our experience of using a graph database to efficiently compare very large models in an industrial model-driven engineering project. The need for a comparison results from the fact that architectural models are reused. They conform to a common domain-specific language but diverge as they belong to different products managed in separate branches of a repository in the sense of a clone-and-own approach. In the presented industry project, huge models are developed and reside in the commercial tool MAGICDRAW. In fact, unlike many other tools, MAGICDRAW turned out to be capable to handle those huge models in industrial environments. In this context, there is a strong necessity to detect and judge relevant differences between models in different branches in order to avoid a model drift and loosing reuse opportunities across the products. Indeed, MAGICDRAW has a built-in difference tool, which however exposes an excessive number of differences, only a fraction of which are really relevant for certain tasks. We show that the capabilities of the graph database NEO4J can be leveraged to reduce the differences to relevant ones. The expressiveness of NEO4J turned out to be sufficient, just as the performance did.
Download

Paper Nr:	90
Title:	Enhancing Biomedical Scientific Reviews Summarization with Graph-based Factual Evidence Extracted from Papers
Authors:	Giacomo Frisoni, Paolo Italiani, Francesco Boschi and Gianluca Moro
Abstract:	Combining structured knowledge and neural language models to tackle natural language processing tasks is a recent research trend that catalyzes community attention. This integration holds a lot of potential in document summarization, especially in the biomedical domain, where the jargon and the complex facts make the overarching information truly hard to interpret. In this context, graph construction via semantic parsing plays a crucial role in unambiguously capturing the most relevant parts of a document. However, current works are limited to extracting open-domain triples, failing to model real-world n-ary and nested biomedical interactions accurately. To alleviate this issue, we present EASumm, the first framework for biomedical abstractive summarization enhanced by event graph extraction (i.e., graphical representations of medical evidence learned from scientific text), relying on dual text-graph encoders. Extensive evaluations on the CDSR dataset corroborate the importance of explicit event structures, with better or comparable performance than previous state-of-the-art systems. Finally, we offer some hints to guide future research in the field.
Download

Short Papers

Paper Nr:	9
Title:	From Digital Footprints to Facts: Mining Marketing Policies of the Greek Community on Instagram and Youtube
Authors:	Dimitris Linarakis, Stefanos Vlachos and Paraskevi Raftopoulou
Abstract:	Social media have conquered the widest spectrum of our lives. Companies, following the phenomena of the new digital era, are giving up traditional practices and use new policies of diffusion for advertising products and for engaging potential customers. In the context of ENIRISST+ (https://enirisst-plus.gr), we started investigating whether transportation businesses (i.e., ferry companies, airlines, etc.) use the new era practices for promoting their products and services and then, extended our research on all kind of businesses. The work presented in this paper studies this shift for the Greek Instagram and YouTube community, records and analyses the activity of prominent Greek companies on social media, and measures the social and commercial impact of the emerging COVID-19 pandemic during 2020 on Greek users’ digital behaviour. Subsequently, we use the acquired data and analysis (i) to draw conclusions about the digital behaviour and preferences of the Greek social media scene and (ii) to compare the results in marketing policies and behavioural patterns on two inherently different social media platforms. This is the first study in the literature to perform an analysis on the behaviour of the Greek community in different social media.
Download

Paper Nr:	75
Title:	Line-up Optimization Model of Basketball Players and the Simulation Evaluation
Authors:	Wang Yichen and Haruka Yamashita
Abstract:	This study aims to maximize the offensive capabilities of the basketball team by optimizing the line-up of players at an arbitrary time. We construct a highly accurate prediction model when the members are changed considering the situation in the game and then propose a model to determine the optimal line-up. The Recursive Neural Network model analyzes time series data, and the Neural Network model incorporates player combinations and game conditions as conditions are combined. The model enables an analysis of the past scores and game conditions and the construction of a predictive model of scores that takes the line-up into account and determines the optimal line-up by calculating the prediction of the offense capabilities with changing the line-up. Furthermore, to demonstrate the validity of the proposed model, this study evaluates the accuracy of the prediction of the score using data accumulated from the actual baseball game. Moreover, because it is difficult to use this method in actual games, we applied the proposed model to the play data of a basketball simulation game. We conducted a simulation experiment where members were successively optimized and showed that the score was better than the experiment without the optimization.
Download

Paper Nr:	80
Title:	Clustered Majority Judgement
Authors:	Emanuele d’Ajello, Davide Formica, Elio Masciari, Gaia Mattia, Arianna Anniciello, Cristina Moscariello, Stefano Quintarelli and Davide Zaccarella
Abstract:	In order to overcome the classical methods of judgement, in the literature there is a lot of material about different methodology and their intrinsic limitations. One of the most relevant modern model to deal with votation system dynamics is the Majority Judgement. It was created with the aim of reducing polarization of the electorate in modern democracies and not to alienate minorities, thanks to its use of a highest median rule, producing more informative results than the existing alternatives. Nonetheless, as shown in the literature, in the case of multiwinner elections it can lead to scenarios in which minorities, albeit numerous, are not adequately represented. For this reason our aim is to implement a clustered version of this algorithm, in order to mitigate these disadvantages: it creates clusters taking into account the similarity between the expressed judgements and then for, each of these created groups, Majority Judgement rule is applied to return a ranking over the set of candidates. These traits make the algorithm available for applications in different areas of interest in which a decisional process is involved.
Download

Paper Nr:	5
Title:	Position Paper: Quality Assurance in Deep Learning Systems
Authors:	Domingos F. Oliveira and Miguel A. Brito
Abstract:	The use of DL as a driving force for new and next-generation technological innovation plays a vital role in the success of organisations. Its penetration in almost all domains requires improving the quality of such systems using quality assurance models. It has been widely explored in DM and SD projects, hence the need to resort to methodology like KDD, SEMMA and the CRISP-DM. In this way, the reuse of standards and methods to guarantee the quality of these systems presents itself as an opportunity. In this way, the position paper has the fundamental objective of giving an idea about the form of a structure that facilitates the application of quality assurance in DL systems. Creating a framework that enables quality assurance of DL systems involves adjusting the development process of traditional methods since the challenge lies in the different programming paradigms and the logical representation of DL software.
Download

Paper Nr:	6
Title:	Supporting Trainset Annotation for Text Classification of Incoming Enterprise Documents
Authors:	Juris Rats and Inguna Pede
Abstract:	Volumes of documents organisations receive on a daily basis increase constantly which makes organizations hire more people to index and route them properly. A machine learning based model aimed at automation of the indexing of the incoming documents is proposed in this article. The overall automation process is described and two methods for support of trainset annotation are analysed and compared. Experts are supported during the annotation process by grouping the stream of documents into clusters of similar documents. It is expected that this may improve both the process of topic selection and that of document annotation. Grouping of the document stream is performed firstly via clustering of documents and selecting the next document from the same cluster and secondly searching the next document via Elasticsearch More Like This (MLT) query. Results of the experiments show that MLT query outperforms the clustering.
Download

Paper Nr:	18
Title:	The Role of Fake Review Detection in Managing Online Corporate Reputation
Authors:	R. E. Loke and Z. Kisoen
Abstract:	In a recent official statement, Google highlighted the negative effects of fake reviews on review websites and specifically requested companies not to buy and users not to accept payments to provide fake reviews (Google, 2019). Also, governmental authorities started acting against organisations that show to have a high number of fake reviews on their apps (DigitalTrends, 2018; Gov UK, 2020; ACM, 2017). However, while the phenomenon of fake reviews is well-known in industries as online journalism and business and travel portals, it remains a difficult challenge in software engineering (Martens & Maalej, 2019). Fake reviews threaten the reputation of an organisation and lead to a disvalued source to determine the public opinion about brands. Negative fake reviews can lead to confusion for customers and a loss of sales. Positive fake reviews might also lead to wrong insights about real users’ needs and requirements. Although fake reviews have been studied for a while now, there are only a limited number of spam detection models available for companies to protect their corporate reputation. Especially in times with the coronavirus, organisations need to put extra focus on online presence and limit the amount of negative input that affects their competitive position which can even lead to business loss. Given state-of-the-art derived features that can be engineered from review texts, a spam detector based on supervised machine learning is derived in an experiment that performs quite well on the well-known Amazon Mechanical Turk dataset.
Download

Paper Nr:	48
Title:	Towards Indoor Radon Analytics: An OLAP-based Multidimensional Approach
Authors:	Rolando Azevedo, Joaquim P. Silva, Nuno Lopes, António Curado, Leonel J. R. Nunes and Sérgio Ivan Lopes
Abstract:	Indoor radon represents a known hazard to public health, namely, its relationship with lung cancer. The adoption of data analytics tools for indoor radon human exposure risk assessment is crucial for building management decision-making and is a fundamental requirement for the implementation of remediation measures. This work presents the implementation of a data warehouse and an OLAP cube as components of a more comprehensive IoT-based system, which has been developed for continuous indoor radon gas management in public buildings. The proposed data warehouse consists of a three-tier data storage structure to store historical measurements. Although the adopted approach has been tested with a small number of IoT sensors, the operation of the data warehouse and OLAP server assures that the system is viable and highly scalable. The increase in the number of active IoT sensors deployed in new buildings, cities, and districts will increase the richness of the data, which will help to foster even better models.
Download

Paper Nr:	53
Title:	Analyzing Cross-impact Matrices for Managerial Decision-making Problems with the DEMATEL Approach
Authors:	Shailesh Tripathi, Nadine Bachmann, Manuel Brunner and Herbert Jodlbauer
Abstract:	Cross-impact matrices define pairwise direct impacts between variables representing the complexity of various social, economic and technological systems. Business and management-related research primarily utilizes the row and column sums of direct impact matrices to identify critical, influential, dependent, neuter, and inert variables. However, the impact of drivers and outcomes in complex systems is usually difficult to interpret accurately without considering the indirect impact of variables. This paper considers all impacts of direct and indirect impact paths (known as the total impacts) between variables using the decision-making trial and evaluation laboratory (DEMATEL) approach for direct impact matrices in which the rank order remains stable (i.e., a stable equilibrium state exists). Numerical experiments show that the rank order of variables and their role (influence or dependence) can change significantly when considering total impacts between variables compared with when considering direct impacts only. This analysis can be used to support management in strategic planning and decision-making, e.g., in an international business environment: Management should attempt to obtain the total impacts matrix defining all direct and indirect impacts that determine the rank order on which informed decisions are subsequently based. The results presented in this paper indicate that impact paths between variables should be incorporated into the system with an in-depth domain understanding. This enables the realistic capture of impacts and the establishment of a stable state for obtaining an unbiased understanding of the roles of variables.
Download

Paper Nr:	62
Title:	Exploring Corporate Reputation based on Sentiment Polarities That Are Related to Opinions in Dutch Online Reviews
Authors:	R. E. Loke and J. Vergeer
Abstract:	This research demonstrates the power and robustness of the vocabulary method by Hernández-Rubio et al. (2019) for aspect extraction from online review data. We showcase that this algorithm not only works on the English language based on the CoreNLP toolkit, but also extend it on the Dutch language, specifically with aid of the Frog toolkit. Results on sampled datasets for three different retailers show that it can be used to extract fine-grained aspects that are relevant to acquire corporate reputation insights.
Download

Area 3 - Data Science

Full Papers

Paper Nr:	22
Title:	Detecting Bots in Social-networks using Node and Structural Embeddings
Authors:	Ashkan Dehghan, Kinga Siuta, Agata Skorupka, Akshat Dubey, Andrei Betlen, David Miller, Wei Xu, Bogumił Kamiński and Paweł Prałat
Abstract:	Users on social networks such as Twitter interact with and are influenced by each other without much knowledge of the identity behind each user. This anonymity has created a perfect environment for bot and hostile accounts to influence the network by mimicking real-user behaviour. To combat this, research into designing algorithms and datasets for identifying bot users has gained significant attention. In this work, we highlight various techniques for classifying bots, focusing on the use of node and structural embedding algorithms. We show that embeddings can be used as unsupervised techniques for building features with predictive power for identifying bots. By comparing features extracted from embeddings to other techniques such as NLP, user profile and node-features, we demonstrate that embeddings can be used as unique source of predictive information. Finally, we study the stability of features extracted using embeddings for tasks such as bot classification by artificially introducing noise in the network. Degradation of classification accuracy is comparable to models trained on carefully designed node features, hinting at the stability of embeddings.
Download

Paper Nr:	23
Title:	Highly Automated Corner Cases Extraction: Using Gradient Boost Quantile Regression for AI Quality Assurance
Authors:	Niels Heller and Namrata Gurung
Abstract:	This work introduces a method for Quality Assurance of Artificial Intelligence (AI) Systems, which identifies and characterizes “corner cases”. Here, corner cases are intuitively defined as “inputs yielding an unexpectedly bad AI performance”. While relying on automated methods for corner case selection, the method relies also on human work. Specifically, the method structures the work of data scientists in an iterative process which formalizes the expectations towards an AI under test. The method is applied in a use case in Autonomous Driving, and validation experiments, which point at a general effectiveness of the method, are reported on. Besides allowing insights on the AI under test, the method seems to be particularly suited to structure a constructive critique of the quality of a test dataset. As this work reports on a first application of the method, a special focus lies on limitations and possible extensions of the method.
Download

Paper Nr:	39
Title:	Development of a Text Classification Framework using Transformer-based Embeddings
Authors:	Sumona Yeasmin, Nazia Afrin, Kashfia Saif and Mohammad Rezwanul Huq
Abstract:	Traditional text document classification methods represent documents with non-contextualized word embeddings and vector space models. Recent techniques for text classification often rely on word embeddings as a transfer learning component. The existing text document classification methodologies have been explored first and then we evaluated their strengths and limitations. We have started with models based on Bag-of-Words and shifted towards transformer-based architectures. It is concluded that transformer-based embedding is necessary to capture the contextual meaning. BERT, one of the transformer-based embedding architectures, produces robust word embeddings, analyzing from left to right and right to left and capturing the proper context. This research introduces a novel text classification framework based on BERT embeddings of text documents. Several classification algorithms have been applied to the word embeddings of the pre-trained state-of-art BERT model. Experiments show that the random forest classifier obtains the highest accuracy than the decision tree and k-nearest neighbor (KNN) algorithms. Furthermore, the obtained results have been compared with existing work and show up to 50% improvement in accuracy. In the future, this work can be extended by building a hybrid recommender system, combining content-based documents with similar features and user-centric interests. This study shows promising results and validates the proposed methodology viable for text classification.
Download

Paper Nr:	43
Title:	Exploratory Analysis of Chat-based Black Market Profiles with Natural Language Processing
Authors:	André Büsgen, Lars Klöser, Philipp Kohl, Oliver Schmidts, Bodo Kraft and Albert Zündorf
Abstract:	Messenger apps like WhatsApp or Telegram are an integral part of daily communication. Besides the various positive effects, those services extend the operating range of criminals. Open trading groups with many thousand participants emerged on Telegram. Law enforcement agencies monitor suspicious users in such chat rooms. This research shows that text analysis, based on natural language processing, facilitates this through a meaningful domain overview and detailed investigations. We crawled a corpus from such self-proclaimed black markets and annotated five attribute types products, money, payment methods, user names, and locations. Based on each message a user sends, we extract and group these attributes to build profiles. Then, we build features to cluster the profiles. Pretrained word vectors yield better unsupervised clustering results than current state-of-the-art transformer models. The result is a semantically meaningful high-level overview of the user landscape of black market chatrooms. Additionally, the extracted structured information serves as a foundation for further data exploration, for example, the most active users or preferred payment methods.
Download

Paper Nr:	58
Title:	Inferring #MeToo Experience Tweets using Classic and Neural Models
Authors:	Julianne Zech, Lisa Singh, Kornraphop Kawintiranon, Naomi Mezey and Jamillah Bowman Williams
Abstract:	The #MeToo movement is one of several calls for social change to gain traction on Twitter in the past decade. The movement went viral after prominent individuals shared their experiences, and much of its power continues to be derived from experience sharing. Because millions of #MeToo tweets are published every year, it is important to accurately identify experience-related tweets. Therefore, we propose a new learning task and compare the effectiveness of classic machine learning models, ensemble models, and a neural network model that incorporates a pre-trained language model to reduce the impact of feature sparsity. We find that even with limited training data, the neural network model outperforms the classic and ensemble classifiers. Finally, we analyze the experience-related conversation in English during the first year of the #MeToo movement and determine that experience tweets represent a sizable minority of the conversation and are moderately correlated to major events.
Download

Paper Nr:	77
Title:	A2P: Attention-based Memory Access Prediction for Graph Analytics
Authors:	Pengmiao Zhang, Rajgopal Kannan, Anant V. Nori and Viktor K. Prasanna
Abstract:	Graphs are widely used to represent real-life applications including social networks, web search engines, bioinformatics, etc. With the rise of Big Data, graph analytics offers significant potential in exploring challenging problems on relational data. Graph analytics is typically memory-bound. One way to hide the memory access latency is through data prefetching, which relies on accurate memory access prediction. Traditional prefetchers with pre-defined rules cannot adapt to complex graph analytics memory patterns. Recently, Machine Learning (ML) models, especially Long Short-Term Memory (LSTM), have shown improved performance for memory access prediction. However, existing models have shortcomings including unstable LSTM models, interleaved patterns in labels using consecutive deltas (difference between addresses), and large output dimensions. We propose A2P, a novel attention-based memory access prediction model for graph analytics. We apply multi-head attention to extract features, which are easier to be trained than LSTM. We design a novel bitmap labeling method, which collects future deltas within a spatial range and makes the patterns easier to be learned. By constraining the prediction range, bitmap labeling provides up to 5K compression for model output dimension. We further introduce a novel concept of super page, which allows the model prediction to break the constraint of a physical page. For the widely used GAP benchmark, our results show that for the top three predictions, A2P outperforms the widely used state-of-the-art LSTM-based model by 23.1% w.r.t. Precision, 21.2% w.r.t. Recall, and 10.4% w.r.t. Coverage.
Download

Paper Nr:	84
Title:	Big Data Analysis of Ionosphere Disturbances using Deep Autoencoder and Dense Network
Authors:	Rayan Abri, Harun Artuner, Sara Abri and Salih Cetin
Abstract:	The ionosphere plays a critical role in the functioning of the atmosphere and the planet. Fluctuations and some anomalies in the ionosphere occur as a result of solar flares caused by coronal mass ejections, seismic motions, and geomagnetic activity. The Total electron content (TEC) of the ionosphere is the most important metric for studying its morphology. The purpose of this article is to examine the relationships that exist between earthquakes and TEC data. In order to accomplish this, we present a classification method for the ionosphere’s TEC data that is based on earthquakes. Deep autoencoder techniques are used for the feature extraction from TEC data. The features that were obtained were fed into dense neural networks, which are used to perform classification. In order to assess the suggested classification model, the results of the classification model are compared to the results of the LDA (Linear Discriminant Analysis) classifier model.The research results show that the suggested model enhances the accuracy of differentiating earthquakes by around 0.94, making it a useful tool for identifying ionospheric disturbances in terms of earthquakes.
Download

Short Papers

Paper Nr:	4
Title:	Using the Silhouette Coefficient for Representative Search of Team Tactics in Noisy Data
Authors:	Friedemann Schwenkreis
Abstract:	Automatically recognizing team tactics based on spatiotemporal data is challenging. Deep Learning approaches have been proposed in this area but require a tremendous amount of manual work to create training and test data. This paper presents a clustering approach to reduce the needed manual effort significantly. A method is described to transform the spatiotemporal data into a canonical form that allows to efficiently apply clustering techniques. Since noise cannot be avoided in the given application context, the silhouette coefficient is applied to filter clusters considered to be noisy in a cluster technique independent way. Then, a variant of the silhouette coefficient is introduced as an indicator regarding the overall cluster model quality which allows to select the optimal clustering technique as well as the optimal set of cluster technique parameters for the given application context.
Download

Paper Nr:	25
Title:	Sensitivity Analysis using Regression Models: To Determine the Impact of Meta-level Features on the Youtube Views
Authors:	Vaishnavi Borwankar, Catherine Chris, Hitesh Kumar and Sophia Rahaman
Abstract:	The popularity of social media has led to a shift in paradigm with YouTube emerging as a ubiquitous platform for networking and content sharing. YouTube, with over a million content creators, has become the most preferred destination for watching videos online. The meta-level features like the title, tags, number of views, likes, dislikes, etc. are significant to determine the sensitivity of the videos. This study aims to determine how these meta-level features can better be utilized to increase the popularity of the videos. The study specifically analyzes how the number of likes, dislikes and comment count have an impact on the number of views. The number of likes, dislikes and the comment count are the independent variables, while the view count is a dependent variable. The dataset used for this research is the daily Trending YouTube Video Statistics for the years 2017-2019 from Kaggle, that spans across the US region with over forty-thousand videos from sixty plus channels released by YouTube for public use. In this paper, we use the Ordinary Least Square Regression Algorithm and Stochastic Gradient Descent Algorithm to perform Sensitivity Analysis. The analysis is performed on two categories: Media and Sports. The accuracy of both the models are compared by evaluating the mean absolute error (MAE) and the relative absolute error (RAE) taken from the results of the experiment. The results showed a significant impact of meta-level features on the popularity of the videos along with their percentage dependency.
Download

Paper Nr:	31
Title:	A Hybrid Approach for Product Classification based on Image and Text Matching
Authors:	Sebastian Bast, Christoph Brosch and Rolf Krieger
Abstract:	The classification of products generates a high effort for retail companies because products must be classified manually in many cases. To optimize the product data creation process, methods for automating product classification are necessary. An important component of product data records are digital product images. Due to the latest developments in pattern recognition, these images can be used for product classification. Artificial neural networks are already capable of classifying digital images with lower error rates than humans. But the enormous variety of products and frequent changes in the product assortment are big challenges for current methods for classifying product images automatically. In this paper, we present a system that automatically classifies products based on their images and their textual descriptions extracted from the images according to the Global Product Classification Standard (GPC) by using machine learning methods to find similarities in image and text datasets. Our experiments show that the manual effort required to classify product data can be significantly reduced by machine learning techniques.
Download

Paper Nr:	37
Title:	A Comparison of Automatic Labelling Approaches for Sentiment Analysis
Authors:	Sumana Biswas, Karen Young and Josephine Griffith
Abstract:	Labelling a large quantity of social media data for the task of supervised machine learning is not only time-consuming but also difficult and expensive. On the other hand, the accuracy of supervised machine learning models is strongly related to the quality of the labelled data on which they train, and automatic sentiment labelling techniques could reduce the time and cost of human labelling. We have compared three automatic sentiment labelling techniques: TextBlob, Vader, and Afinn to assign sentiments to tweets without any human assistance. We compare three scenarios: one uses training and testing datasets with existing ground truth labels; the second experiment uses automatic labels as training and testing datasets; and the third experiment uses three automatic labelling techniques to label the training dataset and uses the ground truth labels for testing. The experiments were evaluated on two Twitter datasets: SemEval-2013 (DS-1) and SemEval-2016 (DS-2). Results show that the Afinn labelling technique obtains the highest accuracy of 80.17% (DS-1) and 80.05% (DS-2) using a BiLSTM deep learning model. These findings imply that automatic text labelling could provide significant benefits, and suggest a feasible alternative to the time and cost of human labelling efforts.
Download

Paper Nr:	55
Title:	Estimating the Optimal Number of Clusters from Subsets of Ensembles
Authors:	Afees Adegoke Odebode, Allan Tucker, Mahir Arzoky and Stepehen Swift
Abstract:	This research estimates the optimal number of clusters in a dataset using a novel ensemble technique - a preferred alternative to relying on the output of a single clustering. Combining clusterings from different algorithms can lead to a more stable and robust solution, often unattainable by any single clustering solution. Technically, we created subsets of ensembles as possible estimates; and evaluated them using a quality metric to obtain the best subset. We tested our method on publicly available datasets of varying types, sources and clustering difficulty to establish the accuracy and performance of our approach against eight standard methods. Our method outperforms all the techniques in the number of clusters estimated correctly. Due to the exhaustive nature of the initial algorithm, it is slow as the number of ensembles or the solution space increases; hence, we have provided an updated version based on the single-digit difference of Gray code that runs in linear time in terms of the subset size.
Download

Paper Nr:	57
Title:	Prediction of Sulfur in the Hot Metal based on Data Mining and Artificial Neural Networks
Authors:	Wandercleiton Cardoso and Rendo di Felice
Abstract:	In recent years, interest in artificial intelligence and the integration of Industry 4.0 technologies to improve and monitor steel production conditions has increased. In the current scenario of the world economy, where the prices of energy and inputs used in industrial processes are increasingly volatile, strict control of all stages of the production process is of paramount importance. For the steel production process, the temperature of the metal in the liquid state is one of the most important parameters to be evaluated, since its lack of control negatively affects the final quality of the product. Every day, several models are proposed to simulate industrial processes. In this sense, data mining and the use of artificial neural networks are competitive alternatives to solve this task. In this context, the objective of this work was to perform data mining in a Big Data with more than 300,000 pieces of information, processing them using an artificial neural network and probabilistic reasoning. It is concluded that data mining and neural networks can be used in practice as a tool for predicting and controlling impurities during the production of hot metal in a blast furnace.
Download

Paper Nr:	63
Title:	Improved Boosted Classification to Mitigate the Ethnicity and Age Group Unfairness
Authors:	Ivona Colakovic and Sašo Karakatič
Abstract:	This paper deals with the group fairness issue that arises when classifying data, which contains socially induced biases for age and ethnicity. To tackle the unfair focus on certain age and ethnicity groups, we propose an adaptive boosting method that balances the fair treatment of all groups. The proposed approach builds upon the AdaBoost method but supplements it with the factor of fairness between the sensitive groups. The results show that the proposed method focuses more on the age and ethnicity groups, given less focus with traditional classification techniques. Thus the resulting classification model is more balanced, treating all of the sensitive groups more equally without sacrificing the overall quality of the classification.
Download

Paper Nr:	87
Title:	Slide-recommendation System: A Strategy for Integrating Instructional Feedback into Online Exercise Sessions
Authors:	Victor Obionwu, Vincent Toulouse, David Broneske and Gunter Saake
Abstract:	A structured learning behavior needs an understanding of the learning environment, the feedback it returns, and comprehension of the task requirements. However, as observed in the activity logs of our SQLValidator, students spend most time doing trial-and-error until they came to the correct answer.While most students resort to consulting their colleagues, few eventually acquire a comprehension of the rules of the SQL language. However, with instructional feedback in form of a recommendation, we could reduce the time penalty of ineffective engagement. To this end, we have extended our SQLValidator with a recommendation subsystem that provides automatic instructional feedback during online exercise sessions. We show that a mapping between SQL exercises, lecture slides, and respective cosine similarity can be used for providing useful recommendations. The performance of our prototype reaches a precision value of 0.767 and an Fβ=0.5 value of 0.505 which justifies our strategy of aiding students with lecture slide recommendation.
Download

Paper Nr:	11
Title:	User Profiling: On the Road from URLs to Semantic Features
Authors:	Claudio Barros and Perrine Moreau
Abstract:	Text data is undoubtedly one of the most rich and peculiar source of information there is. It can come in many forms and require specific treatment based on their nature in order to create meaningful features that can be subsequently used in predictive modelling. URLs in particular are quite specific and require adaptations in terms of processing compared to usual corpora of texts. In this paper, we review different ways we have used URLs to create meaningful features, both by exploiting the URL itself and by scrapping its page content. We additionally attempt to measure the impact of the addition of different groups of features created in a predictive modelling use case.
Download

Paper Nr:	40
Title:	Feature Extraction and Failure Detection Pipeline Applied to Log-based and Production Data
Authors:	Rosaria Rossini, Nicolò Bertozzi, Eliseu Pereira, Claudio Pastrone and Gil Gonçalves
Abstract:	Machines can generate an enormous amount of data, complemented with production, alerts, failures, and maintenance data, enabling through a feature engineering process the generation of solid datasets. Modern machines incorporate sensors and data processing modules from factories, but in older equipment, these devices must be installed with the machine already in production, or in some cases, it is not possible to install all required sensors. In order to overcome this issue, and quickly start to analyze the machine behavior, in this paper, a two-step log & production-based approach is described and applied to log and production data with the aim of exploiting feature engineering applied to an industrial dataset. In particular, by aggregating production and log data, the proposed two-steps analysis can be applied to predict if, in the near future, I) an error will occur in such machine, and II) the gravity of such error, i.e. have a general evaluation if such issue is a candidate failure or a scheduled stop. The proposed approach has been tested on a real scenario with data collected from a woodworking drilling machine.
Download

Paper Nr:	45
Title:	Unsupervised Electrodermal Data Analysis Comparison between Biopac and Empatica E4 Data Collection Platforms
Authors:	Kassy Raymond and Andrew Hamilton-Wright
Abstract:	Unsupervised learning algorithms are valuable for exploring a variety of data domains. In this paper we compare the efficacy of the k-means and DBSCAN algorithms in the context of discerning structure in electrodermal data obtained using two different collection modalities for simultaneously collected data: the “gold standard” Biopac data platform and the wearable Empatica E4. Insights into the structure of the data from each system are provided, as is an analysis of the performance of each clustering algorithm at identifying interesting structure within the data.
Download

Paper Nr:	60
Title:	Using Machine Learning Methods and the Influenza Simulation System to Explore the Similarities of Taiwan’s Administrative Regions
Authors:	Zong-Kai Lai, Yi-Ting Chiang, Tsan-sheng Hsu and Hung-Jui Chang
Abstract:	When designing public health policy to prevent the spread of disease, it is crucial to consider the difference in each administrative region. Residents’ daily and inter-regions activities are essential when epidemic diseases are spreading. Most of the statistical data in the traditional public health system cannot capture these behaviors. The standard statistic data and the disease transmission behaviors are combined and equally considered in the disease-transmission simulation system. According to the data from the simulation system, the administrative regions in Taiwan are separated into one urban and three non-urban areas by the clustering algorithm. Then we use decision tree algorithms to determine the main factors when deciding whether an area is rural or urban. The experiment results show that the percentage of elders and the road infrastructure is the main feature for determining the type of an area.
Download

Paper Nr:	67
Title:	Political Analytics on Election Candidates and Their Parties in Context of the US Presidential Elections 2020
Authors:	Kalpdrum Passi and Rakshit Sorathiya
Abstract:	The availability of internet services in the United States and rest of the world in general in the modern past has contributed to more traction in the social network platforms like Facebook, Twitter, YouTube, and much more. This has made it possible for individuals to freely speak and express their sentiments and emotions towards the society. In 2020, the United State Presidential Elections saw around 1.5 million tweets on Twitter specifically for the Democratic and Republican party, Joe Biden, and Donald Trump, respectively. The tweets involve people’s sentiments and opinions towards the two political leaders (Joe Biden and Donald Trump) and their parties. The study of beliefs, sentiments, perceptions, views, and feelings conveyed in text is known as sentiment analysis. The political parties have used this technique to run their campaigns and understand the opinions of the public. In this thesis, during the voting time for the United States Elections in 2020, we conducted text mining on approximately 1.5 million tweets received between 15th October and 8th November that address the two mainstream political parties in the United States. We aimed at how Twitter users perceived for both political parties and their candidates in the United States (Democratic and Republican Party) using VADER a sentiment analysis tool that is tailored to discover the social media emotions, with a lexicon and rule-based sentiment analysis. The results of the research were the Democratic Party’s Joe Biden regardless of the sentiments and opinions in the in Twitter showing Donald Trump could win.
Download

Paper Nr:	78
Title:	Using Convolutional Neural Networks for Detecting Acrylamide in Biscuit Manufacturing Process
Authors:	Dilruba Topcuoglu, Berat Utkan Mentes, Nur Askin, Ayse Damla Sengul, Zeynep Deniz Cankut, Talip Akdemir, Murat Ayvaz, Elif Kurt, Ozge Erdohan, Tumay Temiz and Murat Ceylan
Abstract:	Based on a research in 2002 (Ozkaynak & Ova, 2006), acrylamide substance is formed when excessive heat treatment (e.g. frying, grilling, baking) is applied to starch-containing products. This substance contains carcinogenic and neurotoxicological risks for human health. The acrylamide levels are controlled by random laboratory sampling. This control processes which are executed by humans, cause a prolonged and error prone process. In this study, we offer a Convolutional Neural Network (CNN) model, which provides acceptable precision and recall rates for detecting acrylamide in biscuit manufacturing process.
Download

Paper Nr:	79
Title:	Auctions and Estimates: Evidence from Indian Art Market
Authors:	Shailendra Gurjar and Usha Ananthakumar
Abstract:	We examine whether presale estimates of paintings by Indian artists are unbiased predictors of the hammer price. Our analysis includes both sold and unsold artworks. Unbiasedness of estimates is tested by performing a two-stage Heckit model on 5,077 artworks auctioned between 2000 and 2018. The results of our study show that presale estimates are upward biased for expensive artworks and downward biased for others. In addition, we also find that in the market for Indian paintings, characteristics of auction, artist, and artwork determine the biasedness of estimates.
Download

Area 4 - Data Management and Quality

Full Papers

Paper Nr:	10
Title:	Development of a Semantic Database Model to Facilitate Data Analytics in Battery Cell Manufacturing
Authors:	Ozan Yesilyurt, David Brandt, Julian Joël Grimm, Kamal Husseini, Aleksandra Naumann, Julia Meiners and David Becker-Koch
Abstract:	The global demand for batteries is increasing worldwide. To cover this high battery demand, optimizing manufacturing productivity and improving the quality of battery cells are necessary. Digitalization promises to offer great potential to address these challenges. Through data collection along the manufacturing processes, hidden correlations can be identified. However, data is highly diverse in battery cell manufacturing, complicating data analysis. A semantic data storage can increase the understanding of the relationships between the datasets, facilitating the identification of the causes of defects in manufacturing processes. To structure heterogeneous data in a semantically understandable and analyzable form, this paper presents the development of a semantic database model. The realization of this model enables structuring various datasets for simplified access and usage for increasing productivity and battery cell quality in battery cell manufacturing.
Download

Paper Nr:	49
Title:	Domain-independent Data-to-Text Generation for Open Data
Authors:	Andreas Burgdorf, Micaela Barkmann, André Pomp and Tobias Meisen
Abstract:	As a result of the efforts of the Open Data movements, the number of Open Data portals and the amount of data published in them is steadily increasing. An aspect that increases the utilizability of data enormously but is nevertheless often neglected is the enrichment of data with textual data documentation. However, the creation of descriptions of sufficient quality is time-consuming and thus cost-intensive. One approach to solving this problem is Data to text generation which creates descriptions to raw data. In the past, promising results were achieved on data from Wikipedia. Based on a seq2seq model developed for such purposes, we investigate whether this technique can also be applied in the Open Data domain and the associated challenges. In three studies, we reproduce the results obtained from a previous work and apply them to additional datasets with new challenges in terms of data nature and data volume. We can conclude that previous methods are not suitable to be applied in the Open Data sector without further modification, but the results still exceed our expectations and show the potential of applicability.
Download

Paper Nr:	92
Title:	Modelling of Efficient Graph-aware Data Storage using DNA
Authors:	Asad Usmani and Lena Wiese
Abstract:	The global demand for massive data archival with a yearly exponential growth rate is near to outpacing the capability of the conventional world storage media. Fortunately, DNA (Deoxyribonucleic acid) storage has made a substantial breakthrough for archiving such vast data for a long time. Though many scientists have made remarkable efforts to use DNA storage as a promising emergent solution for archiving raw data, not anyone has exploited it to store graph-aware encoded data. Desirably, the exploitation of graph-aware data archiving has notable advantages over raw data. That supports data portability and significantly reduces the concerned data size for DNA storage in terms of nucleotides. Hence, it benefits us in database operational cost reduction. We present a theoretical model for efficient DNA storage of simple graph-based scientific data. Furthermore, some simple graph-based datasets, particularly from the biological domain, have been used for experimental results and analysis. That revealed a compression ratio between 1.18 to 1.53.
Download

Short Papers

Paper Nr:	30
Title:	Framework for Public Health Policy Indicators Governance and Metadata Quality Flags to Promote Data Literacy
Authors:	Wesley Lourenco Barbosa, Jacqueline David Planas, Maritza M. Carvalho Francisco, Solange Nice Alves-Souza, Lucia Vilela Leite Filgueiras, Leandro Manuel Reis Velloso and Luiz Sérgio de-Souza
Abstract:	Public Health Policy Indicators (PHPI) are tools for monitoring the performance of policies and enable data-driven decision-making. For the PHPI to be useful for different stakeholders, they must be characterized, promoting an unequivocal understanding of their meaning. PHPI are consolidated from data assets, which must be managed to result in reliable information that support the decision-making process. However, in the public sector, aspects related to data and indicators governance tend to be neglected. Thus, we propose a metadata-oriented framework for health indicator governance, that incorporates aspects of the agile philosophy, and allows implementing a fast-start governance program. Furthermore, a flag-based system is proposed to promote data literacy in the context of health indicators. From a case study, we attained results that show the feasibility of implementing a governance program, with budget and time constraints, guaranteeing fast value delivery. The quality flags proved to be an adequate strategy to classify the indicator metadata in a simplified way and encourage improvement actions. Therefore, working towards obtaining more detailed descriptions of the indicators that highlight the usefulness of the information, promotes a better understanding of its meaning and use, encouraging data literacy, generating value, and positively impacting the management of health policies.
Download

Paper Nr:	36
Title:	Compiling Open Datasets in Context of Large Organizations while Protecting User Privacy and Guaranteeing Plausible Deniability
Authors:	Igor Jakovljevic, Christian Gütl, Andreas Wagner and Alexander Nussbaumer
Abstract:	Open data and open science are terms that are becoming ever more popular. The information generated in large organizations is of great potential for organizations, future research, innovation, and more. Currently, there is a wide range of similar guidelines for publishing organizational data, focusing on data anonymization containing conflicting ideas and steps. These guidelines usually do not focus on the whole process of assessing risks, evaluating, and distributing data. In this paper, the relevant tasks from different open data frameworks have been identified, adapted, and synthesized into a six-step framework to transform organizational data into open data while offering privacy protection to organisational users. As part of the research, the framework was applied to a CERN dataset and expert interviews were conducted to evaluate the results and the framework. Drawbacks of the frameworks were identified and suggested as improvements for future work.
Download

Paper Nr:	46
Title:	Analysis of Data Quality in Digital Smart Cities: The Cases of Nantes, Hamburg and Helsinki
Authors:	José L. Hernández, Ana Quijano, Rubén García, Pierre Nouaille, Lukas Risch, Mikko Virtanen and Ignacio de Miguel
Abstract:	The Smart Cities concept is supported by the use of Information and Communication Technologies (ICT), which enables the digitalisation of the city assets. Then, cities are nowadays driven by data, with a clear dependency on the data collection approaches. Decisions and criteria for urban transformation therefore rely on data and Key Performance Indicators. However, one question remains and refers the reliability and credibility of data that guide the decision-making processes. Many efforts are made in the definition of the data quality methodologies, but not in analysing the real situation about data collection is smart cities. This paper applies a methodology to quantitatively analyse the real quality of the data-sets in the cities of Nantes, Hamburg and Helsinki. This work is under the umbrella of mySMARTLife project (GA #731297). The main conclusion or lessons learnt is the need for more appropriate methods to increase data quality, instead of defining new methodologies. Data quality requires improvements to make better informed decisions and obtain more credible Key Performance Indicators.
Download

Paper Nr:	64
Title:	Working Efficiently with Large Geodata Files using Ad-hoc Queries
Authors:	Pascal Bormann, Michel Krämer and Hendrik M. Würz
Abstract:	Working with large geospatial data such as building models or point clouds typically requires an index structure to enable fast queries. Creating such an index is a time-consuming process. Especially in single-user explorative scenarios, as they are often found in the scientific community, creating an index or importing the data into a database management system (DBMS) might be unnecessary. In this position paper, we show through a series of experiments that modern commodity hardware is fast enough to perform many query types ad-hoc on unindexed building model and point cloud data. We show how searching in unindexed data can be sped up using simple techniques and trivial data layout adjustments. Our experiments show that ad-hoc queries often can be answered in interactive or near-interactive time without an index, sometimes even outperforming the DBMS. We believe our results provide valuable input and open up possibilities for future research.
Download

Paper Nr:	65
Title:	Interoperability-oriented Quality Assessment for Czech Open Data
Authors:	Dasa Kusnirakova, Mouzhi Ge, Leonard Walletzky and Barbora Buhnova
Abstract:	With the rapid increase of published open datasets, it is crucial to support the open data progress in smart cities while considering the open data quality. In the Czech Republic, and its National Open Data Catalogue (NODC), the open datasets are usually evaluated based on their metadata only, while leaving the content and the adherence to the recommended data structure to the sole responsibility of the data providers. The interoperability of open datasets remains unknown. This paper therefore aims to propose a novel content-aware quality evaluation framework that assesses the quality of open datasets based on five data quality dimensions. With the proposed framework, we provide a fundamental view on the interoperability-oriented data quality of Czech open datasets, which are published in NODC. Our evaluations find that domain-specific open data quality assessments are able to detect data quality issues beyond traditional heuristics used for determining Czech open data quality, increase their interoperability, and thus increase their potential to bring value for the society. The findings of this research are beneficial not only for the case of the Czech Republic, but also can be applied in other countries that intend to enhance their open data quality evaluation processes.
Download

Paper Nr:	73
Title:	Evaluation of Architectures for FAIR Data Management in a Research Data Management Use Case
Authors:	Benedikt Heinrichs, Marius Politze and M. Amin Yazdi
Abstract:	Research data management systems are mostly designed to manage data according to the FAIR Guiding Principles. In order for the systems themselves to follow this promise and improve the possibility of networking between decentralized systems, they should incorporate standardized interfaces for exchange of data and metadata. For this purpose, in the last couple of years, several standards emerged which try to fill this gap and define data structures and APIs. This paper aims to evaluate these standards by defining the requirements of a research data management system called Coscine as a use case and seeing if the current standards meet the defined needs. The evaluation shows that there is not one complete standard for every requirement but that they can complete each other to fulfill the goal of a standardized research data management system.
Download

Paper Nr:	86
Title:	Efficient Subgraph Indexing for Biochemical Graphs
Authors:	Chimi Wangmo and Lena Wiese
Abstract:	The dynamic nature of graph-structured data demands fast subgraph query processing to solve real-world problems such as identifying spammers in social networks, fraud detection in the financial system, and finding motifs in biological networks. The need for an efficient subgraph search has motivated the study for filtering the candidate graphs using the filter-then-verify framework with minimal indexing size. This paper presents an efficient in-memory index structure for indexing the paths in the transaction graph database. Our radix tree-based index structure addresses the issue of high memory consumption related to trie for representing biochemical datasets. Furthermore, we also contrast various containers used in the radix nodes. We demonstrate empirically the benefits of compressing the common prefixes in the path by achieving 20% reduction in the indexing size than a trie-based implementation.
Download

Paper Nr:	12
Title:	Maneuver-based Visualization of Similarities between Recorded Traffic Scenarios
Authors:	Thilo Braun, Lennart Ries, Moritz Hesche, Stefan Otten and Eric Sax
Abstract:	Since automated driving functions are safety-critical systems, extensive validation and verification is necessary. Scenario-based testing is a promising approach for this challenge. For selection of relevant scenarios, collected data and knowledge models are potential sources. In this paper we introduce a concept to use recorded trajectory and map data, abstracted to maneuvers, to describe the scenarios and visualize them intuitively. This enables a data-driven scenario-mining process to find relevant scenarios for the testing of automated driving functions. To compare the scenarios, a similarity measure based on the manuevers is designed and the scenarios and their similarities are represented as a graph. Graph-visualization methods, already successfully applied in other domains, structure the collected data for further analysis. The concept is exemplary applied to an urban traffic dataset.
Download

Paper Nr:	28
Title:	Evaluation by Simulation of the Diffusion Methods in the Cloud: Based Network Architecture for Digital Open Universities
Authors:	Boukar Abatchia Nicolas, Mahamadou Issoufou Tiado, Nassirou Adamou Hassane and Ibrahim Ganaou Noura
Abstract:	The interconnection between the Internet and the telecommunication networks brings to the advent of the new generation of digital open universities (DOUNG). That recent model was improved through many additional works including the extension of its architecture from the Local Area Network (LAN) to the Internet and to the GSM (Global System for Mobile communications) environment. This hybrid architecture leads to several connections with the goal to achieve a good level of Quality of Service (QoS). One solution belongs to the using of clouds with the issue of choosing a diffusion method adapted to this new context. In this paper, a comparative study of flow distribution methods is conduct through dissemination issue and simulations. We extend that work with the cloud contribution assessment including scales evaluation. All results for vertical and horizontal scaling and for the unicast and multicast methods are produced and discussed.
Download

Paper Nr:	41
Title:	Multitudinous Data Platform for Community Big Data
Authors:	S. Junrat, J. Nopparat, M. Manopiroonporn, W. Suntiamorntut and S. Charoenpanyasak
Abstract:	A city’s data is diversified. Many types of data are available, including time series data from various sensors, event data from human input (text, number, date, location, photos, and so on). Those data need a platform that can collect data of a city comprehensively and support querying to analyze data conveniently. We designed and developed a platform that support to automatically construct a variety of data structures by designing a collection of questions that can be used to collect various categories of data sets in each city flexibly. It can be applied to solve problems and develop a city pertinently.
Download

Paper Nr:	42
Title:	Towards Semantic Interoperability of Core Registers in Croatia
Authors:	Kornelije Rabuzin, Darko Gulija, Leo Mršić and Nikola Modrušan
Abstract:	Digital government assumes sharing and use of government data without restrictions. However, different reports and indicators presented in this paper show that in Croatia, core register data could and should be used and shared more extensively. In this way, better services to citizens and companies could be offered. The first step to accomplish this goal is to examine core registers in Croatia, in order to detect possible issues and problems which hinder data use, sharing and exchange. For that purpose, a project was started whose goal is to analyse basic register data in Croatia. Findings from the first phase of the project, including the first set of registers, are presented in this paper.
Download

Paper Nr:	56
Title:	End-to-End Data Quality: Insights from Two Case Studies
Authors:	M. Redwan Hasan and Christine Legner
Abstract:	Maintaining high data quality in organizations have become indispensable. In the past, companies largely concentrated their data quality efforts on a single point in the information supply chain – focusing either on master data quality or on information products. As they start repurposing data and leveraging it for more advanced and complex use-cases, they need to proactively manage data quality in an end-to-end approach. Leveraging insights from two case studies, this paper analyses two different, yet complementary approaches to end-to-end data quality management, namely first-time-right approach and use-case driven approach. The findings highlight that end-to-end data quality management relies on common principles but can start from either side of the information supply chain – either through a use-case or data entry point at the source.
Download

Paper Nr:	59
Title:	Students or Mechanical Turk: Who Are the More Reliable Social Media Data Labelers?
Authors:	Lisa Singh, Rebecca Vanarsdall, Yanchen Wang and Carole Roan Gresenz
Abstract:	For social media related machine learning tasks, having reliable data labelers is important. However, it is unclear whether students or Mechanical Turk workers would be better data labelers on these noisy, short posts. This paper compares the reliability of students and Mechanical Turk workers for a series of social media data labeling tasks. In general, we find that for most tasks, the Mechanical Turk workers have stronger agreement than the student workers. When we group labeling tasks based on difficulty, we find more consistency for Mechanical Turk workers across labeling tasks than student workers. Both these findings suggest that using Mechanical Turk workers for labeling social media posts leads to more reliable labeling than college students.
Download

Paper Nr:	74
Title:	A Two-tire Approach for Organization Name Entity Resolution
Authors:	Almuth Müller and Achim Kuwertz
Abstract:	This paper presents a concept for a two-tire semi-automated approach for business data entity resolution. Resolving entity names is generally relevant e.g. in business intelligence. When applied, several difficulties have to be considered, such as name deviations for an organization. Here, two types of deviations can be distinguished. First, names can differ due to typos, native special characters or transformation errors. Second, an organization name can change due to outdated designations or being given in another language. A further aspect is data sovereignty. Analyzed data sources can be under direct control, e.g. in own data storage systems, and thus be kept clean. Yet, other sources of relevant data may only be publicly available. It is in general not recommended to copy such data, due to e.g. its amount and data duplication issues. The proposed two-tire approach for entity resolution thus not only considers different kinds of name derivations, but also data sovereignty issues. Being still work in progress, it yet has the potential to reduce the effort required when compared to manual approaches and can possibly be applied in different areas where there is a significant need for harmonized data and externally curated systems are not feasible.
Download

Paper Nr:	83
Title:	Functional Component Descriptions for Electrical Circuits based on Semantic Technology Reasoning
Authors:	Johannes Bayer, Mina Karami Zadeh, Markus Schröder and Andreas Dengel
Abstract:	Circuit diagrams have been used in electrical engineering for decades to describe the wiring of devices and facilities. They depict electrical components in a symbolic and graph-based manner. While the circuit design is usually performed electronically, there are still legacy paper-based diagrams that require digitization in order to be used in CAE systems. Generally, knowledge on specific circuits may be lost between engineering projects, making it hard for domain novices to understand a given circuit design. The graph-based nature of these documents can be exploited by semantic technology-based reasoning in order to generate human-understandable descriptions of their functional principles. More precisely, each electrical component (e.g. a diode) of a circuit may be assigned a high-level function label which describes its purpose within the device (e.g. flyback diode for reverse voltage protection). In this paper, forward chaining rules are used for such a generation. The described approach is applicable for both CAE-based circuits as well as raw circuits yielded by an image understanding pipeline. The viability of the approach is demonstrated by application to an existing set of circuits.
Download

Area 5 - Databases and Data Security

Full Papers

Paper Nr:	14
Title:	Startable: Multidimensional Modelling for Column-Oriented NoSQL
Authors:	Leandro Mendes Ferreira, Solange Nice Alves-Souza and Luciana Maria da Silva
Abstract:	NoSQL Database Management Systems (DBMS) can be an alternative for analytical systems and have been used in this regard. As analytical systems often use multidimensional data modelling, which was created for relational databases, a new logical data modelling is necessary for NoSQL Column-Oriented Databases. We developed logical modelling (Startable) and applied NoSQL oriented to the family of columns. Furthermore, a logical model for the application of Startable modelling is presented in this research and a benchmark where the performance of the proposed modelling is demonstrated compared to traditional approaches to multidimensional modelling.
Download

Short Papers

Paper Nr:	21
Title:	Towards Decentralized Parameter Servers for Secure Federated Learning
Authors:	Muhammad El-Hindi, Zheguang Zhao and Carsten Binnig
Abstract:	Federated learning aims to protect the privacy of data owners in a collaborative machine learning setup since training data does not need to be revealed to any other participant involved in the training process. This is achieved by only requiring participants to share locally computed model updates (i.e., gradients), instead of the training data, with a centralized parameter server. However, recent papers have shown that privacy attacks exist which allow this server to reconstruct the training data of individual data owners only from the received gradients. To mitigate this attack, in this paper, we propose a new federated learning framework that decentralizes the parameter server. As part of this contribution, we investigate the configuration space of such a decentralized federated learning framework. Moreover, we propose three promising privacy-preserving techniques, namely model sharding, asynchronous updates and polling intervals for stale parameters. In our evaluation, we observe on different data sets that these techniques can effectively thwart the gradient-based reconstruction attacks on deep learning models, both from the client side and the server side, by reducing the attack results close to random noise.
Download

Paper Nr:	82
Title:	Local Personal Data Processing with Third Party Code and Bounded Leakage
Authors:	Robin Carpentier, Iulian Sandu Popa and Nicolas Anciaux
Abstract:	Personal Data Management Systems (PDMSs) provide individuals with appropriate tools to collect, manage and share their personal data under control. A founding principle of PDMSs is to move the computation code to the user’s data, not the other way around. This opens up new uses for personal data, wherein the entire personal database of the individuals is operated within their local environment and never exposed outside, but only aggregated computed results are externalized. Yet, whenever arbitrary aggregation function code, provided by a third-party service or application, is evaluated on large datasets, as envisioned for typical PDMS use-cases, can the potential leakage of the user’s personal information, through the legitimate results of that function, be bounded and kept small? This paper aims at providing a positive answer to this question, which is essential to demonstrate the rationale of the PDMS paradigm. We resort to an architecture for PDMSs based on Trusted Execution Environments to evaluate any classical user-defined aggregate PDMS function. We show that an upper bound on leakage exists and we sketch remaining research issues.
Download

Paper Nr:	89
Title:	NoSQL Document Databases Assessment: Couchbase, CouchDB, and MongoDB
Authors:	Inês Carvalho, Filipe Sá and Jorge Bernardino
Abstract:	NoSQL document databases emerged as an alternative to relational databases to deal with large volumes of data. In this paper, we assess the top three free and open-source NoSQL document databases: Couchbase, CouchDB, and MongoDB. Through this analysis, we identify the main characteristics of each database. The OSSpal methodology, which combines quantitative and qualitative measures to assess open-source software was used. This methodology defines seven categories: functionality, operational software characteristics, software technology attributes, documentation, support and service, community and adoption, and development process. At the end, it is obtained a score that identify which is the best NoSQL document database.
Download