DATA 2023 Abstracts


Area 1 - Big Data

Full Papers
Paper Nr: 68
Title:

Disease Prediction with Heterogeneous Graph of Electronic Health Records and Toxicogenomics Data

Authors:

Ji-Hyeong Park, Hyun-Soo Choi, Sunhwa Jo and Jinho Kim

Abstract: Disease prediction is an important technology in the field of medicine. Several studies have been conducted on disease prediction using electronic health records (EHR). However, existing methods have several limitations, such as predicting only a single disease and utilizing limited data sources of textual or drug-related data; thus, they cannot capture the relationship between a patient and a disease, or among diseases. Furthermore, they suffer from the problem that additional information other than EHR exists only for a limited set of diseases and cannot be used for a wide range of diseases. To mitigate these problems, we utilize Toxicogenomics Data (TD) that contains extensive information about most diseases, and analyze this complicated data using a heterogeneous graph embedding technique. We utilize metapath and graph neural network for graph embedding of heterogeneous relationships in EHR-TD, and then develop a novel disease prediction framework. To achieve this goal, we first present a process for the collection and processing of EHR and TD data to improve their reliability. Secondly, we propose a method for efficiently constructing heterogeneous EHR-TD graphs, and present an embedding model that can be effectively used. Finally, we propose a metapath interaction encoder that can address the problems of RNN-based encoders in previous models. Thereafter, we validate the effectiveness of the proposed framework and modules with extensive evaluations of various designs for disease prediction using EHR and TD data.
Download

Paper Nr: 86
Title:

A Comparison Study for Disaster Tweet Classification Using Deep Learning Models

Authors:

Soudabeh T. Dinani and Doina Caragea

Abstract: Effectively filtering and categorizing the large volume of user-generated content on social media during disaster events can help emergency management and disaster response prioritize their resources. Deep learning approaches, including recurrent neural networks and transformer-based models, have been previously used for this purpose. Capsule Neural Networks (CapsNets), initially proposed for image classification, have been proven to be useful for text analysis as well. However, to the best of our knowledge, CapsNets have not been used for classifying crisis-related messages, and have not been extensively compared with state-of-the-art transformer-based models, such as BERT. Therefore, in this study, we performed a thorough comparison between CapsNet models, state-of-the-art BERT models and two popular recurrent neural network models that have been successfully used for tweet classification, specifically, LSTM and Bi-LSTM models, on the task of classifying crisis tweets both in terms of their informativeness (binary classification), as well as their humanitarian content (multi-class classification). For this purpose, we used several benchmark datasets for crisis tweet classification, namely CrisisBench, CrisisNLP and CrisisLex. Experimental results show that the performance of the CapsNet models is on a par with that of LSTM and Bi-LSTM models for all metrics considered, while the performance obtained with BERT models have surpassed the performance of the other three models across different datasets and classes for both classification tasks, and thus BERT could be considered the best overall model for classifying crisis tweets.
Download

Short Papers
Paper Nr: 38
Title:

IntrusionHunter: Detection of Cyber Threats in Big Data

Authors:

Hashem Mohamed, Alia El Bolock and Caroline Sabty

Abstract: The rise of cyber-attacks has become a serious problem due to our growing reliance on technology, making it essential for both individuals and businesses to use efficient cybersecurity solutions. This work continues on previous work to improve the accuracy of intrusion detection systems by employing advanced classification techniques and an up-to-date dataset. In this work, we propose IntrusionHunter, an anomaly-based intrusion detection system operating on the CSE-CICIDS2018 dataset. IntrusionHunter classifies intrusions based on three models, each catering to different purposes: binary classification (2C), multiclass classification with 7 classes (7C), and multiclass classification with 15 classes (15C). Four main classification models were used: Random Forest, Extreme Gradient Boosting, Convolutional Neural Networks, and Deep Neural Networks. The results show that Random Forest and XGBoost algorithms outperformed state-of-the-art intrusion detection systems in binary and multiclass classification (15 classes). The findings also show that the dataset imbalance needs to be addressed to improve the performance of deep learning techniques.
Download

Paper Nr: 39
Title:

Automatic Classification of Quantitative Data from DNS Cache Servers into Stationary and Non-Stationary States Based on Clustering

Authors:

Hikofumi Suzuki and Katsumi Wasaki

Abstract: In this study, quantitative traffic data from DNS cache servers are classified as stationary or non-stationary. Then, unsupervised machine learning is performed using the classified traffic data. Among the 17 types of DNS traffic data subject to revision, A Record, MX, SOA Record, and AD Flag are considered. The correlation between A Record and AD Flag is difficult to detect using conventional clustering methods because they form zonal clusters under stationary-state conditions. Therefore, the number of clusters is calculated using the clustering algorithms Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Mean Shift, and variational Bayesian Gaussian mixture model (VBGMM). The possibility of automatic classification is investigated.
Download

Paper Nr: 59
Title:

Distributed Edge Computing System for Vehicle Communication

Authors:

Rinith Pakala, Niket Kathiriya, Hossein Haeri, Satya P. Maddipatla, Kshitij Jerath, Craig Beal, Sean Brennan and Cindy Chen

Abstract: The development of communication technologies in edge computing has fostered progress across various applications, particularly those involving vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communication. Enhanced infrastructure has improved data transmission network availability, promoting better connectivity and data collection from IoT devices. A notable IoT application is with the Intelligent Transportation System (ITS). IoT technology integration enables ITS to access a variety of data sources, including those pertaining to weather and road conditions. Real-time data on factors like temperature, humidity, precipitation, and friction contribute to improved decision-making models. Traditionally, these models are trained at the cloud level, which can lead to communication and computational delays. However, substantial advancements in cloud-to-edge computing have decreased communication relays and increased computational distribution, resulting in faster response times. Despite these benefits, the developments still largely depend on central cloud sources for computation due to restrictions in computational and storage capacity at the edge. This reliance leads to duplicated data transfers between edge servers and cloud application servers. Additionally, edge computing is further complicated by data models predominantly based on data heuristics. In this paper, we propose a system that streamlines edge computing by allowing computation at the edge, thus reducing latency in responding to requests across distributed networks. Our system is also designed to facilitate quick updates of predictions, ensuring vehicles receive more pertinent safety-critical model predictions. We will demonstrate the construction of our system for V2V and V2I applications, incorporating cloud-ware, middleware, and vehicle-ware levels.
Download

Paper Nr: 66
Title:

QFLS: A Cloud-Based Framework for Supporting Big Healthcare Data Management and Analytics from Big Data Lakes: Definitions, Requirements, Models and Techniques

Authors:

Alfredo Cuzzocrea and Selim Soufargi

Abstract: This paper introduces definitions, requirements, models and techniques of QUALITOP Federated Big Data Analytics Learning System (QFLS), a Cloud-based framework for supporting big healthcare data management and analytics from big data lakes. QFLS anatomy and main functionalities are described, along with the main software solutions proposed with the framework.
Download

Paper Nr: 119
Title:

GreenCC: A Hybrid Approach to Sustainably Validate Manufacturing Data in Industry 4.0 Environments

Authors:

Simon Paasche and Sven Groppe

Abstract: The era of big data streams forces companies to rethink their business models to gain competitive advantages. To fully make use of the collected information, data have to be available in high quality. With big data, the impact of information and communications technology (ICT) is also increasing. The extended use of ICT leads to an increase in energy consumption and thus also in the CO2 footprint, both of which in turn result in high costs. A tradeoff between making use of the data and reducing the resources required for data acquisition and validation arises. Our work investigates how data validation in smart manufacturing environments can be implemented in an energy-efficient and resource-saving way. Therefore, we present a combination of a light consistency checker (LightCC) and a full consistency checker (FullCC) which can be activated in periods with a high probability of defects. Our LightCC uses heuristics to predict missing messages and identifies time frames with an increased likelihood for further inconsistencies. In these periods, our FullCC can be activated to perform an accurate validation. We call our developed system green consistency checker (GreenCC).
Download

Area 2 - Business Analytics

Full Papers
Paper Nr: 14
Title:

Look at the Horizon: Evaluation of a Software Solution Against Cyber Sickness in Virtual Reality Applications

Authors:

Jonathan Harth, Christian-Norbert Zimmer and Michaela Zupanic

Abstract: Cyber sickness (CS), a symptom that occurs in 30-80% of users when using virtual environments, is still considered an obstacle to the spread of virtual reality (VR). The aim of this study is to investigate whether symptoms of CS can be minimised by software adaptation. A prototype of a layer based on the "Seetroën Glasses" was used for this purpose. 80 students participated in the study with virtual roller coaster rides. The results show that the group with the layer was able to increase its performance and that the layer was able to delay the exit of the participants for about 2 laps. The layer does not provide immunity to CS, but it does delay the onset of symptoms. The study shows that the virtual test environment is suitable for investigating CS and that the prototype of the layer may be promising for reducing symptoms of CS.
Download

Paper Nr: 17
Title:

Benchmarking Automated Machine Learning Methods for Price Forecasting Applications

Authors:

Horst Stühler, Marc-André Zöller, Dennis Klau, Alexandre Beiderwellen-Bedrikow and Christian Tutschku

Abstract: Price forecasting for used construction equipment is a challenging task due to spatial and temporal price fluctuations. It is thus of high interest to automate the forecasting process based on current market data. Even though applying machine learning (ML) to these data represents a promising approach to predict the residual value of certain tools, it is hard to implement for small and medium-sized enterprises due to their insufficient ML expertise. To this end, we demonstrate the possibility of substituting manually created ML pipelines with automated machine learning (AutoML) solutions, which automatically generate the underlying pipelines. We combine AutoML methods with the domain knowledge of the companies. Based on the CRISP-DM process, we split the manual ML pipeline into a machine learning and non-machine learning part. To take all complex industrial requirements into account and to demonstrate the applicability of our new approach, we designed a novel metric named method evaluation score, which incorporates the most important technical and non-technical metrics for quality and usability. Based on this metric, we show in a case study for the industrial use case of price forecasting, that domain knowledge combined with AutoML can weaken the dependence on ML experts for innovative small and medium-sized enterprises which are interested in conducting such solutions.
Download

Paper Nr: 34
Title:

Temporal Multidimensional Model for Evolving Graph-Based Data Warehouses

Authors:

Redha Benhissen, Fadila Bentayeb and Omar Boussaid

Abstract: Nowadays, companies are focusing on overhauling their data architecture, consolidating data and discarding legacy systems. Big data has a great impact on businesses since it helps companies to efficiently manage and analyse large volumes of data. In business intelligence and especially decision-making, data warehouses support OLAP technology, and they have been very useful for the efficient analysis of structured data. A data warehouse is built by collecting data from several data sources. However, big data refers to large sets of unstructured, semi-structured or structured data obtained from numerous sources. Many changes in the content and structure of these sources can occur. Therefore, these changes have to be reflected in the data warehouse using the bi-temporal approach for the data and versioning for the schema. In this paper, we propose a temporal multidimensional model using a graph formalism for multi-version data warehouses that is able to integrate the changes that occur in the data sources. The approach is based on multi-version evolution for schema changes and the bi-temporal labelling of the entities, as well as the relationships between them, for data evolution. Our proposal provides flexibility to the evolution of a data warehouse by increasing the analysis possibilities for users with the decision support system, and it allows flexible temporal queries to provide consistent results. We will present the overall approach, with a focus on the evolutionary treatment of the data, including dimensional changes. We validate our approach with a case study that illustrates temporal queries, and we carry out runtime performance tests for graph data warehouses.
Download

Paper Nr: 72
Title:

A Proactive Approach for the Sustainable Management of Water Distribution Systems

Authors:

Sarah Di Grande, Mariaelena Berlotti, Salvatore Cavalieri and Roberto Gueli

Abstract: Today, water distribution systems need to supply water to consumers in a sustainable way. This is connected to the concept of Watergy, which means the satisfaction of user demand with the least possible use of water and energy resources. Thanks to modern technologies, the forecasting of water and energy demand can help achieve this goal. In particular, water demand forecasting allows water distribution companies to know in advance how water resources will be allocated, it can help identify any anomalies in water consumption, and it is essential for pumps scheduling. On the other hand, energy consumption forecasting has other important roles, such as energy optimization, identification of anomalous consumption, and planning of energy load. The present paper aims to develop short-term water demand and energy forecasting models through innovative machine learning-based methodologies for the water distribution sector: global forecasting models, the N-Beats machine learning algorithm, and transfer learning approaches. These tools demonstrated very good performances in the creation of the models previously mentioned.
Download

Short Papers
Paper Nr: 23
Title:

Behavioral Recommender System for Process Automation Steps

Authors:

Mohammadreza Fani Sani, Fatemeh Nikraftar, Michal Sroka and Andrea Burattin

Abstract: Process automation is used to increase the performance of processes. One of the leading process automation tools is Microsoft Process Advisor. This tool requires users to select the corresponding connectors for the automation of different tasks, which can be a challenging endeavor for users who have limited business knowledge as there are various connectors and templates exist. To overcome this challenge, we present a process-aware recommender system for connectors that eases the labeling task for end users. The results of applying this method to real event logs indicate that it can recommend relevant connectors and, therefore, the usage of the same mechanism might be generalized to broader contexts.
Download

Paper Nr: 26
Title:

Exploring Functional Patterns of Driving Records by Interacting with Major Classes and Territory Using Generalized Additive Models

Authors:

Shengkun Xie, Anna T. Lawniczak and Clare Chua-Chow

Abstract: Studying the safe driver index, such as Driving Records (DR), is essential to auto insurance regulation. Part of the auto insurance regulation aims to estimate the relativity of major risk factors, including DR, to provide some benchmark values for auto insurance companies. The risk relativity estimate of DR is often through either an assessment via empirical loss cost or a statistical modelling approach such as using generalized linear models. However, these methods are only able to give an estimate on an integer level of DR. This work proposes a novel approach to estimating the risk relativity of DR via generalized additive models (GAM). This method makes the integer level of DR continuous, making it more flexible and practical. Extending the generalized linear model to GAM is critical as investigating this new method could enhance applications of advanced statistical methods to the actuarial practice. Thus, making the proposed methodology of analyzing the safe driver index more statistically sound. Furthermore, exploring functional patterns by interacting with major classes or territories allows us to find statistical evidence to justify the existence of correlations between risk factors. This may help address the issue of potential double penalties in insurance pricing and call for a solution to overcome this problem from a statistical perspective.
Download

Paper Nr: 46
Title:

GUIDO: A Hybrid Approach to Guideline Discovery & Ordering from Natural Language Texts

Authors:

Nils Freyer, Dustin Thewes and Matthias Meinecke

Abstract: Extracting workflow nets from textual descriptions can be used to simplify guidelines or formalize textual descriptions of formal processes like business processes and algorithms. The task of manually extracting processes, however, requires domain expertise and effort. While automatic process model extraction is desirable, annotating texts with formalized process models is expensive. Therefore, there are only a few machine-learning-based extraction approaches. Rule-based approaches, in turn, require domain specificity to work well and can rarely distinguish relevant and irrelevant information in textual descriptions. In this paper, we present GUIDO, a hybrid approach to the process model extraction task that first, classifies sentences regarding their relevance to the process model, using a BERT-based sentence classifier, and second, extracts a process model from the sentences classified as relevant, using dependency parsing. The presented approach achieves significantly better results than a pure rule-based approach. GUIDO achieves an average behavioral similarity score of 0.93. Still, in comparison to purely machine-learning-based approaches, the annotation costs stay low.
Download

Paper Nr: 56
Title:

A Measure Data Catalog for Dashboard Management and Validation

Authors:

Bruno Oliveira, Ana Henriques, Óscar Oliveira, Ana Duarte, Vasco Santos, António Antunes and Elsa Cardoso

Abstract: The amount and diversity of data that organizations have to deal with, intensify the importance and challenges associated with data management. In this context, data catalogs play a significant role, as they are used as a tool to find and understand data. With the emergence of new approaches to storing and exploring data, such as Data Lake or Data Lakehouse, the requirements associated with building and maintaining the data catalog have evolved and represent an opportunity for organizations to develop their decision-making processes. This work explores a metric data catalog for analytical systems to support building, validating, and maintaining dashboards in a Business Intelligence system.
Download

Paper Nr: 73
Title:

Detection and Prediction of Leakages in Water Distribution Networks

Authors:

Mariaelena Berlotti, Sarah Di Grande, Salvatore Cavalieri and Roberto Gueli

Abstract: Leakages are one of the main causes of water loss in a water distribution system (WDS). In recent years, the increasing of streaming data coming from sensors installed in the water network, allows the monitoring the health status of each asset of the WDS. In this paper, a preliminary data-driven approach for leakages detection and prediction is proposed. Starting from the characteristics of a real water distribution network, a realistic leakages dataset has been achieved. Using this dataset, unsupervised rule-based time series algorithms has been trained for the detection and prediction of leakages.
Download

Paper Nr: 75
Title:

Clustering Object-Centric Event Logs

Authors:

Anahita F. Ghahfarokhi, Fatemeh Akoochekian, Fareed Zandkarimi and Wil P. van der Aalst

Abstract: Process mining provides various algorithms to analyze process executions based on event data. Process discovery, the most prominent category of process mining techniques, aims to discover process models from event logs. However, it leads to spaghetti models when working with real-life data. To reduce the complexity of process models, several clustering techniques have been proposed on top of event logs with a single case notion. However, in real-life processes often multiple objects are involved in a process. Recently, Object-Centric Event Logs (OCELs) have been introduced to capture the information of such processes, and several process discovery techniques have been developed on top of OCELs. Yet, the output of the discovery techniques leads to complex models. In this paper, we propose a clustering-based approach to cluster similar objects in OCELs to simplify the obtained process models. Using a case study of a real Business-to-Business (B2B) process, we demonstrate that our approach reduces the complexity of the models and generates coherent subsets of objects which help the end-users gain insights into the process.
Download

Paper Nr: 97
Title:

Detecting Anomalies on Cryptocurrency Markets Using Graph Algorithms

Authors:

Agata Skorupka

Abstract: The low level regulation of cryptocurrency market as well as crucial role of trust and digital market specificity makes it a good environment for anonymous transactions without identity verification, therefore fraudulent activities. Examples of such anomalies may be failing to fulfil transaction, as well as different forms of market manipulation. As cryptocurrencies are incorporated in more and more investment portfolios, including big companies accepting payment by this means, anomalies on cryptocurrency may pose significant systemic risk. Therefore there is a need to detect fraudulent users in a computationally efficient way. This paper presents usage of graph algorithms for that purpose. While most of the literature is focused on using structural and classical embeddings, this research proposes utilizing nodes statistics to build an accurate model with less engineering overhead as well as computational time involved.
Download

Paper Nr: 102
Title:

A Multi-Factor Approach to Measure User Preference Similarity in Neighbor-Based Recommender Systems

Authors:

Ho H. Vy, Tiet G. Hong, Vu M. Hang, Cuong Pham-Nguyen and Le H. Nam

Abstract: Neighbor-based Collaborative filtering is one of the commonly applied techniques in recommender systems. It is highly appreciated for its interpretability and ease of implementation. The effectiveness of neighbor-based collaborative filtering depends on the selection of a user preference similarity measure to identify neighbor users. In this paper, we propose a user preference similarity measure named Multi-Factor Preference Similarity (MFPS). The distinctive feature of our proposed method is its efficient combination of the four key factors in determining user preference similarity: rating commodity, rating usefulness, rating details, and rating time. Our experiments have demonstrated that the combination of these factors in our proposed method has achieved good results on both experimental datasets: Movielens 100K and Personality-2018.
Download

Paper Nr: 118
Title:

Optimized Fusion Chain with Web3 Storage

Authors:

Tahiya T. Lisa and Chaiyachet Saivichit

Abstract: The Internet of Things (IoT) has become increasingly popular over the past two decades. It has a significant impact on many aspects of daily life, including smart cities, intelligent transportation, manufacturing, and several other industries. Processing and networking abilities of the Internet of Things are condemned/ censorious in today’s highly technological environment. These systems are also reasonably priced when used on embedded platforms and consume little power (Ismail and Materwala, 2019). Nearly all IoT devices have bounded storage and random access memory with 8-bit or 16-bit microcontrollers (Bosamia and Patel, 2020). Emerging technologies, such as blockchains, can be altered to accommodate IoT networks’ and devices’ limitations and requirements. Nowadays, blockchain has become the most secure medium of data transmission. The main goal of this paper consists of the usage of web3 storage in the fusion chain platform along with the fusion chain’s lightweight block structure that can be used by IoT devices resulting in minimal CPU power consumption. Moreover, the development of two consensus algorithms for the performance evaluation of the fusion chain blockchain to balance the computational power with the Internet of Things is proposed.
Download

Paper Nr: 13
Title:

A Visual Analysis of Hazardous Events in Contract Risk Management

Authors:

Georgios Stathis, Giulia Biagioni, Athanasios Trantas, Jaap van den Herik and Bart Custers

Abstract: This article proposes a new visual analysis method of hazardous events to be used in contract risk management. We present our research work for the creation of an ontological extension of the Onassis Ontology to manage, analyse and visualise risk data. Onassis is an openly accessible ontology that we earlier designed to structure contract automation data. Onassis and its extension for risk management contribute to the development of trustworthy Intelligent Contracts (iContracts). They allow for the creation of explicit data out of usually implicit contractual information and legal processes on which it is possible to perform cross-referencing analysis with other collections of data. The ontological model that resulted from our study additionally contributes to the disambiguation of the bow-tie method structure, the primary method for analysing and visualising hazardous events. To achieve this, we use the following methodology. We visualise the bow-tie method in an ontology and then investigate the presence of taxonomic ambiguities or even errors in its structure. The results present an enriched version of bow-tie conceptualisation, in which entities and relationships are translated into openly-accessible and ready-to-use ontological terms, whereas risk analysis becomes visible.
Download

Paper Nr: 92
Title:

An Ontology-Based Collaborative Business Intelligence Framework

Authors:

Muhammad Fahad and Jérôme Darmont

Abstract: Business Intelligence constitutes a set of methodologies and tools aiming at querying, reporting, on-line analytic processing (OLAP), generating alerts, performing business analytics, etc. When in need to perform these tasks collectively by different collaborators, we need a Collaborative Business Intelligence (CBI) platform. CBI plays a significant role in targeting a common goal among various companies, but it requires them to connect, organize and coordinate with each other to share opportunities, respecting their own autonomy and heterogeneity. This paper presents a CBI platform that democratizes data by allowing BI users to easily connect, share and visualize data among collaborators, obtain actionable answers by collaborative analysis, investigate and make collaborative decisions, and also store the analyses along graphical diagrams and charts in a collaborative ontology knowledge base. Our CBI platform builds a dashboard to persist collaborative analysis, supports interactive interface for tracking collaborative session data and also provides customizable features to edit, update and build new ones from existing graphs, diagrams and charts. Our CBI framework supports and assists information sharing, collaborative decision-making and annotation management beyond the boundaries of individuals and enterprises.
Download

Paper Nr: 100
Title:

Dense Information Retrieval on a Latin Digital Library via LaBSE and LatinBERT Embeddings

Authors:

Federico A. Galatolo, Gabriele Martino, Mario A. Cimino and Chiara O. Tommasi

Abstract: Dense Information Retrieval (DIR) has recently gained attention due to the advances in deep learning-based word embedding. In particular, for historical languages such as Latin, a DIR task is appropriate although challenging, due to: (i) the complexity of managing searches using traditional Natural Language Processing (NLP); (ii) the availability of fewer resources with respect to modern languages; (iii) the large variation in usage among different eras. In this research, pre-trained transformer models are used as features extractors, to carry out a search on a Latin Digital Library. The system computes embeddings of sentences using state-of-the-art models, i.e., Latin BERT and LaBSE, and uses cosine distance to retrieve the most similar sentences. The paper delineates the system development and summarizes an evaluation of its performance using a quantitative metric based on expert’s per-query documents ranking. The proposed design is suitable for other historical languages. Early results show the higher potential of the LabSE model, encouraging further comparative research. To foster further development, the data and source code have been publicly released.
Download

Paper Nr: 106
Title:

Decision Support System for Corporate Reputation Based Social Media Listening Using a Cross-Source Sentiment Analysis Engine

Authors:

R. E. Loke and S. Pathak

Abstract: This paper presents a Decision Support System (DSS) that helps companies with corporate reputation (CR) estimates of their respective brands by collecting provided feedbacks on their products and services and deriving state-of-the-art key performance indicators. A Sentiment Analysis Engine (SAE) is at the core of the proposed DSS that enables to monitor, estimate, and classify clients’ sentiments in terms of polarity, as expressed in public comments on social media (SM) company channels. The SAE is built on machine learning (ML) text classification models that are cross-source trained and validated with real data streams from a platform like Trustpilot that specializes in user reviews and tested on unseen comments gathered from a collection of public company pages and channels on a social networking platform like Facebook. Such crosssource opinion analysis remains a challenge and is highly relevant in the disciplines of research and engineering in which a sentiment classifier for an unlabeled destination domain is assisted by a tagged source task (Singh and Jaiswal, 2022). The best performance in terms of F1 score was obtained with a multinomial naive Bayes model: 0,87 for validation and 0,74 for testing.
Download

Paper Nr: 107
Title:

EREO: An Effective Rule Evaluation Framework for Discovering Interesting Patterns in US Birth Data and Beyond

Authors:

Abhilash C. B. and Kavi Mahesh

Abstract: Birth data holds immense importance in healthcare for several reasons. It offers a comprehensive and representative sample of the population, enabling the identification of patterns and trends that can significantly impact public health policies and interventions. However, extracting interesting patterns from the vast birth data attributes poses a domain-specific and challenging problem. We can derive intriguing patterns by utilizing rare rules for identifying interesting associations. The level of interestingness depends on various factors, including the user, data, and domain. To address this, we propose the Effective Rule Evaluation using Ontology (EREO) framework, which incorporates two modes of rule evaluation. Firstly, the Integrated Rule Information Content (IRIC) measure is employed to quantify the level of interestingness. Secondly, the interesting rules are assessed by domain experts. The combined approach of these two modes of evaluation confirms the level of interestingness of the derived rules.The study demonstrates a significant relationship between these two modes of assessment, providing evidence of the convergence between expert evaluations and the ontology-based association rule measurements. This connection adds further value to the field by contributing to the understanding and measurement of interestingness within the context of ontology-based association rules
Download

Paper Nr: 112
Title:

A Design Framework for a Blockchain-Based Open Market Platform of Enriched Card-Based Transactional Data for Big Data Analytics and Open Banking

Authors:

Trevor Toy and Josef Langerman

Abstract: Around a quarter of the world’s data is generated by financial institutions. The Capgemini 2022 World Payments Report predicts a 28% increase in transaction volumes from 2021 to 2026, to an estimated total of 2.122 trillion global non-cash transactions. There is a growing demand for accessible transactional data for analytical purposes and to support the rapid global adoption of Open Banking. Open banking is a collaborative business model involving customer-authorised transactional data sharing with other unaffiliated parties to allow for enhanced service and product offerings to the marketplace. This research explores utilising distributed ledger technology to facilitate the market mechanism of securely sharing data through an integrated and decentralised platform that conforms to the expected regulatory and compliance standards of the financial industry from which the data is generated. Scalable and accessible access is a core requirement of a marketplace platform for its data consumers and producers. To enable customer-authorised transactional data sharing, an incentive mechanism is proposed, which includes the data subject in the process to empower them to control access and earn money from the related transactional data that they generate. A proposed framework is defined for the development of a marketplace platform that can ultimately support the growth, prosperity and development of economies, businesses, communities and individuals, by providing accessible and relevant transactional data for big data analytics and open banking.
Download

Paper Nr: 114
Title:

AutoImpute: An Autonomous Web Tool for Data Imputation Based on Extremely Randomized Trees

Authors:

Mustafa Alabadla, Fatimah Sidi, Iskandar Ishak, Hamidah Ibrahim, Hazlina Hamdan, Shahril I. Amir and Appak Y. Nurlankyzy

Abstract: Missing values is one of the main reasons that causes performance degradation, among other things. An inaccurate prediction might result from incorrect imputation of missing variables. A critical step in the study of healthcare information is the imputation of uncertain or missing data. As a result, there has been a significant increase in the development of software tools designed to assist machine learning users in completing their data sets prior to entering them into training algorithms. This study fills the gap by proposing an autonomous imputation application that uses the Extremely Randomised Trees Imputation method to impute mixed-type missing data. The proposed imputation tool provides public users the option to remotely impute their data sets using either of two modes: standard or autonomous. As pointed out in the experimental part, the proposed imputation tool performs better than traditional methods for imputation of missing data on various missing ratios and achieved accurate results for autonomous imputation.
Download

Area 3 - Data Science

Full Papers
Paper Nr: 52
Title:

Predicting Academic Performance of Low-Income Students in Public Ecuadorian Online Universities: An Educational Data Mining Approach

Authors:

Jorge Rodas-Silva and Jorge Parraga-Alava

Abstract: The success of higher education institutions in the online learning environment can be measured by the performance of students. Identifying backgrounds or factors that increase the academic success rate of online students is especially helpful for educational decision-makers to adequately plan actions to promote successful outcomes in this digital landscape. In this paper, we identify the factors that contribute to the academic success of students in public Ecuadorian online universities and develop a predictive model to aid in improving their performance. Our approach involved five stages: data collection and description, which involved gathering data from universities, including social demographic, and academic features. In preprocessing, cleaning, and transforming the data to prepare it for analysis was performed. Modeling involved applying machine learning algorithms to identify patterns and key factors to predict student outcomes. It was validated in the next stage where the performance of feature selection and predictive model was tackled. In the last stage, were interpreted the results of the analysis about the factors that contribute to the academic success of low-income students in online universities in Ecuador. The results suggest that the grade in the leveling course, the family income, and the age of the student mainly influence their academic performance. The best performances were achieved with Boruta + Random Forest and LVQ + SVM, reaching an accuracy of 75.24% and 68.63% for binary (Pass/Fail) and multiclass (Average/Good/Excellent) academic performance prediction, respectively.
Download

Paper Nr: 60
Title:

Identifying High-Quality Training Data for Misinformation Detection

Authors:

Jaren Haber, Kornraphop Kawintiranon, Lisa Singh, Alexander Chen, Aidan Pizzo, Anna Pogrebivsky and Joyce Yang

Abstract: Misinformation spread through social media poses a grave threat to public health, interfering with the best scientific evidence available. This spread was particularly visible during the COVID-19 pandemic. To track and curb misinformation, an essential first step is to detect it. One component of misinformation detection is finding examples of misinformation posts that can serve as training data for misinformation detection algorithms. In this paper, we focus on the challenge of collecting high-quality training data in misinformation detection applications. To that end, we demonstrate the effectiveness of a simple methodology and show its viability on five myths related to COVID-19. Our methodology incorporates both dictionary-based sampling and predictions from weak learners to identify a reasonable number of myth examples for data labeling. To aid researchers in adjusting this methodology for specific use cases, we use word usage entropy to describe when fewer iterations of sampling and training will be needed to obtain high-quality samples. Finally, we present a case study that shows the prevalence of three of our myths on Twitter at the beginning of the pandemic.
Download

Paper Nr: 63
Title:

Hawkes Processes on Social and Mass Media: A Causal Study of the #BlackLivesMatter Movement in the Summer of 2020

Authors:

Alfred Lindström, Simon Lindgren and Raazesh Sainudiin

Abstract: In this work we study interactions in social media and the reports in mass media during the Black Lives Matter (BLM) protests following the death of George Floyd. We implement open-source pipelines to process the data at scale and employ the self-exciting counting process known as Hawkes process to address our main question: is there a causal relation between interactions in social media and reports of street protests in mass media? Specifically, we use distributed label propagation to identify such interactions in Twitter, that supported the BLM movement, and compared the timing of these interaction to those of news reports of street protests mentioning George Floyd, via the Global Database of Events, Language, and Tone (GDELT) Project. The comparison was made through a Bivariate Hawkes process model for a formal hypothesis test of Granger-causality. We show that interactions in social media that supported the BLM movement, at the beginning of nationwide protests, caused the global mass media reports of street protests in solidarity with the movement. This suggests that BLM activists have harnessed social media to mobilise street protests across the planet.
Download

Paper Nr: 65
Title:

Conv-LSTM for Real Time Monitoring of the Mineral Grades in the Flotation Froth

Authors:

Ahmed Bendaouia, El H. Abdelwahed, Sara Qassimi, Abdelmalek Boussetta, Intissar Benzakour, Oumkeltoum Amar, François Bourzeix, Achraf Soulala and Oussama Hasidi

Abstract: Accurate monitoring of the mineral grades in the flotation froth is crucial for efficient minerals separation in the mining industry. In this study, we propose the use of ConvLSTM, a type of neural network that combines Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks, to create a model that can extract spatial and temporal patterns from flotation froth video data. Our model enables the analysis of both spatial and temporal patterns, making it useful for understanding the dynamic behavior of the froth surface in the flotation processes. Using ConvLSTM, we developed a more accurate and reliable model for monitoring and controlling the flotation froth quality. Our results demonstrate the effectiveness of our approach, with mean absolute error (MAE) of 2.59 in a lead, copper and zinc differential flotation site. Our findings suggest that artificial intelligence can be an effective tool for improving the flotation monitoring and control, with potential applications in other areas of the mining industry.
Download

Paper Nr: 69
Title:

Heterogeneous Ensemble Learning for Modelling Species Distribution: A Case Study of Redstarts Habitat Suitability

Authors:

Omar El Alaoui and Ali Idri

Abstract: Habitat protection is a critical aspect of species conservation, as restoring a habitat to its former state after it has been destroyed can be difficult. Species Distribution Models (SDMs), also known as habitat suitability models, are commonly used to address this issue. It finds ecological and evolutionary insights by linking species occurrences records to environmental data. Machine learning (ML) algorithms have been recently used to predict the distribution of species. Yet, a single ML algorithm may not always yield accurate predictions for a given dataset, making it challenging to develop a highly accurate model using a single algorithm. Therefore, this study proposes a novel approach to assess habitat suitability of three redstarts species based on ensemble learning techniques. Initially, eight machine learning algorithms, including MultiLayer Perceptron (MLP), Support Vector Machine (SVM), K-nearest neighbors (KNN), Decision Trees (DT), Gradient Boosting Classifier (GB), Random Forest (RF), AdaBoost (AB), and Quadratic Discriminant Analysis (QDA), were trained as base-learners. Subsequently, based on the performance of these base-learners, seven heterogeneous ensembles of two up to eight models, were constructed for each species dataset. The performance of the proposed approach was evaluated using five performance criteria (accuracy, sensitivity, specificity, AUC, and Kappa), Scott Knott (SK) test to statistically compare the performance of the presented models, and the Borda Count voting method to rank the best performing models based on multiple performance criteria. The findings revealed that the heterogeneous ensembles outperformed their singles in all three species datasets, underscoring the efficacy of the proposed approach in modelling species distribution.
Download

Paper Nr: 74
Title:

Rigor in Applied Data Science Research Based on DSR: A Literature Review

Authors:

Daniel Szafarski, Tobias Schmieg, Laslo Welz and Thomas Schäffer

Abstract: Design Science Research (DSR) enjoys increasing popularity in the field of information systems due to its practical relevance and focus on design. A study from 2012 shows that DSR publications in general have a weak rigor in connection with the selection and use of research methods. At the same time, there has also been a recent increase in Data Science publications based on the paradigm of DSR. Therefore, this study analyzes the rigor and the specific characteristics of the application of DSR based on 62 publications from this field. Major deficits are observed in a large part of the sample regarding the rigorous documentation of the scientific process as well as the selection and citation of adequate research methods. Overall 77.4% of the analyzed publications were therefore characterized as weak in regard to their rigor. One explanation is the novel combination of DSR and Data Science together with the speed at which new findings are obtained and published.
Download

Paper Nr: 78
Title:

Extracting Frequent Gradual Patterns Based on SAT

Authors:

Jerry Lonlac, Imen O. Dlala, Said Jabbour, Engelbert M. Nguifo, Badran Raddaoui and Lakhdar Sais

Abstract: This paper proposes a constraint-based modeling approach for mining frequent gradual patterns from numerical data. Our declarative approach provides a principle way to take advantage of recent advancements in satisfiability testing and several features of modern SAT solvers to enumerating gradual patterns. Interestingly, our approach can easily be extended with extra requirements, such as temporal constraints used to extract more specific patterns in a broad range of gradual patterns mining applications. An empirical evaluation on two real-word datasets shows the efficiency of our approach.
Download

Paper Nr: 85
Title:

An Advanced BERT LayerSum Model for Sentiment Classification of COVID-19 Tweets

Authors:

Areeba Umair and Elio Masciari

Abstract: The new coronavirus that triggered the global pandemic COVID-19 has had a profound effect on attitudes among people all around the world. People and communities have experienced a wide range of feelings and attitudes as a result of the pandemic. There was a great deal of apprehension following the original COVID-19 epidemic. People were worried about getting the infection or spreading it to their loved ones. These worries were heightened by the disease’s unknown nature and quick dissemination. This paper proposes a novel model for sentiment analysis of tweets related to the COVID-19 pandemic. The proposed model leverages BERT as a base model and improves the last four layers of BERT for the sentiment analysis task. The embeddings of the last four layers of BERT are stacked and then summed, and the obtained embeddings are concatenated with the classification token [CLS]. The goal of the study is twofold: we categorize tweets into positive, negative, and neutral sentiments and we classify the user sentiment. The paper highlights the importance of sentiment analysis in tracking public opinion and sentiment towards the COVID-19 pandemic and demonstrates the effectiveness of the proposed model in accurately classifying the sentiment of tweets related to COVID-19. The proposed model is evaluated and compared with four widely used models: KNN, SVM, Naı̈ve Bayes, and BERT, on a dataset of tweets labeled as positive, negative, or neutral. The results show that our proposed model achieved the highest accuracy, precision, and recall for negative sentiment classification compared to other models, indicating its effectiveness in sentiment analysis. The proposed model can be used for analyzing sentiment in order to provide valuable insights for decision-making processes.
Download

Paper Nr: 89
Title:

Prediction of QT Prolongation in Advanced Breast Cancer Patients Using Survival Modelling Algorithms

Authors:

Asmir Vodenčarević, Julia Kreuzeder, Achim Wöckel and Peter A. Fasching

Abstract: Advanced breast cancer includes locally advanced disease and metastatic breast cancer with distant metastasis in other organs like lung, liver, brain and bone. While it cannot be cured, its progression can be controlled by modern treatments including targeted therapies. However, these therapies as well as certain risk factors like advanced age can facilitate toxicities such as prolongation of the time interval between the start of the Q wave and the end of the T wave in patient’s electrocardiogram. This could lead to serious life-threatening issues like cardiac arrhythmia. In this paper we addressed the issue of individual, patient-level prediction of QT prolongation in advanced breast cancer patients treated with the CDK4/6-inhibitor ribociclib. By formulating the prediction task as a survival analysis problem, we were able to apply five conventional statistical and machine learning survival modelling algorithms to both clinical trial and real-world data in order to train and externally validate prediction models. Cox proportional hazards model regularized by elastic net reached external, cross-study validation performance (c-index based on inverse probability of censoring weights) of 0.63 on the real-world data and 0.71 on the clinical trial data. The most important predictive factors included baseline electrocardiogram features and patient quality of life.
Download

Paper Nr: 93
Title:

Anomaly Detection of Medical IoT Traffic Using Machine Learning

Authors:

Lerina Aversano, Mario L. Bernardi, Marta Cimitile, Debora Montano, Riccardo Pecori and Luca Veltri

Abstract: Although Internet traffic detection and categorization have been extensively researched over the last decades, it remains a hot issue in the Internet of Things (IoT) context, mainly when traffic is generated in medical structures. Theoretically, it is possible to apply classical methods for IoT traffic categorization and to detect traffic addressed to intelligent devices present in hospital rooms. The problem is always to get a proper medical IoT traffic dataset. In this work, we have created a synthetic dataset of IoT traffic generated by different smart devices put in different hospital rooms. For creating the medical IoT traffic, we have exploited IoT-Flock, an open-source tool for IoT traffic generation supporting CoAP and MQTT, the most used IoT protocols. We have performed, for the first time, a multinomial classification of IoT-Flock-generated traffic considering both normal-traffic and packets of different attacks. The classification has been performed by comparing both traditional machine learning techniques and deep learning network models composed of several hidden layers. The obtained results are very encouraging and can confirm the usability of IoT-Flock data to be used to test and train machine and deep learning models to detect abnormal IoT traffic in a medical scenario.
Download

Paper Nr: 104
Title:

Analyzing Cyber-Physical Systems in Cars: A Case Study

Authors:

Harry H. Beyel, Omar Makke, Fangbo Yuan, Oleg Gusikhin and Wil M. van der Aalst

Abstract: Cyber-physical systems connected to the internet are generating unprecedented volumes of data. Understanding cyber-physical systems’ behavior using collected data is becoming increasingly important. Process-mining techniques consider sequences of events and thus can be used to check and verify how such cyber-physical systems operate. The data captured by cyber-physical systems are typically noisy and are not readily suitable for process mining. In this work, we present how a stream of connected-vehicle data can be transformed into an event log suitable for process mining. By applying different process-discovery techniques, we discover de-facto models that capture the behavior of an assistance system embedded in cars. We apply conformance-checking techniques and consult domain experts to find the best de-facto model. In addition, we apply conformance-checking methods to a preexisting, de-jure model that we transformed into a Petri net. We compare both models and point out differences. In this process, we show how we overcome challenges and highlight why applying process-mining techniques in the cyber-physical systems domain is valuable.
Download

Short Papers
Paper Nr: 11
Title:

Fundus Unimodal and Late Fusion Multimodal Diabetic Retinopathy Grading

Authors:

Sara El-Ateif and Ali Idri

Abstract: Diabetic Retinopathy (DR) is an eye disease with complications, if left untreated grow, split into four grades: mild, moderate, severe, and proliferative. We propose to (1) compare and evaluate three different recently used deep learning models: EfficientNet-B5, Swin Transformer, and Hybrid-EfficientNetB0-SwinTF (HES) on the APTOS 2019 dataset’s fundus and early fused (EF) weighted gaussian blur fundus. (2) Evaluate three fine-tuning and pre-processing schemes on the best model. And (3) choose the best model-scheme per modality and perform late fusion on them to get the final DR grade. Results show that our best method, late fusion HES model, results in F1-socre of 81.21%, accuracy of 81.83%, and AUC of 96.30%. We propose using late fusion HES model in population-wide diagnosis to assist doctors in Morocco to reduce DR burden.
Download

Paper Nr: 27
Title:

Astronomical Images Quality Assessment with Automated Machine Learning

Authors:

Olivier Parisot, Pierrick Bruneau and Patrik Hitzelberger

Abstract: Electronically Assisted Astronomy consists in capturing deep sky images with a digital camera coupled to a telescope to display views of celestial objects that would have been invisible through direct observation. This practice generates a large quantity of data, which may then be enhanced with dedicated image editing software after observation sessions. In this study, we show how Image Quality Assessment can be useful for automatically rating astronomical images, and we also develop a dedicated model by using Automated Machine Learning.
Download

Paper Nr: 40
Title:

Characterizing Speed Performance of Multi-Agent Reinforcement Learning

Authors:

Samuel Wiggins, Yuan Meng, Rajgopal Kannan and Viktor Prasanna

Abstract: Multi-Agent Reinforcement Learning (MARL) has achieved significant success in large-scale AI systems and big-data applications such as smart grids, surveillance, etc. Existing advancements in MARL algorithms focus on improving the rewards obtained by introducing various mechanisms for inter-agent cooperation. However, these optimizations are usually compute- and memory-intensive, thus leading to suboptimal speed performance in end-to-end training time. In this work, we analyze the speed performance (i.e., latency-bounded throughput) as the key metric in MARL implementations. Specifically, we first introduce a taxonomy of MARL algorithms from an acceleration perspective categorized by (1) training scheme and (2) communication method. Using our taxonomy, we identify three state-of-the-art MARL algorithms - Multi-Agent Deep Deterministic Policy Gradient (MADDPG), Target-oriented Multi-agent Communication and Cooperation (ToM2C), and Networked Multi-agent RL (NeurComm) - as target benchmark algorithms, and provide a systematic analysis of their performance bottlenecks on a homogeneous multi-core CPU platform. We justify the need for MARL latency-bounded throughput to be a key performance metric in future literature while also addressing opportunities for parallelization and acceleration.
Download

Paper Nr: 48
Title:

Does Categorical Encoding Affect the Interpretability of a Multilayer Perceptron for Breast Cancer Classification?

Authors:

Hajar Hakkoum, Ali Idri, Ibtissam Abnane and José L. Fernades-Aleman

Abstract: The lack of transparency in machine learning black-box models continues to be an impediment to their adoption in critical domains such as medicine, in which human lives are involved. Historical medical datasets often contain categorical attributes that are used to represent the categories or progression levels of a parameter or disease. The literature has shown that the manner in which these categorical attributes are handled in the preprocessing phase can affect accuracy, but little attention has been paid to interpretability. The objective of this study was to empirically evaluate a simple multilayer perceptron network when trained to diagnose breast cancer with ordinal and one-hot categorical encoding, and interpreted using a decision tree global surrogate and the Shapley Additive exPlanations (SHAP). The results obtained on the basis of Spearman fidelity show the poor performance of MLP with both encodings, but a slight preference for one-hot. Further evaluations are required with more datasets and categorical encodings to analyse their impact on model interpretability.
Download

Paper Nr: 51
Title:

Enriching Relation Extraction with OpenIE

Authors:

Alessandro Temperoni, Maria Biryukov and Martin Theobald

Abstract: Relation extraction (RE) is a sub-discipline of information extraction (IE) which focuses on the prediction of a relational predicate from a natural-language input unit. Together with named-entity recognition (NER) and disambiguation (NED), RE forms the basis for many advanced IE tasks such as knowledge-base (KB) population and verification. In this work, we explore how recent approaches for open information extraction (OpenIE) may help to improve the task of RE by encoding structured information about the sentences’ principal units, such as subjects, objects, verbal phrases, and adverbials, into various forms of vectorized (and hence unstructured) representations of the sentences. Our main conjecture is that the decomposition of long and possibly convoluted sentences into multiple smaller clauses via OpenIE even helps to fine-tune context-sensitive language models such as BERT (and its plethora of variants) for RE. Our experiments over two annotated corpora, KnowledgeNet and FewRel, demonstrate the improved accuracy of our enriched models compared to existing RE approaches. Our best results reach 92% and 71% of F1 score for KnowledgeNet and FewRel, respectively, proving the effectiveness of our approach on competitive benchmarks.
Download

Paper Nr: 54
Title:

CryptonDL: Encrypted Image Classification Using Deep Learning Models

Authors:

Adham Helbawy, Mahmoud Bahaa and Alia El Bolock

Abstract: Deep Neural Networks (DNNs) have surpassed traditional machine learning algorithms due to their superior performance in big data analysis in various applications. Fully homomorphic encryption (FHE) contributes to machine learning classification, as it supports homomorphic operations over encrypted data without decryption. In this paper, we propose a deep learning model, CryptonDL, that utilizes TenSEAL’s CKKS scheme to encrypt three image datasets and then classify each encrypted image. This model first trains the image datasets without encryption using a PyTorch convolutional neural network model. Using the weights of the convolutional neural network model in the encrypted convolutional neural network model, each image will be encrypted and then only decrypted in the prediction results. TenSEAL implemented the same model, but this model was optimized to achieve higher accuracy than TenSEAL’s original model. CryptonDL achieved an encrypted image classification accuracy of 98.32 percent and an F1 score of 0.9832 on the MNIST dataset, an accuracy of 88 percent and an F1 score of 0.8811 on the Fashion MNIST, and an accuracy of 92 percent and an F1 score of 0.9207 on the Kuzushiji MNIST. CryptonDL shows that encrypted image classification could be achieved with high accuracy without using pre-trained models.
Download

Paper Nr: 55
Title:

A Data Analysis Pipeline for Automating Apple Trait Analysis and Prediction

Authors:

Kyle Ranslam and Ramon Lawrence

Abstract: Feeding the world’s growing population requires research and development of fruit varieties that can be sustainably grown with high yields and quality and require low inputs of water and fertilizer. The process of developing new fruit varieties is data-intensive and traditionally uses manual processes that do not scale. The contribution of this work is a data analysis pipeline that automates the extraction of fruit characteristics from images and integrates multiple data sources (images, field measurements, human evaluation) to help direct the research to the most promising candidates and reduce the amount of manual time required for data collection and analysis. Initial results demonstrate that the image analysis is accurate and can be done at scale in a realworld environment.
Download

Paper Nr: 64
Title:

Detection of Drowsy Driving Using Wearable Sensors

Authors:

Duarte Pereira, Brigida M. Faria and Luis P. Reis

Abstract: Drowsy driving is one of the leading causes of traffic accidents. Some solution provides feedback when the driver is drowsy, however, few tackle the issue in a way that allows for portability and early prevision. This study focuses on drowsiness detection during driving. Wearable sensors are used, for a low-cost, portable, automated, and non-intrusive solution. The wearable sensors chosen for biosignal acquisition are Empatica’s E4 wristband for heart activity acquisition and Brainlink Pro for brain activity. Features were mainly in the time domain and time-frequency, and algorithms, such as Nearest Neighbours, Radial Basis Function, Support Vector Machine, Decision Tree, Random Forest, Multi-layer Perceptron, Naive Bayes, and Logistic Regression were trained and validated through the use of a database developed for this study (11 adults with normal last-night sleep, and 2 without any last-night sleep). Participants answered Pittsburgh, and Satisfaction, Alertness, Timing, Efficiency and Duration questionnaires, after which photoplethysmography and electroencephalography physiological signals were acquired during driving in a simulation environment. The practice-run discrimination and individual classification had comparable results, both slightly above average (70 to 80%). The evaluation metric values showed that the discrimination of sleep-deprived exams yielded significantly better. This suggests that the proposed methodology is capable of classifying sleep deprivation and surpasses existing ones in its portability.
Download

Paper Nr: 71
Title:

Effects of Environmental Conditions on Historic Buildings: Interpretable Versus Accurate Exploratory Data Analysis

Authors:

Marco Parola, Hajar Dirrhami, Mario A. Cimino and Nunziante Squeglia

Abstract: The goal of structural health monitoring is to continuously assess the structural integrity and performance of a building or structure over time. This is achieved by collecting data on various structural parameters and using this data to identify potential areas of concern or damage. A critical challenge involves some properties being severely damaged by recurrent variations of external factors. These variations in environmental and operational conditions (such as humidity, temperature, and traffic) can deflect the variability in structural behavior caused by structural damage and make it difficult to identify the damage of interest. In this paper, we present a study on how regression analysis and deep learning can be used to measure the influence of environmental factors on the structural behavior of the Leaning Tower of Pisa. Transparent linear regressors offer the benefit of being simple to understand and interpret. They can provide insights about the relationship between input and target variables, as well as the relative importance of each input in forecasting the outcome. On the other hand, deep learning models are capable of learning nonlinear relationships between input and target variables. Definitively, in this work the accuracy-interpretability trade-off for structural health monitoring is discussed.
Download

Paper Nr: 95
Title:

Creation and Evaluation of a Food Product Image Dataset for Product Property Extraction

Authors:

Christoph Brosch, Alexander Bouwens, Sebastian Bast, Swen Haab and Rolf Krieger

Abstract: The enormous progress in the field of artificial intelligence (AI) enables retail companies to automate their processes and thus to save costs. Thereby, many AI-based automation approaches are based on machine learning and computer vision. The realization of such approaches requires high-quality training data. In this paper, we describe the creation process of an annotated dataset that contains 1,034 images of single food products, taken under studio conditions, annotated with 5 class labels and 30 object detection labels, which can be used for product recognition and classification tasks. We based all images and labels on standards presented by GS1, a global non-profit organisation. The objective of our work is to support the development of machine learning models in the retail domain and to provide a reference process for creating the necessary training data.
Download

Paper Nr: 101
Title:

Performance Evaluation and Comparison of a New Regression Algorithm

Authors:

Sabina Gooljar, Kris Manohar and Patrick Hosein

Abstract: In recent years, Machine Learning algorithms, in particular supervised learning techniques, have been shown to be very effective in solving regression problems. We compare the performance of a newly proposed regression algorithm against four conventional machine learning algorithms namely, Decision Trees, Random Forest, k-Nearest Neighbours and XG Boost. The proposed algorithm was presented in detail in a previous paper but detailed comparisons were not included. We do an in-depth comparison, using the Mean Absolute Error (MAE) as the performance metric, on a diverse set of datasets to illustrate the great potential and robustness of the proposed approach. The reader is free to replicate our results since we have provided the source code in a GitHub repository while the datasets are publicly available.
Download

Paper Nr: 103
Title:

Union k-Fold Feature Selection on Microarray Data

Authors:

Artur J. Ferreira and Mário T. Figueiredo

Abstract: Cancer detection from microarray data is an important problem to be handled by machine learning techniques. This type of data poses many challenges to machine learning techniques, namely because it usually has large number of features (genes) and small number of instances (patients). Moreover, it is important to characterize which genes are the most important for a given classification task, providing explainability on the classification. In this paper, we propose a feature selection approach for microarray data, which is an extension of the recently proposed k-fold feature selection algorithm. We propose performing the union of the feature subspaces found independently by two feature selection filters, which have been proven to be adequate for this type of data, individually. The experimental results show that the union of the subsets of features found by each filter, in some cases, produces better results than the use of each individual filter, yielding human manageable subsets of features.
Download

Paper Nr: 20
Title:

Pump and Dump Cryptocurrency Detection Using Social Media

Authors:

Domenico Alfano, Roberto Abbruzzese and Domenico Parente

Abstract: The economic implications behind the fluctuation of cryptocurrencies prices, and, more importantly, the complexity of the variables involved in the process, have made price forecasting a very popular topic among researchers. Especially around detecting Pump & Dump events, where investors try to manipulate cryptocurrency owners to either buy or sell making a profit from them. Over the last decade, research has progressed by proposing new metrics (financial and non financial) capable of influencing and tracking the reasons for price fluctuations. Thanks to the advent of social media, major investment communities can be analysed through social channels to create new metrics. With developments in the field of Natural Language Processing, these social channels are used to extract opinions and mood of expert investors and cryptocurrencies owners. We propose to apply those innovative ways of creating metrics and to demonstrate that, taking these generated metrics into account, can significantly outperform other existing Pump & Dump detection methods. Moreover, to measure how each created metric contributes to the detection, a game theory approach called SHapley Additive exPlanations and a method that explains each prediction using a local, interpretable model to approach any black box machine learning model called Lime will be used.
Download

Paper Nr: 22
Title:

Embedding-Enhanced Similarity Metrics for Next POI Recommendation

Authors:

Sara Jarrad, Hubert Naacke, Stephane Gancarski and Modou Gueye

Abstract: Social media platforms allow users to share information, including photos and tags, and connect with their peers. This data can be used for innovative research, such as proposing personalized travel destination recommendations based on user-generated traces. This study aims to demonstrate the value of using embeddings, which are dense real-valued vectors representing each visited location, in generating recommendations for the next Point of Interest (POI) to visit based on the last POI visited. The Word2Vec language model is used to generate these embeddings by considering POIs as words and sequences of POIs as sentences. This model captures contextual information and identifies similar contexts based on the proximity of numerical vectors. Empirical experiments conducted on a real dataset show that embedding-based methods outperform conventional methods in predicting the next POI to visit.
Download

Paper Nr: 25
Title:

Predicting How Much a Consumer Is Willing to Pay for a Bottle of Wine: Dealing With Data Imbalance

Authors:

Hugo Alonso and Teresa Candeias

Abstract: The wine industry has becoming increasingly important worldwide and is one of the most significant industries in Portugal. In a previous paper, the problem of predicting how much a Portuguese consumer is willing to pay for a bottle of wine was considered for the first time ever. The problem was treated as a multi-class ordinal classification task. Although we achieved good prediction results, globally speaking, it was difficult to identify rare cases of consumers who are interested in paying for more expensive wines. We found that this was a direct consequence of data imbalance. Therefore, here, we present a first attempt to deal with this issue, based on the use of re-sampling strategies to balance the training data, namely random under-sampling, random over-sampling with replacement and the synthetic minority over-sampling technique. We consider several learning methods and develop various predictive models. A comparative study is carried out and its results highlight the importance of a careful choice of the re-sampling strategy and the learning method in order to get the best possible prediction results.
Download

Paper Nr: 33
Title:

Building a Dataset for Trip Style Assessment Based on Real Trip Data

Authors:

Luís P. Loureiro, Artur J. Ferreira and André R. Lourenço

Abstract: In most countries, to have permission to drive vehicles on public roads one must have insurance against civil liability for vehicles. In many cases, the insurance fees depend on the age of the driver, the number of years one holds a driving license, and the driving history. The usual assumption taken by insurance companies that younger drivers are always more risky than others are not always correct, penalizing young good drivers. In this paper, we follow a pay-as-you-drive approach based on trip behavior data of different drivers. First, we build a dataset from real trip data. Then, we apply a two-stage clustering approach to the dataset to identify trip profiles. The experimental results show that we can cluster and identify distinct trip profiles in which many trips have a non-aggressive style, some have an aggressive style and only a few are risky style trips. Our solution finds application in fair insurance fee calculation or fleet management tasks, for instance.
Download

Paper Nr: 37
Title:

Decomposition Heuristic for the Aircraft Sequencing Problem: Impact on Mathematical and Constraint Programming

Authors:

Joana Leite, Rafael Guedes and Diogo Queirós

Abstract: In this paper, we revisit the Aircraft Sequencing Problem (ASP), which consists of scheduling aircraft landings respecting a pre-determined time window and separation criteria. ASP has several versions, with the static single runway being the one with the longer solving times for the benchmark instances, for both mixed integer programming (MIP) and constraint programming (CP) implementations. We considered this version of the problem and addressed the possibility of using parallel processing to solve it. For this purpose, we developed a heuristic for splitting the instances, which always guarantees a feasible solution that is the optimal solution if a set of conditions is satisfied. The splitting allows for parallel processing, and opens the possibility of using the best method to solve each subset of the partition obtained. To explore this feature, we also analysed the performances of MIP and CP implementations and constructed a measure to point to the fastest one. For the benchmark instances, the results show a time reduction over 50%, in the cases the optimal solution is known, and an improvement of over 12% on the value of the best-known feasible solution, in the cases the optimal solution is not known and running time has to be limited.
Download

Paper Nr: 47
Title:

Integrating Autoencoder-Based Hybrid Models into Cervical Carcinoma Prediction from Liquid-Based Cytology

Authors:

Ferdaous Idlahcen, Ali Idri and Hasnae Zerouaoui

Abstract: Artificial intelligence (AI)-assisted cervical cytology is poised to enhance sensitivity whilst lessening bias, labor, and time expenses. It typically involves image processing and deep learning to automatically recognize pre-cancerous lesions on a given whole-slide image (WSI) prior to lethal invasive cancer development. Here, we introduce autoencoder (AE)-based hybrid models for cervical carcinoma prediction on the Mendeley-liquid-based cytology dataset. This is built on fourteen combinations of AE, DenseNet-201, and six state-of-the-art classifiers: adaptive boosting (AdaBoost), support vector machine (SVM), multilayer perceptron (MLP), decision tree (DT), k-nearest neighbors (k-NN), and random forest (RF). As empirical evaluations, four performance metrics, Scott-Knott (SK), and Borda count voting scheme, were performed. The AE-based hybrid models integrating AdaBoost, MLP, and RF as classifiers are among the top-ranked architectures, with respective accuracy values of 99.30, 99.20, and 98.48%. Yet, DenseNet-201 remains a solid option when adopting an end-to-end training strategy.
Download

Paper Nr: 80
Title:

The Imbalance Data Handling of XGBoost in Insurance Fraud Detection

Authors:

Nathanael T. Averro, Hendri Murfi and Gianinna Ardaneswari

Abstract: Insurance fraud is an emerging problem threatening the insurance industry because of its potential severe loss. Many conventional efforts have been implemented to detect fraud, such as releasing blacklists and deeper investigation on every claim, but these efforts tend to cost financial resources a lot. Because of that, machine learning is proposed as a decision support system to detect potential insurance fraud. Insurance fraud detection problems often have data with an imbalanced class. This paper examines the imbalanced class handling of XGBoost in predicting insurance fraud. Our simulation shows that the weighted-XGBoost outperforms other approaches in handling the imbalanced class problem. The imbalance-XGBoost models are quite reliable in improving base models. They can reach up to 28% improvement of the recall score on minority class compared to the basic XGBoost model. The precision score of both imbalance-XGBoost models decreases, while the weighted-XGBoost model simultaneously improves the precision and recall score.
Download

Paper Nr: 81
Title:

BERT-Based Hybrid Deep Learning with Text Augmentation for Sentiment Analysis of Indonesian Hotel Reviews

Authors:

Maxwell Thomson, Hendri Murfi and Gianinna Ardaneswari

Abstract: Indonesia’s tourist industry plays a significant role in the country’s economic growth. Despite being impacted by COVID-19, the occupancy rate of hotels in June 2022 reached 50.28%, surpassing the previous record of 49.17% in January 2020. As hotel occupancy rates rise, it becomes increasingly important to analyze customer reviews of hotels through sentiment analysis to categorize the emotions expressed in the reviews. While a BERT-based hybrid deep learning model has been shown to perform well in sentiment analysis, the nature of class imbalance is often a problem. To address this, the text augmentation method provides a solution to increase the amount of minority class training data through existing data. This paper evaluates five word-level text augmentation methods for the BERT-based hybrid model on classifying sentiments in Indonesian hotel reviews. Our simulations show that the text augmentation methods can improve the model performance for all datasets dan unit measures. Moreover, the random swap method achieves the highest precision and specificity on two of three datasets.
Download

Paper Nr: 91
Title:

Optimization of Surgery Scheduling Problems Based on Prescriptive Analytics

Authors:

João Lopes, Gonçalo Vieira, Rita Veloso, Susana Ferreira, Maria Salazar and Manuel F. Santos

Abstract: Surgery scheduling plays a crucial role in modern healthcare systems, ensuring efficient use of resources, minimising patient waiting times and improving organisations’ operational performance. Additionally, healthcare faces enormous challenges, with a general modernisation of all clinical and administrative processes expected, requiring organisations to keep up with the latest advances in Information Technology. The scheduling of surgeries is a crucial sector for the good functioning of hospitals, and the management of waiting lists is directly related to this process, which has seen the COVID-19 pandemic cause a significant increase in waiting times in some specialities. Surgery scheduling is considered a highly complex problem, influenced by numerous factors such as resource availability, operating shifts, patient priorities and scheduling restrictions, putting significant challenges to healthcare providers. In this research, in collaboration with one of the leading hospitals in Portugal, the Centro Hospitalar Universitário de Santo António (CHUdSA), we propose an approach based on Prescriptive Analytics, using optimisation algorithms to evaluate their performance in the management of the operating room. The results allow identifying the feasibility of this approach, taking into account the number of surgeries to be scheduled and surgical spaces in a time perspective, prevailing the priority of each surgery in the waiting list.
Download

Paper Nr: 113
Title:

The Application of Affective Measures in Text-Based Emotion Aware Recommender Systems

Authors:

John K. Leung, Igor Griva, William G. Kennedy, Jason M. Kinser, Sohyun Park and Seo Y. Lee

Abstract: This paper presents an innovative approach to address the problems researchers face in Emotion Aware Recommender Systems (EARS): the difficulty and cumbersome collecting voluminously good quality emotion-tagged datasets and an effective way to protect users’ emotional data privacy. Without enough goodquality emotion-tagged datasets, researchers cannot conduct repeatable affective computing research in EARS that generates personalized recommendations based on users’ emotional preferences. Similarly, if we fail to protect users’ emotional data privacy fully, users could resist engaging with EARS services. This paper introduced a method that detects affective features in subjective passages using the Generative Pre-trained Transformer Technology, forming the basis of the affective index and Affective Index Indicator (AII). Eliminate the need for users to build an affective feature detection mechanism. The paper advocates for a Separation of Responsibility approach where users protect their emotional profile data while EARS service providers refrain from retaining or storing it. Service providers can update users’ affective indices in memory without saving their privacy data, providing affective-aware recommendations without compromising user privacy. This paper offers a solution to the subjectivity and variability of emotions, data privacy concerns, and evaluation metrics and benchmarks, paving the way for future EARS research.
Download

Paper Nr: 120
Title:

A Successive Quadratic Approximation Approach for Tuning Parameters in a Previously Proposed Regression Algorithm

Authors:

Patrick Hosein, Kris Manohar and Ken Manohar

Abstract: We investigate a previously proposed regression algorithm that provides excellent performance but requires significant computing resources for parameter optimization. We summarize this previously proposed algorithm and introduce an efficient approach for parameter tuning. The speedup provided by this optimization approach is illustrated over a wide range of examples. This speedup in parameter tuning increases the practicability of the proposed regression algorithm.
Download

Area 4 - Data Management and Quality

Full Papers
Paper Nr: 15
Title:

Towards a Low-Code Tool for Developing Data Quality Rules

Authors:

Timon S. Klann, Marcel Altendeitering and Falk Howar

Abstract: High-quality data sets are vital for organizations as they promote business innovation and the creation of data-driven products and services. Data quality rules are a common approach to assess and enforce compliance with business domain knowledge and ensure the correctness of data sets. These data sets are usually subject to numerous data quality rules to allow for varying requirements and represent the needs of different stakeholders. However, established data quality tools have a rather technical user interface and lack support for inexperienced users, thus hindering them from specifying their data quality requirements. In this study, we present a tool for the user-friendly and collaborative development of data quality rules. Conceptually, our tool realizes a domain-specific language, which enables the graphical creation of rules using common data quality constraints. For implementation, we relied on CINCO, a tool for creating domain-specific visual modeling solutions, and Great Expectations, an open-source data validation framework. The evaluation of our prototype was two-fold, comprising expert interviews and a focus group discussion. Overall, our solution was well-received and can contribute to lowering the accessibility of data quality tools.
Download

Paper Nr: 94
Title:

Semantic, Technical and Legal Interoperability of European Company Open Data in Practice: The STIRData Approach

Authors:

Jakub Klímek, Alexandros Chortaras, Jakub Míšek, Jim J. Yang, Steinar Skagemo and Vassilis Tzouvaras

Abstract: As part of the Open Data Directive, the European Commission has published a list of high-value datasets (HVDs) that public sector bodies must make available as open data. The list also contains specific data items that must be included in these datasets. However, it does not prescribe any technical means of how the data should be published, severely hindering the interoperability of the datasets once they are published. One of the HVD topics is company data. In this practice report paper, we present results of STIRData, a project co-financed by the Connecting Europe Facility Programme of the European Union, focusing on technical, semantic, and legal interoperability of open data from business registries, covering the company data HVDs topic. The results include a data architecture and a data specification to make the published data technically and semantically interoperable, and legal interoperability guidelines to ensure legal interoperability of the published data. Moreover, proof-of-concept transformations of data from selected European business registries are shown using open source tools and according to the specification. Finally, a user-orientated platform for browsing and analysing the data is presented as an example of the possibilities of using the data published in an interoperable way.
Download

Short Papers
Paper Nr: 6
Title:

Structuring the End of the Data Life Cycle

Authors:

Daniel Tebernum and Falk Howar

Abstract: Data is an important asset and managing it effectively and appropriately can give companies a competitive advantage. Therefore, it should be assumed that data engineering considers and improves all phases of the data life cycle. However, data deletion does not seem to be prominent in theory and practice. We believe this is for two reasons. First, the added value in deleting data is not always immediately apparent or has a noticeable effect. Second, to the best of our knowledge, there is a lack of structured elaboration on the topic of data deletion that provides a more holistic perspective on the issue and makes the topic approachable to a greater audience. In this paper, an extensive systematic literature review is conducted to explore the topic of data deletion. Based on this, we present a data deletion taxonomy to organize the subject area and to further professionalize data deletion as part of data engineering. The results are expected to help both researchers and practitioners to address the end of the data life cycle in a more structured way.
Download

Paper Nr: 21
Title:

Insights from Big Data Economy of Finnish Company Trust, Consumer Confidence and Data Exchange: An Empirical Evidence of Structural Equation Modelling

Authors:

Sunday A. Olaleye

Abstract: In the emerging development of the big data economy, data exchange under management information systems is increasingly relevant in society and organizations. Recent studies have contributed to the literature on data exchange by investigating security, privacy, errors, and risk issues. However, there is less attention on the impact of company trust and consumer confidence on data exchange. Interoperability is a panacea to data exchange, and it refers to the capacity of information systems to share data and information, reducing the likelihood of information gaps and blind spots. This study fills the gaps in the literature on data exchange and the big data economy by giving a deeper understanding of data exchange through the influence of company trust and consumer confidence with the lens of structural equation modelling, thus, consolidating the big data economy and management information systems. The results from the model tested indicate the direct relationship between the path coefficient of company trust, consumer confidence and data exchange. The study emphasized the theoretical contribution and managerial implications and gave future research direction.
Download

Paper Nr: 96
Title:

Geo-Semantic Event-POI Matching of Large Mobility Datasets

Authors:

Ndiouma Bame, Ibrahima Gueye and Hubert Naacke

Abstract: Users often share data about their daily activities through social networks. These event data are very useful for a variety of uses cases such as points of interest (POI) recommendation. However, event data often lack information about POIs. Thus, enriching event data with POI information is of upmost importance. This implies to know the POI in which an event took place before completing the data. We face the problem of aligning two types of data sources, event data and POI data, which is difficult because they do not have a common identifier or the same descriptive attributes. This work proposes and implements a complete methodology for the enrichment of a large dataset of geolocated data on user events with POI using both geographical and semantic properties. This effective methodology for matching POIs with geo-located events comprises four steps: (i) in a first step, we cross-reference the data using spatial proximity to define the geographical neighborhood of each event; (ii) in a second step, we define the semantic neighborhood of each event based on a threshold on the semantic similarity. The semantic similarity exploits events data such as their contextual description and the tags by crossing them with those of the POI. (iii) these two types of similarity are combined for each POI of the event semantic neighborhood, to evaluate a geo-semantic similarity score; (iv) subsequently, each event is matched with the POI of the semantic neighborhood which maximizes the geo-semantic similarity score. We propose a robust modeling of our methodology and evaluate the effectiveness of our approach.
Download

Paper Nr: 105
Title:

Towards Data Ecosystems in Smart Cities: Success Factors of Urban Data Spaces

Authors:

Josh Haberkern and Thomas Schäffer

Abstract: In planning and designing public urban services, cities are increasingly relying on digital systems and data. Urban Data Spaces represent the data ecosystem of a city or region, bringing together municipalities, municipal companies, citizens, and businesses. They enable the development and management of data-driven services and aim to combat siloed data storage and usage. The main goal of this research paper is to examine the success factors for public sector stakeholders in creating and managing Urban Data Spaces. Using a multi-method approach (literature analysis, expert interviews, focus groups, and survey), we identified, validated, and quantified 23 success factors. The success factors were categorized into five dimensions: Platform Design, Platform Governance, Technical Platform Design, Platform Management Capabilities, and Stakeholder Involvement. Key findings are: A shared vision of an open and interoperable Urban Data Space, supported by a Life Cycle Management enables public management to benefit from data-driven services and become more sustainable. In addition, a cross-organizational data governance and strategy with a focus on the development of data competence and data quality management form the foundation of those Data Ecosystems. Based on the identified success factors, this article presents recommendations for scientists and practitioners.
Download

Paper Nr: 58
Title:

A Semantic Web Approach for Military Operation Scenarios Development for Simulation

Authors:

André M. Demori, Julio C. Tesolin, David C. Moura, João C. Gomes, Gabriel Pedroso, Leonardo B. Silva de Carvalho, Edison Pignaton de Freitas and Maria R. Cavalcanti

Abstract: The reality simulation process by computational means allows decision-makers to analyze and propose the best strategies to be adopted in a real environment. However, the scenario, sometimes heterogeneous, as in the case of military operations, requires formalization to achieve domain knowledge, allowing a more faithful reproduction of reality. In the case of military operation scenarios that address tactical, operational, and strategic elements and the use of communications, formalization could help organize knowledge, data sharing, and decision-making. This article proposes (i) the use of conceptual modeling that is based on concepts arising from a foundation ontology named UFO (Unified Foundational Ontology), (ii) the use of Web Ontology language (OWL), and (iii) the use of rule definitions expressed in the Semantic Web Rule Language (SWRL). Through this approach, this article describes the process of formalizing the domain knowledge as a reference and its corresponding operational ontology by identifying entities, relationships, rules, and all the categorizations made in the ontology for execution and decision-making in a battlefield simulator that is still in production. The application of this ontology is illustrated in representative and real-world examples, showing promising results of the proposed approach.
Download

Paper Nr: 99
Title:

Behavioral Modeling of Real Dynamic Processes in an Industry 4.0-Oriented Context

Authors:

Dylan Molinié, Kurosh Madani and Véronique Amarger

Abstract: With the Industry 4.0, new fashions to think the industry emerge: the production units are now orchestrated from some decentralized places to collaborate to improve efficiency, save time and resources, and reduce costs. To that end, Artificial Intelligence is expected to help manage units, prevent disruptions, predict failures, etc. A way to do so may consist in modeling the temporal evolution of the processes to track, predict and prevent the future failures; such modeling can be performed using the full dataset at once, but it may be more accurate to isolate the regions of the feature space where there is little variation in the data, then model these local regions separately, and finally connect all of them to build the final model of the system. This paper proposes to identify the compact regions of the feature space with unsupervised clustering, and then to model them with data-driven regression. The proposed methodology is tested on real industrial data, obtained in the scope of an Industry 4.0-oriented European project, and its accuracy is compared to that achieved by a global model; results show that local modeling achieves better accuracy, both during learning and testing stages.
Download

Paper Nr: 116
Title:

On Data-Preprocessing for Effective Predictive Maintenance on Multi-Purpose Machines

Authors:

Lukas Meitz, Michael Heider, Thorsten Schöler and Jörg Hähner

Abstract: Maintenance of complex machinery is time and resource intensive. Therefore, decreasing maintenance cycles by employing Predictive Maintenance (PdM) is sought after by many manufacturers of machines and can be a valuable selling point. However, currently PdM is a hard to solve problem getting increasingly harder with the complexity of the maintained system. One challenge is to adequately prepare data for model training and analysis. In this paper, we propose the use of expert knowledge–based preprocessing techniques to extend the standard data science–workflow. We define complex multi-purpose machinery as an application domain and test our proposed techniques on real-world data generated by numerous machines deployed in the wild. We find that our techniques enable and enhance model training.
Download

Area 5 - Databases and Data Security

Short Papers
Paper Nr: 36
Title:

QTrail-DB: A Query Processing Engine for Imperfect Databases with Evolving Qualities

Authors:

Maha Asiri and Mohamed Y. Eltabakh

Abstract: Imperfect databases are very common in many applications due to various reasons ranging from data-entry errors, transmission errors, and wrong instruments’ readings, to faulty experimental setups leading to incorrect results. The management and query processing of imperfect databases is a very challenging problem requires incorporating the data’s qualities within the database engine. Even more challenging, the qualities are not static and may evolve over time. Unfortunately, most of the state-of-art techniques deal with the data quality problem as an offline task. In this paper, we propose the “QTrail-DB” system that introduces a new quality model based on the new concept of “Quality Trails”, which captures the evolution of the data’s qualities over time. QTrail-DB extends the relational data model to incorporate the quality trails within the database system. We propose a new query algebra, called “QTrail Algebra”, that enables transparent propagation and derivations of the data’s qualities within a query pipeline. QTrail-DB is developed within PostgreSQL and experimentally evaluated using real-world datasets to demonstrate its efficiency and practicality.
Download

Paper Nr: 76
Title:

Design and Implementation of a Document Encryption Convergence Program Selecting Encryption Methods, and Integrating the Program into the Existing Office System

Authors:

Hong-Jin Ryu and Samuel S. Lee

Abstract: The article explores the limitations of current encryption methods and proposes a new encryption system that combines ancient and modern cryptography through software engineering. The article also discusses the history of cryptography, from simple substitution encryption methods to more complex mathematical algorithms. The proposed encryption system fuses several ancient ciphers with AES-256 encryption, creating a more secure and stronger password that is difficult to decrypt. This paper suggests the usefulness and reliability of cryptographic algorithms, such as virtual currency and blockchain fields, and could be used easily by the general public in electronic devices.
Download

Paper Nr: 61
Title:

Comparing Data Store Performance for Full-Text Search: To SQL or to NoSQL?

Authors:

George Fotopoulos, Paris Koloveas, Paraskevi Raftopoulou and Christos Tryfonopoulos

Abstract: The amount of textual data produced nowadays is constantly increasing as the number and variety of both new and reproduced textual information created by humans and (lately) also by bots is unprecedented. Storing, handling and querying such high volumes of textual data have become more challenging than ever and both research and industry have been using various alternatives, ranging from typical Relational Database Management Systems to specialised text engines and NoSQL databases, in an effort to cope with the volume. However, all these decisions are, largely, based on experience or personal preference for one system over another, since there is no performance comparison study that compares the available solutions regarding full-text search and retrieval. In this work, we fill this gap in the literature by systematically comparing four popular databases in full-text search scenarios and reporting their performance across different datasets, full-text search operators and parameters. To the best of our knowledge, our study is the first to go beyond the comparison of characteristics, like expressiveness of the query language or popularity, and actually compare popular relational, NoSQL, and textual data stores in terms of retrieval efficiency for full-text search. Moreover, our findings quantify the differences in full-text search performance between the examined solutions and reveal both anticipated and less anticipated results.
Download

Paper Nr: 110
Title:

Sharding and Master-Slave Replication of NoSQL Databases: Comparison of MongoDB and Redis

Authors:

Anikó Vágner and Mustafa Al-Zaidi

Abstract: One of the reasons that NoSQL databases were born is that they can be used in clusters, namely, computers work together, share data and from the client side, the clustered computers look as if there is only one computer. In this paper, the distribution models of two NoSQL databases are introduced. We chose the databases from the database ranking website (noa, 2023a), exactly the two first NoSQL databases: MongoDB and Redis. However they belong to two different NoSQL database categories, they use similar key-value pairs which is the main basis of the clustering. Additionally, the distribution models do not depend on the categories of the databases, both database management systems know sharding and master-slave replication, and can use these two distribution models together. These two database management systems do not know peer-to-peer replication. Our goal was to get to know whether there are similarities between the structures of clustered computers of each database management system. If we consider the theory, the answer should be yes, they are similar to each other: for the sharding 2 computers are enough, similarly for the replication 2 computers are also enough, and if both of the techniques are used, 4 computers should be enough.
Download