DATA 2024 Abstracts


Area 1 - Big Data

Full Papers
Paper Nr: 61
Title:

Privacy-Preserving Big Hierarchical Data Analytics via Co-Occurrence Analysis

Authors:

Alfredo Cuzzocrea and Selim Soufargi

Abstract: Nowadays, Big Data Analytics is gaining the momentum in both the academic and industrial research communities. In this context, the issue of performing such a critical process under tight privacy-preservation constraints plays the critical role of “enabling technology”. This paper, by perfectly aligning with the depicted paradigm, introduces and experimentally assesses Drill-CODA, an innovative framework that combines drill-across multidimensional big data analytics and co-occurrence analysis to finally achieve privacy-preservation during the analytical phase.
Download

Paper Nr: 123
Title:

Expert Agent Guided Learning with Transformers and Knowledge Graphs

Authors:

Chukwuka V. Obionwu, Bhavya C. Valappil, Minu Genty, Maria Jomy, Visakh Padmanabhan, Aishwarya Suresh, Sumat S. Bedi, David Broneske and Gunter Saake

Abstract: The interaction between students and instructors can be likened to an interaction with a conversational agent model that understands the context of the interaction and the questions the student poses. Large language models have exhibited remarkable aptitude for facilitating learning and educational procedures. However, they occasionally exhibit hallucinations, which can result in the spread of inaccurate or false information. This issue is problematic and requires attention in order to ensure the general reliability of the information system. Knowledge graphs provide a methodical technique for describing entities and their interconnections. This facilitates a comprehensive and interconnected understanding of the knowledge in a specific field. Therefore, in order to make the interactions with our conversational agent more human-like and to deal with hallucinations, we employ a retrieval-focused generation strategy that utilizes existing knowledge and creates responses based on contextually relevant information. Our system relies on a knowledge graph, an intent classifier, and a response generator that compares and evaluates question embeddings to ensure accurate and contextually appropriate replies. We further evaluate our implementation based on relevant metrics and compare it to state-of-the-art task-specific retrieve-and-extract architectures. For language generation tasks, we find that the RCG models generate more specific, diverse, and factual information than state-of-the-art baseline models.
Download

Short Papers
Paper Nr: 43
Title:

A Comparison of the Efficiencies of Various Structured and Semi- Structured Data Formats in Data Analysis and Big Data Analytic Development

Authors:

Heather E. Graham and Taoxin Peng

Abstract: As data volumes grow, so too does our need and ability to analyse it. Cloud computing technologies offer a wide variety of options for analysing big data and make this ability available to anyone. However, the monetary implications for doing this in an inefficient fashion could surprise those who may be used to an on-premises solution to big data analysis, as they move from a model where storage is limited and processing power has little cost implications, to a model where storage is cheap but compute is expensive. This paper investigates the efficiencies gained or lost by using each of five data formats, CSV, JSON, Parquet, ORC and Avro, on Amazon Athena, which uses SQL as a query language over data at rest in Amazon S3, and on Amazon EMR, using the Pig language over a distributed Hadoop architecture. Experiment results suggest that ORC is the most efficient data format to use on the platforms tested against, based on time and monetary costs.
Download

Paper Nr: 92
Title:

It is Time to Develop an Auditing Framework to Promote Value Aware Chatbots

Authors:

Yanchen Wang and Lisa Singh

Abstract: The launch of ChatGPT in November 2022 marked the beginning of a new era in AI, the availability of generative AI tools for everyone to use. ChatGPT and other similar chatbots boast a wide range of capabilities from answering student homework questions to creating music and art. Given the large amounts of human data chatbots are built on, it is inevitable that they will inherit human errors and biases. These biases have the potential to inflict significant harm or increase inequity on different subpopulations. Because chatbots do not have an inherent understanding of societal values, they may create new content that is contrary to established norms. Examples of concerning generated content includes child pornography, inaccurate facts, and discriminatory posts. In this position paper, we argue that the speed of advancement of this technology requires us, as computer and data scientists, to mobilize and develop a values-based auditing framework containing a community established standard set of measurements to monitor the health of different chatbots and LLMs. To support our argument, we use a simple audit template to share the results of basic audits we conduct that are focused on measuring potential bias in search engine style tasks, code generation, and story generation. We identify responses from GPT 3.5 and GPT 4 that are both consistent and not consistent with values derived from existing law. While the findings come as no surprise, they do underscore the urgency of developing a robust auditing framework for openly sharing results in a consistent way so that mitigation strategies can be developed by the academic community, government agencies, and companies when our values are not being adhered to. We conclude this paper with recommendations for value-based strategies for improving the technologies.
Download

Paper Nr: 98
Title:

Twitter Metrical Data Analysis Using R: Twiplomacy in the Outbreak of the War in Ukraine

Authors:

Dimitrios Vagianos and Thomas Papatsas

Abstract: It is common practice for Social Media to be used to inform and sway public opinion during contemporary conflicts. This study focuses on how Twitter (now known as X) was used with regards to the Russian-Ukrainian war during the first three months following Russia’s invasion of Ukraine on February 24, 2022. Thirty accounts in total—fifteen from each opposing side—were used to mine the data. The information released by these accounts throughout this monitoring period, along with the frequency of their postings, were collected and investigated in order to highlight the diverse approaches in this kind of cyberspace confrontation. In order to emphasize the key components of each party’s strategy and its efficacy, the interactivity networks of the accounts under discussion were constructed and visually analysed. Overall, this research attempt exploits a combination of effective data analysis approaches including word frequencies’ investigation and interactivity networks analysis based on the modularity community detection algorithm. By exclusively using Open Source software, the results visually highlight the degree of coordination and intensity of Twitter use of the Ukrainian side, a fact that is in full accordance with the comparatively more successful induced influence Ukraine achieved during this time frame, as this has been reported by the media generally.
Download

Paper Nr: 47
Title:

Towards Computational Performance Engineering for Unsupervised Concept Drift Detection: Complexities, Benchmarking, Performance Analysis

Authors:

Elias Werner, Nishant Kumar, Matthias Lieber, Sunna Torge, Stefan Gumhold and Wolfgang E. Nagel

Abstract: Concept drift detection is crucial for many AI systems to ensure the system’s reliability. These systems often have to deal with large amounts of data or react in real-time. Thus, drift detectors must meet computational requirements or constraints with a comprehensive performance evaluation. However, so far, the focus of developing drift detectors is on inference quality, e.g. accuracy, but not on computational performance, such as runtime. Many of the previous works consider computational performance only as a secondary objective and do not have a benchmark for such evaluation. Hence, we propose and explain performance engineering for unsupervised concept drift detection that reflects on computational complexities, benchmarking, and performance analysis. We provide the computational complexities of existing unsupervised drift detectors and discuss why further computational performance investigations are required. Hence, we state and substantiate the aspects of a benchmark for unsupervised drift detection reflecting on inference quality and computational performance. Furthermore, we demonstrate performance analysis practices that have proven their effectiveness in High-Performance Computing, by tracing two drift detectors and displaying their performance data.
Download

Area 2 - Business Analytics

Full Papers
Paper Nr: 32
Title:

Artificial Intelligence-Powered Large Language Transformer Models for Opioid Abuse and Social Determinants of Health Detection for the Underserved Population

Authors:

Don Roosan, Yanting Wu, Jay Chok, Christopher P. Sanine, Tiffany Khou, Yawen Li and Hasiba M. Khan

Abstract: The rise of big data in healthcare, particularly within electronic health records (EHRs), presents both challenges and opportunities for addressing complex public health issues such as opioid use disorder (OUD) and social determinants of health (SDoH). Traditional data analysis methods are often limited by their reliance on structured data, overlooking the wealth of valuable insights embedded within unstructured clinical narratives. Leveraging advancements in artificial intelligence (AI), Large Language Models (LLM) and natural language processing (NLP), this study proposes a novel approach to detect OUD by analyzing unstructured data within EHRs. Specifically, a Bidirectional Encoder Representations from Transformers (BERT)-based NLP method is developed and applied to clinical progress notes extracted from the EHR system of Emanate Health System. The study created a data analytics platform utilizing user-centered design for improving clinical decisions. This study contributes to the ongoing effort to combat the opioid crisis by bridging the gap between technology-driven analytics and clinical practice, ultimately striving for improved patient wellbeing and equitable healthcare delivery.
Download

Paper Nr: 52
Title:

Reputation, Sentiment, Time Series and Prediction

Authors:

Peter Mitic

Abstract: A formal formulation for reputation is presented as a time series of daily sentiment assessments. Projections of reputation time series are made using three methods that replicate the distributional and auto-correlation properties of the data: ARIMA, a Copula fit, and Cholesky decomposition. Each projection is tested for goodness-of-fit with respect to observed data using a bespoke auto-correlation test. Numerical results show that Cholesky decomposition provides optimal goodness-of-fit success, but overestimates the projection volatility. Expressing reputation as a time series and deriving predictions from them has significant advantages in corporate risk control and decision making.
Download

Paper Nr: 53
Title:

A Cascade of Consequences: Improving an Accident Analysis Method by Learning from a Real Life Telecommunications Accident

Authors:

Hans A. Wienen, Faiza A. Bukhsh, Eelco Vriezekolk and Luís Ferreira Pires

Abstract: Telecommunications networks are vital enablers of modern society. Large accidents in these networks that cause their unavailability can therefore have a severe impact on the functioning of society. Learning from these accidents can help prevent them and thus make our society more resilient. In this paper, we present an accident analysis method (TRAM) which we have developed by extending the AcciMap method and we report on its application to analyse a severe accident in a telecommunications network. We validate notation for representing and breaking positive feedback loops in a network breakdown, and we suggest a method to enhance the prioritisation of recommendations derived from our analysis. Furthermore, our research reveals that splitting the analysis based on the expertise of the method’s participants negatively impacts the efficiency of the overall process.
Download

Paper Nr: 54
Title:

Is Positive Sentiment Missing in Corporate Reputation?

Authors:

Peter Mitic

Abstract: The value of a perceived negative bias is quantified in the context of corporate reputation time series, derived by exhaustive data mining and automated natural language processing. Two methods of analysis are proposed: State-Space using a Kalman filter time series with a Normal distribution profile, and Forward Filtering Backward Sampling for those without. Normality tests indicate that approximately 92% of corporate reputation time series do fit the Normal profile. The results indicate that observed positive reputation profiles should be boosted by approximately 4% to account for negative bias. Examination of the observed balance between negative and positive sentiment in reputation time series indicates dependence on the sentiment calculation method, and region. Positive sentiment predominates in the US, Japan and parts of Western Europe, but not in the UK or in Hong Kong/China.
Download

Paper Nr: 58
Title:

A Deep Dive into GPT-4's Data Mining Capabilities for Free-Text Spine Radiology Reports

Authors:

Klaudia Szabó Ledenyi, András Kicsi and László Vidács

Abstract: The significant growth of large language models revolutionized the field of natural language processing. Recent advancements in large language models, particularly generative pretrained transformer (GPT) models, have shown advanced capabilities in natural language understanding and reasoning. These models typically interact with users through prompts rather than providing training data or fine-tuning, which can save a significant amount of time and resources. This paper presents a study evaluating GPT-4’s performance in data mining from free-text spine radiology reports using a single prompt. The evaluation includes sentence classification, sentence-level sentiment analysis and two representative biomedical information extraction tasks: named entity recognition and relation extraction. Our research findings indicate that GPT-4 performs effectively in few-shot information extraction from radiology text, even without specific training for the clinical domain. This approach shows potential for more effective information extraction from free-text radiology reports compared to manual annotation.
Download

Paper Nr: 62
Title:

Risk-Stratified Multi-Objective Resource Allocation for Optimal Aviation Security

Authors:

Eva K. Lee, Taylor J. Leonard and Jerry C. Booker

Abstract: This study aims to establish a quantitative construct for enterprise risk assessment and optimal portfolio investment to achieve the best aviation security. We first analyze and model various aviation transportation risks and establish their interdependencies via a topological overlap network. Next, a multi-objective portfolio investment model is formulated to optimally allocate security measures. The portfolio risk model determines the best security capabilities and resource allocation under a given budget. The computational framework allows for marginal cost analysis which determines how best to invest any additional resources for the best overall risk protection and return on investment. Our analysis involves cascading and inter-dependency modeling of the multi-tier risk taxonomy and overlaying security measures. The model incorporates three objectives: (1) maximize the risk posture (ability to mitigate risks) in aviation security, (2) minimize the probability of false clears, and (3) maximize the probability of threat detection. This work presents the first comprehensive model that links all resources across the 440 federally funded airports in the United States. We experimented with several computational strategies including Dantzig-Wolfe decomposition, column generation, particle swarm optimization, and a greedy heuristic to solve the resulting intractable instances. Contrasting the current baseline performance to some of the near-optimal solutions obtained by our system, our solutions offer improved risk posture, lower false clear, and higher threat detection across all the airports, indicating a better risk enterprise strategy and decision process under our system. The risk assessment and optimal portfolio investment construct are generalizable and can be readily applied to other risk and security problems.
Download

Paper Nr: 100
Title:

Intelligent Sampling System for Connected Vehicle Big Data

Authors:

Omar Makke, Syam Chand, Vamsee K. Batchu, Oleg Gusikhin and Vicky Svidenko

Abstract: The impact of connected vehicle big data on the automotive industry is significant. Big data offers data scientists the opportunity to explore and analyze vehicle features and their usage thoroughly to assist in optimizing existing designs or offer new features. However, the downside of big data is its associated cost. While storage tends to be cheap, data transmission and computational resources are not. Specifically, for connected vehicle data, even when unstructured data is excluded, the data size can still increase by several terabytes a day if one is not careful about what data to collect. Therefore, it is advisable to apply methods which help avoiding collecting redundant data to reduce the computation cost. Furthermore, some data scientists may be tempted to calculate “exact” metrics when the data is available, partly because applying statistical methods can be tedious, which can exhaust the computational resources. In this paper we argue that intelligent sampling systems which centralize the sampling methods and domain knowledge are required for connected vehicle big data. We also present our system which assists interested parties in performing analytics and provide two case studies to demonstrate the benefits of the system.
Download

Paper Nr: 104
Title:

An Exploratory Analysis of Malaria and Climatic Factors in India

Authors:

Sachin Y. Bodke and Usha Ananthakumar

Abstract: Malaria remains a significant health challenge in India, prompting a thorough analysis of cases in recent years from 2020 to 2022. This study focuses on understanding the spread of malaria over time and across different states, specifically emphasizing the impact of climatic factors such as rainfall and temperature. India’s diverse climatic conditions, ranging from hot summers to cold winters, contribute to the complexity of malaria dynamics. Variations in malaria prevalence were observed with changes in rainfall and temperature, particularly during the months of July to October. Our findings reveal a notable increase in malaria cases during a period characterized by significant rainfall and temperature. The study identifies a significant prevalence of malaria cases in India’s West, East, and North East regions with peak transmission occurring in the rainy season months. Considering the intricate interplay between climatic factors and disease transmission, this study contributes valuable insights for tailored malaria control strategies during heightened transmission periods.
Download

Short Papers
Paper Nr: 24
Title:

Dynamic Price Prediction for Revenue Management System in Hospitality Sector

Authors:

Susanna Saitta, Vito D’Amico and Giovanni M. Farinella

Abstract: Dynamic pricing prediction is widely adopted in many different sectors. In receptive structures, the price of services (e.g. room price) is usually set dynamically by the Revenue Manager (RM) which continuously monitors the Key Performance Indicators (KPIs) recorded over time, together with market conditions and other external factors. The prices of services are dynamically adjusted by the RM to maximize the revenue of the receptive structure. This manual adjustment of prices performed by the RM is costly and time-consuming. In this work we study the problem of automatic dynamic pricing. To this aim, we collect and exploit a dataset related to real receptive structures. The dataset is annotated by revenue management experts and takes into account static, dynamic and engineered features. We benchmark different machine learning models to automatically predict the price that a RM would dynamically set for an entry level room forecasting the price in the next 90 days. The compared approaches have been tested and evaluated on three different hotels and could be easily adapted to other room types. To the best of our knowledge, the problem addressed in this paper is understudied and the results obtained in our study can help further research in the field.
Download

Paper Nr: 82
Title:

Failure Prediction Using Multimodal Classification of PCB Images

Authors:

Pedro M. Goncalves, Miguel A. Brito and Jose C. Moreira

Abstract: In the era of Industry 4.0, where digital technologies revolutionize manufacturing, a wealth of data drive optimization efforts. Despite the opportunities, managing these vast datasets poses significant challenges. Printed Circuit Boards (PCBs) are pivotal in modern industry, yet their complex manufacturing process demands robust fault detection mechanisms to ensure quality and safety. Traditional classification models have limitations, exacerbated by imbalanced datasets and the sheer volume of data. Addressing these challenges, our research pioneers a multimodal classification approach, integrating PCB images and structured data to enhance fault prediction. Leveraging diverse data modalities, our methodology promises superior accuracy with reduced data requirements. Crucially, this work is conducted in collaboration with Bosch Car Multimedia, ensuring its relevance to industry needs. Our goals encompass crafting sophisticated models, curbing production costs, and establishing efficient data pipelines for real-time processing. This research marks a pivotal step towards efficient fault prediction in PCB manufacturing within the Industry 4.0 framework.
Download

Paper Nr: 96
Title:

A Web-Based Hate Speech Detection System for Dialectal Arabic

Authors:

Anis Charfi, Andria Atalla, Raghda Akasheh, Mabrouka Bessghaier and Wajdi Zaghouani

Abstract: A significant issue in today’s global society is hate speech, which is defined as any kind of expression that attempts to degrade an individual or a society based on attributes such as race, color, nationality, gender, or religion (Schmidt and Wiegand, 2017). In this paper, we present a Web-based hate speech detection system that focuses on the Arabic language and supports its various dialects. The system is designed to detect hate speech within a given sentence or within a file containing multiple sentences. Behind the scenes, our system makes use of the AraBERT model trained on our ADHAR hate speech corpus, which we developed in previous work. The output of our system discerns the presence of hate speech within the provided sentence by categorizing it into one of two categories: ”Hate” or ”Not hate”. Our system also detects different categories of hate speech such as race-based hate speech and religion-based hate speech. We experimented with various machine learning models, and our system achieved the highest accuracy, along with an F1-score of 0.94, when using AraBERT. Furthermore, we have extended the functionality of our tool to support inputting a file in CSV format and to visualize the output as polarization pie charts, enabling the analysis of large datasets.
Download

Paper Nr: 102
Title:

Prediction of Academic Success in a University and Improvement Using Lean Tools

Authors:

Kléber Sánchez and Diego Vallejo-Huanga

Abstract: The pandemic of COVID-19 caused several essential challenges for humanity. In the educational sector, mechanisms had to be quickly implemented to migrate in-person activities to complete virtuality. Academic institutions and society faced a paradigm shift since modifying the conditions of the teaching-learning system produced changes in the quality of education and student approval rates. This scientific article evaluates three classification models built by collecting data from a public Higher Education Institution to predict its approval based on different exogenous variables. The results show that the highest performance was obtained with the Random Forest algorithm, which has an accuracy of 61.3% and allows us to identify students whose initial conditions generate a high probability of failing a virtual course before it starts. In addition, this research collected information to detect opportunities for improving the prediction model, including restructuring the questions in the surveys and including new variables. The results suggest that the leading cause of course failure is the lack of elementary knowledge and skills students should have acquired during their secondary education. Finally, to mitigate the problem, a readjustment of the study program is proposed along with lean support tools to measure the results of these modifications.
Download

Paper Nr: 115
Title:

Design Features for Data Trustee Selection in Data Spaces

Authors:

Michael Steinert, Daniel Tebernum and Marius Hupperz

Abstract: As the world becomes increasingly digital, data is becoming a critical resource. When used effectively, it can lead to more accurate forecasts, process optimization, and the creation of innovative business models. The necessary data is often distributed across multiple organizations, and its full value can only be realized through shared collaboration. Data spaces provide organizations with a platform for sovereign and secure data sharing. To enable legally secure data sharing and ensure compliance with regulations, data trustees play a critical role as trusted intermediaries. However, choosing a suitable data trustee that meets the needs of the participants who want to share data with each other is difficult. Our study seeks to elucidate the process by which participants of a data space can choose an appropriate data trustee. To this purpose, we have implemented a whitelist approach. We report on the results of our design science research project, which includes design features to facilitate the integration of our whitelist approach into different data space instantiations. Potential shortcomings were identified and addressed during an expert workshop. By providing verified design knowledge, we help practitioners in the data space community to incorporate the concept of how to select the most appropriate data trustee.
Download

Paper Nr: 125
Title:

Presence of Corporate Reputation Cues in Company Vacancy Texts Boosts Vacancy Attractiveness as Perceived by Employees

Authors:

R. E. Loke and F. Betten

Abstract: Attracting the best candidates online for job vacancies has become a challenging task for companies. One thing that could influence the attractiveness of organisations for employees is their reputation that is an essential component of marketing research and plays a crucial role in customer and employee acquisition and retention. Prior research has shown the importance for companies to improve their corporate reputation (CR) for its effect on attracting the best candidates for job vacancies. Company ratings and vacancy advertisements are nowadays a massive, rich valued, online data source for forming opinions regarding corporations. This study focuses on the effect of CR cues that are present in the description of online vacancies on vacancy attractiveness. Our findings show that departments that are responsible for writing vacancy descriptions are recommended to include the CR themes citizenship, leadership, innovation, and governance and to exclude performance. This will increase vacancies’ attractiveness which helps prevent labour shortage.
Download

Paper Nr: 77
Title:

Strategic Placement of Data Centers for Economic Analysis: An Online Algorithm Approach

Authors:

Christine Markarian and Claude Fachkha

Abstract: Governments worldwide have increasingly recognized the transformative potential of data analytics in economics, leading to the establishment of specialized research centers dedicated to economic analysis. These centers serve as hubs for experts to dissect economic indicators, inform policymaking, and foster sustainable growth. With data analytics playing a pivotal role in understanding economic trends and formulating policy responses, the strategic placement of data centers becomes crucial. In this paper, we address the strategic placement of data centers in urbanized environments within the framework of online algorithms. Online algorithms are designed to make sequential decisions without complete information about future inputs, making them suitable for dynamic urban environments. Specifically, we formulate the problem as the Online Data Center Placement problem (ODCP) and design a novel online algorithm for it. To gauge our algorithm’s effectiveness, we use competitive analysis, a standard method for assessing online algorithms. This method compares our algorithm’s solutions with those of the optimal offline solution. Our study aims to provide a systematic approach for informed decision-making, optimizing resource usage, and fostering economic growth.
Download

Paper Nr: 105
Title:

Machine Learning for KPI Development in Public Administration

Authors:

Simona Fioretto, Elio Masciari and Enea V. Napolitano

Abstract: Efficient and effective service delivery to citizens in Public Administrations (PA) requires the use of key performance indicators (KPIs) for performance evaluation and measurement. This paper proposes an innovative framework for constructing KPIs in performance evaluation systems using Random Forest and variable importance analysis. Our approach aims to identify the variables that have a strong impact on the performance of PAs. This identification enables a deeper understanding of the factors that are critical for organizational performance. By analyzing the importance of variables and consulting domain experts, relevant KPIs can be developed. This ensures improvement strategies focus on critical aspects linked to performance. The framework provides a continuous monitoring flow for KPIs and a set of phases for adapting KPIs in response to changing administrative dynamics. The objective of this study is to enhance the performance of PAs by applying machine learning techniques to achieve a more agile and results-oriented PAs.
Download

Area 3 - Data Science

Full Papers
Paper Nr: 76
Title:

Cross-Lingual Low-Resources Speech Emotion Recognition with Domain Adaptive Transfer Learning

Authors:

Imen Baklouti, Olfa Ben Ahmed and Christine Fernandez-Maloigne

Abstract: Speech Emotion Recognition (SER) plays an important role in several human-computer interaction-based applications. During the last decade, SER systems in a single language have achieved great progress through Deep Learning (DL) approaches. However, SER is still a challenge in real-world applications, especially with low-resource languages. Indeed, SER suffers from the limited availability of labeled training data in the speech corpora to train an efficient prediction model from scratch. Yet, due to the domain shift between source and target data distributions traditional transfer learning methods often fail to transfer emotional knowledge from one language (source) to (target) to another. In this paper, we propose a simple yet effective approach for Cross-Lingual speech emotion recognition using supervised domain adaptation. The proposed method is based on 2D Mel-Spectrogram images as features for model training from source data. Then, a transfer learning method with domain adaptation is proposed in order to reduce the domain shift between source and target data in the latent space during model fine-tuning. We conduct experiments through different tasks on three different SER datasets. The proposed method has been evaluated on different transfer learning tasks namely for low-resource scenarios using the IEMOCAP, RAVDESS and EmoDB datasets. Obtained results demonstrate that the proposed method achieved competitive classification performance in comparison with the classical transfer learning method and with recent state-of-the-art SER-based domain adaptation works.
Download

Paper Nr: 80
Title:

HTC-GEN: A Generative LLM-Based Approach to Handle Data Scarcity in Hierarchical Text Classification

Authors:

Carmelo F. Longo, Misael Mongiovı̀, Luana Bulla and Giusy G. Tuccari

Abstract: Hierarchical text classification is a challenging task, in particular when complex taxonomies, characterized by multi-level labeling structures, need to be handled. A critical aspect of the task lies in the scarcity of labeled data capable of representing the entire spectrum of taxonomy labels. To address this, we propose HTC-GEN, a novel framework that leverages on synthetic data generation by means of large language models, with a specific focus on LLama2. LLama2 generates coherent, contextually relevant text samples across hierarchical levels, faithfully emulating the intricate patterns of real-world text data. HTC-GEN obviates the need for labor-intensive human annotation required to build data for training supervised models. The proposed methodology effectively handles the common issue of imbalanced datasets, enabling robust generalization for labels with minimal or missing real-world data. We test our approach on a widely recognized benchmark dataset for hierarchical zero-shot text classification, demonstrating superior performance compared to the state-of-the-art zero-shot model. Our findings underscore the significant potential of synthetic-data-driven solutions to effectively address the intricate challenges of hierarchical text classification.
Download

Paper Nr: 85
Title:

Course Recommendation System for Company Job Placement Using Collaborative Filtering and Hybrid Model

Authors:

Jaeheon Park, Suan Lee, Woncheol Lee and Jinho Kim

Abstract: This study introduces a novel recommendation system aimed at enhancing university career counseling by adapting it to more accurately align with students’ interests and career trajectories. Recognizing the challenges students face in selecting courses that complement their career goals, our research explores the efficacy of employing both collaborative filtering and a hybrid model approach in the development of this system. Uniquely, this system utilizes a company-course recommendation method, diverging from the traditional student-course paradigm, to generalize company-course relationships, thereby enhancing the system’s recommendation precision. Through meticulous feature engineering, we improved the performance of the NeuMF model. Our experiments demonstrate that the proposed method outperforms other models by 10% to 79% based on the mAP metric, suggesting that the proposed model can effectively recommend courses for employment.
Download

Paper Nr: 120
Title:

Intelligent Transportation Systems: A Survey on Data Engineering

Authors:

Safa Batita, Achraf Makni and Ikram Amous

Abstract: This paper presents an examination of data engineering within Intelligent Transportation Systems (ITS), focusing on integrating advanced technologies such as Real-Time Databases (RT-DBs), Graph Databases (GDBs), and Artificial Intelligence (AI) to improve ITS capabilities. The decision to focus on database systems and AI in this paper is based on their crucial roles in shaping modern transportation systems and offers a comprehensive view of the technological framework influencing ITS. Through an extensive review of existing literature, the paper explores how these solutions synergistically contribute to data collection, organization, processing, and extraction of value from various ITS data. The paper analyzes the transformative impact of real-time data management in connected vehicle systems and the efficacy of GDBs in capturing complex relationships within intelligent transportation networks. Additionally, it assesses the adaptability of AI in various ITS applications, including traffic prediction, driver assistance, and accident analysis. Despite their benefits, the paper discusses persistent challenges related to system complexity, interoperability, data management, and model accuracy, which impact the widespread deployment of ITS. Furthermore, the paper presents recommendations for addressing these challenges and emphasizes research directions that require further exploration, underscoring the importance of intelligent and efficient transportation worldwide.
Download

Paper Nr: 124
Title:

A Max Flow Min Cut View of Social Media Posts

Authors:

James Abello, Timothy R. Tangherlini and Haoyang Zhang

Abstract: Viewing social media posts as a collection of directed triples { ⟨ Entity, Verb, Entity ⟩ } provides a frequency labeled graph with vertices comprising the set of entities, and each edge encoding the frequency of co-occurrence of the pair of entities labeled by its linking verb or verb phrase. The set of edges of the underlying topology can be partitioned into maximal subgraphs, called fixed points, each consisting of a sequence of vertex disjoint layers. We exploit this view to observe how information spreads on social media platforms. This is achieved via traces of label propagation across a Max Flow Min Cut decomposition of each fixed point. These traces generate a weighted label set system with an underlined label distribution, from which we derive a barycentric coordinatization of the collection of minimum cuts of each fixed point. This is a novel graph decomposition that incorporates information flow with a multi-layered summary of noisy social media forums, providing a comprehensible yet fine-grained summary of social media conversations.
Download

Short Papers
Paper Nr: 41
Title:

Students’ Performance in Learning Management System: An Approach to Key Attributes Identification and Predictive Algorithm Design

Authors:

Dynil Duch, Madeth May and Sébastien George

Abstract: The study we present in this paper explores the use of learning analytics to predict students’ performance in Moodle, an online Learning Management System (LMS). The student performance, in our research context, refers to the measurable outcomes of a student’s academic progress and achievement. Our research effort aims to help teachers spot and solve problems early on to increase student productivity and success rates. To achieve this main goal, our study first conducts a literature review to identify a broad range of attributes for predicting students’ performance. Then, based on the identified attributes, we use an authentic learning situation, lasting a year, involving 160 students from CADT (Cambodia Academy of Digital Technology), to collect and analyze data from student engagement activities in Moodle. The collected data include attendance, interaction logs, submitted quizzes, undertaken tasks, assignments, time spent on courses, and the outcome score. The collected data is then used to train with different classifiers, thus allowing us to determine the Random Forest classifier as the most effective in predicting students’ outcomes. We also propose a predictive algorithm that utilizes the coefficient values from the classifier to make predictions about students’ performance. Finally, to assess the efficiency of our algorithm, we analyze the correlation between previously identified attributes and their impact on the prediction accuracy.
Download

Paper Nr: 45
Title:

Forgetting in Knowledge Graph Based Recommender Systems

Authors:

Xu Wang and Christopher Brewster

Abstract: Recommender systems need to contend with continuous changes in both search spaces and user profiles. The set of items in the search space is usually treated as continuously expanding, however, users also purchase items or change their requirements. This raises the issue of how to ”forget” an item after purchase or consumption. This paper addresses the issue of “forgetting” in knowledge graph-based recommender systems. We propose an innovative method for identifying and removing unnecessary or irrelevant triples from the graph itself. Using this approach, we simplify the knowledge graph while maintaining the quality of the recommendations. We also introduce several metrics to assess the impact of forgetting in knowledge graph-based recommender systems. Our experiments demonstrate that incorporating consideration of impact in the forgetting process can enhance the efficiency of the recommender system without compromising the quality of its recommendations.
Download

Paper Nr: 48
Title:

Influential Factors on Drivetrain Consumption in Electric City Buses and Assessing the Optimization Potentials

Authors:

Sunilkumar Raghuraman, Daniel Baumann, Marc Schindewolf and Eric Sax

Abstract: In response to the growing need for sustainable mobility amidst global challenges like climate change and urbanization, ensuring energy-efficient operation of Electric City Buses (ECBs) is crucial. This study initially utilizes techniques associated with explainable artificial intelligence, such as SHapley Additive ExPlanations (SHAP), to determine the impact of various factors such as vehicle speed, acceleration, braking on drivetrain consumption. The data is categorized into distinct scenarios such as acceleration, starting, curve, uphill and downhill for this analysis. In driving scenarios such as curves, uphill, or downhill, the position of the brake pedal, along with the accelerator pedal and vehicle speed, were identified as significant factors affecting drivetrain consumption. Secondly, the study delves into analyzing driving behavior during bus stop entries, employing methods like Deep Autoencoder-based Clustering (DAC) and Self-Organizing Map (SOM). In the results of the DAC and SOM analysis, it was found that Cluster 2, identified through the DAC model, exhibited substantial energy consumption, characterized by higher acceleration and lesser brake pedal usage. Conversely, the SOM analysis showed that the orange and blue clusters have greater energy efficiency, with a higher distance covered and lower energy consumption, contrasting with other clusters that consumed more energy for reaching the busstop.
Download

Paper Nr: 49
Title:

Yet Another Miner Utility Unveiling a Dataset: CodeGrain

Authors:

Dániel Horváth and László Vidács

Abstract: Automated program repair (APR) gained more and more attention over the years, both from an academic, and an industrial point of view. The overall goal of APR is to reduce the cost of development and maintenance, by automagically finding and fixing common bugs, typos, or errors in code. A successful, and highly researched approach is to use deep-learning (DL) techniques to accomplish this task. DL methods are known to be very data-hungry, but despite this, data that is readily available online is hard to find, which poses a challenge to the development of such solutions. In this paper, we address this issue by providing a new dataset consisting of 371,483 code examples on bug-fixing, while also introducing a method that other researchers could use as a feature in their mining software. We extracted code from 5,273 different repositories and 250,090 different commits. Our work contributes to related research by providing a publicly accessible dataset, which DL models could be trained, or fine-tuned on, and a method that easily integrates with almost any code mining tool, as a language-independent feature that gives more granular choices when extracting code parts from a specific bugfix commit. The dataset also includes the summary, and message of the commits in the training data which consists of multiple programming languages, including C, C++, Java, JavaScript, and Python.
Download

Paper Nr: 55
Title:

Study of Impact of Gender on Engagement and Performance of Engineering Students

Authors:

M. E. Sousa-Vieira, J. C. López-Ardao and M. Fernández-Veiga

Abstract: The gender gap in STEM disciplines manifests itself in several axes. One of these, not covered in depth in the literature, is the likelihood of different academic achievements by men and women in some technology-oriented courses. We address this question in this work, providing a cross-sectional study conducted on an entry-level course on computer networking at the college level. Our findings suggest that, while we do not observe statistically significant difference in the final grades, women perform slightly better in some specific individual tasks and tend to have more intense participation in social learning activities. Interestingly, our results do not confirm the hypothesis that men tend to show higher variance in their achievements than women.
Download

Paper Nr: 56
Title:

Random Neural Network Ensemble for Very High Dimensional Datasets

Authors:

Jesus S. Aguilar–Ruiz and Matteo Fratini

Abstract: This paper introduces a machine learning method, Neural Network Ensemble (NNE), which combines ensemble learning principles with neural networks for classification tasks, particularly in the context of gene expression analysis. While the concept of weak learnability equalling strong learnability has been previously discussed, NNE’s unique features, such as addressing high dimensionality and blending Random Forest principles with experimental parameters, distinguish it within the ensemble landscape. The study evaluates NNE’s performance across five very high dimensional datasets, demonstrating competitive results compared to benchmark methods. Further analysis of the ensemble configuration, with respect to using variable–size neural networks units and guiding the selection of input variables would improve the classification performance of NNE–based architectures.
Download

Paper Nr: 59
Title:

Federated Road Surface Anomaly Detection Using Smartphone Accelerometer Data

Authors:

Oussama Mazari Abdessameud and Walid Cherifi

Abstract: Road surface conditions significantly impact traffic flow, vehicle integrity, and driver safety. This importance is magnified in the context of service vehicles, where speed is often the only recourse for saving lives. Detecting road surface anomalies, such as potholes, cracks, and speed bumps, is crucial for ensuring smooth and safe driving experiences. Taking advantage of the widespread use of smartphones, this paper introduces a turn-by-turn navigation system that utilizes machine learning to detect road surface anomalies using accelerometer data and promptly alerts drivers. The detection model is personalized for individual drivers and continuously enhanced through federated learning, ensuring both local and global model improvements without compromising user privacy. Experimental results showcase the detection performance of our model, which continually improves with cumulative user contributions.
Download

Paper Nr: 66
Title:

Comparative Analysis of Hate Speech Detection Models on Brazilian Portuguese Data: Modified BERT vs. BERT vs. Standard Machine Learning Algorithms

Authors:

Thiago M. Chu, Leila Weitzel and Paulo Quaresma

Abstract: The Internet became the platform for debates and expression of personal opinions on various subjects. Social media have assumed an important role as a tool for interaction and communication between people. To understand this phenomenon, it is indispensable to detect and assess what characterizes hate speech and how harmful it can be to society. In this paper we present a comprehensive evaluation of Portuguese-BR hate speech identification based on BERT model and ML models as baseline. The BERT model achieves higher scores compared to the machine learning algorithms, indicating better overall performance in distinguishing between classes.
Download

Paper Nr: 73
Title:

Machine Learning Classification in Cardiology: A Systematic Mapping Study

Authors:

Khadija Anejjar, Fatima A. Amazal and Ali Idri

Abstract: Heart disease, a widespread and potentially life-threatening condition affecting millions globally, demands early detection and precise prediction for effective prevention and timely intervention. Recently, there has been a growing interest in leveraging machine learning classification techniques to enhance accuracy and efficiency in the diagnosis, prognosis, screening, treatment, monitoring, and management of heart disease. This paper aims to contribute through a comprehensive systematic mapping study to the current body of knowledge, covering 715 selected studies spanning from 1997 to December 2023. The studies were meticulously classified based on eight criteria: year of publication, type of contribution, empirical study design, type of medical data used, machine learning techniques employed, medical task focused on, heart pathology assessed, and classification type.
Download

Paper Nr: 75
Title:

Impact of Satellites Streaks for Observational Astronomy: A Study on Data Captured During One Year from Luxembourg Greater Region

Authors:

Olivier Parisot and Mahmoud Jaziri

Abstract: The visible and significant presence of satellites in the night sky has an impact on astronomy and astrophotography activities for both amateurs and professionals, by perturbing observations sessions with undesired streaks in captured images, and the number of spacecrafts orbiting the Earth is expected to increase steadily in the coming years. In this article, we test an existing method and we propose a dedicated approach based on eXplainable Artificial Intelligence to detect streaks in astronomical data captured between March 2022 and February 2023 with a smart telescope in the Greater Luxembourg Region. To speed up the calculation, we also propose a detection approach based on Generative Adversarial Networks.
Download

Paper Nr: 107
Title:

Visualization and Interpretation of Mel-Frequency Cepstral Coefficients for UAV Drone Audio Data

Authors:

Mia Y. Wang, Zhiwei Chu, Conner Entzminger, Yi Ding and Qian Zhang

Abstract: Unmanned Aerial Vehicles (UAVs) have become a focal point in various fields, prompting the need for effective detection and classification methodologies. This paper presents a thorough investigation into UAV audio signatures using Mel-Frequency Cepstral Coefficients (MFCCs). We meticulously explore the influence of varying MFCC quantities on classification accuracy across diverse UAV categories. Our analysis demonstrates that employing 30 MFCCs produces promising outcomes, characterized by reduced variance and heightened discriminatory capability compared to alternative configurations. Moreover, we introduce a novel image-based dataset derived from our existing audio dataset, encompassing waveform, spectrogram, Mel filter bank, and MFCC plots for 26 UAV categories, each comprising 100 audio files. This dataset facilitates comprehensive analysis and the development of multimodal UAV detection systems. Our research highlights the significance of leveraging diverse datasets and identifies future paths for UAV detection and classification research.
Download

Paper Nr: 108
Title:

Spatial and Spatio-Temporal Modelling of Auto Insurance Claim Frequencies During Pre-and Post-COVID-19 Pandemic

Authors:

Jin Zhang, Shengkun Xie, Anna T. Lawniczak and Clare Chua-Chow

Abstract: This study explores the dynamics of automobile insurance claim frequencies, shedding light on spatial patterns indicative of regional diversity. By examining data from urban, rural, and suburban areas, we discern disparate claim frequencies across both geographical and temporal dimensions, offering pivotal insights for insurers and regulators seeking to enhance risk assessment and pricing methodologies. Our analysis of auto insurance loss data from Ontario, Canada, unveils a marked divergence in relative claim frequencies between the expansive northern regions and the densely populated south. Furthermore, by scrutinizing various accident years, including those influenced by the COVID-19 pandemic, distinct temporal trends emerge. Applying sophisticated spatio-temporal models facilitates precise predictions, equipping insurers with the tools necessary for adept navigation of the ever-evolving landscape of uncertainties. This research enhances our comprehension of the dynamic nature of territory risk within spatio-temporal contexts. These insights provide valuable assistance to insurance companies and auto insurance regulators in effectively managing territorial risk.
Download

Paper Nr: 118
Title:

Visualizing OWL and RDF: Advancing Ontology Representation for Enhanced Semantic Clarity and Communication

Authors:

Giulia Biagioni

Abstract: This position paper advocates for the development of advanced visualization tools specifically designed to represent the full range of expressiveness conveyed by the vocabulary terms of OWL and RDFs. It highlights the urgent need for a standardized way to visually refer to these vocabulary terms, addressing current challenges in ontology visualization. The paper outlines the significant benefits that such tools and initiatives will deliver, including enhanced clarity, improved communication among stakeholders, and more efficient management of semantic data. By standardizing visual representations, the paper argues for a more intuitive and accessible approach to interacting with complex semantic structures, ultimately facilitating better understanding and broader adoption of semantic web technologies.
Download

Paper Nr: 119
Title:

Navigating the AI Timeline: From 1995 to Today

Authors:

Vincenza Carchiolo and Michele Malgeri

Abstract: In recent years, the exponential growth of Artificial Intelligence (AI) has transcended disciplinary boundaries, expanding into diverse fields beyond computer science. This study analyzes AI’s distribution across disciplines using a large dataset of scientific publications. Contrary to expectations, substantial AI research extends into medicine, engineering, social sciences, and humanities. This interdisciplinary presence heralds new possibilities for collaborative innovation to tackle contemporary challenges. The analysis identifies emerging trends, contributing to a deeper understanding of AI’s evolving role in society.
Download

Paper Nr: 16
Title:

Brain Stroke Prediction Using Visual Geometry Group Model

Authors:

V. Narayanan, A. Reddy, V. Venkatesh, S. Tutun, P. Norouzzadeh, E. Snir, S. Mahmoud and B. Rahmani

Abstract: Stroke has become the leading cause of high mortality and disability rates in the modern era. Early detection and prediction of stroke can significantly improve patient outcomes. In this study, we propose a deep learning approach using the Visual Geometry Group (VGG-16) model. VGG-16 is a type of Convolutional Neural Network (CNN) which is one of the best computer vision models to date to predict the occurrence of a stroke in the brain. VGG-16 is a type of CNN that is one of the best computer vision models to date. We used a dataset consisting of Magnetic resonance imaging (MRI) images of patients with and without stroke. The VGG-16 model was pre-trained on the ImageNet dataset and fine-tuned on our dataset to predict the occurrence of a stroke. Our experimental results demonstrated that the proposed approach achieves high accuracy and can effectively predict stroke occurrence. We have also conducted an extensive analysis of the model’s performance and provided insights into important features used by the model to predict stroke occurrence. The proposed approach has the potential to be used in clinical settings to aid in the early detection and prevention of stroke.
Download

Paper Nr: 20
Title:

Stochastic Simulation Agent for Unknown Inventory Demands in Healthcare Supply Management

Authors:

Rafael Marin Machado de Souza, Leandro Nunes de Castro, Marcio Biczyk, Marcos dos Santos and Eder C. Cassettari

Abstract: The acquisition of innovative items or those without historical demand data considerably increases the complexity of the routine of buyers, who among the daily challenges are keeping stocks up to date, with quantities that provide maximum profitability or maximum use of the purchased items. Seeking to provide a tool to assist in these goals, this study implements a Python-based software agent employing the Monte Carlo method for stochastic simulation and proposes a solution for uncertain inventory demands, providing a decision-mak-ing tool in the absence of historical data, thereby optimizing inventory levels and maximizing profitability. Experiments conducted across both local and cloud server configurations, with a comparative analysis of CPU and GPU performance, demonstrates the agent’s capacity to generate random scenarios with a statistical tolerance margin of 1% from 10,000 simulations. Scalability tests underscore the agent’s adaptability to diverse scenarios, effectively harnessing GPU capabilities for processing extensive data.
Download

Paper Nr: 25
Title:

An Effective Prediction of Events in Social Networks Using Influence Score of Communities

Authors:

B. S. A. S. Rajita, Yaganti B. Vikas, Pritish P. Moharir, Deepa Kumari and Subhrakanta Panda

Abstract: In real-life social networks (SN), dynamic community evolution changes the structure of that network. Hence, a comprehensive framework is imperative for predicting community evolution, which this research refers to as an ’event’. This research studies how the influence of peer nodes in a social network often triggers community evolution. Therefore, this paper proposes calculating the communities’ new derived feature called Influence Score (IS) , to predict their events. Thus, it is imperative to compute the communities’ influence score (as a derived feature) and study its suitability for accurately predicting events using Machine Learning (ML) models. The experimental results show that derived features together with community features are more effective in predicting community events. The implementation and significance of the presented approach on the dataset show that IS, as an added feature, improved the accuracy of the ML models by approximately 6.6%. Additionally, it considerably improved other parameters, including F-measure, recall, and precision. This paper also presents a comparative analysis with other derived features. It shows an improvement in the accuracy by approximately 1.5% and 0.8%. The results also indicate that the IS score improved the accuracy of the logistic regression by 2.53% compared to an existing similar approach. Thus, this paper infers that IS as a derived feature is considerably effective in improving the accuracy of ML models in predicting events in SN communities.
Download

Paper Nr: 26
Title:

Logical Rule Set to Data Acquisition and Database Semantics

Authors:

Susumu Yamasaki and Mariko Sasakura

Abstract: This paper is concerned with website page references including streams from on-line seminar, as reference data of website organization. The motive to take data processing of website pages comes from our observation that the website page references contain a structure of data acquisition and of logical database framework. As formality of data processing, we then treat logical expressions in intuitionistic propositional logic. By evaluating linkage of website pages as well as balked and suspended negatives of link, we make analysis of the structure contained in processes of using website page references, for the purpose of data acquisition and database semantics. As logical expressions, we make use of logical rules, from the views of structural analysis of linkage consistency. By means of query derivation to the logical rule set, we can have 3-valued domain model theory in logical rule sets. The query derivation of this paper is a newly designed method for data acquisition. With respect to abstraction of the state notion from computing environments, a logical database is formulated as a state constraint rule set with data acquisition capability, causing state transitions. The behavioural meaning of logical databases is captured in modal operator such that we can apply a modal logic to meaning descriptions of logical databases. With a different level of logical framework for the meaning of database, modal logic is presented. Apart from the level of logical database in intuitionistic propositional logic, modal operator may be related to semantics for database. By means of fixed point of some function denoting a relation between states for the modal operator, computing-environment states are specified to be concerned with the logical databases.
Download

Paper Nr: 27
Title:

A Data-Driven Approach for Predictive Maintenance of Impellers in Flexible Impeller Pumps Using Prophet

Authors:

Efe C. Demir and Sencer Sultanoğlu

Abstract: This article presents a data-driven approach aimed at improving the efficiency of fabric dyeing operations in the textile industry. It specifically focuses on the predictive maintenance of flexible impeller pumps (FIP) and the application of the Prophet algorithm. The study extensively explores the potential of machine learning and data analytics to increase operational efficiency and enable early failure detection. By using the Facebook Prophet model and time series data for early detection of wear and tear, it offers an approach to maintain pump efficiency without installing new hardware, relying solely on data.
Download

Paper Nr: 31
Title:

Objective Evaluation of Sleep Disturbances in Older Adults with Cognitive Impairment Using a Bed Sensor System and Self-Organizing Map Analysis

Authors:

Tomoko Kamimura, Risa Otsuka, Asaka Domoto, Hikofumi Suzuki and Mamino Tokita

Abstract: Bed sensor systems are useful for measuring sleep states in cognitively impaired older adults because they can measure unrestrained individuals. However, there are no criteria for identifying sleep abnormalities using them. We developed a method to determine sleep abnormalities by analysing data collected by a bed sensor system using a self-organizing map (SOM). In this study, the sleep states were measured in two cognitively impaired care-facility residents. These recordings were used to calculate total nocturnal sleep time, wake time after sleep onset, frequency of leaving the bed, and frequency of awakening in the bed for each day. The data from these four variables were used to draw an SOM for each individual’s sleep state to identify normal or abnormal sleep days. We visually determined whether a main cluster was formed in the SOM. If a main cluster was formed, the days included in the main cluster were defined as the individual’s normal days, while other days were defined as the individual’s abnormal days. The above parameters were independently compared between the two groups, as determined by the SOM. The characteristics of abnormal sleep days identified by SOM could be explained using these four variables, suggesting the effectiveness of identifying abnormal days by SOM.
Download

Paper Nr: 33
Title:

BeRTo: An Efficient Spark-Based Tool for Linking Business Registries in Big Data Environments

Authors:

Andrea Colombo and Francesco Invernici

Abstract: Linking entities from different datasets is a crucial task for the success of modern businesses. However, aligning entities becomes challenging as common identifiers might be missing. Therefore, the process should rely on string-based attributes, such as names or addresses, thus harming precision in the matching. At the same time, powerful general-purpose record linkage tools require users to clean and pre-process the initial data, introducing a bottleneck in the success of the data integration activity and a burden on actual users. Furthermore, scalability has become a relevant issue in modern big data environments, where a lot of data flows daily from external sources. This work presents a novel record linkage tool, BeRTo, that addresses the problem of linking a specific type of data source, i.e., business registries, containing information about companies and corporations. While being domain-specific harms its usability in other contexts, it manages to reach a new frontier in terms of precision but also scalability, as it has been built on Spark. Integrating the pre-processing and cleaning steps in the same tool creates a user-friendly end-to-end pipeline that requires users only to input the raw data and set their preferred configuration, allowing to focus on recall or precision.
Download

Paper Nr: 34
Title:

Recommendations of Research Articles by Experts: Visualizing Relationships and Expertise

Authors:

Peiling Wang, Scott Shumate, Pinghao Ye and Chad Mitchell

Abstract: The paper applied data analytics and network visualization to show the potentials of employing Faculty Opinions beyond literature recommendations by domain experts. Based on a set of highly recommended articles by at least four experts with a sum of 10 or more stars (A recommended article is assigned a score between one to three stars by the recommender.), this study tests the new ideas and methods of identifying and visualizing relationships between scientific papers, experts, and categories. Despite of the available dataset in the study is small, the findings show that a platform designed for recommending and retrieving publications has the potential as a knowledge base for seeking experts. The results are indicative rather than conclusive; further study should apply AI methodology to include multiple data sources to corroborate findings and to enhance the applicability of data visualization towards knowledge graphs.
Download

Paper Nr: 42
Title:

Dataset Balancing in Disease Prediction

Authors:

Vincenza Carchiolo and Michele Malgeri

Abstract: The utilization of machine learning in the prevention of serious diseases such as cancer or heart disease is increasingly crucial. Various studies have demonstrated that enhanced forecasting performance can significantly extend patients’ life expectancy. Naturally, having sufficient datasets is vital for employing techniques to classify the clinical situation of patients, facilitating predictions regarding disease onset. However, available datasets often exhibit imbalances, with more records featuring positive metrics than negative ones. Hence, data preprocessing assumes a pivotal role. In this paper, we aim to assess the impact of machine learning and SMOTE (Synthetic Minority Over-sampling Technique) methods on prediction performance using a given set of examples. Furthermore, we will illustrate how the selection of an appropriate SMOTE process significantly enhances performance, as evidenced by several metrics. Nonetheless, in certain instances, the effect of SMOTE is scarcely noticeable, contingent upon the dataset and machine learning methods employed.
Download

Paper Nr: 60
Title:

A Comparative Study on the Impact of Categorical Encoding on Black Box Model Interpretability

Authors:

Hajar Hakkoum and Ali Idri

Abstract: This study explores the challenge of opaque machine learning models in medicine, focusing on Support Vector Machines (SVMs) and comparing their performance and interpretability with Multilayer Perceptrons (MLPs). Using two medical datasets (breast cancer and lymphography) and three encoding methods (ordinal, one-hot, and dummy), we assessed model accuracy and interpretability through a decision tree surrogate and SHAP Kernel explainer. Our findings highlight a preference for ordinal encoding for accuracy, while one-hot encoding excels in interpretability. Surprisingly, dummy encoding effectively balanced the accuracy-interpretability trade-off.
Download

Paper Nr: 78
Title:

Discretization Strategies for Improved Health State Labeling in Multivariable Predictive Maintenance Systems

Authors:

Jean-Victor Autran, Véronique Kuhn, Jean-Philippe Diguet, Matthias Dubois and Cédric Buche

Abstract: In machine learning, effective data preprocessing, particularly in the context of predictive maintenance, is a key to achieving accurate predictions. Predictive maintenance datasets commonly exhibit binary health states, offering limited insights into transitional phases between optimal and failure states. This work introduces an approach to label data derived from intricate electronic systems based on unsupervised discretization techniques. The proposed method uses data distribution patterns and predefined failure thresholds to discern the overall health of a system. By adopting this approach, the model achieves a nuanced classification that not only distinguishes between healthy and failure states but also incorporates multiple transitional states. These states act as intermediary phases in the system’s progression toward potential failure, enhancing the granularity of predictive maintenance assessments. The primary objective of this methodology is to increase anomaly detection capabilities within electronic systems. Through the utilization of unsupervised discretization, the model ensures a data-driven approach to system monitoring and health evaluation. The inclusion of multiple transitional states in the labeling process facilitates a more precise predictive maintenance framework, enabling informed decision-making in maintenance strategies. This article contributes to advancing the effectiveness of predictive maintenance applications by addressing the limitations associated with binary labeling, ultimately encouraging a more nuanced and accurate understanding of system health.
Download

Paper Nr: 83
Title:

A Few-Shot Learning-Focused Survey on Recent Named Entity Recognition and Relation Classification Models

Authors:

Sakher K. Alqaaidi, Elika Bozorgi, Afsaneh Shams and Krzysztof J. Kochut

Abstract: Named Entity Recognition (NER) and Relation Classification (RC) are important tasks for extracting information from unstructured text and transforming it into a machine-readable format. Recently, the field of few-shot learning has gained increased interest due to its ability to enable models to generalize across multiple domains using minimal labeled data. However, no studies have addressed the recent achievements in the NER and RC fields within the few-shot learning paradigm. In this work, we aim to fill this gap by presenting a survey on recent few-shot learning models in the fields of NER and RC. Our survey provides a thorough introduction to these tasks, along with a summary of the latest approaches and achievements. We conclude with our observations on the current state of research in these domains.
Download

Paper Nr: 121
Title:

Random Forest Classification of Cognitive Impairment Using Digital Tree Drawing Test (dTDT) Data

Authors:

Sebastian Unger, Zafer Bayram, Laura Anderle and Thomas Ostermann

Abstract: Early detection and diagnosis of dementia is a major challenge for medical research and practice. Hence, in the last decade, digital drawing tests became popular, showing sometimes even better performance than their paper-and-pencil versions. Combined with machine learning algorithms, these tests are used to differentiate between healthy people and people with mild cognitive impairment (MCI) or early Alzheimer’s disease (eAD), commonly using data from the Clock Drawing Test (CDT). In this investigation, a Random Forest Classification (RF) algorithm is trained on digital Tree Drawing Test (dTDT) data, containing socio-medical information and process data of 86 healthy people, 97 people with MCI, and 74 people with eAD. The results indicate that the binary classification works well for homogeneous groups, as demonstrated by a sensitivity of 0.85 and a specificity of 0.9 (AUC of 0.94). In contrast, the performance of both binary and multiclass classification degrades for groups with heterogeneous characteristics, which is reflected in a sensitivity of 0.91 and 0.29 and a specificity of 0.44 and 0.36 (AUC of 0.74 and 0.65), respectively. Nevertheless, as the early detection of cognitive impairment becomes increasingly important in healthcare, the results could be useful for models that aim for automatic identification.
Download

Area 4 - Data Management and Quality

Full Papers
Paper Nr: 36
Title:

Online Machine Learning for Adaptive Ballast Water Management

Authors:

Nadeem Iftikhar, Yi-Chen Lin, Xiufeng Liu and Finn E. Nordbjerg

Abstract: The paper proposes an innovative solution that employs online machine learning to continuously train and update models using sensor data from ships and ports. The proposed solution enhances the efficiency of ballast water management systems (BWMS), which are automated systems that utilize ultraviolet light and filters to purify and disinfect the ballast water that ships carry for maintaining their stability and balance. The solution allows it to grasp the complex and evolving patterns of ballast water quality and flow rate, as well as the diverse conditions of ships and ports. The solution also offers probabilistic forecasts that consider the uncertainty of future events that could impact the performance of ballast water management systems. An online machine learning architecture is proposed that can accommodate probabilistic based machine learning models and algorithms designed for specific training objectives and strategies. Three training methodologies are introduced: continuous training, scheduled training and threshold-triggered training. The effectiveness and reliability of the solution are demonstrated using actual data from ship and port performances. The results are visualized using time-based line charts and maps.
Download

Short Papers
Paper Nr: 95
Title:

Towards FAIR Data Workflows for Multidisciplinary Science: Ongoing Endeavors and Future Perspectives in Plasma Technology

Authors:

Robert Wagner, Dagmar Waltemath, Kristina Yordanova and Markus M. Becker

Abstract: This paper focuses on the ongoing process of establishing a FAIR (Findable, Accessible, Interoperable and Reusable) data workflow for multidisciplinary research and development in applied plasma science. The presented workflow aims to support researchers in handling their project data while also fulfilling the requirements of modern digital research data management. The centerpiece of the workflow is a graph database (utilizing Neo4J) that connects structured data and metadata from multiple sources across the involved disciplines. The resulting workflow intents to enhance the FAIR compliance of the data, thereby supporting data integration and automated processing as well as providing new possibilities for user friendly data exploration and reuse.
Download

Paper Nr: 97
Title:

Interoperable Open Data Platforms: A Prototype for Sharing CKAN Data Sources

Authors:

Sebastian Becker and Marcel Altendeitering

Abstract: Open data promotes transparency, accountability, and innovation in organizations and represents a central element of modern data management, supporting informed decision-making. CKAN is the world’s leading open-source data portal, widely used on national and local open-data platforms. However, CKAN installations are usually operated insularly, with limited interoperability and interaction between multiple instances. The induced separation is based on incompatible data models, leading to complex searches that include several open data platforms. In this paper, we describe a prototype that realizes an interoperability layer between CKAN instances and connects them in a data space. To create the intended solution, we relied on the Eclipse Dataspace Components (EDC) open-source project and present details on our architectural approach and implementation. For evaluation, we conducted a series of focus group discussions with stakeholders of our prototype. We received mostly positive feedback on our developments, and the participants agreed that our solution could lead to an improved interoperability of open-data platforms.
Download

Paper Nr: 99
Title:

Anomaly Detection in Industrial Production Products Using OPC-UA and Deep Learning

Authors:

Henry O. Velesaca, Doménica Carrasco, Dario Carpio, Juan A. Holgado-Terriza, Jose M. Gutierrez-Guerrero, Tonny Toscano and Angel D. Sappa

Abstract: In the realm of industrial manufacturing, detecting defects in products is critical for maintaining quality. Traditional methods relying on human inspection are often error-prone and time-consuming. However, advancements in automation and computer vision have led to smarter industrial control systems. This paper explores a novel approach to identifying defects in industrial processes by integrating OPC-UA and YOLO v8. OPC-UA provides a secure communication standard, enabling seamless data exchange between devices, while YOLO v8 provides accurate object detection. By combining these technologies, manufacturers can monitor production lines in near real-time, analyze defects promptly, and take corrective actions. As a result, product quality and operational efficiency are improved. A case study involving tinplate lid defect detection demonstrates the effectiveness of the proposed approach. The system architecture, including PLC integration, image acquisition, and YOLO v8 implementation, is detailed, followed by the performance evaluation of the OPC-UA server and YOLO v8 model integration. Results indicate efficient communication with low Round Trip Times and End-to-End delay, highlighting the potential of this approach for defect detection. The code is available at GitHub: https://github.com/hvelesaca/OPC-UA-YOLOv8-Lid-Anomaly-Detection, facilitating further research.
Download

Paper Nr: 111
Title:

Towards Semantic Data Management Plans for Efficient Review Processing and Automation

Authors:

Jana Martínková and Marek Suchánek

Abstract: In recent times, Data Management Planning has become increasingly crucial. Effective practices in data management ensure more precise data collection, secure storage, proper handling, and utilization beyond the primary project. However, existing DMPs often suffer from complex structures that impede accessibility for humans and machines. This project aims to address these challenges by converting DMPs into formats that are both machine-actionable and human-readable. Leveraging established DMP templates and relevant ontologies, our methodology involves analyzing diverse approaches to achieve this dual functionality. We assess machine-actionability through comparative evaluations using AI and NLP tools. Furthermore, we identify gaps in ontologies, laying the groundwork for future enhancements in this critical area of research.
Download

Paper Nr: 112
Title:

Integration and Optimization of XNAT-Based Platforms for the Management of Heterogeneous and Multicenter Data in Biomedical Research

Authors:

Camilla Scapicchio, Silvia Arezzini, Maria E. Fantacci, Antonino Formuso, Aafke C. Kraan, Enrico Mazzoni, Sara Saponaro, Maria I. Tenerani and Alessandra Retico

Abstract: The rise of data-driven analysis methods in biomedical research has led to the need for proper data management. Organizing large datasets of heterogeneous biomedical data can be challenging, especially in multi-centric studies, with the need to ensure data integrity, quality, and privacy compliance with laws. In this work, we report and discuss two solutions that we are starting to implement: a platform for collecting Computed Tomography imaging data of phantoms and associated metadata in a multi-centric study focused on radiomics, and a platform for gathering, sharing, and analyzing diverse data acquired in a project focused on FLASH radiotherapy. Both platforms will be built on top of the XNAT technology. Our goal is to establish a secure and collaborative medical research environment that promotes data sharing, customized workflow analysis, and stores data and results for subsequent studies. The key innovation is the creation of a personalized platform system that currently does not exist. This is essential from a scientific point of view to enable advanced statistical analysis and reveal non-trivial relationships among heterogeneous data. This cannot be achieved with disorganized data collection. The platforms will also integrate analysis tools and quality control pipelines executable directly from the platform on stored data.
Download

Paper Nr: 38
Title:

Data-Driven Model Categorization: Advancing Physical Systems Analysis Through Graph Neural Networks

Authors:

Andrija Grbavac, Martin Angerbauer, Michael Grill and André C. Kulzer

Abstract: Efficiently categorizing physical system models is crucial for data science applications in scientific and engineering realms, facilitating insightful analysis, control, and optimization. While current methods, often relying on Convolutional Neural Networks (CNNs), effectively handle spatial dependencies in image data, they struggle with intricate relationships inherent in physical system models. Our research introduces a novel approach employing Graph Neural Networks (GNNs) to enhance categorization. GNNs excel in modeling complex relational structures, making them apt for analyzing interconnected components within physical systems represented as graphs. Leveraging GNNs, our methodology treats entities as system components and edges as their arrangements, effectively learning and exploiting inherent dependencies and interactions. The proposed GNN-based approach outperforms CNN-based methods across a dataset of 55 physical system models, eliminating limitations observed in CNN approaches. The results underscore GNNs’ ability to discern subtle interdependencies and capture non-local patterns, enhancing the accuracy and robustness of model categorization in a data science framework. This research contributes to advancing model categorization, emphasizing the application of data science for understanding and controlling complex physical systems. The innovative use of GNNs opens new avenues for revolutionizing the categorization of intricate physical system models in scientific and engineering domains.
Download

Paper Nr: 50
Title:

Use of Semantic Artefacts in Agricultural Data-Driven Service Development

Authors:

Silke Cuno and Philipp Lämmel

Abstract: The paper is a survey of the resources and efforts in the field of semantic interoperability in the agricultural domain, and describes challenges and solutions for building federated digital agricultural integration platforms that provide service components based on shared and reused data. It also shows the state of the art of public semantic artefacts and their potential contribution to solving semantic interoperability within digital agricultural integration platforms by combining their main standards into common agricultural information models that can be used by developers and fit a wide range of agricultural use cases.
Download

Paper Nr: 72
Title:

Real-Time Equipment Health Monitoring Using Unsupervised Learning Techniques

Authors:

Nadeem Iftikhar and Finn E. Nordbjerg

Abstract: Reducing unplanned downtime requires monitoring of equipment health. This may not be possible in many cases as traditional health monitoring systems often rely on the use of historical data and maintenance information which is not always available, especially for small and medium-sized enterprises. This paper presents a practical approach that uses sensor data for real-time equipment health indication. The methodology proposed consists of a set of steps. It starts with feature engineering which may include feature extraction to transform raw sensor data into a format more suitable for analysis. Anomaly detection follows next, where various techniques are employed to find any deviations in the engineered features indicating potential equipment deterioration or abrupt failures. Then comes the most important stages equipment health indication and alert generation. These stages provide timely information about the equipment’s condition and any necessary interventions. These steps make it possible for such an approach to be effective even when there is little or no historical data available. The applicability of this approach is validated through a lab-based case study.
Download

Area 5 - Databases and Data Security

Full Papers
Paper Nr: 44
Title:

Efficient and Secure Multiparty Querying over Federated Graph Databases

Authors:

Nouf Aljuaid, Alexei Lisitsa and Sven Schewe

Abstract: We present a system for efficient privacy-preserving multi-party querying (PPMQ) over federated graph databases. This framework offers a customisable and adaptable approach to privacy preservation using two different security protocols. The first protocol utilises standard secure multiparty computation (SMPC) protocols on the client side, enabling computations to be conducted on data without exposing the data itself. The second protocol is implemented on the server side using a combination of an SMPC protocol to prevent exposing the data to the clients and the use of encrypted hashing to prevent exposing the data to the server. We have conducted experiments to compare the efficiency of our PPMQ system with Neo4j Fabric, the off-the-shelf solution for querying federated graph databases, and with two previous systems, SMPQ and Conclave for secure multiparty querying. The results demonstrated that the execution times and overheads of PPMQ are comparable to those using Neo4j Fabric. Notably, our results reveal that the execution times and overheads of PPMQ outperform both SMPQ and Conclave, showcasing the better efficiency of our approach in preserving privacy within federated graph databases.
Download