DATA 2025 Abstracts


Area 1 - Data Analytics & Visualization

Full Papers
Paper Nr: 34
Title:

CCR-Logistic Based Variable Importance Visualization: Differentiating Prime and Suppressor Variables in Logit Models

Authors:

Ana Perišić and Ivan Sever

Abstract: Logistic regression typically involves assessing variable importance. This task becomes considerably more challenging in the presence of correlated variables (predictors) and suppression. We present a procedure for determining variable importance in multiple logistic regression models that can distinguish between suppressor variables and prime predictors. We propose a simple visualization tool for representing variable importance that can help practitioners to determine important prime and suppressor variables when building the multiple logistic regression model. The methodology relies on the extension of the Correlated Component Regression approach to logistic regression (CCR-Logit), which utilizes linear combinations of predictors instead of original predictors and can easily be generalized to various regression models. CCR-logistic methodology can handle a large number of predictors and is especially useful when dealing with correlated predictors. The variable importance is quantified by observing standardized regression coefficients from univariate models and higher-order component models, where univariate models capture the direct effect on the outcome, while the higher-order component models capture the suppressor effects. The proposed methodology is presented on a real-world dataset within the field of tourism.

Paper Nr: 78
Title:

Predictors of Freshmen Attrition: A Case Study of Bayesian Methods and Probabilistic Programming

Authors:

Eitel J. M. Lauría

Abstract: The study explores the use of Bayesian hierarchical linear models to make inferences on predictors of freshmen student attrition using student data from nine academic years and six schools at Marist University. We formulate a hierarchical generalized (Bernoulli) linear model, and implement it in a probabilistic programming platform using Markov chain Monte Carlo (MCMC) techniques. Model fitness, parameter convergence, and the significance of regression estimates are assessed. We compared the Bayesian model to a frequentist generalized linear mixed model (GLMM). We identified college academic performance, financial need, gender, tutoring, and work-study program participation as significant factors affecting the log-odds of freshmen attrition. Additionally, the study revealed fluctuations across time and schools. The variation in attrition rates highlights the need for targeted retention initiatives, as some schools appear more vulnerable to higher attrition. The study provides valuable insights for stakeholders, administrators, and decision-makers, offering applicable findings for other institutions and a detailed guideline on analyzing educational data using Bayesian methods.

Short Papers
Paper Nr: 47
Title:

Visual Methods for Network Analytics of Echo Chamber: A Case Study of Thailand’s General Election 2023

Authors:

Isariyaporn Sukcharoenchaikul and Puripant Ruchikachorn

Abstract: This research develops visual methods to study the echo chamber effect through a case study on Thai-land’s 2023 General Election. Using visualization techniques like node-link diagrams, t-SNE projections, and heatmaps, it examines homophilic relationships, clustering, and polarization in online communities. To minimize inaccuracies and biases, network graphs are created from contextual analysis of user-generated content, rather than relying on predefined relationships like friendships or followers. The study applies the Echo Chamber Score (ECS) with visualizations to explore variations in analytical methods and how they capture different aspects of echo chambers. Additionally, it illustrates how political events shape online discourse and community dynamics by linking ECS with key political milestones.

Paper Nr: 119
Title:

IDAT: An Interactive Data Exploration Tool

Authors:

Nir Regev, Asaf Shabtai and Lior Rokach

Abstract: In the current landscape of data analytics, data scientists predominantly utilize in-memory processing tools such as Python’s pandas or big data frameworks like Spark to conduct exploratory data analysis (EDA). These methods, while powerful, often entail substantial trade-offs, including significant consumption of time, memory, and storage, alongside elevated data scanning costs. Considering these limitations, we developed iDAT, a cost-effective interactive data exploration method. Our method uses a deep neural network (NN) to learn the relationship between queries and their results to provide a rapid inference layer for the prediction of query results. To validate the method, we let 20 data scientists run EDA (exploratory data analysis) queries using the system underlying this method. We show that it reduces the need to scan data during inference (query calculation). We evaluated this method using 12 datasets and compared it to the latest query approximation engines (VerdictDB, BlinkDB) in terms of query latency, model weight, and accuracy. Our results indicate that the iDat predicted query results with a WMAPE (weighted mean absolute percentage error) ranging from approximately 1% to 4%, which, for most of our datasets, was better than the results of the compared benchmarks.

Paper Nr: 64
Title:

DataPulse: An Interactive Dashboard for Statistical and Exploratory Analysis of Multimodal Healthcare Data in Shiny

Authors:

Adam Urban, James Connolly, Gauri Vaidya, Krishn Kumar Gupt and Meghana Kshirsagar

Abstract: The extraction of actionable insights from data-driven analyses is crucial for efficiently profiling, analyzing, and visualizing intricate medical datasets. A robust, generic data profiling tool is essential to uncover and understand relationships within medical data. In this context, building Shiny app can lead to numerous advantages—providing interactive, user-friendly, and dynamic dashboards, and the capacity to deploy scalable, web-based solutions. In this article, we introduce DataPulse, a versatile data profiling tool designed to analyze multimodal healthcare data. By leveraging advanced statistical methodologies, DataPulse uncovers complex relationships in the data and displays them through comprehensive dashboards. For instance, in ra-diogenomics, sequential imaging visualizations can highlight dynamic changes in disease progression over time. The article discusses two usecases of DataPulse: one focusing on the analysis of hip fracture patient pathways in the emergency department, and the other offering a detailed exploration of cancer disease through multimodal datasets to derive insights on drug outcomes and disease progression over time. In conclusion, DataPulse exemplifies how robust, interactive data profiling can transform complex medical data into actionable insights.

Paper Nr: 98
Title:

Data Storytelling in Learning Analytics: AI-Driven Competence Assessment

Authors:

Ainhoa Álvarez, Aitor Renobales-Irusta and Mikel Villamañe

Abstract: Learning dashboards have become very popular, but the information shown on them is often difficult to interpret by users. Different authors have worked to improve dashboards including narratives or data storytelling techniques. However, creating these narratives is a complex process. Several studies have begun to analyse the use of GenAI tools to generate these narratives in a scalable way, but this is still an area of study that is at an early stage. In this paper, we present a proposal and a study aimed at generating narratives using GenAI, extending previous work by aligning the generated narratives with the curriculum design of the course. We first present a proposal for generating the narratives and then a study to evaluate their adequacy.

Paper Nr: 113
Title:

Towards a KD4LA Framework to Support Learning Analytics in Higher Education

Authors:

Thi My Hang Vu

Abstract: Learning analytics (LAs) involves the process of collecting, organizing, and generating insights from educational data, such as learner assessments, learner profiles, or learner interactions with the educational environment, to support educators and learners in decision-making. This topic has gained attention from the community for many decades. Nowadays, with advancements in data mining and the availability of large amounts of data from various educational environments, learning analytics presents both opportunities and challenges. Especially in higher education, where data is more complex and data analytics is closely integrated with pedagogical activities and objectives, a consolidated framework is crucial to support both educators and learners in their tasks. This paper proposes a comprehensive framework, named KD4LA (Knowledge Discovery for Learning Analytics), which clarifies essential components of common learning analytics tasks in higher education. These tasks include generating statistical insights on student assessments, segmenting students based on their acquired knowledge, or evaluating their proficiency in relation to learning objectives. The proposed framework is validated through several real-world case studies to demonstrate its practical applicability.

Paper Nr: 126
Title:

Data Visualization Framework for Identifying Optimal Locations for Links in the Index Page of a Website

Authors:

Haider A. Ramadhan

Abstract: A data visualization framework for analyzing the efficiency of the design and structure of index pages in websites is presented. The framework uses multiple visualization views to identify the relationship between the link location in the page structure and its popularity in terms of user access patterns. In other words, the framework identifies the relationship between the popularity of the links and their functional locations in the index page. The framework then automatically recommends a more efficient redesign of the page. A second experimental analysis of the framework is presented. The evaluation follows the same methodology of the first experiment in which index pages of four real large websites were used. Both experimental studies seem to support the usefulness of the framework.

Paper Nr: 140
Title:

Effect of Data Visualization on Users’ Running Performance on Treadmill

Authors:

Thanaphon Amattayakul and Puripant Ruchikachorn

Abstract: This paper examined how real-time data visualization influences treadmill users’ performance and experience. Traditional treadmill displays are often represented with texts, limiting user engagement and motivation. By applying visualization techniques that align with human cognitive processing, such as line charts and progress indicators, we proposed data visualization designs to represent running performance metrics more meaningfully. The study applied 3 display conditions: traditional and two improved visualization displays. Through a within-subjects experiment with 18 participants, metrics such as time to exhaustion, heart rate, distance, and calorie expended were collected along with subjective feedback. Statistical analysis showed that both visualization displays significantly improved running performance and satisfaction. Results showed that real-time feedback with data visualization design can positively influence users’ understanding and psychological connection to their fitness data. These findings highlight the potential of data visualization to perceive and elevate user experience in exercise interfaces.

Area 2 - Data Engineering, Infrastructure and Business Applications

Full Papers
Paper Nr: 40
Title:

Data Quality Threat Mitigation Strategies for Multi-Sourced Linked Data

Authors:

Ali Obaidi and Adrienne Chen-Young

Abstract: Federal agencies link data from multiple sources to generate statistical data products essential to informing policy and decision making (National Academies, 2017). The ability to integrate and link data is accompanied by the challenge of harmonizing heterogenous data, disambiguating similar data, and ensuring that the quality of data from all sources can be reconciled at levels that provide value and utility commensurate with the integration effort. Given the significant resources and effort needed to consistently maintain high quality, multi-sourced, linked data in a government ecosystem, this paper proposes steps that can be taken to mitigate threats to data quality at the earliest stage of the statistical analysis data lifecycle: data collection. This paper examines the threats to data quality that are identified in the Federal Committee on Statistical Methodology’s (FCSM) Data Quality Framework (Dworak-Fisher, 2020), utilizes the U.S. Geological Survey’s (USGS) Science Data Lifecycle Model (SDLM) (Faundeen, 2013) to isolate data quality threats that occur before integration processing, and presents mitigation strategies that can be taken to safeguard the utility, objectivity, and integrity of multi-sourced statistical data products.

Paper Nr: 139
Title:

TACO: A Lightweight Tree-Based Approximate Compression Method for Time Series

Authors:

André Bauer

Abstract: The rapid expansion of time series data necessitates efficient compression techniques to mitigate storage and transmission challenges. Traditional compression methods offer trade-offs between exact reconstruction, compression efficiency, and computational overhead. However, many existing approaches rely on strong statistical assumptions or require computationally intensive training, limiting their practicality for large-scale applications. In this work, we introduce TACO, a lightweight tree-based approximate compression method for time series. TACO eliminates the need for training, operates without restrictive data distribution assumptions, and enables selective decompression of individual values. We evaluate TACO on five diverse datasets comprising over 170,000 time series and compare it against two state-of-the-art methods. Experimental results demonstrate that TACO achieves compression rates of up to 92%, with average compression ratios ranging from 7.55 to 20.86, while maintaining reconstruction errors as low as 10−6, outperforming state-of-the-art approaches in three of the five datasets.

Short Papers
Paper Nr: 33
Title:

Data Quality Scoring: A Conceptual Model and Prototypical Implementation

Authors:

Mario Köbis-Riedel, Marcel Altendeitering and Christian Beecks

Abstract: A high level of data quality is crucial for organizations as it supports efficient processes, corporate decision-making, and driving innovation. However, collaborating on data across organizational borders and sharing data with business partners is often impaired by a lack of data quality information and different interpretations of the data quality concept. This information asymmetry of data quality information between data provider and consumer leads to a lower usability of data sets. In this paper, we present the conceptual model and prototypical implementation of a Data Quality Scoring (DQS) solution. Our solution automatically assesses the quality of a data set and allocates a data quality label similar to the Nutri-Score label for food. This way, we can communicate the data quality score in a structured and user-friendly way. For evaluation, we tested our approach using exemplary data sets and assessed the general functionality and runtime complexity. Overall, we found that our proposed DQS system is capable of automatically allocating data quality labels and can support communicating data quality information.

Paper Nr: 63
Title:

A Decision Framework for AI/MLOps Toolchain Selection in Manufacturing

Authors:

Martin Bischof and Florian Wahl

Abstract: This paper addresses the growing challenge of implementing and selecting appropriate Machine Learning Operations toolchains in manufacturing environments, where computer vision applications are becoming increasingly prevalent. We introduce a comprehensive framework that uniquely combines MLOps platform evaluation criteria with a practical workflow methodology tailored for manufacturing settings. To validate our framework, we conducted experiments using the MVTec Anomaly Detection dataset, achieving 77.78 % accuracy in granular defect-type classification when deployed through a commercial MLOps platform. Our framework effectively bridges the gap between theoretical principles and real-world manufacturing constraints by emphasizing both technical requirements and workflow considerations. This research advances industrial AI implementation by providing a systematic methodology that transcends conventional data mining approaches while specifically addressing manufacturing-sector challenges. Our findings demonstrate that successful MLOps toolchain selection necessitates a balanced evaluation of both functional capabilities and implementation workflows.

Paper Nr: 105
Title:

Selecting a Data Warehouse Provider: A Daunting Task

Authors:

João Ferreira, Nuno Lourenço and João R. Campos

Abstract: In the contemporary landscape of rapid data accumulation, organizations increasingly rely on data warehouses to process and store vast datasets efficiently. Although the most challenging task is appropriately designing a data warehouse, selecting a provider is far from the trivial task it should be. Each provider offers a distinct array of services, each with its pricing model, which requires significant effort to analyze and determine which configuration meets the specific needs of the organization. In this paper, we highlight the inherent challenges of making fair comparisons among data warehouse solutions, providing the context of a start-up in the space traffic management industry as a case study. We defined several critical attributes for corporate decision-making: cost, processing capabilities, and data storage capacity. We systematically compare four leading technologies: Google BigQuery, AWS Redshift, Azure Synapse, and Snowflake. Our methodology employs a set of metrics designed to assess warehouse solutions, encompassing storage pricing, processing capabilities, scalability, and the integration of ETL tools. The process and the results highlight the challenges of this evaluation. It underscores the need for a standard approach to characterize the provided service specifications and pricing to allow for a fair and systematic assessment and comparison of alternative solutions.

Paper Nr: 136
Title:

An Advanced Entity Resolution in Data Lakes: First Steps

Authors:

Lamisse F. Bouabdelli, Fatma Abdelhedi, Slimane Hammoudi and Allel Hadjali

Abstract: Entity Resolution (ER) is a critical challenge for maintaining data quality in data lakes, aiming to identify different descriptions that refer to the same real-world entity. We address here the problem of entity resolution in data lakes, where their schema-less architecture and heterogeneous data sources often lead to entity duplication, inconsistency, and ambiguity, causing serious data quality issues. Although ER has been well studied both in academic research and industry, many state-of-the-art ER solutions face significant drawbacks. Existing ER solutions typically compare two entities based on attribute similarity, without taking into account that some attributes contribute more significantly than others in distinguishing entities. In addition, traditional validation methods that rely on human experts are often error-prone, time-consuming, and costly. We propose an efficient ER approach that leverages deep learning, knowledge graphs (KG), and large language models (LLM) to automate and enhance entity disambiguation. Furthermore, the matching task incorporates attribute weights, thereby improving accuracy. By integrating LLM for automated validation, this approach significantly reduces the reliance on manual expert verification while maintaining high accuracy.

Area 3 - Data in Industry and Emerging Trends in Data

Full Papers
Paper Nr: 114
Title:

Assessing Registration and Screening Technologies for Efficient Mass Vaccination and Public Health Monitoring

Authors:

Eva K. Lee and Kevin Yifan Liu

Abstract: Vaccine data collection during mass vaccination campaigns is a difficult task due to the lack of a unified system; yet, accurate and timely documentation is essential for monitoring efficacy and adverse effects. In this study, we evaluate five electronic registration and screening technologies to test for how quickly immunizations could be delivered and recorded given the different physical and cyber requirements of the different technologies. Using time−motion studies and service data analysis from influenza vaccination campaigns, we demonstrate operations and tracking efficiency with throughput improvements of 16% to 45%. Based on these findings, we propose a prototypical unified system for dispensing, monitoring, and assessment that is interoperable with existing immunization and electronic medical record systems. This paper highlights the potential of electronic technologies to significantly enhance processes in vaccine administration and data management. With the resource-constrained public health setting, the design emphasizes on minimally-enhanced technology requirements to achieve seamless data and process management and improved operations efficiencies. The system is flexible, scalable, and adaptable for different types of medical countermeasures.

Short Papers
Paper Nr: 86
Title:

Driving Innovation in Fleet Management: An Integrated Data-Driven Framework for Operational Excellence and Sustainability

Authors:

Suryakant Kaushik

Abstract: This paper presents a comprehensive framework for leveraging advanced data analytics, artificial intelligence, and Internet of Things (IoT) technologies to revolutionize fleet management systems across various transportation sectors. Fleet operations globally face significant challenges including operational inefficiencies, increasing fuel costs, environmental compliance requirements, and safety concerns. The proposed integrated data-driven framework addresses these challenges by combining operational research techniques with AI-powered analytics and IoT-enabled sensor networks to optimize routing, reduce fuel consumption, enhance predictive maintenance capabilities, and improve driver safety protocols. Through analysis of real-world implementations across commercial and municipal fleets, we demonstrate how this framework has achieved fuel consumption reductions of up to 15%, decreased unplanned maintenance downtime by 30%, and significantly improved safety metrics. Our research provides empirical evidence of return on investment across various fleet sizes and compositions, including successful retrofitting strategies for legacy vehicles.

Paper Nr: 90
Title:

Evaluating Large Language Models for Literature Screening: A Systematic Review of Sensitivity and Workload Reduction

Authors:

Elias Sandner, Luca Fontana, Kavita Kothari, Andre Henriques, Igor Jakovljevic, Alice Simniceanu, Andreas Wagner and Christian Gütl

Abstract: Systematic reviews provide high-quality evidence but require extensive manual screening, making them time-consuming and costly. Recent advancements in general-purpose large language models (LLMs) have shown potential for automating this process. Unlike traditional machine learning, LLMs can classify studies based on natural language instructions without task-specific training data. This systematic review examines existing approaches that apply LLMs to automate the screening phase. Models used, prompting strategies, and evaluation datasets are analyzed, and the reported performance is compared in terms of sensitivity and workload reduction. While several approaches achieve sensitivity above 95%, none consistently reach the 99% threshold required for replacing human screening. The most effective models use ensemble strategies, calibration techniques, or advanced prompting rather than relying solely on the latest LLMs. However, generalizability remains uncertain due to dataset limitations and the absence of standardized benchmarking. Key challenges in optimizing sensitivity are discussed, and the need for a comprehensive benchmark to enable direct comparison is emphasized. This review provides an overview of LLM-based screening automation, identifying gaps and outlining future directions for improving reliability and applicability in evidence synthesis.

Paper Nr: 150
Title:

AI-Ready Open Data Ecosystem

Authors:

Wenwey Hseush, Shou-Chung Wang, Yong-Yueh Lee and Anthony Ma

Abstract: AI-ready data sharing plays a pivotal role in the data economy, where data consumption and value creation by AI agents become the undetected new norm in our daily lives. This paper aims to establish an AI-ready, open-ended data infrastructure that spans the Internet to serve live and ubiquitous data to AI agents readily without the repetitive data acquisition and preparation by the agent developers. To achieve this, we propose a standardized framework for evaluating AI-ready data provisioning services, defining the criteria of a data ecosystem and its provisioned data to meet the real-world needs of AI agents.

Paper Nr: 164
Title:

A Microservice-Based Architecture for Real-Time Credit Card Fraud Detection with Observability

Authors:

Robson S. Santos, Robesvânia Araújo, Paulo A. L. Rego, José M. da S. M. Filho, Jarélio G. da S. Filho, José D. C. Neto, Nicksson C. A. de Freitas, Emanuel B. Rodrigues, Francisco Gomes and Fernando Trinta

Abstract: The growth of real-time financial transactions has increased the demand for scalable and transparent fraud detection systems. This paper presents a microservice-based architecture designed to detect credit card fraud in real time, integrating machine learning models with observability tools to monitor operational behavior. Built on OpenTelemetry (OTel), the architecture enables detailed tracking of performance metrics, resource usage, and system bottlenecks. Experiments conducted in a cloud-based environment demonstrate the scalability and efficiency of the solution under different workloads. Among the tested models, XGBoost outperformed Random Forest in throughput and latency, handling over 25,000 concurrent requests with response times under 50 ms. Compared to previous work focused solely on model accuracy, this study advances toward real-world applicability by combining fraud detection with runtime observability and elastic deployment. The solution is open-source and reproducible, and it contributes to the development of robust data-driven systems in the financial domain.

Paper Nr: 134
Title:

A Review on the Use of Large Language Models in the Context of Open Government Data

Authors:

Daniel Staegemann, Christian Haertel, Matthias Pohl and Klaus Turowski

Abstract: Since ChatGPT was released to the public in 2022, large language models (LLM) have drawn enormous interest from academia and industry alike. Their ability to create complex texts based on provided inputs positions them to be a valuable tool in many domains. Moreover, since some time, many governments want to increase transparency and enable the offering of new services by making their data freely available. However, these efforts towards Open Government Data (OGD) face various challenges with many being related to the question how the data can be made easily findable and accessible. To address this issue, the use of LLMs appears to be a promising solution. To provide an overview of the corresponding research, in this work, the results of a structured literature review on the use of LLMs in the context of OGD are presented. Hereby, numerous application areas as well as challenges were identified and described, providing researchers and practitioners alike with a synoptic overview of the domain.

Area 4 - Data Privacy & Security, Ethics & Governance

Short Papers
Paper Nr: 32
Title:

Towards Interoperable Data Spaces: Comparative Analysis of Data Space Implementations Between Japan and Europe

Authors:

Shun Ishihara and Taka Matsutsuka

Abstract: Data spaces are evolving rapidly. In Europe, the concept of data spaces, which emphasises the importance of trust, sovereignty, and interoperability, is being implemented as a platform such as Catena-X. Meanwhile, Japan has been developing its approach to data sharing, in line with global trends but also to address unique domestic challenges, resulting a platform such as DATA-EX. Achieving interoperability between European and Japanese data spaces remains a critical challenge due to the differences created by these parallel advances. Although interoperability between data spaces has several aspects, compatibility of trust in the participating entities and the data exchanged is a significant aspect due to its influence on business. This paper undertakes a comparative analysis of DATA-EX and Catena-X while focusing on aspect of trust, to explore the challenges and opportunities for achieving interoperability between Japanese and European data spaces. By examining common data exchange processes, key objects such as datasets, and specific evaluation criteria, the study identifies gaps, challenges, and proposes actionable solutions such as inter-exchangeable topology. Through this analysis, the paper aims to contribute to the ongoing discourse on global data interoperability.

Paper Nr: 35
Title:

Enhancing Trust in Inter-Organisational Data Sharing: Levels of Assurance for Data Trustworthiness

Authors:

Florian Zimmer, Janosch Haber and Mayuko Kaneko

Abstract: With data increasingly acknowledged as a valuable asset, much effort has been put into investigating inter-organisational data sharing to unlock the value of previously unused data. Hence, research has identified mutual trust between actors as essential prerequisite for successful data sharing activities. However, existing research oftentimes focuses on trust from a data provider perspective only. Our work, therefore, highlights the unbalanced view of trust and addresses trust barriers from a data consumer perspective. Investigating trust on a data level, i.e. the assessment and assurance of data trustworthiness, we found that existing solutions focused on data trustworthiness do not meet the domain requirements of inter-organisational data sharing. This paper addresses this shortcoming by proposing a new artifact called Levels of Assurance for Data Trustworthiness (Data LoA) based on a design science research approach. Data LoA provides an overarching, standardised framework to assure data trustworthiness in inter-organisational data sharing. Our research suggests that the adoption of this artifact would lead to an increase of data consumer trust. Still, being a first iteration artifact, Data LoA requires further design efforts before it can be deployed.

Paper Nr: 96
Title:

Facilitating Data Usage Control Through IPv6 Extension Headers

Authors:

Haydar Qarawlus, Malte Hellmeier and Falk Howar

Abstract: Data ownership and privacy control are gaining increasing attention and relevance. More and more aspects of enforcement methods for data ownership, data usage control, and data access control are being researched to find suitable solutions and practical applications. This includes technical methods, standards, and reference architectures for controlling data after it has been shared, which are frequently discussed under the term data sovereignty. However, enforcing data sovereignty is still challenging, and most solutions only work in trusted environments or are focused on the application layer. In this paper, we focus on policy enforcement mechanisms to facilitate data sovereignty on a lower network layer. We explore the use of steganography concepts and IPv6 extension headers in data usage control as an additional usage control measure. We propose the use of IPv6 Hop-by-Hop Options and Destination Options Extension Headers, including possible implementation methods and a prototype setup. We tested our work in multiple scenarios to examine performance and applicability. The results highlight the numerous benefits of our proposal with minimal drawbacks.

Paper Nr: 108
Title:

Towards Consistent Policy Enforcement in Dataspaces

Authors:

Julia Pampus and Maritta Heisel

Abstract: Data sovereignty refers to the autonomy and self-determination of organisations when it comes to sharing data. The focus, thereby, is on the data usage conditions that are expressed as policies. Current research explores the structure of these policies, the processes related to data offerings and policy negotiations, and their enforcement using access and usage control methods. However, there is still a lack of a consistent and comprehensive understanding of data sovereignty among data-sharing participants across various system landscapes. First, we discuss the reasons for this issue and its significance in the context of dataspaces, then take a position. We present a model-based design framework encompassing different environments for describing sovereign data sharing. To conclude our contribution, we outline an approach for systematically eliciting and analysing data usage requirements, thus strengthening interoperability and trust.

Paper Nr: 15
Title:

Fair Client Selection in Federated Learning: Enhancing Fairness in Collaborative AI Systems

Authors:

Ranim Bouzamoucha, Farah Barika Ktata and Sami Zhioua

Abstract: Fairness in machine learning (ML) is essential, especially in sensitive domains like healthcare and recruitment. Federated Learning (FL) preserves data privacy but poses fairness challenges due to non-IID data. This study addresses these issues by proposing a client selection strategy that improves both demographic and participation fairness while maintaining model performance. By analyzing the impact of selecting clients based on local fairness metrics, we developed a lightweight algorithm that balances fairness and accuracy through a Multi-Armed Bandit framework. This approach prioritizes equitable client participation, ensuring the global model is free of biases against any group. Our algorithm is computationally simple, making it suitable for constrained environments, and promotes exploration to include underrepresented clients. Experimental results show reduced biases and slight accuracy improvements, demonstrating the feasibility of fairness-driven FL. This work has practical implications for applications in recruitment, clinical decision-making, and other fields requiring equitable, high-performing ML models.

Paper Nr: 112
Title:

Prevalence of Security Vulnerabilities in C++ Projects

Authors:

Thiago Gadelha, Wallisson Freitas, Eduardo Rodrigues, José Maria Monteiro and Javam Machado

Abstract: One of the most critical tasks for organizations nowadays is to maintain the security of their software products. Common software vulnerabilities can result in severe security breaches, financial losses, and reputation deterioration. A software security vulnerability can be defined as a flaw in the source code that can be exploited by an attacker to gain unauthorized access to the software, thereby compromising its behavior and functionality. Then, detecting and fixing security vulnerabilities in the source code of software systems is one of the most significant challenges in the field of information security. The Static Application Security Testing (SAST) tools are capable of statically analyzing the source code, without executing it, to identify security vulnerabilities, bugs, and code smells during the coding phase, when it is relatively inexpensive to detect and resolve security issues. In this context, this paper proposes an exploratory study of security vulnerabilities in C++ code from very large projects. We analyzed twenty-six worldwide C++ projects and empirically studied the prevalence of security vulnerabilities. Our results showed that some vulnerabilities occur together. Besides, some vulnerabilities are more frequent than others. Based on these findings, this paper has the potential to aid software developers in avoiding future problems during the development of a C++ project.

Area 5 - Data Science & Machine Learning

Full Papers
Paper Nr: 19
Title:

A Multi-Scale Feature Fusion Network for Detecting and Classifying Apple Leaf Diseases

Authors:

Assad S. Doutoum, Recep Eryigit and Bulent Tugrul

Abstract: Early detection and identification of leaf diseases reduce expenses and increase profits. Thus, it is essential for producers to be aware of the symptoms and indications of these leaf diseases and take the necessary preventative measures. Early diagnosis and treatment can also help prevent the disease from spreading to healthy plants. For successful disease control, regular inspections of orchards are essential. As well as being costly and time-consuming, traditional methods require a great deal of labor. However, the use of modern technologies and methods such as computer vision will both increase successes and reduce costs. Deep learning methods can be used to detect and classify diseases, as well as predict the likelihood of them occurring. Though, a particular CNN architecture may focus on a subset of features, while another may discover other additional features not extracted from the dataset. Robust classification models should be developed that perform consistently well when different environmental factors such as light, angle, background and noise vary. To solve these challenges, this study proposes a multi-scale feature fusion network (MFFN) that combines features from different scales or levels of detail in an image to improve the performance and robustness of classification models. The proposed method is evaluated on a publicly available dataset and is shown to improve the performance of the original models. Four branches applied to CNN architectures were simultaneously trained and merged to accurately classify and predict infected apple leaves. The merged model was able to detect infected leaves with a high degree of accuracy, significantly through the combined models. The merged model was able to accurately predict the unhealthy apple leaves with a 99.36% on the training accuracy, 98.90% on the validation accuracy, and 98.28% on the test accuracy. The results show that combining the models is an effective way to increase the accuracy of predictions under volatile conditions.

Paper Nr: 26
Title:

3D Convolutional Neural Network to Predict the Energy Consumption of Milling Processes

Authors:

Christoph Wald, Thomas Jung and Frank Schirmeier

Abstract: With increasingly fluctuating energy prices, the energy-flexible operation of electrical consumers, including machine tools, has recently gained attention. This study aims to predict the energy consumption from the shape of the volume removed during a time step in milling, generated using a time-discrete simulation environment. A 3D residual network is used to analyze the voxel representation of these ”removed volumes”. In total, 48 unique combinations of cutting depth and feed rate are recorded on a three-axis mill to evaluate the proposed model. The results indicate that energy consumption prediction using these shapes is possible.

Paper Nr: 31
Title:

Domain Adaption of a Heterogeneous Textual Dataset for Semantic Similarity Clustering

Authors:

Erik Nikulski, Julius Gonsior, Claudio Hartmann and Wolfgang Lehner

Abstract: Industrial textual datasets can be very domain-specific, containing abbreviations, terms, and identifiers that are only understandable with in-domain knowledge. In this work, we introduce guidelines for developing a domain-specific topic modeling approach that includes an extensive domain-specific preprocessing pipeline along with the domain adaption of a semantic document embedding model. While preprocessing is generally assumed to be a trivial step, for real-world datasets, it is often a cumbersome and complex task requiring lots of human effort. In the presented approach, preprocessing is an essential step in representing domain-specific information more explicitly. To further enhance the domain adaption process, we introduce a partially automated labeling scheme to create a set of in-domain labeled data. We demonstrate a 22% performance increase in the semantic embedding model compared to zero-shot performance on an industrial, domain-specific dataset. As a result, the topic model improves its ability to generate relevant topics and extract representative keywords and documents.

Paper Nr: 36
Title:

Development the Novel FCF-SIWEC-RBNAR Hybrid Method for Financial Performance Evaluation

Authors:

Hamide Özyürek, Galip Cihan Yalçın and Karahan Kara

Abstract: Financial performance analyses are fundamental tools that provide insights into companies' financial conditions. The primary aim of this study is to develop a financial performance analysis method as a decision support system. In this context, the FCF-SIWEC-RBNAR (Fermatean Cubic Fuzzy- Simple Weight Calculation- Reference-Based Normalization Alternative Ranking) hybrid method was developed. In this method, expert weights are determined using FCF sets, while the weights of criteria are calculated using the FCF-SIWEC approach based on expert evaluations. Companies are then ranked according to their financial performance using the RBNAR method. To demonstrate the applicability of the proposed hybrid method, four case studies were conducted using data from 50 companies operating on Borsa Istanbul for the years 2020, 2021, 2022, and 2023. As a result of the research, the "Debt-to-Equity Ratio" was identified as the most significant financial criterion. Additionally, the financial performance rankings of companies were determined for each year. These findings support that the FCF-SIWEC-RBNAR hybrid method is a robust and applicable approach for financial performance evaluation.

Paper Nr: 37
Title:

4,500 Seconds: Small Data Training Approaches for Deep UAV Audio Classification

Authors:

Andrew P. Berg, Qian Zhang and Mia Y. Wang

Abstract: Unmanned aerial vehicle (UAV) usage is expected to surge in the coming decade, raising the need for heightened security measures to prevent airspace violations and security threats. This study investigates deep learning approaches to UAV classification focusing on the key issue of data scarcity. To investigate this we opted to train the models using a total of 4,500 seconds of audio samples, evenly distributed across a 9-class dataset. We leveraged parameter efficient fine-tuning (PEFT) and data augmentations to mitigate the data scarcity. This paper implements and compares the use of convolutional neural networks (CNNs) and attention-based transformers. Our results show that, CNNs outperform transformers by 1-2% accuracy, while still being more computationally efficient. These early findings, however, point to potential in using transformers models; suggesting that with more data and further optimizations they could outperform CNNs. Future works aims to upscale the dataset to better understand the trade-offs between these approaches.

Paper Nr: 68
Title:

Robots Performance Monitoring in Autonomous Manufacturing Operations Using Machine Learning and Big Data

Authors:

Ahmed Bendaouia, Salma Messaoudi, El Hassan Abdelwahed and Jianzhi Li

Abstract: Additive manufacturing has revolutionized industrial automation by enabling flexible and precise production processes. Ensuring the reliability of robotic systems remains a critical challenge. In this study, data-driven approaches are employed to automatically detect faults in the UR5 robot with six joints using Artificial Intelligence. By analyzing sensor data across different combinations of payload, speed, and temperature, this work applies feature engineering and anomaly detection techniques to enhance fault prediction. New features are generated, including binarized anomaly indicators using the interquartile range method and a difference-based time feature to account for the sequential and irregular nature of robot time data. These engineered features allow the use of neural networks (including LSTM), Random Forest, KNN, and GBM models to classify anomalies in position, velocity, and current. A key objective is to evaluate which anomaly type is the most sensitive by analyzing error metrics such as MAE and RMSE, providing insights into the most critical factors affecting robot performance. The experimental findings highlight the superiority of Gradient Boosting Machine and Random Forest in balancing accuracy and computational efficiency, achieving over 99% test accuracy while maintaining short training times. These two models outperform the others, which show a noticeable gap either in training time or test accuracy, demonstrating their effectiveness in improving fault detection and performance monitoring strategies in autonomous experimentation.

Paper Nr: 75
Title:

Enhancing Biosecurity in Tamper-Resistant Large Language Models with Quantum Gradient Descent

Authors:

Fahmida Hai, Saif Nirzhor, Rubayat Khan and Don Roosan

Abstract: This paper introduces a tamper-resistant framework for large language models (LLMs) in medical applications, utilizing quantum gradient descent (QGD) to detect malicious parameter modifications in real time. Integrated into a LLaMA-based model, QGD monitors weight amplitude distributions, identifying adversarial fine-tuning anomalies. Tests on the MIMIC and eICU datasets show minimal performance impact (accuracy: 89.1 to 88.3 on MIMIC) while robustly detecting tampering. PubMedQA evaluations confirm preserved biomedical question-answering capabilities. Compared to baselines like selective unlearning and cryptographic fingerprinting, QGD offers superior sensitivity to subtle weight changes. This quantum-inspired approach ensures secure, reliable medical AI, extensible to other high-stakes domains.

Paper Nr: 77
Title:

An Approach for the Automatic Detection of Prejudice in Instant Messaging Applications

Authors:

Melissa Sousa, Fernanda Nascimento, Gustavo Martins, José Maria Monteiro and Javam Machado

Abstract: Instant messaging applications have revolutionized communication, making it more accessible and efficient. However, they have also facilitated the widespread dissemination of prejudiced media content. In this context, the rapid and effective detection of prejudice in texts shared via messaging apps is crucial for promoting a healthy, diverse, and tolerant communicative environment. Few prejudice detection methods have been specifically developed for instant messaging platforms. Moreover, the development of effective methods requires labeled datasets containing prejudiced messages disseminated on these platforms, as user expressions differ significantly from those on other social networks like Facebook, Instagram, and X. However, we have not found any datasets containing prejudiced messages extracted from WhatsApp or Telegram. This work presents two publicly available labeled datasets, named PrejudiceWhatsApp.Br and PrejudiceTelegram.Br, consisting of Brazilian Portuguese (PT-BR) messages collected from public groups on WhatsApp and Telegram, respectively. Additionally, we developed a dictionary of prejudiced words for Brazilian Portuguese, named PrejudicePT-br, comprising 842 words organized into nine categories. Finally, we built a dictionary-based machine learning model to automatically detect prejudice in WhatsApp and Telegram messages. We conducted a series of text classification experiments, combining two feature extraction methods, three distinct token generation strategies, two preprocessing approaches, and nine classification algorithms to classify texts into two categories: prejudiced and non-prejudiced. Our best results achieved an F1-score of 0.86 for both datasets, demonstrating the feasibility of the proposed approach.

Paper Nr: 101
Title:

A Big Data Analytics System for Predicting Suicidal Ideation in Real-Time Based on Social Media Streaming Data

Authors:

Mohamed A. Allayla and Serkan Ayvaz

Abstract: Online social media platforms have recently become integral to our society and daily routines. Every day, users worldwide spend a couple of hours on such platforms, expressing their sentiments and emotional state and contacting each other. Analyzing such huge amounts of data from these platforms can provide a clear insight into public sentiments and help detect their mental status. The early identification of these health condition risks may assist in preventing or reducing the number of suicide ideation and potentially saving people’s lives. The traditional techniques have become ineffective in processing such streams and large-scale datasets. Therefore, the paper proposed a new methodology based on a big data architecture to predict suicidal ideation from social media content. The proposed approach provides a practical analysis of social media data in two phases: batch processing and real-time streaming prediction. The batch dataset was collected from the Reddit forum and used for model building and training, while streaming big data was extracted using Twitter streaming API and used for real-time prediction. After the raw data was preprocessed, the extracted features were fed to multiple Apache Spark ML classifiers: NB, LR, LinearSVC, DT, RF, and MLP. We conducted various experiments using various feature-extraction techniques with different testing scenarios. The experimental results of the batch processing phase showed that the features extracted of (Unigram + Bigram) + CV-IDF with MLP classifier provided high performance for classifying suicidal ideation, with an accuracy of 93.47%, and then applied for real-time streaming prediction phase.

Paper Nr: 111
Title:

A Comparative Study of ML Approaches for Detecting AI-Generated Essays

Authors:

Mihai Nechita and Madalina Raschip

Abstract: Recent advancements in generative AI introduced a significant challenge to academic credibility and integrity. The current paper presents a comprehensive study of traditional machine learning methods and complex neural network models such as recurrent neural networks and Transformer-based models to detect AI-generated essays. A two-step training of the Transformer-based model was proposed. The aim of the pretraining step is to move the general language model closer to our problem. The models used obtain a good AUC score for classification, outperforming the SOTA zero-shot detection approaches. The results show that Transformer architectures not only outperform other methods on the validation datasets but also exhibit increased robustness across different sampling parameters. The generalization to new datasets as well as the performance of the models at a small level of FPR was evaluated. In order to enhance transparency, the explainability of the proposed models through the LIME and SHAP approaches was explored.

Paper Nr: 131
Title:

An Extreme Gradient Boosting (XGBoost) Trees Approach to Detect and Identify Unlawful Insider Trading (UIT) Transactions

Authors:

Krishna Neupane and Igor Griva

Abstract: Corporate insiders have control of material non-public preferential information (MNPI). Occasionally, the insiders strategically bypass legal and regulatory safeguards to exploit MNPI in their execution of securities trading. Due to a large volume of transactions a detection of unlawful insider trading becomes an arduous task for humans to examine and identify underlying patterns from the insider’s behavior. On the other hand, innovative machine learning architectures have shown promising results for analyzing large-scale and complex data with hidden patterns. One such popular technique is eXtreme Gradient Boosting (XGBoost), the state-of-the-arts supervised classifier. We, hence, resort to and apply XGBoost to alleviate challenges of identification and detection of unlawful activities. The results demonstrate that XGBoost can identify unlawful transactions with a high accuracy of 97 percent and can provide ranking of the features that play the most important role in detecting fraudulent activities.

Paper Nr: 142
Title:

Integrated Sentiment and Emotion Analysis of the Ukraine-Russia Conflict Using Machine Learning and Transformer Models

Authors:

Mohammad Hossein Amirhosseini, Nabeela Berardinelli, Kunal Gaikwad, Christian Eze Iwuchukwu and Mahmud Ahmed

Abstract: The Russia-Ukraine war has been a significant international conflict, generating a wide range of public sentiments. With escalating geopolitical tensions, determining whether public discourse supports or condemns the invasion has become increasingly important. This study investigates public attitudes through large-scale sentiment analysis of 1,426,310 tweets collected during the early phase of the conflict. Sentiment classification was performed using machine learning models, including XGBoost, Random Forest, Naïve Bayes, Support Vector Machine, and a Feedforward Deep Learning model, combined with Count Vectorizer and TF-IDF. The deep learning model with Count Vectorizer achieved the highest accuracy at 89.58%, outperforming all others. To go beyond polarity classification, emotion prediction was also conducted using a lexicon-based method (NRC Emotion Lexicon) and a transformer-based model (DistilRoBERTa), both trained to classify tweets into eight emotions: joy, trust, surprise, fear, anger, sadness, disgust, and anticipation. A comparative evaluation showed that the transformer model significantly outperformed the lexicon-based model across all metrics, including accuracy, precision, recall, F1 score, and Hamming loss. Fear and anger emerged as the most dominant emotions, highlighting widespread public anxiety and distress. This analysis provides a nuanced understanding of online discourse during conflict and offers insights for researchers, policymakers, and communicators responding to global crises.

Paper Nr: 143
Title:

From What-If Scenarios to Event Associations: A Novel Approach to Social Media Event Analysis

Authors:

Aigerim Mussina, Sanzhar Aubakirov, Paulo Trigo and Madina Mansurova

Abstract: This paper introduces a novel approach to event prediction in social media by applying association rules to generate counterfactual what-if scenarios. Using the Events2012 dataset as a foundation, we developed the EventsAssociation2012 dataset to systematically identify patterns within event sequences and assess the predictive power of what-if scenarios. Employing a Large Language Model (LLM) to generate event embeddings, similarity scores, and conditional probabilities, we mapped real-world scenarios to intra-event and inter-event associations, thereby creating a robust framework for understanding the interconnected nature of social media discussions. Our methodology leverages association rule mining to model causal relationships between events, enabling predictions of plausible future outcomes based on hypothetical scenarios. The results demonstrate the potential for applying what-if scenarios to new event datasets, revealing challenges and opportunities for refining this approach. The study further discusses areas for improvement, such as expanding the identification of intra-event scenarios, exploring multi-event associations, and enhancing topic embedding techniques. Overall, this work advances counterfactual analysis in event prediction, providing a more accurate and comprehensive method for modeling event associations in the dynamic landscape of social media.

Paper Nr: 144
Title:

Optimized and Explainable Feature Selection for Soil Moisture Prediction Across Sites

Authors:

Bamory Ahmed Toru Koné, Rima Grati, Bassem Bouaziz, Khouloud Boukadi and Massimo Mecella

Abstract: Accurate soil moisture prediction is critical for improving agricultural practices and managing water supplies. While feature selection techniques have proven useful in enhancing machine learning models’ performance in predicting soil moisture, their adaptation to different soil conditions remains limited. To address this gap, this study presents a novel multisite feature selection framework that draws on meteorological and soil data from three distinct locations with mineral, calcareous, and organic soils. The framework identifies soil-specific features through targeted selection processes and then uses SHAP, an explainable AI technique, to assess their global importance and influence. Furthermore, cross-site validation is performed to assess the transferability and generalizability of selected features, giving insight into their resilience across different environments. The proposed approach, which combines explainable AI and cross-site validation, provides a complete approach to understanding and improving feature relevance for soil moisture prediction. Overall, this study establishes the foundation for building more generalizable and robust predictive models, which will improve their applicability in a variety of agricultural and environmental scenarios.

Paper Nr: 151
Title:

Scalable Traffic Flow Estimation on Sensorless Roads Using LSTM and Floating Car Data

Authors:

Thamires de Souza Oliveira, David Pagano, Salvatore Cavalieri Vincenza Torrisi and Giovanni Calabró

Abstract: Urban traffic monitoring is crucial for mobility, but the implementation of fixed sensors is costly and leads to restricted coverage. Floating Car Data (FCD) is emerging as an option, but its low penetration makes accurate traffic flow estimation difficult. This research proposes a Long Short-Term Memory (LSTM) model to scale FCD-based traffic estimates by learning flow patterns from routes with existing sensors. The model is trained with data from the most correlated sensors, but never the same one used for testing. The model identifies flow patterns from the available sensors and applies them to related paths. The findings indicate that the strategy is effective on routes with consistent flow but has limitations in regions with high traffic variability. This work contributes to the advancement of FCD scalability methods, expanding the coverage of urban traffic estimation without the need for new infrastructure.

Paper Nr: 157
Title:

Leveraging Liquid Time-Constant Neural Networks for ECG Classification: A Focus on Pre-Processing Techniques

Authors:

Lisa-Maria Beneke, Michell Boerger, Philipp Lämmel, Helene Knof, Andrei Aleksandrov and Nikolay Tcholtchev

Abstract: Neural networks have become pivotal in timeseries classification due to their ability to capture complex temporal relationships. This paper presents an evaluation of Liquid Time-Constant Neural Networks (LTCs), a novel approach inspired by recurrent neural networks (RNNs) that introduces a unique mechanism to adaptively manage temporal dynamics through time-constant parameters. Specifically, we explore the applicability and effectiveness of LTC in the context of classifying myocardial infarctions in electrocardiogram data by benchmarking the performance of LTCs against RNN and LSTM models utilzing the PTB-XL dataset. Moreover, our study focuses on analyzing the impact of various pre-processing methods, including baseline wander removal, Fourier transformation, Butterworth filtering, and a novel x-scaling method, on the efficacy of these models. The findings provide insights into the strengths and limitations of LTCs, enhancing the understanding of their applicability in multivariate time series classification tasks.

Paper Nr: 162
Title:

PPMI-Benchmark: A Dual Evaluation Framework for Imputation and Synthetic Data Generation in Longitudinal Parkinson's Disease Research

Authors:

Moad Hani, Nacim Betrouni, Saïd Mahmoudi and Mohammed Benjelloun

Abstract: : Longitudinal datasets like the Parkinson’s Progression Markers Initiative (PPMI) face critical challenges from missing data and privacy constraints. This paper introduces PPMI-Benchmark, the first comprehensive framework evaluating 12 imputation methods and 6 synthetic data generation techniques across clinical, demographic, and biomarker variables in Parkinson’s disease research. We implement advanced methods including HyperImpute (ensemble optimization), VaDER (variational deep embedding), and conditional tabular GANs (CTGAN), evaluating them through novel metrics integrating sliced Wasserstein distance (dSW = 0.039 ± 0.012), temporal consistency analysis, and clinical validity constraints. Our results demonstrate HyperImpute’s superiority in imputation accuracy (MAE=5.16 vs. 5.19–5.57 for baselines), while CTGAN achieves optimal distribution fidelity (SWD=0.039 vs. 0.062–0.146). Crucially, we reveal persistent demographic biases in cognitive scores, with age-related imputation errors increasing by 23% for patients over 70, and propose mitigation strategies. The framework provides actionable guidelines for selecting data completion strategies based on missingness patterns (MCAR/MAR/MNAR), computational constraints, and clinical objectives, advancing reproducibility and fairness in neurodegenerative disease research. Validated on 1,483 PPMI participants, our work addresses emerging needs in healthcare AI governance and synthetic data interoperability for multi-center collaborations.

Short Papers
Paper Nr: 20
Title:

Neuro-Symbolic Methods in Natural Language Processing: A Review

Authors:

Mst Shapna Akter, Md Fahim Sultan and Alfredo Cuzzocrea

Abstract: Neuro-Symbolic (NeSy) techniques in Natural Language Processing (NLP) combine the strengths of neural network-based learning with the clear interpretability of symbolic methods. This review paper explores recent advancements in neurosymbolic NLP methods. We carefully highlight the benefits and drawbacks of different approaches in various NLP tasks. Additionally, we support our evaluations with explanations based on theory and real-world evidence. Based on our review, we suggest several potential research directions. Our study contributes in three main ways: (1) We present a detailed, complete taxonomy for the Neuro-Symbolic methods in the NLP field; (2) We provide theoretical insights and comparative analysis of the Neuro-Symbolic methods; (3) We propose future research directions to explore.

Paper Nr: 25
Title:

Classifying Hotspots Mutations for Biosimulation with Quantum Neural Networks and Variational Quantum Eigensolver

Authors:

Don Roosan, Rubayat Khan, Saif Nirzhor, Tiffany Khou and Fahmida Hai

Abstract: The rapid expansion of biomolecular datasets presents significant challenges for computational biology. Quantum computing emerges as a promising solution to address these complexities. This study introduces a novel quantum framework for analyzing TART-T and TART-C gene data by integrating genomic and structural information. Leveraging a Quantum Neural Network (QNN), we classify hotspot mutations, utilizing quantum superposition to uncover intricate relationships within the data. Additionally, a Variational Quantum Eigensolver (VQE) is employed to estimate molecular ground-state energies through a hybrid classical-quantum approach, overcoming the limitations of traditional computational methods. Implemented using IBM Qiskit, our framework demonstrates high accuracy in both mutation classification and energy estimation on current Noisy Intermediate-Scale Quantum (NISQ) devices. These results underscore the potential of quantum computing to advance the understanding of gene function and protein structure. Furthermore, this research serves as a foundational blueprint for extending quantum computational methods to other genes and biological systems, highlighting their synergy with classical approaches and paving the way for breakthroughs in drug discovery and personalized medicine.

Paper Nr: 29
Title:

Exploring LLM Capabilities in Extracting DCAT-Compatible Metadata for Data Cataloging

Authors:

Lennart Busch, Daniel Tebernum and Gissel Velarde

Abstract: Efficient data exploration is crucial as data becomes increasingly important for accelerating processes, improving forecasts and developing new business models. Data consumers often spend 25-98% of their time searching for suitable data due to the exponential growth, heterogeneity and distribution of data. Data catalogs can support and accelerate data exploration by using metadata to answer user queries. However, as metadata creation and maintenance is often a manual process, it is time-consuming and requires expertise. This study investigates whether LLMs can automate metadata maintenance of text-based data and generate high-quality DCAT-compatible metadata. We tested zero-shot and few-shot prompting strategies with LLMs from different vendors for generating metadata such as titles and keywords, along with a fine-tuned model for classification. Our results show that LLMs can generate metadata comparable to human-created content, particularly on tasks that require advanced semantic understanding. Larger models outperformed smaller ones, and fine-tuning significantly improves classification accuracy, while few-shot prompting yields better results in most cases. Although LLMs offer a faster and reliable way to create metadata, a successful application requires careful consideration of task-specific criteria and domain context.

Paper Nr: 30
Title:

Explainable Assessment Model for Digital Transformation Maturity

Authors:

Jihen Hlel, Nesrine Ben Yahia and Narjès Bellamine Ben Saoud

Abstract: Digital transformation has become a critical factor for organizational success in the modern business landscape. However, effectively and automatically assessing the maturity of this transformation remains a significant challenge. In this paper, we address the need for a unified and explainable digital maturity model to guide organizations in their transformation journey. Our primary research questions focus on the development of a core digital maturity model, the automatic validation of its effectiveness, and its explainability. To this end, we propose a core model composed of seven key dimensions (Technology, Strategy, Skills, Culture, Organization, Data, and Leadership) derived from an extensive literature review. Each dimension is assessed across five maturity levels (Basic, Discovery, Developed, Integrated, and Leadership). We then validate the proposed model by leveraging machine learning techniques to assess its applicability within organizations. Finally, we introduce an ensemble learning approach that combines unsupervised and supervised learning methods to enhance the explainability of the proposed digital maturity model. This approach aims not only to assess but also to elucidate the impact of different dimensions on digital maturity.

Paper Nr: 38
Title:

Proximal Policy Optimization with Graph Neural Networks for Optimal Power Flow

Authors:

Ángela López-Cardona, Guillermo Bernárdez, Pere Barlet-Rose and Albert Cabellos-Aparicio

Abstract: Optimal Power Flow (OPF) is a key research area within the power systems field that seeks the optimal operation point of electric power plants, and which needs to be solved every few minutes in real-world scenarios. However, due to the non-convex nature of power generation systems, there is not yet a fast, robust solution for the full Alternating Current Optimal Power Flow (ACOPF). In the last decades, power grids have evolved into a typical dynamic, non-linear and large-scale control system —known as the power system—, so searching for better and faster ACOPF solutions is becoming crucial. The appearance of Graph Neural Networks (GNN) has allowed the use of Machine Learning (ML) algorithms on graph data, such as power networks. On the other hand, Deep Reinforcement Learning (DRL) is known for its proven ability to solve complex decision-making problems. Although solutions that use these two methods separately are beginning to appear in the literature, none has yet combined the advantages of both. We propose a novel architecture based on the Proximal Policy Optimization (PPO) algorithm with Graph Neural Networks to solve the Optimal Power Flow. The objective is to design an architecture that learns how to solve the optimization problem and, at the same time, is able to generalize to unseen scenarios. We compare our solution with the Direct Current Optimal Power Flow approximation (DCOPF) in terms of cost. We first trained our DRL agent on the IEEE 30 bus system and with it, we computed the OPF on that base network with topology changes.

Paper Nr: 45
Title:

Arbitrary Shaped Clustering Validation on the Test Bench

Authors:

Georg Stefan Schlake and Christian Beecks

Abstract: Clustering is a highly important as well as highly subjective task in the field of data analytics. Selecting a suitable clustering method and a good clustering result is all but trivial and needs insight into not only the field of clustering, but also the application scenario, in which the clustering is utilized. Evaluating a single clustering is hard, especially as there exists a wide variety of indices to evaluate the quality of a clustering, both for simple convex and for arbitrary shaped clusterings. In this paper, we investigate the ability of 11 state-of-the-art Clustering Validation Indices (CVI) to evaluate arbitrary shaped clusterings. To this end, we provide a survey of the intuitive workings of these CVI and an extensive benchmark on newly generated datasets. Furthermore, we evaluate both the Euclidean distance and the density-based DC-distance to quantify the quality of arbitrary shaped clusters. We use the generation of novel datasets to evaluate the influence of a number of metafeatures on the CVI.

Paper Nr: 49
Title:

Innovative Sentence Classification in Scientific Literature: A Two-Phase Approach with Time Mixing Attention and Mixture of Experts

Authors:

Meng Wang, Mengting Zhang, Hanyu Li, Jing Xie, Zhixiong Zhang, Yang Li and Gaihong Yu

Abstract: Accurately classifying innovative sentences in scientific literature is essential for understanding research contributions. This paper proposes a two-phase classification framework that integrates a Time Mixing Attention (TMA) mechanism and a Mixture of Experts (MoE) system to enhance multi-class innovation classification. In the first phase, TMA improves long-range dependency modeling through temporal shift padding and sequence slice reorganization. The second phase employs an MoE-based approach to classify theoretical, methodological, and applied innovations. To mitigate class imbalance, a generative semantic data augmentation method is introduced, improving model performance across different innovation categories. Experimental results demonstrate that the proposed two-phase SciBERT+TMA model achieves the highest performance, with a macroaveraged F1-score of 90.8%, including 95.1% for theoretical innovation, 90.8% for methodological innovation, and 86.6% for applied innovation. Compared to the one-phase SciBERT+TMA model, the two-phase approach significantly improves precision and recall, highlighting the benefits of progressive classification refinement. In contrast, the best-performing LLM baseline, Ministral-8B-Instruct, achieves a macro-averaged F1-score of 85.2%, demonstrating the limitations of prompt-based inference in structured classification tasks. The results underscore the advantage of a domain-adapted approach in capturing fine-grained distinctions in innovation classification. The proposed framework provides a scalable solution for multi-class sentence classification and can be extended to broader academic classification tasks. Model weights and details are available at https://huggingface.co/wmsr22/Research Value Generation/tree/main.

Paper Nr: 51
Title:

Data Breaches: What Happened over the Last 20 Years?

Authors:

Faheem Ullah, Liwei Wang, Uswa Fatima and Muhammad Imran Taj

Abstract: With the rapid development of information technology, commercial software has been inadequate in protecting personal data resulting in multiple data breaches across industries. However, comprehensive research on data breaches remains limited. This study investigates their yearly trend, associated costs, impacted industries, types of compromised data, primary causes, affected regions, and tools used. Using web crawling, we collect reports from news headlines and extract insights from the data using natural language processing. Our results indicate a consistent upward trend in the number of data breaches over the years, with an average cost of $2.7 million per incident. IT industry is the main target of data breaches while emails are the most common type of data breached. Hacking is the major cause of data breaches with North America being the most targeted region. SSH, RDP, FTP, Intruder, and Metasploit emerge as the top five tools used to breach data. Our findings show how things have changed over the past two decades in relation to data breaches and highlight the urgent need for enhanced security measures to mitigate evolving data losses, particularly in high-risk industries.

Paper Nr: 54
Title:

PreXP: Enhancing Trust in Data Preprocessing Through Explainability

Authors:

Sandra Samuel and Nada Sharaf

Abstract: Data preprocessing is a crucial yet often opaque stage in machine learning workflows. Manual preprocessing is time consuming and inconsistent, while automated pipelines efficiently transform data but lack explainabil-ity, making it difficult to track modifications and understand preprocessing decisions. This lack of transparency can lead to uncertainty and reduced confidence in data preparation. PreXP (Preprocessing with Explainabil-ity) addresses this gap by enhancing transparency in preprocessing workflows. Rather than modifying data, PreXP provides interpretability by documenting and clarifying preprocessing steps, ensuring that users remain informed about how their data has been prepared. Initial evaluations suggest that increasing visibility into preprocessing decisions improves trust and interpretability, reinforcing the need for explainability in data driven systems and supporting the development of more accountable machine learning workflows.

Paper Nr: 57
Title:

Framework for Data-Driven Spirulina Cultivation and Recommendations

Authors:

Aakaash Kurunth, Adithya S. Gurikar, B. Tejas, Sean Sougaijam and Kamatchi Priya L.

Abstract: Spirulina platensis, a microalgae widely recognized for its exceptional nutritional profile and sustainability, has garnered considerable attention in the fields of food, pharmaceuticals, and bioenergy. Its growth is strongly influenced by environmental factors such as temperature, irradiance, pH levels, and nutrient availability. However, identifying the ideal conditions for its cultivation is a challenging task due to the complex interactions between these factors. To address these challenges, we propose a novel approach which integrates predictive analytics and an intelligent recommendation engine to optimize Spirulina cultivation. We begin by evaluating multiple regression models on our dataset like Stacking, xGBoost, CatBoost, Gradient Boosting Machine (GBM), Support Vector Regressor (SVR), and Neural Networks to determine the most accurate predictor of Spirulina optical density. The best performing model is then leveraged to power a hybrid recommendation engine that combines content-based filtering and rule-based logic. This system helps us understand the optimal ranges for critical features necessary for Spirulina cultivation and thereby offering precise recommendations to farmers and researchers.

Paper Nr: 62
Title:

Improved Alzheimer’s Detection from Brain MRI via Transfer Learning on Pre-Trained Convolutional Deep Models

Authors:

Malek Jallali, Raouia Mokni and Boudour Ammar

Abstract: Alzheimer’s Disease (AD) presents a major challenge in modern healthcare due to its complex diagnosis and management. Early and accurate detection is essential for improving patient care and enabling timely therapeutic interventions. Research suggests that neurodegenerative changes associated with AD may appear years before clinical symptoms, highlighting the need for advanced diagnostic techniques. This study explores deep learning models for classifying AD stages using MRI scans. Specifically, we propose a Modified Convolutional Neural Network (MCNN) and a fine-tuned VGGNet19 (FT-VGGNet19) architecture. Both models were evaluated on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset, leveraging data augmentation to enhance generalization and mitigate dataset limitations. Experimental results show that data augmentation significantly improves classification performance. The FT-VGGNet19 model achieved the highest accuracy, reaching 90% on the original dataset and 92% with augmented data. This study highlights the strengths of each model for clinical applications, emphasizing the role of optimized deep-learning frameworks in early AD detection. The findings contribute to developing robust and scalable diagnostic systems, offering promising advancements in neurodegenerative disease management.

Paper Nr: 66
Title:

Quantum Gradient Optimized Drug Repurposing Prototype for Omics Data

Authors:

Don Roosan, Saif Nirzhor, Rubayat Khan and Fahmida Hai

Abstract: This paper presents a novel quantum-enhanced prototype for drug repurposing and addresses the challenge of managing massive genomics data in precision medicine. Leveraging cutting-edge quantum server architectures, we integrated quantum-inspired feature extraction with large language model (LLM)–-based analytics and unified high-dimensional omics datasets and textual corpora for faster and more accurate therapeutic insights. Applying Synthetic Minority Over-sampling Technique (SMOTE) to balance underrepresented cancer subtypes and multi-omics sources such as TCGA and LINCS, the pipeline generated refined embeddings through quantum principal component analysis (QPCA). These embeddings drove an LLM trained on biomedical texts and clinical notes, generating drug recommendations with improved predicted efficacy and safety profiles. Combining quantum computing with LLM outperformed classical PCA-based approaches in accuracy, F1 score, and area under the ROC curve. Our prototype highlights the potential of harnessing quantum computing and next-generation servers for scalable, explainable, and timely drug repurposing in modern healthcare.

Paper Nr: 67
Title:

Quantum Approximate Optimization Algorithm for Spatiotemporal Forecasting of HIV Clusters

Authors:

Don Roosan, Saif Nirzhor, Rubayat Khan, Fahmida Hai and Mohammad Rifat Haidar

Abstract: HIV epidemiological data is increasingly complex, requiring advanced computation for accurate cluster detection and forecasting. We employed quantum-accelerated machine learning to analyze HIV prevalence at the ZIP-code level using AIDSVu and synthetic SDoH data for 2022. Our approach compared classical clustering (DBSCAN, HDBSCAN) with a quantum approximate optimization algorithm (QAOA), developed a hybrid quantum-classical neural network for HIV prevalence forecasting, and used quantum Bayesian networks to explore causal links between SDoH factors and HIV incidence. The QAOA-based method achieved 92% accuracy in cluster detection within 1.6 seconds, outperforming classical algorithms. Meanwhile, the hybrid quantum-classical neural network predicted HIV prevalence with 94% accuracy, surpassing a purely classical counterpart. Quantum Bayesian analysis identified housing instability as a key driver of HIV cluster emergence and expansion, with stigma exerting a geographically variable influence. These quantum-enhanced methods deliver greater precision and efficiency in HIV surveillance while illuminating critical causal pathways. This work can guide targeted interventions, optimize resource allocation for PrEP, and address structural inequities fueling HIV transmission.

Paper Nr: 81
Title:

ClustSize: An Algorithmic Framework for Size-Constrained Clustering

Authors:

Diego Vallejo-Huanga, Cèsar Ferri and Fernando Martínez-Plumed

Abstract: Size-constrained clustering addresses a fundamental need in many real-world applications by ensuring that clusters adhere to user-specified size limits, whether to balance groups or to satisfy domain-specific requirements. In this paper, we present ClustSize, an interactive web platform that implements two advanced algorithms: K-MedoidsSC and CSCLP, to perform real-time clustering of tabular data under strict size constraints. Developed in R Studio using the Shiny framework and deployed on Shinyapps.io, ClustSize not only enforces precise cluster cardinalities, but also facilitates dynamic parameter tuning and visualisation for enhanced user exploration. We comprehensive validate its performance through comprehensive benchmarking, also evaluating runtime, RAM usage, load, and stress conditions, and gather usability insights via user surveys. Post-deployment evaluations confirm that both algorithms consistently produce clusters that exactly meet the specified size limits, and that the system reliably supports up to 50 concurrent users and maintains functionality under stress, processing approximately 90 requests in 5 seconds. These results highlight the potential of integrating advanced size-constrained clustering into interactive web platforms for practical data analysis.

Paper Nr: 84
Title:

Bridging Competency Gaps in Data Science: Evaluating the Role of Automation Frameworks Across the DASC-PM Lifecycle

Authors:

Maike Holtkemper and Christian Beecks

Abstract: Successful data science projects require a balanced mix of competencies. However, a shortage of skilled professionals disrupts this balance, fragmenting expertise across the data science pipeline. This fragmentation causes inefficiencies, delays, and project failures. Automation frameworks can help to mitigate these issues by handling repetitive tasks and integrating specialized skills. These frameworks improve workflow efficiency across project phases but remain limited in critical areas like project initiation and deployment. This pre-study identifies tasks in each project phase using the DASC-PM model. The model structures the assessment of automation potential and maps tasks to the EDISON Data Science Framework (EDSF), determining which competencies automation can support. The findings indicate that automation enhances efficiency in early phases, such as Data Provision and Analysis, contrasting with challenges in Project Order and Deployment, where human expertise remains essential. Addressing these gaps can improve collaboration and create a more integrated data science workflow.

Paper Nr: 100
Title:

Utilizing Generative Adversarial Networks for Preserving Privacy in Developing Machine Learning Models for the Healthcare Industry

Authors:

Shahnawaz Khan, Bharavi Mishra, Sultan Alamri and Philippe Pringuet

Abstract: Privacy preservation is a critical challenge while developing machine learning models utilizing medical data. This research investigates the application of Generative Adversarial Networks (GANs) for generating synthetic medical dataset while preserving the properties of the real-world dataset. It investigates on both types of medical datasets which are tabular and image-based datasets. This research employs Conditional Tabular GANs (CTGANs) for tabular data (Heart Disease Cleveland dataset) and Deep Convolutional GANs (DCGANs) for image data (chest X-ray dataset). The primary purpose is to synthesize datasets within the healthcare domain that closely mimic the statistical properties and diagnostic relevance of their real-world counterparts while safeguarding patient privacy. The proposed research focuses on training GANs to learn complex patterns and dependencies within the data. Thus, enabling GANs to generate realistic synthetic samples that can be used for training machine learning models. The generated datasets have been evaluated using various classifiers. The results demonstrate that models trained on synthetic data achieve comparable performance to those trained on real data. The results demonstrate the efficacy of our approach in balancing data utility and privacy. Furthermore, this research explores different techniques for privacy enhancement. These techniques include parameter tuning, differential privacy, and layer-wise perturbation, to further strengthen privacy preservation. The findings suggest that GAN-based synthetic data generation offers a robust and versatile solution for privacy-preserving machine learning in medical applications.

Paper Nr: 115
Title:

Model Card Metadata Collection from Hugging Face to Foster Multidisciplinary AI Research: A Dataset

Authors:

Muhammad Asif Suryani, Saurav Karmakar, Brigitte Mathiak and Philipp Mayr

Abstract: Metadata features generally exhibit valuable meta information which may facilitate researchers in their tasks. Several studies incorporated scholarly metadata by highlighting its usefulness in certain granularity to assist numerous research tasks. The emergence of Large Language Models (LLMs) has brought an exciting change in the field of Artificial Intelligence (AI) and Machine Learning (ML), which is equally supported by Open Science initiative and FAIR principles. One of the prominent platforms, which ensures the availability of these models to research communities is the Hugging Face. It provides democratized access to models while experiencing rapid growth as a repository. As of March 2025, Hugging Face hosts more than 1.4 million models, which were 0.5 million approximately in February 2024. In this dataset paper, we provide information on a large fraction of Hugging Face model cards. Our dataset comprises of a wide range of metadata features which showcase the meta information about each model card. In this work, we aim to provide democratized access to a collection of diverse metadata features from Hugging Face model cards and present an insightful overview of these cards by leveraging the metadata to support the research communities by facilitating model adoption.

Paper Nr: 117
Title:

Detecting AI-Generated Reviews for Corporate Reputation Management

Authors:

R. E. Loke and M. GC

Abstract: In this paper, we provide a concrete case study to better steer corporate reputation management in the light of recent proliferation of AI-generated product reviews. Firstly, we provide a systematic methodology for generating high quality AI-generated text reviews using pre-existing human written reviews and present the GPTARD dataset for training AI-generated review detection systems. We also present a separate evaluation dataset called ARED that contains a sample of product reviews from Amazon along with their predicted authenticity from incumbent tools in the industry that enables comparative benchmarking of AI-generated review detection systems. Secondly, we provide a concise overview of current approaches in fake review detection and propose to apply an overall, integrated group of four predictive features in our machine learning systems. We demonstrate the efficacy of these features by providing a comparative study among four different types of classifiers in which our specific machine learning based AI-generated review detection system in the form of random forest prevails with an accuracy of 98.50% and a precision of 99.34% on ChatGPT generated reviews. Our highly performant system can in practice be used as a reliable tool in managing corporate reputation against AI-generated fake reviews. Finally, we provide an estimation of AI-generated reviews in a sample of products on Amazon.com that turns out to be almost 10%. To validate this estimation, we also provide a comparison with existing tools Fakespot and ReviewMeta. Such high prevalence of AI-generated reviews motivates future work in helping corporate reputation management by effectively fighting spam product reviews.

Paper Nr: 132
Title:

Anomaly Detection in IoT Networks: A Performance Comparison of Transformer, 1D-CNN, and GrowNet Models on the Bot-IoT Dataset

Authors:

Aurelia Kusumastuti, Denis Rangelov, Philipp Lämmel, Michell Boerger, Andrei Aleksandrov and Nikolay Tcholtchev

Abstract: This paper presents an exploratory analysis of deep learning techniques for intrusion detection in IoT networks. Specifically, we investigate three innovative intrusion detection systems based on transformer, 1D-CNN and GrowNet architectures, comparing their performance against random forest and three-layer perceptron models as baselines. For each model, we study the multiclass classification performance using the publicly available IoT network traffic dataset Bot-IoT. We use the most important performance indicators, namely, accuracy, F1-score, and ROC, but also training and inference time to gauge the utility and efficacy of the models. In contrast to earlier studies where random forests were the dominant method for ML-based intrusion detection, our findings indicate that the transformer architecture outperforms all other methods in our approach.

Paper Nr: 133
Title:

Toward a Public Dataset of Wide-Field Astronomical Images Captured with Smartphones

Authors:

Olivier Parisot and Diogo Ramalho Fernandes

Abstract: Smartphone photography has improved considerably over the years, particularly in low-light conditions. Thanks to better sensors, advanced noise reduction and night modes, smartphones can now capture detailed wide-field astronomical images, from the Moon to deep sky objects. These innovations have transformed them into pocket telescopes, making astrophotography more accessible than ever. In this paper, we present AstroSmartphoneDataset, a dataset of wide-field astronomical images collected since 2021 with various smartphones, and we show how we have used these images to train Deep Learning models for trails detection.

Paper Nr: 146
Title:

Data-Driven Prediction of High-Risk Situations for Cyclists Through Spatiotemporal Patterns and Environmental Conditions

Authors:

Sarah Di Grande, Mariaelena Berlotti, Salvatore Cavalieri and Daniel G. Costa

Abstract: Enhancing cycling is an increasingly important challenge, especially as it is promoted for its economic, environmental, and health benefits. However, ensuring safety of cyclists is crucial to support this shift in mobility. In this context, machine learning offers promising avenues. This study proposes a novel approach to identifying high-risk locations by dynamically incorporating spatiotemporal patterns and environmental conditions. The method was tested using comprehensive data from Germany, and its design suggests strong potential for generalization to different countries. This work can support urban planners, policymakers, and navigation systems in improving road safety and informing smarter mobility decisions.

Paper Nr: 147
Title:

Detecting Misinformation Virality on WhatsApp

Authors:

Fernanda Nascimento, Melissa Sousa, Gustavo Martins, José Maria Monteiro and Javam Machado

Abstract: In recent years, the large-scale dissemination of misinformation through social media has become a critical issue, undermining public health, social stability, and even democracy. In many developing countries, such as Brazil, India, and Mexico, the WhatsApp messaging app is one of the primary sources of misinformation. Recently, automatic misinformation detection using machine learning methods has emerged as an active research topic. However, not all misinformation is equally significant. For instance, widely spread misinformation can have a greater impact and tends to be more persuasive to users. In this context, we address the early detection of misinformation virality by analyzing the textual content of WhatsApp messages. First, we introduced a large-scale, labeled, and publicly available dataset of WhatsApp messages, called FakeViralWhatsApp.Br, which contains 52,080 labeled messages along with their respective frequencies. Next, we conducted a series of binary classification experiments, combining three feature extraction methods, three distinct token generation strategies, two preprocessing approaches, and six classification algorithms to classify messages into viral and non-viral categories. Our best results achieved an F1-score of 0.98 for general posts and 1.0 for misinformation messages, demonstrating the feasibility of the proposed approach.

Paper Nr: 149
Title:

Next-Event Prediction in Cybercrime Complaint Narratives Using Temporal Event Scene Graphs

Authors:

Mohammad Saad Rafeeq, Narendra Bijarniya and Chandramani Chaudhary

Abstract: Cybercrime complaint narratives encompass complex sequences of criminal activities that challenge conventional sequence modeling techniques. This work introduces a framework that employs dynamic temporal event scene graphs to represent each narrative as an evolving, structured network of entities and events. Our approach converts complaint texts into temporal event scene graphs in which nodes symbolize key entities and edges capture interactions, annotated with their sequential order. This structured representation provides a richer and more intuitive understanding of how cybercrime incidents unfold over time. To forecast missing or forthcoming events, we fine-tune a pre-trained BART model using a masked sequence-to-sequence paradigm.Our experiments are performed on a dataset comprising thousands of real-world cybercrime reports, containing roughly 76,000 distinct event descriptions—a scale that introduces significant sparsity and generalization challenges. Our results demonstrate that while models such as GPT-2 and T5 struggle to capture robust patterns in this diverse domain, the BART-based approach achieves modest yet promising improvements.

Paper Nr: 152
Title:

Evaluation of LLM-Based Strategies for the Extraction of Food Product Information from Online Shops

Authors:

Christoph Brosch, Sian Brumm, Rolf Krieger and Jonas Scheffler

Abstract: Generative AI and large language models (LLMs) offer significant potential for automating the extraction of structured information from web pages. In this work, we focus on food product pages from online retailers and explore schema-constrained extraction approaches to retrieve key product attributes, such as ingredient lists and nutrition tables. We compare two LLM-based approaches, direct extraction and indirect extraction via generated functions, evaluating them in terms of accuracy, efficiency, and cost on a curated dataset of 3,000 food product pages from three different online shops. Our results show that although the indirect approach achieves slightly lower accuracy (96.48%, −2.27% compared to direct extraction), it reduces the number of required LLM calls by 95.82%, leading to substantial efficiency gains and lower operational costs. These findings suggest that indirect extraction approaches can provide scalable and cost-effective solutions for large-scale information extraction tasks from template-based web pages using LLMs.

Paper Nr: 153
Title:

Bitcoin Fraud Detection: A Study with Dimensionality Reduction and Machine Learning Techniques

Authors:

Nuno Gomes and Artur J. Ferreira

Abstract: The use of cryptocurrencies corresponds to a remarkable moment in global financial markets. The nature of cryptocurrency transactions, done between cryptographic addresses, poses many challenges to identify fraudulent activities, since malicious transactions may appear as legitimate. Using data with these transactions, one may learn machine learning models targeted to identify the fraudulent ones. The transaction datasets are typically imbalanced, holding a few illicit examples, which is challenging for machine learning techniques to identify fraudulent transactions. In this paper, we investigate the use of a machine learning pipeline with dimensionality reduction techniques over Bitcoin transaction data. The experimental results show that XGBoost is the best performing method among a large set of competitors. The dimensionality reduction techniques are able to identify adequate subsets suitable for explainability purposes on the classification decision.

Paper Nr: 155
Title:

Process Chain for Artificial Intelligence-Based Demand Forecasting and Procurement Scheduling

Authors:

Maximilian Hohn and Philipp Maximilian Sieberg

Abstract: This position paper introduces a conceptual general process chain for leveraging artificial intelligence (AI) in demand prediction and procurement scheduling for small and medium-sized enterprises (SME). While AI offers significant advantages, such as reducing inventory costs, improving delivery reliability, and optimizing logistics, its adoption in SME is hindered by limited expertise, restricted access to AI tools, and psychological barriers like trust and acceptance. The proposed framework integrates probabilistic modeling, clustering algorithms, feature extraction methods and temperature scaling to enhance prediction accuracy and efficiency. By aggregating demand forecasts, the system enables risk-adjusted and cashflow-optimized scheduling. A preliminary result is presented, demonstrating robust predictions within confidence intervals. While the findings are preliminary, this paper highlights the transformative potential of AI in SME scheduling and outlines future research directions, including model optimization and the integration of explainable AI methods to further enhance traceability and user acceptance.

Paper Nr: 163
Title:

Prediction of Daily Sales of Individual Products in a Medium-Sized Brazilian Supermarket Using Recurrent Neural Networks Models

Authors:

Jociano Perin, Lucas Dias Hiera Sampaio, Marlon Marcon and André Roberto Ortoncelli

Abstract: Accurately predicting daily sales of products in supermarkets is crucial for inventory management, demand forecasting, and optimizing supply chain operations. Many studies focus on predicting the total sales of large stores and supermarkets. This study focuses on forecasting daily sales of individual products across various categories. In the experiments, we used Linear Regression and two types of Recurrent Neural Networks: Long Short-Term Memory and Gated Recurrent Unit. One of the contributions of the work is the database used, which is made available for public access and contains daily sales records (between January 2019 and December 2024) of 250 products in a medium-sized supermarket in Brazil. The results show that the predictors’ performance varies significantly from product to product. For one semester, the average of the best 25% resulted in a Root Mean Squared Error (RMSE) of 1.55 and a Mean Absolute Percentage Error (MAPE) of 17.20, and for the average of all products, the best RMSE was 2.12, and the best MAPE was 43.94. We observed similar performance variations for all analyzed semesters. With the results presented, it is possible to understand the performance of the predictors in ten semesters.

Paper Nr: 27
Title:

Harnessing Diet and Gene Expression Insights Through a Centralized Nutrigenomics Database to Improve Public Health

Authors:

Shriya Samudrala, Ijeoma Ezengwa, Fahmida Hai, Rubayat Khan, Saif Nirzhor and Don Roosan

Abstract: Nutrigenomics is an emerging field that explores the intricate interaction between genes and diet. This study aimed to develop a comprehensive database to help clinicians and patients understand the connections between genetic disorders, associated genes, and tailored nutritional recommendations. The database was built through an extensive review of primary journal articles and includes detailed information on gene characteristics, such as gene expression, location, descriptions, and their interactions with diseases and nutrition. The data suggest that a patient's food intake can either increase or decrease the expression of genes related to specific diseases. These findings underscore the potential of nutrition to modify gene expression and reduce the risk of chronic diseases. The study highlights the transformative role nutrigenomics could play in medicine by enabling clinicians to offer personalized dietary recommendations based on a patient’s genetic profile. Future research should focus on validating the database in clinical counselling to further refine its practical applications.

Paper Nr: 39
Title:

Quantifying the Effects of Image Degradation on LVLM Benchmark Results Systematically

Authors:

Rupert Urbanski and Ralf Peters

Abstract: Degraded image quality, along with the underlying issues of text-to-text neural networks, can compromise the performance of LVLMs. This paper quantifies the impacts of blurry, noisy and warped images and evaluates the robustness of LVLMs towards the common forms of image degradation in real-world applications utilising a specifically developed benchmark dataset comprising 15840 systematically degraded text images, which were hand-crafted based on standardised university admission exams.

Paper Nr: 50
Title:

Comparison of Tree-Based Learning Methods for Fraud Detection in Motor Insurance

Authors:

David Paul Suda, Mark Anthony Caruana and Lorin Grima

Abstract: Fraud detection in motor insurance is investigated with the implementation and comparison of various tree-based learning methods subject to different data balancing approaches. A dataset obtained from the insurance industry will be used. The focus is on decision trees, random forests, gradient boosting machines, light gradient boosting machines and XGBoost. Due to the highly imbalanced nature of our dataset, synthetic minority oversampling and cost-sensitive learning approaches will be used to address this issue. A study aimed at comparing the two data-balancing approaches is novel in literature, and this study concludes that cost-sensitive learning is overall superior for this application. The light gradient boosting machine using cost-sensitive learning is the most effective method, achieving a balanced accuracy of 81% and successfully identifying 83% of fraudulent cases. For the most successful approach, the primary insights into the most important features are provided. The findings derived from this study provide a useful evaluation into the suitability of tree-based learners in the field of insurance fraud detection, and also contribute to the current development of useful tools for correct classification and the important features to be addressed.

Paper Nr: 56
Title:

Expanding the Singular Channel - MODALINK: A Generalized Automated Multimodal Dataset Generation Workflow for Emotion Recognition

Authors:

Amany H. AbouEl-Naga, May Hussien, Wolfgang Minker, Mohammed A.-M. Salem and Nada Sharaf

Abstract: Human communication relies deeply on the emotional states of the individuals involved. The process of identifying and processing emotions in the human brain is inherently multimodal. With recent advancements in artificial intelligence and deep learning, fields like affective computing and human-computer interaction have witnessed tremendous progress. This has shifted the focus from unimodal emotion recognition systems to mul-timodal systems that comprehend and analyze emotions across multiple channels, such as facial expressions, speech, text, and physiological signals, to enhance emotion classification accuracy. Despite these advancements, the availability of datasets combining two or more modalities remains limited. Furthermore, very few datasets have been introduced for the Arabic language, despite its widespread use (Safwat et al., 2023; Akila et al., 2015). In this paper, MODALINK, an automated workflow to generate the first novel Egyptian-Arabic dialect dataset integrating visual, audio, and text modalities is proposed. Preliminary testing phases of the proposed workflow demonstrate its ability to generate synchronized modalities efficiently and in a timely manner.

Paper Nr: 65
Title:

Brain Tumor Classification with Hybrid Deep Learning Models from MRI Images

Authors:

Dhouha Boubdellaha, Raouia Mokni and Boudour Ammar

Abstract: Brain tumor classification using MRI images plays a crucial role in medical diagnostics, enabling early detection and improving treatment planning. Traditional diagnostic methods are often subjective and time-intensive, emphasizing the need for automated and precise solutions. This study explores hybrid deep learning models alongside fine-tuned pre-trained architectures for classifying brain tumors into four categories: glioma, meningioma, pituitary tumor, and healthy brain tissue. The proposed approach incorporates both hybrid ensemble models—Ens-VGG16-FT-InceptionV3, Ens-ViT-FT-InceptionV3, and Ens-CNN-ViT—and fine-tuned architectures—FT-VGG16, FT-VGG19, and FT-InceptionV3. To enhance model robustness and generalization, data augmentation techniques such as rotation and scaling were applied. Among these models, the hybrid ensemble Ens-VGG16-FT-InceptionV3 achieved the highest accuracy and F1-score of 99%, outperforming both standalone models and other hybrid configurations. These findings demonstrate the effectiveness of integrating complementary architectures for improved brain tumor classification. Ultimately, this study highlights the potential of hybrid ensemble learning to advance brain tumor diagnostics, providing more accurate, reliable, and scalable medical imaging solutions.

Paper Nr: 127
Title:

Exploring Camouflaged Object Detection Techniques for Invasive Vegetation Monitoring

Authors:

Henry O. Velesaca, Hector Villegas and Angel D. Sappa

Abstract: This paper presents a novel approach to weed detection by leveraging state-of-the-art camouflaged object detection techniques. The work evaluates six camouflaged object detection architectures on agricultural datasets to identify weeds naturally blending with crops through similar physical characteristics. The proposed approach shows excellent results in detecting weeds using Unmanned Aerial Vehicle images. This work establishes a new framework for the challenging task of weed detection in agricultural settings using camouflaged object detection approaches, contributing to more efficient and sustainable farming practices.

Paper Nr: 130
Title:

A BERT-Based Model for Detecting Depression in Diabetes-Related Social Media Posts

Authors:

Rdouan Faizi, Bouchaib Bounabat and Mahmoud El Hamlaoui

Abstract: This paper introduces a BERT-based model for detecting depression in diabetic social media posts. Based on transformer-based language models, the proposed approach is specifically designed to capture the specific linguistic patterns that are indicative of depressive symptoms. The model was trained on a dataset of comments retrieved from diabetes-related YouTube channels, which were then manually annotated as either ‘Depression’ or ‘Well-being’. Through extensive experimentation, the model achieved a high classification accuracy of 93% on the test set. These findings highlight its potential as an effective tool for automated mental health monitoring in at-risk populations, particularly those coping with chronic health conditions such as diabetes.

Area 6 - Databases & Data Management

Short Papers
Paper Nr: 94
Title:

Cypher by Example: A Visual Query Language for Graph Databases

Authors:

Kornelije Rabuzin, Maja Cerjan and Martina Šestak

Abstract: Considering the increasing connectivity in data exchange between applications and devices, modern IT systems need a storage solution capable of handling connections and patterns between entities. Graph databases emerged as a potential solution in the past decade. Since graph database query language standardization is ongoing, users interact with graph databases using query languages like Cypher or Gremlin, supported by modern Graph Database Management Systems (GDBMSs). Despite well-documented syntax, users with little knowledge of graph databases face a steep learning curve before writing queries on their own data. This limits interest in implementing graph databases due to the lack of a visual tool for maintenance. To address this, the paper introduces Cypher by Example, a visual graph query language with an interface and query patterns for interacting with the database. It presents the basic elements of this query language and demonstrates its usefulness in two use cases.

Paper Nr: 158
Title:

Democratizing the Access to Geospatial Data: The Performance Bottleneck of Unified Data Interfaces

Authors:

Matthias Pohl, Arne Osterthun, Joshua Reibert, Dennis Gehrmann, Christian Haertel, Daniel Staegemann and Klaus Turowski

Abstract: The exponential growth of geodata collected through satellites, weather stations, and other measuring systems presents significant challenges for efficient data management and analysis. High-resolution datasets from Earth observation missions like ESA Sentinel are vital for climate research, weather forecasting, and environmental monitoring, yet their multidimensional nature and temporal depth increasingly strain traditional spatial data models. This research evaluates the performance characteristics of various geospatial data access and processing technologies through systematic benchmarking. The study compares file formats (NetCDF, Zarr) and interface standards (OpenEO, OGC WCS) on a Rasdaman database instance to determine optimal configurations for interactive data exploration and analysis workflows. Performance tests reveal a clear hierarchy in processing efficiency. Direct RasQL queries consistently outperform both OGC WCS and OpenEO API interfaces across all test scenarios. Array file formats demonstrate superior query processing speeds, likely attributable to reduced database technology overhead. The findings provide a robust foundation for selecting appropriate geodata processing technologies based on specific use case requirements, data volumes, and performance needs. This research contributes to enhancing the efficiency of multidimensional geospatial data handling, particularly for time-critical applications that require interactive visualization and analysis capabilities.