Abstracts Track 2025

Area 1 - Data Engineering, Infrastructure and Business Applications

Nr:	58
Title:	Leveraging Data Lakes for Smart Building Management: Enhancing the Smart Readiness Indicator
Authors:	José L. Hernández, Dimitrios Rovas, Ignacio de Miguel and Fredy Vélez
Abstract:	This study investigates the transformative potential of data lakes in enhancing the Smart Readiness Indicator for smart buildings. By integrating both static and dynamic data—ranging from real-time sensor readings and energy system performance to external data such as weather forecasts—data lakes provide a unified repository for comprehensive data analysis. This approach enables advanced analytics and facilitates proactive decision-making in critical Smart Readiness Indicator domains such as heating, cooling, ventilation, monitoring, and electricity management. Data lakes enable a transition from reactive to proactive building management strategies, leading to improved operational efficiency, enhanced occupant comfort, and reduced energy costs. Pilot studies conducted across nine European buildings demonstrate substantial Smart Readiness Indicator score improvements, particularly in less-digitized structures. Despite their benefits, challenges remain, including ensuring interoperability among heterogeneous data sources, standardizing semantic data structures, and addressing limitations for specific climatic zones. However, the integration of data lakes aligns with broader goals of digital transformation and sustainability, facilitating smarter, more adaptive buildings. The study highlights the pivotal role of robust data management strategies in optimizing smart building operations and achieving higher Smart Readiness Indicator scores.

Nr:	166
Title:	A Multimodal hEalth Datawarehouse for Italy
Authors:	Maria Paola Priola
Abstract:	Healthcare Data Warehouses (DWs) are designed to transform fragmented data into actionable insights and support decision-making (Inmon, 2005). Yet, conventional DWs often lack interactivity and face integration challenges (Kimball and Caserta, 2004) due to inconsistent formats and semantic heterogeneity—especially in public health, where data originates from diverse administrative sources. Web-based systems have supported healthcare planning ((Schuurman et al., 2008), (Richard et al., 2005)) and pandemic monitoring ((Chosy et al., 2015), (Agapito et al., 2020)), but often remain siloed or analytically limited. Globally, the WHO’s Health for All (Mahler, 1981) and the European Health Data Space (Gavrilov et al., 2020) aim to integrate health data, but lack real-time analytics or conversational tools. In Italy, platforms like ISTAT and DatiOpen offer frag mented access without unified analytics layers (Gruendner et al., 2021). To address these gaps, this abstract presents MEDITA (Multimodal hEalth Datawarehouse for ITAly), a prototype platform that integrates, harmonizes, and models both structured and unstructured Italian public health data using an Extract-Transform-Load-Model-Deploy (ETLMD) pipeline. MEDITA combines over half a billion statistical observations with 130,000+ Italian-language health news articles, preprocessed through a modular transformation layer that resolves structural fragmentation and semantic variability. MEDITA’s transformation phase introduces four key steps: (1) Variable Structure Mapping, which classifies variables based on the pairing of data objects and internal tables into schema types—Single-Single (SS), Single-Multi (SM), Multi-Single (MS), Multi-Multi (MM); (2) Fuzzy Attribute Matching, which applies Levenshtein distance (Levenshtein et al., 1966) to resolve naming inconsistencies across datasets; (3) Records Harmonization, aligning geographic codes to NUTS standards and enforcing uniqueness through superkey detection; and (4) Data Standardization, which uses adaptive parsing to produce clean, pipe-delimited tables with consistent encoding and format. Structured data is then prepared for querying, aggregation, and visualization. In the model phase, a key feature is HANK (Health Assistant for Needs and Knowledge), a Retrieval-Augmented Generation (RAG) chatbot (Lewis et al., 2020) that enables Italian-language interaction with the health corpus. Queries are embedded, matched to relevant content, and answered by a transformer model grounded in validated sources, reducing hallucinations (Guerreiro et al., 2023). MEDITA is a comprehensive prototype DW featuring a streamlined web application composed of dedicated modules that guide users through the analytical workflow. The Infographics module provides interactive visual summaries of aggregated health data by year, topic, or region to support initial exploration. The Data Explorer enables deeper dataset inspection through filtering and comparative visualizations. Users can interact with HANK to submit natural language queries and receive contextualized answers grounded in trusted sources. The Forecast Generator supports time-series modeling for regression and classification tasks, offering built-in diagnostic tools—including checks for heteroskedasticity and model validity—to ensure robustness of statistical procedures. Adhering to FAIR principles (Wilkinson et al., 2016),its modular architecture allows future integration of multimodal sources.

Area 2 - Data Science & Machine Learning

Nr:	129
Title:	A Deep Learning Approach for Single-Line Programming Language Identification
Authors:	Oscar Rodriguez-Prieto, Alejandro Pato and Francisco Ortin
Abstract:	Identifying the programming language of a source code fragment is an active research field in software development. Its applications include syntax highlighting, search engine indexing, repository management, and automated code analysis. While traditional language detection methods rely on file extensions or entire code files, they struggle when only a single line of code is available, such as in embedded snippets found in documentation, chat messages, or Q&A platforms. Existing machine learning-based approaches typically require multiple lines of source code or even the entire file contents to achieve high accuracy. To address this limitation, we propose PLangRec, a deep-learning model designed to predict the programming language of a source code excerpt from a single line of code. Our approach is built upon a large-scale dataset containing 434.18 million labeled instances across 21 programming languages—three orders of magnitude larger than previous datasets. We train a char-level deep bidirectional recurrent neural network, enabling the model to learn language-specific patterns even in extremely short code fragments. The resulting model achieves a 95.07% accuracy and a macro-F1 score of 95.07% for single-line classification, outperforming existing state-of-the-art methods. This ability to accurately classify a programming language from minimal input has significant benefits for applications where complete source code is unavailable. To further enhance classification for multi-line snippets and entire source files, we develop a stacking ensemble meta-model that aggregates predictions from the single-line classifier. This meta-model efficiently combines multiple line-based predictions, enabling it to outperform existing methods in recognizing programming languages for code snippets of varying lengths. Our evaluation demonstrates that PLangRec achieves superior performance not only for isolated lines but also for five-line and ten-line snippets, as well as full source code files. To facilitate adoption and practical use, we release PLangRec as an open-source tool available in multiple formats, including a web application, a web API, and a Python desktop program. This versatility makes it suitable for integration into development environments, search engines, and automated code analysis tools. By providing an accurate and efficient programming language detection system, PLangRec addresses a long-standing challenge in source code classification and opens new possibilities for intelligent code processing and analysis.

Nr:	168
Title:	Toward Multimodal Machine Learning Models for Work-Related Stress Detection Among Nurses
Authors:	Souhir Ben Souissi and Christoph Golz
Abstract:	Healthcare systems worldwide are facing a critical shortage of professionals, with nurses being among the most affected. Alarmingly, nearly half of all nurses leave the profession prematurely, and one of the main drivers behind this trend is high work-related stress. While previous research has primarily relied on surveys to measure and monitor stress, this method is increasingly ineffective—response rates are declining, especially among highly stressed populations, and survey fatigue is rising due to the growing number of ongoing studies. To address this challenge, alternative, non-intrusive, and data-driven approaches are urgently needed to continuously assess stress without overburdening staff. In our earlier work, we explored the potential of unstructured textual data from daily clinical documentation in electronic health records (EHRs) to detect ward-level stress signals using natural language processing (NLP). However, due to the lack of labeled data and the focus on aggregated documentation, it was not possible to directly identify stress patterns at the individual level. Building on this foundation, our current research aims to develop a multimodal machine learning framework for detecting and predicting work-related stress among nurses. The project involves two main objectives: (1) Collecting and integrating data from diverse sources including wearable devices, unstructured routine clinical texts, daily self-reports, and shift schedules; (2) Designing and evaluating multimodal models that can capture complex relationships between these data modalities and stress indicators. We adopt a longitudinal study design involving 24 nurses over an extended period, with repeated measurements to enable dynamic analysis. The data collection pipeline includes: (a) Physiological data from wearable smartwatches (e.g., heart rate variability, sleep, and activity patterns); (b) Textual data from the clinical information system, specifically authored by the participants; (c) Daily self-reported short text inputs reflecting individual perceptions and emotional states; (d) Shift schedule data, capturing workload patterns, work-rest ratios, and time-of-day effects. All data sources are securely linked via anonymized participant IDs and are de-identified after collection to protect privacy while maintaining the necessary granularity for analysis. The resulting dataset will be used to train a multimodal machine learning model capable of synthesizing information across these channels. The model will explore patterns between physiological biomarkers, linguistic signals, self-reported perceptions, and scheduling features, with the goal of predicting elevated stress levels and identifying potential early warning signs of burnout or attrition. Ultimately, this approach aims to contribute to the development of scalable, low-burden stress monitoring systems that support healthcare institutions in identifying risk early and implementing timely interventions—without adding further administrative or psychological load to nursing staff.

Nr:	17
Title:	Enhancing Data Repositories by Leveraging Ontologies
Authors:	Sarah Rudder
Abstract:	Data are defined as digitized information (Veldkamp, 2023) at the center of organization’s operations. Information is a valuable asset of modern businesses, representing new ways that organizations create value for their customers and stakeholders and requires the development of solutions to process, manage, and monitor data (Collins & Lanz, 2019). Managing data involves acquiring data, placing it into datasets, and maintaining the relevant links to ensure real-time accuracy (Veldkamp, 2023). A data lake is a massive collection of datasets that are typically hosted in different storage systems, are of varying formats, and have the potential to change independently of one another (Nargesian, et al., 2019). Sawadogo & Darmont, 2021, have defined a data lake as a scalable storage and analysis data system mainly used for knowledge extraction. Any application that involves multiple ontologies must establish semantic mappings to ensure interoperability between them (Doan, et al., 2004). Using semantic ontologies to capture domain knowledge allows incremental construction of a data schema without affecting the existing information (Murphy & Moreland, 2022). An ontology is an explicit specification of a common language described by a set of representational terms (Gruber, 1993). Ontologies are viewed as the interface between the knowledge base and reality that guides information shareability, acquisition, and organization (Kang, et al., 2010). Ontological models define natural language in a machine-readable format (Mejhed Mkhinini, et al., 2020). Although there are several tools meant to integrate data lakes, there is no sufficient evidence that the result is any more than a massive repository that fails to consider equivalent relationships between classes, subclasses, object properties, data properties, or individuals. There is a need to improve the capture of data from disparate sources without causing additional inconsistency and confusion. This study aims to leverage domain-specific ontologies to form an authoritative source of truth among repositories that will improve information management across data lakes by establishing a common vocabulary. Although this research is still in its infancy, the potential to reduce duplication, maintain traceability, and provide a foundation for an ASoT is promising and will contribute to the general field of data analytics. A short presentation or scholarly poster discussing the problem domain and path forward for an innovative solution is requested.

Nr:	148
Title:	Localization of Generative AI Models: Aligning Technology with Sovereign AI Policies
Authors:	Mandana Omidbakhsh and Khashayar Khorasani
Abstract:	The rapid advancement of generative AI technologies poses significant challenges and opportunities across global markets. As countries and regions develop their own AI policies to promote ethical standards, security, and cultural relevance, the need for localization of these models becomes critical. We explored the methodologies for localizing generative AI models and the strategies to comply with sovereign AI policies while addressing local cultural, legal, and ethical considerations.A comprehensive review of current sovereign AI policies shows diverse norms, necessitating a tailored approach to localization. We identified the challenges to develop a measurement framework for adapting generative AI models to local regulations and cultural norms, and thus we propose four key metrics namely : Cultural Relevance Index (CRI), Legal Compliance Score (LCS), User Acceptance Rate (UAR), and Feedback Loop Mechanism (FLM). We further add the metrics for translation quality, including Error Rate, Linguistic Accuracy, and Consistency, which ensure grammatical precision and preservation of original meaning. Furthermore, we consider Cultural Relevance and User Feedback/Satisfaction as indicators of Cultural Adaptation and Percentage of Content Localized and Time to Market efficiency as indicators of Localization Coverage. To evaluate our proposed measurements, we conduct a case study to assess the effectiveness and applicability of our proposed measurements across different Large Language Models(LLMs) such as GPT 3.5, GPT4, GPT mini 40 Llama, and Mistral A. The results of this case study highlight the varying degrees of adaptability of these models to localized requirements, demonstrating the strengths and weaknesses of each in addressing cultural, legal, and linguistic factors. Ultimately, this study underscores the importance of refining AI localization frameworks to create more inclusive, context-aware, and legally compliant with sovereign AI policies.