Abstract: |
Healthcare Data Warehouses (DWs) are designed to transform fragmented data into actionable insights and support decision-making (Inmon, 2005). Yet, conventional DWs often lack interactivity and face integration challenges (Kimball and Caserta, 2004) due to inconsistent formats and semantic heterogeneity—especially in
public health, where data originates from diverse administrative sources. Web-based systems have supported healthcare planning ((Schuurman et al., 2008), (Richard et al., 2005)) and pandemic monitoring ((Chosy et al., 2015), (Agapito et al., 2020)), but often remain siloed or analytically limited. Globally, the WHO’s Health for All (Mahler, 1981) and the European Health Data Space (Gavrilov et al., 2020) aim to integrate health data, but lack real-time analytics or conversational tools. In Italy, platforms like ISTAT and DatiOpen offer frag mented access without unified analytics layers (Gruendner et al., 2021). To address these gaps, this abstract presents MEDITA (Multimodal hEalth Datawarehouse for ITAly), a prototype platform that integrates, harmonizes, and models both structured and unstructured Italian public health data using an Extract-Transform-Load-Model-Deploy (ETLMD) pipeline. MEDITA combines over half a billion statistical observations with
130,000+ Italian-language health news articles, preprocessed through a modular transformation layer that resolves structural fragmentation and semantic variability. MEDITA’s transformation phase introduces four key steps: (1) Variable Structure Mapping, which classifies variables based on the pairing of data objects and internal tables into schema types—Single-Single (SS), Single-Multi (SM), Multi-Single (MS), Multi-Multi (MM); (2) Fuzzy Attribute Matching, which applies Levenshtein distance (Levenshtein et al., 1966) to resolve naming inconsistencies across datasets; (3) Records Harmonization, aligning geographic codes to
NUTS standards and enforcing uniqueness through superkey detection; and (4) Data Standardization, which uses adaptive parsing to produce clean, pipe-delimited tables with consistent encoding and format. Structured data is then prepared for querying, aggregation, and visualization. In the model phase, a key feature is HANK
(Health Assistant for Needs and Knowledge), a Retrieval-Augmented Generation (RAG) chatbot (Lewis et al., 2020) that enables Italian-language interaction with the health corpus. Queries are embedded, matched to relevant content, and answered by a transformer model grounded in validated sources, reducing hallucinations
(Guerreiro et al., 2023). MEDITA is a comprehensive prototype DW featuring a streamlined web application composed of dedicated modules that guide users through the analytical workflow. The Infographics module provides interactive visual summaries of aggregated health data by year, topic, or region to support initial exploration. The Data Explorer enables deeper dataset inspection through filtering and comparative visualizations. Users can interact with HANK to submit natural language queries and receive contextualized answers grounded in trusted sources. The Forecast Generator supports time-series modeling for regression and classification tasks, offering built-in diagnostic tools—including checks for heteroskedasticity and model validity—to ensure robustness of statistical procedures. Adhering to FAIR principles (Wilkinson et al., 2016),its modular architecture allows future integration of multimodal sources. |