Abstracts Track 2022

Area 1 - Business Analytics

Nr: 16

ENIRISST: The First Data Infrastructure for Shipping and Transport in Greece


Christos Tryfonopoulos and Amalia Polydoropoulou

Abstract: Within the past years, the transportation sector has been affected by the several advancements occurring in the era of big data. This has increasingly attracted the attention of both scientists and practitioners in the private and public sector, and several studies and applications have been developed in this field. Despite these advances, several challenges are still left to be addressed to harness the full potential of big data technologies. This abstract presents the Intelligent Research Infrastructure for Shipping, Supply chain, Transport and Logistics (ENIRISST), which is a unique, novel and fully extendable Research Infrastructure established in 2019 in Greece. ENIRISST focuses on providing a virtual infrastructure that is realized in a fully extendable way to accommodate current and future extensions. Its goal is to become a widely acceptable, multi-purpose data analytics platform that will unify a wide variety of open data sources and enhance collaborations among different disciplines in the transportation sector. The Infrastructure relies on the NIST Big Data Reference Architecture supporting data cataloging, virtualization, analytics and visualization through the use of open source tools and is designed to become a unified data lake that satisfies several important yet conflicting principles including usability, efficiency, security, expandability, reliability, and maintainability. The ENIRISST platform involves both traditional and state-of-the-art solutions that include relational databases, noSQL solutions for big data analysis, machine learning suites, data lake engines for live interactive queries and analytics, and popular visualization components. The overall architecture is flexible, allowing the integration of new sub-systems in terms of software or a process. The data platform does not intend to be a replica of existing sources, but rather provide a data lake approach that is able to combine, correlate, analyze and leverage the collected data to useful insights. The data lake functionality may be abstracted to three main tasks: (a) real-time/on-premises movement of data, (b) analysis of data, and (c) visualization of data/results. This data lake approach is necessary to accommodate not only the various types of stored data (both historical and ‘live’ data from different and disparate information sources), but also the various types of services from a wide variety of fields including maritime, land and freight travel, environment, energy, tourism, and finance. As the project reaches its final year, new services are constantly integrated into the deployed platform; these services (indicatively) include: national transport model simulations, pollution monitoring e-infrastructures, traffic safety training facility and data repository, maritime activity recording and marine/cruise tourism logging, crowdsourced infrastructure registers, multimodal travel advisors, decision support tools for shipping, electronic logistics observatories for freight, logistics and supply chain industries, and intermodal transport operation monitoring tools. By incorporating a large amount of novel services in a national and international level and adopting a flexible data infrastructure, ENIRISST has created a de facto center of excellence and the first intelligent research infrastructure in the fields of shipping and transport in Greece.

Area 2 - Data Science

Nr: 7

Imprecise Regression: Interval Uncertainty in the Dependent Variable


Krasymyr Tretiak, Georg Schollmeyer and Scott Ferson

Abstract: A new iterative method is described which fits an imprecise regression model to interval-valued data. We consider the explanatory variables to be precise point values and the dependent variables are presumed to have uncertainty, which is purely defined by interval boundaries and has no probabilistic information. The proposed method based on machine learning algorithms and captures the relationship between the explanatory variables and a dependent variable by fitting an imprecise regression model, which is linear with respect to unknown parameters. The iterative approach looks for the optimal model parameters that reduce the mean squared error between the actual and predicted interval values of the dependent variable. The method incorporates a first-order gradient-based optimization algorithm to minimize the loss function, which is natural but not the only loss function possible. We exploit interval arithmetic operations to model the uncertainty in the data. These operations provide results that are rigorous because they bound all possible answers corresponding to simple regression lines based on points in the respective intervals. Because the results can't be any tighter without excluding some alternative responses, they are also best-possible. Computing enclosures of functions using interval arithmetic often leads to overestimations when interval variables appear more than once in a mathematical expression. This drawback in the application of interval arithmetic which can lead to significant inflation of the interval bounds is called the dependency problem. In order to overcome the effect of repeated variables on each iteration in the iterative approach, we use an interval-valued extension for real functions which is known as the mean value form. It is a second-order inclusion monotone approximation of an interval-valued extension of the real function. The proposed iterative method estimates the lower and upper bounds of the expectation region, which depicts the envelope of all possible precise regression lines obtained by ordinary regression analysis based any configuration of bivariate scalar points from x-values and the respective intervals. The expectation region is the analogue of the point prediction from a simple linear regression, which others have variously called 'identification region' and 'upper and lower bounds on predictions'. The proposed method is easy to implement and adapt for different basis such as polynomial or trigonometric functions which are independent from response variables. We have also extended this concept to neural network applications which are more flexible tools in regression analysis. Several examples will be presented. We have verified the method by probabilistic Monte Carlo simulations, and we compared it to another approach which computes identification regions in regression analysis with interval data.

Area 3 - Data Management and Quality

Nr: 8

A Data Quality Model for Machine Learning Driven Remote Health Care Services


Mandana Omidbakhsh and Khashayar Khorasani

Abstract: With the rise of IoT devices combined with application of machine learning techniques, healthcare sector can benefit highly from autonomous, trustworthy, reliable and timely services for diagnosis, prognosis and monitoring of patients in remote communities. IoT sensors/medical devices generally generate large data sets, though are required to be validated for training purposes of machine leaning models in distributed edge cloud infrastructures for the application in remote communities. Machine learning driven remote healthcare systems rely heavily on the quality of data; any malfunction or failure in the measurement system can jeopardize the quality of the sensing data and consequently result in poor critical decision-making. However, there is no exhaustive data quality model to ensure the delivery of effective healthcare management, reliable and safe communication and efficient resource management for remote services. In this work, we first review the quality characteristics of big data for machine learning algorithms on edge cloud computing systems that provide remarkable extent of data security and privacy. Then, we propose a data quality model in which both objective and subjective characteristics appropriately associated with the health care services for remote communities are considered. The former includes effectiveness, reliability and autonomy; and the latter refers to ease of use, understandability and discretionary. Eventually, we will evaluate our proposed data quality model for machine learning on edge infrastructures by several case studies.