DATA 2017 Abstracts

Area 1 - Big Data

Full Papers

Paper Nr:	18
Title:	Adaptive Resource Management for Distributed Data Analytics based on Container-level Cluster Monitoring
Authors:	Thomas Renner, Lauritz Thamsen and Odej Kao
Abstract:	Many distributed data analysis jobs are executed repeatedly in production clusters. Examples include daily executed batch jobs and iterative programs. These jobs present an opportunity to learn workload characteristics through continuous fine-grained cluster monitoring. Therefore, based on detailed profiles of resource utilization, data placement, and job runtimes, resource management can in fact adapt to actual workloads. In this paper, we present a system architecture that contains four mechanisms for an adaptive resource management, encompassing data placement, resource allocation, and container as well as job scheduling. In particular, we extended Apache Hadoop's scheduling and data placement to improve resource utilization and job runtimes for recurring analytics jobs. Furthermore, we developed a Hadoop submission tool that allows users to reserve resources for specific target runtimes and which uses historical data available from cluster monitoring for predictions.
Download

Short Papers

Paper Nr:	23
Title:	The Benefit of Thinking Small in Big Data
Authors:	Kurt Englmeier and Hernán Astudillo Rojas
Abstract:	No doubt, big data technology can be a key enabler for data-driven decision making. However, there are caveats. Processing technology for unstructured and structured data alone–with or without Artificial Intelligence–will not suffice to catch up the promises made by big data pundits. This article argues that we should be level-headed about what we can achieve with big data. We can achieve a lot of these promises if we also achieve to get our interests and requirements better reflected in design or adaptation of big data technology. Economy of scale urges provider of big data technology to address mainstream requirements, that is, analytic requirements of a broad clientele. Our analytical problems, however, are rather individual, albeit mainstream only to a certain extent. We will see many technology add-ons for specific requirements, with more emphasis on human interaction too, that will be essential for the success in big data. In this article, we take machine translation as an example and a prototypical translation memory as add-on technology that supports users to turn the faulty automatic translation into a useful one.
Download

Paper Nr:	43
Title:	Towards a Scalable Architecture for Flight Data Management
Authors:	Iván García, Miguel A. Martínez-Prieto, Anibal Bregón, Pedro C. Álvarez and Fernando Díaz
Abstract:	The dramatic growth in the air traffic levels witnessed during the last two decades has increased the interest for optimizing the Air Traffic Management (ATM) systems. The main objective is being able to cope with the sustained air traffic growth under safe, economic, efficient and environmental friendly working conditions. The ADS-B (Automatic Dependent Surveillance - Broadcast) system is part of the new air traffic control systems, since it allows to substitute the secondary radar with cheaper ground stations that, at the same time, provide more accurate real-time positioning information. However, this system generates a large volume of data that, when combined with other flight-related data, such as flight plans or weather reports, faces scalability issues. This paper introduces an (on-going) Data Lake based architecture which allows the full ADS-B data life-cycle to be supported in a scalable and cost-effective way using technologies from the Apache Hadoop ecosystem.
Download

Paper Nr:	17
Title:	Identification of Opinion Leaders and Followers in Social Media
Authors:	Chun-Che Huang, Li-Ching Lien, Po-An Chen, Tzu-Liang Tseng and Shian-Hua Lin
Abstract:	In recent years, with the development of Web2.0, opinion leaders on the Web go up onto the stage and lead the will of the people. Many time, government, private companies and even traditional news media need to understand the opinion leaders’ ideas on the Web. Identifying opinion leaders and followers becomes a very important study. To study the characteristics of opinion leaders and the impact of opinion leaders on followers, our research evaluates whether every speaker in social media satisfies characteristics of opinion leader. The characteristics of opinion leader and relationship between opinion leader and follower are studied. By observing relational matrix, the interacting relations between users in social media are analysed and opinion leaders and followers are identified.
Download

Paper Nr:	35
Title:	Index Clustering: A Map-reduce Clustering Approach using Numba
Authors:	Xinyu Chen and Trilce Estrada
Abstract:	Clustering high-dimensional data is often a crucial step of many applications. However, the so called "Curse of dimensionality" is a challenge for most clustering algorithms. In such high-dimensional spaces, distances between points tend to be less meaningful and the spaces become sparse. Such sparsity needs more data points to characterize the similarities so more distance comparisons are computed. Many approaches have been proposed for reduction of dimensionality, such as sub-space clustering, random projection clustering, and feature selection technique. However, approaches like these become unfeasible in scenarios where data is geographically distributed or cannot be openly used across sites. To deal with the location and privacy issues as well as mitigate the expensive distance computation, we propose an index-based clustering algorithm that generates a spatial \emph{key} for each data point across all dimensions without needing an explicit knowledge of the other data points. Then it performs a conceptual Map-Reduce procedure in the index space to form a final clustering assignment. Our results show that this algorithm is linear and can be parallelized and executed independently across points and dimensions. We present a Numba implementation and preliminary study of this algorithm's capabilities and limitations.
Download

Area 2 - Business Analytics

Full Papers

Paper Nr:	10
Title:	Asymmetric Heterogeneous Transfer Learning: A Survey
Authors:	Magda Friedjungová and Marcel Jiřina
Abstract:	One of the main prerequisites in most machine learning and data mining tasks is that all available data originates from the same domain. In practice, we often can’t meet this requirement due to poor quality, unavailable data or missing data attributes (new task, e.g. cold-start problem). A possible solution can be the combination of data from different domains represented by different feature spaces, which relate to the same task. We can also transfer the knowledge from a different but related task that has been learned already. Such a solution is called transfer learning and it is very helpful in cases where collecting data is expensive, difficult or impossible. This overview focuses on the current progress in the new and unique area of transfer learning - asymmetric heterogeneous transfer learning. This type of transfer learning considers the same task solved using data from different feature spaces. Through suitable mappings between these different feature spaces we can get more data for solving data mining tasks. We discuss approaches and methods for solving this type of transfer learning tasks. Furthermore, we mention the most used metrics and the possibility of using metric or similarity learning.
Download

Paper Nr:	12
Title:	The Role of Big Data Analytics in Corporate Decision-making
Authors:	Darlan Arruda and Nazim H. Madhavji
Abstract:	Big Data Analytics results can play a major role in corporate decision-making allowing companies to achieve competitive advantage and make improved decisions. This paper describes a systematic literature review (SLR) on the role of the results of Big Data Analytics in corporate decisions. Initially, 1652 papers were identified from various sources. Filtering through the 5-step process, 20 relevant studies were selected for analysis in this SLR. The findings of this study are fourfold in the area of: (a) usage of the results of Big Data Analytics in corporate decision-making; (b) the types of business functions where analytics has been fruitfully utilised; (c) the impact of analytics on decision-making; and (d) the impediments to using Big Data Analytics in corporate decision-making. Also, on the management front, two important issues identified are: (i) aligning data-driven decision-making with business strategy and (ii) collaboration across business functions for effective flow of Big Data and information. On the technical front, big data present some challenges due to the lack of tools to process such properties of Big Data as variety, veracity, volume, and velocity. We observe from this analysis that, thus far, little scientific research has focused on understanding how to address the analytics results in corporate decision-making. This paper ends with some recommendations for further research in this area.
Download

Paper Nr:	26
Title:	Churn Prediction for Mobile Prepaid Subscribers
Authors:	Zehra Can and Erinç Albey
Abstract:	In telecommunication, mobile operators prefer to acquire postpaid subscribers and increase their incoming revenue based on the usage of postpaid lines. However, subscribers tend to buy and use prepaid mobile lines because of the simplicity of the usage, and due to higher control over the cost of the line compared to postpaid lines. Moreover the prepaid lines have less paper work between the operator and subscriber. The mobile subscriber can end their contract, whenever they want, without making any contact with the operator. After reaching the end of the defined period, the subscriber will disappear, which is defined as “involuntary churn”. In this work, prepaid subscribers’ behavior are defined with their RFM data and some additional features, such as usage, call center and refill transactions. We model the churn behavior using Pareto/NBD model and with two benchmark models: a logistic regression model based on RFM data, and a logistic regression model based on the additional features. Pareto/NBD model is a crucial step in calculating customer lifetime value (CLV) and aliveness of the customers. If Pareto/NBD model proves to be a valid approach, then a mobile operator can define valuable prepaid subscribers using this and decide on the actions for these customers, such as suggesting customized offers.
Download

Short Papers

Paper Nr:	2
Title:	Models for Predicting the Development of Insulin Resistance
Authors:	Thomas Forstner, Christiane Dienhart, Ludmilla Kedenko, Gernot Wolkersdörfer and Bernhard Paulweber
Abstract:	Insulin resistance is the leading cause for developing type 2 diabetes. Early determination of insulin resistance and herewith of impending type 2 diabetes could help to establish sooner preventive measures or even therapies. However, an optimal predictive model for developing insulin resistance has not been established yet. Based on the data of an Austrian cohort study (SAPHIR study) various predictive models were calculated and compared to each other. For developing predictive models logistic regression models were used. For finding an optimal cut-off value an ROC approach was used. Based on various biochemical parameters an overall percentage of around 82% correct classifications could be achieved.
Download

Paper Nr:	7
Title:	B-kNN to Improve the Efficiency of kNN
Authors:	Dhrgam AL Kafaf, Dae-Kyoo Kim and Lunjin Lu
Abstract:	The kNN algorithm typically relies on the exhaustive use of training datasets, which aggravates efficiency on large datasets. In this paper, we present the B-kNN algorithm to improve the efficiency of kNN using a two-fold preprocess scheme built upon the notion of minimum and maximum points and boundary subsets. For a given training dataset, B-kNN first identifies classes and for each class, it further identifies the minimum and maximum points (MMP) of the class. A given testing object is evaluated to the MMP of each class. If the object belongs to the MMP, the object is predicted belonging to the class. If not, a boundary subset (BS) is defined for each class. Then, BSs are fed into kNN for determining the class of the object. As BSs are significantly smaller in size than their classes, the efficiency of kNN improves. We present two case studies to evaluate B-kNN. The results show an average of 97\% improvement in efficiency over kNN using the entire training dataset, while making little sacrifice of the accuracy compared to kNN.
Download

Paper Nr:	8
Title:	Data Scientist - Manager of the Discovery Lifecycle
Authors:	Kurt Englmeier and Fionn Murtagh
Abstract:	Data Scientists are the masters of Big Data. Analyzing masses of versatile data leads to insights that, in turn, may connect to successful business strategies, crime prevention, or better health care just to name a few. Big Data is primarily approached as mathematical and technical challenge. This may lead to technology design that enables useful insights from Big Data. However, this technology-driven approach does not meet completely and consistently enough the variety of information consumer requirements. To catch up with the versatility of user needs, the technology aspect should probably be secondary. If we adopt a user-driven approach, we are more in the position to cope with the individual expectations and exigencies of information consumers. This article takes information discovery as the overarching paradigm in data science and explains how this perspective change may impact the view on the profession of the data scientist and, resulting from that, the curriculum for the education in data science. It reflects the result from discussions with companies participating in our student project cooperation program. These results are groundwork for the development of a curriculum framework for Applied Data Science.
Download

Paper Nr:	14
Title:	Using Visualisation Techniques to Acquire a Better Understanding of Storytelling for Cultural Heritage
Authors:	Paulo Carvalho, Olivier Parisot and Thomas Tamisier
Abstract:	Historical information has an important role regarding cultural heritage. It is used to interpret facts occurred in the past and also to understand the present. Storytelling, when applied in the narrative of true events and resulting from different personal views and anecdotal stories, act as an important source of historical information. In this paper, we discuss the problems we encounter in the field of historical information storytelling and we present a software architecture to facilitate the comprehension of stories. More precisely, the proposed solution helps to analyse a story, examine its composition identifying existing entity classes and computing possible relations with other stories, to finally build a visual representation of these stories.
Download

Paper Nr:	31
Title:	Demand Prediction using Machine Learning Methods and Stacked Generalization
Authors:	Resul Tugay and Şule Gündüz Öğüdücü
Abstract:	Supply and demand are two fundamental concepts of sellers and customers. Predicting demand accurately is critical for organizations in order to be able to make plans. In this paper, we propose a new approach for demand prediction on an e-commerce web site. The proposed model differs from earlier models in several ways. The business model used in the e-commerce web site, for which the model is implemented, includes many sellers that sell the same product at the same time at different prices where the company operates a market place model. The demand prediction for such a model should consider the price of the same product sold by competing sellers along the features of these sellers. In this study we first applied different regression algorithms for specific set of products of one department of a company that is one of the most popular online e-commerce companies in Turkey. Then we used stacked generalization or also known as stacking ensemble learning to predict demand. Finally, all the approaches are evaluated on a real world data set obtained from the e-commerce company. The experimental results show that some of the machine learning methods do produce almost as good results as the stacked generalization method.
Download

Paper Nr:	34
Title:	Table Interpretation and Extraction of Semantic Relationships to Synthesize Digital Documents
Authors:	Martha O. Perez-Arriaga, Trilce Estrada and Soraya Abad-Mota
Abstract:	The large number of scientific publications produced today prevents researchers from analyzing them rapidly. Automated analysis methods are needed to locate relevant facts in a large volume of information. Though publishers establish standards for scientific documents, the variety of topics, layouts, and writing styles impedes the prompt analysis of publications. A single standard across scientific fields is infeasible, but common elements tables and text exist by which to analyze publications from any domain. Tables offer an additional dimension describing direct or quantitative relationships among concepts. However, extracting tables information, and unambiguously linking it to its corresponding text to form accurate semantic relationships are non-trivial tasks. We present a comprehensive framework to conceptually represent a document by extracting its semantic relationships and context. Given a document, our framework uses its text, and tables content and structure to identify relevant concepts and relationships. Additionally, we use the Web and ontologies to perform disambiguation, establish a context, annotate relationships, and preserve provenance. Finally, our framework provides an augmented synthesis for each document in a domain-independent format. Our results show that by using information from tables we are able to increase the number of highly ranked semantic relationships by a whole order of magnitude.
Download

Paper Nr:	36
Title:	Data Transformation Methodologies between Heterogeneous Data Stores - A Comparative Study
Authors:	Arnab Chakrabarti and Manasi Jayapal
Abstract:	With the advent of the NoSQL Data Stores, solutions for data migration from traditional relational databases to NoSQL databases is gaining more impetus in the recent times. This is also due to the fact that data generated in recent times are more heterogeneous in nature. In current available literatures and surveys we find that in-depth study has been already conducted for the tools and platform used in handling structured, unstructured and semi-structured data, however there are no guide which compares the methodologies of transforming and transferring data between these data stores. In this paper we present an extensive comparative study where we compare and evaluate data transformation methodologies between varied data sources as well as discuss the challenges and opportunities associated with it.
Download

Paper Nr:	38
Title:	Mining and Linguistically Interpreting Data from Questionnaires - Influence of Financial Literacy to Behaviour
Authors:	Miroslav Hudec and Zuzana Brokešová
Abstract:	This paper is focused on mining and interpreting information about effect of financial literacy on individuals’ behavior from the collected data by soft computing approach. Fuzzy sets and fuzzy logic allows us to formalize linguistic terms such as most of, high literacy and the like and interpret mined knowledge by short quantified sentences of natural language. This way is capable to cover semantic uncertainty in data and concepts. The preliminary results in this position paper have shown that for majority of people of low financial literacy angst and other treats represent serious issues, whereas about half of people with high literacy do not consider these treats as significant. Finally, influence of literacy to anchoring questions is mined and interpreted. Eventually, the paper emphasises needs for further data analysis and comparison.
Download

Paper Nr:	42
Title:	A Data-driven Framework on Mining Relationships between Air Quality and Cancer Diseases
Authors:	Wei Yuan Chang, En Tzu Wang and Arbee L. P. Chen
Abstract:	According to the report on global health risks, published by World Health Organization, environmental issues are urged to be dealt with in the world. Especially, air pollution causes great damage to human health. In this work, we build a framework for finding the correlations between air pollution and cancer diseases. This framework consists of a data access flow and a data analytics flow. The data access flow is designed to process raw data and to make the data able to be accessed by APIs. The cancer statistics is then mapped to air pollution data through temporal and spatial information. The analytics flow is used to find insights, based on the data exploration and data classification methods. The data exploration methods use statistics, clustering, and a series of mining techniques to interpret data. Then, the data mining methods are applied to find the relationships between air quality and cancer diseases by viewing air pollution indicators and cancer statistics as features and labels, respectively. The experiment results show that NO and NO2 air pollutants have a significant influence on the breast cancer, and the lung cancer is significantly influenced by NO2, NO, PM10 and O3, which are consistent with those from traditional statistical methods. Moreover, our results also cover the research results from several other studies. The proposed framework is flexible and can be applied to other applications with spatiotemporal data.
Download

Paper Nr:	9
Title:	Reducing Variant Diversity by Clustering - Data Pre-processing for Discrete Event Simulation Models
Authors:	Sonja Strasser and Andreas Peirleitner
Abstract:	Building discrete event simulation Models for studying questions in production planning and control affords reasonable calculation time. Two main causes for increased calculation time are the level of model details as well as the experimental design. However, if the objective is to optimize parameters to investigate the parameter settings for materials, they have to be modelled in detail. As a consequence model details such as number of simulated materials or work stations in a production system have to be reduced. The challenge in real world applications with a high variant diversity of products is to select representative materials from the huge number of existing materials for building a simulation model on condition that the simulation results remain valid. Data mining methods, especially clustering can be used to perform this selection automatically. In this paper a procedure for data preparation and clustering of materials with different routings is shown and applied in a case study from sheet metal processing.
Download

Paper Nr:	15
Title:	Clink - A Novel Record Linkage Methodology based on Graph Interactions
Authors:	Mahmoud Boghdady and Neamat El-Tazi
Abstract:	With the advent of the big-data era and the rapid growth of the amount of data, companies are faced with more opportunities and challenges to outperform their peers, innovate, compete, and capture value from big-data platforms such as social networks. Utilizing the full beneﬁt of social media requires companies to identify their own customers against customers as a whole by linking their local data against data from social media applying record-linkage techniques that differ from simple to complex. For large sources that have huge data and fewer constraints over data, the linking process produces low quality results and requires a lot of pairwise comparisons. We propose a study on how to calculate similarity score not only based on string similarity techniques or topological graph similarity, but also using graph interactions between nodes to effectively achieve better linkage results.
Download

Area 3 - Data Management and Quality

Full Papers

Paper Nr:	22
Title:	Storing and Processing Personal Narratives in the Context of Cultural Legacy Preservation
Authors:	Pierrick Bruneau, Olivier Parisot and Thomas Tamisier
Abstract:	An important, yet underestimated, aspect of cultural heritage preservation is the analysis of personal narratives told by citizens. In this paper, we present a data model and implementation towards facilitating narratives storage and sharing. The proposed solution aims at collecting textual narratives in raw form, processing them to extract and store structured content, and then exposing results through a RESTful interface. We apply it to a corpus related to the time of the European construction in Luxembourg. We disclose details about our conceptual model and implementation, as well as evidence supporting the interest of our approach.
Download

Paper Nr:	32
Title:	A Multi-criteria Approach for Large-object Cloud Storage
Authors:	Uwe Hohenstein, Michael C. Jaeger and Spyridon V. Gogouvitis
Abstract:	In the area of storage, various services and products are available from several providers. Each product possesses particular advantages of its own. For example, some systems are offered as cloud services, while others can be installed on premises, some store redundantly to achieve high reliability while others are less reliable but cheaper. In order to benefit from the offerings at a broader scale, e.g., to use specific features in some cases while trying to reduce costs in others, a federation is beneficial to use several storage tools with their individual virtues in parallel in applications. The major task of a federation in this context is to handle the heterogeneity of involved systems. This work focuses on storing large objects, i.e., storage systems for videos, database archives, virtual machine images etc. A metadata-based approach is proposed that uses the metadata associated with objects and containers as a fundamental concept to set up and manage a federation and to control storage locations. The overall goal is to relieve applications from the burden to find appropriate storage systems. Here a multi-criteria approach comes into play. We show how to extend the object storage developed by the VISION Cloud project to support federation of various storage systems in the discussed sense.
Download

Paper Nr:	48
Title:	Success of the Functionalities of a Learning Management System
Authors:	Floriana Meluso, Paolo Avogadro, Silvia Calegari and Matteo Dominoni
Abstract:	The goal of this research is to define and implement indicators for a Learning Management System (LMS). In particular, we focus on estimating patterns on the utilization of the message system by defining two quantities: the specific utilization and popularity. The idea is to take into account the perspective of academic institution managers and the administrators of the LMS, for example to understand if a particular department fails at providing a useful LMS service, or in order to allocate the correct amount of resources. These indicators have been tested on the LMS employed by the “Università degli Studi di Milano-Bicocca” (Milan, Italy), and in general provided a picture of poor utilization of the message system, where the usage follows a pattern similar to the Zipf law. This feature, correlated with the principle of least effort, suggests that LMSs should join forces with existing social networking systems to create strong online learning communities.
Download

Short Papers

Paper Nr:	16
Title:	Using Signifiers for Data Integration in Rail Automation
Authors:	Alexander Wurl, Andreas Falkner, Alois Haselböck and Alexandra Mazak
Abstract:	In Rail Automation, planning future projects requires the integration of business-critical data from heterogeneous data sources. As a consequence, data quality of integrated data is crucial for the optimal utilization of the production capacity. Unfortunately, current integration approaches mostly neglect uncertainties and inconsistencies in the integration process in terms of railway specific data. To tackle these restrictions, we propose a semi-automatic process for data import, where the user resolves ambiguous data classifications. The task of finding the correct data warehouse classification of source values in a proprietary, often semi-structured format is supported by the notion of a signifier, which is a natural extension of composite primary keys. In a case study from the domain of asset management in Rail Automation we evaluate that this approach facilitates high-quality data integration while minimizing user interaction.
Download

Paper Nr:	45
Title:	Data Preprocessing of eSport Game Records - Counter-Strike: Global Offensive
Authors:	David Bednárek, Martin Krulis, Jakub Yaghob and Filip Zavoral
Abstract:	Electronic sports or pro gaming have become very popular in this millenium and the increased value of this new industry is attracting investors with various interests. One of these interest is game betting, which requires player and team rating, game result predictions, and fraud detection techniques. In our work, we focus on preprocessing data of Counter-Strike: Global Offensive game in order to employ subsequent data analysis methods for quantifying player performance. The data preprocessing is difficult since the data format is complex and undocumented, the data quality of available sources is low, and there is no direct way how to match players from the recorded files with players listed on public boards such as HLTV website. We have summarized our experience from the data preprocessing and provide a way how to establish a player matching based on their metadata.
Download

Paper Nr:	46
Title:	Implementation and Empirical Evaluation of a Case-based, Interactive e-Learning Module with X-ray Tooth Prognosis
Authors:	Thomas Ostermann, Hedwig Ihlhoff-Goulioumius, Martin R. Fischer, Jan P. Ehlers and Michaela Zupanic
Abstract:	The prognosis estimation of teeth based on radiographs is a subordinate but relevant target in many dental medicine curricula in Germany. Empirical data on the integration of e-learning material into dental curricula are rare. We aimed at developing and implementing a radiological pillar diagnostics online-course in the dental curriculum at the University of Witten/Herdecke. This online course was developed on the CASUS web-based learning platform and implemented in a blended learning approach. Results showed an easy creation of learning cases (virtual patients), higher utilization for the intervention group regarding the number of cases revised, time-on-task, and student acceptance. Dental students experienced improved learning efficacy, higher long time knowledge retention and significantly better results in case based assessment. The usability of the CASUS learning Platform therefore can be regarded as high and further studies using this e-learning approach are recommended.
Download

Paper Nr:	11
Title:	Modeling and Qualitative Evaluation of a Management Canvas for Big Data Applications
Authors:	Michael Kaufmann, Tobias Eljasik-Swoboda, Christian Nawroth, Kevin Berwind, Marco Bornschlegl and Matthias Hemmje
Abstract:	A reference model for big data management is proposed, together with a methodology for business enterprises to bootstrap big data projects. Similar to the business model canvas for marketing management, the big data management (BDM) canvas is a template for developing new (or mapping existing) big data applications, strategies and projects. It subdivides this task into meaningful fields of action. The BDM canvas provides a visual chart that can be used in workshops iteratively to develop strategies for generating value from data. It can also be used for project planning and project progress reporting. The canvas instantiates a big data reference meta-model, the BDM cube, which provides its meta-structure. In addition to developing and theorizing the proposed data management model, two case studies on pilot applications in companies in Switzerland and Austria provide a qualitative evaluation of our approach. Using the insights from expert feedback, we provide an outlook for further research.
Download

Paper Nr:	27
Title:	Validating ETL Patterns Feasability using Alloy
Authors:	Bruno Oliveira and Orlando Belo
Abstract:	The ETL processes can be seen as typical data-oriented workflows composed of dozens of granular tasks that are responsible for the integration of data coming from different data sources. They are one of the most important components of a data warehousing system, strongly influenced by the complexity of business requirements, their changing, and evolution. To facilitate the planning and ETL implementation, a set of patterns specially designed to map standard ETL procedures is presented. They provide a simpler and conceptual perspective that can enrich to enable the generation of execution primitives. Generic models can be built, simplifying process views and providing methods for carrying out the acquired expertise to new applications using well-proven practices. This work demonstrates the fundaments of an ETL pattern-based approach for ETL development, its configuration and validation trough a set of Alloy specifications used to express its structural constraints and behaviour.
Download

Area 4 - Databases and Data Security

Full Papers

Paper Nr:	24
Title:	SPDC: Secure Proxied Database Connectivity
Authors:	Diogo Domingues Regateiro, Óscar Mortágua Pereira and Rui L. Aguiar
Abstract:	In the business world, database applications are a predominant tool where data is generally the most important asset of a company. Companies use database applications to access, explore and modify their data in order to provide a wide variety of services. When these applications run in semi-public locations and connect directly to the database, such as a reception area of a company or are connected to the internet, they can become the target of attacks by malicious users and have the hard-coded database credentials stolen. To prevent unauthorized access to a database, solutions such as virtual private networks (VPNs) are used. However, VPNs can be bypassed using internal attacks, and the stolen credentials used to gain access to the database. In this paper the Secure Proxied Database Connectivity (SPDC) is proposed, which is a new methodology to enhance the protection of the database access. It pushes the credentials to a proxy server and separates the information required to access the database between a proxy server and an authentication server. This solution is compared to a VPN using various attack scenarios and we show, with a proof-of-concept, that this proposal can also be completely transparent to the user.
Download

Paper Nr:	39
Title:	ChronoGraph - Versioning Support for OLTP TinkerPop Graphs
Authors:	Martin Haeusler, Emmanuel Nowakowski, Matthias Farwick, Ruth Breu, Johannes Kessler and Thomas Trojer
Abstract:	In recent years, techniques for system-time versioning of database content are becoming more sophisticated and powerful, due to the demands of business-critical applications that require traceability of changes, auditing capabilities or historical data analysis. The essence of these techniques was standardized in 2011 when it was introduced as a part of the SQL standard. However, in NoSQL databases and in particular in the emerging graph technologies, these aspects are so far being neglected by database providers. In this paper, we present ChronoGrapha, the first TinkerPop graph database implementation that offers comprehensive support for content versioning and analysis, designed for Online Transaction Processing (OLTP). This paper offers two key contributions: the addition of our novel versioning concepts to the state of the art in graph databases, as well as their implementation as an open-source project. We demonstrate the feasibility of our proposed solution through controlled experiments.
Download

Paper Nr:	49
Title:	Managing Distributed Queries under Personalized Anonymity Constraints
Authors:	Axel Michel, Benjamin Nguyen and Philippe Pucheral
Abstract:	The benefit of performing Big data computations over individual’s microdata is manifold, in the medical, energy or transportation fields to cite only a few, and this interest is growing with the emergence of smart disclosure initiatives around the world. However, these computations often expose microdata to privacy leakages, explaining the reluctance of individuals to participate in studies despite the privacy guarantees promised by statistical institutes. This paper proposes a novel approach to push personalized privacy guarantees in the processing of database queries so that individuals can disclose different amounts of information (i.e. data at different levels of accuracy) depending on their own perception of the risk. Moreover, we propose a decentralized computing infrastructure based on secure hardware enforcing these personalized privacy guarantees all along the query execution process. A performance analysis conducted on a real platform shows the effectiveness of the approach.
Download

Short Papers

Paper Nr:	25
Title:	Supporting Pre-shared Keys in Closed Implementations of TLS
Authors:	Diogo Domingues Regateiro, Óscar Mortágua Pereira and Rui L. Aguiar
Abstract:	In the business world, data is generally the most important asset of a company that must be protected. However, it must be made available to provide a wide variety of services, and so it can become the target of attacks by malicious users. Such attacks can involve eavesdropping the network or gaining unauthorized access, allowing such an attacker to access sensitive information. Secure protocols, such as Transport Layer Security (TLS), are usually used to mitigate these attacks. Unfortunately, most implementations force applications to use digital certificates, which may not always be desirable due to trust or monetary issues. Furthermore, implementations are usually closed and cannot be extended to support other authentication methods. In this article a methodology is proposed to slightly modify closed implementations of the TLS protocol that only support digital certificates, so pre-shared keys are used to protect the communication between two entities instead. A performance assessment is carried out on a proof-of-concept to demonstrate its feasibility and performance.
Download

Paper Nr:	30
Title:	Governance and Privacy in a Provincial Data Repository - A Cross-sectional Analysis of Longitudinal Birth Cohort Parent Participants’ Perspectives on Sharing Adult Vs. Child Research Data
Authors:	Shawn X. Dodd, Kiran Pohar Manhas, Stacey Page, Nicole Letourneau, Xinjie Cui and Suzanne Tough
Abstract:	Research data abound and are increasingly shared through a variety of platforms, such as biobanks for precision health and data repositories for reuse of research and administrative data. Data sharing presents great opportunities as well as significant ethical and legal concerns, such as privacy, consent, governance, access, and communication. Respectful data governance calls for stakeholder engagement during platform development. This stakeholder-engagement study used a web-based survey to capture the views of research participants about governance strategies for secondary data use. Survey response rate was 60.8% (n = 346). Parents’ primary concern was ensuring appropriate data re-use of data, even over privacy. Appropriate re-use included project-specific access and limiting access to researchers with more-trusted affiliations like academia. Other affiliations (e.g. industry, government and not-for-profit) were less palatable. Parents considered pediatric data more sensitive than adult data and expressed more reluctance towards sharing child identifiers compared to their own (p-value<0.001). This study stresses the importance of repository governance strategies to sustain long-term access to valuable data assets via large-scale repository.
Download