DATA 2014 Abstracts

Area 1 - Business Analytics

Short Papers

Paper Nr:	39
Title:	Open Data Integration - Visualization as an Asset
Authors:	Paulo Carvalho, Patrik Hitzelberger, Benoît Otjacques, Fatma Bouali and Gilles Venturini
Abstract:	For several years, and even decades, data integration has been a major problem in computer sciences. When it becomes necessary to process information from different data sources, several problems may appear, making the process of integration more difficult. Nowadays, more and more information is being sent and received and is made available on the Web and Data Integration is becoming even more important. This is especially the case in the emerging trend of Open Data (OD). Integrating data from public entities can be a difficult process. Large quantities of datasets are made available. However, an important level of heterogeneity may also exist: Datasets exist in different formats, forms and shapes. While it is important to be able to access this information, it would also be completely useless if we were not able to interpret it. Information Visualization may be an important tool to help the OD integration process. This paper presents problems and barriers which can be encountered in the data integration process, and, more specifically, in the OD integration process. The paper also describes how Information Visualization can be used to facilitate the integration of OD and make the procedure more effective, friendlier, and faster.
Download

Paper Nr:	45
Title:	Complexity of Rule Sets Induced from Incomplete Data with Attribute-concept Values and "Do Not Care" Conditions
Authors:	Patrick G. Clark and Jerzy W. Grzymala-Busse
Abstract:	In this paper we study the complexity of rule sets induced from incomplete data sets with two interpretations of missing attribute values: attribute-concept values and “do not care” conditions. Experiments are conducted on 176 data sets, using three kinds of probabilistic approximations (lower, middle and upper) and the MLEM2 rule induction system. The goal of our research is to determine the interpretation and approximation that produces the least complex rule sets. In our experiment results, the size of the rule set is smaller for attribute-concept values for 12 combinations of the type of data set and approximation, for one combination the size of the rule sets is smaller for “do not care” conditions and for the remaining 11 combinations the difference in performance is statistically insignificant (5% significance level). The total number of conditions is smaller for attribute-concept values for ten combinations, for two combinations the total number of conditions is smaller for “do not care” conditions, while for the remaining 12 combinations the difference in performance is statistically insignificant. Thus, we may claim that attribute-concept values are better than “do not care” conditions in terms of rule complexity.
Download

Paper Nr:	58
Title:	Incorporating Feature Selection and Clustering Approaches for High-Dimensional Data Reduction
Authors:	Been-Chian Chien
Abstract:	Data reduction is an important research topic for analyzing mass data efficiently and effectively in the era of big data. The task of dimension reduction is usually accomplished by technologies of feature selection, feature clustering or algebraic transformation. A novel approach for reducing high-dimensional data is initiated in this paper. The main idea of the proposed scheme is to incorporate data clustering and feature selection to transform high-dimensional data into lower dimensions. The incremental clustering algorithm in the scheme is used to handle the number of dimensions, and the relative discriminant variable is design for selecting significant features. Finally, a simple inner product operation is applied to transform original highdimensional data into a low one. Evaluations are conducted by testing the reduction approach on the problem of document categorization. The experimental results show that the reduced data have high classification accuracy for most of datasets. For some special datasets, the reduced data can get higher classification accuracy in comparison with original data.
Download

Paper Nr:	3
Title:	Matching Knowledge Users with Knowledge Creators using Text Mining Techniques
Authors:	Abdulrahman Al-Haimi
Abstract:	Matching knowledge users with knowledge creators from multiple data sources that share very little similarity in content and data structure is a key problem. Solving this problem is expected to noticeably improve research commercialization rate. In this paper, we discuss and evaluate the effectiveness of a comprehensive methodology that automates classic text mining techniques to match knowledge users with knowledge creators. We also present a prototype application that is considered one of the first attempts to match knowledge users with knowledge creators by analyzing records from Linkedin.com and BASE-search.net. The matching procedure is performed using supervised and unsupervised models. Surprisingly, experimental results show that K-NN classifier shows a slight performance improvement compared to its competition when evaluated in a similar context. After identifying the best-suited methodology, system architecture is designed. One of the main contributions of this research is the introduction and analysis of a novel prototype application that attempts to bridge the gap between research performed in industry and academia.
Download

Paper Nr:	15
Title:	Clustering Users’ Requirements Schemas
Authors:	Nouha Arfaoui and Jalel Akaichi
Abstract:	Data Mining proposes different techniques to deal with data. In our work, we suggest the use of clustering technique since we want grouping the schemas into clusters according to their similarity. This technique is applied to variety type of variables. We focus on categorical data. Many algorithms are proposed, but no one of them takes into consideration the semantic aspect. For this reason, and in order to ensure a good clustering of the schemas of the users’ requirements, we extend the k-mode algorithm by modifying its dissimilarity measure. The schemas within each cluster will be merged to construct the schemas of the data mart.
Download

Paper Nr:	17
Title:	Enterprise Competitive Analysis and Consumer Sentiments on Social Media - Insights from Telecommunication Companies
Authors:	Eric Afful-Dadzie, Stephen Nabareseh, Zuzana Komínková Oplatková and Petr Klímek
Abstract:	The utilization of Social media tools in business enterprises has tremendously increased with an increased number of users and a corresponding upsurge in time spent online. Online social media services such as Facebook and Twitter are used by companies to introduce new products and services, provide various supports and interact with customers on daily basis. This regular interaction of businesses and consumers results in huge amount of customer-generated content which is becoming a source of insight in analysing the often erratic consumer behaviour. For companies to harness the business potential of social media to increase competitive advantage, sentiments behind textual data of both their customers and that of their competitors must be keenly monitored and analysed. This paper demonstrates how companies especially those in the Telecommunication industry can seize the opportunity presented by social media to mine textual data to gain advantage over competitors by cumulatively understanding consumer opinions, frustrations and satisfaction. Using Facebook and Twitter sites of the top three telecommunication companies in Ghana: MTN, Vodafone and Tigo the paper reveals insights from unstructured texts of customers of these three companies. The results show (1) the exponential growth of social media users in Ghana (2) impact and numbers behind active social media participation in the telecommunication industry (3) the power of social media opinion mining for competitive analysis (4) how business value could be extracted from the huge unstructured textual data available on social media and (5) the company that is more responsive to customer concerns.
Download

Paper Nr:	34
Title:	Evaluating the Unification of Multiple Information Retrieval Techniques into a News Indexing Service
Authors:	Christos Bouras and Vassilis Tsogkas
Abstract:	While online information sources are rapidly increasing in amount, so does the daily available online news content. Several approaches have being proposed for organizing this immense amount of data. In this work we explore the integration of multiple information retrieval techniques, like text preprocessing, n-grams expansion, summarization, categorization and item/user clustering into a single mechanism designed to consolidate and index news articles from major news portals from around the web. Our goal is to allow users to seamlessly and quickly get the news of the day that are of appeal to them via our system. We show how, the application of each one of the proposed techniques gradually improves the precision results in terms of the suggested news articles for a number of registered system users and how, aggregately, these techniques provide a unified solution to the recommendation problem.
Download

Paper Nr:	41
Title:	Decision Trees and Data Preprocessing to Help Clustering Interpretation
Authors:	Olivier Parisot, Mohammad Ghoniem and Benoît Otjacques
Abstract:	Clustering is a popular technique for data mining, knowledge discovery and visual analytics. Unfortunately, cluster assignments can be difficult to interpret by a human analyst. This difficulty has often been overcome by using decision trees to explain cluster assignments. The success of this approach is however subject to the legibility of the obtained decision trees. In this work, we propose an evolutionary algorithm to cleverly preprocess the data before clustering in order to obtain clusters that are simpler to interpret with decision trees. A prototype has been implemented and tested to show the benefits of the approach.
Download

Paper Nr:	46
Title:	Evidential-Link-based Approach for Re-ranking XML Retrieval Results
Authors:	M'hamed Mataoui, Mohamed Mezghiche, Faouzi Sebbak and Farid Benhammadi
Abstract:	In this paper, we propose a new evidential link-based approach for re-ranking XML retrieval results. The approach, based on Dempster-Shafer theory of evidence, combines, for each retrieved XML element, content relevance evidence, and computed link evidence (score and rank). The use of the Dempster–Shafer theory is motivated by the need to improve retrieval accuracy by incorporating the uncertain nature of both bodies of evidence (content and link relevance). The link score is computed according to a new link analysis algorithm based on weighted links, where relevance is propagated through the two types of links, i.e., hierarchical and navigational. The propagation, i.e. the amount of relevance score received by each retrieved XML element, depends on link weight which is defined according to two parameters: link type and link length. To evaluate our proposal we carried out a set of experiments based on INEX data collection.
Download

Paper Nr:	63
Title:	ConceptMix - Self-Service Analytical Data Integration based on the Concept-Oriented Model
Authors:	Alexandr Savinov
Abstract:	Data integration as well as other data wrangling tasks account for a great deal of the difficulties in data analysis and frequently constitute the most tedious part of the overall analysis process. We describe a new system, ConceptMix, which radically simplifies analytical data integration for a broad range of non-IT users who do not possess deep knowledge in mathematics or statistics. ConceptMix relies on a novel unified data model, called the concept-oriented model (COM), which provides formal background for its functionality.
Download

Paper Nr:	67
Title:	Development of a Practical Tool for Exploring the Map of Technology
Authors:	So Young Kim, June Young Lee, Hyesung Yoon and Hyuck Jai Lee
Abstract:	This study suggests a way to utilize the map of technology as a guide to find new technology component. Recent studies of mapping knowledge mainly focused on analyzing the map as a result of technological innovation rather than utilizing the map for exploring the world of technological innovation. The preliminary result of a case study suggests that a firm can find possible technology components that can be combined with own technology component. The map of technology comprises the nodes of International Patent Classification (IPC) main groups and the links presenting the co-assign relationship between the IPC main groups.
Download

Area 2 - Data Management and Quality

Full Papers

Paper Nr:	22
Title:	Complex Patten Processing in Spatio-temporal Databases
Authors:	Yang Zheng, Annies Ductan, Devin Thomas and Mohamed Y. Eltabakh
Abstract:	The increasing complexity of spatio-temporal applications has caused the underlying queries to be more sophisticated and usually carry complex semantics. As a result, the traditional spatio-temporal query types, e.g., range, kNN, and aggregation queries, have become just building blocks in more complex query plans. In this paper, we present the STEPQ system, which is an extensible spatio-temporal query engine for complex pattern processing over spatio-temporal data. STEPQ enables full-fledged and optimized integration between spatiotemporal queries and complex event processing (CEP). This integration enables expressing complex queries that execute the desired application semantics without the need for indifferent middle-aware or application level support. The system is implemented using TerraLib module on top of PostgreSQL DBMSs. The experimental evaluation demonstrates the feasibility and practicality of the STEPQ system, and the efficiency of the proposed optimizations.
Download

Paper Nr:	30
Title:	Large-Scale Assessment and Visualization of the Energy Performance of Buildings with Ecomaps - Project SUNSHINE: Smart Urban Services for Higher Energy Efficiency
Authors:	Luca Giovannini, Stefano Pezzi, Umberto di Staso, Federico Prandi and Raffaele de Amicis
Abstract:	This paper illustrates the preliminary results of a research project focused on the development of a Web 2.0 system designed to compute and visualize large-scale building energy performance maps, so called "ecomaps", using: emerging platform-independent technologies such as WebGL for data presentation, an extended version of the EU-Founded project TABULA/EPISCOPE for automatic calculation of building energy parameters and CityGML OGC standard as data container. The proposed architecture will allow citizens, public administrations and government agencies to perform city-wide analyses on the energy performance of building stocks.
Download

Paper Nr:	35
Title:	A Glimpse into the State and Future of (Big) Data Analytics in Austria - Results from an Online Survey
Authors:	Ralf Bierig, Allan Hanbury, Martina Haas, Florina Piroi, Helmut Berger, Mihai Lupu and Michael Dittenbach
Abstract:	We present results from questionnaire data that were collected from leading data analytics experts in Austria. The online survey addresses very current and pressing questions in the area of (big) data analysis. Our findings provide valuable insights about what top Austrian data scientists think about data analytics, what they consider as important application areas that can benefit from big data and data processing, the challenges of the future and how soon these challenges will become important, and the potential research topics of tomorrow. We visualize results, summarize our findings and suggest a possible roadmap for future decision making.
Download

Paper Nr:	49
Title:	Improving Data Cleansing Accuracy - A Model-based Approach
Authors:	Mario Mezzanzanica, Roberto Boselli, Mirko Cesarini and Fabio Mercorio
Abstract:	Research on data quality is growing in importance in both industrial and academic communities, as it aims at deriving knowledge (and then value) from data. Information Systems generate a lot of data useful for studying the dynamics of subjects’ behaviours or phenomena over time, making the quality of data a crucial aspect for guaranteeing the believability of the overall knowledge discovery process. In such a scenario, data cleansing techniques, i.e., automatic methods to cleanse a dirty dataset, are paramount. However, when multiple cleansing alternatives are available a policy is required for choosing between them. The policy design task still relies on the experience of domain-experts, and this makes the automatic identification of accurate policies a significant issue. This paper extends the Universal Cleaning Process enabling the automatic generation of an accurate cleansing policy derived from the dataset to be analysed. The proposed approach has been implemented and tested on an on-line benchmark dataset, a real-world instance of the Labour Market Domain. Our preliminary results show that our approach would represent a contribution towards the generation of data-driven policy, reducing significantly the domain-experts intervention for policy specification. Finally, the generated results have been made publicly available for downloading.
Download

Short Papers

Paper Nr:	20
Title:	A Scalable Framework for Dynamic Data Citation of Arbitrary Structured Data
Authors:	Stefan Pröll and Andreas Rauber
Abstract:	Sharing research data is becoming increasingly important as it enables peers to validate and reproduce data driven experiments. Also exchanging data allows scientists to reuse data in different contexts and gather new knowledge from available sources. But with increasing volume of data, researchers need to reference exact versions of datasets. Until now access to research data often based on single archives of data files where versioning and subsetting support is limited. In this paper we introduce a mechanism that allows researchers to create versioned subsets of research data which can be cited and shared in a lightweight manner. We demonstrate a prototype that supports researchers in creating subsets based on filtering and sorting source data. These subsets can be cited for later reference and reuse. The system produces evidence that allows users to verify the correctness and completeness of a subset based on cryptographic hashing. We describe a replication scenario for enabling scalable data citation in dynamic contexts.
Download

Paper Nr:	64
Title:	A Component-based Approach to Realize Order Placement and Processing in MSMEs
Authors:	M. Saravanan and J. Venkatesh
Abstract:	Micro, Small and Medium scale Enterprises (MSMEs) hold an unfailing distinction of being pillars of equitable economic growth. Lack of proper business platforms and knowledge of marketing strategies render MSMEs vulnerable to middlemen exploitation. In view of the advancements and customers’ growth in the telecommunications field, we utilize the mobile platform to offer trading solutions to MSMEs. In this paper, we propose a mobile phone-based Order Placement and Processing components for MSMEs that can achieve disintermediation and is developed as an android application integrated with cloud services to provide easy access - anytime, anywhere. Our proposed component-based framework encompasses essential trading operations and extends 24 x 7 supports to MSMEs. An economic order calculator and order parallelizer sub-components helps limited budget MSMEs with small warehouse to survive the market by efficiently managing the warehouse, scheduling payments and parallelizing the order depending on its requirements. The other two sub-components custom specific negotiator and effective Order tracker helps in customizing the product and keeps track of the parallelized order respectively, thus assisting buyers in tracking their order to give an end-to-end solution. The envisioned framework will boost MSME margins, build healthy business-ties and transform MSMEs into self-sufficient establishments equipped with full-fledged trading systems that operate in mobile distributed environment.
Download

Paper Nr:	65
Title:	Pricing Schemes for Metropolitan Traffic Data Markets
Authors:	Negin Golrezaei and Hamid Nazerzadeh
Abstract:	Data marketplaces provide platforms for management of large data sets. The data markets are rapidly growing, yet the pricing strategies for data and data analytics are not yet well-understood. In this paper, we explore some of the pricing schemes applicable to data marketplaces in the context of transportation traffic data. This includes historical and real-time freeway and arterial congestion data. We investigate pricing raw sensor data vs. processed information (e.g, prediction of traffic patterns or route planning services) and show that, under natural assumptions, the raw data should be priced higher than processed information.
Download

Paper Nr:	71
Title:	Widget-based Exploration of Linked Statistical Data Spaces
Authors:	Ba-Lam Do, Tuan-Dat Trinh, Peter Wetz, Amin Anjomshoaa, Elmar Kiesling and A. Min Tjoa
Abstract:	Today, public statistical data plays an increasingly important role both in public policy formation and as a facilitator for informed decision-making in the private sector. In line with the increasing adoption of open data policies, the amount of data published by governments and organizations on the web is growing rapidly. To increase the value of such data, the W3C recommends the RDF Data Cube Vocabulary to facilitate the publication of data in a more structured and interlinked manner. Although important first steps toward building a web of statistical Linked Datasets have been made, providing adequate facilities for end users to interactively explore and make use of the published data remains an unresolved challenge. This paper presents a widget-based approach to deal with this issue. In particular, we introduce a mashup platform that allows users lacking advanced skills and knowledge of Semantic Web technologies to interactively analyze datasets through widget compositions and visualizations. Furthermore, we provide mechanisms for the interconnection of datasets to support sophisticated knowledge extraction.
Download

Paper Nr:	76
Title:	Knowledge Spring Process - Towards Discovering and Reusing Knowledge within Linked Open Data Foundations
Authors:	Roberto Espinosa, Larisa Garriga, Jose Jacobo Zubcoff and Jose-Norberto Mazon
Abstract:	Data is everywhere, and non-expert users must be able to exploit it in order to extract knowledge, get insights and make well-informed decisions. The value of the discovered knowledge could be of greater value if it is available for later consumption and reusing. In this paper, we present the first version of the Knowledge Spring Process, an infrastructure that allows non-expert users to (i) apply user-friendly data mining techniques on open data sources, and (ii) share results as Linked Open Data (LOD). The main contribution of this paper is the concept of reusing the knowledge gained from data mining processes after being semantically annotated as LOD, then obtaining Linked Open Knowledge. Our Knowledge Spring Process is based on a model-driven viewpoint in order to easier deal with the wide diversity of open data formats.
Download

Paper Nr:	11
Title:	Integrated Measurement for Pre-Fetching in Mobile Environment
Authors:	Roziyah Darus, Hamidah Ibrahim, Mohamed Othman and Lilly Suryani Affendey
Abstract:	Pre-fetching is used to predict next query of data items before any problems occur due to network congestion, delays, and latency problems. Lately, pre-fetching strategies become more complicated in which to support new types of application especially for mobile devices. Sometime the pre-fetched data items are not interested to the users. Due to this complication, an intelligent technique is introduced where an integrated measurement using data mining with Bayesian approach is proposed to improve the query performance. In previous study, the pre-fetched data items were filtered using data driven measurement. The data was generated based on the data frequency metrics whereby the structure of the query pattern is quantified using statistical methods. The measurement is not good enough to solve sequence query in mobile environment. In this paper, a new technique is proposed to generate new and potential pre-fetching set for the users. A subjective measurement is used to determine the pre-fetching set based on user interestingness. The integrated measurement generates strong and weak association rules based on the data and user interestingness criterions. The result shows that the performance is significantly improved whereby the technique managed to quantify the uncertainty of users' expectation in the next possible query.
Download

Paper Nr:	12
Title:	Instance Based Schema Matching Framework Utilizing Google Similarity and Regular Expression
Authors:	Osama A. Mehdi, Hamidah Ibrahim and Lilly Suriani Affendey
Abstract:	Schema matching is the task of identifying correspondences between schema attributes that exist in different schemas. A variety of approaches have been proposed to achieve the main goal of high-quality match results with respect to precision (P) and recall (R). However, these approaches are unable to achieve high quality match results, as most of these approaches treated the instances as string regardless the data types of the instances. As a consequence, this causes unidentiﬁed matches especially for attribute with numeric instances which further reduces the quality of match results. Therefore, effort still needs to be done to further improve the quality of the match results. In this paper, we propose a framework for addressing the problem of finding matches between schemas of semantically and syntactically related data. Since we only fully exploit the instances of the schemas for this task, we rely on strategies that combine the strength of Google as a web semantic and regular expression as pattern recognition. To demonstrate the accuracy of our framework, we conducted an experimental evaluation using real world data sets. The results show that our framework is able to find 1-1 schema matches with high accuracy in the range of 93% - 99% in terms of precision (P), recall (R), and F-measure (F).
Download

Paper Nr:	29
Title:	Towards Efficient Reorganisation Algorithms of Hybrid Index Structures Supporting Multimedia Search Conditions
Authors:	Carsten Kropf
Abstract:	This paper presents the optimization of the reorganisation algorithms of hybrid index structures supporting multimedia search conditions. Multimedia in this case refers to, on the one hand, the support of high dimensional feature spaces and, on the other, the mix of data of multiple types. We will use an approach which may typically be found in geographic information retrieval (GIR) systems combined of two-dimensional geographical points in combination with textual data. Yet, the dimensions of the points may be arbitrarily set. Currently, most of these access methods implemented for the use in database centric application domains are validated regarding their retrieval efficiency in simulation based environments. Most of the structures and experiments only use synthetic validation in an artificial setup. Additionally, the focus of these tests is to validate the retrieval efficiency. We implemented such an indexing method in a realistic database management system and noticed an unacceptable runtime behaviour of reorganisation algorithms. Hence, a structured and iterative optimization procedure is set up to make hybrid index structures suitable for the use in real world application scenarios. The final outcome is a set of algorithms providing efficient approaches for reorganisations of access methods for hybrid data spaces.
Download

Paper Nr:	47
Title:	Automatic and Graceful Repairing of Data Inconsistencies Resulting from Retroactive Updates in Temporal Xml Databases
Authors:	Hind Hamrouni, Zouhaier Brahmia and Rafik Bouaziz
Abstract:	In temporal XML databases, a retroactive update (i.e., modifying or deleting a past element) due to a detected error means that the database has included erroneous information during some period and, therefore, its consistency should be restored by correcting all errors and inconsistencies that have occurred in the past. Indeed, all processing that have been carried out during the inconsistency period and have used erroneous information have normally produced erroneous information. In this paper, we propose an approach which preserves data consistency in temporal XML databases. More precisely, after any retroactive update, the proposed approach allows (i) detecting and analyzing periods of database inconsistency, which result from that update, and (ii) repairing of all inconsistencies and recovery of all side effects.
Download

Paper Nr:	62
Title:	Explorative Analysis of Heterogeneous, Unstructured, and Uncertain Data - A Computer Science Perspective on Biodiversity Research
Authors:	C. Beckstein, S. Böcker, M. Bogdan, H. Bruehlheide, H. M. Bücker, J. Denzler, P. Dittrich, I. Grosse, A. Hinneburg, B. König-Ries, F. Löffler, M. Marz, M. Müller-Hannemann, M. Winter and W. Zimmermann
Abstract:	We outline a blueprint for the development of new computer science approaches for the management and analysis of big data problems for biodiversity science. Such problems are characterized by a combination of different data sources each of which owns at least one of the typical characteristics of big data (volume, variety, velocity, or veracity). For these problems, we envision a solution that covers different aspects of integrating data sources and algorithms for their analysis on one of the following three layers: At the data layer, there are various data archives of heterogeneous, unstructured, and uncertain data. At the functional layer, the data are analyzed for each archive individually. At the meta-layer, multiple functional archives are combined for complex analysis.
Download

Paper Nr:	70
Title:	Towards a Data Model of End-User Programming of Applications
Authors:	Marko Palviainen, Jarkko Kuusijärvi, Timo Tuomisto and Eila Ovaska
Abstract:	End-user programming produces applications that can produce and/or consume data. An end-user can be a software enthusiast or non-programmer. In this paper end-users are understood to be non-programmers that are interested in creating applications for their personal needs and daily tasks. An interesting research question is how the input and output data of end-users’ applications should be represented? What kind of a data model is needed for this data? And how this input and output data can be utilised? Firstly, the data model should be designed for end-users so that the data model is easy to comprehend and utilise by non-programmers. Secondly, the data model should be suitable for SW professionals that make functionalities available for end-user programming. Thirdly, the data model should be designed so that it is possible to provide reusable processing components for input/output data represented via this model. This paper discusses these three research questions and outlines a data model, called the Tiles4Data data model that is designed for the above requirements.
Download

Paper Nr:	79
Title:	Development of an Open Data Portal for a University - Experience from the University of Alicante
Authors:	Jose Vicente Carcel, Andrés Fuster, Irene Garrigós, Francisco Maciá, Jose-Norberto Mazón, Llorenç Vaquer and Jose Jacobo Zubcoff
Abstract:	University of Alicante (UA), in Spain, is aligned with an Open Government strategy. Within this strategy, UA is carrying out the OpenData4U (Open Data for Universities) project which aims to provide mechanisms for opening data from universities, finding out how open data contributes to open government in universities. This project encourages reusing open data, not only for the sake of transparency, but also as a basis of novel data-intensive business models that universities can foster. This paper describes one of the outputs of the project: an approach for opening data from universities keeping in mind data quality criteria, and tailored to no specific technological scenario. This approach allowed UA to launch its open data portal http://datos.ua.es that is also reviewed in this paper. Also, some research challenges related to university open data are enumerated.
Download

Paper Nr:	80
Title:	Cloud Computing and Technological Lock-In - Literature Review
Authors:	Robert Viseur, Etienne Charlier and Michael Van de Borne
Abstract:	The increasing use of cloud computing services results in an increased risk of lock-in that is source of anxiety for users facing the risk of having their data to be hosted online without the possibility of migrating them on their own IT resources or on competitors' platforms. In this preliminary research, we deal with the problem of the management of lock-in in case of use of cloud computing services. We aim to answer six questions: (1) What is the lock-in? (2) Is the lock-in perceived as a major problem? (3) What are the causes of lock-in? (4) What is the impact of lock-in on the users? (5) How can the users avoid lock-in? and (6) Is the general public concerned with the problem of lock-in? Our paper is organized in three sections. The first section presents the methodology used for this study. The second section details the results. This section identifies in particular six mechanisms to reduce the risk of lock-in. The third section discusses the results and suggests further work.
Download

Area 3 - Ontologies and the Semantic Web

Full Papers

Paper Nr:	23
Title:	A Visual Approach to the Empirical Analysis of Social Influence
Authors:	Chiara Francalanci and Ajaz Hussain
Abstract:	This paper starts from the observation that social networks follow a power-law degree distribution of nodes, with a few hub nodes and a long tail of peripheral nodes. While there exist consolidated approaches supporting the identification and characterization of hub nodes, research on the analysis of the multi-layered distribution of peripheral nodes is limited. In social media, hub nodes represent social influencers. However, the literature provides evidence of the multi-layered structure of influence networks, emphasizing the distinction between influencers and influence. The latter seems to spread following multi-hop paths across nodes in peripheral network layers. This paper proposes a visual approach to the graphical representation and exploration of peripheral layers and clusters to exploit underlying concept of k-shell decomposition analysis. The core concept of our approach is to partition the node set of a graph into hub and peripheral nodes. Then, a power-law based modified force-directed method is applied to clearly display local multi-layered neighbourhood clusters around hub nodes. Our approach is tested on a large sample of tweets from the tourism domain. Empirical results indicate that peripheral nodes have a greater probability of being retweeted and, thus, play a critical role in determining the influence of content. Our visualization technique helps us highlight peripheral nodes and, thus, seems an interesting tool to the visual analysis of social influence.
Download

Paper Nr:	37
Title:	Mining User Behavior in a Social Bookmarking System - A Delicious Friend Recommender System
Authors:	Matteo Manca, Ludovico Boratto and Salvatore Carta
Abstract:	The growth of the Web 2.0 has brought to a widespread use of social media systems. In particular, social bookmarking systems are a form of social media system that allows to tag bookmarks of interest for a user and to share them. The increasing popularity of these systems leads to an increasing number of active users and this implies that each user interacts with too many users ("social interaction overload"). In order to overcome this problem, we present a friend recommender system in the social bookmarking domain. Recommendations are produced by mining user behavior in a tagging system, analyzing the bookmarks tagged by a user and the frequency of each used tag. Experimental results highlight that, by analyzing both the tagging and bookmarking behavior of a user, our approach is able to mine preferences in a more accurate way, with respect to state-of-the-art approaches that consider only tags.
Download

Short Papers

Paper Nr:	26
Title:	Aerospace Information System based on Semantic Technonogies and Ontology Management - A Web Portal for Semantic Search and Document Categorization
Authors:	F. Gargiulo, G. Zazzaro, G. Romano, G. Gigante, A. Raggioli and R. Fusco
Abstract:	This paper describes a semantic search tool based on our experience in using a new lexical domain ontology for aerospace integrated with an open source general purpose ontology to support aerospace engineers in the timely semantic retrieval of the knowledge. The semantic search module represents an integrated tool dedicated to the semantic search, extraction and classification of information and knowledge in aerospace domain. It describes the implementation of a disambiguation algorithm based upon these ontologies and a new interesting graphical user interface for semantic searches is presented. Furthermore, next to the domain ontology, a taxonomy for classifying aerospace documents is also proposed. The document classification algorithm that leverages the deep integration between the proposed lexical domain ontology and taxonomy is also described. Finally, some considerations about the usage of the semantic search module by the side of domain experts, semantic experts or common users are reported.
Download

Paper Nr:	78
Title:	A Hybrid Approach to Developing a Cyber Security Ontology
Authors:	James Geller, Soon Ae Chun and Arwa Wali
Abstract:	The process of developing an ontology cannot be fully automated at the current state-of-the-art. However, leaving the tedious, time-consuming and error-prone task of ontology development entirely to humans has problems of its own, including limited staff budgets and semantic disagreements between experts. Thus, a hybrid computer/expert approach is advocated. The research challenge is how to minimize and optimally organize the task of the expert(s) while maximally leveraging the power of the computer and of existing computer-readable documents. The purpose of this paper is two-fold. First we present such a hybrid approach by describing a knowledge acquisition tool that we have developed. This tool makes use of an existing Bootstrap Ontology and proposes likely locations of concepts and semantic relationships, based on a text book, to a domain expert who can decide on them. The tool is attempting to minimize the number of interactions. Secondly we are proposing the notion of an augmented ontology specifically for pedagogical use. The application domain of this work is cyber-security education, but the ontology development methods are applicable to any educational topic.
Download

Paper Nr:	32
Title:	A Holistic, Semantics-aware Approach to Spatial Data Infrastructures
Authors:	Cristiano Fugazza, Monica Pepe, Alessandro Oggioni, Fabio Pavesi and Paola Carrara
Abstract:	We present a novel approach to the management of Spatial Data Infrastrutures that leverages semantics-aware context information to model the distinct aspects involved in the management of geospatial data. RDF-based schemata are employed for encoding information about the user community, the terminologies in use in a specific research domain, gazetteer information representing the physical landscape underpinning data and, last but not least, resource metadata. The data structures are then interconnected to enable seamless exploitation for metadata creation and resource discovery, which we demonstrate through a worked-out example of SPARQL query on RDF graph data. The methodology is being applied by the National Research Council of Italy (CNR) to support creation of a distributed infrastructure for marine data in the context of the RITMARE Flagship Project.
Download

Paper Nr:	48
Title:	Integrating Semi-structured Information using Semantic Technologies - An Evaluation of Tools and a Case Study on University Rankings Data
Authors:	Alejandra Casas-Bayona and Hector G. Ceballos
Abstract:	Information integration is not a trivial activity. Information managers face problems like: heterogeneity (in data, schemas, syntax and platforms), distribution and duplicity. In this paper we: 1) analyze ontology-based methodologies that provide mediation frameworks for integrating and reconciling information from structured data sources, and 2) propose the use of available semantic technologies for replicating such functionality. Our aim is providing an agile method for integrating and reconciling information from semi-structured data (spreadsheets) and determining to which extent available semantic technologies minimize the need of ontological expertise for information integration. We present our findings and lessons learned from a case study on university rankings data.
Download

Paper Nr:	68
Title:	Assesment of Online Bank GUI based on User Experience Evaluation - A Case Study
Authors:	Malgorzata Plechawska-Wojcik and Kamil Kolodziejczyk
Abstract:	The paper presents a case study of User Centered Design (UCD) assessment. The case study is aimed to design and test GUI of online banking application. The procedure is multistep, based on UCD phases. Case study implemented such methods as Contextual analysis, Heuristic evaluation, Prototyping and extended iterative User test. The paper outlines contain description of applied methods and results including survey summary and users recommendations. The adjusted version of GUI is also presented.
Download

Paper Nr:	77
Title:	Identifying Semantic Classes within Student’s Data Using Clustering Technique
Authors:	Marek Jaszuk, Teresa Mroczek and Barbara Fryc
Abstract:	The paper discusses the problem of discovering semantic classes which are the basic building block of any semantic model. A methods based on clustering techniques is proposed, which leads to discovering related data coming from survey questions and other sources of information. We explain how the questions can be interpreted as belonging to the same semantic class. Discovering semantic classes is assumed to be foundation for construction of the knowledge model (ontology) describing objects being the subjects of the survey. The ultimate goal of the research is developing a methodology for automatic building of semantic models from the data. In our case the surveys refer to different socio-economic factors describing student’s situation. Thus the particular goal of the work is construction of the knowledge model, which would allow for predicting the possible outcomes of the educational process. The research is, however, more general, and its results could be used for analyzing collections of objects, for which we have data coming from surveys, and possibly some additional sources of information.
Download

Area 4 - Databases and Data Security

Short Papers

Paper Nr:	59
Title:	Security in Large-Scale Data Management and Distributed Data Acquisition
Authors:	Alexander Kramer, Wilfried Jakob, Heiko Maaß and Wolfgang Süß
Abstract:	The internet is about to change from a pure network of computers to a network of more or less intelligent devices, the computer being just one of them. Examples of this change are the concepts of smart applications like smart homes, smart traffic control and guidance systems, smart power grids, or smart buildings. These systems require among others a high degree of robustness, reliability, scalability, safety, and security. In this paper, we concentrate on the data exchange and management aspect and introduce a security concept for scalable and easy-to-use Generic Data Services, called SeGDS. It covers application scenarios from embedded field devices for data acquisition to large-scale generic data applications and data management. The concept is based largely on proven standard enterprise hardware and standard solutions. As a first application, we report about transport and management of mass data originating from high-resolution electrical data devices, which measure parameters of the electrical grid with a high sample rate. The shown solution is intended to be a contribution to concepts of a secure, flexible, but comparably inexpensive management of large amounts of data coming from modern smart power grids or other comparable smart applications.
Download

Paper Nr:	66
Title:	XACML Policy Inconsistency Analysis and Resolution
Authors:	Teo Poh Kuang, Hamidah Ibrahim, Nur Izura Udzir and Fatimah Sidi
Abstract:	Modality inconsistency is one of the security policy evaluation challenges, which arises because of the existence of both positive and negative authorizations for a given subject-object pair. An inconsistency analysis model is needed to discover inconsistency based on the inheritance relationship between concepts and resolved it by using predefined resolution rules. Previous studies handle modality inconsistency by providing the hierarchy of subjects and objects and simple conditions evaluation, like string equality matching. They do not identify modality inconsistency when a concept inherits conflicting decisions from its superclasses on the basis of the partially ordered structures obtained based on subject hierarchy, object hierarchy, and spatial hierarchy. An inconsistency analysis model is proposed in this paper to detect and resolve inconsistent policies during security policy evaluation. Our inconsistency analysis model analyzes all possible violations that might exist among security policies based on role hierarchy, object hierarchy, and spatial hierarchy. In addition, comparison with previous works shows that our inconsistency model is more effective in detecting inconsistency than the previous works.

Paper Nr:	72
Title:	Method of a Structure-Independent Databases Design in Configurable Information Systems
Authors:	Yuri Rogozov, Alexander Sviridov and Alexander Belikov
Abstract:	We propose a method of developing different structure-independent databases using relational technology. The method allows depending on the motivation of the developer to obtain various structure independent databases using relational technology. The requirements that structure-independent database must satisfy have been formulated. Alternatives of method realization were considered.
Download

Paper Nr:	21
Title:	FunctionGuard - A Query Engine for Expensive Scientific Functions in Relational Databases
Authors:	Anh Pham and Mohamed Eltabakh
Abstract:	Expensive user-defined functions impose unique challenges to database management systems at query time. This is mostly due to the black-box nature of these functions, the in-ability to optimize their internals, and the potential inefficiency of the common optimization heuristics, e.g., “selection-push-down’. Moreover, the in- creasing diversity of modern scientific applications that depend on DBMSs and, at the same time, extensively use expensive UDFs is mandating the design and development of efficient techniques to support these expensive functions. In this paper, we propose the “FunctionGuard” system that leverages disk-based persistent caching in novel ways to achieve across-queries optimizations for expensive UDFs. The unique features of FunctionGuard include: (1) Dynamic extraction of dependencies between the UDFs and the data sources and identifying the potential cacheable functions, (2) Cache-aware query optimization through newly introduced query operators, (3) Proactive cache refreshing that partially migrates the cost of the expensive calls from the query time to the idle and under-utilized times, and (4) Integration with the state-of-art techniques that generate efficient query plans under the presence of expensive functions. The system is implemented within PostgreSQL DBMS, and the results show the effectiveness of the proposed algorithms and optimizations.
Download

Paper Nr:	27
Title:	Column-oriented Database Systems and XML Compression
Authors:	Tyler Corbin, Tomasz Müldner and Jan Krzysztof Miziołek
Abstract:	The verbose nature of XML requires data compression, which makes it more difficult to efficiently implement querying. At the same time, the renewed industrial and academic interest in Column-Oriented DBMS (column-stores) resulted in improved efficiency of queries in these DBMS. Nevertheless there has been no research on relations between XML compression and column-stores. This paper describes an existing XML compressor and shows the inherent similarities between its compression technique and column-stores. Efficiency of compression is tested using specially designed benchmark data.
Download

Paper Nr:	40
Title:	Data-Centric Workflow Approach to Lifecycle Data Management
Authors:	Marko Junkkari and Antti Sirkka
Abstract:	Data-centric workflows focus on how the data is transferred between processes and how it is logically stored. In addition to traditional workflow analysis, these can be applied to monitoring, tracing, and analyzing data in processes and their mutual relationships. In many applications, e.g. manufacturing, the tracing of products thorough entire lifecycle is becoming more and more important. In the present paper we define the traceability graph that involves a framework for data that adapts to different levels of precision of tracing. Advanced analyzing requires modeling of data in processes and methods for accumulating resources and emissions thorough the lifecycle of products. This, in turns, requires explicit modeling and presentation how objects are divided and/or composed and how information is cumulated via these tasks. The traceability graph focuses on these issues. The traceability graph is formally defined by set theory that is an established and exact specification method.
Download

Paper Nr:	73
Title:	The Formal Model of Structure-Independent Databases
Authors:	Sergey Kucherov, Alexander Sviridov and Svetlana A. Belousova
Abstract:	In this paper we’d like to propose a formal model of structure-independent databases. This formal model allows to describe and implement not only different SIDB that are implemented on relational technology, but also the tools of working with them. Model contains set of relations and operations which are basic and can be supplemented depending on the characteristics and requirements of implementation. To provide the flexibility structure-independent databases presupposes manipulation of data and metadata structures.
Download