Abstracts Track 2024


Area 1 - Business Analytics

Nr: 129
Title:

Intention-Based Deep Learning Approach for Detecting Online Fake News

Authors:

Kyuhan Lee and Sudha Ram

Abstract: The dissemination of fake news has a substantial detrimental impact on the various operations of businesses within multiple sectors, including healthcare, retail, and finance. One effective approach to combat fake news is automated content moderation using computational approaches. However, current approaches have overlooked the heterogeneous deceptive intention behind posting fake news. In this study, following the design science approach, we propose a novel deep-learning framework for detecting online fake news by incorporating theories of deceptive intention. Specifically, we first develop a transfer-learning model that identifies deceptive intention reflected in texts using semantic and deceptive cues derived from multiple deep learning and natural language processing techniques. We then apply this model to distinguish two subclasses of fake news: deceptive and non-deceptive fake news. These two subclasses of fake news, along with the observed class of non-fake news (i.e., true news), are used to train deep bidirectional transformers whose goal is to determine news veracity. When benchmarked against cutting-edge deep learning methods, our design presented significant performance improvement in detecting fake news. Moreover, our additional, controlled experiment revealed that, when assisting humans in the loop of the fake news detection process, the proposed design demonstrated synergistic effects such as improved task performance and enhanced decision-making confidence.

Nr: 128
Title:

Predictive Analytics for Alumni Donor and Volunteer Engagement in Higher Education

Authors:

Muza Furin-Carraux and Jim McNulty

Abstract: In higher education institutions, the ability to identify potential donors and volunteers among alumni plays an important role in resource allocation and strategic decision-making. Existing approaches in universities often lack the comprehensive utilization of predictive analytics, relying mostly on manual selection processes of alumni prospects. The goal of the research is to develop an optimal model to accurately predict donor and volunteer behavior. In order to run the multi-faceted project, we used CRISP-DM methodology. We began with a thorough understanding of the organizational problem. By leveraging insights from conversations with university teams, we identified main variables and data sources, necessary for model development. Logistic regression appeared to be the preferred statistical model, given its suitability for binary outcome prediction. We utilized data from a large alumni pool, and began working on a model. Python was used as the primary language for model implementation. The most significant milestone in the project was the integration of the model into Salesforce. Custom objects within the CRM were built and connected to accommodate predictive scores, facilitating organic integration into existing workflows. We used complex calculations based on various variables to generate donor and volunteer scores. Evaluation of model performance revealed high accuracy rates, emphasizing the optimal outcomes in alumni prospect identification. Event attendance, marital status, certain degrees, and online engagement emerged as higher predictors of alumni donor behavior. Future research may be needed to further strengthen the model parameters and explore additional predictors to enhance the accuracy.

Nr: 132
Title:

ER-Diagram Conceptual Modeling Framework for Addressing KPI Apportioning in Dimensional Models

Authors:

Jelena Hadina and Boris Jukic

Abstract: The Entity-Relationship (ER) model was developed to tackle the issue of logical view of data (Chen, 1976) and to this day, ER modeling is used as a basic tool in database design due to its simplicity, intuitive appeal, and ability to capture useful semantics. As the model is used across different domains, researchers are encouraged to add and develop constructs to the ER model in a careful manner (Badia, 2004). According to Moody and Kortink, entities in the ER model can be classified into three categories with their respect to mapping into dimensional model: transaction, component, and classification entities. Transaction entities record details of business events, component entities are those related to transaction entity by one-to-many relationship and classification entities are related to a component entity by a chain of one-to-many relationships in standard dimensional modeling approach (Moody & Kortink, 2003). Conversely, entities labeled as component and classification entities contain fields that are mapped into the dimensions. In analytical systems we often encounter situations where many to many relationships exist between component/classification and transaction entities which exhibit different levels of cardinalities which are currently nevertheless modeled in the same fashion. Our proposed data modeling notation has impact on better data warehouse design with specific impact on quality of business intelligence extracted from data warehouse (DW) designed in such fashion. The current standard cardinality notations don’t specify the exact size of the range of instances of one entity as it relates to the instances of another, or they specify the min and max range as specific integer numbers. However, these notations do not list the probability distribution of these ranges which in turn miss the opportunity to investigate one of the key factors that determine how the resulting dimensional model of the data warehouse is designed and implemented. Our notation acknowledges existence of probability distributions and suggests different dimensional model strategies based on category of distributions represented in a particular relationship. We propose new notation identifying different cases leading to different mapping strategies as ER model is mapped into a dimensional model. To illustrate our approach, we provide different examples of scenarios the new notation could be applied to. Two broad categories of cases we present lead to different approach to the apportioning of KPI metrics of interest. We also propose additional types of classifications of KPI metrics characterized by the nature of the method of their apportioning. Our discussion shows how our suggested additions to the standard ER diagram can improve efficiency and accuracy of the modeling stage resulting in better database design solutions more suited for effective analytics. Badia, Antonio. (2004). Entity-Relationship modeling revisited. SIGMOD Record. 33. 77-82. 10.1145/974121.974135. Chen, P. (1976) The Entity-Relationship Model: Toward a Unified View of Data. ACM Transactions on Database Systems, 1, 9-36. Moody, D. L., & Kortink, M. A. R. (2003). From ER models to dimensional models: bridging the gap between OLTP and OLAP design, Part I. Business Intelligence Journal, 8, 7-24. Roy-Hubara, Noa & Sturm, Arnon. (2020). Design methods for the new database era: a systematic literature review. Software & Systems Modeling. 19.

Area 2 - Data Science

Nr: 131
Title:

Mining High-Cardinality Covariates with GPUs Using NVIDIA RAPIDS

Authors:

Robert W. Fox

Abstract: I hope to develop an interpretable (“white-box”) method for understanding the relationships between high-cardinality categorical variables in any given data set. In discovering knowledge from data, categorical features with many levels pose a distinct challenge as they increase the dimensionality of our data sets. Finding interactions between categorical features with high cardinality can quickly become intractable as the combinatorial possibilities grow. I developed a novel method to calculate Jensen–Shannon distance on GPUs using NVIDIA RAPIDS. Using modern GPU hardware, we can efficiently compare categorical distributions of feature A for every level of categorical features B, C , D, etc. Moreover, I developed a “JSD Score” that factors in a minimum frequency count of level being tested (giving weight to larger categories). Using NVIDIA RAPIDS, my code to calculate Jensen-Shannon distances (JSD) over 11 categorical variables (with over 1300 combined categorical levels) ran in 33.7 seconds. Each individual comparison has a “measure feature” and a “split feature”. In one example, “Loan Grade” is the measure feature, and “Loan Purpose” is the split feature. The distribution of "Loan Grade" is compared for every distinct pair of "Loan Purpose" values. In this manner we can detect categorical interactions. All combinations across all 11 categorical variables are effectively brute-force calculated. The resulting data frame included a row for each of the 3,682,690 comparisons with both the JSD value and my “JSD Score” value. JSD Score Some categorical splits have low row counts (or may even result in unique, single observations). To ascertain both distributional differences and general magnitude of differences I invented a “JSD Score” where I incorporate the minimum row count between the split features as a simple multiplier. In the example above, we multiply the JSD of 0.185 by 473 loans from the small business split (the smaller of the two splits). By using the minimum value, we emphasize splits that collectively cover more of the overall population. These “JSD Scores” are relative to the total number of observations. This metric allows us to zoom in on categorical interactions that have larger support. The GPU Algorithm My method for calculating JDS scores en-masse takes advantage of the fact that for a single categorical feature (e.g. “Loan Grade”) there is one fixed matrix dimension of all possible values (7 values “A” through “G”). Crosstab frequencies for all categorical levels (across all categorical features!) can be appended horizontally. The analysis kernel uses column offsets as a parameter input for calculation on GPU cores. Frequency counts for each categorical feature into GPU memory once and complete all calculations for that measure feature (11 trips in our test data). Using GPU resources to perform data mining on categorical variables represents an advancement in tackling both the time and energy required to perform analysis of highly dimensional data. My novel approach is brute-force and verbose; but it preserves all information value in knowledge discovery. This is not necessarily true of other techniques commonly used to tackle highly dimensional data (e.g. PCA, correspondence analysis, target encoding, etc.). I believe this work is a solid start towards developing efficient “white box” knowledge discovery from data sets that contain high-cardinality categorical features.

Nr: 134
Title:

Understanding the Relative Importance Measures Based on Orthonormality Transformation

Authors:

Tien En Chang and Argon Chen

Abstract: Assessing the relative importance of explanatory variables is critical to help build more interpretable, efficient, and reliable supervised learning models. However, beyond simple correlations and regression coefficients, researchers and practitioners generally agree that the General Dominance Index (GD) is the most plausible measure of relative importance, despite its computationally intractable nature. A class of relative importance measures based on orthonormality transformation, which we refer to as OTMs, has been found to approximate the GD efficiently. In particular, Johnson's Relative Weight (RW) has been deemed the most successful OTM in the literature over the past 20 years. Nevertheless, the theoretical foundation of the OTMs remains unclear. To further understand the OTMs, we provide a generalized framework that includes orthogonalization and weight reallocation techniques. Through comprehensive experiments, we conclude that Johnson's minimal transformation to orthonormality is the best choice compared with other popular orthogonalization methods. We also find that the weight reallocation method used by RW could have the “dilution problem” when a strong principal component is present. Conversely, the weight reallocation method based on regression coefficients is found to have some advantages but also suffers from the “fairness issue” when there is a notably large Variance Inflation Factor (VIF). These problems can lead to poor performance in terms of approximating the GD. With the comprehensive Monte Carlo simulation cases, we attempt to offer guidelines for practitioners to choose the right OTM approach given the correlation structure among the explanatory variables.

Area 3 - Data Management and Quality

Nr: 46
Title:

Spatio-Temporal High-Dimensional Matrix Autoregressive Models via Tensor Decomposition

Authors:

Rukayya Ibrahim, S. Yaser Samadi and Tharindu DeAlwis

Abstract: Massive, highly interactive datasets such as time-dependent big data, and especially spatio-temporal data, are constantly arising in many modern applications. These datasets can be found in fields such as econometrics, geospatial technologies, and medicine. The high dimensionality of these datasets requires efficient methods that can efficiently handle their high-order complexity. Tensor decomposition techniques are commonly employed to analyze, predict, and forecast these datasets, as they have the advantages of latent structure and information extraction, data imputation, and complexity control. In this talk, we propose a method for modeling and analyzing matrix-valued spatio-temporal data using matrix autoregression expressed as a tensor regression model. This approach allows for the incorporation of the matrix structure of both the response and predictors while achieving dimension reduction through a low-rank tensor structure. We compare our model to other existing models in the literature used for such data and find that our model is able to handle high-dimensional data more efficiently. Additionally, we propose two methods of estimation to estimate the transition tensor for low and high-dimensional settings. Furthermore, we derive asymptotic and non-asymptotic properties of the proposed estimators. Simulation studies and real data analysis are performed to illustrate the advantages of our model over existing methods.