Healthcare costs that can be attributed to unplanned readmissions are staggeringly high and negatively impact health and wellness of patients. In the United States, hospital systems and care providers have strong financial motivations to reduce readmissions in accordance with several government guidelines. One of the critical steps to reducing readmissions is to recognize the factors that lead to readmission and correspondingly identify at-risk patients based on these factors. The availability of large volumes of electronic health care records make it possible to develop and deploy automated machine learning models that can predict unplanned readmissions and pinpoint the most important factors of readmission risk. While hospital readmission is an undesirable outcome for any patient, it is more so for medically frail patients. Here, we develop and compare four machine learn- ing models (Random Forest, XGBoost, CatBoost, and Logistic Regression) for predicting 30-day unplanned readmission for patients deemed frail (Age ≥ 50). Variables that indicate frailty, comorbidities, high risk medication use, demographic, hospital and insurance were incorporated in the models for prediction of unplanned 30-day readmission. Our findings indicate that CatBoost outperforms the other three models (AUC 0.80) and prior work in this area. We find that constructs of frailty, certain categories of high risk medications, and comorbidity are all strong predictors of readmission for elderly patients.
With a growing increase in botnet attacks, computer networks are constantly under threat from attacks that cripple cyber-infrastructure. Detecting these attacks in real-time proves to be a difficult and resource intensive task. One of the pertinent methods to detect such attacks is signature based detection using machine learning models. This paper explores the efficacy of these models at detecting botnet attacks, using data captured from large-scale network attacks. Our study provides a comprehensive overview of performance characteristics two machine learning models — Random Forest and Multi-Layer Perceptron (Deep Learning) in such attack scenarios. Using Big Data analytics, the study explores the advantages, limitations, model/feature parameters, and overall performance of using machine learning in botnet attacks / communication. With insights gained from the analysis, this work recommends algorithms/models for specific attacks of botnets instead of a generalized model.
Information propagation in online social networks has drawn a lot of attention from researchers in different fields. While prior works have studied the impact and speed of different information propagation in various networks, we focus on the potential interactions of two hypothetically opposite pieces of information, negative and positive. We experiment the amount of time that is allowed for the positive information to be distributed with wide enough impact after the negative information and different selection strategies for positive source nodes. Our results enable the selection of a set of users based on a limited operating budget to start the spread of positive information as a measure to counteract the spread of negative information. Among different methods, we identify that both eigenvector and betweenness centrality are effective selection metrics. Furthermore, we quantitatively demonstrate that choosing a larger set of nodes for the spread of positive information allows for a wider window of time to respond in order to limit the propagation of negative information to a certain threshold.
Text mining approaches for automated ontology-based curation of biological and biomedical literature have largely focused on syntactic and lexical analysis along with machine learning. Recent advances in deep learning have shown increased accuracy for textual data annotation. However, the application of deep learning for ontology-based curation is a relatively new area and prior work has focused on a limited set of models.
Here, we introduce a new deep learning model/architecture based on combining multiple Gated Recurrent Units (GRU) with a character+word based input. We use data from five ontologies in the CRAFT corpus as a Gold Standard to evaluate our model’s performance. We also compare our model to seven models from prior work. We use four metrics – Precision, Recall, F1 score, and a semantic similarity metric (Jaccard similarity) to compare our model’s output to the Gold Standard. Our model resulted in a 84% Precision, 84% Recall, 83% F1, and a 84% Jaccard similarity. Results show that our GRU-based model outperforms prior models across all five ontologies. We also observed that character+word inputs result in a higher performance across models as compared to word only inputs.
These findings indicate that deep learning algorithms are a promising avenue to be explored for automated ontology-based curation of data. This study also serves as a formal comparison and guideline for building and selecting deep learning models and architectures for ontology-based curation.
Predicting human mobility within cities is an important task in urban and transportation planning. With the vast amount of digital traces available through social media platforms, we investigate the potential application of such data in predicting commuter trip distribution at small spatial scale. We develop back propagation (BP) neural network and gravity models using both traditional and Twitter data in New York City to explore their performance and compare the results. Our results suggest the potential of using social media data in transportation modeling to improve the prediction accuracy. Adding Twitter data to both models improved the performance with a slight decrease in root mean square error (RMSE) and an increase in R-squared (R2) value. The findings indicate that the traditional gravity models outperform neural networks in terms of having lower RMSE. However, the R2 results show higher values for neural networks suggesting a better fit between the real and predicted outputs. Given the complex nature of transportation networks and different reasons for limited performance of neural networks with the data, we conclude that more research is needed to explore the performance of such models with additional inputs.
Student loans occupy a significant portion of the federal budget, as well as, the largest financial burden in terms of debt for graduates. This paper explores data-driven approaches towards understanding the repayment of such loans. Using statistical and machine learning models on the College Scorecard Data, this research focuses on extracting and identifying key factors affecting the repayment of a student loan. The specific factors can be used to develop models which provide predictive capability towards repayment rate, detect irregularities/non-repayment, and help understand the intricacies of student loans.
Scientists exploring a new area of research are interested to know the “hot” topics in that area in order to make informed choices. With exponential growth in scientific literature, identifying such trends manually is not easy. Topic modeling has emerged as an effective approach to analyze large volumes of text. While this approach has been applied on literature in other scientific areas, there has been no formal analysis of bioinformatics literature.
Here, we conduct keyword and topic model-based analysis on bioinformatics literature starting from 1998 to 2016. We identify top keywords and topics per year and explore temporal popularity trends of those keywords/areas. Network analysis was conducted to identify clusters of sub-areas/topics in bioinformatics. We found that “big-data”, “next generation sequencing”, and “cancer” all experienced exponential increase in popularity over the years. On the other hand, interest in drug discovery has plateaued after the early 2000s.
The assurances provided by an assurance protocol for any information system (IS), extend only as much as the integrity of the assurance protocol itself. The integrity of the assurance protocol is negatively influenced by a) the complexity of the assurance protocol, and b) the complexity of the platform on which the assurance protocol is executed. This paper outlines a holistic Mirror Network (MN) framework for assuring information systems that seeks to minimize both complexities. The MN framework is illustrated using a generic cloud file storage system as an example IS.
Several applications fall under the broad umbrella of data dissemination systems (DDS), where providers and consumers of information rely on untrusted, or even unknown middle-men to disseminate and acquire data. This paper proposes a security architecture for a generic DDS by identifying a minimal trusted computing base (TCB) for middle-men and leveraging the TCB to provide useful assurances regarding the operation of the DDS. A precise characterization of the TCB is provided as a set of simple functions that can be executed even inside a severely resource limited trustworthy boundary. A core feature of the proposed approach is the ability of even resource limited modules to maintain an index ordered merkle tree (IOMT).
We consider the security requirements for a broad class of content distribution systems where the content distribution infrastructure is required to strictly abide by access control policies prescribed by owners of content. We propose a security solution that identifies a minimal trusted computing base (TCB) for a content distribution infrastructure, and leverages the TCB to provide all desired assurances regarding the operation of the infrastructure. It is assumed that the contents and access control policies associated with contents are dynamic.
In applications such as remote file storage systems, an essential component of cloud computing systems, users are required to rely on untrustworthy servers. We outline an approach to secure such file storage systems by relying only on a resource limited trusted module available at the server, and more specifically, without the need to trust any component of the server or its operator(s). The proposed approach to realize a trusted file storage system (TFSS) addresses some shortcomings of a prior effort (Sarmenta et al., 2006) which employs a merkle hash tree to guarantee freshness. We argue that the shortcomings stem the inability to verify non- existence. The TFSS described in this paper relies on index ordered merkle trees (IOMT) to gain the ability to verify non-existence.