The Flint Water Crisis is a profound humanitarian disaster for the citizens of Flint, Michigan. It is also an event that has captured the attention of individuals throughout the United States and indeed the world through extensive media coverage. It is unthinkable to many that a developed country can have a city whose water supply is poisoning its citizens and that the government failed to respond in an appropriate, timely manner to the water contamination. Given the increasing use of internet-based communication, this technological crisis created a high volume of human communication in the digital news and social media. It is apparent that humans are using social media as a new form of adaptation for dealing with extreme events and its challenges such as the Flint water crisis (Bernabé-Moreno et al.2014, Hossmann 2011, Saleem et al. 2014). In order to explore the possibilities and pitfalls of online communication during critical events, this chapter will discuss the collective ability of social media users to communicate, reach out to others for collective action, and organize in response to the negative consequences of the Flint disaster through the lens of Twitter. Rather than focusing on the technical aspects of data collection and analysis, our goal is to reach a wide variety of audiences with a key message, social media has the capacity to transform the way public and private sectors and civil society manage critical events in general and technological disasters in particular. The chapter starts by describing the event as observed in Twitter followed by some inferences from the data, building on former theoretical and empirical work about social media and disasters.
The purpose of this retrospective study is to determine whether frailty is predictive of 30-day readmission in adults 50 years of age and older who were admitted with a psychiatric diagnosis to a behavioral health hospital, 2013-2017. A total of 1,063 patients were included. A 26-item frailty risk score (FRS-26-ICD10) was constructed from electronic health record (EHR) data. There were 114 readmissions.Cox regression modeling for demographic characteristics, emergent admission, comorbidity, and FRS-26-ICD determined prediction of time to readmission was modest (iAUC=0.671; the FRS-26-ICD was a significant predictor of readmission alone and in models with demographics and emergent admission; however, only the ECI was significantly related to hazard of readmission adjusting for other factors (adj.HR = 1.26, 95% CI=(1.17, 1.37), p<0.001) while FRS-26-ICD became non-significant. Frailty is a relevant syndrome in behavioral health that should be further studied in risk prediction and incorporated into care planning to prevent readmissions.
The purpose of the current study was to investigate the predictive properties of five definitions of a frailty risk score (FRS) and three comorbidity indices using data from electronic health records (EHRs) of hospitalized adults aged ≥50 years for 3-day, 7-day, and 30-day readmission, and to identify an optimal model for a FRS and comorbidity combination. Retrospective analysis of the EHR dataset was performed, and multivariable logistic regression and area under the curve (AUC) were used to examine readmission for frailty and comorbidity. The sample (N = 55,778) was mostly female (53%), non-Hispanic White (73%), married (53%), and on Medicare (55%). Mean FRSs ranged from 1.3 (SD = 1.5) to 4.3 (SD = 2.1). FRS and co- morbidity were independently associated with readmission. Predictive accuracy for FRS and comorbidity combinations ranged from AUC of 0.75 to 0.77 (30-day readmission) to 0.84 to 0.85 (3-day readmission). FRS and comorbidity combinations performed similarly well, whereas comorbidity was always indepen- dently associated with readmission. FRS measures were more associated with 30-day readmission than 7-day and 3-day readmission.
Streaming social media provides a real-time glimpse of extreme weather impacts. However, the volume of streaming data makes mining information a challenge for emergency managers, policy makers, and disciplinary scientists. Here we explore the effectiveness of data learned approaches to mine and filter information from streaming social media data from Hurricane Irma’s landfall in Florida, USA. We use 54,383 Twitter messages (out of 784 K geolocated messages) from 16,598 users from Sept. 10–12, 2017 to develop 4 independent models to filter data for relevance: 1) a geospatial model based on forcing conditions at the place and time of each tweet, 2) an image classification model for tweets that include images, 3) a user model to predict the reliability of the tweeter, and 4) a text model to determine if the text is related to Hurricane Irma. All four models are inde- pendently tested, and can be combined to quickly filter and visualize tweets based on user-defined thresholds for each submodel. We envision that this type of filtering and visualization routine can be useful as a base model for data capture from noisy sources such as Twitter. The data can then be subsequently used by policy makers, environmental managers, emergency managers, and domain scientists interested in finding tweets with specific attributes to use during different stages of the disaster (e.g., preparedness, response, and recovery), or for detailed research.
psi-collect is a command line tool for collecting post storm imagery from National Geodetic Survey (NGS) Remote Sensing Division of the US National Oceanographic and Atmospheric Administration. The tool enables reproducible computational workflows in downstream learning and labeling tasks and uses parallel processing to capture over 100,000 images each with an average size of 7.7 Mb from several different sources.
Representing scientific knowledge using ontologies enables data integration, consistent machine-readable data representation, and allows for large-scale computational analyses. Text mining approaches that can automatically process and annotate scientific literature with ontology concepts are necessary to keep up with the rapid pace of scientific publishing. Here, we present deep learning models (Gated Recurrent Units (GRU) and Long Short Term Memory (LSTM)) combined with different input encoding formats for automated Named Entity Recognition (NER) of ontology concepts from text. The Colorado Richly Annotated Full Text (CRAFT) gold standard corpus was used to train and test our models. Precision, Recall, F-1, and Jaccard semantic similarity were used to evaluate the performance of the models. We found that GRU-based models outperform LSTM models across all evaluation metrics. Surprisingly, considering the top two probabilistic predictions of the model for each instance instead of the top one resulted in a substantial increase in accuracy. Inclusion of ontology semantics via subsumption reasoning yielded modest performance improvement.
We present an active learning pipeline to identify hurricane impacts on coastal landscapes. Previously unlabeled post-storm images are used in a three component workflow — first an online interface is used to crowd-source labels for imagery; second, a convolutional neural network is trained using the labeled images; third, model predictions are displayed on an interactive map. Both the labeler and interactive map allow coastal scientists to provide additional labels that will be used to develop a large labeled dataset, a refined model, and improved hurricane impact assessments.
Healthcare costs that can be attributed to unplanned readmissions are staggeringly high and negatively impact health and wellness of patients. In the United States, hospital systems and care providers have strong financial motivations to reduce readmissions in accordance with several government guidelines. One of the critical steps to reducing readmissions is to recognize the factors that lead to readmission and correspondingly identify at-risk patients based on these factors. The availability of large volumes of electronic health care records make it possible to develop and deploy automated machine learning models that can predict unplanned readmissions and pinpoint the most important factors of readmission risk. While hospital readmission is an undesirable outcome for any patient, it is more so for medically frail patients. Here, we develop and compare four machine learn- ing models (Random Forest, XGBoost, CatBoost, and Logistic Regression) for predicting 30-day unplanned readmission for patients deemed frail (Age ≥ 50). Variables that indicate frailty, comorbidities, high risk medication use, demographic, hospital and insurance were incorporated in the models for prediction of unplanned 30-day readmission. Our findings indicate that CatBoost outperforms the other three models (AUC 0.80) and prior work in this area. We find that constructs of frailty, certain categories of high risk medications, and comorbidity are all strong predictors of readmission for elderly patients.
This study investigates Twitter usage during Hurricane Sandy following the survey of the general population and exploring communication dynamics on Twitter through different modalities. The results suggest that Twitter is a highly valuable source of disaster-related information particularly during the power outage. With a substantial increase in the number of tweets and unique users during the Hurricane Sandy, a large number of posts contained firsthand information about the hurricane showing the intensity of the event in real-time. More specifically, a number of images of damage and flooding were shared on Twitter through which researchers and emergency managers can retrieve valuable information to help identify storm damages and plan relief efforts. The social media analysis revealed the most important information that can be derived from twitter during disasters so that authorities can successfully utilize such data. The findings provide insights into the choice of keywords and sentiments and identifying the influential actors at different stages of disasters. A number of key influencers and their followers from different domains including political, news, weather, and relief organizations participated in Twitter-based discussions related to Hurricane Sandy. The connectivity of the influencers and their followers on Twitter plays a vital role in information sharing and dissemination throughout the hurricane. These connections can provide an effective vehicle for emergency managers towards establishing better bi-directional communication during disasters. However, while government agencies were among the prominent Twitter users during the Hurricane Sandy, they primarily relied on one-way communication rather than engaging with their audiences, a challenge that need to be addressed in future research.
Background: Diabetes and cardiovascular disease are two of the main causes of death in the United States. Identifying and predicting these diseases in patients is the first step towards stopping their progression. We evaluate the capabilities of machine learning models in detecting at-risk patients using survey data (and laboratory results), and identify key variables within the data contributing to these diseases among the patients.
Methods: Our research explores data-driven approaches which utilize supervised machine learning models to identify patients with such diseases. Using the National Health and Nutrition Examination Survey (NHANES) dataset, we conduct an exhaustive search of all available feature variables within the data to develop models for cardiovascular, prediabetes, and diabetes detection. Using different time-frames and feature sets for the data (based on laboratory data), multiple machine learning models (logistic regression, support vector machines, random forest, and gradient boosting) were evaluated on their classification performance. The models were then combined to develop a weighted ensemble model, capable of leveraging the performance of the disparate models to improve detection accuracy. Information gain of tree-based models was used to identify the key variables within the patient data that contributed to the detection of at-risk patients in each of the diseases classes by the data-learned models.
Results: The developed ensemble model for cardiovascular disease (based on 131 variables) achieved an Area Under – Receiver Operating Characteristics (AU-ROC) score of 83.1% using no laboratory results, and 83.9% accuracy with laboratory results. In diabetes classification (based on 123 variables), eXtreme Gradient Boost (XGBoost) model achieved an AU-ROC score of 86.2% (without laboratory data) and 95.7% (with laboratory data). For pre-diabetic patients, the ensemble model had the top AU-ROC score of 73.7% (without laboratory data), and for laboratory based data XGBoost performed the best at 84.4%. Top five predictors in diabetes patients were 1) waist size, 2) age, 3) self-reported weight, 4) leg length, and 5) sodium intake. For cardiovascular diseases the models identified 1) age, 2) systolic blood pressure, 3) self-reported weight, 4) occurrence of chest pain, and 5) diastolic blood pressure as key contributors.
Conclusion: We conclude machine learned models based on survey questionnaire can provide an automated identification mechanism for patients at risk of diabetes and cardiovascular diseases. We also identify key contributors to the prediction, which can be further explored for their implications on electronic health records.
With a growing increase in botnet attacks, computer networks are constantly under threat from attacks that cripple cyber-infrastructure. Detecting these attacks in real-time proves to be a difficult and resource intensive task. One of the pertinent methods to detect such attacks is signature based detection using machine learning models. This paper explores the efficacy of these models at detecting botnet attacks, using data captured from large-scale network attacks. Our study provides a comprehensive overview of performance characteristics two machine learning models — Random Forest and Multi-Layer Perceptron (Deep Learning) in such attack scenarios. Using Big Data analytics, the study explores the advantages, limitations, model/feature parameters, and overall performance of using machine learning in botnet attacks / communication. With insights gained from the analysis, this work recommends algorithms/models for specific attacks of botnets instead of a generalized model.
Information propagation in online social networks has drawn a lot of attention from researchers in different fields. While prior works have studied the impact and speed of different information propagation in various networks, we focus on the potential interactions of two hypothetically opposite pieces of information, negative and positive. We experiment the amount of time that is allowed for the positive information to be distributed with wide enough impact after the negative information and different selection strategies for positive source nodes. Our results enable the selection of a set of users based on a limited operating budget to start the spread of positive information as a measure to counteract the spread of negative information. Among different methods, we identify that both eigenvector and betweenness centrality are effective selection metrics. Furthermore, we quantitatively demonstrate that choosing a larger set of nodes for the spread of positive information allows for a wider window of time to respond in order to limit the propagation of negative information to a certain threshold.
Text mining approaches for automated ontology-based curation of biological and biomedical literature have largely focused on syntactic and lexical analysis along with machine learning. Recent advances in deep learning have shown increased accuracy for textual data annotation. However, the application of deep learning for ontology-based curation is a relatively new area and prior work has focused on a limited set of models.
Here, we introduce a new deep learning model/architecture based on combining multiple Gated Recurrent Units (GRU) with a character+word based input. We use data from five ontologies in the CRAFT corpus as a Gold Standard to evaluate our model’s performance. We also compare our model to seven models from prior work. We use four metrics – Precision, Recall, F1 score, and a semantic similarity metric (Jaccard similarity) to compare our model’s output to the Gold Standard. Our model resulted in a 84% Precision, 84% Recall, 83% F1, and a 84% Jaccard similarity. Results show that our GRU-based model outperforms prior models across all five ontologies. We also observed that character+word inputs result in a higher performance across models as compared to word only inputs.
These findings indicate that deep learning algorithms are a promising avenue to be explored for automated ontology-based curation of data. This study also serves as a formal comparison and guideline for building and selecting deep learning models and architectures for ontology-based curation.
As Internet-based communications have expanded, online debating has become a significant form of political participation. This work examines online discussions around health care in the United States by analysing tweets about Obamacare and then assessing the degrees of polarisation in social media. The results indicate that highly influential entities in social media have an important capacity to polarise the public. Another relevant finding is that ideology is a powerful mechanism to frame online discussions by relegating policy arguments in online debates. Finally, this work shows that social media can easily promote negative sentiments towards ‘the other’, confirming group homogeneity in online communities.
Predicting human mobility within cities is an important task in urban and transportation planning. With the vast amount of digital traces available through social media platforms, we investigate the potential application of such data in predicting commuter trip distribution at small spatial scale. We develop back propagation (BP) neural network and gravity models using both traditional and Twitter data in New York City to explore their performance and compare the results. Our results suggest the potential of using social media data in transportation modeling to improve the prediction accuracy. Adding Twitter data to both models improved the performance with a slight decrease in root mean square error (RMSE) and an increase in R-squared (R2) value. The findings indicate that the traditional gravity models outperform neural networks in terms of having lower RMSE. However, the R2 results show higher values for neural networks suggesting a better fit between the real and predicted outputs. Given the complex nature of transportation networks and different reasons for limited performance of neural networks with the data, we conclude that more research is needed to explore the performance of such models with additional inputs.
Student loans occupy a significant portion of the federal budget, as well as, the largest financial burden in terms of debt for graduates. This paper explores data-driven approaches towards understanding the repayment of such loans. Using statistical and machine learning models on the College Scorecard Data, this research focuses on extracting and identifying key factors affecting the repayment of a student loan. The specific factors can be used to develop models which provide predictive capability towards repayment rate, detect irregularities/non-repayment, and help understand the intricacies of student loans.
Scientists exploring a new area of research are interested to know the “hot” topics in that area in order to make informed choices. With exponential growth in scientific literature, identifying such trends manually is not easy. Topic modeling has emerged as an effective approach to analyze large volumes of text. While this approach has been applied on literature in other scientific areas, there has been no formal analysis of bioinformatics literature.
Here, we conduct keyword and topic model-based analysis on bioinformatics literature starting from 1998 to 2016. We identify top keywords and topics per year and explore temporal popularity trends of those keywords/areas. Network analysis was conducted to identify clusters of sub-areas/topics in bioinformatics. We found that “big-data”, “next generation sequencing”, and “cancer” all experienced exponential increase in popularity over the years. On the other hand, interest in drug discovery has plateaued after the early 2000s.
We introduce a family of authenticated data structures — Ordered Merkle Trees (OMT) — and illustrate their utility in security kernels for a wide variety of sub-systems. Specifically, the utility of two types of OMTs: a) the index ordered merkle tree (IOMT) and b) the range ordered merkle tree (ROMT), are investigated for their suitability in security kernels for various sub-systems of Border Gateway Protocol (BGP), the Internet’s inter-autonomous system routing infrastructure. We outline simple generic security kernel functions to maintain OMTs, and sub-system specific security kernel functionality for BGP sub- systems (like registries, autonomous system owners, and BGP speakers/routers), that take advantage of OMT .
South African students across numerous university campuses joined together in the second half of 2015 to protest the rising cost of higher education. In addition to on-campus protesting, activists utilized Twitter to mobilize and communicate with each other, and, as the protests drew national attention, the hashtag# FeesMustFall began trending on Twitter. Then, what began as a localized movement against tuition increases became a global issue when a court interdict was granted by a South African court against the use of the# FeesMustFall hashtag. This paper traces that global spread of the# FeesMustFall hashtag on Twitter as a response to the extraordinary attempt to limit online free speech. In this paper, we analyze the global flow and geographic spread of the# FeesMustFall hashtag on Twitter. Our evidence supports the argument that the attempt to censor and curtail the protestors’ right to organize and share the hashtag in fact propelled the# FeesMustFall movement onto the international stage.
The assurances provided by an assurance protocol for any information system (IS), extend only as much as the integrity of the assurance protocol itself. The integrity of the assurance protocol is negatively influenced by a) the complexity of the assurance protocol, and b) the complexity of the platform on which the assurance protocol is executed. This paper outlines a holistic Mirror Network (MN) framework for assuring information systems that seeks to minimize both complexities. The MN framework is illustrated using a generic cloud file storage system as an example IS.
A cloud storage assurance architecture (CSAA) for providing integrity, privacy and availability assurances regarding any cloud storage service is presented. CSAA is motivated by the fact that the complexity of components (software / hardware and personnel) that compose such a service, and lack of transparency regarding policies followed by the service makes conventional security mechanisms insufficient to provide convincing assurances to users. As it is impractical to rule out hidden undesired functionality in every component of the service, CSAA bootstraps all desired assurances from simple transformation procedures executed inside a low complexity trustworthy module; no component of the cloud storage service is trusted.
As social media tools become more popular at all levels of government, more research is needed to determine how the platforms can be used to create meaningful citizen–government collaboration. Many entities use the tools in one-way, push manners. The aim of this research is to determine if sentiment (tone) can positively influence citizen participation with government via social media. Using a systematic random sample of 125 U.S. cities, we found that positive sentiment is more likely to engender digital participation but this was not a perfect one-to-one relationship. Some cities that had an overall positive sentiment score and displayed a participatory style of social media use did not have positive citizen sentiment scores. We argue that positive tone is only one part of a successful social media interaction plan, and encourage social media managers to actively manage platforms to use activities that spur participation.
Devices participating in mobile ad hoc networks (MANET) are expected to strictly adhere to a uniform routing protocol to route data packets among themselves. Unfortunately, MANET devices, composed of untrustworthy software and hardware components, expose a large attack surface. This can be exploited by attackers to gain control over one or more devices, and wreak havoc on the MANET subnet. The approach presented in this paper to secure MANETs restricts the attack surface to a single module in MANET devices a trusted MANET module (TMM). TMMs are deliberately constrained to demand only modest memory and computational resources in the interest of further reducing the attack surface. The specific contribution of this paper is a precise characterization of simple TMM functionality suitable for any distance vector based routing protocol, to realize the broad assurance that “any node that fails to abide by the routing protocol will not be able to participate in the MANET”.
Increasing complexity and inter-dependency of information systems (IS), and the lack of transparency regarding system components and policies, have rendered traditional security mechanisms (applied at different OSI levels) inadequate to provide convincing confidentiality-integrity-availability (CIA) assurances regarding any IS. We present an architecture for a generic, trustworthy assurance-as-a-service IS, which can actively monitor the integrity of any IS, and provide convincing system-specific CIA assurances to users of the IS. More importantly no component of the monitored IS itself is trusted in order to provide assurances regarding the monitored IS.
Several applications fall under the broad umbrella of data dissemination systems (DDS), where providers and consumers of information rely on untrusted, or even unknown middle-men to disseminate and acquire data. This paper proposes a security architecture for a generic DDS by identifying a minimal trusted computing base (TCB) for middle-men and leveraging the TCB to provide useful assurances regarding the operation of the DDS. A precise characterization of the TCB is provided as a set of simple functions that can be executed even inside a severely resource limited trustworthy boundary. A core feature of the proposed approach is the ability of even resource limited modules to maintain an index ordered merkle tree (IOMT).
We consider the security requirements for a broad class of content distribution systems where the content distribution infrastructure is required to strictly abide by access control policies prescribed by owners of content. We propose a security solution that identifies a minimal trusted computing base (TCB) for a content distribution infrastructure, and leverages the TCB to provide all desired assurances regarding the operation of the infrastructure. It is assumed that the contents and access control policies associated with contents are dynamic.
In applications such as remote file storage systems, an essential component of cloud computing systems, users are required to rely on untrustworthy servers. We outline an approach to secure such file storage systems by relying only on a resource limited trusted module available at the server, and more specifically, without the need to trust any component of the server or its operator(s). The proposed approach to realize a trusted file storage system (TFSS) addresses some shortcomings of a prior effort (Sarmenta et al., 2006) which employs a merkle hash tree to guarantee freshness. We argue that the shortcomings stem the inability to verify non- existence. The TFSS described in this paper relies on index ordered merkle trees (IOMT) to gain the ability to verify non-existence.