Research

My research focuses on the overarching areas of data science and knowledge extraction from complex and large-scale datasets. In a world with ever increasing data generated both by humans and machines alike, compute systems are undergoing a fundamental shift towards data driven/intensive model of operation. This creates an unprecedented need for novel approaches in algorithms, computation, and theory with the goal of inferring critical and insightful patterns from such data. My research aims to meet this need by developing computational techniques and solving challenges originating from inter and cross disciplinary fields. Specifically, the primary goal of my research is to discover dynamic patterns (temporal or event based) in sequential data using machine learning/data mining algorithms and novel data structures. My research spans two main algorithmic areas: 1) explainable machine learning for understanding human interactions & complex processes, and 2) deep learning architectures for written language and multimodal data. A key component of my research is geared towards data science for social good, where the algorithms in the above areas are applicable to a wide range of domains such as health informatics, geo-spatial intelligence, social sciences, psychology, etc.

Explainable AI
Deep Learning
Big Data
Data Science
Cyber-Security
Trustworthy Computing

Research Projects

Funding Agencies

Knowledge Extraction from Natural Language
Deep Learning, Ontologies, Semantic Representations, Multi-Source Information

Annotating ontologies enables knowledge extraction from free text to perform higher level queries. Currently, the majority of ontology-based data annotation is performed via manual curation of scientific literature – the process of reading and annotating parts of text with one or more ontology concepts. Manual data curation is tedious, time consuming, and highly un-scalable to the growing body of scientific literature. In our research, we are exploring algorithms which can automate this process, and this is a very challenging problem. Specifically, We are developing neural models which can utilize these sequences of information and other modalities of scientific literature to perform such annotation.
Data Driven Approaches Towards Understanding Patient Risk
Machine Learning; Deep Learning; Health Care Informatics; Big Data; High Performance Computing;

While the development and curation of electronic health records (EHR) have enabled significant gain in the knowledge-base towards analysis of health risk of patients, they provide a very coarse-grained view of their overall state. Furthermore, current approaches toward developing predictive models capable of analyzing the risk score of a patient are narrow in their focus, choosing only certain variables from the EHR data, towards predicting mortality, hospital readmission, and/or disease diagnosis, etc. A key hurdle towards developing an overall risk score for a patient is the underlying volume and complexity of EHR data, where structured and unstructured data present statistical, machine-learning, and computational challenges. Towards solving this problem, the proposed research aims at developing a fusion architecture capable of integrating multiple predictive models to provide an overall assessment of patient health.
Anomaly Prediction in High-Velocity Streaming Data
Big Data; High Performance Computing; Machine Learning; Cyber-Security

With an ever-growing necessity of smarter threat detection systems in cyber security, Big Data based approaches towards pre- cognition of events/anomalies can provide time critical reactivity to counter complex and real-time threats. Towards such a goal, my research explores development of parallizable machine-learning models capable of detecting trends in high-velocity datastreams. Using a data-driven approach, we are working with internet-scale data and designing pipelines which can extract, filter, normalize and predict the outcome of the trend in a real-time environment. Utilizing both supervised and unsupervised algorithms for categorization and classification of temporal patterns we are trying to understand the predictive capabilities for a cyber-system under attack. The research is also exploring models which can help mitigate such attack using defensive strategies.
Social Media Tracking and Analysis System (SMTAS)
Big Data; Social Media Analytics; Cloud Computing; Scalable Architectures; Machine Learning

My first foray into Big-Data analysis was a part of a research challenge to design and develop a web-based application capable of analyzing Twitter fire-hose datastream. The result of the effort was Social Media Tracking and Analysis System (SMTAS). SMTAS is built on a scalable architecture utilizing cloud- based storage and computing nodes. As a web-application, SMTAS allows researchers to create buckets of data filtered by search terms such as words, hashtags, phrases, geo-locations, languages, etc. SMTAS has real-time analysis modules for volume statistics, temporal analysis, trend detection, sentiment analysis (based on supervised machine-learning), and spatial modeling. The application was the first of its kind as a collaborative tool for researchers in social-sciences and computational-sciences to study human behavior based on Big-Data analysis.
Minimal - Trusted Computing Base (M-TCB)
Trustworthy Computing; Cyber-Security

Hidden undesired functionality is an un-avoidable reality in any complex hardware or software components. Undesired functionality — deliberately introduced Trojan horses or accidentally introduced bugs — in any component of a system can be exploited by attackers to exert control over a system. To mitigate such vulnerabilities, we investigated the utilization of a minimal Trusted Computing Base (M-TCB) as the security kernel of the system. The architecture leverages the efficiency, versatility, and the reusability of Ordered Merkle Trees as the datastrucure for the kernel. The research utilized the architecture for the design and implementation in various application systems such as: 1) a remote file storage system; 2) a generic content distribution system; 3) generic look-up servers; 4) mobile ad-hoc networks; and 5) the Internet’s routing infrastructure based on the border gateway protocol (BGP).
Organic Social-Media During Severe Weather Events
Social Media; Real-Time Analytics; Big Data; Natural Disaster

In the research work funded by Coastal Storm Awareness Program – National Oceanic and Atmospheric Administration(CSAP- NOAA), we wanted to understand the utility of Twitter as a source of information during physical events such as hurricanes. By using the SMTAS platform the research team collected approximately 4.8M geo-coded and 8.2M search-term based tweets related to Hurricane Sandy. The computational goal of the team was to utilize Natural Language Processing (NLP) based machine- learning models to filter out the noise (non-weather related) messages from the dataset. The signal (weather-related) tweets were utilized to create a real-time web-application software called Social Crowd based Visual Monitor (SoC-VM). SoC-VM can be utilized by the emergency managers in the field to gather real-time weather information based on the pictures shared in the signal tweets. The system utilizes the machine-learning models to filter tweets in real-time and is based on a Distributed-Data Protocol based web-framework and non-relational database. I also analyzed the network connectivity of influencers – followers and tracked spread of information on temporal domain during the hurricane.
Networks of Research (NoR)
Graph Analytics; Research Productivity; Big Data; Text Mining

One of my interest areas is to study the overall science output of a research organization by utilizing the publication and grant activity of researchers — the Networks of Research (NoR) project is the result of the effort. Using the data from publication datasources (Elsivier, Scopus, Microsoft Academic Database) and internal university data from Sponsored Programs Administration (SPA) a large graph database was created for all researchers at the university. The goal of the project was to understand the collaboration between researchers and the topics of research conducted at the university. Using graph-theory methods and unsupervised topic-modeling (to capture topics of published abstract), inter-disciplinary and intra-disciplinary collaboration can be visualized across researcher networks. Trends of topics can also be analyzed across temporal domains and emerging topics of interest can be derived from the grant activity.