My primary research interests are the development of new methods, algorithms, and systems applicable for mining, analyzing, and visualizing large-scale complex datasets (“Big Data”). In a world with ever increasing data generated both by humans and machines alike, compute systems are undergoing a fundamental shift towards a more distributed, data-intensive model of operation. The development of such systems have a wide range of applicability in disparate areas (such as AI, cybersecurity, engineering, social sciences, etc.) and aim to gain new insights from the data. The broad set of application areas present a unique opportunity in inter-disciplinary research, and present new challenges in the domains of computation, storage, and security.

  • Big Data
  • Data Science
  • Machine Learning
  • Deep Learning
  • Cyber-Security
  • Trustworthy Computing

Research Projects

Funding Agencies

  • Data Driven Approaches Towards Understanding Patient Risk

    Machine Learning; Deep Learning; Health Care Infromatics; Big Data; High Performance Computing;

    While the development and curation of electronic health records (EHR) have enabled significant gain in the knowledge-base towards analysis of health risk of patients, they provide a very coarse-grained view of their overall state. Furthermore, current approaches toward developing predictive models capable of analyzing the risk score of a patient are narrow in their focus, choosing only certain variables from the EHR data, towards predicting mortality, hospital readmission, and/or disease diagnosis, etc. A key hurdle towards developing an overall risk score for a patient is the underlying volume and complexity of EHR data, where structured and unstructured data present statistical, machine-learning, and computational challenges. Towards solving this problem, the proposed research aims at developing a fusion architecture capable of integrating multiple predictive models to provide an overall assessment of patient health.
  • Anomaly Prediction in High-Velocity Streaming Data

    Big Data; High Performance Computing; Machine Learning; Cyber-Security

    With an ever-growing necessity of smarter threat detection systems in cyber security, Big Data based approaches towards pre- cognition of events/anomalies can provide time critical reactivity to counter complex and real-time threats. Towards such a goal, my research explores development of parallizable machine-learning models capable of detecting trends in high-velocity datastreams. Using a data-driven approach, we are working with internet-scale data and designing pipelines which can extract, filter, normalize and predict the outcome of the trend in a real-time environment. Utilizing both supervised and unsupervised algorithms for categorization and classification of temporal patterns we are trying to understand the predictive capabilities for a cyber-system under attack. The research is also exploring models which can help mitigate such attack using defensive strategies.

  • Social Media Tracking and Analysis System (SMTAS)

    Big Data; Social Media Analytics; Cloud Computing; Scalable Architectures; Machine Learning

    My first foray into Big-Data analysis was a part of a research challenge to design and develop a web-based application capable of analyzing Twitter fire-hose datastream. The result of the effort was Social Media Tracking and Analysis System (SMTAS). SMTAS is built on a scalable architecture utilizing cloud- based storage and computing nodes. As a web-application, SMTAS allows researchers to create buckets of data filtered by search terms such as words, hashtags, phrases, geo-locations, languages, etc. SMTAS has real-time analysis modules for volume statistics, temporal analysis, trend detection, sentiment analysis (based on supervised machine-learning), and spatial modeling. The application was the first of its kind as a collaborative tool for researchers in social-sciences and computational-sciences to study human behavior based on Big-Data analysis.

  • Minimal - Trusted Computing Base (M-TCB)

    Trustworthy Computing; Cyber-Security

    Hidden undesired functionality is an un-avoidable reality in any complex hardware or software components. Undesired functionality — deliberately introduced Trojan horses or accidentally introduced bugs — in any component of a system can be exploited by attackers to exert control over a system. To mitigate such vulnerabilities, we investigated the utilization of a minimal Trusted Computing Base (M-TCB) as the security kernel of the system. The architecture leverages the efficiency, versatility, and the reusability of Ordered Merkle Trees as the datastrucure for the kernel. The research utilized the architecture for the design and implementation in various application systems such as: 1) a remote file storage system; 2) a generic content distribution system; 3) generic look-up servers; 4) mobile ad-hoc networks; and 5) the Internet’s routing infrastructure based on the border gateway protocol (BGP).

  • Organic Social-Media During Severe Weather Events

    Social Media; Real-Time Analytics; Big Data; Natural Disaster

    In the research work funded by Coastal Storm Awareness Program – National Oceanic and Atmospheric Administration(CSAP- NOAA), we wanted to understand the utility of Twitter as a source of information during physical events such as hurricanes. By using the SMTAS platform the research team collected approximately 4.8M geo-coded and 8.2M search-term based tweets related to Hurricane Sandy. The computational goal of the team was to utilize Natural Language Processing (NLP) based machine- learning models to filter out the noise (non-weather related) messages from the dataset. The signal (weather-related) tweets were utilized to create a real-time web-application software called Social Crowd based Visual Monitor (SoC-VM). SoC-VM can be utilized by the emergency managers in the field to gather real-time weather information based on the pictures shared in the signal tweets. The system utilizes the machine-learning models to filter tweets in real-time and is based on a Distributed-Data Protocol based web-framework and non-relational database. I also analyzed the network connectivity of influencers – followers and tracked spread of information on temporal domain during the hurricane.

  • Networks of Research (NoR)

    Graph Analytics; Research Productivity; Big Data; Text Mining

    One of my interest areas is to study the overall science output of a research organization by utilizing the publication and grant activity of researchers — the Networks of Research (NoR) project is the result of the effort. Using the data from publication datasources (Elsivier, Scopus, Microsoft Academic Database) and internal university data from Sponsored Programs Administration (SPA) a large graph database was created for all researchers at the university. The goal of the project was to understand the collaboration between researchers and the topics of research conducted at the university. Using graph-theory methods and unsupervised topic-modeling (to capture topics of published abstract), inter-disciplinary and intra-disciplinary collaboration can be visualized across researcher networks. Trends of topics can also be analyzed across temporal domains and emerging topics of interest can be derived from the grant activity.

Students and Researchers

David Santana

David Santana

Under-Graduate Researcher - Machine learning, Computer Vision, and Data Analytics

Shraddha Dafare

Shraddha Dafare

Graduate Researcher - Data Mining, Machine Learning, and Big Data

Awantika Mahar

Awantika Mahar

Graduate Researcher - Data mining, Machine learning , Data munging

Dharani Sethuram

Dharani Sethuram

Graduate Researcher - Data Analytics and Data Mining