Efficient Techniques for Big Data Processing in Data Integration and Motif Search

Metadata

Handle

http://hdl.handle.net/11134/20002:860639982

Persons

Creator (cre): Mi, Tian

Major Advisor (mja): Rajasekaran, Sanguthevar

Associate Advisor (asa): Mandoiu, Ion

Associate Advisor (asa): Wu, Yufeng

Associate Advisor (asa): Ammar, Reda

Title

Efficient Techniques for Big Data Processing in Data Integration and Motif Search

Origin Information

Event Place	Storrs, CT
Date Created	2013
Publisher	University of Connecticut

Parent Item

Dissertations

Resource Type

Text

Digital Origin

born digital

Description

The rapid growth of data in bioinformatics and biomedical informatics brings new challenges to these areas. In this thesis, we present efficient computational algorithms for big data processing in data integration and motif search. Data integration, or record linkage, is the problem of identifying information pertaining to the same entity, existing in different data sources, in the absence of a global identifier. For instance, there could be multiple records for the same individual with different healthcare providers. Several algorithms have been proposed in the literature that are adept in integrating records from two different datasets. However, limitations show up when facing multiple (more than two) data sources. More often than not we have to deal with much more than two datasets. We propose efficient algorithms based on hierarchical clustering to handle massive data from multiple sources. In motif prediction, minimotifs (also called Short Linear Motifs) are short contiguous peptide pieces of proteins that have a known biological function. Minimotif Miner (MnM) (http://mnm.engr.uconn.edu) is a computational minimotif prediction tool that analyzes protein queries for the presence of minimotifs. The basic algorithm employs sequence matching and checks to see if any of the experimentally validated motifs can be located in the query. It then uses a series of methods (known as {em filters}) to eliminate possible false-positive predictions. Since the initial version of MnM, the MnM database has grown rapidly and the number of minimotifs has increased from 462 to 294,933. This growth has also resulted in more false positives in our predictions. In our work, we have developed novel filters to address this problem using knowledge of the cellular function and molecular function. Together with other filters of protein protein interaction, frequency score, and surface prediction score, we have developed computational combination of individual filters to significantly increase the accuracy of the minimotif prediction. Besides, we studied a crucial fundamental operation in bioinformatics and biomedical informatics, the external or out-of-core selection problem. Selection problem is aimed to find the i_th smallest element given a number of input elements. ‘Out-of-core’ refers to the case when the number of input elements is much more than what the core memory can hold. Some applications include noise reduction (e.g., median filters) in signal or image processing, high-breakdown regression in robust statistics, clustering, neural networks, data mining, etc. Note that these applications play an important role in computational biological science. We propose a novel algorithm of no more than (2+epsilon) passes (epsilon being a very small fraction) and compare our algorithms with some of the best existing algorithms.

Genre

doctoral dissertations

Organizations

Degree granting institution (dgg): University of Connecticut

Held By

Archives & Special Collections, University of Connecticut Library

Rights Statement

IN COPYRIGHT

Use and Reproduction

These materials are provided for educational and research purposes only.

Local Identifier

OC_d_168

OCLC Number

858863594

Efficient Techniques for Big Data Processing in Data Integration and Motif Search

Share