Efficient Techniques for Big Data Processing in Data Integration and Motif Search
Digital Document
Document
Handle |
Handle
http://hdl.handle.net/11134/20002:860639982
|
||||||
---|---|---|---|---|---|---|---|
Persons |
Persons
Creator (cre): Mi, Tian
Major Advisor (mja): Rajasekaran, Sanguthevar
Associate Advisor (asa): Mandoiu, Ion
Associate Advisor (asa): Wu, Yufeng
Associate Advisor (asa): Ammar, Reda
|
||||||
Title |
Title
Title
Efficient Techniques for Big Data Processing in Data Integration and Motif Search
|
||||||
Origin Information |
Origin Information
|
||||||
Parent Item |
Parent Item
|
||||||
Resource Type |
Resource Type
|
||||||
Digital Origin |
Digital Origin
born digital
|
||||||
Description |
Description
The rapid growth of data in bioinformatics and biomedical informatics brings new challenges to these areas. In this thesis, we present efficient computational algorithms for big data processing in data integration and motif search. Data integration, or record linkage, is the problem of identifying information pertaining to the same entity, existing in different data sources, in the absence of a global identifier. For instance, there could be multiple records for the same individual with different healthcare providers. Several algorithms have been proposed in the literature that are adept in integrating records from two different datasets. However, limitations show up when facing multiple (more than two) data sources. More often than not we have to deal with much more than two datasets. We propose efficient algorithms based on hierarchical clustering to handle massive data from multiple sources. In motif prediction, minimotifs (also called Short Linear Motifs) are short contiguous peptide pieces of proteins that have a known biological function. Minimotif Miner (MnM) (http://mnm.engr.uconn.edu) is a computational minimotif prediction tool that analyzes protein queries for the presence of minimotifs. The basic algorithm employs sequence matching and checks to see if any of the experimentally validated motifs can be located in the query. It then uses a series of methods (known as {em filters}) to eliminate possible false-positive predictions. Since the initial version of MnM, the MnM database has grown rapidly and the number of minimotifs has increased from 462 to 294,933. This growth has also resulted in more false positives in our predictions. In our work, we have developed novel filters to address this problem using knowledge of the cellular function and molecular function. Together with other filters of protein protein interaction, frequency score, and surface prediction score, we have developed computational combination of individual filters to significantly increase the accuracy of the minimotif prediction. Besides, we studied a crucial fundamental operation in bioinformatics and biomedical informatics, the external or out-of-core selection problem. Selection problem is aimed to find the i_th smallest element given a number of input elements. ‘Out-of-core’ refers to the case when the number of input elements is much more than what the core memory can hold. Some applications include noise reduction (e.g., median filters) in signal or image processing, high-breakdown regression in robust statistics, clustering, neural networks, data mining, etc. Note that these applications play an important role in computational biological science. We propose a novel algorithm of no more than (2+epsilon) passes (epsilon being a very small fraction) and compare our algorithms with some of the best existing algorithms.
|
||||||
Genre |
Genre
|
||||||
Organizations |
Organizations
Degree granting institution (dgg): University of Connecticut
|
||||||
Held By | |||||||
Rights Statement |
Rights Statement
|
||||||
Use and Reproduction |
Use and Reproduction
These materials are provided for educational and research purposes only.
|
||||||
Local Identifier |
Local Identifier
OC_d_168
|
||||||
OCLC Number |
OCLC Number
858863594
|