About Me

I am Assistant Professor in the Department of Computer Science at Kalamazoo College. I obtained my M.S. in 2012 and Ph.D. in 2017, both from Western Michigan University (WMU).

I have a passion for high performance computing (HPC) and the use of parallel algorithms to solve big data problems in the realms of DNA and genomics. I'm also interested in computational biology and the many ways that computer science can, through the development of specialized algorithms, assist in the handling, processing, and analysis of different biological systems.

For my dissertation I worked with supercomputer clusters from XSEDE to develop hybrid parallel algorithms that can accelerate the compression of DNA sequence data, allowing scientists and specialists to work with the data in its compressed state, without the need for decompression.

I was recognized by the CS Department at WMU with the Graduate Research Publication Award in 2017 for my publication titled "A Hybrid MPI-OpenMP Strategy to Speedup the Compression of Big Next-Generation Sequencing Datasets" in the journal IEEE Transactions on Parallel and Distributed Systems. In addition, I was invited to present my research at the 13th IEEE International Symposium on Parallel and Distributed Processing with Applications (IEEE ISPA-15) in Helsinki, Finland.

Regarding long-term goals, I plan to cultivate and promote the problem-solving nature of our species, making students and other researchers aware of the possibilities of computational power for the betterment of the human condition.

A Hybrid Parallel Approach to High-Performance Compression of Big Genomic Files and in compresso Data Processing

Due to the rapid development of high-throughput low cost Next-Generation Sequencing, genomic file transmission and storage is now one of the many Big Data challenges in computer science. Highly specialized compression techniques have been devised to tackle this issue, but sequential data compression has become increasingly inefficient and existing parallel algorithms suffer from poor scalability. Even the best available solutions can take hours to compress gigabytes of data, making the use of these techniques for large-scale genomics prohibitively expensive in terms of time and space complexity.

This dissertation responds to the aforementioned problem by presenting a novel hybrid parallel approach to speed up the compression of big genomic datasets by combining the features of both distributed and shared memory architectures and parallel programing models. The algorithm that the approach relies on has been developed with several goals in mind: to balance the work load among processes and threads, to alleviate memory latency by exploiting locality, and to accelerate I/O by reducing excessive read/write operations and inter-node message exchange. To make the algorithm scalable, an innovative timestamp-based file structure was designed. It allows the compressed data to be written in a distributed and non-deterministic fashion while retaining the capability of decompressing the dataset back to its original state. In an effort to lessen the dependency on decompression, the proposed file structure facilitates the handling of DNA sequences in their compressed state by using fine-grained decompression in a technique that is identified as in compresso data manipulation.

Theoretical analysis and experimental results from this research suggest strong scalability, with many datasets yielding super-linear speedups and constant efficiency. Performance measurements were executed to test the limitations imposed by Amdahl's law when doubling the number of processing units. The compression of a FASTQ file of size 1 terabyte took less than 3.5 minutes, reporting a 90% time decrease against compression algorithms with multithreaded parallelism and more than 98% time decrease for those running sequential code-i.e., 10 to 50 times faster, respectively-improving the execution time proportionally to the added system resources. In addition, the proposed approach achieved better compression ratios by employing an entropy-based encoder optimized to work close to its Shannon entropy and a dictionary-based encoder with variable-length codes. Consequently, in compresso data manipulation was accelerated for FASTQ to FASTA format conversion, basic FASTQ file statistics, and DNA sequence pattern finding, extraction, and trimming. Findings of this research provide evidence that hybrid parallelism can also be implemented using CPU+GPU models, potentially increasing the computational power and portability of the approach.


Fahad Saeed, Ph.D. Chair
Ajay Gupta, Ph.D.
Todd Barkman, Ph.D.