HOME

About Me

I am Associate Professor in the Department of Computer Science at Kalamazoo College. I obtained my M.S. in 2012 and Ph.D. in 2017, both from Western Michigan University (WMU).

My research and scholarly activity pertain to the study and application of parallel computing to several problems in Computer Science, like data compression for massive files, the acceleration of computations by using specific task and data parallelism schemes, and the design and analysis of parallel algorithms for bioinformatics. I also explore aspects of parallel computing education and the building of college curricula that integrate recent technological advances in this area. Furthermore, given the value I place on mentoring, I encourage my students to work in research with me and contribute to developing collaborative scholarship, as research activities with my students are fundamental to my work.

As an educator, my teaching philosophy and practices center on creating an environment where students can become competent, informed, and accountable computer scientists. Driven by a commitment to progressive education, I infuse my courses with opportunities for students to acquire rigorous knowledge and essential and practical skills while fostering an awareness of the importance of computer science to our world. My course design and curriculum embody my commitment to diversity, equity, inclusion, and accessibility, consistently addressing issues of representation and the erasure of individuals whose identities and contributions to the development of computer science have been marginalized and made invisible.

Regarding long-term goals, I plan to cultivate and promote the problem-solving nature of our species, making students and other researchers aware of the possibilities of computational power for the betterment of the human condition.

A Hybrid Parallel Approach to High-Performance Compression of Big Genomic Files and in compresso Data Processing
Abstract

Due to the rapid development of high-throughput low cost Next-Generation Sequencing, genomic file transmission and storage is now one of the many Big Data challenges in computer science. Highly specialized compression techniques have been devised to tackle this issue, but sequential data compression has become increasingly inefficient and existing parallel algorithms suffer from poor scalability. Even the best available solutions can take hours to compress gigabytes of data, making the use of these techniques for large-scale genomics prohibitively expensive in terms of time and space complexity.

This dissertation responds to the aforementioned problem by presenting a novel hybrid parallel approach to speed up the compression of big genomic datasets by combining the features of both distributed and shared memory architectures and parallel programing models. The algorithm that the approach relies on has been developed with several goals in mind: to balance the work load among processes and threads, to alleviate memory latency by exploiting locality, and to accelerate I/O by reducing excessive read/write operations and inter-node message exchange. To make the algorithm scalable, an innovative timestamp-based file structure was designed. It allows the compressed data to be written in a distributed and non-deterministic fashion while retaining the capability of decompressing the dataset back to its original state. In an effort to lessen the dependency on decompression, the proposed file structure facilitates the handling of DNA sequences in their compressed state by using fine-grained decompression in a technique that is identified as in compresso data manipulation.

Theoretical analysis and experimental results from this research suggest strong scalability, with many datasets yielding super-linear speedups and constant efficiency. Performance measurements were executed to test the limitations imposed by Amdahl's law when doubling the number of processing units. The compression of a FASTQ file of size 1 terabyte took less than 3.5 minutes, reporting a 90% time decrease against compression algorithms with multithreaded parallelism and more than 98% time decrease for those running sequential code-i.e., 10 to 50 times faster, respectively-improving the execution time proportionally to the added system resources. In addition, the proposed approach achieved better compression ratios by employing an entropy-based encoder optimized to work close to its Shannon entropy and a dictionary-based encoder with variable-length codes. Consequently, in compresso data manipulation was accelerated for FASTQ to FASTA format conversion, basic FASTQ file statistics, and DNA sequence pattern finding, extraction, and trimming. Findings of this research provide evidence that hybrid parallelism can also be implemented using CPU+GPU models, potentially increasing the computational power and portability of the approach.

Access full dissertation →

Committee

Fahad Saeed, Ph.D. Chair
Ajay Gupta, Ph.D.
Todd Barkman, Ph.D.