SCHOLARLY WORK

Journals

  • Vargas-Pérez, S., and Saeed, F. "A Hybrid MPI-OpenMP Strategy to Speedup the Compression of Big Next-Generation Sequencing Datasets", IEEE Transactions on Parallel and Distributed Systems, Vol. 28, pp. 2760-2769, Octuber 2017.
    DOI: 10.1109/TPDS.2017.2692782 →

Workshops & Conferences

  • Vargas-Pérez, S. "Teaching Performance Metrics in Parallel Computing Courses", 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 385-390, San Francisco, CA, USA, 2024.
    DOI: 10.1109/IPDPSW63119.2024.00086 →
  • Vargas-Pérez, S. "Designing an Independent Study to Create HPC Learning Experiences for Undergraduates", 2022 IEEE 29th International Conference on High Performance Computing, Data and Analytics Workshop (HiPCW), pp. 6- 11, Bengaluru, India, December 2022.
    DOI: 10.1109/HiPCW57629.2022.00006 →
  • Vargas-Pérez, S., and Saeed, F. "Scalable Data Structure to Compress Next-Generation Sequencing Files and its Application to Compressive Genomics", 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Vol. 0, pp. 1923-1928, November 2017.
    DOI:10.1109/BIBM.2017.8217953 →
  • Vargas-Pérez, S., and Saeed, F. "A Parallel Algorithm for Compression of Big Next-Generation Sequencing Datasets", Proceedings of Parallel and Distributed Processing with Applications (IEEE ISPA-15), Vol. 3, pp. 196-201, August 2015.
    DOI: 10.1109/Trustcom.2015.632 →

Manuals & Guides

Senior Integrated Projects

The Senior Integrated Project (SIP) is a capstone experience akin to a senior thesis required of all students at Kalamazoo College.
  • Benjamin Saalberg, "Building and Parallelizing a SAR Image Environment Viewing Engine (SIEVE)," 2023.
    Honors
    Abstract →
  • Chaeyoun Myoung, "Exploring Quantum Machine Learning for Differential Privacy: A Classification Approach on 3D Labeled Data," 2023.
  • Aleksandr Molchagin, "The Problem of Misinformation and Disinformation, and How Computer Science Can Combat It," 2022.
    Honors
    Abstract →
  • Juleon Brodie, "Parallel Algorithm Design for Compression of DNA Data Files.," 2021.
    Abstract →
  • Ronan Wolfe, "Analysis of CUDA Approach to Next-Generation 2021 Sequencing Parallelization Using in compresso Application," 2021.
    Honors
    Abstract →
  • Nathan Silverman, "Parallel Computing Made Easy," 2020.
    Abstract →
  • Abdullah Qureshi, "Analysis of Complex Genomics," 2019.
    Abstract →

Supervised Independent Studies

  • Sarah Jaimes Santos, "Minorities in Computing," 2024.
  • Usaid Bin Shafqat, "Configurations and Guide to Access Jigwé for Parallel Computing Classes," 2023.
    Guide →
  • Dahwi Kim, "Writing Bash Script for Linux/Unix Environments," 2020.
  • Nicolas McCabe, "Parallelizing Ray Tracing Program Using CUDA and GPUs," 2019.
  • Tim Rutledge, "Using GPUs to solve Sum Sudoku," 2019.

Research

  • Parallel in compresso Data Manipulation for Compressed Genomic Data
    PI and Co-PI: Sandino Vargas-Pérez and Cole Koryto.
    Possible Submittion: 27th Workshop on Advances in Parallel and Distributed Computational Models.

    FASTQ files are used to store information from DNA sequence reads and their associated quality scores. Processing is needed to filter unnecessary information. In order to cope with lower quality data, it is common to remove low quality bases. In addition, the removal of adapter sequences, PCR primers and other artifacts is necessary. FASTQ files are often distributed in a compressed version due to their massive size. In an effort to lessen the dependency on decompression, in compresso, a technique that facilitates the handling of DNA sequences in their compressed state by using fine-grained decompression data manipulation, is proposed. Using hybrid parallelism with MPI and OpenMP, in compresso data manipulation has shown positive speedups for FASTQ to FASTA format conversion, basic FASTQ file statistics, and DNA sequence pattern finding, extraction, and trimming, when compared to sequentially processed, uncompressed datasets.
  • High-Performance Parallel Compression for Genomic Data: Leveraging Hybrid Parallelism and in compresso Analytics.
    PI: Sandino Vargas-Pérez.

    The field of genomics is generating massive amounts of data due to advancements in next-generation sequencing (NGS) technologies. Managing, storing, and analyzing these vast datasets poses significant computational and storage challenges. Efficient data compression is critical for reducing storage costs, speeding up data transfers, and enabling large-scale analyses. Traditional compression methods often fall short in handling the complexity and scale of genomic data due to repetitive sequences, high dimensionality, and the need for rapid decompression during analysis. Hybrid parallelism combines multiple parallel computing paradigms (e.g., data parallelism, task parallelism) to optimize performance. in compresso techniques integrate compression and analysis, allowing computational tasks to operate on compressed data directly, thereby saving time and computational resources.

    Some limitations are that genomic compression tools (e.g., Gzip, Bzip2, CRAM) often struggle to balance compression ratio, speed, and scalability, particularly in high-performance computing (HPC) environments. There is a growing need for specialized solutions that can leverage parallel architectures efficiently.

    Objectives and Goals

    • To develop a high-performance, parallel compression library tailored for genomic datasets that leverages hybrid parallelism and in compresso techniques to enhance storage efficiency and computational performance.
    • To design a modular compression framework that can be easily integrated into existing genomic pipelines.
    • To implement hybrid parallelism techniques to optimize the compression and decompression processes for multi-core and distributed computing environments.
    • To implement in compresso algorithms that enable genomic analyses directly on compressed data, reducing the need for full decompression.
    • Evaluate the performance of the library against existing tools in terms of compression ratio, speed, scalability, and resource usage.