Skip to content

GSoC 2024

Nezar Abdennur edited this page Mar 5, 2024 · 6 revisions

Open2C Google Summer of Code 2024 Project Idea List

GSoC 2024 Mentors

Nezar Abdennur, Ilya Flyamer, Geoff Fudenberg, Anton Goloborodko, Thomas Reimonn, Yao Xiao

Information for Students

You can find the potential projects that are available for GSoC 2024 contributors. Interested applicants can always contact us at our GSoC2024 channel on Slack or send us an email for potential brainstorming before they submit the application.

We develop software for the analysis of the spatial organization of genomes, mostly leveraging the family of molecular technologies known as chromosome conformation capture (3C), primarily its high throughput derivative called Hi-C and its many closely related techniques, which we’ll collectively refer to below as 3C+. We also develop tools for genomic and multi-omic data analysis more broadly within the Python data science ecosystem. We like our tools to be easy to use, flexible, to facilitate active development of novel analytical approaches, and scalable, to make use of the latest and largest datasets. We welcome Google Summer of Code 2024 contributors with potential proposals focusing on one of the topics below.

  • Informational slides from March 4th presentation

Project Ideas

Provides a framework for genomic data analysis using Pandas DataFrames, including genomic interval arithmetic.

GSoC 2024 applicants can contribute to Bioframe by:

  • extending operations to 2D genomic intervals (https://github.com/open2c/bioframe/issues/25);
  • implement operations on binned genomes and their intervals (https://github.com/open2c/bioframe/issues/116)
  • enable out-of-core genomic interval arithmetic (difficulty: hard)
  • create an API for accessing genome assembly metadata (difficulty: easy)
  • develop example tutorials, such as adapting the muon tutorial to use bioframe: (difficulty: easy).
  • Skills and Requirements: Python, data science with Python (numpy/pandas), background in math, 350 hours, Medium.
  • Mentors: Geoff Fudenberg, Anton Goloborodko, Nezar Abdennur

A standard storage format and Python package for Hi-C and 3C+ data based on HDF5 format, designed for storage and manipulation of extremely large Hi-C datasets at any resolution, but is not limited to Hi-C data in any way. These massive heatmaps can be explored using a multiscale genome browser such as HiGlass and analyzed with a growing array of downstream analysis software, including cooltools.

GSoC 2024 applicants can contribute to cooler by:

  • Implementing the powerful and flexible Zarr storage system as an alternative and cloud-friendly backend for cooler.
  • Providing solutions for an Xarray-based API for genomic heatmaps via cooler.
  • Challenges: Harmonizing differences between Zarr and HDF5 APIs.
  • Skills and Requirements: Python programming, numpy, minimal familiarity with at least one of HDF5, Zarr or Xarray, 350 hours commitment, Medium
  • Mentors: Nezar Abdennur, Thomas Reimonn

Provides a suite of computational tools to perform various downstream analytical workflows on genomic contact maps in cooler files. The individual datasets are typically much larger than what can fit memory at once, demanding an out-of-core data processing approach. The unified CLI + Python API design facilitates creating workflows on high-performance computing clusters as well as in custom data analysis notebooks or simple scripts. As the key part of interpreting and extracting biological insights from Hi-C and 3C-based datasets, Open2C maintains a collection of detailed educational tutorials on key concepts in 3C+ data analysis using interactive notebooks based largely on cooltools; see open2c_examples.

GSoC 2024 applicants can contribute to cooltools by:

  • Migrate log-smoothing code to a mini repository. (https://github.com/open2c/cooltools/issues/505)
  • Implementation and optimization of scalable, sparse eigen-decomposition and other matrix factorization methods for Hi-C data.
  • Challenges: parallelization and optimization of the process.
  • Skills and Requirements: python data science stack (numpy/pandas), math background can be useful (for linear algebra), 350 hours, Hard
  • Mentors: Geoff Fudenberg, Ilya Flamer, Yao Xiao

A simple and fast command-line framework for low-level stream-based processing of sequencing data from a 3C+ experiment. Pairtools fulfills the fundamental step of 3C+ data processing: detecting genomic contacts from experimental sequencing data and provides tools to sort, manipulate, filter, and classify these pairs,to design feature-rich pipelines for specialized experimental protocols or studies, as well as perform quality assessment of billions of contacts detected in a given experiment.

GSoC 2024 applicants can contribute to pairtools by:

  • Turning the pairtools CLI into a domain-specific language (DSL) allowing on-demand pipeline construction.
  • Developing a binary pairs format using Apache Parquet.
  • Implementing cheaper I/O using technologies like Apache Arrow or Dask.
  • Challenges: Parallel implementation of the parsing and crucial steps of Hi-C data processing
  • Skills and Requirements: Python programming, numpy and pandas, design of CLI tools following Unix style guidelines, 350 hours, Medium
  • Mentors: Anton Goloborodko