Novosibirsk Bioinformaticians Increase Genome Data Analysis Efficiency

Bioinformaticians fr om Novosibirsk State University (NSU), the Institute of Cytology and Genetics SB RAS (ICG), and Martin Luther University (Germany) created a unique software package to analyze genome-wide data hidden in DNA. Their software improves and significantly reduces the costs of expensive genomic experiments. The results of this work of were published in the journal “Nucleic Acids Research”.

According to the researchers, this software searches for common DNA motifs where proteins that control the reading of information encoded in the DNA molecule “sit”. Adjacent motifs usually function together so the identification of these pairs will allow scientists to predict protein interactions at the stage of DNA sequence analysis and to investigate the role of these interactions in physiological processes.

Researchers spent two years creating the software package with scientists from NSU and ICG making a significant contribution. The project was developed and continues to be developed by Victor Levitsky and Elena Zemlyanskaya from the Computational Transcriptomics and Evolutionary Bioinformatics Laboratory (CTEBL) at the NSU Natural Sciences Department and ICG and Dmitry Oshchepkov, ICG researcher. Tatyana Merkulova, Associate Professor NSU Information Biology Section, Victoria Mironova, CTEBL researcher, and Ivo Grosse (Germany), Scientific Director CTEBL, were also involved in the project.

Viktor Levitsky described the significance of this work,

We were able to identify more detailed information about the regulatory role of protein based on the mass sequencing of its binding sites. Previous approaches for identifying a specific protein that regulates gene expression yielded high reliability for 3-5 protein partners. Our method identifies 10-15 protein partners.

Millions of cells in the body synthesize proteins that work continuously to carry oxygen, protect against invasion of foreign agents, reduce and relax muscle fibers, and perform many other functions. Information about wh ere and when these actions should be performed is encrypted in the DNA molecule, and the information is recorded using only four nucleotide "letters".  Nucleotides are combined into "words", genes, and each gene carries information about a protein that can be synthesized from it. The structure and function of the cell is determined by a unique combination of proteins and what regulatory elements of DNA “decide” to be. Their structural units, short sequences of  nucleotides or motifs “letters”, are recognized by regulatory proteins (transcription factors). This leads to the initiation or, conversely, blocking of the process of reading genetic information.

An expensive experiment called ChIP-seq is used to find all the motifs of a particular regulatory protein in the genome. Regulatory proteins never work alone. Numerous partner regulatory proteins modulate the activity and specificity of each protein and the result of this motif is often determined by these interactions. The search for potential partners usually involves additional ChIP-seq experiments, which greatly increases the cost. This new software package solves this problem.

Viktor Levitsky provided an analogy for their work,

Think of regulatory proteins as a small population of two thousand people. You know that a small number of specific people (10-20) work together in one room, and you, as a researcher, need to determine the composition of this working group using only sound. You can identify a few hundred people by voice, but the problem is that there are often people in the population who work very quietly, so you don’t hear them. Therefore, using sound without additional visual data, most of the work group remains unknown to you. Our development is an addition to the audio information providing video. The audio is the location of DNA motifs without overlapping and video is the location of motifs with overlapping. Prior to our developments, in an analysis of one ChIP-seq experiment it was only possible to identify pairs of motifs without their overlapping in DNA. Thus, we have added a new dimension to describe the functionality of the object being studied.

The Novosibirsk scientists have received a patent for their program and it is ready for practical applications. In the past few years, open databases have appeared and continue to expand with   several tens of thousands of ChIP-seq experiments for various types of tissues, cells, and various regulatory proteins. The Siberian scientist’s algorithm can be used to search for new partners for well-known regulatory proteins that are key to important physiological functions of the body, for example, the immune response.

The Russian Foundation for Basic Research supported this research; Project No. 18-29-13040, State Budget Project No. 0324-2019-0040.