NSU Scientists Develop One of the Largest Genetic Association Databases

NSU scientists developed one of the largest genetic association databases in the world. The database contains billions of associations of genome variants with human traits that were identified in hundreds of scientific studies conducted by the international scientific community. Information about these associations increases our understanding of human genetics and biology and can contribute to the diagnosis, prevention, and treatment of diseases. The results of this work were published in the public domain journal “Nucleic Acids Research”. 

Genome-wide association studies (GWAS) are the primary tool for identifying genetic factors that influence quantitative traits and the risk of developing common human diseases. Information about the associations identified during a GWAS helps to research the etiology of human diseases and develop risk prediction models. They can also be useful looking for candidate biomarkers, therapeutic interventions, and targets for these interventions. The number of genetic associations studied by the scientific community is growing rapidly, but the use of this data is limited by the large volume and lack of uniform standards for format and quality.

For many years, scientists from the NSU Natural Sciences Department Theoretical and Applied Functional Genomics Laboratory, in collaboration with colleagues from PolyKnomics (the Netherlands), collected information on associations and developed a computing infrastructure and computational methods for unification, quality control, and analysis. As a result of collecting and processing tens of terabytes of raw data, the researchers obtained one of the world's largest databases of genetic associations. Tatyana Shashkova, Junior Researcher at the Laboratory, talked about their result,

We hope the database we developed can be used to solve a wide range of problems from fundamental research of human genetics to the development of predictive models and the search for candidate therapeutic effects”.

The database contains complete results of associative studies for more than 7,000 traits, including quantitative traits, common diseases, metabolite levels, proteins and glycans, as well as the results of several large-scale studies of gene transcription control. Overall, the database contains data on more than 75 billion genetic associations. To provide access to the database, the PheLiGe web interface was developed. The team also created a GWAS-MAP system that allows access to the database and a wide range of analysis through a command line interface. 

PolyKnomics CEO Lennart Karssen added, “The technological solution we developed with NSU is multipurpose. For example, it can be scaled to store and process information about millions of genomes. Such big data emerges in the context of national biobanking programs or genomic breeding programs”.


The diagram above illustrates the data processing model. The integration module is responsible for converting the summary statistics of genome-wide association studies into a universal format and controlling the data quality. The reference table is used to test and filter allelic variants. If summary statistics pass quality control, they are uploaded together with metadata to databases (DB module). Finally, the data is made available to an external user through a web interface.