Omics Data Mining

Overview of our analysis pipeline.

Active since 2000

Research rationale

In the late 1990s, new techniques for measuring the expression of genes at the level of entire genomes have emerged. By combining these quantitative measurements with biological knowledge, this breakthrough has paved the way for deciphering the activity of genes, their interactions and their involvement in various biological processes. However, the analysis of this mass of data remained a manual task. Firstly, because, although there were many different sources storing biological data, all these sources were completely independent of each other, and secondly, because tools to analyse these data in an automated way did not yet exist. Solutions to integrate heterogeneous data and to automate their analysis were more than ever needed.

Results

A methodology to ease data integration using Semantic Web technologies

Our research started with the idea that Semantic Web technologies, which provide a common framework allowing data to be shared and reused between applications, might be applied to the management of disseminated biological data. We studied and reported the specificities of biological data that made the application of these technologies to the life sciences a real challenge. Then, we proposed a methodology to facilitate data integration using Semantic Web whose precepts were very close to the rules that currently govern the Web of Data (Pasquier 2008, 2011). We implemented our ideas in AllOnto, a Knowledge Base System capable of storing and performing queries on large sets of RDF/OWL specifications (including the storing and querying of reified statements). The software was designed to handle the provenance of information and included reasoning capabilities dealing with type inference, transitivity and built-in OWL constructs like owl:sameAs and owl:inverseOf.

Allonto was applied to collect and integrate data used in three different approaches of data mining we investigated.

Automatic gene annotation through a data-driven approach

The data-driven approach involves first identifying groups of genes whose expression shows similar variation and then integrating knowledge about genes. The research on this topic took the form of an integrated system called THEA (Tools for High-throughput Experiments Analysis). The software integrates several data mining algorithms to automatically annotate groups of genes sharing similar expression profiles with biological information (various ontologies, chromosomal localization, link with diseases). Experiments show that using THEA not only makes it easy and quick to obtain all manually highlighted results, but also to pinpoint new findings (Pasquier and Christen 2004; Pasquier et al. 2004).

Automatic gene annotation through a knowledge-driven approach

The knowledge-driven approach consists of first finding co-annotated groups of genes and then, in a second step, integrating data on expression profiles. The CGGA method (Co-expressed Gene Groups Analysis), which we developed, is part of this approach. The tests that have been carried out show that the functional annotations provided by CGGA reduce the complexity of the data analysis problem by integrating various types of information about genes. The experimental results showed the interest of the approach and made it possible to identify relevant information on the biological processes studied (Martinez et al. 2005, 2009, 2006a; b; c, 2008c; Pasquier et al. 2008).

Extraction of association rules from a heterogeneous set of gene data

We have proposed the use of Association Rule Discovery (ARD) as a method capable of identifying rules linking any pieces of biological data and which does not impose any ordering over the use of data sources. We developed an application called GenMiner to fully exploit the capacities of ARD in the context of biological data mining. GenMiner allows the joint use of knownledge about genes and their level of expression under certain conditions in order to discover the relationships between a priori knowledge and experimental measurements. Our method includes a new algorithm, called NorDI (Normal DIscretization algorithm) to discretize gene expression measurements and generate expression profiles. The experiments we conducted confirmed the advantages of GenMiner over known approaches. GenMiner allows to search for association rules using a much smaller minimum support than what is possible with traditional approaches. In addition, GenMiner significantly reduces the number of extracted rules, making it much easier for the end user to explore and interpret (Martinez et al. 2007, 2008a; b).

Alongside these activities, research was also carried out on the parallelization of the Blast algorithm (Anand et al. 2004). Collaborations with biologists still continue, resulting in collaborative research where we are primarily responsible for data analysis (Pasquier et al. 2014).

Funding

Program Inter-EPST Program on bioinformatics
Year 2002-2004
Funder CNRS, INSERM, INRA, INRIA, Ministry of Research
Grant name The use of a knowledge base system to analyze microarray data
Project coordinator Claude Pasquier
Program CNRS Bio-STIC-LR
Year 2005-2007
Funder CNRS, INSERM, INRA, INRIA, Ministry of Research
Grant name Towards an editor for the subdivision of trees into sub-trees collections formals and functionals criteria for the subdivision process, intra-inter collection trees comparaisons
Project coordinator François Chevenet
Program CNRS post-doctoral grant
Year 2008-2010
Funder CNRS
Grant name Transcriptome mass data use and interpretation using the Massively Parallel Signature Sequencing (MPSS) technologies
Grant recipient Ronnie Alves
Project coordinator Claude Pasquier
Program ANR Methylclonome
Year 2013-2015
Funder ANR
Grant name Analyse de l’héritabilité des traces épigénétiques dans la reproduction clonale
Grant id ANR-12-BSV6-0006
Project coordinator Alain Robichon
Program Gliosplice
Year 2017-2020
Funder Institut National du Cancer (INCa)
Grant name Characterization of alternative splicing networks coordinating brain tumor heterogeneity and treatment resistance commitment
Project coordinator Mathieu Gabut

Softwares

  • AllOnto: Knowledge Base System to store and query RDF/OWL specifications
  • CGGA: Extraction of bi-clusters of genes
  • GenMiner: Mining equivalence classes and minimal non-redundant association rule from gene expression data
  • NORDI: Discretization of gene expression data according to the distribution of the dataset
  • THEA: Integrated information processing system dedicated to the annotation of transcriptomic results
  • Thea-Interact: Analysis of the interaction network of Drosophila genes
  • Thea-Online: Web portal using Semantic Web technologies to integrate, query and display information from multiple sources

Anand, S., Christen, R., and Pasquier, C. (2004), “Distributed BLAST with ProActive,” in 1st grid plugtest. etsi headquarters. proactive user group, Sophia Antipolis.

Martinez, R., Christen, R., Pasquier, C., and Pasquier, N. (2005), “Exploratory Analysis of Cancer SAGE Data,” in 9th european conferences on principles and practice of knowledge discovery in databases (pkdd’05), discovery challenge, Porto, Portugal.

Martinez, R., Pasquier, C., and Pasquier, N. (2007), “GenMiner: Mining Informative Association Rules from Genomic Data,” in IEEE international conference on bioinformatics and biomedicine (bibm’07), Fremont, Silicon Valley, CA: IEEE, pp. 15–22. https://doi.org/10.1109/BIBM.2007.49.

Martinez, R., Pasquier, N., and Pasquier, C. (2008a), “Mining Association Rule Bases from Integrated Genomic Data and Annotations,” in 5th international conference on computational intelligence methods for bioinformatics and biostatistics (cibb’08), ed. S. Berlin, Vietri sul Mare, Salerno, Italy, pp. 33–43.

Martinez, R., Pasquier, N., and Pasquier, C. (2008b), “GenMiner: mining non-redundant association rules from integrated gene expression data and annotations.” Bioinformatics (Oxford, England), Oxford Academic, 24, 2643–4. https://doi.org/10.1093/bioinformatics/btn490.

Martinez, R., Pasquier, N., and Pasquier, C. (2009), “Mining Association Rule Bases from Integrated Genomic Data and Annotations (extended version),” Lecture Notes in Bioinformatics, (S. B. Heidelberg, ed.), 5488, 78–90. https://doi.org/10.1007/978-3-642-02504-4_7.

Martinez, R., Pasquier, N., Collard, M., Pasquier, C., and Lopez-Perez, L. (2006a), “Co-expressed gene groups analysis (CGGA): An automatic tool for the interpretation of microarray experiments,” Journal of Integrative Bioinformatics, De Gruyter, 3, 1–12. https://doi.org/10.2390/biecoll-jib-2006-37.

Martinez, R., Pasquier, N., Pasquier, C., and Lopez-Perez, L. (2006b), “Interpreting microarray experiments via co-expressed gene groups analysis,” in 9th international conference of discovery science (icds’06), lecture notes in computer science, Barcelona: Springer Berlin Heidelberg, pp. 316–320. https://doi.org/10.1007/11893318_34.

Martinez, R., Pasquier, N., Pasquier, C., Collard, M., and Lopez-Perez, L. (2006c), “Analyse des groupes de gènes co-exprimés (AGGC): un outil automatique pour l’interprétation des expériences de biopuces,” in 13ème rencontres de la société francophone de classification (sfc’06), ed. Cépaduès Editions, Metz, pp. 267–276.

Martinez, R., Pasquier, N., Pasquier, C., Collard, M., and Lopez-Perez, L. (2008c), “Analyse des groupes de gènes co-exprimés: un outil automatique pour l’interprétation des expériences de biopuces (version étendue),” Revue des Nouvelles Technologies de l’Information (RNTI-C-2), Classification : points de vue croisés, RNTI, 831, 263–74.

Pasquier, C. (2008), “Biological data integration using Semantic Web technologies.” Biochimie, Elsevier, 90, 584–94. https://doi.org/10.1016/j.biochi.2008.02.007.

Pasquier, C. (2011), “Applying Semantic Web technologies to biological data integration and visualization,” in Data management in semantic web, eds. H. Jin and L. Zehua, Nova Science Publishers, Inc., pp. 131–151.

Pasquier, C., and Christen, R. (2004), “Analysis of microarray data with THEA,” in Lyon’s international multidisciplinary meeting on post-genomics: Integrative post-genomics (ipg’04), La Doua, Lyon.

Pasquier, C., Clément, M., Dombrovsky, A., Penaud, S., Da Rocha, M., Rancurel, C., Ledger, N., Capovilla, M., and Robichon, A. (2014), “Environmentally selected aphid variants in clonality context display differential patterns of methylation in the genome,” PLOS ONE, Public Library of Science, 9, e115022. https://doi.org/10.1371/journal.pone.0115022.

Pasquier, C., Girardot, F., Jevardat de Fombelle, K., and Christen, R. (2004), “THEA: ontology-driven analysis of microarray data.” Bioinformatics (Oxford, England), Oxford Academic, 20, 2636–43. https://doi.org/10.1093/bioinformatics/bth295.

Pasquier, N., Pasquier, C., Brisson, L., and Collard, M. (2008), “Mining Gene Expression Data using Domain Knowledge,” International Journal of Software and Informatics (IJSI), Chinese Academy of Sciences, 2, 215–231.

Avatar
Claude Pasquier
Researcher in Computer Science / Computational Biology

Université côte d’Azur, CNRS, I3S Laboratory, Sophia Antipolis

Related