About

Single omics analysis does not always provide enough information to understand the behaviors of a cellular system. Rather, the integration of multiple omics analyses may give a better understanding of a biological system as a whole. The development of such statistical integrative methodologies constitutes major research challenges to (a) assess the quality of the data (b) give a comprehensive overview of the system under study, by modeling the data according to the biological question (c) extract significant information and (d) cope with the high dimensionality of the data.

mixOmics was first an R package dedicated to the exploration and the integration of highly dimensional data sets. A strong focus is given to graphical representation to better understand the relationships between the different types of data and better visualize the correlation structure between the different measured entities.

Introduction to Statistics

What does "data exploration" mean?

  • to extract non trivial and potentially useful information from highly dimensional data
  • to discover interesting aspects of the data which should be further analyzed
  • to pinpoint some artefacts in the data (i.e. batch effect, experimental errors...)
In summary: to obtain a quantified description/model of cellular metabolism at a genome scale that can serve as a foundation for further hypothesis-driven investigation.

What does "statistical integration" mean?

Given the complexity of the data, using direct correlation measures is not sufficient to model the correspondence between two different molecular entities and to understand the functional principles and dynamics of total cellular systems. We would like to find what is in common and what is different between two data sets.

mixOmics proposes multivariate statistical approaches to identify similarities between two heterogeneous data sets. These approaches focus on dimension reduction to:

  • summarize the information in a smaller data set
  • to highlight the biological entities that are of potential relevance
Such approaches include: Principal Component Analysis, Canonical Correlation Analysis, Partial Least Squares regression and many variants we have been developing so far to deal with highly dimensional data (sparse PCA, regularized CCA, sparse PLS, sPLS-DA).

What should I know before using this interface

  • mixOmics does not build/infer relationships between the biological entities based on prior biological knowledge. It is an entirely data driven, holistic approach. It indicates which entities are likely to be related to each other (in a statistical sense) that would require further investigation. It indicates the similarity/differences between the samples.
  • the two data sets must come from the same biological material (i.e. the same biological samples). The samples/individuals should be displayed in rows and biological entities or variables in columns in two separate files. Check that the samples are matching on each row (see figure below)
    .
  • this web interface provides a user friendly tool to the mixOmics R package. More flexible options are available if using the original R package (also refer to our tutorial)
  • if you are not sure about the biological questions you would like to be answered through this interface, please refer to: Which method to use? or examples

Which method to use?

Depending on your biological question, one or several methodologies can be applied. Below are listed some typical analysis frameworks.

We call variables the expression or abundance of entities that are measured (genes, metabolites, proteins, SNP, ...) and samples or observations or units an individual, a patient, a cell on which the experiment is performed

In mixOmics, the data should be displayed with samples in rows and variables in columns. (see figure above)

  • I would like to identify the trends or patterns in my data, experimental bias or, identify if my samples 'naturally' cluster according to the biological conditions: Principal Component Analysis (PCA)
  • In addition to the above, I would like to select the variables that contribute the most to the variance (the information) in the data set: sparse Principal Component Analysis (sPCA)
  • I would like to know if I can extract common information from the two data sets (or highlight the correlation between the two data sets), i.e. modeling a bi-directional relationship between the two data sets.
    • The total number of variables (p+q) is less than the number of samples: Canonical Correlation Analysis (CCA) or Partial Least Squares (PLS) canonical mode
    • The total number of variables (p+q) is greater than the number of samples: regularized Canonical Correlation Analysis (rCCA)or Partial Least Squares (PLS) canonical mode

  • I would like to model a 'uni-directional' relationship between the two data sets, i.e. I would like to predict the expression of the metabolites (Y) given the expression of transcripts (X)
    • Partial Least Squares (PLS), classic or regression mode

  • In addition to the above, I would like to select the variables from both data sets that covary (i.e. 'change together') across the different conditions: sparse Partial Least Squares (sPLS) with appropriate mode

Here X = expression data and Y = vector indicating the classes of the samples

  • I would like to know how informative my data are to rightly classify my samples, as well as predicting the class of new samples: PLS-Discriminant Analysis (PLS-DA)
  • In addition to the above, I would like to select the variables that help classifying the samples: sparse PLS-DA (sPLS-DA)

Here X = expression data and Y = response vector

  • I would like to model a causal relationship between my data and the response vector and assess how informative my data are to predict such response: PLS-regression mode
  • In addition to the above, I would like to select the variables that best predict the response: sparse PLS-regression mode

Previous Case studies analysed using mixOmics

Nutrigenomic data: rCCA

The expression of 120 genes measured on liver cells (dedicated microarray) and 21 hepatic fatty acids (gas chromatography) were measured on 40 mice classified into 2 different genotypes and 5 diets. In this study, we do not know a priori if gene expression changes imply fatty acid concentrations changes or inversely. The aim is to generate interesting hypothesis on the transcription factor pathways potentially linking hepatic fatty acids and gene expression.

Results:: rCCA illustrates the effects of the genotype and the diets on each data set and further highlights interesting correlations between speci?c variables from the two data sets.

More details

Wine yeast data: sPLS regression mode

Five industrial yeast strains were sampled at 3 time points of fermentation ( n = 43 in total), 3391 genes and 10 exometabolites (aroma compounds) were measured. The aim is to mine and interpret gene expression data in the context of aroma compound production as the different yeast strains produce wines with highly divergent aroma profiles.

Results:: sPLS selected genes that provided good coverage of key reactions and major branches of the aroma production pathways.

More details

Drought tolerance of sunflower: sPLS regression mode

Measurement of the expression of 32423 transcripts in 8 sunflower genotypes undergoing water deprivation in a controlled environment. the plants were harvested at fixed date (genotype-dependent stress levels) or at fixed stress (genotype-dependent stress duration). Eight morpho-physiological data were also collected from the same plants (96 individuals in total). The aim is not only to identify transcripts involved in drought tolerance in sunflower, but also to unravel genotype-dependent responses and the transcripts underpinning those distinct responses. Are sunflower responses to water deprivation genotype-dependent and which transcripts are behind those differences?

Results are currently being analysed.

NCI60 panel data: sPLS canonical mode

The two data sets are generated using Affymetrix (1,517 genes) or spotted cDNA platforms (1,375 genes) from 60 human tumor cell lines (9 different types of tumour). The aim is to assess the degree of complementarities through the screening of different gene sets representing common pathways or biological functions. The two data sets play a symmetric role, this is why the "canonical mode" is appropriate here.

Results:: sPLS selected highly relevant genes and complementary findings from the two data sets, which enabled a detailed understanding of the molecular characteristics of several groups of cell lines.

More details

Brain tumor data: sPLS-DA

The Brain data set compares 5 embryonic tumors with 5597 gene expression. The aim is to find discriminative genes that can help separating the different classes of tumours.

Results:: selected genes were found to be involved in biological functions such as cell differentiation, cellular developmental process and central nervous system development.

More details: "Sparse PLS Discriminant Analysis: biologically relevant feature selection and graphical displays for multiclass problems", in press.

Dataset examples:

This demonstration proposes to analyse 5 different datasets with 5 different types of analysis.

First, please download the following input files:

  • Breast tumor Affymetrix dataset 1 subsequently called X (right-click, save-as). The expression of 22, 215 genes on 89 early stage breast tumors (stage I and II) aggressively treated with either tamoxifen or adjuvant chemotherapy.
  • Breast tumor Affymetrix dataset 2 subsequently called Xprim (right-click, save-as). The expression of 20,681 genes on 89 early stage breast tumors (stage I and II). Xprim is a subset of X that does not contain genes located on chromosome 8.
  • Breast tumor CGH dataset subsequently called Yprim (right-click, save-as). The data set contains 138 unique probes on chromosome 8 on the same matching tumors.
  • Breast tumor class dataset (right-click, save-as). A vector components containing the estrogen receptor status of the 89 patients.

Sparse PLS analysis: the aim of sPLS is to select correlated or co-jointly expressed variables from the two data sets. In this analysis we would like to identify genes that contribute to breast cancer pathophysiologies when deregulated by recurrent aberrations. In this .pdf file, you can find a whole illustration on the results obtained on these data sets. To directly go to the results page on this analysis, go here.

You can now start the wizard by clicking on the "next" button at the bottom of the page.

The following pages present several options to be chosen by the user. Below, we recommend to use the chosen parameters in the following format:

Name of the option | Recommended parameter
(Other options can be chosen but will not be covered in this demonstration).
  • Please choose your methodology | (s)PLS
  • Please choose your parameters
    • Approach | Sparse PLS
    • Number of components | 2
    • Please choose the mode of computation | Canonical
    • How many variables to keep on dimension 1 for X: | 100
    • How many variables to keep on dimension 2 for X: | 50
    • How many variables to keep on dimension 1-2 for Y: | 138
  • Please choose your graphical outputs parameters
    • What should be the Correlation Threshold? | 0.7
    • Display the X variables label? | Unchecked
    • Display the samples label? | Checked
    • Add colors to the graphical output? | Checked
    • Display two-dimensional visualizations of the correlation matrices within and between two data sets? | Unchecked
  • Please choose your export parameters
    • Save a Cytoscape-compatible file of the networks representation? | Checked
    • Save a GSEA-compatible file for the genes selected with the analysis from the X dataset? | Checked
    • Save a GSEA-compatible file of the class dataset? | Checked
    • Retrieve the gene information from the iHOP database for the genes selected with the analysis from the X dataset? | Checked
    • Source of accession numbers: | Online Mendelian Inheritance in Man
    • Type of information desired: | Genes interaction
  • Please upload your datasets
    • X dataset: | X_affy.csv
    • Xprim dataset: | Xprim_affy.csv
    • Y dataset: | Yprim_cgh.csv
    • Class vector: | ER_status.csv

First, please download the following input files:

  • Multidrug ABC transporter dataset (right-click, save-as). The expression of the 48 human ABC transporters measured by real-time quantitative RT-PCR for each cell line. There is one missing value in this data set.
  • Multidrug class dataset (right-click, save-as). A vector components containing the phenotypes of the 60 cell lines.

PCA analysis: the aim of PCA is to discover interesting aspects in the data and to reduce the dimension of the data in a smaller number of variables (the "principal components") which summarize most of the information in the data set. It also allows identifying possible artifacts. In this analysis we would like to understand the relationships between the expression of the ABC transporters and how they relate to the different cell lines type.

You can now start the wizard by clicking on the "next" button at the bottom of the page.

The following pages present several options to be chosen by the user. Below, we recommend to use the chosen parameters in the following format:

Name of the option | Recommended parameter
(Other options can be chosen but will not be covered in this demonstration).
  • Please choose your methodology | (s)PCA
  • Please choose your parameters
    • Approach | PCA
    • Number of components | 2
    • Center the data | Checked
    • Scale the data | Checked
  • Please choose your graphical outputs parameters
    • Display the X variables label? | Checked
    • Display the samples label? | Unchecked
    • Add colors to the graphical output? | Checked
  • Please choose your export parameters
    • None
  • Please upload your datasets
    • X dataset: | multi_abc_trans.csv
    • Class vector: | multi_colour.csv

First, please download the following input files:

  • Nutrimouse gene dataset (right-click, save-as). Expressions of 120 genes measured in liver cells for 40 mouse, selected (among about 30,000) as potentially relevant in the context of the nutrition study.
  • Nutrimouse lipid dataset (right-click, save-as). Concentrations (in percentages) of 21 hepatic fatty acids measured by gas chromatography in the same mouse.
  • Nutrimouse class dataset (right-click, save-as). 5-levels factor. Oils used for experimental diets preparation were corn and colza oils (50/50) for a reference diet (REF), hydrogenated coconut oil for a saturated fatty acid diet (COC), sunflower oil for an Omega6 fatty acid-rich diet (SUN), linseed oil for an Omega3-rich diet (LIN) and corn/colza/enriched fish oils for the FISH diet (43/43/14).

rCCA analysis: the aim of rCCA is to highlight the correlations between the fatty acids and the genes and see how these variables illustrate the effects of the genotypes and the diets of the mice.

You can now start the wizard by clicking on the "next" button at the bottom of the page.

The following pages present several options to be chosen by the user. Below, we recommend to use the chosen parameters in the following format:

Name of the option | Recommended parameter
(Other options can be chosen but will not be covered in this demonstration).
  • Please choose your methodology | (r)CCA
  • Please choose your parameters
    • Number of components | 3
    • Lambda 1 value: | 0.064
    • Lambda 2 value: | 0.008096
  • Please choose your graphical outputs parameters
    • What should be the Correlation Threshold? | 0.6
    • Display the X variables label? | Checked
    • Display the samples label? | Unchecked
    • Add colours to the graphical output? | Checked
    • Display two-dimensional visualizations of the correlation matrices within and between two data sets? | Unchecked
    • Display a color-coded Clustered Image Maps (CIMs) ("heat maps")? | Checked
  • Please choose your export parameters
    • Save a Cytoscape-compatible file of the networks representation? | Checked
  • Please upload your datasets
    • X dataset: | nutri_gene.csv
    • Y dataset: | nutri_lipid.csv
    • Class vector: | nutri_colour.csv

First, please download the following input files:

sPLs analysis: the aim of sPLS is to select correlated or co-jointly expressed variables from the two data sets. In this specific example, we want to know if there exists a subset of clinical variables and a subset of transcripts that can give more insight into the paracetamol toxicity in the liver.

You can now start the wizard by clicking on the "next" button at the bottom of the page.

The following pages present several options to be chosen by the user. Below, we recommend to use the chosen parameters in the following format:

Name of the option | Recommended parameter
(Other options can be chosen but will not be covered in this demonstration).
  • Please choose your methodology | (s)PLS
  • Please choose your parameters
    • Approach | Sparse PLS
    • Number of components | 3
    • Please choose the mode of computation | Regression
    • How many variables to keep on dimension 1-3 for X: | 50
    • How many variables to keep on dimension 1-3 for Y: | 5
  • Please choose your graphical outputs parameters
    • What should be the Correlation Threshold? | 0.7
    • Display the X variables label? | Checked
    • Display the samples label? | Unchecked
    • Add colors to the graphical output? | Checked
    • Display two-dimensional visualizations of the correlation matrices within and between two data sets? | Unchecked
    • Display a color-coded Clustered Image Maps (CIMs) (heatmap of the variables)? | Unchecked
  • Please choose your export parameters
    • Save a Cytoscape-compatible file of the networks representation? | Checked
    • Save a GSEA-compatible file for the genes selected with the analysis from the X dataset? | Checked
    • Save a GSEA-compatible file of the class dataset? | Checked
    • Retrieve the gene information from the iHOP database for the genes selected with the analysis from the X dataset? | Checked
    • Source of accession numbers: | NCBI Gene ID
    • Type of information desired: | Genes interaction
  • Please upload your datasets
    • X dataset: | liver_gene.csv
    • Y dataset: | liver_clinic.csv
    • Class vector: | liver_colour.csv

First, please download the following input files:

sIPCA analysis: IPCA can be used as an alternative to PCA to sometimes obtain more "meaningful" components. The aim is to discover interesting aspects in the data and to reduce the dimension of the data in a smaller number of variables (the "independent principal components") which summarize most of the information in the data set. In this analysis we would like to see if the gene expression can help clustering the samples according to the biological conditions. The sparse version allows to select the relevant genes.

You can now start the wizard by clicking on the "next" button at the bottom of the page.

The following pages present several options to be chosen by the user. Below, we recommend to use the chosen parameters in the following format:

Name of the option | Recommended parameter
(Other options can be chosen but will not be covered in this demonstration).
  • Please choose your methodology | (s)IPCA
  • Please choose your parameters
    • Approach | Sparse IPCA
    • Number of components | 2
    • Please choose the mode of computation | Deflation
    • How many variables to keep on dimension 1-2 for X: | 50
  • Please choose your graphical outputs parameters
    • Display the X variables label? | Checked
    • Display the samples label? | Unchecked
    • Add colors to the graphical output? | Checked
  • Please choose your export parameters
    • Save a GSEA-compatible file for the genes selected with the analysis from the X dataset? | Unchecked
    • Save a GSEA-compatible file of the classes dataset? | Ucnhecked
    • Retrieve the gene information from the iHOP database for the genes selected with the analysis from the X dataset? | Checked
    • Source of accession numbers: | NCBI Gene ID
    • Type of information desired: | Genes interaction
  • Please upload your datasets
    • X dataset: | liver_gene_id.csv
    • Class vector: | liver_toxicity_timeanddose.csv

First, please download the following input files:

sPLS-DA analysis: the aim of the analysis is to select the discriminative genes that can help separating the different classes of tumours. Based on this subset of potentially biomarkers extracted from our training data, we can then assess if we can correctly predict the classes of a new sample test set.

You can now start the wizard by clicking on the "next" button at the bottom of the page.

The following pages present several options to be chosen by the user. Below, we recommend to use the chosen parameters in the following format:

Name of the option | Recommended parameter
(Other options can be chosen but will not be covered in this demonstration).
  • Please choose your methodology | (s)PLS-DA
  • Please choose your parameters
    • Approach | Sparse PLS-DA
    • Predict the class of samples from a new test set? | Checked
    • Number of components | 3
    • How many variables to keep on dimension 1-3: | 50
  • Please choose your graphical outputs parameters
    • Display the X variables label? | Checked
    • Display the samples label? | Unchecked
  • Please choose your export parameters
    • Save a Cytoscape-compatible file of the networks representation? | Checked
    • Save a GSEA-compatible file for the genes selected with the analysis from the X dataset? | Checked
    • Save a GSEA-compatible file of the class dataset? | Checked
    • Retrieve the gene information from the iHOP database for the genes selected with the analysis from the X dataset? | Unchecked
  • Please upload your datasets
    • Train dataset X (size n x p): | srbct_gene_train.csv
    • Train class vector Y (length n): | srbc_class_train.csv
    • Test dataset X' (size n' x p): | srbct_gene_test.csv

Contact

We are well aware that a close interaction between statisticians and the biologists/bioinformaticians community is crucial to attempt modeling such complex relationships between heterogeneous data.

Do not hesitate to ask questions, give feedback and comments to us: mixomics[at]math.univ-toulouse.fr

Start New Project

Project Name:

Please choose your methodology






Guide me through my options

Parameter selection

Approach:


Parameters:


(Note that the result of the analysis will provide a histogram of the proportion of explained variance for a larger number of components.)



Please choose your parameters

Approach:





Please choose the mode of computation




Please choose your parameters


Parameters:


Lambda parameters are mandatory for Regularised CCA, if you do not know what to enter you can find documentation on how to estimate them on this page




Please choose your parameters

Approach:


Parameters:

Please choose the mode of computation





Please choose your parameters

Approach:


Parameters:

This requires the upload of the test data

Graphical output parameters

Please choose your graphical outputs parameters

Correlation network outputs


Variable representation outputs


Samples representation outputs


this will require the upload of a vector indicating the class of each sample

Export Parameters

Please choose your export parameters

Please review the chosen parameters

Methodology:


Number of components

Data Upload

Please upload your datasets (50MB maximum)

Formats accepted are .txt / .csv

Use the following example (right-click, save-as) to see how your dataset should be organised.
Numbers must be in the US format, i.e. decimals separated by a dot not a comma.
It is recommended to have no or very few missing values in your datasets.