zMAP - Proteomics Analysis Platform

 

What is zMAP?

Isobaric labeling-based mass spectrometry (ILMS) has been widely used to quantify, on a proteome-wide scale, the relative protein abundance in different biological conditions. Large-scale proteomic studies based on ILMS, however, typically involve multiple runs of mass spectrometry (MS), bringing great computational difficulty to the simultaneous comparison and other integration analyses of all ILMS samples. We present zMAP, a toolset that makes ILMS intensities comparable across MS runs by modeling the associated mean-variance dependence and accordingly applying a variance stabilizing z-transformation. Specific efforts have been made to render the model fitting procedure resistant to underlying hypervariable proteins. Two case studies demonstrate the effectiveness of zMAP in handling large-scale ILMS data sets. Specifically, the transformed z-statistics, as a new kind of measurements of protein abundance, have effectively unlocked a broad range of integration analyses that cannot otherwise be applied on original ILMS intensities.

zMAP

The design of zMAP aims to simultaneously compare protein profiles of multiple samples and integrate samples from different MS runs for identifying hypervariable proteins across samples. zMAP models the mean-variance dependence associated with ILMS intensities and accordingly applies a variance-stabilizing z-transformation(z-statistic), which dramatically increases the comparability between samples generated by different MS runs. The z-statistic can be widely applied in downstream analyses, including PCA, clustering analysis, GSVA, and so on. To facilitate biologists' usage, we offer a user-friendly and straightforward web service. zMAP requires that all the involved MS runs are associated with the same biological design, such that the average intensities across all samples in each run are biologically identical.

Parameter

Description

Protein intensity file

A tab-delimited file containing raw gene-level protein intensity with samples in columns, and gene symbols in rows. Note: 1. The protein intensity matrix does not require normalization. 2. Sample names can only consist of letters, numbers, and underscores.

Sample information file

Sample information file is a three-column, tab-delimited file with the first line identifying the columns. The column names are Sample_id, Sample_condition, and MS_run.

Window size

Number of proteins within each sliding window. The default is 400.

Step size

The step size of the sliding window. The default is 100 proteins.

Proportion of proteins used for linear regression(%)

To avoid the influence of differentially expressed proteins across samples, which can lead to an overestimated variance, only a certain proportion (30% by default) of the proteins with the smallest observed variances are used for the quantile regression.

MVC fitting method

The method used for fitting mean-variance curve : exponential model or natural cubic spline model

Input Files

1. Protein intensity file (example)

A tab-delimited file containing raw gene-level protein intensity with samples in columns, and gene symbols in rows.

Note:

1. The protein intensity matrix does not require normalization.

2. Sample names can only consist of letters, numbers, and underscores.

2. Sample information file (example)

Sample information file is a three-column, tab-delimited file with the first line identifying the columns. The column names are Sample_id, Sample_condition, and MS_run.

Note:

Sample_id

Sample_id can only consist of letters, numbers, and underscores.

Sample_condition

Sample_condition must consist only of letters, numbers, and underscores.

MS_run

The MS run in which the proteins of the sample were quantified.

Output Files

1. z-statistic table.

z_statistic_table.txt

z-statistic, as a new kind of measurements of protein abundance, is crucial for downstream analysis and visualization.

2. The table containing chi-square statistics and corresponding statistical p-values.

zmap_chi_square_pvalue.txt

This file will be utilized to identify hypervariable proteins across samples.

3. Results directory for each MS run's model construction.

The results directory encompasses graphics of normalized data, linear regression, nonlinear regression outcomes, and so on. Users can consult these resources to evaluate data quality and model fitting effectiveness.

reverse-zMAP

 

reverse-zMAP module is primarily designed for handling MS runs with relatively large sample sizes. The major concern is that, with the increase of the number of samples, the odds that outlier measurements are involved for each specific protein increase, giving rise to excessively large variances. Consequently, in the sliding-window process of the zMAP module, the proportion of proteins that are suitable to use for the quantile regression in each window can be very small. Besides, fitting a single mean-variance curve (MVC) for a large number of samples may not be flexible enough to allow for the variation of mean-variance trend across samples. In practice, large-scale proteomic studies have frequently applied the strategy of adding a biologically identical reference sample to each individual MS run. For example, in cancer studies, a mixed sample is typically generated by pooling tumor samples and/or normal adjacent tissues (NATs) from several related patients in equal protein amounts. The proteome of this mixed sample is then profiled in every MS run separately. reverse-zMAP module alleviates the influence of outliers by repeatedly making pairwise comparisons, which in the meanwhile allows the modeling of sample-specific mean-variance trend, but it requires a biologically identical reference sample in each MS run for a subsequent integration across MS runs.

 

Parameter

Description

Protein intensity file

A tab-delimited file containing raw gene-level protein intensity with samples in columns, and gene symbols in rows. Note: 1. The protein intensity matrix does not require normalization. 2. Sample names can only consist of letters, numbers, and underscores.

Sample information file

The sample information file is a four-column, tab-delimited file with the first line identifying the columns. The column names are Sample_id, MS_run, Sample_condition, and internal_ref.

Window size

Number of proteins within each sliding window. The default is 400.

Step size

The step size of the sliding window. The default is 100 proteins.

Proportion of proteins used for linear regression(%)

To avoid the influence of differential proteins, only a certain proportion (50% by default) of the proteins with the middle M-value are used for quantile regression.

MVC fitting method

The method used for fitting mean-variance curve : exponential model or natural cubic spline model.

 

Input Files

1. Protein intensity file (example)

A tab-delimited file containing raw gene-level protein intensity with samples in columns, and gene symbols in rows.

Note:

1. The protein intensity matrice does not require normalization.

2. Sample names can only consist of letters, numbers, and underscores.

2. Sample information file (example)

Sample information file is a four-column, tab-delimited file with the first line identifying the columns. The column names are Sample_id,MS_run,Sample_condition, and internal_ref.

Note:

To avoid code conflicts, the column headers are:

Sample_id

Sample_id can only consist of letters, numbers, and underscores.

MS_run

The MS run in which the proteins of the sample were quantified.

Sample_condition

Sample_condition consists only of letters, numbers, and underscores.

internal_ref

Yes or No. Indicate whether the sample is a reference sample.

Output Files

1. z-statistic table.

z_statistic_table.txt

It is crucial for downstream analysis and visualization.

2. Chi-square statistics and corresponding statistical p-values.

reverse_zmap_chi_square_pvalue.txt

This file contains 5 columns: protein, number of samples where this protein was not detected, chi-square statistic, p-value, BH-corrected p-value, and Bonferroni-corrected p-value. This file will be utilized to identify hypervariable proteins (HVPs) across samples.

3. Summary of proteomic datasets at different levels.

qc_boxplot.pdf

qc_table.pdf

4. Goodness of fit for the application of reverse-zMAP to your data set.

4.1 Box plot of the R2 values of all quantile regressions.

r2_linear_model_boxplot.pdf

4.2 The distribution of R2 values from the fitting of M-A curves. The associated three quartiles are marked.

r2_distribution_of_nonlinear_model_fitting_for_estimated_variance.pdf

4.3 The distribution of R2 values from the fitting of MVCs.

r2_distribution_of_nonlinear_fitting_for_intercept_u.pdf

5. The comparison of cumulative distributions of m values and z-statistics, with each line representing a sample.

cumulative_distribution.pdf

Downstream analysis

Sample quality control

Performing hierarchical clustering and principal component analysis on the z-statistic matrix of samples provides a concise visualization of the overall impact of sample conditions and MS runs.

Name

Description

z-statistic file

Output file z_statistic_table.txt from zMAP or reverse-zMAP

Sample information file

The sample information file is a three-column, tab-delimited file with the first line identifying the columns. The column names are Sample_id, Sample_condition, and MS_run. Note: To avoid code conflicts, Sample_condition consists only of letters, numbers, and underscores.

Input Files

1. z statistic file (example)

Output file z_statistic_table.txt from zMAP or reverse-zMAP as input.

2. Sample information file (example)

Sample information file is a three-column, tab-delimited file with the first line identifying the columns. The column names are Sample_id, Sample_condition, and MS_run.

Note:

To avoid code conflicts:

Sample_id

Sample_id must consist of letters, numbers, and underscores.

Sample_condition

Sample_condition must consist only of letters, numbers, and underscores.

MS_run

The MS run in which the proteins of the sample were quantified.

Output Files

1. Table of PCC between each pair of samples based on z-statistic.

pearsonr_correlation_coefficient_of_z_statistic.txt

2. Sample-based hierarchical cluster.

pearsonr_correlation_coefficient_of_z_statistic.pdf

3. Principal components matrix.

pca_df.txt

4. Principal component plot.

A principal component plot displaying samples in a 2D plane defined by their first two principal components. This plot provides a concise visualization of the overall impact of sample conditions and MS runs.

PC1_PC2_scatterplot.pdf

HVPs identification and clustering

The chi-square statistics derived by zMAP along with the associated numbers of degrees of freedom were summed across MS runs for each protein, giving rise to a p-value that assessed the overall expression variability of the protein. This functionality is designed to select hypervariable proteins (HVPs) from the output files of zMAP and cluster them to obtain multiple expression signatures. Then, pathway enrichment analysis was conducted on these signatures to reveal biological insights.

 

Parameter

Description

Chi-square and p-value file

Output file zmap_chi_square_pvalue.txt from zMAP.

z-statistic file

Output file z_statistic_table.txt from zMAP.

Sample information file

The sample information file is a three-column, tab-delimited file with the first line identifying the columns. The column names are Sample_id, Sample_condition, and MS_run.

FDR

BH-corrected p-value cutoff for identifying HVPs. The default is 0.05.

Number of HVP clusters

HVPs are hierarchically clustered into multiple clusters revealing diverse expression signatures across different sample conditions.

Minimum proteins within each cluster

If the number of proteins within certain clusters falls below the specified minimum count, these clusters will be merged into a single cluster labeled as cluster_0. As a result, the final number of clusters may be fewer than what the user initially specified.

Number of top-ranked DEPs

The z-statistics associated with each protein are separately averaged within each sample condition. The standard deviation of these average z-statistics for each protein is then calculated. Subsequently, proteins are ranked according to these standard deviations, and the top-ranked differentially expressed proteins (DEPs) are selected for heatmap visualization.

Number of DEP clusters

Top-ranked DEPs are clustered into multiple clusters based on K-means.

Input Files

1. Chi-square and p-value file (example)

Output file zmap_chi_square_pvalue.txt from zMAP as input.

2. z statistic file (example)

Output file z_statistic_table.txt from zMAP as input.

3. Sample information file (example)

As above

Output Files

1. Heatmap showing the z-statistics of hypervariable proteins.

These proteins were clustered into multiple clusters.

hypervariable_proteins_cluster_heatmap_bh_cutoff_0.05.pdf

2. The list of proteins belonging to each cluster is within the dictionary.

./clusters

3. Pathway enrichment results.

Enriched biological processes and KEGG pathways for protein clusters are integrated within the dictionary.

./enrichment_results

4. Heatmap showing the z-statistics of top-ranked DEPs.

These proteins were clustered into multiple clusters.

top_100_differentially_expressed_proteins_heatmap.pdf

Sample subgrouping

For delineating molecular subtypes at the protein level, this function is designed to select a subset of highly variable proteins across samples. It then undertakes unsupervised clustering on the samples, offering quantitative evidence to ascertain both the number and composition of potential clusters within the dataset. For in-depth algorithmic insights, please consult the details provided in Consensus Clustering.1,2

Parameter

Description

z-statistic file

Output file z_statistic_table.txt from reverse-zMAP

Sample information file

The sample information file is a three-column, tab-delimited file with the first line identifying the columns. The column names are Sample_id,Sample_condition, and MS_run

Sample condition

Perform clustering on samples under specific conditions.

Top N(number) hypervariable proteins for subgrouping

Proteins were ranked based on their variance across samples, and the z-statistic matrices of the top_n proteins were used for clustering. By default, the top_n is set to 3000.

Input Files

1.z statistic file (example)

Output file z_statistic_table.txt from reverse-zMAP as input.

2. Sample information file (example)

As above.

Output File

1. Sample clustering results:

cluster_XX.csv (XX is the number of clusters)

2. Graphic output:

consensus.pdf

We invoked the ConsensusClusterPlus R package at the underlying level. For interpretation of the output graphics, please refer to https://bioconductor.org/packages/release/bioc/vignettes/ConsensusClusterPlus/inst/doc/ConsensusClusterPlus.pdf

 

Association with clinical features

In the prior step, samples underwent grouping using the z-statistic matrix derived from HVPs. With the intention of correlating proteomic subgroups to clinical and molecular features,We then employed either the Chi-square test or Fisher's exact test to examine the association between proteomic subgroups and discrete sample features. For continuous features, we utilized Student's t-test or ANOVA.

Subsequently, ANOVA was further applied to pinpoint differentially expressed proteins (DEPs) among sample groups. Hierarchical clustering of proteins ensued, driven by the z-statistic matrix, revealing distinctive expression signatures. Following this, pathway enrichment analysis was conducted for each set of proteins.

Parameter

Description

z-statistic file

Output file z_statistic_table.txt from reverse-zMAP.

Sample cluster file

A comma-separated file with the first column containing sample names and the second column indicating the respective sample groups.

Clinical information file

A tab-separated file with the first column representing sample names, and the subsequent columns containing clinical and molecular features. Please note that column names can only include letters, numbers, and underscores. Allows NaN values. When plotting, NaN values are represented by white color by default.

Clinical features(discrete variable name)

This is a string where discrete features are comma-separated. For example, 'gender, stage'.

Clinical features(continuous variable name)

This is a string where continuous features are comma-separated. For example, 'age, tumor_size'.

Color file

Tab-delimited file with no column names. The first column represents discrete sample features, and the second column corresponds to the hexadecimal color code for each feature.

Colorbar file

Tab-delimited file with no column names. The first column represents discrete sample features, and the second column corresponds to the colormaps in Matplotlib for each feature.

FDR (BH-adjusted p-value cutoff for ANOVA)

FDR cutoff used for identifying DEPs. The default FDR is 0.05.

Protein clusters

Cluster number for DEPs.

Input Files

1.z statistic file (example)

Output file z_statistic_table.txt from reverse-zMAP as input.

2. Sample cluster file (example)

A comma-separated file with the first column containing sample names and the second column indicating the respective sample groups.

3. Clinical information file (example)

A tab-separated file with the first column representing sample names, and the subsequent columns containing clinical and molecular features. Please note that column names can only include letters, numbers, and underscores. Allows NaN values. When plotting,NaN values are represented by white color by default.

4. Color file (example)

Tab-delimited file with no column names. The first column represents discrete sample features, and the second column corresponds to the hexadecimal color code (such as #808080) for each feature.

5. Colorbar file (example)

Tab-delimited file with no column names. The first column represents discrete sample features, and the second column corresponds to the colormaps in Matplotlib for each feature.

Colormaps: https://matplotlib.org/stable/users/explain/colors/colormaps.html

Output File

1. Graphic output:

sample_clustering_association_with_clinical_and_molecule_feature.pdf

2. Pathway enrichment results:

The directory contains pathway enrichment results for each expression signature.

./enrichment_results

Survival analysis

Plots the Kaplan-Meier survival curve and assesses the significance of survival differences among subgroups.

Parameter

Description

Survival and group file

The sample information file is a four-column, tab-delimited file with the first line identifying the columns. The column names are Sample_id, survival_time,death_or_not, and group.

Input Files

1. Survival and group file (example)

Sample information file is a four-column, tab-delimited file with the first line identifying the columns. The column names are Sample_id, survival_time, death_or_not and group.

Output File

1. Graphic output:

survival_analysis.pdf

Association with survival data

Cox proportional hazard regression model is used to identify prognostic markers based on z-statistic.3

Parameter

Description

z-statistic file

Output file z_statistic_table.txt from reverse-zMAP

Survival file

The sample information file is a three-column, tab-delimited file with the first line identifying the columns. The column names are Sample_id, survival_time, and death_or_not.

Input Files

1. z-statistic file (example)

Output file z_statistic_table.txt from reverse-zMAP.

2. Survival file (example)

Sample information file is a three-column, tab-delimited file with the first line identifying the columns. The column names are Sample_id, survival_time, and death_or_not.

Output File

1. Cox-regression results file.

Proteins are categorized into three groups: favorable, unfavorable, and not prognostic.

results_cox_regression.txt

Association with mutation data

To pick out the proteins that were associated with non-silent somatic mutations, QTL analysis is used. Somatic mutations in at least 10 samples (Users determine based on the number of samples.). Then, we pick out the proteins whose expression was detected in at least half of the tumor samples. Finally, for each candidate gene-protein pair, we performed a regression of the protein expression against the mutation status of the gene using MatrixEQTL R package4, by fitting the following linear model:Y = α + βX + γS + δA+ε. Here, Y was a vector of the z-statistics of the protein in tumor samples; X was the non-silent somatic mutation indicators of the gene; S and A referred to the sex and age variables, respectively, as well as other control variables; ε was a vector of independent and identically distributed noise variables; α, β, γ, and δ were unknown parameters. This model was fitted by applying the least squares method, and the null hypothesis β=0 was then tested by applying the t-test.

Parameter

Description

z-statistic file

Output file z_statistic_table.txt from reverse-zMAP

Mutation file

Tab-delimited file where rows represent genes, columns represent samples, and values are coded as 1 for non-silent mutations and 0 for no non-silent mutations.

Covariates file

The file includes additional covariates. Columns represent samples, while rows encompass various additional covariates, including age, sex, and others. The values inside must be numerical.

Gene location file

The file containing transcription start site information for all genes in the genome. It contains three columns with the column names being gene, chromosome, and tss respectively.

Chromosome length file

Tab-delimited file contains two columns, chromosome and length.

FDR(BH-adjusted p-value cutoff)

The default FDR is 0.05.

Input Files

1. z-statistic file (example)

Output file z_statistic_table.txt from reverse-zMAP.

2. Mutation file (example)

Tab-delimited file where rows represent genes, columns represent samples, and values are coded as 1 for non-silent mutations and 0 for no non-silent mutations.

3. Covariates file (example)

The file includes additional covariates. Columns represent samples, while rows encompass various additional covariates, including age, sex, and others. The values inside must be numerical.

4. Gene location file (example)

The file containing transcription start site information for all genes in the genome. It contains three columns with the column names being gene, chromosome, and tss respectively.

5. Chromosome length file (example)

Tab-delimited file contains two columns, chromosome and length.

Output File

1. Graphic output:

pQTL_results_FDR_XX.pdf (XX is FDR cutoff)

Two-dimensional plot displaying the significant gene-protein associations, with Y and X axes representing the locations of proteins and mutated genes in the genome, respectively. The total number of proteins associated with each mutated gene is also displayed. For a specific gene-protein association, Beta>0 suggests the protein has increased expression with the mutation of the gene, and vice versa.

2. Proteins associated with mutated genes.

results_association_z_statistic_with_mutation.txt

3. Number of proteins associated with mutated genes.

mutation_associated_protein_count.txt

 

Gene set variation analysis

GSVA calculates gene set enrichment scores (GSVA scores) for each sample5 using the z-statistic matrix. Differential expression analysis is then conducted on these GSVA scores using limma6, aiming to identify changes in pathway activities across multiple sample groups. Finally, the differential pathway activities across sample groups are visualized using a heatmap.

 

Parameter

Description

z-statistic file

Output file z_statistic_table.txt from zMAP or reverse-zMAP as input.

Sample information file

Sample information file is a three-column, tab-delimited file with the first line identifying the columns. The column names are Sample_id, Sample_condition, and MS_run.

Top N(number) pathways

Extract a table of the top-ranked differentially regulated biological processes and KEGG pathways across sample groups.

FDR(BH-adjusted pvalue cutoff for significance)

Adjusted pvalue cutoff for the differentially regulated biological processes and pathways across sample groups. The default is 0.05.

Input Files

1.z statistic file (example)

Output file z_statistic_table.txt from zMAP or reverse-zMAP as input.

2.Sample information file (example)

As above

Output Files

1. Tables displaying GSVA scores with samples in columns and pathways in rows.

kegg_gsva_kcdf_Gaussian.csv

go_gsva_kcdf_Gaussian.csv

2. Table listing the top-ranked differentially regulated pathways across sample groups.

kegg_gsva_differential_pathway_activity_top_XX.csv (XX is the number of top-ranked differentially regulated pathways)

go_gsva_differential_pathway_activity_top_XX.csv (XX is the number of top-ranked differentially regulated pathways)

3. Heatmap of differentially regulated pathways.

kegg_gsva_differential_pathway_activity_top_XX_adj.P.Val_XXX.pdf

go_gsva_differential_pathway_activity_top_XX_adj.P.Val_XXX.pdf

(both PDF and PNG, XX is the number of top-ranked differentially active gene setsXXX is the BH-adjusted P-value cutoff for differential activity pathway)

Co-expression network

A protein co-expression network is constructed based on the z-statistics. Proteins detected in more than half of the samples are included in the analysis. We first used the KNN method to impute missing values in the z-statistic matrix. Then, we obtained the PTCCs for all pairs of the proteins by applying a previously developed method for covariance matrix estimation7. Finally, the statistical significance of each PTCC was assessed based on a mixture model8. When constructing the co-expression network, an edge was added to link a protein pair if and only if the corresponding PTCC was positive and significant.

Parameter

Description

z-statistic file

Output file z_statistic_table.txt from zMAP or reverse-zMAP as input.

Input Files

1.z statistic file (example)

Output file z_statistic_table.txt from zMAP or reverse-zMAP as input.

Output Files

1. protein-protein interaction pairs

 

Reference

1          Monti, S., Tamayo, P., Mesirov, J. & Golub, T. Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52, 91-118, doi:Doi 10.1023/A:1023949509487 (2003).

2          Wilkerson, M. D. & Hayes, D. N. ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking. Bioinformatics 26, 1572-1573, doi:10.1093/bioinformatics/btq170 (2010).

3          Cox, D. R. Regression models and lifetables. Journal of the Royal Statistical Society: Series B (Methodological) 34, 187-202 (1972).

4          Shabalin, A. A. Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics 28, 1353-1358 (2012).

5          Hänzelmann, S., Castelo, R. & Guinney, J. GSVA: gene set variation analysis for microarray and RNA-seq data. BMC bioinformatics 14, 1-15 (2013).

6          Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic acids research 43, e47, doi:10.1093/nar/gkv007 (2015).

7          Schäfer, J. & Strimmer, K. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical applications in genetics and molecular biology 4, Article32, doi:10.2202/1544-6115.1175 (2005).

8          Efron, B. J. J. o. t. A. S. A. Large-scale simultaneous hypothesis testing: the choice of a null hypothesis.  99, 96-104 (2004).