Blood Protein - Methods summary

Summary

The Blood Proteins section contains information regarding the proteins present in blood. Externally and “In-house” generated data are integrated to explore human plasma protein profiles in healthy individuals. Plasma levels in blood are presented, based both on antibody-based immune assays and mass spectrometry-based proteomics.

Key publications

Uhlén M et al. (2019) “The human secretome” Sci signal

What can you learn from the Blood Protein section?

Learn about

  • concentration of human plasma proteins based on immune assays and MS
  • expression of plasma proteins in healthy individuals based on PEA

Data overview

Data type Count Data Coverage (nr genes)
Protein concentration Protein concentrations across 453 genes measured in Immunoassays 453
Protein concentration Protein concentrations across 4294 genes measued with mass spectrometry 4294

How has the data been generated?

The plasma proteome levels from healthy individuals were measured using proximity extension assay (PEA). The healthy individuals were followed longitudinal for two years and the plasma proteome levels were measured every three months. Data generated by proximity extension assays was normalized within and between plates followed by transformation using a predetermined correction factor and provided in the arbitrary unit Normalized Protein eXpression (NPX). To analyze longitudinal healthy datasets while accounting for both within-subject and inter-individual variability, linear mixed-effects models are applied. Models include random intercepts for individual subjects and fixed effects for age and sex. Proteins with more than 80% of samples below the limit of detection are excluded. The Blood Atlas also contain information of proteins detected by mass spectrometry-based proteomics, based on publicly available data on the Peptide Atlas. The mass-spectrometry based data was filtered to include only the minimal, non-redundant list of proteins derived from the set of identified peptides and exclude entries labelled as contaminants. In addition, the concentration of actively secreted proteins was annotated using publicly available literature.


What is presented in the section?

In the gene summary page of the Blood Proteins section, the protein levels from the PEA based assays are shown. The longitudinal variation in the expression of a protein in plasma from healthy individuals is displayed in two line plots separated according to gender.



In addition, the gene summary page include the blood concentrations from mass spectrometry studies when available in the Peptide Atlas. The predicted concentration in blood (plasma or serum) is shown.


Furthermore, the blood concentration for the proteins annotated to be secreted to blood is shown (when available). The reported concentration in blood are shown for some representative studies with reference to the literature.


Human Disease Blood Atlas - Method Summary

Summary

A comprehensive characterization of the blood proteome profiles in patients with various diseases can contribute to a better understanding of the disease etiology, resulting in earlier diagnosis, risk stratification and better monitoring of the disease progression. Connecting the dynamics of the plasma proteome to functionality across conditions could work as a window into their biology and mechanisms and broaden the horizon for new treatments. Precision Medicine thus aims to allow for an individualized diagnosis, treatment and monitoring of patients, including the use of molecular tools such as genomics, proteomics and metabolomics. Technologies such as Proximity Extension Assay and Targeted Mass Spectrometry are well equipped to do this. In the first version of the Human Disease Blood Atlas, a pan-cancer study covering 12 major cancer types was reported. In the current version, protein profiles for 59 diseases are presented.

Key publications

Álvez MB et al. (2023) "Next generation pan-cancer blood proteome profiling using proximity extension assay" Nat Commun 14, 4308 (2023).

Kotol D et al. (2023) "Absolute quantification of pan-cancer plasma proteomes reveals unique signature in multiple myeloma" Cancers 15(19), 4764 (2023).

What can you learn from the Disease Blood Atlas?

Learn about

  • comprehensive and precise protein levels in blood covering 59 diseases
  • proteins associated with each of the analyzed diseases

Data overview

Data type Count Data Coverage (nr genes)
Protein expression 66 Differential expression analysis across 66 diseases 1165

How was the Proximity Extension Assay data generated?

Next Generation Blood Profiling was performed by combining antibody-based proximity extension assay with next generation sequencing (Wik L et al. (2021)). This method enables the multiplex exploration of protein concentrations in blood from patients with different diseases. Plasma profiles of 1165 proteins from more than 6000 patients representing altogether 59 diseases (Figure 1) were measured in minute amounts of blood plasma collected at the time of diagnosis and before treatment. The diseases in this study belong to different classes, including cardiovascular, metabolic, cancer, psychiatric, autoimmune, infectious, and pediatric diseases.


Figure 1. Overview of pan-disease blood proteome profiling study.

Differential abundance analyses

To investigate disease-specific proteome profiles, differential abundance analyses were conducted with the following comparisons:

  1. Disease vs. Healthy samples: comparing each disease to healthy controls.
  2. Disease vs. Diseases from the same class: comparing each disease to others within the same disease class.
  3. Disease vs. All other diseases: comparing each disease to all other diseases in the study.

The models were generated using the limma R package (Ritchie ME et al. (2015)), with the folloring model covariates:

  • Age and sex adjustments: For general diseases, both age and sex were included as covariates in the model to control for their potential effects on protein expression.
  • Sex-specific diseases: For diseases identified as sex-specific, comparisons were only made between samples of the same sex. Sex was not included as a covariate in these analyses to focus on the differential expression related to the disease itself.
  • Pediatric diseases: In cases where pediatric diseases were compared to healthy controls, age was not included as a covariate due to the perfect correlation between age and disease status (e.g., all pediatric cases were very young and healthy controls were older). Including age as a covariate in this scenario would confound the analysis, as age directly impacts the disease classification.

This approach ensures that our analyses account for relevant biological variables while addressing specific issues related to data correlations and sample characteristics. Additionally, control samples were matched to the number of cases based on sex and age to ensure a balanced comparison and reduce potential biases in the analysis.

The up- and down-regulated proteins in each disease are summarized in the volcano plots displayed in the sections for the different diseases, and highlighting the most significantly differentially expressed proteins. The results for all diseased patients for each protein target are presented on the individual gene pages.

Machine learning analyses

Additionally, a machine learning approach was applied to investigate the plasma proteome in diseases with a sufficient sample size. Regularized logistic regression (lasso) classification models were developed in three settings:

  1. Disease vs. Healthy samples: binary classification models trained to distinguish each disease from healthy controls.
  2. Disease vs. Diseases from the same class: multiclass models differentiating diseases belonging to the same disease class.
  3. Disease vs. All other diseases: a single multiclass model trained to classify all diseases against each other.

For all comparisons, the data were split into 70% training and 30% testing sets. Models were trained on 100 splits to account for variability in data partitioning. In the Human Disease Blood Atlas, we report the top-ranking features for classification of each disease across all settings, including the average important score and standard deviation, calculated from the absolute model estimates and normalized to a 1-100 scale.

How was the Targeted Proteomics data generated?

Targeted proteomics is a bottom-up approach where proteases, most commonly trypsin, are used to digest proteins into peptides that can be measured by liquid chromatography-tandem mass spectrometry (LC-MS/MS). This strategy is an excellent tool for performing measurements with high reproducibility and precision, making it appropriate for quantifying proteins in cells, tissues, and blood.

Targeted proteomics, as opposed to the widely used data-dependent acquisition (DDA), also known as shotgun proteomics, works with a defined collection of peptides and builds on prior knowledge about the analytes. Generally, peptide quantification can be either relative or absolute. Relative quantification is a method for describing the amount of an analyte in proportion to another measurement of the same analyte across several biological samples or across two groups, as in case-control studies. On the other hand, absolute concentrations can be obtained by spiking samples with known amounts of heavy-labelled standards during the sample preparation workflow. Using isotope-labeled peptides or protein standards can also considerably increase consistency and precision and it can be done at a large scale.

A quantitative strategy based on heavy isotope-labeled PrESTs was originally developed as a collaborative effort between Professor Matthias Mann and Professor Mathias Uhlén (Zeiler M et al. (2012)). They introduced the multiplex PrEST-SILAC quantitative approach. This quantitative workflow was based on shotgun proteomics and had the benefit of being relatively simple to execute and straightforward to work with. The addition of stable isotope labeled (SIS) PrESTs, combined with a mass spectrometry readout, can be used in almost any MS setup and analysis mode, including both targeted (SRM, MRM, PRM, DIA) and untargeted (DDA) modes of operation. The standards are added to the sample at the initial stage in the proteomics workflow and, therefore, they can account for potential digestion biases as they generate the same prototypic peptides (Figure 3) and mimic the exact amino acid repertoire of the endogenous protein. When protein standards are not cleaved together with the endogenous proteins from the sample, this bias is a common source of errors that affects almost every LC-MS/MS sample preparation workflow and can be very hard to control for.


Figure 3. The standard's N-terminal sequence enables affinity purification and measurement. The C-terminal portion contains 50–150 human amino acids. Each standard contains numerous tryptic peptides that can be used to measure an unknown sample's target protein.

Each SIS-PrEST standard is fully labeled with 13C and 15N enriched arginine and lysine, and the protein sequence used for quantification span shorter amino acid sequences (50-150 aa) representative of the target protein of interest (Figure 4).

In the Disease Atlas, 273 SIS-PrESTs were spiked in known concentrations directly into undepleted human blood plasma from 1,469 cancer patients. The spiked amounts were tuned to be as close to a 1:1 ratio with the endogenous proteins as possible. This increases the analytical precision during a one-point calibration-based quantification of the endogenous proteins. The quantitative peptides were selected using the lowest coefficient of variation and highest frequency of detection as selection criteria, while the single best-performing peptide per protein was used.


Figure. 4. Targeted Proteomics workflow using SIS-PrESTs. Production of Standards: PrESTs from the human protein atlas are labeled in high-throughput with heavy Arginine (Arg10) and Lysine (Lys8) amino acid residues. Each PrEST fragment can be individually quantified by the common Q-Tag sequence (also used for purification). Assay Generation: Heavy peptides originating from the PrEST sequence are used to establish targeted assays. The quantitative range is defined, and the protein level in healthy plasma is determined in a pool of healthy volunteers. Targeted Proteomics: SIS-PrESTs are spiked directly into non-depleted human plasma collected from cancer patients and act as internal standards throughout the workflow. Quantitative Mass Spectrometry: Endogenous peptides from each patient is measured together with the spiked internal standard. The known amount of spiked standard is used to calculate the absolute concentration of each protein analyte.

What is presented in the section?

The protein levels for all cancer patients for each protein target, together with information on whether the target is upregulated in any of the diseases and/or is included in any disease prediction model, are presented on the individual gene summary pages in the Human Protein Atlas.