The PCAPAM50 Approach

Pipeline Overview

The PCAPAM50 pipeline consists of two steps: First, creating a gene expression-guided ER-balanced subset to make intermediate subtype calls, and second, using these intermediate subtype calls to perform a refined intrinsic subtyping called PCAPAM50. This page focuses on the PCAPAM50 approach. For instructions on the Conventional PAM50 approach, please visit its respective page.


1. makeCalls.PC1ihc – Intermediate Intrinsic Subtype Calls

This function processes clinical IHC subtyping data and PAM50 gene expression data to form a gene expression-guided ER-balanced set. This set is created by combining IHC classification information and using principal component 1 (PC1) to guide the separation. The function computes the median for each gene in this ER-balanced set, updates a calibration file, and runs subtype prediction algorithms to generate intermediate intrinsic subtype calls based on the PAM50 method. Various diagnostics and subtyping results are returned.


1) Load the Test data:
The test data is derived from the TCGA breast cancer dataset. The test matrix is an upper-quartile (UQ) normalized log2(x+1) transformed dataset of PAM50 gene expression from RNA-Seq data. It is recommended to perform UQ normalization and log2 transformation on your input matrix to closely align with the scale of PAM50 centroids.

data_path <- system.file("extdata", "Sample_IHC_PAM-Mat.Rdat", package = "PCAPAM50")

load(data_path) # Loads Test.ihc and Test.matrix

2) Prepare the Data:
Ensure the clinical subtype data frame has a column “PatientID” matching the column names of the matrix. The IHC subtype column should be named “IHC,” with ER-positive subtypes starting with “L” (for luminals) and ER-negative subtypes not starting with “L.” In the test data, ER-positive cases are labeled “LA,” “LB1,” “LB2,” and ER-negative cases are labeled “TN” and “Her2+”.

The data must be sorted properly:

Test.ihc$ER_status <- rep("NA", length(Test.ihc$PatientID))

Test.ihc$ER_status[grep("^L",Test.ihc$IHC)] = "pos"

Test.ihc$ER_status[-grep("^L",Test.ihc$IHC)] = "neg"

Test.ihc <- Test.ihc[order(Test.ihc$ER_status, decreasing = TRUE),]

Display the sorted data:

Test.ihc$ER_status=factor(Test.ihc$ER_status, levels=c("pos", "neg"))
Test.ihc$IHC=factor(Test.ihc$IHC, levels=c("TN", "Her2+", "LA", "LB1", "LB2"))

table(Test.ihc$ER_status, Test.ihc$IHC)
#      TN Her2+ LA LB1 LB2
#  pos  0     0 19  65  27
#  neg 23     7  0   0   0 

Let’s examine the matrix. First, sort the test matrix using the IHC dataframe:

Test.matrix <- Test.matrix[, Test.ihc$PatientID]

Next, check the dimensions of the Test.matrix:

dim(Test.matrix)
#[1]  50 141

This matrix contains the 50 PAM50 genes and expression values for 141 samples.
Important note: Ensure that your input matrix is also matched with the 50 gene names provided in the test matrix.


3) Create the Clinical Subtype Data Frame:
Create a clinical subtype data frame using the provided test files. The inputDir determines the output folder.

df.cln <- data.frame(PatientID = Test.ihc$PatientID, IHC = Test.ihc$IHC, stringsAsFactors = FALSE)

inputDir <- "Call.PC1"

4) Call the Function:
Run the makeCalls.PC1ihc function. Refer to the manual for detailed documentation on usage and arguments. Example run on test data:

res.PC1 <- makeCalls.PC1ihc(df.cln = df.cln, seed = 118, mat = Test.matrix, inputDir = inputDir)

The function returns a list containing:

- Int.sbs - Data frame with integrated subtype and clinical data.
- score.fl - Data frame with scores from subtype predictions.
- mdns.fl - Data frame with median values for each gene in the ER-balanced set.
- SBS.colr - Colors associated with each subtype from the prediction results.
- outList - Detailed results from subtype prediction functions.
- PC1cutoff - Cutoff values for PC1 used in subsetting.
- DF.PC1 - Data frame of initial PCA results merged with clinical data.

It generates a plot within the inputDir folder displaying the percentage of misclassified IHC cases along the PC1 axis with the vertical line identified as the cutoff.

PC1_misclassified_cases.png

A heatmap is also generated within the inputDir folder.

PC1ihc.Mdns_PAM50_normalized_heatmap.pdf


2. makeCalls.v1PAM – PCAPAM50 Calls

This function uses the intermediate intrinsic subtype calls to create an ER-balanced set. It internally selects an equal number of Basal and LumA cases to form this subset.


1) Call the Function:
Call the function makeCalls.v1PAM() on test data. Refer to the manual for detailed documentation on usage and arguments.

df.pc1pam = data.frame(PatientID=res.PC1$Int.sbs$PatientID,
            PAM50=res.PC1$Int.sbs$Int.SBS.Mdns.PC1ihc,
            IHC=res.PC1$Int.sbs$IHC,
            stringsAsFactors=F) ### IHC column is optional
  
inputDir <- "Calls.PCAPAM50" 

res.PCAPAM50 <- makeCalls.v1PAM(df.pam = df.pc1pam, seed = 118, mat = Test.matrix, inputDir=inputDir)

The function returns a list containing:

- Int.sbs - Data frame with integrated subtype and clinical data.
- score.fl - Data frame with scores from subtype predictions.
- mdns.fl - Data frame with median values for each gene in the ER-balanced set.
- SBS.colr - Colors associated with each subtype from the prediction results.
- outList - Detailed results from subtype prediction functions.

A heatmap is generated within the inputDir folder.

PCAPAM50.Mdns_PAM50_normalized_heatmap.pdf