This section demonstrates the improvement in the accuracy of PCAPAM50 subtyping over PAM50 subtyping.
For the test data, we compare the accuracy of PCAPAM50 subtyping over PAM50 subtyping by comparing the intrinsic subtype calls to the IHC calls.
Below is the comparison of test data PAM50 calls and IHC subtype calls agreement:
res.PAM50$Int.sbs$IHC = factor(res.PAM50$Int.sbs$IHC, levels = c("TN", "HER2+", "LA", "LB1", "LB2")) addmargins(table(res.PAM50$Int.sbs$Int.SBS.Mdns.PAM50, res.PAM50$Int.sbs$IHC)) # TN HER2+ LA LB1 LB2 Sum # Basal 20 3 0 4 0 27 # Her2 2 3 0 0 4 9 # LumA 0 0 17 33 14 64 # LumB 1 0 2 26 9 38 # Normal 0 1 0 2 0 3 # Sum 23 7 19 65 27 141
Agreements:
Basal.TN agreement = 20/27 Her2.HER2+ agreement = 3/9 LumA.LA agreement = 17/64 LumB.LB1/LB2 agreement = 35/38
Here is the comparison of test data PCAPAM50 calls and IHC subtype calls agreement:
res.PCAPAM50$Int.sbs$IHC=toupper(res.PCAPAM50$Int.sbs$IHC) res.PCAPAM50$Int.sbs$IHC = factor(res.PCAPAM50$Int.sbs$IHC,levels = c("TN", "HER2+", "LA", "LB1", "LB2")) addmargins(table(res.PCAPAM50$Int.sbs$Int.SBS.Mdns.PCAPAM50, res.PCAPAM50$Int.sbs$IHC)) # TN HER2+ LA LB1 LB2 Sum # Basal 20 3 0 4 0 27 # Her2 2 4 0 0 4 10 # LumA 0 0 17 27 11 55 # LumB 1 0 2 33 12 48 # Normal 0 0 0 1 0 1 # Sum 23 7 19 65 27 141
Agreements:
Basal.TN agreement = 20/27 Her2.HER2+ agreement = 4/10 LumA.LA agreement = 17/55 LumB.LB1/LB2 = 45/48
Upon comparison, we can see that PCAPAM50 has an overall agreement of 86/141 = 61%, which is an improvement of 7% over the PAM50 calls. If you closely examine the results, you will notice that the consistency of IHC LB subtype with LumB increased by 28.5%(35 to 45) with PCAPAM50.
In order to demonstrate improvements and replicate the results from the original paper, we obtained the 712 TCGA breast cancer cases from the original study where we derived IHC calls. This data is provided with the PCAPAM50 package. Follow the examples below to run the analysis. The conventional PAM50 calls obtained in the paper are included in the dataset, so only running PCAPAM50 is necessary to show the improvement.
1) Load TCGA data:
tcga_data_path <- system.file("extdata", "TCGA.712BC_IHC_PAM-Mat.Rdat", package = "PCAPAM50") load(tcga_data_path) # Loads "TCGA.712BC.IHC" "TCGA.712BC.matrix"
2) Prepare the data:
TCGA.712BC.IHC$ER_status <- rep("NA", length(TCGA.712BC.IHC$PatientID)) TCGA.712BC.IHC$ER_status[grep("^L",TCGA.712BC.IHC$IHC)] = "pos" TCGA.712BC.IHC$ER_status[-grep("^L",TCGA.712BC.IHC$IHC)] = "neg" TCGA.712BC.IHC <- TCGA.712BC.IHC[order(TCGA.712BC.IHC$ER_status, decreasing = TRUE),]
Display the sorted data:
TCGA.712BC.IHC$ER_status=factor(TCGA.712BC.IHC$ER_status, levels=c("pos", "neg")) TCGA.712BC.IHC$IHC=factor(TCGA.712BC.IHC$IHC, levels=c("TN", "Her2+", "LA", "LB1", "LB2")) table(TCGA.712BC.IHC$ER_status, TCGA.712BC.IHC$IHC) # TN Her2+ LA LB1 LB2 # pos 0 0 111 325 123 # neg 116 37 0 0 0
Let’s examine the matrix. First, sort the test matrix using the IHC dataframe:
TCGA.712BC.matrix <- TCGA.712BC.matrix[, TCGA.712BC.IHC$PatientID] dim(TCGA.712BC.matrix) #[1] 50 712
This matrix has the 50 PAM50 genes and 712 sample expression values. This is already an upper-quartile (UQ) normalized log2(x+1) transformed dataset of PAM50 gene expression from RNA-Seq data.
3) Create the Clinical Subtype Data Frame:
df.tcga.cln <- data.frame(PatientID = TCGA.712BC.IHC$PatientID, IHC = TCGA.712BC.IHC$IHC, stringsAsFactors = FALSE) inputDir <- "Call.PC1.TCGA"
4) Call the Function:
res.PC1 <- makeCalls.PC1ihc(df.cln = df.tcga.cln, seed = 118, mat = TCGA.712BC.matrix, inputDir = inputDir)
5) makeCalls.v1PAM:
df.pc1pam = data.frame(PatientID=res.PC1$Int.sbs$PatientID, PAM50=res.PC1$Int.sbs$Int.SBS.Mdns.PC1ihc, IHC=res.PC1$Int.sbs$IHC, stringsAsFactors=F) ### IHC column is optional inputDir <- "Calls.PCAPAM50.TCGA" TCGA.712BC.matrix = TCGA.712BC.matrix[,df.pc1pam$PatientID] res.PCAPAM50 <- makeCalls.v1PAM(df.pam = df.pc1pam, seed = 118, mat = TCGA.712BC.matrix, inputDir=inputDir)
6) PCAPAM50 vs PAM50:
PAM50 vs IHC:
addmargins(table(TCGA.712BC.IHC$PAM50_Given.Mdns, TCGA.712BC.IHC$IHC)) # TN Her2+ LA LB1 LB2 Sum # Basal 100 7 1 12 1 121 # Her2 9 29 0 4 19 61 # LumA 2 0 100 200 64 366 # LumB 1 0 5 100 36 142 # Normal 4 1 5 9 3 22 # Sum 116 37 111 325 123 712
Agreements:
Basal.TN agreement = 100/121 Her2.HER2+ agreement = 29/61 LumA.LA agreement = 100/366 LumB.LB1/LB2 agreement = 136/142
PCAPAM50 vs IHC:
res.PCAPAM50$Int.sbs$IHC=toupper(res.PCAPAM50$Int.sbs$IHC) res.PCAPAM50$Int.sbs$IHC = factor(res.PCAPAM50$Int.sbs$IHC,levels=c("TN", "HER2+", "LA", "LB1", "LB2")) addmargins(table(res.PCAPAM50$Int.sbs$Int.SBS.Mdns.PCAPAM50, res.PCAPAM50$Int.sbs$IHC)) # TN HER2+ LA LB1 LB2 Sum # Basal 99 7 0 12 1 119 # Her2 10 30 0 7 24 71 # LumA 3 0 103 168 43 317 # LumB 2 0 6 135 55 198 # Normal 2 0 2 3 0 7 # Sum 116 37 111 325 123 712
Agreements:
Basal.TN agreement = 99/119 Her2.HER2+ agreement = 30/71 LumA.LA agreement = 103/317 LumB.LB1/LB2 agreement = 190/198
Upon comparison, we can see that PCAPAM50 has an overall agreement of 422/712 = 59.3%, which is an improvement of 8% over the PAM50 calls. If you closely examine the results, you will notice that the consistency of IHC LB subtype with LumB increased by 39.7%(136 to 190) with PCAPAM50.