The goal of this project is to estimate survival rates for lung squamous cell carcinoma(LUSC) and lung adenocarcinoma(LUAD) cancers using data from the cancer genome atlas program (TCGA). Specifically, we will investigate the impact of demographic and pathological features on survival rate. Estimating the survival rate and unraveling the features that affect it will provide clinicians with a better baseline from which to tailor therapies or estimate how likely a treatment will be successful.
Lung cancer is the second most common form of cancer in both men and women, accounting for 2.3 million cases of the 17 million total estimated cases. It is also the leading cause of death making up almost 25% of all cancer deaths. There are two types of lung cancers: small cell and non-small cell cancers. Non-small cell cancers account for 85-90% of all lung cancers. Lung squamous cell carcinoma (LUSC) and Lung adenocarcinoma (LUAD), the two subtypes account for 25-30% and 40% of cases respectively. LUSC is associated with smoking and is usually found in the middle of the lungs. LUAD on the other hand is found on the periphery of the lungs and may be associated with smoking in some cases but is the most common cancer type among non-smokers. Most people diagnosed with lung cancer are 65 or older whereas a very small number of people diagnosed are younger than 45.
Are patient survival rates different for LUSC and LUAD?
Are patient survival rates different for sub-groups within the dataset? for example, male vs. female ?
How do individual demographic factors effect survival? Will smoking history affect survival rates sepecifically in LUAD cancer patients?
Ok, now that we have framed our questions, lets proceed!
Number of records before and after eliminating negative times.
Before | After | |
---|---|---|
LUSC | 494 | 478 |
LUAD | 503 | 484 |
Let’s explore the RACE and ETHNICITY variables
Ethnicity
The ethnicity variable has 2 unique values: hispanic or latino and not hispanic or latino. The non hispanic or latino category has insignificant frequency in both LUSC and LUAD. Therefore, we will drop this variable.
Race
Var1 | Freq |
---|---|
asian | 7 |
black or african american | 30 |
white | 337 |
Var1 | Freq |
---|---|
american indian or alaska native | 1 |
asian | 8 |
black or african american | 52 |
white | 379 |
The race variable has 3 and 4 unique values in LUSC and LUAD respectively however, only the white and African American categories are represented with significant frequency. So, we will retain this feature but combine call minority categories into one. So, we will have two categories in this column: white, and other
Based on the results above, we will do the following to clean the data
* Drop ethnicity
* Delete the missing values from stage_m, stage_n, pathologic_stage, race, and smoking history columns * Combine Asian, American Indian, Black or African american into one category called “other” in the race column
Now that we have clean data, let’s explore each variable
0 | 1 | |
---|---|---|
female | 73 | 31 |
male | 168 | 85 |
0 | 1 | |
---|---|---|
female | 177 | 55 |
male | 147 | 42 |
Kaplan Meier estimates do not handle conitnuous variables. So, we are going to convert age into a categorical variable and divide it into two categories:
* For LUSC: above 69 and below 69 * For LUAD: above 65 and below 65
0 | 1 | |
---|---|---|
above69 | 150 | 83 |
below69 | 91 | 33 |
0 | 1 | |
---|---|---|
above65 | 167 | 60 |
below65 | 157 | 37 |
0 | 1 | |
---|---|---|
current reformed smoker for < or = 15 years | 117 | 61 |
current reformed smoker for > 15 years | 40 | 18 |
current reformed smoker, duration not specified | 4 | 0 |
current smoker | 71 | 35 |
lifelong non-smoker | 9 | 2 |
0 | 1 | |
---|---|---|
current reformed smoker for < or = 15 years | 106 | 40 |
current reformed smoker for > 15 years | 82 | 25 |
current reformed smoker, duration not specified | 3 | 1 |
current smoker | 82 | 19 |
lifelong non-smoker | 51 | 12 |
Before we proceed to survival analysis, lets summarize data cleaning and preparation so far * Eliminated Ethnicity column * Transformed all low populated subcategories(Black or african american, asian, alskan indian) to “other” in the race column * Converted age to a categorical variable * Transformed sub categories to main categories in pathological and t,n,m stage columns * The final data set consists of 350 and 412 records for LUSC and LUAD respectively
median | 0.95LCL | 0.95UCL | |
---|---|---|---|
admin.disease_code=luad | 3.542466 | 3.142466 | 4.443836 |
admin.disease_code=lusc | 3.906849 | 3.032877 | 5.238356 |
category | chisq | df | p.value |
---|---|---|---|
gender | 0.5975156 | 1 | 0.4395275 |
agecat | 1.4792898 | 1 | 0.2238857 |
race | 0.7991014 | 1 | 0.3713622 |
smoking_history | 4.5758077 | 2 | 0.1014790 |
pathologic_stage | 3.7000930 | 2 | 0.1572299 |
stage_t | 5.4293237 | 3 | 0.1429275 |
stage_n | 1.8526311 | 2 | 0.3960101 |
stage_m | 0.6845194 | 1 | 0.4080348 |
median | 0.95LCL | 0.95UCL | |
---|---|---|---|
gender=female | 4.54 | 3.03 | 6.60 |
gender=male | 2.95 | 2.46 | 5.24 |
median | 0.95LCL | 0.95UCL | |
---|---|---|---|
agecat=above69 | 3.18 | 2.64 | 5.08 |
agecat=below69 | 4.54 | 2.90 | NA |
median | 0.95LCL | 0.95UCL | |
---|---|---|---|
race=other | 2.61 | 0.9 | NA |
race=white | 3.90 | 2.9 | 5.35 |
median | 0.95LCL | 0.95UCL | |
---|---|---|---|
smoking_history=current smoker | 3.18 | 2.26 | 8.63 |
smoking_history=reformed smoker | 3.90 | 2.90 | 5.41 |
smoking_history=lifelong non-smoker | 1.73 | 0.23 | NA |
median | 0.95LCL | 0.95UCL | |
---|---|---|---|
pathologic_stage=stage i | 4.60 | 3.69 | 5.72 |
pathologic_stage=stage ii | 2.92 | 2.29 | NA |
pathologic_stage=stage iii | 2.41 | 1.42 | NA |
median | 0.95LCL | 0.95UCL | |
---|---|---|---|
stage_t=t1 | 5.08 | 3.69 | NA |
stage_t=t2 | 3.18 | 2.64 | 5.72 |
stage_t=t3 | 2.64 | 1.71 | NA |
stage_t=t4 | 2.41 | 1.06 | NA |
median | 0.95LCL | 0.95UCL | |
---|---|---|---|
stage_n=n0 | 3.91 | 2.95 | 5.24 |
stage_n=n1 | 3.16 | 2.41 | NA |
stage_n=n2 | 2.03 | 1.42 | NA |
median | 0.95LCL | 0.95UCL | |
---|---|---|---|
stage_m=m0 | 3.90 | 2.90 | 5.35 |
stage_m=mx | 3.05 | 1.06 | NA |
For each variable, the proportional hazard assumption was checked using a statistical test and schoenfeld residue plots. cox.zph function in R tests the independence between schoenfeld residuals and time. The proportional hazards assumption is supported by a non-significant relationship between residuals and time.
## chisq df p
## gendermale 0.627 1 0.43
## GLOBAL 0.627 1 0.43
## chisq df p
## agecatbelow69 0.239 1 0.62
## GLOBAL 0.239 1 0.62
## chisq df p
## agecatbelow69 0.239 1 0.62
## GLOBAL 0.239 1 0.62
## chisq df p
## pathologic_stagestage ii 0.5000 1 0.48
## pathologic_stagestage iii 0.0692 1 0.79
## GLOBAL 0.5025 2 0.78
## chisq df p
## stage_tt2 3.306 1 0.069
## stage_tt3 0.189 1 0.663
## stage_tt4 0.225 1 0.636
## GLOBAL 3.490 3 0.322
## chisq df p
## stage_nn1 0.227 1 0.63
## stage_nn2 0.232 1 0.63
## GLOBAL 0.576 2 0.75
## chisq df p
## stage_mmx 0.172 1 0.68
## GLOBAL 0.172 1 0.68
beta | HR | 95%CI(lower) | 95%CI(upper) | p-value | wald.test | wald.p.value | |
---|---|---|---|---|---|---|---|
gender | 0.16 | 1.18 | 0.78 | 1.79 | 0.44 | 0.60 | 0.44 |
agecat | -0.25 | 0.78 | 0.51 | 1.17 | 0.23 | 1.47 | 0.23 |
race | -0.25 | 0.78 | 0.45 | 1.35 | 0.37 | 0.79 | 0.37 |
beta | HR | 95%CI (L) | 95%CI (U) | p value | |
---|---|---|---|---|---|
reformed smoker | -0.23 | 0.79 | 0.53 | 1.18 | 0.26 |
lifelong nonsmoker | 1.06 | 2.89 | 0.68 | 12.20 | 0.15 |
pathologic_stage2 | 0.28 | 1.32 | 0.86 | 2.04 | 0.21 |
pathologic_stage3 | 0.44 | 1.56 | 0.95 | 2.56 | 0.08 |
staget2 | 0.21 | 1.23 | 0.78 | 1.94 | 0.36 |
staget3 | 0.39 | 1.47 | 0.74 | 2.93 | 0.27 |
staget4 | 1.00 | 2.72 | 1.11 | 6.63 | 0.03 |
stagen1 | 0.09 | 1.09 | 0.70 | 1.71 | 0.70 |
stagen2 | 0.40 | 1.49 | 0.84 | 2.64 | 0.18 |
stagemx | 0.22 | 1.25 | 0.74 | 2.10 | 0.41 |
wald.test | p.value | |
---|---|---|
smoking_history | 4.17 | 0.12 |
pathologic_stage | 3.65 | 0.16 |
stage_t | 5.19 | 0.16 |
stage_n | 1.83 | 0.40 |
stage_m | 0.68 | 0.41 |
Based on the results from KM estimates and the Cox model, we can conclude that, for LUSC, none of the variables are statistically significant for overall survival
category | chisq | df | p.value |
---|---|---|---|
gender | 0.0619247 | 1 | 0.8034793 |
agecat | 2.1215024 | 1 | 0.1452440 |
race | 2.4764949 | 1 | 0.1155595 |
smoking_history | 1.4695368 | 2 | 0.4796165 |
pathologic_stage | 30.9366702 | 3 | 0.0000009 |
stage_t | 11.9727101 | 3 | 0.0074772 |
stage_n | 26.8069607 | 2 | 0.0000015 |
stage_m | 2.9879501 | 2 | 0.2244786 |
The p-values from the log rank test for variables pathologic stage,stage t, and stage n indicate that they are statistically significant indicating that the groups within these variables differ significantly in survival. Interpretations of each individual variable are discussed below. Since each of these variables consist of more than two groups, a pair wise comparison was performed and the results are shown below.
median | 0.95LCL | 0.95UCL | |
---|---|---|---|
gender=female | 3.47 | 2.86 | 4.73 |
gender=male | 3.54 | 2.96 | 7.35 |
median | 0.95LCL | 0.95UCL | |
---|---|---|---|
agecat=above65 | 3.45 | 2.74 | 4.9 |
agecat=below65 | 3.54 | 3.11 | NA |
median | 0.95LCL | 0.95UCL | |
---|---|---|---|
race=other | 3.72 | 2.45 | NA |
race=white | 3.47 | 2.96 | 4.38 |
median | 0.95LCL | 0.95UCL | |
---|---|---|---|
smoking_history=current smoker | 3.98 | 2.60 | NA |
smoking_history=reformed smoker | 3.45 | 3.11 | 4.53 |
smoking_history=lifelong non-smoker | 3.89 | 2.73 | NA |
median | 0.95LCL | 0.95UCL | |
---|---|---|---|
pathologic_stage=stage i | 4.73 | 3.89 | NA |
pathologic_stage=stage ii | 2.55 | 1.85 | 4.87 |
pathologic_stage=stage iii | 1.72 | 1.13 | 3.78 |
pathologic_stage=stage iv | 2.86 | 2.67 | NA |
median | 0.95LCL | 0.95UCL | |
---|---|---|---|
stage_t=t1 | 3.78 | 3.20 | NA |
stage_t=t2 | 3.72 | 2.94 | 4.73 |
stage_t=t3 | 2.37 | 1.05 | NA |
stage_t=t4 | 2.26 | 0.49 | NA |
## stage i stage ii stageiii
## stage ii **
## stage iii ****
## stage iv *
## attr(,"legend")
## [1] 0 '****' 1e-04 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1 \t ## NA: ''
## t1 t2 t3
## t2
## t3 **
## t4 **
## attr(,"legend")
## [1] 0 '****' 1e-04 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1 \t ## NA: ''
median | 0.95LCL | 0.95UCL | |
---|---|---|---|
stage_n=n0 | 4.53 | 3.54 | NA |
stage_n=n1 | 2.67 | 1.85 | 4.11 |
stage_n=n2 | 1.72 | 1.13 | 3.78 |
median | 0.95LCL | 0.95UCL | |
---|---|---|---|
stage_m=m0 | 3.53 | 2.94 | 4.53 |
stage_m=m1 | 2.86 | 2.26 | NA |
stage_m=mx | 4.90 | 3.11 | NA |
## n0 n1
## n1 ***
## n2 ****
## attr(,"legend")
## [1] 0 '****' 1e-04 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1 \t ## NA: ''
The p-values are high for all variables indicating that they are not statistically significant. This means that the proportional hazard age assumption is valid for all variables.
## chisq df p
## gendermale 0.507 1 0.48
## GLOBAL 0.507 1 0.48
## chisq df p
## agecatbelow65 0.0923 1 0.76
## GLOBAL 0.0923 1 0.76
## chisq df p
## agecatbelow65 0.0923 1 0.76
## GLOBAL 0.0923 1 0.76
## chisq df p
## pathologic_stagestage ii 0.00266 1 0.96
## pathologic_stagestage iii 2.66184 1 0.10
## pathologic_stagestage iv 0.77412 1 0.38
## GLOBAL 3.31842 3 0.35
## chisq df p
## stage_tt2 3.006384 1 0.083
## stage_tt3 2.785220 1 0.095
## stage_tt4 0.000474 1 0.983
## GLOBAL 4.915003 3 0.178
## chisq df p
## stage_nn1 0.957 1 0.33
## stage_nn2 2.323 1 0.13
## GLOBAL 2.558 2 0.28
## chisq df p
## stage_mm1 1.262 1 0.26
## stage_mmx 0.507 1 0.48
## GLOBAL 1.562 2 0.46
beta | HR | 95%_CI_lower | 95%_CI_upper | p value | wald.test | wald.p.value | |
---|---|---|---|---|---|---|---|
gender | -0.05 | 0.95 | 0.63 | 1.42 | 0.80 | 0.06 | 0.80 |
agecat | -0.31 | 0.74 | 0.49 | 1.11 | 0.15 | 2.10 | 0.15 |
race | 0.51 | 1.66 | 0.88 | 3.13 | 0.12 | 2.42 | 0.12 |
beta | HR | 0.95LCL | 0.95UCL | P-value | |
---|---|---|---|---|---|
reformed_smoker | 0.31 | 1.36 | 0.81 | 2.29 | 0.24 |
non-smoker | 0.33 | 1.40 | 0.67 | 2.89 | 0.37 |
pathologic_stage2 | 0.89 | 2.43 | 1.46 | 4.07 | 0.00 |
pathologic_stage3 | 1.36 | 3.88 | 2.30 | 6.54 | 0.00 |
pathologic_stage4 | 0.93 | 2.52 | 1.16 | 5.48 | 0.02 |
staget2 | 0.33 | 1.40 | 0.86 | 2.26 | 0.18 |
staget3 | 1.15 | 3.17 | 1.39 | 7.26 | 0.01 |
staget4 | 1.24 | 3.46 | 1.30 | 9.24 | 0.01 |
stagen1 | 0.83 | 2.30 | 1.40 | 3.77 | 0.00 |
stagen2 | 1.18 | 3.27 | 1.99 | 5.38 | 0.00 |
stagem1 | 0.34 | 1.40 | 0.67 | 2.92 | 0.37 |
stagemx | -0.35 | 0.70 | 0.42 | 1.19 | 0.19 |
wald.test | wald.p.value | |
---|---|---|
smoking_history | 1.46 | 0.48 |
pathologic_stage | 27.81 | 0.00 |
stage_t | 11.11 | 0.01 |
stage_n | 24.48 | 0.00 |
stage_m | 2.94 | 0.23 |
The univariate analysis allowed us to conclude that pathologic stage, stage t, and stage n are the only variables that are statistically significant for overall survival. However, overall cancer/pathologic stage is always assigned based on stage t, stage n and stage m. Therefore, in any linear modeling technique, we expect to see correlation between tumor stage and pathologic stage and hence wont be useful to model them together. Since there are no other variables that are statistically significant, multivariate analysis provides no added value to the analysis.
We can however perform a multivariate analysis using two models to see if the predictions from the univariate analysis will still hold true when other covariates are taken into account
1) Age, gender, pathologic_stage, smoking_history 2) Age, gender, staget, stage_n, stage_m, smoking_history