Goal and Impact

The goal of this project is to estimate survival rates for lung squamous cell carcinoma(LUSC) and lung adenocarcinoma(LUAD) cancers using data from the cancer genome atlas program (TCGA). Specifically, we will investigate the impact of demographic and pathological features on survival rate. Estimating the survival rate and unraveling the features that affect it will provide clinicians with a better baseline from which to tailor therapies or estimate how likely a treatment will be successful.

Background

Lung cancer is the second most common form of cancer in both men and women, accounting for 2.3 million cases of the 17 million total estimated cases. It is also the leading cause of death making up almost 25% of all cancer deaths. There are two types of lung cancers: small cell and non-small cell cancers. Non-small cell cancers account for 85-90% of all lung cancers. Lung squamous cell carcinoma (LUSC) and Lung adenocarcinoma (LUAD), the two subtypes account for 25-30% and 40% of cases respectively. LUSC is associated with smoking and is usually found in the middle of the lungs. LUAD on the other hand is found on the periphery of the lungs and may be associated with smoking in some cases but is the most common cancer type among non-smokers. Most people diagnosed with lung cancer are 65 or older whereas a very small number of people diagnosed are younger than 45.

Specific Questions

  1. Are patient survival rates different for LUSC and LUAD?

  2. Are patient survival rates different for sub-groups within the dataset? for example, male vs. female ?

  3. How do individual demographic factors effect survival? Will smoking history affect survival rates sepecifically in LUAD cancer patients?

Ok, now that we have framed our questions, lets proceed!

Data Extraction

  • The dataset was obtained from The Cancer Genome Atlas program (TCGA)
  • RTCGA and RTCGA.clinical packages were used to extract data
  • LUSC and LUAD clinical data sets contain 504 and 522 patient records respectively.
  • For each patient, I extracted the following information:
  • Gender
  • Age = Age of the patient at initial diagnosis
  • Ethnicity
  • Race
  • Pathologic stage
  • Pathologic substages (t,n,m)
  • Smoking history
  • vital status (Event(in this case death) or censored)
  • times = either the time to death (time in days from the time of initial diagnosis to time of death) or time to last follow up(time interval between initial diagnosis and last followup) whichever is applicable for each patient

Data Wrangling

Step1 - Eliminate negative values from the times column

  • Negative values indicate error in data collection
  • Filter (times >= 0) and then convert the time from days to years (times/365).

Number of records before and after eliminating negative times.

Before After
LUSC 494 478
LUAD 503 484

Step2 - Check for missing data

  • 6 of the 9 variables have missing data
  • 4 variables (stage_n, stage_m, pathologic_stage, smoking_history) have < 3% missing values
  • For race and ethinicity variables, LUSC has 10% and 20% missing values while LUAD has 20% and 34% missing values respectively

 
Let’s explore the RACE and ETHNICITY variables  

Ethnicity

The ethnicity variable has 2 unique values: hispanic or latino and not hispanic or latino. The non hispanic or latino category has insignificant frequency in both LUSC and LUAD. Therefore, we will drop this variable.

 

Race  

Var1 Freq
asian 7
black or african american 30
white 337
Var1 Freq
american indian or alaska native 1
asian 8
black or african american 52
white 379

The race variable has 3 and 4 unique values in LUSC and LUAD respectively however, only the white and African American categories are represented with significant frequency. So, we will retain this feature but combine call minority categories into one. So, we will have two categories in this column: white, and other

 

Step3 - Data cleaning

Based on the results above, we will do the following to clean the data
* Drop ethnicity
* Delete the missing values from stage_m, stage_n, pathologic_stage, race, and smoking history columns * Combine Asian, American Indian, Black or African american into one category called “other” in the race column

 

Now that we have clean data, let’s explore each variable
 

Data Exploration

Gender

0 1
female 73 31
male 168 85
0 1
female 177 55
male 147 42
  • LUSC has more male subjects
  • LUAD has a more balanced sample of males and females
  • Males and females experience similar event rates regardless of the type od lung cancer

Age

  • LUSC seems to have slightly left skewed age distribution while LUAD has a symmetric distribution with two bi modal peaks
  • Mean and median age for LUSC are 67.4453782 and 69
  • Mean and median age for LUAD are 65.0902613 and 65.0902613

Kaplan Meier estimates do not handle conitnuous variables. So, we are going to convert age into a categorical variable and divide it into two categories:
* For LUSC: above 69 and below 69 * For LUAD: above 65 and below 65

0 1
above69 150 83
below69 91 33
0 1
above65 167 60
below65 157 37

Pathologic stage

  • There are more patients with stage 1 cancer than other stages in both LUSC and LUAD
  • Patients with stage 3 have a much higher event rate in both LUSC and LUAD
  • In LUSC, there are very few samples for patients with stage 4. This might cause non-convergence problem with cox regression. Therefore, we will delete these records before implementing survival analysis.
  • The disease stage sub categories, such as 1a and 1b describe slight variations in disease progression (ie. tumor size) and modeling might not benefit from separating them. We will transform the data to re-class the sub categories based on the main stage

Stage T,N,M

LUSC

  • Most of the patients have tumor type t2 which means the tumor is larger than 3cm and is partially clogging the airways.
  • The majority of the patients have n0 which means the cancer has not spread to lymph nodes
  • There are some patients with n1 which means the cancer has spread to lymph nodes within the lungs
  • The majority of the patients have mo which means the cancer has not spread to other parts of the body
  • mx means metastasis cannot be assessed
  • There are very few patients within stage m1 and nx, which might also cause non-convergence problem, therefore, m1 and nx records were deleted.

LUAD

  • LUAD has the same patterns observed in LUSC. We will also transform the subcategories to the main category in the cancer stage and stages t,n, and m.

Smoking History

0 1
current reformed smoker for < or = 15 years 117 61
current reformed smoker for > 15 years 40 18
current reformed smoker, duration not specified 4 0
current smoker 71 35
lifelong non-smoker 9 2
0 1
current reformed smoker for < or = 15 years 106 40
current reformed smoker for > 15 years 82 25
current reformed smoker, duration not specified 3 1
current smoker 82 19
lifelong non-smoker 51 12
  • LUAD has more non-smokers. This is expected as this is the most common lung cancer in non-smokers.

 

Before we proceed to survival analysis, lets summarize data cleaning and preparation so far * Eliminated Ethnicity column * Transformed all low populated subcategories(Black or african american, asian, alskan indian) to “other” in the race column * Converted age to a categorical variable * Transformed sub categories to main categories in pathological and t,n,m stage columns * The final data set consists of 350 and 412 records for LUSC and LUAD respectively

Survival Analysis

  • We will use survminer and survival packages in R for this analysis
  • First, we will estimate survival rates for LUSC and LUAD using Kaplan Meir(KM) method.
  • KM method generates plots of survival probablity versus time and summaries of data including median survival times and confidence intervals(CI)
  • Survival curves will be compared using a log rank statistic with a significance level of 0.05
  • Next, we will perform a univariate analysis (one independent variable at a time) for LUSC and LUAD using KM and Cox proportional hazards methods.
  • This will allow us to check the impact of each factor on survival and whether groups within each variable have different survival curves. For eg. do survival rates differ between men and women under the gender variable?
  • For variables that show statistically-significant differences and have more than 2 unique values or groups, we will also do a pair wise comparison
  • Before generating a model using the Cox proportional hazards method, we will verify the proportional hazards assumption using schonfeld residues and p-value for the corresponding chi-squared distribution
  • Based on the results from the univariate analysis, we will choose variables for the multivariate analysis and predict a model using Cox proportional hazards regression model

Survival rates for LUSC and LUAD

median 0.95LCL 0.95UCL
admin.disease_code=luad 3.542466 3.142466 4.443836
admin.disease_code=lusc 3.906849 3.032877 5.238356
  • The one-,three-, and five- year survival rates for LUSC and LUAD are 81.1% versus 85.5%, 55.7% versus58.2%, and 43.6% versus 33.8%, respectively.
  • The log rank test statistic of chisq= 0.3 with a p-value of 0.56 tells us that there is not enough statistical evidence to reject the null hypothesis and therefore allows us to conclude the survival curves are not significantly different for LUSC and LUAD

Univariate analysis (LUSC)

KM survival curves

Chisq and p-values for LUSC from Kaplan Meier estimates
category chisq df p.value
gender 0.5975156 1 0.4395275
agecat 1.4792898 1 0.2238857
race 0.7991014 1 0.3713622
smoking_history 4.5758077 2 0.1014790
pathologic_stage 3.7000930 2 0.1572299
stage_t 5.4293237 3 0.1429275
stage_n 1.8526311 2 0.3960101
stage_m 0.6845194 1 0.4080348
  • The p-values and chi squared values of the log rank statistic from the univariate analysis using KM revealed that there are no statistically significant differences in overall survival curves between groups within a variable
  • The survival curves for all categories(see below) show a steep decline in the initial years indicating poor prognosis from the disease.
median 0.95LCL 0.95UCL
gender=female 4.54 3.03 6.60
gender=male 2.95 2.46 5.24
median 0.95LCL 0.95UCL
agecat=above69 3.18 2.64 5.08
agecat=below69 4.54 2.90 NA
median 0.95LCL 0.95UCL
race=other 2.61 0.9 NA
race=white 3.90 2.9 5.35
median 0.95LCL 0.95UCL
smoking_history=current smoker 3.18 2.26 8.63
smoking_history=reformed smoker 3.90 2.90 5.41
smoking_history=lifelong non-smoker 1.73 0.23 NA
median 0.95LCL 0.95UCL
pathologic_stage=stage i 4.60 3.69 5.72
pathologic_stage=stage ii 2.92 2.29 NA
pathologic_stage=stage iii 2.41 1.42 NA
median 0.95LCL 0.95UCL
stage_t=t1 5.08 3.69 NA
stage_t=t2 3.18 2.64 5.72
stage_t=t3 2.64 1.71 NA
stage_t=t4 2.41 1.06 NA
median 0.95LCL 0.95UCL
stage_n=n0 3.91 2.95 5.24
stage_n=n1 3.16 2.41 NA
stage_n=n2 2.03 1.42 NA
median 0.95LCL 0.95UCL
stage_m=m0 3.90 2.90 5.35
stage_m=mx 3.05 1.06 NA

   

Cox proportional hazards model

Testing proportional Hazard assumption

For each variable, the proportional hazard assumption was checked using a statistical test and schoenfeld residue plots. cox.zph function in R tests the independence between schoenfeld residuals and time. The proportional hazards assumption is supported by a non-significant relationship between residuals and time.

 

  • The p-values are high for all variables except race,indicating that they are not statistically significant. This means that proportional hazard assumption is valid for all variables except race. So, cox regression might not be a good fit to model this variable. We will not attempt any methods to fix this, since this majority of this variable consists of one value and this might not be useful.
  • The schoenfelds plots for all variables show a random distribution around the mean of 0 validating the proportional hazards assumption

 

##            chisq df    p
## gendermale 0.627  1 0.43
## GLOBAL     0.627  1 0.43

##               chisq df    p
## agecatbelow69 0.239  1 0.62
## GLOBAL        0.239  1 0.62

##               chisq df    p
## agecatbelow69 0.239  1 0.62
## GLOBAL        0.239  1 0.62

##                            chisq df    p
## pathologic_stagestage ii  0.5000  1 0.48
## pathologic_stagestage iii 0.0692  1 0.79
## GLOBAL                    0.5025  2 0.78

##           chisq df     p
## stage_tt2 3.306  1 0.069
## stage_tt3 0.189  1 0.663
## stage_tt4 0.225  1 0.636
## GLOBAL    3.490  3 0.322

##           chisq df    p
## stage_nn1 0.227  1 0.63
## stage_nn2 0.232  1 0.63
## GLOBAL    0.576  2 0.75

##           chisq df    p
## stage_mmx 0.172  1 0.68
## GLOBAL    0.172  1 0.68

 

Cox Model LUSC (Univariate)

Table: Results from Univariate analysis using cox method
beta HR 95%CI(lower) 95%CI(upper) p-value wald.test wald.p.value
gender 0.16 1.18 0.78 1.79 0.44 0.60 0.44
agecat -0.25 0.78 0.51 1.17 0.23 1.47 0.23
race -0.25 0.78 0.45 1.35 0.37 0.79 0.37
beta HR 95%CI (L) 95%CI (U) p value
reformed smoker -0.23 0.79 0.53 1.18 0.26
lifelong nonsmoker 1.06 2.89 0.68 12.20 0.15
pathologic_stage2 0.28 1.32 0.86 2.04 0.21
pathologic_stage3 0.44 1.56 0.95 2.56 0.08
staget2 0.21 1.23 0.78 1.94 0.36
staget3 0.39 1.47 0.74 2.93 0.27
staget4 1.00 2.72 1.11 6.63 0.03
stagen1 0.09 1.09 0.70 1.71 0.70
stagen2 0.40 1.49 0.84 2.64 0.18
stagemx 0.22 1.25 0.74 2.10 0.41
wald.test p.value
smoking_history 4.17 0.12
pathologic_stage 3.65 0.16
stage_t 5.19 0.16
stage_n 1.83 0.40
stage_m 0.68 0.41

 

 

 

 

 

 

 

  • The table above shows the regression beta coefficients, hazard ratios, confidence intervals for hazard ratios, and statistical significance (wald test and p value) of each variable in relation to overall survival. Each variable has been assessed independently via separate Cox regressions.
  • From the output above, we can see that the p-values from wald test are similar to those obtained from KM estimates suggesting:
  • none of the variables are statistically significant for overall survival
  • groups within a variable dont have significantly different survival rates
  • all confidence intervals include the NULL value which also indicates that they are not statistically significant

 

Based on the results from KM estimates and the Cox model, we can conclude that, for LUSC, none of the variables are statistically significant for overall survival

Univariate Analysis LUAD

KM method

category chisq df p.value
gender 0.0619247 1 0.8034793
agecat 2.1215024 1 0.1452440
race 2.4764949 1 0.1155595
smoking_history 1.4695368 2 0.4796165
pathologic_stage 30.9366702 3 0.0000009
stage_t 11.9727101 3 0.0074772
stage_n 26.8069607 2 0.0000015
stage_m 2.9879501 2 0.2244786
  • The p-values from the log rank test for variables pathologic stage,stage t, and stage n indicate that they are statistically significant indicating that the groups within these variables differ significantly in survival. Interpretations of each individual variable are discussed below. Since each of these variables consist of more than two groups, a pair wise comparison was performed and the results are shown below.

  • The survival curves (see below) for all variables also show a step decline, similar to what was observed in LUSC, in the first five years indicating poor prognosis from the disease

 

median 0.95LCL 0.95UCL
gender=female 3.47 2.86 4.73
gender=male 3.54 2.96 7.35
median 0.95LCL 0.95UCL
agecat=above65 3.45 2.74 4.9
agecat=below65 3.54 3.11 NA
median 0.95LCL 0.95UCL
race=other 3.72 2.45 NA
race=white 3.47 2.96 4.38
median 0.95LCL 0.95UCL
smoking_history=current smoker 3.98 2.60 NA
smoking_history=reformed smoker 3.45 3.11 4.53
smoking_history=lifelong non-smoker 3.89 2.73 NA

Pathologic stage

  • The survival curves show that the survival rate decreases with increasing pathologic stage.
  • Patients with stage 1 have a median survival time of ~4.7 years whereas patients with stage 2 and 3 have 2.5 and 1.7 years respectively
  • Patients with stage 4 cancer have a higher median survival time but it should be noted there are very few patients in this sample
  • Pair wise comparisons shows that stage 2 and stage 3 are significantly different from stage 1. However, there is little difference in survival time between stages 2 and 3.

 

Stage T

  • The survival curves show that the survival rate decreases rapidly for stage 3 and 4 in comparison to stage 1 and 2. Correspondingly, their median survival times also indicate the same trend.
  • Pair wise comparison shows that
    • t1 is not very different from t2 while t3 and t4 are significantly different from t1
    • No differences are observed between t2, t3, and t4.
median 0.95LCL 0.95UCL
pathologic_stage=stage i 4.73 3.89 NA
pathologic_stage=stage ii 2.55 1.85 4.87
pathologic_stage=stage iii 1.72 1.13 3.78
pathologic_stage=stage iv 2.86 2.67 NA
median 0.95LCL 0.95UCL
stage_t=t1 3.78 3.20 NA
stage_t=t2 3.72 2.94 4.73
stage_t=t3 2.37 1.05 NA
stage_t=t4 2.26 0.49 NA
##           stage i stage ii stageiii
## stage ii  **                       
## stage iii ****                     
## stage iv  *                        
## attr(,"legend")
## [1] 0 '****' 1e-04 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1 \t    ## NA: ''
##    t1 t2 t3
## t2         
## t3 **      
## t4 **      
## attr(,"legend")
## [1] 0 '****' 1e-04 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1 \t    ## NA: ''

Stage N

  • The survival curves show that the survival rate is better for patients with n0 where the cancer has not spread to the lymph nodes while n1 and n2 curves show a steep decline
  • The median survival times for n0, n1, and n2 are 4.5, 2.6, and 1.7 years respectively
  • Pair wise comparison also indicates the same trend : +n0 is significantly different from n1 and n2 +no difference between n1 and n2
median 0.95LCL 0.95UCL
stage_n=n0 4.53 3.54 NA
stage_n=n1 2.67 1.85 4.11
stage_n=n2 1.72 1.13 3.78
median 0.95LCL 0.95UCL
stage_m=m0 3.53 2.94 4.53
stage_m=m1 2.86 2.26 NA
stage_m=mx 4.90 3.11 NA
##    n0   n1
## n1 ***    
## n2 ****   
## attr(,"legend")
## [1] 0 '****' 1e-04 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1 \t    ## NA: ''

Cox proportional hazards model

Testing for Cox proportional hazard assumption

  • The p-values are high for all variables indicating that they are not statistically significant. This means that the proportional hazard age assumption is valid for all variables.

  • The schoenfelds plots for all variables show a random distribution around the mean of 0

##            chisq df    p
## gendermale 0.507  1 0.48
## GLOBAL     0.507  1 0.48

##                chisq df    p
## agecatbelow65 0.0923  1 0.76
## GLOBAL        0.0923  1 0.76

##                chisq df    p
## agecatbelow65 0.0923  1 0.76
## GLOBAL        0.0923  1 0.76

##                             chisq df    p
## pathologic_stagestage ii  0.00266  1 0.96
## pathologic_stagestage iii 2.66184  1 0.10
## pathologic_stagestage iv  0.77412  1 0.38
## GLOBAL                    3.31842  3 0.35

##              chisq df     p
## stage_tt2 3.006384  1 0.083
## stage_tt3 2.785220  1 0.095
## stage_tt4 0.000474  1 0.983
## GLOBAL    4.915003  3 0.178

##           chisq df    p
## stage_nn1 0.957  1 0.33
## stage_nn2 2.323  1 0.13
## GLOBAL    2.558  2 0.28

##           chisq df    p
## stage_mm1 1.262  1 0.26
## stage_mmx 0.507  1 0.48
## GLOBAL    1.562  2 0.46

Cox Model (Univariate)

Univariate analysis using Cox proportion hazards
beta HR 95%_CI_lower 95%_CI_upper p value wald.test wald.p.value
gender -0.05 0.95 0.63 1.42 0.80 0.06 0.80
agecat -0.31 0.74 0.49 1.11 0.15 2.10 0.15
race 0.51 1.66 0.88 3.13 0.12 2.42 0.12
beta HR 0.95LCL 0.95UCL P-value
reformed_smoker 0.31 1.36 0.81 2.29 0.24
non-smoker 0.33 1.40 0.67 2.89 0.37
pathologic_stage2 0.89 2.43 1.46 4.07 0.00
pathologic_stage3 1.36 3.88 2.30 6.54 0.00
pathologic_stage4 0.93 2.52 1.16 5.48 0.02
staget2 0.33 1.40 0.86 2.26 0.18
staget3 1.15 3.17 1.39 7.26 0.01
staget4 1.24 3.46 1.30 9.24 0.01
stagen1 0.83 2.30 1.40 3.77 0.00
stagen2 1.18 3.27 1.99 5.38 0.00
stagem1 0.34 1.40 0.67 2.92 0.37
stagemx -0.35 0.70 0.42 1.19 0.19
wald.test wald.p.value
smoking_history 1.46 0.48
pathologic_stage 27.81 0.00
stage_t 11.11 0.01
stage_n 24.48 0.00
stage_m 2.94 0.23

 

 

 

 

 

 

 

  • Pathologic stage which denotes overall cancer stage, stage_t representing tumor size, and stage_n representing spread to lymphnodes have highly statistically significant coefficients.
  • All the variables have positive beta coefficients which means they are associated with poor survival
  • The hazard ratios for stage t groups and stage n groups indicate that the hazard rate increases with the increase in tumor size and spread to lymphnodes.
  • For pathologic stage groups,this trend is true for stage 2 and 3. However, stage 4 has a HR similar to that to stage2. This is likely due to the limited sample size.

Multivariate analysis for LUAD

The univariate analysis allowed us to conclude that pathologic stage, stage t, and stage n are the only variables that are statistically significant for overall survival. However, overall cancer/pathologic stage is always assigned based on stage t, stage n and stage m. Therefore, in any linear modeling technique, we expect to see correlation between tumor stage and pathologic stage and hence wont be useful to model them together. Since there are no other variables that are statistically significant, multivariate analysis provides no added value to the analysis.

We can however perform a multivariate analysis using two models to see if the predictions from the univariate analysis will still hold true when other covariates are taken into account
1) Age, gender, pathologic_stage, smoking_history 2) Age, gender, staget, stage_n, stage_m, smoking_history

Model 1

Testing the proportional hazards assumption

  • Both the p-values and schoenfeld plots show that the proportional hazards assumption is true for all variables

  • The p-value from wald test indicates that the model is significant.
  • In this model, pathologic stages 2,3, and 4 remain statistically significant (p < 0.05) after taking into account all the other variables
  • There is also no significant change in effect sizes (beta coefficients) and hazard ratios from the univariate analysis for cancer stage
  • The p-values together with the beta coefficients and hazard ratios indicate a strong relationship between patient cancer stage and increased risk for death after accounting for other covariates. Therefore, we can conclude that advanced cancer stage is associated with poor prognostic.
  • Interestingly, both age and reformed smoker group within the smoking history yeild borderline significance with p values of 0.06 and 0.05 respectively. However, the CI for both variables include 1 which means that they make smaller contributions to the difference in HR after adjusting for other covariates and only trend toward significance.
  • It will be useful to model interactions between age, smoking history and cancer stage

Model 2

  • Includes age, tnm stages t,n and smoking history
  • The wald test p-values indicate that the model is significant.
  • Stages n1 and n2 remain statistically significant (p<0.05) after taking all other variables into account.
  • Stage m, gender, and smoking history continue to insignificant
  • However, both t3 and t4 are not statistically significant and their effect is diminished after taking all other variables into account.
  • Age becomes borderline significant

Conclusions

  • We estimated the survival rates for LUSC and LUAD and found there is no statistically significant difference between their survival rates
  • The one, three, and five year survival rates for LUSC and LUAD are 81.1% versus 85.5%, 55.7% versus 58.2%, and 43.6% versus 33.8%, respectively.
  • In LUSC, both KM and Cox univariate models indicated that none of the variables have a statistically significant association with overall survival
  • In LUAD, the overall cancer stage and spread to lymphnodes have a significant association with overall survival both in univariate and multivariate analysis
  • Based on results from multivariate analysis, age and cancer stage might have some possible interactions
  • As a follow up, it would be interesting to use cox models with interactions between age and overall cancer stage, smoking history and cancer stage, age and stage t.