Goal and Impact

The goal of this project is to estimate survival rates for lung squamous cell carcinoma(LUSC) and lung adenocarcinoma(LUAD) cancers using data from the cancer genome atlas program (TCGA). Specifically, we will investigate the impact of demographic and pathological features on survival rate. Estimating the survival rate and unraveling the features that affect it will provide clinicians with a better baseline from which to tailor therapies or estimate how likely a treatment will be successful.

Background

Lung cancer is the second most common form of cancer in both men and women, accounting for 2.3 million cases of the 17 million total estimated cases. It is also the leading cause of death making up almost 25% of all cancer deaths. There are two types of lung cancers: small cell and non-small cell cancers. Non-small cell cancers account for 85-90% of all lung cancers. Lung squamous cell carcinoma (LUSC) and Lung adenocarcinoma (LUAD), the two subtypes account for 25-30% and 40% of cases respectively. LUSC is associated with smoking and is usually found in the middle of the lungs. LUAD on the other hand is found on the periphery of the lungs and may be associated with smoking in some cases but is the most common cancer type among non-smokers. Most people diagnosed with lung cancer are 65 or older whereas a very small number of people diagnosed are younger than 45.

Specific Questions

Are patient survival rates different for LUSC and LUAD?
Are patient survival rates different for sub-groups within the dataset? for example, male vs. female ?
How do individual demographic factors effect survival? Will smoking history affect survival rates sepecifically in LUAD cancer patients?

Ok, now that we have framed our questions, lets proceed!

Data Extraction

The dataset was obtained from The Cancer Genome Atlas program (TCGA)
RTCGA and RTCGA.clinical packages were used to extract data
LUSC and LUAD clinical data sets contain 504 and 522 patient records respectively.
For each patient, I extracted the following information:
Gender
Age = Age of the patient at initial diagnosis
Ethnicity
Race
Pathologic stage
Pathologic substages (t,n,m)
Smoking history
vital status (Event(in this case death) or censored)
times = either the time to death (time in days from the time of initial diagnosis to time of death) or time to last follow up(time interval between initial diagnosis and last followup) whichever is applicable for each patient

Data Wrangling

Step1 - Eliminate negative values from the times column

Negative values indicate error in data collection
Filter (times >= 0) and then convert the time from days to years (times/365).

Number of records before and after eliminating negative times.

	Before	After
LUSC	494	478
LUAD	503	484

Step2 - Check for missing data

6 of the 9 variables have missing data
4 variables (stage_n, stage_m, pathologic_stage, smoking_history) have < 3% missing values
For race and ethinicity variables, LUSC has 10% and 20% missing values while LUAD has 20% and 34% missing values respectively

Let’s explore the RACE and ETHNICITY variables

Ethnicity

The ethnicity variable has 2 unique values: hispanic or latino and not hispanic or latino. The non hispanic or latino category has insignificant frequency in both LUSC and LUAD. Therefore, we will drop this variable.

Race

Var1	Freq
asian	7
black or african american	30
white	337

Var1	Freq
american indian or alaska native	1
asian	8
black or african american	52
white	379

The race variable has 3 and 4 unique values in LUSC and LUAD respectively however, only the white and African American categories are represented with significant frequency. So, we will retain this feature but combine call minority categories into one. So, we will have two categories in this column: white, and other

Step3 - Data cleaning

Based on the results above, we will do the following to clean the data
* Drop ethnicity
* Delete the missing values from stage_m, stage_n, pathologic_stage, race, and smoking history columns * Combine Asian, American Indian, Black or African american into one category called “other” in the race column

Now that we have clean data, let’s explore each variable

Data Exploration

Gender

	0	1
female	73	31
male	168	85

	0	1
female	177	55
male	147	42

LUSC has more male subjects
LUAD has a more balanced sample of males and females
Males and females experience similar event rates regardless of the type od lung cancer

Age

LUSC seems to have slightly left skewed age distribution while LUAD has a symmetric distribution with two bi modal peaks
Mean and median age for LUSC are 67.4453782 and 69
Mean and median age for LUAD are 65.0902613 and 65.0902613

Kaplan Meier estimates do not handle conitnuous variables. So, we are going to convert age into a categorical variable and divide it into two categories:
* For LUSC: above 69 and below 69 * For LUAD: above 65 and below 65

	0	1
above69	150	83
below69	91	33

	0	1
above65	167	60
below65	157	37

Pathologic stage

There are more patients with stage 1 cancer than other stages in both LUSC and LUAD
Patients with stage 3 have a much higher event rate in both LUSC and LUAD
In LUSC, there are very few samples for patients with stage 4. This might cause non-convergence problem with cox regression. Therefore, we will delete these records before implementing survival analysis.
The disease stage sub categories, such as 1a and 1b describe slight variations in disease progression (ie. tumor size) and modeling might not benefit from separating them. We will transform the data to re-class the sub categories based on the main stage

Stage T,N,M

LUSC

Most of the patients have tumor type t2 which means the tumor is larger than 3cm and is partially clogging the airways.
The majority of the patients have n0 which means the cancer has not spread to lymph nodes
There are some patients with n1 which means the cancer has spread to lymph nodes within the lungs
The majority of the patients have mo which means the cancer has not spread to other parts of the body
mx means metastasis cannot be assessed
There are very few patients within stage m1 and nx, which might also cause non-convergence problem, therefore, m1 and nx records were deleted.

LUAD

LUAD has the same patterns observed in LUSC. We will also transform the subcategories to the main category in the cancer stage and stages t,n, and m.

Smoking History

	0	1
current reformed smoker for < or = 15 years	117	61
current reformed smoker for > 15 years	40	18
current reformed smoker, duration not specified	4	0
current smoker	71	35
lifelong non-smoker	9	2

	0	1
current reformed smoker for < or = 15 years	106	40
current reformed smoker for > 15 years	82	25
current reformed smoker, duration not specified	3	1
current smoker	82	19
lifelong non-smoker	51	12

LUAD has more non-smokers. This is expected as this is the most common lung cancer in non-smokers.

Before we proceed to survival analysis, lets summarize data cleaning and preparation so far * Eliminated Ethnicity column * Transformed all low populated subcategories(Black or african american, asian, alskan indian) to “other” in the race column * Converted age to a categorical variable * Transformed sub categories to main categories in pathological and t,n,m stage columns * The final data set consists of 350 and 412 records for LUSC and LUAD respectively

Survival Analysis

We will use survminer and survival packages in R for this analysis
First, we will estimate survival rates for LUSC and LUAD using Kaplan Meir(KM) method.
KM method generates plots of survival probablity versus time and summaries of data including median survival times and confidence intervals(CI)
Survival curves will be compared using a log rank statistic with a significance level of 0.05
Next, we will perform a univariate analysis (one independent variable at a time) for LUSC and LUAD using KM and Cox proportional hazards methods.
This will allow us to check the impact of each factor on survival and whether groups within each variable have different survival curves. For eg. do survival rates differ between men and women under the gender variable?
For variables that show statistically-significant differences and have more than 2 unique values or groups, we will also do a pair wise comparison
Before generating a model using the Cox proportional hazards method, we will verify the proportional hazards assumption using schonfeld residues and p-value for the corresponding chi-squared distribution
Based on the results from the univariate analysis, we will choose variables for the multivariate analysis and predict a model using Cox proportional hazards regression model

Survival rates for LUSC and LUAD

	median	0.95LCL	0.95UCL
admin.disease_code=luad	3.542466	3.142466	4.443836
admin.disease_code=lusc	3.906849	3.032877	5.238356

The one-,three-, and five- year survival rates for LUSC and LUAD are 81.1% versus 85.5%, 55.7% versus58.2%, and 43.6% versus 33.8%, respectively.
The log rank test statistic of chisq= 0.3 with a p-value of 0.56 tells us that there is not enough statistical evidence to reject the null hypothesis and therefore allows us to conclude the survival curves are not significantly different for LUSC and LUAD

Univariate analysis (LUSC)

KM survival curves

Chisq and p-values for LUSC from Kaplan Meier estimates
category	chisq	df	p.value
gender	0.5975156	1	0.4395275
agecat	1.4792898	1	0.2238857
race	0.7991014	1	0.3713622
smoking_history	4.5758077	2	0.1014790
pathologic_stage	3.7000930	2	0.1572299
stage_t	5.4293237	3	0.1429275
stage_n	1.8526311	2	0.3960101
stage_m	0.6845194	1	0.4080348

The p-values and chi squared values of the log rank statistic from the univariate analysis using KM revealed that there are no statistically significant differences in overall survival curves between groups within a variable
The survival curves for all categories(see below) show a steep decline in the initial years indicating poor prognosis from the disease.

	median	0.95LCL	0.95UCL
gender=female	4.54	3.03	6.60
gender=male	2.95	2.46	5.24

	median	0.95LCL	0.95UCL
agecat=above69	3.18	2.64	5.08
agecat=below69	4.54	2.90	NA

	median	0.95LCL	0.95UCL
race=other	2.61	0.9	NA
race=white	3.90	2.9	5.35

	median	0.95LCL	0.95UCL
smoking_history=current smoker	3.18	2.26	8.63
smoking_history=reformed smoker	3.90	2.90	5.41
smoking_history=lifelong non-smoker	1.73	0.23	NA

	median	0.95LCL	0.95UCL
pathologic_stage=stage i	4.60	3.69	5.72
pathologic_stage=stage ii	2.92	2.29	NA
pathologic_stage=stage iii	2.41	1.42	NA

	median	0.95LCL	0.95UCL
stage_t=t1	5.08	3.69	NA
stage_t=t2	3.18	2.64	5.72
stage_t=t3	2.64	1.71	NA
stage_t=t4	2.41	1.06	NA

	median	0.95LCL	0.95UCL
stage_n=n0	3.91	2.95	5.24
stage_n=n1	3.16	2.41	NA
stage_n=n2	2.03	1.42	NA

	median	0.95LCL	0.95UCL
stage_m=m0	3.90	2.90	5.35
stage_m=mx	3.05	1.06	NA

Cox proportional hazards model

Testing proportional Hazard assumption

For each variable, the proportional hazard assumption was checked using a statistical test and schoenfeld residue plots. cox.zph function in R tests the independence between schoenfeld residuals and time. The proportional hazards assumption is supported by a non-significant relationship between residuals and time.

The p-values are high for all variables except race,indicating that they are not statistically significant. This means that proportional hazard assumption is valid for all variables except race. So, cox regression might not be a good fit to model this variable. We will not attempt any methods to fix this, since this majority of this variable consists of one value and this might not be useful.
The schoenfelds plots for all variables show a random distribution around the mean of 0 validating the proportional hazards assumption

##            chisq df    p
## gendermale 0.627  1 0.43
## GLOBAL     0.627  1 0.43

##               chisq df    p
## agecatbelow69 0.239  1 0.62
## GLOBAL        0.239  1 0.62

##               chisq df    p
## agecatbelow69 0.239  1 0.62
## GLOBAL        0.239  1 0.62

##                            chisq df    p
## pathologic_stagestage ii  0.5000  1 0.48
## pathologic_stagestage iii 0.0692  1 0.79
## GLOBAL                    0.5025  2 0.78

##           chisq df     p
## stage_tt2 3.306  1 0.069
## stage_tt3 0.189  1 0.663
## stage_tt4 0.225  1 0.636
## GLOBAL    3.490  3 0.322

##           chisq df    p
## stage_nn1 0.227  1 0.63
## stage_nn2 0.232  1 0.63
## GLOBAL    0.576  2 0.75

##           chisq df    p
## stage_mmx 0.172  1 0.68
## GLOBAL    0.172  1 0.68

Cox Model LUSC (Univariate)

Table: Results from Univariate analysis using cox method
	beta	HR	95%CI(lower)	95%CI(upper)	p-value	wald.test	wald.p.value
gender	0.16	1.18	0.78	1.79	0.44	0.60	0.44
agecat	-0.25	0.78	0.51	1.17	0.23	1.47	0.23
race	-0.25	0.78	0.45	1.35	0.37	0.79	0.37

	beta	HR	95%CI (L)	95%CI (U)	p value
reformed smoker	-0.23	0.79	0.53	1.18	0.26
lifelong nonsmoker	1.06	2.89	0.68	12.20	0.15
pathologic_stage2	0.28	1.32	0.86	2.04	0.21
pathologic_stage3	0.44	1.56	0.95	2.56	0.08
staget2	0.21	1.23	0.78	1.94	0.36
staget3	0.39	1.47	0.74	2.93	0.27
staget4	1.00	2.72	1.11	6.63	0.03
stagen1	0.09	1.09	0.70	1.71	0.70
stagen2	0.40	1.49	0.84	2.64	0.18
stagemx	0.22	1.25	0.74	2.10	0.41

	wald.test	p.value
smoking_history	4.17	0.12
pathologic_stage	3.65	0.16
stage_t	5.19	0.16
stage_n	1.83	0.40
stage_m	0.68	0.41

The table above shows the regression beta coefficients, hazard ratios, confidence intervals for hazard ratios, and statistical significance (wald test and p value) of each variable in relation to overall survival. Each variable has been assessed independently via separate Cox regressions.
From the output above, we can see that the p-values from wald test are similar to those obtained from KM estimates suggesting:
none of the variables are statistically significant for overall survival
groups within a variable dont have significantly different survival rates
all confidence intervals include the NULL value which also indicates that they are not statistically significant

Based on the results from KM estimates and the Cox model, we can conclude that, for LUSC, none of the variables are statistically significant for overall survival

Univariate Analysis LUAD

KM method

category	chisq	df	p.value
gender	0.0619247	1	0.8034793
agecat	2.1215024	1	0.1452440
race	2.4764949	1	0.1155595
smoking_history	1.4695368	2	0.4796165
pathologic_stage	30.9366702	3	0.0000009
stage_t	11.9727101	3	0.0074772
stage_n	26.8069607	2	0.0000015
stage_m	2.9879501	2	0.2244786

The p-values from the log rank test for variables pathologic stage,stage t, and stage n indicate that they are statistically significant indicating that the groups within these variables differ significantly in survival. Interpretations of each individual variable are discussed below. Since each of these variables consist of more than two groups, a pair wise comparison was performed and the results are shown below.
The survival curves (see below) for all variables also show a step decline, similar to what was observed in LUSC, in the first five years indicating poor prognosis from the disease

	median	0.95LCL	0.95UCL
gender=female	3.47	2.86	4.73
gender=male	3.54	2.96	7.35

	median	0.95LCL	0.95UCL
agecat=above65	3.45	2.74	4.9
agecat=below65	3.54	3.11	NA

	median	0.95LCL	0.95UCL
race=other	3.72	2.45	NA
race=white	3.47	2.96	4.38

	median	0.95LCL	0.95UCL
smoking_history=current smoker	3.98	2.60	NA
smoking_history=reformed smoker	3.45	3.11	4.53
smoking_history=lifelong non-smoker	3.89	2.73	NA

Pathologic stage

The survival curves show that the survival rate decreases with increasing pathologic stage.
Patients with stage 1 have a median survival time of ~4.7 years whereas patients with stage 2 and 3 have 2.5 and 1.7 years respectively
Patients with stage 4 cancer have a higher median survival time but it should be noted there are very few patients in this sample
Pair wise comparisons shows that stage 2 and stage 3 are significantly different from stage 1. However, there is little difference in survival time between stages 2 and 3.

Stage T

The survival curves show that the survival rate decreases rapidly for stage 3 and 4 in comparison to stage 1 and 2. Correspondingly, their median survival times also indicate the same trend.
Pair wise comparison shows that
- t1 is not very different from t2 while t3 and t4 are significantly different from t1
- No differences are observed between t2, t3, and t4.

	median	0.95LCL	0.95UCL
pathologic_stage=stage i	4.73	3.89	NA
pathologic_stage=stage ii	2.55	1.85	4.87
pathologic_stage=stage iii	1.72	1.13	3.78
pathologic_stage=stage iv	2.86	2.67	NA

	median	0.95LCL	0.95UCL
stage_t=t1	3.78	3.20	NA
stage_t=t2	3.72	2.94	4.73
stage_t=t3	2.37	1.05	NA
stage_t=t4	2.26	0.49	NA

##           stage i stage ii stageiii
## stage ii  **                       
## stage iii ****                     
## stage iv  *                        
## attr(,"legend")
## [1] 0 '****' 1e-04 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1 \t    ## NA: ''

##    t1 t2 t3
## t2         
## t3 **      
## t4 **      
## attr(,"legend")
## [1] 0 '****' 1e-04 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1 \t    ## NA: ''

Stage N

The survival curves show that the survival rate is better for patients with n0 where the cancer has not spread to the lymph nodes while n1 and n2 curves show a steep decline
The median survival times for n0, n1, and n2 are 4.5, 2.6, and 1.7 years respectively
Pair wise comparison also indicates the same trend : +n0 is significantly different from n1 and n2 +no difference between n1 and n2

	median	0.95LCL	0.95UCL
stage_n=n0	4.53	3.54	NA
stage_n=n1	2.67	1.85	4.11
stage_n=n2	1.72	1.13	3.78

	median	0.95LCL	0.95UCL
stage_m=m0	3.53	2.94	4.53
stage_m=m1	2.86	2.26	NA
stage_m=mx	4.90	3.11	NA

##    n0   n1
## n1 ***    
## n2 ****   
## attr(,"legend")
## [1] 0 '****' 1e-04 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1 \t    ## NA: ''

Cox proportional hazards model

Testing for Cox proportional hazard assumption

The p-values are high for all variables indicating that they are not statistically significant. This means that the proportional hazard age assumption is valid for all variables.
The schoenfelds plots for all variables show a random distribution around the mean of 0

##            chisq df    p
## gendermale 0.507  1 0.48
## GLOBAL     0.507  1 0.48

##                chisq df    p
## agecatbelow65 0.0923  1 0.76
## GLOBAL        0.0923  1 0.76

##                chisq df    p
## agecatbelow65 0.0923  1 0.76
## GLOBAL        0.0923  1 0.76

##                             chisq df    p
## pathologic_stagestage ii  0.00266  1 0.96
## pathologic_stagestage iii 2.66184  1 0.10
## pathologic_stagestage iv  0.77412  1 0.38
## GLOBAL                    3.31842  3 0.35

##              chisq df     p
## stage_tt2 3.006384  1 0.083
## stage_tt3 2.785220  1 0.095
## stage_tt4 0.000474  1 0.983
## GLOBAL    4.915003  3 0.178

##           chisq df    p
## stage_nn1 0.957  1 0.33
## stage_nn2 2.323  1 0.13
## GLOBAL    2.558  2 0.28

##           chisq df    p
## stage_mm1 1.262  1 0.26
## stage_mmx 0.507  1 0.48
## GLOBAL    1.562  2 0.46

Cox Model (Univariate)

Univariate analysis using Cox proportion hazards
	beta	HR	95%_CI_lower	95%_CI_upper	p value	wald.test	wald.p.value
gender	-0.05	0.95	0.63	1.42	0.80	0.06	0.80
agecat	-0.31	0.74	0.49	1.11	0.15	2.10	0.15
race	0.51	1.66	0.88	3.13	0.12	2.42	0.12

	beta	HR	0.95LCL	0.95UCL	P-value
reformed_smoker	0.31	1.36	0.81	2.29	0.24
non-smoker	0.33	1.40	0.67	2.89	0.37
pathologic_stage2	0.89	2.43	1.46	4.07	0.00
pathologic_stage3	1.36	3.88	2.30	6.54	0.00
pathologic_stage4	0.93	2.52	1.16	5.48	0.02
staget2	0.33	1.40	0.86	2.26	0.18
staget3	1.15	3.17	1.39	7.26	0.01
staget4	1.24	3.46	1.30	9.24	0.01
stagen1	0.83	2.30	1.40	3.77	0.00
stagen2	1.18	3.27	1.99	5.38	0.00
stagem1	0.34	1.40	0.67	2.92	0.37
stagemx	-0.35	0.70	0.42	1.19	0.19

	wald.test	wald.p.value
smoking_history	1.46	0.48
pathologic_stage	27.81	0.00
stage_t	11.11	0.01
stage_n	24.48	0.00
stage_m	2.94	0.23

Pathologic stage which denotes overall cancer stage, stage_t representing tumor size, and stage_n representing spread to lymphnodes have highly statistically significant coefficients.
All the variables have positive beta coefficients which means they are associated with poor survival
The hazard ratios for stage t groups and stage n groups indicate that the hazard rate increases with the increase in tumor size and spread to lymphnodes.
For pathologic stage groups,this trend is true for stage 2 and 3. However, stage 4 has a HR similar to that to stage2. This is likely due to the limited sample size.

Multivariate analysis for LUAD

The univariate analysis allowed us to conclude that pathologic stage, stage t, and stage n are the only variables that are statistically significant for overall survival. However, overall cancer/pathologic stage is always assigned based on stage t, stage n and stage m. Therefore, in any linear modeling technique, we expect to see correlation between tumor stage and pathologic stage and hence wont be useful to model them together. Since there are no other variables that are statistically significant, multivariate analysis provides no added value to the analysis.

We can however perform a multivariate analysis using two models to see if the predictions from the univariate analysis will still hold true when other covariates are taken into account
1) Age, gender, pathologic_stage, smoking_history 2) Age, gender, staget, stage_n, stage_m, smoking_history

Model 1

Testing the proportional hazards assumption

Both the p-values and schoenfeld plots show that the proportional hazards assumption is true for all variables

The p-value from wald test indicates that the model is significant.
In this model, pathologic stages 2,3, and 4 remain statistically significant (p < 0.05) after taking into account all the other variables
There is also no significant change in effect sizes (beta coefficients) and hazard ratios from the univariate analysis for cancer stage
The p-values together with the beta coefficients and hazard ratios indicate a strong relationship between patient cancer stage and increased risk for death after accounting for other covariates. Therefore, we can conclude that advanced cancer stage is associated with poor prognostic.
Interestingly, both age and reformed smoker group within the smoking history yeild borderline significance with p values of 0.06 and 0.05 respectively. However, the CI for both variables include 1 which means that they make smaller contributions to the difference in HR after adjusting for other covariates and only trend toward significance.
It will be useful to model interactions between age, smoking history and cancer stage

Model 2

Includes age, tnm stages t,n and smoking history

The wald test p-values indicate that the model is significant.
Stages n1 and n2 remain statistically significant (p<0.05) after taking all other variables into account.
Stage m, gender, and smoking history continue to insignificant
However, both t3 and t4 are not statistically significant and their effect is diminished after taking all other variables into account.
Age becomes borderline significant

Conclusions

We estimated the survival rates for LUSC and LUAD and found there is no statistically significant difference between their survival rates
The one, three, and five year survival rates for LUSC and LUAD are 81.1% versus 85.5%, 55.7% versus 58.2%, and 43.6% versus 33.8%, respectively.
In LUSC, both KM and Cox univariate models indicated that none of the variables have a statistically significant association with overall survival
In LUAD, the overall cancer stage and spread to lymphnodes have a significant association with overall survival both in univariate and multivariate analysis
Based on results from multivariate analysis, age and cancer stage might have some possible interactions
As a follow up, it would be interesting to use cox models with interactions between age and overall cancer stage, smoking history and cancer stage, age and stage t.

Survival Analysis on Non-small cell lung cancer

Spandana Makeneni

Goal and Impact

Background

Specific Questions

Data Extraction

Data Wrangling

Step1 - Eliminate negative values from the times column

Step2 - Check for missing data

Step3 - Data cleaning

Data Exploration

Gender

Age

Pathologic stage

Stage T,N,M

LUSC

LUAD

Smoking History

Survival Analysis

Survival rates for LUSC and LUAD

Univariate analysis (LUSC)

KM survival curves

Cox proportional hazards model

Testing proportional Hazard assumption

Cox Model LUSC (Univariate)

Univariate Analysis LUAD

KM method

Pathologic stage

Stage T

Stage N

Cox proportional hazards model

Testing for Cox proportional hazard assumption

Cox Model (Univariate)

Multivariate analysis for LUAD

Model 1

Testing the proportional hazards assumption

Model 2

Conclusions