Logistic Regression

Analysis

Loading the data

First make sure working directory is set correctly. Then read the ﬁle into the variable called my_data.

my_data <- read.table("Data_all.15k.patients.txt")

head(my_data)

##          dateofbirth maritalstatus race ageatdiagnosis alivestatus survivaltime
## pid00001  28/01/1932             5    2             61           0          110
## pid00002  07/05/1934             2    3             60           0          100
## pid00003  14/04/1921             5    3             76           0           70
## pid00004  08/11/1931             3    3             61           1           31
## pid00005  08/01/1922             2    3             74           1           47
## pid00006  03/12/1939             2    3             55           0          100
##          grade nodesexam nodespos extent nodalstatus size pgr er
## pid00001     3        32        3     10           6   60   2  2
## pid00002     2        13        1     10           6   15   1  1
## pid00003     3         8        0     10           0    8   1  1
## pid00004     3        20        0     10           0   10   2  2
## pid00005     2        16        8     10           6   15   2  1
## pid00006     3        19        0     10           0   48   1  1

str(my_data)

## 'data.frame':    15000 obs. of  14 variables:
##  $ dateofbirth   : chr  "28/01/1932" "07/05/1934" "14/04/1921" "08/11/1931" ...
##  $ maritalstatus : int  5 2 5 3 2 2 2 4 4 2 ...
##  $ race          : int  2 3 3 3 3 3 1 2 2 3 ...
##  $ ageatdiagnosis: int  61 60 76 61 74 55 84 81 70 78 ...
##  $ alivestatus   : int  0 0 0 1 1 0 0 0 0 0 ...
##  $ survivaltime  : int  110 100 70 31 47 100 87 86 138 86 ...
##  $ grade         : int  3 2 3 3 2 3 3 2 3 2 ...
##  $ nodesexam     : int  32 13 8 20 16 19 3 13 21 15 ...
##  $ nodespos      : int  3 1 0 0 8 0 0 0 0 0 ...
##  $ extent        : int  10 10 10 10 10 10 10 10 10 10 ...
##  $ nodalstatus   : int  6 6 0 0 6 0 0 0 0 0 ...
##  $ size          : int  60 15 8 10 15 48 32 15 22 6 ...
##  $ pgr           : int  2 1 1 2 2 1 2 1 1 2 ...
##  $ er            : int  2 1 1 2 1 1 2 1 1 1 ...

Performing Logistic Regression

Creating a logistic regression model is very similar to creating a linear regression model. Setting the parameter “family” to “binomial” tells the glm() function to create a logistic regression model.

my_lr <- glm(alivestatus ~ size + nodespos + grade,
             data = my_data,
             family = "binomial")

summary(my_lr)

## 
## Call:
## glm(formula = alivestatus ~ size + nodespos + grade, family = "binomial", 
##     data = my_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.0612  -0.6022  -0.4451  -0.3164   2.5375  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.899557   0.095006  -41.05   <2e-16 ***
## size         0.023213   0.001391   16.69   <2e-16 ***
## nodespos     0.149229   0.005960   25.04   <2e-16 ***
## grade        0.628004   0.035757   17.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 14307  on 14999  degrees of freedom
## Residual deviance: 12094  on 14996  degrees of freedom
## AIC: 12102
## 
## Number of Fisher Scoring iterations: 5

Note that the summary output of logistic regression model is very similar to that of linear regression.

The predict() function can be used to determine the probability of getting the value 1. The parameter “type” speciﬁes which results to return. In order to retrieve the predicted probability of the outcome “1” ( 1 in the output variable alivestatus means death. ) use “response”.

predicted_alivestatus <- predict(my_lr,
                                 newdata = my_data,
                                 type = "response")

my_data$predicted_alivestatus <- predicted_alivestatus

head(my_data)

##          dateofbirth maritalstatus race ageatdiagnosis alivestatus survivaltime
## pid00001  28/01/1932             5    2             61           0          110
## pid00002  07/05/1934             2    3             60           0          100
## pid00003  14/04/1921             5    3             76           0           70
## pid00004  08/11/1931             3    3             61           1           31
## pid00005  08/01/1922             2    3             74           1           47
## pid00006  03/12/1939             2    3             55           0          100
##          grade nodesexam nodespos extent nodalstatus size pgr er
## pid00001     3        32        3     10           6   60   2  2
## pid00002     2        13        1     10           6   15   1  1
## pid00003     3         8        0     10           0    8   1  1
## pid00004     3        20        0     10           0   10   2  2
## pid00005     2        16        8     10           6   15   2  1
## pid00006     3        19        0     10           0   48   1  1
##          predicted_alivestatus
## pid00001             0.4563383
## pid00002             0.1046935
## pid00003             0.1382568
## pid00004             0.1438815
## pid00005             0.2494546
## pid00006             0.2887764

From the prediction results, a probability greater than 0.5 will most likely be dead (have a value of 1). Using this cutoff, a predicted result vector can be created which can be compared to the actual result. In order to compare the predicted with actual results there are certain terminologies that are important to understand:

TP: True positive: Predictions of TRUE event and it is actually TRUE
TN: True negative: Prediction of FALSE event and it is actually FALSE.
FP: False positive: Prediction of TRUE even, but it is actually FALSE.
FN: False negative: Prediction of FALSE even and it is TRUE

These numbers in combination can provide useful statistics to judge the quality of the model.

accuracy = (TP + TN) / (TP + TN + FP + FN)
precision = TP / (TP + FP)
sensitivity = recall = true positive rate = TP / (TP + FN)
specificity = true negative rate = TN / (TN + FP)

my_data$predicted_alivestatus_binary <- ifelse(test = predicted_alivestatus >= 0.5,
                                               yes = 1,
                                               no = 0)

confmat <- table(my_data[,c("alivestatus","predicted_alivestatus_binary")]) 
confmat

##            predicted_alivestatus_binary
## alivestatus     0     1
##           0 11928   317
##           1  2203   552

TP <- confmat[2,2]
TN <- confmat[1,1]
FP <- confmat[1,2]
FN <- confmat[2,1]

accuracy <- (TP + TN) / (TP + TN + FP + FN)
accuracy

## [1] 0.832

precision <- TP / (TP + FP)
precision

## [1] 0.6352129

sensitivity <- TP / (TP + FN)
sensitivity

## [1] 0.200363

specificity <- TN / (TN + FP)
specificity

## [1] 0.9741119

Recall is also often referred to as the measurement of how speciﬁc the method is and Precision is the measurement of sensitivity. We can use the AUC package to see how the sensitive and specity change as the cutoﬀ is changed. Several performance characteristics plotted against each other is called receiver operating characteristic curve, or ROC curve.

ROC curve

my_roc <- roc(data = my_data,
              response = alivestatus, 
              predictor = predicted_alivestatus)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(x = my_roc$thresholds,
     y = my_roc$sensitivities,
     col = "blue",
     type = "l")

points(x = my_roc$thresholds,
       y = my_roc$specificities,
       col = "red",
       type = "l")

Logistic Regression

XDASI Fall 2021

Answer KEY

Introduction

Analysis

Loading the data

Performing Logistic Regression

ROC curve