Last updated: 2018-03-13
Code version: fe4c128
library(dplyr)
library(data.table)
library(magrittr)
library(purrr)
library(here) # for tracking working directory
library(ggplot2)
library(epistats)
library(broom)
Analyze the throat dataset (throat.txt or throat.sav) in R, SPSS, or both.
Examine the relation between sore throat and duration of surgery in three ways: 1) make a scatterplot of sore throat by duration of surgery; 2) make histograms of duration split by sore throat; and 3) make a plot of proportion having sore throat by duration of surgery (you’ll need to categorize duration).
In a survey of 35 patients having surgery with a general anesthetic, patients were asked whether or not they experienced a sore throat (throat=0 for no, throat=1 for yes). The duration of the surgery in minutes was also recorded, and the type of device used to secure the airway (0 = laryngeal mask airway; 1=tracheal tube)
throat <- read.table(here("data", "throat.txt"), sep = ";", header = T)
str(throat)
'data.frame': 35 obs. of 4 variables:
$ Patient: int 1 2 3 4 5 6 7 8 9 10 ...
$ D : int 45 15 40 83 90 25 35 65 95 35 ...
$ T : int 0 0 0 1 1 1 0 0 0 0 ...
$ Y : int 0 0 1 1 1 1 1 1 1 1 ...
Rename variables for easier interpretation
throat %<>%
transmute(
patient = Patient,
duration = D,
sore_throat = as.logical(`T`),
tracheal_tube = as.logical(Y))
Scatterplot
throat %>%
ggplot(aes(x = duration, y = sore_throat)) +
geom_point()
Histograms
throat %>%
ggplot(aes(x = duration, fill = sore_throat)) +
geom_histogram()
Proportion sore throats by category of duration.
Let’s divide duration in 5 categories
Calculate mean duration by quantile for proper plotting
throat %<>%
mutate(duration_group = quant(duration, n.tiles = 5))
throat_grouped <-throat %>%
group_by(duration_group) %>%
summarize(mean_duration = mean(duration),
prop_throat = mean(sore_throat))
Plot them
throat_grouped %>%
ggplot(aes(x = mean_duration, y = prop_throat)) +
geom_point() +
lims(y = c(0,1))
Now improved with errorbars
In epistats
ther is a function that returns a proportion with a confidence interval for logical variables (without using n = length(x), x = sum(x)
)
Group the data by duration group using nest
from tidyr
Perform the confidence interval estimation in each subset, pull out the proportions and confidence intervals.
require(tidyr)
throat %<>%
mutate(duration_group = quant(duration, n.tiles = 5))
throat_nested <- throat %>%
group_by(duration_group) %>%
nest() %>%
mutate(
mean_duration = map(data, function(data) mean(data$duration)),
prop = map(data, function(data) binom.confint_logical(data$sore_throat))) %>%
unnest(prop, mean_duration)
throat_nested %>%
ggplot(aes(x = mean_duration, y = mean)) +
geom_errorbar(aes(ymin = lower, ymax = upper)) +
lims(y = c(0,1)) +
theme_minimal() + labs(y = "proportion sore throat")
Fit a logistic regression model to explain the probability of sore throat as a function of duration of surgery. SPSS users: do this once using the Regression, Binary Logistic menu option, and a second time using the Generalized Linear Model menu option. Compare the results from both “methods” for estimating a logistic regression; are there differences in the parameter estimates, standard errors or maximum likelihood?
fit <- glm(sore_throat ~ duration, data = throat, family = binomial(link = "logit"))
summary(fit)
Call:
glm(formula = sore_throat ~ duration, family = binomial(link = "logit"),
data = throat)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.189 -1.149 -1.131 1.204 1.225
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.136898 0.658830 -0.208 0.835
duration 0.001734 0.012292 0.141 0.888
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 48.492 on 34 degrees of freedom
Residual deviance: 48.472 on 33 degrees of freedom
AIC: 52.472
Number of Fisher Scoring iterations: 3
Add the fitted logistic curve from (d) to one of the graphs in (a) (SPSS users: save the predicted probabilities from the model; use either a multiple line graph or an overlay scatterplot).
throat_nested %<>%
mutate(pred = predict(fit, newdata = data.frame(duration = mean_duration),
type = "response"))
throat_nested %>%
ggplot(aes(x = mean_duration, y = mean)) +
geom_errorbar(aes(ymin = lower, ymax = upper)) +
geom_line(aes(y = pred), lty = 2) +
lims(y = c(0,1)) +
theme_minimal() + labs(y = "proportion sore throat")
Continue with the analysis of the throat dataset in R, SPSS, or both (SPSS users: use the GLM menu for the modelling):
Examine the relation between sore throat and device, and between device and duration.
throat %>%
select(sore_throat, tracheal_tube, duration) %>%
mutate_if(is.logical, as.factor) %>%
GGally::ggpairs()
Patients without a tracheal tube seem to have a sore throat more often.
Patients with a tracheal tube seem to have a longer duration
Fit a logistic regression model to explain the probability of sore throat as a function of type of device. Get the Wald and profile likelihood
fit2 <- glm(sore_throat ~ tracheal_tube, data = throat, family = binomial(link = "logit"))
summary(fit2)
Call:
glm(formula = sore_throat ~ tracheal_tube, family = binomial(link = "logit"),
data = throat)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5353 -0.9508 -0.9508 0.8576 1.4224
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.8109 0.6009 1.349 0.1772
tracheal_tubeTRUE -1.3705 0.7467 -1.836 0.0664 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 48.492 on 34 degrees of freedom
Residual deviance: 44.889 on 33 degrees of freedom
AIC: 48.889
Number of Fisher Scoring iterations: 4
confint(fit2)
2.5 % 97.5 %
(Intercept) -0.310979 2.11680249
tracheal_tubeTRUE -2.929278 0.04379487
Both Wald test and profile likelihood agree that tracheal tube is not significant.
Fit the remaining models from the lecture and interpret the output.
fit2 <- glm(sore_throat ~ tracheal_tube + duration, data = throat, family = binomial(link = "logit"))
fit3 <- glm(sore_throat ~ duration * tracheal_tube, data = throat, family = binomial(link = "logit"))
summary(fit3)
Call:
glm(formula = sore_throat ~ duration * tracheal_tube, family = binomial(link = "logit"),
data = throat)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9459 -0.7546 -0.5331 0.7840 2.0107
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.66116 1.49224 1.783 0.07453 .
duration -0.06208 0.04295 -1.445 0.14833
tracheal_tubeTRUE -5.52029 2.01592 -2.738 0.00617 **
duration:tracheal_tubeTRUE 0.10127 0.04791 2.114 0.03454 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 48.492 on 34 degrees of freedom
Residual deviance: 37.908 on 31 degrees of freedom
AIC: 45.908
Number of Fisher Scoring iterations: 4
AIC(fit2, fit3)
df AIC
fit2 3 49.10924
fit3 4 45.90770
There seems to be a significant interaction between tracheal tube and duration, according to both the p-value from the fit summary and the AIC.
Without the tracheal tube, the duration of the procedure does not seem to matter. The tracheal tube itselve is protective for sore throat. For patients with a tracheal tube, increased duration is associated with higher odds of sore throat
Optional in R: continue with the analysis of the throat dataset, and repeat the model diagnostics. For saving, “binning” and plotting the binned residuals, see the script provided.
plot(fit3)
Use binning (adapted from provided script)
throat %<>%
mutate(
residuals = residuals(fit3),
linpred = predict(fit3, type = "link")
)
throat %>%
mutate(pred_cat = quant(linpred, n.tiles = 10)) %>%
group_by(pred_cat) %>%
summarize(residuals = mean(residuals),
linpred = mean(linpred)) %>%
ggplot(aes(x = linpred, y = residuals)) +
geom_point()
This should be unstructured.
The R script GLM computer lab1 analyses.R contains the beginnings of an analysis of the ICU dataset (ICU.RData, ICU.sav). Use and expand on this script in R (or use SPSS) to answer the following questions. a. (SPSS users note: this is not entirely possible, see answers to 1a&c for compromise.) Examine the relation between status and systolic blood pressure (SYS). Try to make a graph of this relation, similar to the graph of status vs. age in the lecture notes (bottom left graph on slide 60). Compare this plot to the systolic blood pressure plot on slide 61 of the lecture notes. What is different in how this new plot has been made? b. Fit a logistic regression model using only systolic blood pressure (SYS) to predict status. What is the odds ratio for dying (status = dead) for a 10 unit increase in systolic blood pressure? c. What assumption does the model in 3(b) make about dying and SYS? Do you think that assumption is met? d. Examine the relation between status and level of consciousness (LOC) in a contingency table. What do you notice? What problems might this give in the model fitting? e. Fit a logistic regression model using LOC to predict status. Comment on the estimates and standard errors for the regression coefficients.
This dataset will be considered for further analysis on days 2 and 3.
# fit_all <- ...
# stepAIC(fit_all)
# check assumptions (binned deviance, by variable)
This assignment was skipped due to time-limitations and it being discussed during the lecture.
Can we predict birth weight using gestational age? Is the prediction the same for boys and girls? a. Use the dataset bwt_gestage.csv to answer these questions. Note: sex = 1 are the boys, sex = 2 are the girls. b. Interpret your findings. c. Fit the final model from (a) using both the lm() and glm() functions in R. Compare the parameter estimates and standard errors from the two methods of fitting a linear model. d. Repeat (c) using both the Regression, Linear and the Generalized Linear Models procedures in SPSS. Compare the parameter estimates and standard errors from the two methods of fitting a linear model. What do you notice?
We wish to find factors that influence the probability that a low birth weight infant (<1500 g) will experience a germinal matrix hemorrhage. A sample of 100 low birth weight newborns was retrospectively collected in a hospital in Boston, MA. Factors possibly indicative of a germinal matrix hemorrhage were extracted from a chart review and included sex, head circumference, systolic blood pressure and gestational age of the infant, and whether the mother suffered from toxemia during the pregnancy. Use the dataset lowbwt.txt to predict the probability of hemorrhage. (Note: the dichotomous variables are defined as follows: sex=1 is a male, tox=1 is toxemia, grmhem=1 hemorrhage.) a. Start by describing the data, get a sense of the relations between the potential explanatory variables and the outcome. b. Construct a model to predict occurrence of germinal matrix hemorrhage. c. Interpret your findings.
The dataset epilepsy in the R library HSAUR (or the SPSS file epilepsy.sav) contains data from a clinical trial on 59 patients suffering from epilepsy. Patients were randomized to groups receiving either an anti-epileptic drug or a placebo, in addition to standard chemotherapy. N.B. If you’re working in R and you’ve loaded the faraway library, unload it now! Otherwise you may end up with the wrong epilepsy data frame (both faraway and HSAUR contain data frames with different structures but the same name): detach(“package:faraway”). Get some information about this data frame by loading the HSAUR library and using help(epilepsy). This is data from a longitudinal trial; we will use only the data from the last two-week period. We are interested in whether the probability of seizure is higher in the treatment or control group in this period. a. Start by making a selection of the data for period 4 and making a new variable for seizure yes/no. b. Get some descriptive statistics for the data, get a sense of the relations between the potential explanatory variables and the dichotomous outcome seizure. c. What type of variable is seizure.rate? d. Give two reasons why a logistic regression model is not the most appropriate way to analyze this data.
The R script ICU analyses day 2.R contains the beginnings of an analysis of the ICU dataset (ICU.RData, ICU.sav). Use and expand on this script in R (or use SPSS) to answer the following questions.
Setup
# Analyses of throat and ICU datasets for day 2 GLM course
library(gmodels)
library(splines)
library(HSAUR)
load(here("data", "ICU.RData"))
str(ICU)
'data.frame': 200 obs. of 22 variables:
$ ID : num 8 12 14 28 32 38 40 41 42 50 ...
$ STA : Factor w/ 2 levels "Alive","Dead": 1 1 1 1 1 1 1 1 1 1 ...
$ AGE : num 27 59 77 54 87 69 63 30 35 70 ...
$ AGECAT: num 30 60 80 50 90 70 60 30 40 70 ...
$ SEX : Factor w/ 2 levels "Male","Female": 2 1 1 1 2 1 1 2 1 2 ...
$ RACE : Factor w/ 3 levels "White","Black",..: 1 1 1 1 1 1 1 1 2 1 ...
$ SER : Factor w/ 2 levels "Medical","Surgical": 1 1 2 1 2 1 2 1 1 2 ...
$ CAN : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 2 ...
$ CRN : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ INF : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 1 1 1 1 ...
$ CPR : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ SYS : num 142 112 100 142 110 110 104 144 108 138 ...
$ HRA : num 88 80 70 103 154 132 66 110 60 103 ...
$ PRE : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 1 1 1 ...
$ TYP : Factor w/ 2 levels "Elective","Emergency": 2 2 1 2 2 2 1 2 2 1 ...
$ FRA : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 1 1 ...
$ PO2 : Factor w/ 2 levels "> 60","<= 60": 1 1 1 1 1 2 1 1 1 1 ...
$ PH : Factor w/ 2 levels ">= 7.25","< 7.25": 1 1 1 1 1 1 1 1 1 1 ...
$ PCO : Factor w/ 2 levels "<= 45","> 45": 1 1 1 1 1 1 1 1 1 1 ...
$ BIC : Factor w/ 2 levels ">= 18","< 18": 1 1 1 1 1 2 1 1 1 1 ...
$ CRE : Factor w/ 2 levels "<= 2.0","> 2.0": 1 1 1 1 1 1 1 1 1 1 ...
$ LOC : Factor w/ 3 levels "No Coma or Stupor",..: 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "variable.labels")= Named chr "Identification Code" "Status" "Age in Years" "Age Class" ...
..- attr(*, "names")= chr "ID" "STA" "AGE" "AGECAT" ...
- attr(*, "codepage")= int 1252
Examine the relation between status (STA) and type of admission (TYP).
CrossTable(ICU$TYP, ICU$STA, prop.c=FALSE, prop.t=FALSE, prop.chisq=FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
|-------------------------|
Total Observations in Table: 200
| ICU$STA
ICU$TYP | Alive | Dead | Row Total |
-------------|-----------|-----------|-----------|
Elective | 52 | 2 | 54 |
| 0.963 | 0.037 | 0.270 |
-------------|-----------|-----------|-----------|
Emergency | 108 | 38 | 146 |
| 0.740 | 0.260 | 0.730 |
-------------|-----------|-----------|-----------|
Column Total | 160 | 40 | 200 |
-------------|-----------|-----------|-----------|
Fit a logistic regression model using TYP to predict status. Express the results as OR and 95% CI.
icu.m1 <- glm(formula = STA ~ TYP, family = binomial(link = "logit"), data=ICU)
summary(icu.m1)
Call:
glm(formula = STA ~ TYP, family = binomial(link = "logit"), data = ICU)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.7765 -0.7765 -0.7765 -0.2747 2.5674
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.2581 0.7203 -4.523 6.08e-06 ***
TYPEmergency 2.2136 0.7446 2.973 0.00295 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 200.16 on 199 degrees of freedom
Residual deviance: 184.52 on 198 degrees of freedom
AIC: 188.52
Number of Fisher Scoring iterations: 5
exp(coefficients(icu.m1))
(Intercept) TYPEmergency
0.03846155 9.14814450
exp(confint(icu.m1))
2.5 % 97.5 %
(Intercept) 0.006293049 0.1235554
TYPEmergency 2.658796276 57.5739571
Fit two binomial models producing the Risk Ratio and Risk Difference, with their 95% CI.
# "relative risk" regression: link=log, family = binomial
icu.m2 <- glm(formula = STA ~ TYP, family = binomial(link = "log"), data=ICU)
summary(icu.m2)
Call:
glm(formula = STA ~ TYP, family = binomial(link = "log"), data = ICU)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.7765 -0.7765 -0.7765 -0.2747 2.5674
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.2958 0.6938 -4.750 2.03e-06 ***
TYPEmergency 1.9498 0.7077 2.755 0.00587 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 200.16 on 199 degrees of freedom
Residual deviance: 184.52 on 198 degrees of freedom
AIC: 188.52
Number of Fisher Scoring iterations: 6
exp(coefficients(icu.m2))
(Intercept) TYPEmergency
0.03703704 7.02739702
exp(confint(icu.m2))
2.5 % 97.5 %
(Intercept) 0.006254798 0.1099454
TYPEmergency 2.265999444 42.2886756
# "risk difference" regression: link=identity, family = binomial
# Note: do not exponentiate coeff & CI
icu.m3 <- glm(formula = STA ~ TYP, family = binomial(link = "identity"), data=ICU)
summary(icu.m3)
Call:
glm(formula = STA ~ TYP, family = binomial(link = "identity"),
data = ICU)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.7765 -0.7765 -0.7765 -0.2747 2.5674
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.03704 0.02570 1.441 0.15
TYPEmergency 0.22324 0.04449 5.018 5.22e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 200.16 on 199 degrees of freedom
Residual deviance: 184.52 on 198 degrees of freedom
AIC: 188.52
Number of Fisher Scoring iterations: 2
# confint(icu.m3)
confint.default(icu.m3)
2.5 % 97.5 %
(Intercept) -0.01333321 0.08740729
TYPEmergency 0.13604217 0.31043170
For the identity link, likihood profiling for confidence intervals does not work. Possibly due to inadmissable values in the range of the profile.
Fit two further binomial models using the probit link and the cloglog link.
# probit link
icu.m4 <- glm(formula = STA ~ TYP, family = binomial(link = "probit"), data=ICU)
summary(icu.m4)
Call:
glm(formula = STA ~ TYP, family = binomial(link = "probit"),
data = ICU)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.7765 -0.7765 -0.7765 -0.2747 2.5674
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7862 0.3175 -5.625 1.85e-08 ***
TYPEmergency 1.1437 0.3367 3.397 0.000681 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 200.16 on 199 degrees of freedom
Residual deviance: 184.52 on 198 degrees of freedom
AIC: 188.52
Number of Fisher Scoring iterations: 5
# cloglog link
icu.m5 <- glm(formula = STA ~ TYP, family = binomial(link = "cloglog"), data=ICU)
summary(icu.m5)
Call:
glm(formula = STA ~ TYP, family = binomial(link = "cloglog"),
data = ICU)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.7765 -0.7765 -0.7765 -0.2747 2.5674
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.2770 0.7071 -4.634 3.58e-06 ***
TYPEmergency 2.0780 0.7257 2.864 0.00419 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 200.16 on 199 degrees of freedom
Residual deviance: 184.52 on 198 degrees of freedom
AIC: 188.52
Number of Fisher Scoring iterations: 6
Decide which link gives the best fitting model, based on deviance or AIC
AIC(icu.m1, icu.m2, icu.m3, icu.m4, icu.m5)
df AIC
icu.m1 2 188.5246
icu.m2 2 188.5246
icu.m3 2 188.5246
icu.m4 2 188.5246
icu.m5 2 188.5246
All have equivalent AIC.
list(icu.m1, icu.m2, icu.m3, icu.m4, icu.m5) %>%
map_dbl(deviance)
[1] 184.5246 184.5246 184.5246 184.5246 184.5246
All have equivalent deviance
Repeat the analyses 1a-d for the continuous variable SYS, being systolic blood pressure.
The identity link required starting values to work
try(fit_id <- glm(STA ~ SYS, data = ICU, family = binomial(link = "identity"),
start = c(mean(ICU$STA == "Dead"), 0)))
Warning: step size truncated due to divergence
Warning: step size truncated due to divergence
Warning: step size truncated due to divergence
Warning: step size truncated due to divergence
Warning: step size truncated due to divergence
Warning: step size truncated due to divergence
Warning: step size truncated due to divergence
Warning: step size truncated due to divergence
Warning: step size truncated due to divergence
Warning: step size truncated due to divergence
Warning: step size truncated due to divergence
Warning: step size truncated due to divergence
Warning: step size truncated due to divergence
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: algorithm stopped at boundary value
The rest we can do in one line
links = list("logit", "log", "probit", "cloglog")
fits <- links %>%
map(function(link) glm(STA ~ SYS, data = ICU, family = binomial(link = link)))
links <- c(links, "identity")
fits[[length(fits)+1]] <- fit_id
data.frame(link = unlist(links),
aic = map_dbl(fits, AIC),
deviance = map_dbl(fits, AIC))
link aic deviance
1 logit 195.3351 195.3351
2 log 193.5661 193.5661
3 probit 196.2581 196.2581
4 cloglog 194.5699 194.5699
5 identity 199.5667 199.5667
Decide which link now gives the best fitting model, based on deviance or AIC
The log link gives the lowest AIC, so this is the best fit.
Discuss the difference with 1e
With this continous predictor, there is a difference in AICs between the models.
In the previous case with only a single binary predictor, there are only two possible values for the predicted probability of survival, regardless of the link function: \(\hat{p}_0\) for unexposed, \(\hat{p}_1\) for exposed. Each set of values for \(p_0\) and \(p_1\) gives rise to a single value of the resulting likelihood. Since all glm models are fitted according to the same criterion of maximum likelihood, they will all find the same values for \(p_0\) and \(p_1\).
We can check this by looking at the values of the predictions
list(icu.m1, icu.m2, icu.m3, icu.m4, icu.m5) %>%
map("fitted.values") %>%
map(unique)
[[1]]
[1] 0.26027397 0.03703705
[[2]]
[1] 0.26027397 0.03703704
[[3]]
[1] 0.26027397 0.03703704
[[4]]
[1] 0.26027397 0.03703704
[[5]]
[1] 0.26027397 0.03703704
In the case of the continous predictor, the predicted probability takes on more values, and now the different link functions start to matter for the model fit.
The logistic model may be improved by introducing a non-linear effect of SYS. One possible way of achieving this is to add a quadratic term to the model. In R, you may also use a flexible natural spline model (use function ns from library splines in the model specification). Check whether model fit improves.
require(splines)
fit_logit1 <- glm(STA ~ SYS + I(SYS^2), data = ICU, family = binomial(link = "logit"))
fit_logit2 <- glm(STA ~ ns(SYS, df = 2), data = ICU, family = binomial(link = "logit"))
fit_logit3 <- glm(STA ~ ns(SYS, df = 3), data = ICU, family = binomial(link = "logit"))
AIC(fit_logit1, fit_logit2, fit_logit3)
df AIC
fit_logit1 3 189.6764
fit_logit2 3 190.4331
fit_logit3 4 191.9052
The spline seems to increase model fit, however including more complicated splines does not.
The dataset epilepsy.RData (or the SPSS file epilepsy.sav) contains data from a clinical trial on 59 patients suffering from epilepsy. Patients were randomized to groups receiving either an anti-epileptic drug or a placebo, in addition to standard chemotherapy. This is data from a longitudinal trial; we will use only the data from the last two-week period. We are interested in whether the seizure rate is higher in the treatment or control group in this period. This dataset has already been used in yesterday’s computer lab and today’s lecture.
load(here("data", "epilepsy.RData"))
str(epilepsy)
'data.frame': 236 obs. of 6 variables:
$ treatmnt: Factor w/ 2 levels "placebo","Progabide": 1 1 1 1 1 1 1 1 1 1 ...
$ base : num 11 11 11 11 11 11 11 11 6 6 ...
$ age : num 31 31 31 31 30 30 30 30 25 25 ...
$ seizr.rt: num 5 3 3 3 3 5 3 3 2 4 ...
$ period : Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
$ subject : Factor w/ 59 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
- attr(*, "variable.labels")= Named chr "treatment" "base" "age" "seizure.rate" ...
..- attr(*, "names")= chr "treatmnt" "base" "age" "seizr.rt" ...
- attr(*, "codepage")= int 1252
Start by making a selection of the data for period 4.
epi4 <- epilepsy %>% filter(period == "4")
str(epi4)
'data.frame': 59 obs. of 6 variables:
$ treatmnt: Factor w/ 2 levels "placebo","Progabide": 1 1 1 1 1 1 1 1 1 1 ...
$ base : num 11 11 6 8 66 27 12 52 23 10 ...
$ age : num 31 30 25 36 22 29 31 42 37 28 ...
$ seizr.rt: num 3 3 5 4 21 7 2 12 5 0 ...
$ period : Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 4 4 4 ...
$ subject : Factor w/ 59 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, "variable.labels")= Named chr "treatment" "base" "age" "seizure.rate" ...
..- attr(*, "names")= chr "treatmnt" "base" "age" "seizr.rt" ...
- attr(*, "codepage")= int 1252
Repeat the analysis from today’s lecture. Comment on the impact of adjustment for the baseline seizure rate. Pay attention to the fact that this is a randomized trial.
First some marginal distributions for treatment groups
epi4 %>%
as.data.table() %>%
melt.data.table(id.vars = c("subject", "treatmnt"),
measure.vars = c("base", "age", "seizr.rt")) %>%
ggplot(aes(x = treatmnt, y = value)) +
geom_boxplot() +
facet_wrap(~variable)
Looks like the covariates base and age are equally distributed among treatment groups, and seizure rate seems a bit lower in the treatment group.
Now for the covariate-outcome distributions:
epi4 %>%
ggplot(aes(x = base, y = seizr.rt)) +
geom_point()
epi4 %>%
ggplot(aes(x = age, y = seizr.rt)) +
geom_point()
Clear correlation between base rate and follow-up seizure rate. No clear marginal correlation between age and seizure rate at follow-up
Model with poisson
fit1 <- glm(seizr.rt ~ treatmnt, family = poisson, data = epi4)
summary(fit1)
Call:
glm(formula = seizr.rt ~ treatmnt, family = poisson, data = epi4)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.9911 -2.0175 -1.1319 0.4214 13.0233
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.07497 0.06696 30.986 <2e-16 ***
treatmntProgabide -0.17142 0.09640 -1.778 0.0754 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 476.25 on 58 degrees of freedom
Residual deviance: 473.08 on 57 degrees of freedom
AIC: 664.85
Number of Fisher Scoring iterations: 6
Adjusted for base rate
fit2 <- glm(seizr.rt ~ treatmnt + log(base), family = poisson, data = epi4)
summary(fit2)
Call:
glm(formula = seizr.rt ~ treatmnt + log(base), family = poisson,
data = epi4)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.6907 -1.2423 -0.0527 0.8498 3.5874
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.91906 0.25818 -7.433 1.06e-13 ***
treatmntProgabide -0.19920 0.09640 -2.066 0.0388 *
log(base) 1.15056 0.06554 17.556 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 476.25 on 58 degrees of freedom
Residual deviance: 147.75 on 56 degrees of freedom
AIC: 341.51
Number of Fisher Scoring iterations: 5
Effect seems somewhat stronger after adjustment for base-rate
Examine the effects of further adjustment.
fit3 <- glm(seizr.rt ~ treatmnt + log(base) + age, family = poisson, data = epi4)
summary(fit3)
Call:
glm(formula = seizr.rt ~ treatmnt + log(base) + age, family = poisson,
data = epi4)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.5962 -1.1318 0.1552 0.8062 3.6635
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.33017 0.40583 -5.742 9.37e-09 ***
treatmntProgabide -0.15726 0.10144 -1.550 0.121
log(base) 1.17365 0.06819 17.211 < 2e-16 ***
age 0.01100 0.00823 1.337 0.181
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 476.25 on 58 degrees of freedom
Residual deviance: 145.98 on 55 degrees of freedom
AIC: 341.74
Number of Fisher Scoring iterations: 5
After adjusting for age, treatment is no longer significant.
AIC(fit1, fit2, fit3)
df AIC
fit1 2 664.8524
fit2 3 341.5144
fit3 4 341.7444
Adding age to the model results in a slightly worse fit according to the AIC.
However, model fit is not the primary goal for causal research.
Let’s look at model diagnostics of the fits
plot(fit2, which = 5)
plot(fit3, which = 5)
Observation 49 seems to have a strong effect on the model
Covered in class
library(HSAUR)
data(epilepsy, package = "HSAUR")
epi <- epilepsy[epilepsy$period==4,]
summary(epi)
treatment base age seizure.rate period
placebo :28 Min. : 6.00 Min. :18.00 Min. : 0.000 1: 0
Progabide:31 1st Qu.: 12.00 1st Qu.:23.00 1st Qu.: 3.000 2: 0
Median : 22.00 Median :28.00 Median : 4.000 3: 0
Mean : 31.22 Mean :28.34 Mean : 7.305 4:59
3rd Qu.: 41.00 3rd Qu.:32.00 3rd Qu.: 8.000
Max. :151.00 Max. :42.00 Max. :63.000
subject
1 : 1
2 : 1
3 : 1
4 : 1
5 : 1
6 : 1
(Other):53
Plot marginal distributions
epi %>%
select(-period) %>%
gather(-treatment, -subject, -seizure.rate, key = "variable", value = "value") %>%
ggplot(aes(x = value, y = seizure.rate)) +
geom_point() + geom_smooth() +
facet_wrap(~variable, scales = "free_x") +
theme_minimal()
Look at extreme cases
epi %>%
filter(seizure.rate == max(seizure.rate))
treatment base age seizure.rate period subject
1 Progabide 151 22 63 4 49
fit_lm <- lm(seizure.rate ~ age + base +treatment, data = epi)
plot(fit_lm)
epi %>%
mutate(log_base = log(base + 0.5)) %>%
select(-period) %>%
gather(-treatment, -subject, -seizure.rate, key = "variable", value = "xvalue") %>%
mutate(log_sr = log(seizure.rate + 0.5)) %>%
gather(seizure.rate, log_sr, key = "outcome", value = "yvalue") %>%
ggplot(aes(x = xvalue, y = yvalue)) +
geom_point() + geom_smooth() +
facet_grid(outcome~variable, scales = "free") +
theme_minimal()
fit_lm_log <- lm(log(seizure.rate + 0.5) ~ age + log(base + 0.5) + treatment, data = epi)
plot(fit_lm_log)
Linear relationship between residuals of a model without the variable
Create a function for partial plots
partial_residuals.lm <- function(fit, term, resid_type = "response") {
formula0 = formula(fit)
all_vars = all.vars(formula0)
response = all_vars[1]
all_terms = all_vars[-1]
new_terms = setdiff(all_terms, term)
fit_resp <- lm(reformulate(new_terms, response), data = fit$model)
fit_term <- lm(reformulate(new_terms, term), data = fit$model)
return(
data.frame(resid_response = resid(fit_resp, type = resid_type),
resid_term = resid(fit_term, type = resid_type))
)
}
partial_plots.lm <- function(fit, terms = NULL) {
formula0 = formula(fit)
all_vars = all.vars(formula0)
response = all_vars[1]
all_terms = all_vars[-1]
terms = if (!is.null(terms)) {terms} else {all_terms}
resid_data = pmap_df(list(terms), function(term) {
data.frame(term = term,
partial_residuals.lm(fit, term), stringsAsFactors = F)
})
p = ggplot(resid_data, aes(x = resid_term, y = resid_response)) +
geom_point() + geom_smooth(method = "lm", alpha = 0.15) +
facet_wrap(~term, scales = "free_x") +
theme_minimal() +
labs(x = "Residual of term ~ .",
y = paste0("Residual of ", response, " ~ ."))
print(p)
return(resid_data)
}
partial_plots.lm(lm(mpg ~ wt+disp + hp + qsec, data = mtcars))
term resid_response resid_term
1 wt -2.91033390 0.28777098
2 wt -2.69441993 0.42981158
3 wt -2.35064824 -0.03572619
4 wt 1.42854760 -0.38033047
5 wt 2.95098614 -0.58149474
6 wt -2.68338334 -0.05241944
7 wt 0.52094375 -0.45609741
8 wt -0.18688165 0.39991943
9 wt 0.30645642 -0.29957405
10 wt -3.33419973 0.64022140
11 wt -4.50286334 0.51919346
12 wt -1.42633806 0.52330013
13 wt -0.44922593 0.14295749
14 wt -2.39500167 0.11227220
15 wt -0.75115512 0.17431181
16 wt -0.80804953 0.42698606
17 wt 3.28798221 0.51178167
18 wt 5.81148158 -0.03769260
19 wt 2.87480969 -0.36227500
20 wt 7.22627053 -0.43462476
21 wt -2.62782794 -0.26872519
22 wt -2.36856655 -0.10087364
23 wt -2.90133415 -0.17803946
24 wt -0.92953210 -0.03181011
25 wt 4.60129205 -0.45274861
26 wt 0.50025255 -0.18974258
27 wt 0.39380977 0.09339092
28 wt 4.91563374 -0.47997751
29 wt 1.90626853 -0.59086130
30 wt -2.75578861 0.50744764
31 wt 2.18096163 -0.11936986
32 wt -2.83014640 0.28301816
33 disp -1.63835092 -20.39759345
34 disp -0.81292451 -37.33786821
35 disp -2.54726114 -11.98100187
36 disp -0.16715100 58.98939900
37 disp 0.50824357 89.04711565
38 disp -2.88680732 14.32012656
39 disp -1.47477003 39.93919934
40 disp 1.58234316 -27.77228688
41 disp -1.08529147 -4.11574242
42 disp -0.57238169 -70.89658841
43 disp -2.27888190 -63.39743693
44 disp 0.84928036 -51.13053654
45 disp 0.16512257 -16.71150896
46 disp -1.92127105 -16.40609088
47 disp 0.14193430 33.62730900
48 disp 1.16032112 0.12733836
49 disp 5.58764369 -22.20308092
50 disp 5.60916488 -10.72102401
51 disp 1.29504873 33.75633803
52 disp 5.28072314 21.63618845
53 disp -3.85911364 2.73836300
54 disp -2.71198439 45.57445282
55 disp -3.60214065 44.92867232
56 disp -1.07823629 -0.78288023
57 disp 2.75823142 91.40054170
58 disp -0.35474117 7.33306814
59 disp 0.80820317 -6.02183820
60 disp 2.76516056 23.17779768
61 disp -0.69514860 45.72968550
62 disp -0.63568334 -82.05091718
63 disp 1.46266820 -63.04423071
64 disp -1.65194975 -47.35496976
65 hp -0.81958713 -40.94967968
66 hp -0.05473623 -35.28459667
67 hp -2.52885651 0.72547575
68 hp -0.17299425 -8.11348301
69 hp 0.51528449 -13.09741107
70 hp -2.96657085 2.22754955
71 hp -2.13679614 29.76132907
72 hp 2.16798709 -27.40734429
73 hp -2.32142667 66.81110082
74 hp -0.19543375 -10.06670053
75 hp -2.15142329 2.22789202
76 hp 1.02406705 -2.05990702
77 hp 0.02703790 9.78478387
78 hp -2.19190131 16.84198870
79 hp 0.45230719 -21.43112877
80 hp 1.47502360 -16.87767078
81 hp 5.74540700 -5.28014878
82 hp 5.66634863 -1.53201356
83 hp 1.60115203 -21.22082692
84 hp 4.92935790 15.73291825
85 hp -4.39462848 28.29783888
86 hp -2.14833698 -36.70637085
87 hp -3.27651153 -23.86282529
88 hp -1.38032531 16.29557557
89 hp 3.03153680 -27.69811510
90 hp -0.23954151 -7.21907835
91 hp 1.53642291 -38.15253365
92 hp 2.59129518 6.00354461
93 hp -1.41000112 31.76422576
94 hp -0.47676085 3.20689902
95 hp -0.29529697 103.18479418
96 hp -1.60209889 4.09391828
97 qsec -2.57002990 -1.81209054
98 qsec -1.60080280 -1.63083769
99 qsec -2.48868291 0.04894094
100 qsec 0.18332689 0.93312402
101 qsec 0.45927800 0.34635461
102 qsec -2.37215897 1.01593588
103 qsec -1.26564767 0.58000827
104 qsec 1.48850107 -0.30853956
105 qsec 0.75910327 3.36926520
106 qsec -0.84114319 -0.84130055
107 qsec -2.24114319 -0.24130055
108 qsec 0.63072566 -0.65218051
109 qsec 0.23842286 0.05281569
110 qsec -1.67153261 0.37855154
111 qsec 0.07853150 0.04826170
112 qsec 1.04020786 -0.22010742
113 qsec 5.48854558 -0.29090893
114 qsec 5.78652898 0.27340698
115 qsec 1.12400525 -0.14891664
116 qsec 5.86092610 1.17225448
117 qsec -3.10158976 1.40551509
118 qsec -3.25491891 -0.77442872
119 qsec -3.89111273 -0.31088794
120 qsec -1.24877730 -0.31723832
121 qsec 2.53611905 0.03969579
122 qsec -0.32042592 0.09899357
123 qsec -0.02363110 -1.55816416
124 qsec 2.65504199 -0.08879107
125 qsec -0.70246252 0.21063854
126 qsec -1.28877564 -1.60224000
127 qsec 2.18315837 1.01511828
128 qsec -1.62958731 -0.19094800
A case-control study was performed to determine whether induced (and spontaneous) abortions could increase the risk of secondary infertility. Obstetric and gynaecologic histories were obtained from 100 women with secondary infertility admitted to a department of obstetrics and gynecology its division of fertility and sterility. For every patient, an attempt was made to find two healthy control subjects from the same hospital with matching for age, parity, and level of education. Two control subjects each were found for 83 of the index patients. The data can be found in infertility.csv. The numbers of (spontaneous or induced) abortions are coded as 0=0, 1=1, 2=2 or more.
What type of study is this? What type of analysis do you prefer for this study design?
Matched case control, can be analyzed with conditional logistic regression.
Perform the analysis and interpret the results. Do previous (spontaneous or induced) abortions affect the risk of secondary infertility?
Read in data
infert <- read.csv(here("data", "infertility.csv"), sep = ";")
str(infert)
'data.frame': 248 obs. of 8 variables:
$ education : Factor w/ 3 levels "0-5yrs","12+ yrs",..: 1 1 1 1 3 3 3 3 3 3 ...
$ age : int 26 42 39 34 35 36 23 32 21 28 ...
$ parity : int 6 1 6 4 3 4 1 2 1 2 ...
$ induced : int 1 1 2 2 1 2 0 0 0 0 ...
$ case : int 1 1 1 1 1 1 1 1 1 1 ...
$ spontaneous : int 2 0 0 0 1 1 0 0 1 0 ...
$ stratum : int 1 2 3 4 5 6 7 8 9 10 ...
$ pooled.stratum: int 3 1 4 2 32 36 6 22 5 19 ...
summary(infert)
education age parity induced
0-5yrs : 12 Min. :21.00 Min. :1.000 Min. :0.0000
12+ yrs:116 1st Qu.:28.00 1st Qu.:1.000 1st Qu.:0.0000
6-11yrs:120 Median :31.00 Median :2.000 Median :0.0000
Mean :31.50 Mean :2.093 Mean :0.5726
3rd Qu.:35.25 3rd Qu.:3.000 3rd Qu.:1.0000
Max. :44.00 Max. :6.000 Max. :2.0000
case spontaneous stratum pooled.stratum
Min. :0.0000 Min. :0.0000 Min. : 1.00 Min. : 1.00
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:21.00 1st Qu.:19.00
Median :0.0000 Median :0.0000 Median :42.00 Median :36.00
Mean :0.3347 Mean :0.5766 Mean :41.87 Mean :33.58
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:62.25 3rd Qu.:48.25
Max. :1.0000 Max. :2.0000 Max. :83.00 Max. :63.00
Check stratum variable, which tells us the matced pairs
table(infert$stratum)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3
76 77 78 79 80 81 82 83
3 3 3 3 3 3 3 3
Perform analysis
require(survival)
fit1 <- clogit(case ~ induced + spontaneous + strata(stratum),
data = infert, method = "exact")
summary(fit1)
Call:
coxph(formula = Surv(rep(1, 248L), case) ~ induced + spontaneous +
strata(stratum), data = infert, method = "exact")
n= 248, number of events= 83
coef exp(coef) se(coef) z Pr(>|z|)
induced 1.4090 4.0919 0.3607 3.906 9.38e-05 ***
spontaneous 1.9859 7.2854 0.3524 5.635 1.75e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
induced 4.092 0.2444 2.018 8.298
spontaneous 7.285 0.1373 3.651 14.536
Rsquare= 0.193 (max possible= 0.519 )
Likelihood ratio test= 53.15 on 2 df, p=2.869e-12
Wald test = 31.84 on 2 df, p=1.221e-07
Score (logrank) test = 48.44 on 2 df, p=3.032e-11
Both previous spontanious and previous induced abortions increase the risk of infertility.
The odds ratio of induced abortions is 4.09, the odds ratio of spontaneous abortions is 7.29
How do we control for the potential confounding effects of age, parity and education?
They are accounted for by the matching, so no need to add them as covariates to the model
Madsen (1976) investigated satisfaction with housing conditions in Copenhagen. Residents in selected areas living in rented homes built between 1960 and 1968 were questioned about their satisfaction and their degree of contact with other residents. The data can be found in the file housinglong.csv. The variable satisfaction is coded 1=low, 2=medium, 3=high; contact is coded 1=low and 2=high. Save your script/SPSS code, we will use this data again in Exercise 4.
Summarize the data using appropriate tables and percentages to show the associations between levels of satisfaction and contact with other residents, levels of satisfaction and type of housing, and contact and type of housing.
housinglong <- read.csv(here("data", "housinglong.csv"), sep = ";")
str(housinglong)
'data.frame': 1681 obs. of 3 variables:
$ type : Factor w/ 3 levels "apartment","house",..: 3 3 3 3 3 3 3 3 3 3 ...
$ satisfaction: int 1 1 1 1 1 1 1 1 1 1 ...
$ contact : int 1 1 1 1 1 1 1 1 1 1 ...
There are three categorical variables, when can get all pairwise contingengy tables with 3 tables
require(gmodels)
CrossTable(housinglong$type, housinglong$satisfaction,
prop.c=FALSE,prop.t=FALSE,prop.chisq=FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
|-------------------------|
Total Observations in Table: 1681
| housinglong$satisfaction
housinglong$type | 1 | 2 | 3 | Row Total |
-----------------|-----------|-----------|-----------|-----------|
apartment | 271 | 192 | 302 | 765 |
| 0.354 | 0.251 | 0.395 | 0.455 |
-----------------|-----------|-----------|-----------|-----------|
house | 197 | 153 | 166 | 516 |
| 0.382 | 0.297 | 0.322 | 0.307 |
-----------------|-----------|-----------|-----------|-----------|
tower block | 99 | 101 | 200 | 400 |
| 0.247 | 0.253 | 0.500 | 0.238 |
-----------------|-----------|-----------|-----------|-----------|
Column Total | 567 | 446 | 668 | 1681 |
-----------------|-----------|-----------|-----------|-----------|
Seems like people from tower block are most often in the highest satisfaction categorty
CrossTable(housinglong$type, housinglong$contact,
prop.c=FALSE,prop.t=FALSE,prop.chisq=FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
|-------------------------|
Total Observations in Table: 1681
| housinglong$contact
housinglong$type | 1 | 2 | Row Total |
-----------------|-----------|-----------|-----------|
apartment | 317 | 448 | 765 |
| 0.414 | 0.586 | 0.455 |
-----------------|-----------|-----------|-----------|
house | 177 | 339 | 516 |
| 0.343 | 0.657 | 0.307 |
-----------------|-----------|-----------|-----------|
tower block | 219 | 181 | 400 |
| 0.547 | 0.453 | 0.238 |
-----------------|-----------|-----------|-----------|
Column Total | 713 | 968 | 1681 |
-----------------|-----------|-----------|-----------|
People living in ’house’s have most contact with neighbours
CrossTable(housinglong$contact, housinglong$satisfaction,
prop.c=FALSE,prop.t=FALSE,prop.chisq=FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
|-------------------------|
Total Observations in Table: 1681
| housinglong$satisfaction
housinglong$contact | 1 | 2 | 3 | Row Total |
--------------------|-----------|-----------|-----------|-----------|
1 | 262 | 178 | 273 | 713 |
| 0.367 | 0.250 | 0.383 | 0.424 |
--------------------|-----------|-----------|-----------|-----------|
2 | 305 | 268 | 395 | 968 |
| 0.315 | 0.277 | 0.408 | 0.576 |
--------------------|-----------|-----------|-----------|-----------|
Column Total | 567 | 446 | 668 | 1681 |
--------------------|-----------|-----------|-----------|-----------|
People with more contact are more often in the highest category of satisfaction.
Use nominal (multinomial) logistic regression to model associations between level of satisfaction and the other two variables. Use the likelihood ratio test (LRT) to delete non-significant variables in order to obtain a parsimonious model that summarizes the patterns of the data.
Drop1 trew an error (“Error in if (trace) { : argument is not interpretable as logical”), so we will make the models ourselves and perform LRtests
require(nnet)
fit0 <- multinom(satisfaction ~ 1, data = housinglong)
# weights: 6 (2 variable)
initial value 1846.767257
final value 1824.438811
converged
fit1 <- multinom(satisfaction ~ type, data = housinglong)
# weights: 12 (6 variable)
initial value 1846.767257
iter 10 value 1807.174032
iter 10 value 1807.174031
iter 10 value 1807.174031
final value 1807.174031
converged
fit2 <- multinom(satisfaction ~ contact, data = housinglong)
# weights: 9 (4 variable)
initial value 1846.767257
final value 1821.875901
converged
fit3 <- multinom(satisfaction ~ type + contact, data = housinglong)
# weights: 15 (8 variable)
initial value 1846.767257
iter 10 value 1802.924087
final value 1802.740161
converged
anova(fit3, fit2, test = "Chisq")
Likelihood ratio tests of Multinomial Models
Response: satisfaction
Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi)
1 contact 3358 3643.752
2 type + contact 3354 3605.480 1 vs 2 4 38.27148 9.849653e-08
anova(fit3, fit1, test = "Chisq")
Likelihood ratio tests of Multinomial Models
Response: satisfaction
Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi)
1 type 3356 3614.348
2 type + contact 3354 3605.480 1 vs 2 2 8.867742 0.01186846
Dropping either contact or type will lead to a significant decrease of the likelihood, although the effect of dropping type is greater.
Interpret the coefficients from the model in (b).
summary(fit3)
Call:
multinom(formula = satisfaction ~ type + contact, data = housinglong)
Coefficients:
(Intercept) typehouse typetower block contact
2 -0.8099988 0.06967834 0.4067607 0.2959811
3 -0.4090469 -0.30402045 0.6415939 0.3282252
Std. Errors:
(Intercept) typehouse typetower block contact
2 0.2260285 0.1437749 0.1713009 0.1301046
3 0.2041219 0.1351693 0.1500773 0.1181870
Residual Deviance: 3605.48
AIC: 3621.48
exp(coef(fit3))
(Intercept) typehouse typetower block contact
2 0.4448586 1.0721633 1.501945 1.344445
3 0.6642831 0.7378458 1.899506 1.388502
The odds of being in satisfactory class 2 is 1.07 times higher than the odds of being in class 1, for people in type house, compared to type appartment, given the same level of contact.
The odds of being in satisfactory class 3 is 0.73 times the odds of being in class 1, for people in type house, compared to type appartment, given the same level of contact.
Possible extensions:
In SPSS is it not possible to get residuals for a multinomial or ordinal logistic regression. In R we can make our own “deviance residuals” as follows (assuming the dataset is called house, and the multinomial model from which we wish to estimate deviance residuals is called house.mlr.2): likl <- numeric(1681) for (i in 1:1681) likl[i] <- fitted(house.mlr.2)[i,house$satisfaction[i]] Plot these deviance residuals against case number: plot(1:1681, -2*log(likl))
likl <- numeric(1681)
for (i in 1:1681) likl[i] <- fitted(fit3)[i,housinglong$satisfaction[i]]
plot(1:1681, -2*log(likl))
We will start the afternoon session with a theoretical (non-computer) question. Discuss the following with a few of your neighbors. We will then discuss the question in the group before proceeding to the computer lab questions.
A researcher wishes to gain insight into the independent management and use of polypharmacy (≥ 5 medications) by elderly home healthcare clients in relation to their cognitive and self-management skills. Three measurement tests were assessed: the Clock-Drawing test (CDT), the Self-Management Ability Scale (SMAS-30) and the independent Medication Management Score (iMMS). The iMMS instrument consists of 17 “yes/no” questions regarding independent medication management, where a “no” indicates lack of management ability in a particular area of medication management. The iMMS equals the number of questions that were answered with “no”. The Clock-Drawing test (“Can you draw a clock and put the hands on 10 past 11?”) purports to measure the cognitive abilities of the individual, and is scored on a scale from 1 to 5 (5 being best). The Self-Management Ability Scale (SMAS-30) consists of 30 questions on general self-management issues, and is scaled from 0 to 100.
The researcher wishes to predict the iMMS from the CDT and SMAS-30, as well as age and sex of the individual with a generalized linear model, using one of three probability distributions: the Gaussian, binomial or Poisson distribution.
Give at least one advantage and one disadvantage for each of these three approaches.
iMMS is a bounded discrete variable, ranging from 0 to 17;
Gaussian: since the range of 0-17 with steps of 1 is not very small, it may be approximated as a continous variable. Using the Guassian distribution with GLM gives the regular linear regression, which has the easiest interpretation. A downside is that the response is actually discrete and bounded
binomial: this seems like a logical choice of distribution for this problem. iMMS can be seen as the result of 17 bernoulli trials (yes/no), however, it may be that the trials are not independent of each other (so scoring ‘no’ on a certain question will increase the probability of scoring ‘no’ on another question), and that the probability of ‘success’ on each question is not the same. This violates two of the assumptions of the binomial distribution. Also, the resulting coefficients are not always easy interpretable depending on the chosen link function.
Poisson: the Poisson distribution is suitable for discrete (count) variables, bounded by 0 as is the case here. A downside is that there exists an upper bound here.
Which specific graphs from the initial data analysis step and/or the model checking step would you need to help you choose among the three approaches?
Plotting the marginal distribution of the outcome may be benificial. When the mean of the distribution is in the center of the range and the distribution is approximately bell-shaped, the Gaussian approximation may be ok. However, this ultimately depends on the error distribution of the models.
Plotting marginal distributions will not be very helpful, as each distribution has assumptions on the error distribution and not the marginal distribution, so probably model checking is best.
– Calculate deviance, check which model has lowest deviance – Look for influential observations based on the Cook’s distance
– Create a fit with both explanatory variables, plot – QQ-plot of residuals – y vs pred(y) to assess homoscedasticity – make partial plots to assess linearity of the relationship
Binomial
Poisson
– calculate dispersion parameter
We will revisit the Madsen dataset on satisfaction with housing conditions (housinglong.csv). Recall: the variable satisfaction is coded 1=low, 2=medium, 3=high; contact is coded 1=low and 2=high.
In question 2 we analyzed the data using a multinomial logistic regression. Why was this not the most appropriate model for examining associations between levels of satisfaction and the other variables? Fit a more suitable model and compare the results with those from (b). See if you can reduce this model, again using the LRT.
Multinomial regression assumes nominal outcome, so no order in the levels of the outcome. In this case, we have that class 1 < class 2 < class 3, so we have order. This information was disregarded by the multinomial model.
Better may be proportional odds model
housinglong <- read.csv(here("data", "housinglong.csv"), sep = ";")
require(MASS)
# fit_po1 <- polr(factor(satisfaction) ~ contact, data = housinglong)
# fit_po2 <- polr(factor(satisfaction) ~ type, data = housinglong)
fit_po3 <- polr(factor(satisfaction) ~ type + contact, data = housinglong)
drop1(fit_po3, test = "Chisq")
Single term deletions
Model:
factor(satisfaction) ~ type + contact
Df AIC LRT Pr(>Chi)
<none> 3620.3
type 2 3651.6 35.322 2.138e-08 ***
contact 1 3625.7 7.376 0.006611 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We cannot drop any of the explanatory variables.
Compare with the multinomial regression models
fit_mn3 <- multinom(satisfaction ~ type + contact, data = housinglong)
# weights: 15 (8 variable)
initial value 1846.767257
iter 10 value 1802.924087
final value 1802.740161
converged
fits <- list(multinomial = fit_mn3, proportional_odds = fit_po3)
fits %>%
map_df(function(fit) data.frame(deviance = fit$deviance,
df = fit$edf,
AIC = AIC(fit)), .id = "model")
model deviance df AIC
1 multinomial 3605.480 8 3621.480
2 proportional_odds 3610.286 5 3620.286
Comment on the AIC and residual deviance of the models from Exercises 2b and 4a.
The deviance of the multinomial model is lower. However, this model has more estimated parameters, as it estimates coefficients for the explanatory variables for 2 of the 3 levels of the response variable.
The proportional odds model assumes the same coefficients for differences between class 1 and 2 and class 2 and 3, and therefore has more residual degrees of freedom.
This results in a lower AIC, and arguably a better fit. However, the difference is small.
R users: as in Exercise 2, try to save and plot the deviance residuals from the model in (a).
nobs = fit_po3$nobs
likl_po <- numeric(nobs)
for (i in 1:nobs) likl_po[i] <- fit_po3$fitted.values[i, housinglong$satisfaction[i]]
plot(1:nobs, -2*log(likl_po))
Try to compare models
bind_rows(list(multinomial = data.frame(likl = likl),
proportional_odds = data.frame(likl = likl_po)), .id = "model") %>%
mutate(loglik = -2*log(likl)) %>%
group_by(model) %>%
mutate(index = 1:n(),
satisfaction = housinglong$satisfaction) %>%
ggplot(aes(x = index, y = loglik, col = model)) +
geom_point() + facet_wrap(~satisfaction, labeller = "label_both")
In both SPSS and R, it is at least possible to get fitted probabilities. From the best model you obtained in (a), get the fitted probabilities per combination of type and contact. Use these to calculate the fitted frequencies (counts). Hint: in R, use the predict() function (and the option type=“probs”) on a new data frame containing all combinations of type and contact; in SPSS, try aggregating the predicted values over the categories of type & contact). Use the predicted probabilities to get predicted counts and find where the largest discrepancies are between observed frequencies and expected frequencies estimated from the model.
Make a grid of the possible covariate combinations and predict probabilities.
cov_grid <- expand.grid(
contact = unique(housinglong$contact),
type = sort(unique(housinglong$type))
)
grid_pred <- predict(fit_po3, newdata = cov_grid, type = "probs")
row.names(grid_pred) <- paste0(cov_grid$type, "_contact", cov_grid$contact)
grid_pred
1 2 3
apartment_contact1 0.3783899 0.2709460 0.3506642
apartment_contact2 0.3210770 0.2688545 0.4100685
house_contact1 0.4350948 0.2657691 0.2991361
house_contact2 0.3743657 0.2710563 0.3545779
tower block_contact1 0.2694631 0.2592953 0.4712416
tower block_contact2 0.2227370 0.2429973 0.5342657
Get the observed probabilities
obs_probs <- prop.table(ftable(table(
housinglong$type,
housinglong$contact,
housinglong$satisfaction),
col.vars = 3), margin = 1)
obs_probs
1 2 3
apartment 1 0.4100946 0.2397476 0.3501577
2 0.3147321 0.2589286 0.4263393
house 1 0.3785311 0.2711864 0.3502825
2 0.3834808 0.3097345 0.3067847
tower block 1 0.2968037 0.2465753 0.4566210
2 0.1878453 0.2596685 0.5524862
Subtract to view differences
diff_grid <- obs_probs - grid_pred
diff_grid
1 2 3
apartment 1 0.0317047855 -0.0311983320 -0.0005064535
2 -0.0063448212 -0.0099259609 0.0162707821
house 1 -0.0565637406 0.0054173313 0.0511464094
2 0.0091150916 0.0386781672 -0.0477932588
tower block 1 0.0273405734 -0.0127199828 -0.0146205906
2 -0.0348916749 0.0166712129 0.0182204620
Visualize (take absolute difference)
image(t(abs(diff_grid)), xaxt = "n", yaxt = "n")
axis(1, at = seq(0, 1, length.out = 3), labels = 1:3, las = 1)
axis(2, at = seq(0, 1, length.out = 6), labels = rownames(grid_pred), las = 1)
For the more advanced R user: included in the R solutions for today is code to check the proportional odds assumption for the model in part (d). To do this, we need to calculate the ln(odds) of satisfaction=1 vs 2&3 and of of satisfaction=1&2 vs 3 for each level of housing type and contact and graph these.
The advanced SPSS user can look at this link for code on checking the proportional odds assumption in SPSS http://www.ats.ucla.edu/stat/spss/dae/ologit.htm .
A retrospective case-control study of 200 male cases of esophageal cancer and 778 population controls was carried out in Ille-et-Vilaine (France). Interest is in the relation between tobacco consumption (tobhigh: 1 = 20+ g/day, 0 = less than 20 g/day) and esophageal cancer (case: 1 = case, 0 = control), while considering the possible confounding or effect- modifying effects of alcohol consumption (alchigh: 1 = 80+ g /day, 0 = < 80 g/day). The data can be found in bd1.sav or bd1.csv (separator = “,”)
What type of study is this? What type of analysis would you prefer for this study design?
Case-control
Logistic regression (the outcome is binary)
Use this data to answer the research question, paying attention to the role (confounding?/effect modification?) of alcohol use.
Do some model checking.
Deviance,
sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] tools splines stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] MASS_7.3-48 nnet_7.3-12 survival_2.41-3
[4] HSAUR_1.3-9 gmodels_2.16.2 tidyr_0.8.0
[7] bindrcpp_0.2 broom_0.4.3 epistats_0.1.0
[10] ggplot2_2.2.1 here_0.1 purrr_0.2.4
[13] magrittr_1.5 data.table_1.10.4-3 dplyr_0.7.4
loaded via a namespace (and not attached):
[1] gtools_3.5.0 tidyselect_0.2.3 reshape2_1.4.3
[4] lattice_0.20-35 colorspace_1.3-2 htmltools_0.3.6
[7] yaml_2.1.16 rlang_0.1.6 pillar_1.1.0
[10] foreign_0.8-69 glue_1.2.0 RColorBrewer_1.1-2
[13] binom_1.1-1 bindr_0.1 plyr_1.8.4
[16] stringr_1.2.0 munsell_0.4.3 gtable_0.2.0
[19] psych_1.7.8 evaluate_0.10.1 labeling_0.3
[22] knitr_1.19 GGally_1.3.2 parallel_3.4.3
[25] Rcpp_0.12.15 scales_0.5.0 backports_1.1.2
[28] gdata_2.18.0 mnormt_1.5-5 digest_0.6.15
[31] stringi_1.1.6 grid_3.4.3 rprojroot_1.3-2
[34] lazyeval_0.2.1 tibble_1.4.2 pkgconfig_2.0.1
[37] Matrix_1.2-12 assertthat_0.2.0 rmarkdown_1.8
[40] reshape_0.8.7 R6_2.2.2 nlme_3.1-131
[43] git2r_0.21.0 compiler_3.4.3
This R Markdown site was created with workflowr