-
Notifications
You must be signed in to change notification settings - Fork 9
/
EX_8_Discrete_frailty.Rmd
225 lines (162 loc) · 11.5 KB
/
EX_8_Discrete_frailty.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
---
title: "Example 8 - Discrete Time Model with frailty"
author: "coreysparks"
date: "March 25, 2015"
output: html_document
---
This example will illustrate how to fit the discrete time hazard model with group-level frailty to continuous duration data (i.e. person-level data) and a discrete-time (longitudinal) data set. In this example, I will use the event of a child dying before age 5 in Haiti. The data for this example come from the Haitian [Demographic and Health Survey for 2012](http://dhsprogram.com/data/dataset/Haiti_Standard-DHS_2012.cfm?flag=0) birth recode file. This file contains information for all live births to women sampled in the survey.
The longitudinal data example uses data from the [ECLS-K ](http://nces.ed.gov/ecls/kinderdatainformation.asp). Specifically, we will examine the transition into poverty between kindergarten and 8th grade.
```{r load, message=FALSE}
#Load required libraries
library(foreign)
library(survival)
library(car)
library(survey)
library(lme4)
library(arm)
#load the data
haiti<-read.dta("/Users/ozd504/Google Drive/dem7223/data/HTBR61FL.DTA", convert.factors = F)
```
```{r extract_data}
#We form a subset of variables
sub<-data.frame(CASEID=haiti$caseid,kidid=paste(haiti$caseid, haiti$bidx, sep="-"),v008=haiti$v008,bord=haiti$bidx,csex=haiti$b4,b2=haiti$b2, b3=haiti$b3, b5=haiti$b5, b7=haiti$b7, ibint=haiti$b11, rural=haiti$v025, educ=haiti$v106,age=haiti$v012,partneredu=haiti$v701,partnerage=haiti$v730, hhses=haiti$v190, weight=haiti$v005/1000000, psu=haiti$v021, strata=haiti$v022, region=haiti$v023)
sub$death.age<-ifelse(sub$b5==1,
((((sub$v008))+1900)-(((sub$b3))+1900))
,sub$b7)
#censoring indicator for death by age 5, in months (<=60 months)
sub$d.event<-ifelse(is.na(sub$b7)==T|sub$b7>60,0,1)
sub$d.eventfac<-factor(sub$d.event); levels(sub$d.eventfac)<-c("Alive at Age 5", "Dead by Age 5")
table(sub$d.eventfac)
#recodes
sub$male<-ifelse(sub$csex==1,1,0)
sub$educ.high<-ifelse(sub$educ %in% c(2,3), 1, 0)
sub$age2<-sub$age^2
sub$partnerhiedu<-ifelse(sub$partneredu<3,0,ifelse(sub$partneredu%in%c(8,9),NA,1 ))
sub$hises<-ifelse(sub$hhses>3, 1,0)
```
###Create the person-period file
The distinction between the way we have been doing things and the discrete time model, is that we treat time discretely, versus continuously. This means that we transform the data from the case-duration data format to the person-period format. For this example, a natural choice would be year, since we have 5 intervals of equal length (12 months each). R provides a useful function called `survSplit()` in the `survival` library that will split a continuous duration into discrete periods.
```{r}
#make person period file
pp<-survSplit(sub, cut=seq(0,60,12), start="start", end="death.age", event="d.event", episode="year")
pp<-pp[order(pp$kidid, pp$year),]
head(pp[, c("kidid", "death.age", "d.event", "start", "year", "male", "hises")], n=20)
library(maptools)
htshp<-readShapePoly("~/Google Drive/dem7223/data/sdr_subnational_data_2015-03-18/shps/sdr_subnational_data_dhs_2012.shp")
plot(htshp)
text(getSpPPolygonsLabptSlots(htshp), labels=as.character(htshp$DHSREGFR), cex=0.6)
#Since the region variable in the shapefile doesn't include two of the areas, I need to make them an ID number:
htshp$reg_merge<-ifelse(htshp$DHSREGFR=="Aire Métropolitaine", 1, ifelse(htshp$DHSREGFR=="Reste-Ouest",2, htshp$REGCODE+1))
htshp$reg_merge
htshp$z_net<-scale(htshp$iC6D4E13)
htshp$z_edu<-scale(htshp$iA999FA)
pp2<-merge(pp, htshp@data, by.x="region", by.y="reg_merge")
```
We see that each child is not in the data for multiple "risk periods", until they experience the event (death) or age out of the risk set (year 6 in this case).
###Discrete time model with shared frailty
For the discrete time model, if the logit link is used, then we are effectively fitting a multilevel model for our outcome. The model with only a group frailty component will have the exact same form as the multilevel logit model with a random intecept at the group level:
Fit the basic random intercept model :
$logit\left ( \frac{\pi_{ij}}{1-\pi_{ij}} \right ) = \beta_{0j} +x'\beta +Z\gamma'$
with
$\beta_{0j} = \beta_0 + Z\gamma'+ u_j$
and
$u_j\sim N(0, \sigma^2)$
Where the intercepts ($u_j$) for each group vary randomly around the overall mean ($\beta_0$).
The individual level predictors are incorporated into $x$, while the group level predictors (if any are measured) are included in $Z$. If only a random intercept is specified, then $Z$ is a vector of 1's.
#Shared frailty at the regional level
Here, we use the DHS to fit a few different multilevel models. Our outcome is child mortality before age 5. We consider the general random intecept model, a model with a higher level predictor, and a cross-level interaction model, where an individual level predictor interacts with a higher level predictor. Our higher level data come from the Haitian shapefile we got from the [DHS](http://spatialdata.dhsprogram.com/), which contains estimates of numerous population characteristics at the same regional level as the `v023` variable provides in the DHS recode files.
```{r, fig.height=6, fig.width=9}
#how many kids in each region?
table(sub$region)
#how many total regions?
length(table(sub$region))
#generate survey design
des<-svydesign(ids=~psu, strata = ~strata , weights=~weight, data=pp)
#now we fit the regular discrete time model without frailty, purely for comparison
fit.1<-svyglm(d.event~year+bord+I(hhses>3)+I(educ>=2)+male+rural,design=des , family="binomial")
display(fit.1, detail=T)
#basic random intercept model
fit.region<-glmer(d.event~year+bord+I(hhses>3)+I(educ>=2)+male+(1|region),pp2, family="binomial", weights = weight)
fit.region<-refit(fit.region)
display(fit.region, detail=T)
#plot the region-level predictor, z-score of % of Women with secondary or higher education
spplot(htshp, c("z_net", "z_edu"))
#model with a higher level predictor - z-score of % of Women with secondary or higher education
fit.region2<-glmer(d.event~year+bord+I(hhses>3)+I(educ>=2)+male+rural+z_edu+(1|region),pp2, family="binomial", weights = weight)
display(fit.region2, detail=T)
#relgrad <- with(fit.region@optinfo$derivs,solve(Hessian,gradient))
#max(abs(relgrad))
anova(fit.region, fit.region2)
#cross level interaction model, interaction between mom's education and regional women's education
fit.region3<-glmer(d.event~year+bord+I(hhses>3)+I(educ>=2)+male+rural+z_edu*I(educ>=2)+(1|region),pp2, family="binomial")
display(fit.region3, detail=T)
anova(fit.region2, fit.region3)
```
The only model that shows anything is the random intercept model, the model with the higher level predictor shows a marginally significant (p=.08) effect of education level of women at the regional level on under 5 mortality, and the cross level interaction model shows nothing in terms of the interaction between maternal education at the individual level and region-level women's education.
#Using Longitudinal Data
As in the other examples, I illustrate fitting these models to data that are longitudinal, instead of person-duration. In this example, we will examine how to fit the discrete time hazard model with frailty to a longitudinally collected data set.
First we load our data and do some recodes and variable construction
```{r load_longdata}
load("~/Google Drive/dem7903_App_Hier/data/eclsk.Rdata")
names(eclsk)<-tolower(names(eclsk))
#get out only the variables I'm going to use for this example
myvars<-c( "childid","gender", "race", "r1_kage","r4age", "r5age", "r6age", "r7age","c1r4mtsc", "c4r4mtsc", "c5r4mtsc", "c6r4mtsc", "c7r4mtsc", "w1povrty","w1povrty","w3povrty", "w5povrty", "w8povrty","wkmomed", "s2_id", "c1_5fp0", "c15fpstr", "c15fppsu")
eclsk<-eclsk[,myvars]
eclsk$age1<-ifelse(eclsk$r1_kage==-9, NA, eclsk$r1_kage/12)
eclsk$age2<-ifelse(eclsk$r4age==-9, NA, eclsk$r4age/12)
#for the later waves, the NCES group the ages into ranges of months, so 1= <105 months, 2=105 to 108 months. So, I fix the age at the midpoint of the interval they give, and make it into years by dividing by 12
eclsk$age3<-recode(eclsk$r5age,recodes="1=105; 2=107; 3=109; 4=112; 5=115; 6=117; -9=NA")/12
eclsk$pov1<-ifelse(eclsk$w1povrty==1,1,0)
eclsk$pov2<-ifelse(eclsk$w3povrty==1,1,0)
eclsk$pov3<-ifelse(eclsk$w5povrty==1,1,0)
#Recode race with white, non Hispanic as reference using dummy vars
eclsk$hisp<-recode (eclsk$race, recodes="3:4=1;-9=NA; else=0")
eclsk$black<-recode (eclsk$race, recodes="2=1;-9=NA; else=0")
eclsk$asian<-recode (eclsk$race, recodes="5=1;-9=NA; else=0")
eclsk$nahn<-recode (eclsk$race, recodes="6:7=1;-9=NA; else=0")
eclsk$other<-recode (eclsk$race, recodes="8=1;-9=NA; else=0")
eclsk$male<-recode(eclsk$gender, recodes="1=1; 2=0; -9=NA")
eclsk$mlths<-recode(eclsk$wkmomed, recodes = "1:2=1; 3:9=0; else = NA")
eclsk$mgths<-recode(eclsk$wkmomed, recodes = "1:3=0; 4:9=1; else =NA")
```
Now, I need to form the transition variable, this is my event variable, and in this case it will be 1 if a child enters poverty between the first wave of the data and the third grade wave, and 0 otherwise. **NOTE** I need to remove any children who are already in poverty age wave 1, because they are not at risk of experiencing **this particular** transition.
```{r createevents}
eclsk<-subset(eclsk, is.na(pov1)==F&is.na(pov2)==F&is.na(pov3)==F&is.na(age1)==F&is.na(age2)==F&is.na(age3)==F&pov1!=1&is.na(eclsk$c15fpstr)==F)
eclsk$povtran1<-ifelse(eclsk$pov1==0&eclsk$pov2==0, 0,1)
eclsk$povtran2<-ifelse(eclsk$povtran1==1, NA,ifelse(eclsk$pov2==0&eclsk$pov3==0,0,1))
```
We can rescale the survey weights for use in multilevel models follwing the presentation in [Carle, 2009](http://www.biomedcentral.com/1471-2288/9/49).
```{r reshape}
#make an id that is the combination of strata and psu
eclsk$sampleid<-paste(eclsk$c15fpstr, eclsk$c15fppsu)
#within each sampling unit, sum the weights
wts<-tapply(eclsk$c1_5fp0,eclsk$sampleid,sum)
#make a data frame from this
wts<-data.frame(id=names(unlist(wts)), wt=unlist(wts))
#get the unique sampling location ids'
t1<-as.data.frame(table(eclsk$sampleid))
#put all of this into a data set
wts2<-data.frame(ids=wts$id, sumwt=wts$wt, jn=t1$Freq)
#merge all of this back to the original data file
eclsk2<-merge(eclsk, wts2, by.x="sampleid", by.y="ids", all.x=T)
#In the new data set, multiply the original weight by the fraction of the
#sampling unit total population each person represents
eclsk2$swts<-eclsk2$c1_5fp0*(eclsk2$jn/eclsk2$sumwt)
```
Now we do the entire data set. To analyze data longitudinally, we need to reshape the data from the current "wide" format (repeated measures in columns) to a "long" format (repeated observations in rows). The `reshape()` function allows us to do this easily. It allows us to specify our repeated measures, time varying covariates as well as time-constant covariates.
```{r}
e.long<-reshape(eclsk2, idvar="childid", varying=list(age=c("age1","age2"), age2=c("age2", "age3"), povtran=c("povtran1", "povtran2")), times=1:2, direction="long" , drop = names(eclsk)[4:19])
e.long<-e.long[order(e.long$childid, e.long$time),]
#find which kids failed in the first time period and remove them from the second risk period risk set
failed1<-which(is.na(e.long$povtran1)==T)
e.long<-e.long[-failed1,]
e.long$age1r<-round(e.long$age1, 0)
e.long$age2r<-round(e.long$age2, 0)
head(e.long, n=10)
```
Now we fit the discrete time model and doing fraily by the school identifier. I use the weights calculated above, standardized to the within cluster sample size.
```{r fitmodel}
#Fit the model
fitl1<-glmer(povtran1~time+mlths+mgths+black+hisp+other+nahn+(1|s2_id),family=binomial,weights=swts, e.long )
display(fitl1, detail=T)
```