Parkinson’s disease is a neurological disorder characterized by tremors, slow-labored walking, and balance disruptions. Parkinson’s disease does not have a specific diagnostic criteria, a diagnosis is at the discretion of a physician based on symptomology, health history, and generally other lab tests to rule out other diseases(1). The goal of this project is to build a model that would discriminate individuals with parkinson’s disease from healthy controls based on a walking gait test on an instrumented walkway. This model would be useful as another screening tool.
While the goal of this project is not hypothesis driven, I do anticipate gait speed and some measure of gait instability to be very important in the model based on key symptoms being slow-labored walking and balance disruptions.
First thing I need to do is get all the data loaded. Data is going to come from a demographic xlsx file, and a series of txt files that house the force plate data from the walking trials
Each line contains 19 columns:
Column 1: Time (in seconds) Columns 2-9: Vertical ground reaction force (VGRF, in Newton) on each of 8. sensors located under the left foot. Columns 10-17: VGRF on each of the 8 sensors located under the right foot. Column 18: Total force under the left foot. Column 19: Total force under the right foot.
These follow a common convention, e.g., GaCo01_02.txt or JuPt03_06.txt, where
Ga, Ju or Si – indicate the study from which the data originated: Ga - Galit Yogev et al (dual tasking in PD; Eur J Neuro, 2005) Ju – Hausdorff et al (RAS in PD; Eur J Neuro, 2007) Si - Silvi Frenkel-Toledo et al (Treadmill walking in PD; Mov Disorders, 2005)
Co or Pt: Control subject or a PD Patient
01: Subject number in the group
A walk number of 10 (for the “Ga” study) indicates a dual-task walking, where the subject was engaged in serial-7 subtraction while walking.
A walk number of 01 refers to a usual, normal walk.
.txt: file name extension
The sampling rate was 100 Hz.
## ID Study Group Subjnum Gender Age Height Weight HoehnYahr UPDRS UPDRSM
## 1 GaPt03 Ga 1 3 2 82 1.45 50 3.0 20 10
## 2 GaPt04 Ga 1 4 1 68 1.71 NaN 2.5 25 8
## 3 GaPt05 Ga 1 5 2 82 1.53 51 2.5 24 5
## 4 GaPt06 Ga 1 6 1 72 1.70 82 2.0 16 13
## 5 GaPt07 Ga 1 7 2 53 1.67 54 3.0 44 22
## 6 GaPt08 Ga 1 8 2 68 1.63 57 2.0 15 8
## TUAG Speed_01 Speed_02 Speed_03 Speed_04 Speed_05 Speed_06 Speed_07 Speed_10
## 1 36.34 NaN NaN NaN NaN NaN NaN NaN 0.778
## 2 11.00 0.642 NaN NaN NaN NaN NaN NaN 0.818
## 3 14.50 0.908 NaN NaN NaN NaN NaN NaN 0.614
## 4 10.47 0.848 NaN NaN NaN NaN NaN NaN 0.937
## 5 18.34 0.677 0.957 NaN NaN NaN NaN NaN 0.579
## 6 10.11 1.046 1.132 NaN NaN NaN NaN NaN 0.228
## ID Group Gender Age Height Weight Speed_01
## 11 GaPt15 1 1 81 NaN 80 0.948
## 15 GaPt19 1 2 76 NaN 55 1.124
## 17 GaPt21 1 1 80 NaN 63 1.166
## 2 GaPt04 1 1 68 1.71 NaN 0.642
## 19 GaPt23 1 2 71 1.60 NaN 0.360
## 153 SiCo17 2 2 64 1.65 NaN 1.202
## 1 GaPt03 1 2 82 1.45 50 NaN
I want to make variables for cadence, step asymmetry (likely just cadence according to each leg) Some measure of variability. I’m going to make empty vectors for each variable, and in a loop calculate the variables for each subject. My main research interest has to do with cadence, so there will be a lot of cadence variables. I will use a findpeaks function to access time points with peak force, find out how far apart peak forces were (assuming they were at the same point in the gait cycle), and use that to calculate cadence by step. I will do this for both left and right legs. I will have a variable for both the mean and standard deviation of cadence, coefficient of variance (defined as the standard deviation over the mean), a mean cadence value between limbs, the standard deviation of that, as well as a limb symmetry index.
CadenceL<-rep(NA,length(Data))
CadenceR<-rep(NA,length(Data))
CadenceLSD<-rep(NA,length(Data))
CadenceRSD<-rep(NA,length(Data))
CoefVL<-rep(NA,length(Data))
CoefVR<-rep(NA,length(Data))
CadenceTotal<-rep(NA,length(Data))
CadenceSD<-rep(NA,length(Data))
CadenceLSI<- rep(NA,length(Data))
# Data is a large list that holds all the base data, time, left force, and right force
#findpeaks(Data[[1]][,2],minpeakdistance = 50,minpeakheight = 500)#dataframe with all L peaks
#2nd column is time
#Get Time Stamps
#findpeaks(Data[[1]][,2],minpeakdistance = 50,minpeakheight = 500)[,2]
# Order them
#sort(findpeaks(Data[[1]][,2],minpeakdistance = 50,minpeakheight = 500)[,2])
# How far apart are they?
#diff(sort(findpeaks(Data[[1]][,2],minpeakdistance = 50,minpeakheight = 500)[,2]))
# What is that in seconds?
samplingrate=100
fs=1/samplingrate
#fs*diff(sort(findpeaks(Data[[1]][,2],minpeakdistance = 50,minpeakheight = 500)[,2]))#Time between steps
# Cadence
#60/(fs*diff(sort(findpeaks(Data[[1]][,2],minpeakdistance = 50,minpeakheight = 500)[,2])))
#What is the average?
#mean(60/(fs*diff(sort(findpeaks(Data[[1]][,2],minpeakdistance = 50,minpeakheight = 500)[,2]))))
# What is the Standard deviation?
#sd(60/(fs*diff(sort(findpeaks(Data[[1]][,2],minpeakdistance = 50,minpeakheight = 500)[,2]))))
#Iterate
for (i in 1:length(Data)){
CadenceL[i]<-mean(60/(fs*diff(sort(findpeaks(Data[[i]][,2],minpeakdistance = 50,minpeakheight = 500)[,2]))))
CadenceR[i]<-mean(60/(fs*diff(sort(findpeaks(Data[[i]][,3],minpeakdistance = 50,minpeakheight = 500)[,2]))))
CadenceLSD[i]<-sd(60/(fs*diff(sort(findpeaks(Data[[i]][,2],minpeakdistance = 50,minpeakheight = 500)[,2]))))
CadenceRSD[i]<-sd(60/(fs*diff(sort(findpeaks(Data[[i]][,3],minpeakdistance = 50,minpeakheight = 500)[,2]))))
CoefVL[i]<-CadenceLSD[i]/CadenceL[i]
CoefVR[i]<-CadenceRSD[i]/CadenceR[i]
CadenceTotal[i]<-mean(CadenceL[i],CadenceR[i])
CadenceSD[i]<-mean(CadenceLSD[i],CadenceRSD[i])
CadenceLSI[i]<-100*((abs(CadenceL[i]-CadenceR[i]))/(abs(CadenceL[i]+CadenceR[i])))
}
I need to add everything into the master dataframe. I’m going to combine everything from before with the ID’s & Disease status. I’m going to make some key variables into high and low dichotomous, variability (LSI), speed, and asymmetry.
## ID Ill CadenceL CadenceLSD CadenceR CadenceRSD CoefVL
## 1 GaPt05_01.txt Parkinsons 56.12406 7.571144 57.02525 9.994528 0.13490014
## 2 GaPt06_01.txt Parkinsons 52.19618 4.491964 52.46820 6.127748 0.08605925
## 3 GaPt07_01.txt Parkinsons 44.14557 11.279012 50.31398 21.886947 0.25549590
## 4 GaPt08_01.txt Parkinsons 52.25546 3.715477 52.13493 2.928872 0.07110217
## 5 GaPt09_01.txt Parkinsons 61.29577 6.210452 61.59904 7.127122 0.10131942
## 6 GaPt12_01.txt Parkinsons 60.61333 5.248713 61.18153 7.497181 0.08659337
## 7 GaPt13_01.txt Parkinsons 63.61374 8.348241 64.17520 10.554711 0.13123330
## 8 GaPt14_01.txt Parkinsons 61.43195 9.416112 60.27928 7.782839 0.15327713
## 9 GaPt16_01.txt Parkinsons 56.11055 5.671999 57.37937 6.204913 0.10108615
## 10 GaPt17_01.txt Parkinsons 51.57313 3.680048 51.67745 4.786804 0.07135592
## 11 GaPt18_01.txt Parkinsons 52.50641 2.688679 52.52434 3.465680 0.05120668
## 12 GaPt20_01.txt Parkinsons 54.13482 5.731212 54.64315 5.849568 0.10586925
## 13 GaPt22_01.txt Parkinsons 48.34174 4.486529 48.88147 6.912966 0.09280860
## 14 GaPt24_01.txt Parkinsons 58.08486 5.777725 59.80871 9.854811 0.09947041
## 15 GaPt25_01.txt Parkinsons 62.01274 6.181150 61.48682 4.655699 0.09967549
## 16 GaPt26_01.txt Parkinsons 54.18645 8.004002 58.90728 15.989033 0.14771223
## 17 GaPt27_01.txt Parkinsons 54.82749 7.665312 55.90048 11.249408 0.13980783
## 18 GaPt28_01.txt Parkinsons 56.03653 10.926376 54.41714 5.191370 0.19498668
## 19 GaPt29_01.txt Parkinsons 60.07655 5.575530 60.10143 5.534039 0.09280709
## 20 GaPt30_01.txt Parkinsons 54.92416 3.250632 54.85622 3.628274 0.05918402
## CoefVR CadenceTotal CadenceSD Speed CadenceLSI Var
## 1 0.17526496 56.12406 7.571144 0.908 0.79645902 High Variability
## 2 0.11678976 52.19618 4.491964 0.848 0.25989510 Low Variability
## 3 0.43500729 44.14557 11.279012 0.677 6.53020999 High Variability
## 4 0.05617869 52.25546 3.715477 1.046 0.11546323 Low Variability
## 5 0.11570185 61.29577 6.210452 0.894 0.24676981 Low Variability
## 6 0.12253995 60.61333 5.248713 1.261 0.46651965 Low Variability
## 7 0.16446713 63.61374 8.348241 0.726 0.43936469 Low Variability
## 8 0.12911300 61.43195 9.416112 1.369 0.94705156 High Variability
## 9 0.10813840 56.11055 5.671999 1.048 1.11800355 High Variability
## 10 0.09262849 51.57313 3.680048 0.731 0.10103301 Low Variability
## 11 0.06598237 52.50641 2.688679 0.889 0.01707731 Low Variability
## 12 0.10705035 54.13482 5.731212 0.722 0.46731313 Low Variability
## 13 0.14142303 48.34174 4.486529 0.802 0.55514818 High Variability
## 14 0.16477216 58.08486 5.777725 1.255 1.46220865 High Variability
## 15 0.07571864 62.01274 6.181150 1.128 0.42584795 Low Variability
## 16 0.27142713 54.18645 8.004002 1.244 4.17425709 High Variability
## 17 0.20123992 54.82749 7.665312 1.423 0.96903497 High Variability
## 18 0.09539954 56.03653 10.926376 0.987 1.46612524 High Variability
## 19 0.09207831 60.07655 5.575530 1.092 0.02070283 Low Variability
## 20 0.06614152 54.92416 3.250632 1.064 0.06189097 Low Variability
## Asy Asy2 LSI2 BiSpeed
## 1 0.90118793 High Asymmetry -0.22757960 low speed
## 2 0.27201759 Low Asymmetry -1.34747720 low speed
## 3 6.16840679 High Asymmetry 1.87643910 low speed
## 4 0.12053252 Low Asymmetry -2.15880319 low speed
## 5 0.30326727 Low Asymmetry -1.39929933 low speed
## 6 0.56819695 Low Asymmetry -0.76245514 high speed
## 7 0.56145949 Low Asymmetry -0.82242547 low speed
## 8 1.15266805 High Asymmetry -0.05440174 high speed
## 9 1.26882126 High Asymmetry 0.11154455 low speed
## 10 0.10431718 Low Asymmetry -2.29230798 low speed
## 11 0.01793643 Low Asymmetry -4.07000439 low speed
## 12 0.50833372 Low Asymmetry -0.76075574 low speed
## 13 0.53973286 Low Asymmetry -0.58852022 low speed
## 14 1.72385002 High Asymmetry 0.37994807 high speed
## 15 0.52592033 Low Asymmetry -0.85367293 low speed
## 16 4.72082308 High Asymmetry 1.42893640 high speed
## 17 1.07299272 High Asymmetry -0.03145458 high speed
## 18 1.61938905 High Asymmetry 0.38262303 low speed
## 19 0.02488024 Low Asymmetry -3.87748507 low speed
## 20 0.06794414 Low Asymmetry -2.78238093 low speed
I’m going to do a Pairs Panels to see all the variables and several boxplots for high correlation variables. I’m curious as to how the dichotomous variables I made interact with Ill, so I’m going to count them, and visualize them in an interaction plot and in a geom_count.
pairs.panels(MasterFrame[,c(2:14,16)])# highest corr with speed, cadencetotal, & symidx
# Boxplot of Illness & Speed
boxplot(Speed~Ill,data = MasterFrame,main="Gait Speed By Health Status",ylab='Speed (m/s)',xlab='')# Solid overlap but healthy is higher
boxplot(MasterFrame$CadenceLSI~MasterFrame$Ill)# Parkinsons has higher outliers
boxplot(MasterFrame$CadenceTotal~MasterFrame$Ill)# Very Similar
ggplot(data=MasterFrame)+geom_count(aes(x=Ill,y=Var))
ggplot(Data=MasterFrame)+geom_boxplot(aes(x=Ill,y=MasterFrame$Asy))
#MasterFrame %>% count(Ill,Var)
#MasterFrame %>% count(Ill,Asy2)
#MasterFrame %>% count(Ill,BiSpeed)
ggplot(data=MasterFrame)+geom_count(aes(x=Ill,y=Asy2))
ggplot(data=MasterFrame)+geom_count(aes(x=Ill,y=BiSpeed))
ggplot(data=MasterFrame)+geom_count(aes(x=Ill,y=Var))
xtabs(~MasterFrame$Ill+MasterFrame$Asy2)
## MasterFrame$Asy2
## MasterFrame$Ill High Asymmetry Low Asymmetry
## Parkinsons 41 46
## Healthy 38 33
xtabs(~MasterFrame$Ill+MasterFrame$Var)
## MasterFrame$Var
## MasterFrame$Ill High Variability Low Variability
## Parkinsons 42 45
## Healthy 37 34
xtabs(~MasterFrame$Ill+MasterFrame$BiSpeed)
## MasterFrame$BiSpeed
## MasterFrame$Ill low speed high speed
## Parkinsons 57 30
## Healthy 22 49
#PCA
#ggbiplot(princomp(MasterFrame[,c(3:12,16)]))
Asy2<-MasterFrame$Asy2
interaction.plot(Ill,Asy2,Speed)
interaction.plot(Ill,BiSpeed,Speed)
Var<-MasterFrame$Var
interaction.plot(Ill,Var,Speed)
I chose to make a separate dataframe MasterFrame2 where I used 0 & 1 as my factors instead of the words they represent so I could include them easier in models. I also removed the ID column (this was mainly so I could do glm(ill~.)).
I also chose to split the data manually opting to make 70-30 split of my data
# make a data frame with the factors put as numbers
MasterFrame2<- MasterFrame
MasterFrame2$Ill<-Ill2
levels(MasterFrame2$Var)<-c(0:1)
levels(MasterFrame2$Asy2)<-c(0:1)
levels(MasterFrame2$BiSpeed)<-c(0:1)
MasterFrame2<-MasterFrame2[,-1]
# Make the training set 70% & Testing Set 30%
# 1-87 are PD, 88-158 Co
TrainMaster<- MasterFrame2[c(1:61,88:137),]
length(TrainMaster$Ill)#111
## [1] 111
TestMaster<- MasterFrame2[-c(1:61,88:137),]
length(TestMaster$Ill)#47
## [1] 47
TestMaster2<- MasterFrame[-c(1:61,88:137),]
#(Speed+CadenceTotal+CadenceLSI+CoefVL+CoefVR+CadenceLSD+CadenceR+CadenceRSD+Asy+Asy2+Var+LSI2+BiSpeed)^2
# 2 false negatives- 0ver80%
# All variables + interactions
kitchensink<-glm(Ill~.^2,MasterFrame2,family = binomial)
#summary(kitchensink)
check1<- predict(kitchensink,type="response",newdata=TestMaster)
score.table(check1, TestMaster$Ill,.5) #100% Accuracy
## Actual vs. Predicted
## Pred
## Actual FALSE TRUE
## 0 26 0
## 1 0 21
# Step of that model
step.kitchen <- step(kitchensink, data = TestMaster, family = binomial,trace = 0)
# AIC of 110 -Speed, CadenceTotal, CadenceLSI, CoefVL, CoefVR, CadenceLSD,
# CadenceR, CadenceRSD, Asy, Asy2,Var,LSI2,BiSpeed
#summary(step.kitchen)
stepcheck<-predict(step.kitchen,type = "response",newdata = TestMaster)
score.table(stepcheck,TestMaster$Ill,.5)#100% Accuracy
## Actual vs. Predicted
## Pred
## Actual FALSE TRUE
## 0 26 0
## 1 0 21
# Make a model without the interactions but with all variables
#Test all variables
lifevest<-glm(Ill~.,MasterFrame2,family = binomial)
#summary(lifevest)
checkinf<- predict(lifevest,type = "response",newdata = TestMaster)
score.table(checkinf,TestMaster$Ill,.5)
## Actual vs. Predicted
## Pred
## Actual FALSE TRUE
## 0 17 9
## 1 5 16
#step model
allstep<-step(lifevest,trace = 0)
summary(allstep)# AIC of 170.6: Variables Speed, Cadense LSI, CadenceLSD, CadenceTotal, CadenceR, Var, CadenceRSD
##
## Call:
## glm(formula = Ill ~ CadenceL + CadenceLSD + CadenceR + CadenceRSD +
## Speed + CadenceLSI + Var, family = binomial, data = MasterFrame2)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.7457 2.6130 -1.051 0.29335
## CadenceL -0.5078 0.2563 -1.982 0.04753 *
## CadenceLSD 0.3579 0.1113 3.214 0.00131 **
## CadenceR 0.3882 0.2342 1.658 0.09733 .
## CadenceRSD -0.1784 0.1193 -1.496 0.13477
## Speed 8.1420 1.5387 5.291 1.21e-07 ***
## CadenceLSI -0.8918 0.3033 -2.940 0.00328 **
## Var1 -0.9676 0.5715 -1.693 0.09045 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 217.41 on 157 degrees of freedom
## Residual deviance: 154.76 on 150 degrees of freedom
## AIC: 170.76
##
## Number of Fisher Scoring iterations: 6
length(allstep$coefficients)-1# 7 variables
## [1] 7
checkall<-predict(allstep,type="response",newdata=TestMaster)
score.table(checkall,TestMaster$Ill,.5) #72% Accuracy & 4 False Negatives
## Actual vs. Predicted
## Pred
## Actual FALSE TRUE
## 0 17 9
## 1 4 17
#PC model
allvar.pca <- princomp(MasterFrame2[,c(2:11,13,15)], cor = TRUE)
biplot(allvar.pca)
glm90 <- pc.glm(allvar.pca, 99, MasterFrame2[,1])
summary(glm90)
##
## Call:
## glm(formula = r ~ ., family = binomial, data = data.frame(pc.obj$scores[,
## 1:k], r = response))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.345775 0.200274 -1.727 0.08426 .
## Comp.1 -0.009934 0.107487 -0.092 0.92636
## Comp.2 0.470288 0.146393 3.212 0.00132 **
## Comp.3 -0.216637 0.200202 -1.082 0.27921
## Comp.4 -0.606129 0.337309 -1.797 0.07234 .
## Comp.5 -1.555214 0.288554 -5.390 7.06e-08 ***
## Comp.6 0.211718 0.348779 0.607 0.54383
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 217.41 on 157 degrees of freedom
## Residual deviance: 161.98 on 151 degrees of freedom
## AIC: 175.98
##
## Number of Fisher Scoring iterations: 5
checkglm<-predict.pc.glm(glm90,allvar.pca,TestMaster2[,c(3:12,14,16)])
score.table(checkglm,TestMaster$Ill,.5)
## Actual vs. Predicted
## Pred
## Actual FALSE TRUE
## 0 18 8
## 1 7 14
#90,95,98,99 all gets 32/47
# lets make a model with just speed
speedmodel<- glm(Ill~Speed,MasterFrame,family=binomial)
summary(speedmodel)
##
## Call:
## glm(formula = Ill ~ Speed, family = binomial, data = MasterFrame)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.554 1.413 -5.345 9.05e-08 ***
## Speed 6.402 1.207 5.302 1.14e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 217.41 on 157 degrees of freedom
## Residual deviance: 175.73 on 156 degrees of freedom
## AIC: 179.73
##
## Number of Fisher Scoring iterations: 4
check3<- predict(speedmodel,type="response",newdata=TestMaster)
score.table(check3,TestMaster$Ill,.5)# Does a lot of work by itself! 63.8% Accuracy
## Actual vs. Predicted
## Pred
## Actual FALSE TRUE
## 0 18 8
## 1 9 12
# Model with just speed & Cadence
model4<- glm(Ill~(Speed+CadenceTotal),MasterFrame,family=binomial)
summary(model4)
##
## Call:
## glm(formula = Ill ~ (Speed + CadenceTotal), family = binomial,
## data = MasterFrame)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.84719 2.15215 -2.717 0.00659 **
## Speed 6.89202 1.31848 5.227 1.72e-07 ***
## CadenceTotal -0.04021 0.03902 -1.030 0.30281
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 217.41 on 157 degrees of freedom
## Residual deviance: 174.61 on 155 degrees of freedom
## AIC: 180.61
##
## Number of Fisher Scoring iterations: 4
check4<- predict(model4,type="response",newdata = TestMaster)
score.table(check4,TestMaster$Ill,.5)# Exactly the same as the just speed model
## Actual vs. Predicted
## Pred
## Actual FALSE TRUE
## 0 18 8
## 1 9 12
# Model with speed, cadence, cadencesd, Asy & interaction
model5<- glm(Ill~(Speed+CadenceR+CadenceRSD+Asy)^2,MasterFrame,family=binomial)
check5<-predict(model5,type="response",newdata=TestMaster)
score.table(check5,TestMaster$Ill,.5)
## Actual vs. Predicted
## Pred
## Actual FALSE TRUE
## 0 15 11
## 1 6 15
#compare models
anova(step.kitchen,allstep,speedmodel,model4,model5,test="Chi")
## Analysis of Deviance Table
##
## Model 1: Ill ~ CadenceL + CadenceLSD + CadenceR + CadenceRSD + CoefVL +
## CoefVR + Speed + CadenceLSI + Var + Asy + LSI2 + BiSpeed +
## CadenceL:CadenceLSD + CadenceL:CadenceRSD + CadenceL:CoefVR +
## CadenceL:Speed + CadenceL:Asy + CadenceL:LSI2 + CadenceL:BiSpeed +
## CadenceLSD:CadenceR + CadenceLSD:CadenceRSD + CadenceLSD:Speed +
## CadenceLSD:CadenceLSI + CadenceLSD:Asy + CadenceLSD:BiSpeed +
## CadenceR:CadenceRSD + CadenceR:CoefVL + CadenceR:Var + CadenceRSD:CoefVL +
## CadenceRSD:Speed + CadenceRSD:Var + CadenceRSD:Asy + CadenceRSD:LSI2 +
## CadenceRSD:BiSpeed + CoefVL:CoefVR + CoefVL:Speed + CoefVL:Var +
## CoefVL:Asy + CoefVL:LSI2 + CoefVL:BiSpeed + CoefVR:Speed +
## CoefVR:CadenceLSI + CoefVR:Asy + CoefVR:LSI2 + Speed:CadenceLSI +
## Speed:Var + Speed:Asy + CadenceLSI:Asy + CadenceLSI:LSI2 +
## Var:Asy + LSI2:BiSpeed
## Model 2: Ill ~ CadenceL + CadenceLSD + CadenceR + CadenceRSD + Speed +
## CadenceLSI + Var
## Model 3: Ill ~ Speed
## Model 4: Ill ~ (Speed + CadenceTotal)
## Model 5: Ill ~ (Speed + CadenceR + CadenceRSD + Asy)^2
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 106 0.00
## 2 150 154.75 -44 -154.755 3.039e-14 ***
## 3 156 175.73 -6 -20.973 0.001855 **
## 4 155 174.61 1 1.116 0.290819
## 5 147 156.72 8 17.892 0.022054 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Diagnostic Plots
#autoplot(allstep, which = 4, nrow=1, ncol=1)
# Main Effects Model
autoplot(allstep, which = 1:4, nrow=2, ncol=2)
# Interaction Model
autoplot(step.kitchen, which = 1:4, nrow=2, ncol=2)
#plot(allstep$residuals)#Pretty close
#plot(step.kitchen$residuals)
#influential points
plot(cooks.distance(allstep))
which(cooks.distance(allstep) > 0.5)#none
## named integer(0)
plot(cooks.distance(kitchensink))
which(cooks.distance(kitchensink) > 0.5)#4 values
## 46 55 72 91
## 46 55 72 91
# Last Visual Check
ggplot()+geom_point(aes(x=checkall,y=TestMaster2$Ill))+xlab("Fitted Values")+ylab("True Values")
ggplot()+geom_point(aes(x=stepcheck,y=TestMaster2$Ill))+xlab("Fitted Values")+ylab("True Values")
I was able to create a model that correctly identified those in my test dataset. The model that used all variables and their interactions was very effective, as was the step model with these variables. This had a 100% accuracy. This model seemed to be overfit and I fear how this would be applicable. Several significant outliers with the cooks distance and the residuals perfectly matching up with the residuals vs fitted.
A step model that included all variables without their interactions produced a reasonable 72% accuracy and the lowest amount of false negatives of the smaller models. This model also only has 7 predictors, which I felt was manageable. I have chosen this model as the most effective even though others have the same accuracy, this is because I want the screening tool to catch as many individuals with parkinsons as possible and find the burden of a healthy individual being included and seen by a doctor to be minimal. This model also had a lower deviance compared to the larger model. The residuals were fairly normal with a bit of error at the edges, and cooks distances all below 0.5.
Gait speed was found the be the most predictive, with a model based only on gait having 63% accuracy on the test dataset. I wasn’t surprised that this was the case and it supported my original hypothesis. While important, the gait variability metrics were no where near as important on their own, yet there were clear interactions between them and some variables. I was surprised that the other variables like variability weren’t more important in the models. I made about 10 models that I didn’t include in this final document because they were basically all just as accurate as the ones I’ve included. It speaks to the difficulty of diagnosis of this disease and that different measures must be taken. I am curious how a model combining stability metrics from a standing balance task when combined with this data would fare, in addition to the health history information used in clinical diagnosis.
In conclusion, I was able to make a model that predicted parkinson’s disease status with 72% accuracy. I feel comfortable with the performance of this model due to the challenge of the task, as well as with its quality from the diagnostics. This was a fun challenge for me and puts together a good portion of my primary research interest at the moment with what I’ve been learning in coursework.
Mayo Foundation for Medical Education and Research. (2022, December 6). Parkinson’s disease. Mayo Clinic. Retrieved December 16, 2022, from https://www.mayoclinic.org/diseases-conditions/parkinsons-disease/
Frenkel-Toledo S, Giladi N, Peretz C, Herman T, Gruendlinger L, Hausdorff JM. Effect of gait speed on gait rhythmicity in Parkinson’s disease: variability of stride time and swing time respond differently. Journal of NeuroEngineering and Rehabilitation 2005: 2:23.
Frenkel-Toledo, S, Giladi N, Peretz C, Herman T, Gruendlinger L, Hausdorff JM. Treadmill walking as a pacemaker to improve gait rhythm and stability in Parkinson’s disease. Movement Disorders 2005; 20(9):1109-1114.
Hausdorff JM, Lowenthal J, Herman T, Gruendlinger L, Peretz C, Giladi N. Rhythmic auditory stimulation modulates gait variability in Parkinson’s disease Eur J Neuroscience 2007; 26: 2369-2375.
Yogev G, Giladi N, Peretz C, Springer S, Simon ES, Hausdorff JM. Dual tasking, gait rhythmicity, and Parkinson’s disease: Which aspects of gait are attention demanding? Eur J Neuroscience 2005; 22:1248-1256.