Dante Goss II Final Project

# Descrimination of Health Status via Gait Analysis

Introduction

Parkinson’s disease is a neurological disorder characterized by tremors, slow-labored walking, and balance disruptions. Parkinson’s disease does not have a specific diagnostic criteria, a diagnosis is at the discretion of a physician based on symptomology, health history, and generally other lab tests to rule out other diseases(1). The goal of this project is to build a model that would discriminate individuals with parkinson’s disease from healthy controls based on a walking gait test on an instrumented walkway. This model would be useful as another screening tool.

While the goal of this project is not hypothesis driven, I do anticipate gait speed and some measure of gait instability to be very important in the model based on key symptoms being slow-labored walking and balance disruptions.

Analysis

Load Data

First thing I need to do is get all the data loaded. Data is going to come from a demographic xlsx file, and a series of txt files that house the force plate data from the walking trials

Data format:

Each line contains 19 columns:

Column 1: Time (in seconds) Columns 2-9: Vertical ground reaction force (VGRF, in Newton) on each of 8. sensors located under the left foot. Columns 10-17: VGRF on each of the 8 sensors located under the right foot. Column 18: Total force under the left foot. Column 19: Total force under the right foot.

Data file names:

These follow a common convention, e.g., GaCo01_02.txt or JuPt03_06.txt, where

Ga, Ju or Si – indicate the study from which the data originated: Ga - Galit Yogev et al (dual tasking in PD; Eur J Neuro, 2005) Ju – Hausdorff et al (RAS in PD; Eur J Neuro, 2007) Si - Silvi Frenkel-Toledo et al (Treadmill walking in PD; Mov Disorders, 2005)

Co or Pt: Control subject or a PD Patient

01: Subject number in the group

A walk number of 10 (for the “Ga” study) indicates a dual-task walking, where the subject was engaged in serial-7 subtraction while walking.

A walk number of 01 refers to a usual, normal walk.

.txt: file name extension

The sampling rate was 100 Hz.

##       ID Study Group Subjnum Gender Age Height Weight HoehnYahr UPDRS UPDRSM
## 1 GaPt03    Ga     1       3      2  82   1.45     50       3.0    20     10
## 2 GaPt04    Ga     1       4      1  68   1.71    NaN       2.5    25      8
## 3 GaPt05    Ga     1       5      2  82   1.53     51       2.5    24      5
## 4 GaPt06    Ga     1       6      1  72   1.70     82       2.0    16     13
## 5 GaPt07    Ga     1       7      2  53   1.67     54       3.0    44     22
## 6 GaPt08    Ga     1       8      2  68   1.63     57       2.0    15      8
##    TUAG Speed_01 Speed_02 Speed_03 Speed_04 Speed_05 Speed_06 Speed_07 Speed_10
## 1 36.34      NaN      NaN      NaN      NaN      NaN      NaN      NaN    0.778
## 2 11.00    0.642      NaN      NaN      NaN      NaN      NaN      NaN    0.818
## 3 14.50    0.908      NaN      NaN      NaN      NaN      NaN      NaN    0.614
## 4 10.47    0.848      NaN      NaN      NaN      NaN      NaN      NaN    0.937
## 5 18.34    0.677    0.957      NaN      NaN      NaN      NaN      NaN    0.579
## 6 10.11    1.046    1.132      NaN      NaN      NaN      NaN      NaN    0.228
##         ID Group Gender Age Height Weight Speed_01
## 11  GaPt15     1      1  81    NaN     80    0.948
## 15  GaPt19     1      2  76    NaN     55    1.124
## 17  GaPt21     1      1  80    NaN     63    1.166
## 2   GaPt04     1      1  68   1.71    NaN    0.642
## 19  GaPt23     1      2  71   1.60    NaN    0.360
## 153 SiCo17     2      2  64   1.65    NaN    1.202
## 1   GaPt03     1      2  82   1.45     50      NaN

Making new variables

I want to make variables for cadence, step asymmetry (likely just cadence according to each leg) Some measure of variability. I’m going to make empty vectors for each variable, and in a loop calculate the variables for each subject. My main research interest has to do with cadence, so there will be a lot of cadence variables. I will use a findpeaks function to access time points with peak force, find out how far apart peak forces were (assuming they were at the same point in the gait cycle), and use that to calculate cadence by step. I will do this for both left and right legs. I will have a variable for both the mean and standard deviation of cadence, coefficient of variance (defined as the standard deviation over the mean), a mean cadence value between limbs, the standard deviation of that, as well as a limb symmetry index.

CadenceL<-rep(NA,length(Data))
CadenceR<-rep(NA,length(Data))
CadenceLSD<-rep(NA,length(Data))
CadenceRSD<-rep(NA,length(Data))
CoefVL<-rep(NA,length(Data))
CoefVR<-rep(NA,length(Data))
CadenceTotal<-rep(NA,length(Data))
CadenceSD<-rep(NA,length(Data))
CadenceLSI<- rep(NA,length(Data))
# Data is a large list that holds all the base data, time, left force, and right force

#findpeaks(Data[[1]][,2],minpeakdistance = 50,minpeakheight = 500)#dataframe with all L peaks
#2nd column is time

#Get Time Stamps 
#findpeaks(Data[[1]][,2],minpeakdistance = 50,minpeakheight = 500)[,2]
# Order them
#sort(findpeaks(Data[[1]][,2],minpeakdistance = 50,minpeakheight = 500)[,2])
# How far apart are they?
#diff(sort(findpeaks(Data[[1]][,2],minpeakdistance = 50,minpeakheight = 500)[,2]))
# What is that in seconds?
samplingrate=100
fs=1/samplingrate
#fs*diff(sort(findpeaks(Data[[1]][,2],minpeakdistance = 50,minpeakheight = 500)[,2]))#Time between steps
# Cadence
#60/(fs*diff(sort(findpeaks(Data[[1]][,2],minpeakdistance = 50,minpeakheight = 500)[,2])))
#What is the average?
#mean(60/(fs*diff(sort(findpeaks(Data[[1]][,2],minpeakdistance = 50,minpeakheight = 500)[,2]))))
# What is the Standard deviation?
#sd(60/(fs*diff(sort(findpeaks(Data[[1]][,2],minpeakdistance = 50,minpeakheight = 500)[,2]))))

#Iterate
for (i in 1:length(Data)){
  CadenceL[i]<-mean(60/(fs*diff(sort(findpeaks(Data[[i]][,2],minpeakdistance = 50,minpeakheight = 500)[,2]))))
  CadenceR[i]<-mean(60/(fs*diff(sort(findpeaks(Data[[i]][,3],minpeakdistance = 50,minpeakheight = 500)[,2]))))
  CadenceLSD[i]<-sd(60/(fs*diff(sort(findpeaks(Data[[i]][,2],minpeakdistance = 50,minpeakheight = 500)[,2]))))
  CadenceRSD[i]<-sd(60/(fs*diff(sort(findpeaks(Data[[i]][,3],minpeakdistance = 50,minpeakheight = 500)[,2]))))
  CoefVL[i]<-CadenceLSD[i]/CadenceL[i]
  CoefVR[i]<-CadenceRSD[i]/CadenceR[i]
  CadenceTotal[i]<-mean(CadenceL[i],CadenceR[i])
  CadenceSD[i]<-mean(CadenceLSD[i],CadenceRSD[i])
  CadenceLSI[i]<-100*((abs(CadenceL[i]-CadenceR[i]))/(abs(CadenceL[i]+CadenceR[i])))
}

Building the master dataframe

I need to add everything into the master dataframe. I’m going to combine everything from before with the ID’s & Disease status. I’m going to make some key variables into high and low dichotomous, variability (LSI), speed, and asymmetry.

##               ID        Ill CadenceL CadenceLSD CadenceR CadenceRSD     CoefVL
## 1  GaPt05_01.txt Parkinsons 56.12406   7.571144 57.02525   9.994528 0.13490014
## 2  GaPt06_01.txt Parkinsons 52.19618   4.491964 52.46820   6.127748 0.08605925
## 3  GaPt07_01.txt Parkinsons 44.14557  11.279012 50.31398  21.886947 0.25549590
## 4  GaPt08_01.txt Parkinsons 52.25546   3.715477 52.13493   2.928872 0.07110217
## 5  GaPt09_01.txt Parkinsons 61.29577   6.210452 61.59904   7.127122 0.10131942
## 6  GaPt12_01.txt Parkinsons 60.61333   5.248713 61.18153   7.497181 0.08659337
## 7  GaPt13_01.txt Parkinsons 63.61374   8.348241 64.17520  10.554711 0.13123330
## 8  GaPt14_01.txt Parkinsons 61.43195   9.416112 60.27928   7.782839 0.15327713
## 9  GaPt16_01.txt Parkinsons 56.11055   5.671999 57.37937   6.204913 0.10108615
## 10 GaPt17_01.txt Parkinsons 51.57313   3.680048 51.67745   4.786804 0.07135592
## 11 GaPt18_01.txt Parkinsons 52.50641   2.688679 52.52434   3.465680 0.05120668
## 12 GaPt20_01.txt Parkinsons 54.13482   5.731212 54.64315   5.849568 0.10586925
## 13 GaPt22_01.txt Parkinsons 48.34174   4.486529 48.88147   6.912966 0.09280860
## 14 GaPt24_01.txt Parkinsons 58.08486   5.777725 59.80871   9.854811 0.09947041
## 15 GaPt25_01.txt Parkinsons 62.01274   6.181150 61.48682   4.655699 0.09967549
## 16 GaPt26_01.txt Parkinsons 54.18645   8.004002 58.90728  15.989033 0.14771223
## 17 GaPt27_01.txt Parkinsons 54.82749   7.665312 55.90048  11.249408 0.13980783
## 18 GaPt28_01.txt Parkinsons 56.03653  10.926376 54.41714   5.191370 0.19498668
## 19 GaPt29_01.txt Parkinsons 60.07655   5.575530 60.10143   5.534039 0.09280709
## 20 GaPt30_01.txt Parkinsons 54.92416   3.250632 54.85622   3.628274 0.05918402
##        CoefVR CadenceTotal CadenceSD Speed CadenceLSI              Var
## 1  0.17526496     56.12406  7.571144 0.908 0.79645902 High Variability
## 2  0.11678976     52.19618  4.491964 0.848 0.25989510  Low Variability
## 3  0.43500729     44.14557 11.279012 0.677 6.53020999 High Variability
## 4  0.05617869     52.25546  3.715477 1.046 0.11546323  Low Variability
## 5  0.11570185     61.29577  6.210452 0.894 0.24676981  Low Variability
## 6  0.12253995     60.61333  5.248713 1.261 0.46651965  Low Variability
## 7  0.16446713     63.61374  8.348241 0.726 0.43936469  Low Variability
## 8  0.12911300     61.43195  9.416112 1.369 0.94705156 High Variability
## 9  0.10813840     56.11055  5.671999 1.048 1.11800355 High Variability
## 10 0.09262849     51.57313  3.680048 0.731 0.10103301  Low Variability
## 11 0.06598237     52.50641  2.688679 0.889 0.01707731  Low Variability
## 12 0.10705035     54.13482  5.731212 0.722 0.46731313  Low Variability
## 13 0.14142303     48.34174  4.486529 0.802 0.55514818 High Variability
## 14 0.16477216     58.08486  5.777725 1.255 1.46220865 High Variability
## 15 0.07571864     62.01274  6.181150 1.128 0.42584795  Low Variability
## 16 0.27142713     54.18645  8.004002 1.244 4.17425709 High Variability
## 17 0.20123992     54.82749  7.665312 1.423 0.96903497 High Variability
## 18 0.09539954     56.03653 10.926376 0.987 1.46612524 High Variability
## 19 0.09207831     60.07655  5.575530 1.092 0.02070283  Low Variability
## 20 0.06614152     54.92416  3.250632 1.064 0.06189097  Low Variability
##           Asy           Asy2        LSI2    BiSpeed
## 1  0.90118793 High Asymmetry -0.22757960  low speed
## 2  0.27201759  Low Asymmetry -1.34747720  low speed
## 3  6.16840679 High Asymmetry  1.87643910  low speed
## 4  0.12053252  Low Asymmetry -2.15880319  low speed
## 5  0.30326727  Low Asymmetry -1.39929933  low speed
## 6  0.56819695  Low Asymmetry -0.76245514 high speed
## 7  0.56145949  Low Asymmetry -0.82242547  low speed
## 8  1.15266805 High Asymmetry -0.05440174 high speed
## 9  1.26882126 High Asymmetry  0.11154455  low speed
## 10 0.10431718  Low Asymmetry -2.29230798  low speed
## 11 0.01793643  Low Asymmetry -4.07000439  low speed
## 12 0.50833372  Low Asymmetry -0.76075574  low speed
## 13 0.53973286  Low Asymmetry -0.58852022  low speed
## 14 1.72385002 High Asymmetry  0.37994807 high speed
## 15 0.52592033  Low Asymmetry -0.85367293  low speed
## 16 4.72082308 High Asymmetry  1.42893640 high speed
## 17 1.07299272 High Asymmetry -0.03145458 high speed
## 18 1.61938905 High Asymmetry  0.38262303  low speed
## 19 0.02488024  Low Asymmetry -3.87748507  low speed
## 20 0.06794414  Low Asymmetry -2.78238093  low speed

Data Visualization

I’m going to do a Pairs Panels to see all the variables and several boxplots for high correlation variables. I’m curious as to how the dichotomous variables I made interact with Ill, so I’m going to count them, and visualize them in an interaction plot and in a geom_count.

pairs.panels(MasterFrame[,c(2:14,16)])# highest corr with speed, cadencetotal, & symidx

# Boxplot of Illness & Speed
boxplot(Speed~Ill,data = MasterFrame,main="Gait Speed By Health Status",ylab='Speed (m/s)',xlab='')# Solid overlap but healthy is higher

boxplot(MasterFrame$CadenceLSI~MasterFrame$Ill)# Parkinsons has higher outliers

boxplot(MasterFrame$CadenceTotal~MasterFrame$Ill)# Very Similar

ggplot(data=MasterFrame)+geom_count(aes(x=Ill,y=Var))

ggplot(Data=MasterFrame)+geom_boxplot(aes(x=Ill,y=MasterFrame$Asy))

#MasterFrame %>% count(Ill,Var)
#MasterFrame %>% count(Ill,Asy2)
#MasterFrame %>% count(Ill,BiSpeed)
ggplot(data=MasterFrame)+geom_count(aes(x=Ill,y=Asy2))

ggplot(data=MasterFrame)+geom_count(aes(x=Ill,y=BiSpeed))

ggplot(data=MasterFrame)+geom_count(aes(x=Ill,y=Var))

xtabs(~MasterFrame$Ill+MasterFrame$Asy2)
##                MasterFrame$Asy2
## MasterFrame$Ill High Asymmetry Low Asymmetry
##      Parkinsons             41            46
##      Healthy                38            33
xtabs(~MasterFrame$Ill+MasterFrame$Var)
##                MasterFrame$Var
## MasterFrame$Ill High Variability Low Variability
##      Parkinsons               42              45
##      Healthy                  37              34
xtabs(~MasterFrame$Ill+MasterFrame$BiSpeed)
##                MasterFrame$BiSpeed
## MasterFrame$Ill low speed high speed
##      Parkinsons        57         30
##      Healthy           22         49
#PCA
#ggbiplot(princomp(MasterFrame[,c(3:12,16)]))

Asy2<-MasterFrame$Asy2
interaction.plot(Ill,Asy2,Speed)

interaction.plot(Ill,BiSpeed,Speed)

Var<-MasterFrame$Var
interaction.plot(Ill,Var,Speed)

Split the data into testing & training sets

I chose to make a separate dataframe MasterFrame2 where I used 0 & 1 as my factors instead of the words they represent so I could include them easier in models. I also removed the ID column (this was mainly so I could do glm(ill~.)).

I also chose to split the data manually opting to make 70-30 split of my data

# make a data frame with the factors put as numbers
MasterFrame2<- MasterFrame
MasterFrame2$Ill<-Ill2
levels(MasterFrame2$Var)<-c(0:1)
levels(MasterFrame2$Asy2)<-c(0:1)
levels(MasterFrame2$BiSpeed)<-c(0:1)
MasterFrame2<-MasterFrame2[,-1]
# Make the training set 70% & Testing Set 30%
# 1-87 are PD, 88-158 Co
TrainMaster<- MasterFrame2[c(1:61,88:137),]
length(TrainMaster$Ill)#111
## [1] 111
TestMaster<- MasterFrame2[-c(1:61,88:137),]
length(TestMaster$Ill)#47
## [1] 47
TestMaster2<- MasterFrame[-c(1:61,88:137),]

Build several GLM’s & test their predictions

#(Speed+CadenceTotal+CadenceLSI+CoefVL+CoefVR+CadenceLSD+CadenceR+CadenceRSD+Asy+Asy2+Var+LSI2+BiSpeed)^2
# 2 false negatives- 0ver80% 


# All variables + interactions
kitchensink<-glm(Ill~.^2,MasterFrame2,family = binomial)
#summary(kitchensink)
check1<- predict(kitchensink,type="response",newdata=TestMaster)
score.table(check1, TestMaster$Ill,.5) #100% Accuracy
## Actual vs. Predicted
##       Pred
## Actual FALSE TRUE
##      0    26    0
##      1     0   21
# Step of that model
step.kitchen <- step(kitchensink, data = TestMaster, family = binomial,trace = 0)
# AIC of 110 -Speed, CadenceTotal, CadenceLSI, CoefVL, CoefVR, CadenceLSD, 
# CadenceR, CadenceRSD, Asy, Asy2,Var,LSI2,BiSpeed
#summary(step.kitchen)
stepcheck<-predict(step.kitchen,type = "response",newdata = TestMaster)
score.table(stepcheck,TestMaster$Ill,.5)#100% Accuracy
## Actual vs. Predicted
##       Pred
## Actual FALSE TRUE
##      0    26    0
##      1     0   21
# Make a model without the interactions but with all variables
#Test all variables
lifevest<-glm(Ill~.,MasterFrame2,family = binomial)
#summary(lifevest)
checkinf<- predict(lifevest,type = "response",newdata = TestMaster)
score.table(checkinf,TestMaster$Ill,.5)
## Actual vs. Predicted
##       Pred
## Actual FALSE TRUE
##      0    17    9
##      1     5   16
#step model
allstep<-step(lifevest,trace = 0)
summary(allstep)# AIC of 170.6: Variables Speed, Cadense LSI, CadenceLSD, CadenceTotal, CadenceR, Var, CadenceRSD
## 
## Call:
## glm(formula = Ill ~ CadenceL + CadenceLSD + CadenceR + CadenceRSD + 
##     Speed + CadenceLSI + Var, family = binomial, data = MasterFrame2)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.7457     2.6130  -1.051  0.29335    
## CadenceL     -0.5078     0.2563  -1.982  0.04753 *  
## CadenceLSD    0.3579     0.1113   3.214  0.00131 ** 
## CadenceR      0.3882     0.2342   1.658  0.09733 .  
## CadenceRSD   -0.1784     0.1193  -1.496  0.13477    
## Speed         8.1420     1.5387   5.291 1.21e-07 ***
## CadenceLSI   -0.8918     0.3033  -2.940  0.00328 ** 
## Var1         -0.9676     0.5715  -1.693  0.09045 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 217.41  on 157  degrees of freedom
## Residual deviance: 154.76  on 150  degrees of freedom
## AIC: 170.76
## 
## Number of Fisher Scoring iterations: 6
length(allstep$coefficients)-1# 7 variables
## [1] 7
checkall<-predict(allstep,type="response",newdata=TestMaster)
score.table(checkall,TestMaster$Ill,.5) #72% Accuracy & 4 False Negatives
## Actual vs. Predicted
##       Pred
## Actual FALSE TRUE
##      0    17    9
##      1     4   17
#PC model
allvar.pca <- princomp(MasterFrame2[,c(2:11,13,15)], cor = TRUE)
biplot(allvar.pca)

glm90 <- pc.glm(allvar.pca, 99, MasterFrame2[,1])
summary(glm90)
## 
## Call:
## glm(formula = r ~ ., family = binomial, data = data.frame(pc.obj$scores[, 
##     1:k], r = response))
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.345775   0.200274  -1.727  0.08426 .  
## Comp.1      -0.009934   0.107487  -0.092  0.92636    
## Comp.2       0.470288   0.146393   3.212  0.00132 ** 
## Comp.3      -0.216637   0.200202  -1.082  0.27921    
## Comp.4      -0.606129   0.337309  -1.797  0.07234 .  
## Comp.5      -1.555214   0.288554  -5.390 7.06e-08 ***
## Comp.6       0.211718   0.348779   0.607  0.54383    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 217.41  on 157  degrees of freedom
## Residual deviance: 161.98  on 151  degrees of freedom
## AIC: 175.98
## 
## Number of Fisher Scoring iterations: 5
checkglm<-predict.pc.glm(glm90,allvar.pca,TestMaster2[,c(3:12,14,16)])
score.table(checkglm,TestMaster$Ill,.5)
## Actual vs. Predicted
##       Pred
## Actual FALSE TRUE
##      0    18    8
##      1     7   14
#90,95,98,99 all gets 32/47

# lets make a model with just speed
speedmodel<- glm(Ill~Speed,MasterFrame,family=binomial)
summary(speedmodel)
## 
## Call:
## glm(formula = Ill ~ Speed, family = binomial, data = MasterFrame)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -7.554      1.413  -5.345 9.05e-08 ***
## Speed          6.402      1.207   5.302 1.14e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 217.41  on 157  degrees of freedom
## Residual deviance: 175.73  on 156  degrees of freedom
## AIC: 179.73
## 
## Number of Fisher Scoring iterations: 4
check3<- predict(speedmodel,type="response",newdata=TestMaster)
score.table(check3,TestMaster$Ill,.5)# Does a lot of work by itself! 63.8% Accuracy
## Actual vs. Predicted
##       Pred
## Actual FALSE TRUE
##      0    18    8
##      1     9   12
# Model with just speed & Cadence
model4<- glm(Ill~(Speed+CadenceTotal),MasterFrame,family=binomial)
summary(model4)
## 
## Call:
## glm(formula = Ill ~ (Speed + CadenceTotal), family = binomial, 
##     data = MasterFrame)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -5.84719    2.15215  -2.717  0.00659 ** 
## Speed         6.89202    1.31848   5.227 1.72e-07 ***
## CadenceTotal -0.04021    0.03902  -1.030  0.30281    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 217.41  on 157  degrees of freedom
## Residual deviance: 174.61  on 155  degrees of freedom
## AIC: 180.61
## 
## Number of Fisher Scoring iterations: 4
check4<- predict(model4,type="response",newdata = TestMaster)
score.table(check4,TestMaster$Ill,.5)# Exactly the same as the just speed model
## Actual vs. Predicted
##       Pred
## Actual FALSE TRUE
##      0    18    8
##      1     9   12
# Model with speed, cadence, cadencesd, Asy & interaction
model5<- glm(Ill~(Speed+CadenceR+CadenceRSD+Asy)^2,MasterFrame,family=binomial)
check5<-predict(model5,type="response",newdata=TestMaster)
score.table(check5,TestMaster$Ill,.5)
## Actual vs. Predicted
##       Pred
## Actual FALSE TRUE
##      0    15   11
##      1     6   15
#compare models
anova(step.kitchen,allstep,speedmodel,model4,model5,test="Chi")
## Analysis of Deviance Table
## 
## Model 1: Ill ~ CadenceL + CadenceLSD + CadenceR + CadenceRSD + CoefVL + 
##     CoefVR + Speed + CadenceLSI + Var + Asy + LSI2 + BiSpeed + 
##     CadenceL:CadenceLSD + CadenceL:CadenceRSD + CadenceL:CoefVR + 
##     CadenceL:Speed + CadenceL:Asy + CadenceL:LSI2 + CadenceL:BiSpeed + 
##     CadenceLSD:CadenceR + CadenceLSD:CadenceRSD + CadenceLSD:Speed + 
##     CadenceLSD:CadenceLSI + CadenceLSD:Asy + CadenceLSD:BiSpeed + 
##     CadenceR:CadenceRSD + CadenceR:CoefVL + CadenceR:Var + CadenceRSD:CoefVL + 
##     CadenceRSD:Speed + CadenceRSD:Var + CadenceRSD:Asy + CadenceRSD:LSI2 + 
##     CadenceRSD:BiSpeed + CoefVL:CoefVR + CoefVL:Speed + CoefVL:Var + 
##     CoefVL:Asy + CoefVL:LSI2 + CoefVL:BiSpeed + CoefVR:Speed + 
##     CoefVR:CadenceLSI + CoefVR:Asy + CoefVR:LSI2 + Speed:CadenceLSI + 
##     Speed:Var + Speed:Asy + CadenceLSI:Asy + CadenceLSI:LSI2 + 
##     Var:Asy + LSI2:BiSpeed
## Model 2: Ill ~ CadenceL + CadenceLSD + CadenceR + CadenceRSD + Speed + 
##     CadenceLSI + Var
## Model 3: Ill ~ Speed
## Model 4: Ill ~ (Speed + CadenceTotal)
## Model 5: Ill ~ (Speed + CadenceR + CadenceRSD + Asy)^2
##   Resid. Df Resid. Dev  Df Deviance  Pr(>Chi)    
## 1       106       0.00                           
## 2       150     154.75 -44 -154.755 3.039e-14 ***
## 3       156     175.73  -6  -20.973  0.001855 ** 
## 4       155     174.61   1    1.116  0.290819    
## 5       147     156.72   8   17.892  0.022054 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Diagnostic Plots
#autoplot(allstep, which = 4, nrow=1, ncol=1)

# Main Effects Model
autoplot(allstep, which = 1:4, nrow=2, ncol=2)

# Interaction Model
autoplot(step.kitchen, which = 1:4, nrow=2, ncol=2)

#plot(allstep$residuals)#Pretty close
#plot(step.kitchen$residuals)

#influential points
plot(cooks.distance(allstep))

which(cooks.distance(allstep) > 0.5)#none
## named integer(0)
plot(cooks.distance(kitchensink))

which(cooks.distance(kitchensink) > 0.5)#4 values
## 46 55 72 91 
## 46 55 72 91
# Last Visual Check
ggplot()+geom_point(aes(x=checkall,y=TestMaster2$Ill))+xlab("Fitted Values")+ylab("True Values")

ggplot()+geom_point(aes(x=stepcheck,y=TestMaster2$Ill))+xlab("Fitted Values")+ylab("True Values")

Conclusions

I was able to create a model that correctly identified those in my test dataset. The model that used all variables and their interactions was very effective, as was the step model with these variables. This had a 100% accuracy. This model seemed to be overfit and I fear how this would be applicable. Several significant outliers with the cooks distance and the residuals perfectly matching up with the residuals vs fitted.

A step model that included all variables without their interactions produced a reasonable 72% accuracy and the lowest amount of false negatives of the smaller models. This model also only has 7 predictors, which I felt was manageable. I have chosen this model as the most effective even though others have the same accuracy, this is because I want the screening tool to catch as many individuals with parkinsons as possible and find the burden of a healthy individual being included and seen by a doctor to be minimal. This model also had a lower deviance compared to the larger model. The residuals were fairly normal with a bit of error at the edges, and cooks distances all below 0.5.

Gait speed was found the be the most predictive, with a model based only on gait having 63% accuracy on the test dataset. I wasn’t surprised that this was the case and it supported my original hypothesis. While important, the gait variability metrics were no where near as important on their own, yet there were clear interactions between them and some variables. I was surprised that the other variables like variability weren’t more important in the models. I made about 10 models that I didn’t include in this final document because they were basically all just as accurate as the ones I’ve included. It speaks to the difficulty of diagnosis of this disease and that different measures must be taken. I am curious how a model combining stability metrics from a standing balance task when combined with this data would fare, in addition to the health history information used in clinical diagnosis.

In conclusion, I was able to make a model that predicted parkinson’s disease status with 72% accuracy. I feel comfortable with the performance of this model due to the challenge of the task, as well as with its quality from the diagnostics. This was a fun challenge for me and puts together a good portion of my primary research interest at the moment with what I’ve been learning in coursework.

References

  1. Mayo Foundation for Medical Education and Research. (2022, December 6). Parkinson’s disease. Mayo Clinic. Retrieved December 16, 2022, from https://www.mayoclinic.org/diseases-conditions/parkinsons-disease/

  2. Frenkel-Toledo S, Giladi N, Peretz C, Herman T, Gruendlinger L, Hausdorff JM. Effect of gait speed on gait rhythmicity in Parkinson’s disease: variability of stride time and swing time respond differently. Journal of NeuroEngineering and Rehabilitation 2005: 2:23.

  3. Frenkel-Toledo, S, Giladi N, Peretz C, Herman T, Gruendlinger L, Hausdorff JM. Treadmill walking as a pacemaker to improve gait rhythm and stability in Parkinson’s disease. Movement Disorders 2005; 20(9):1109-1114.

  4. Hausdorff JM, Lowenthal J, Herman T, Gruendlinger L, Peretz C, Giladi N. Rhythmic auditory stimulation modulates gait variability in Parkinson’s disease Eur J Neuroscience 2007; 26: 2369-2375.

  5. Yogev G, Giladi N, Peretz C, Springer S, Simon ES, Hausdorff JM. Dual tasking, gait rhythmicity, and Parkinson’s disease: Which aspects of gait are attention demanding? Eur J Neuroscience 2005; 22:1248-1256.