We have about 1,300 meal records of 60 persons in the spreadsheet below. Each row indicates 1 meal with several features like person profile, number of dishes, detail nutrients like amount of energy, carbo, fat, protein and so on. You can see score of each meal at the last “R” column. The score is from 1 to 4 and 1 indicates the worst and 4 does the best meal. Choose whichever features you like and can create any features from original ones to build the classifier.
We can find some missing value at below table, hence do missing value check first.
MissingValueCheck function: check the rate of missing value in data, if rate > 0.05, we need to do other preprocessing, else pass the checking process.
MissingValueCheck <- function(DF){
df1 <- DF[!is.na(DF$NA.),]
df2 <- data.frame(all = nrow(df1),missing = colSums(is.na(df1)),ratio = colSums(is.na(df1))/nrow(df1))%>%
filter(ratio > 0.05)
if (nrow(df2) == 0){
check = 'Pass'
} else {
check = row.names(df2)
}
return(check)
}
Passed the data check, we start to calcualte and choose features : I choosed [Type, gender, age, height, weight, EER.kcal] from raw data and calcualted [avg.E,avg.P,avg.F,avg.C,avg.Salt,avg.Vegetables] which columns divided from dishes.
Training and Validing dataset spilt to 70% and 30%, then using package ‘caret’ to build a randomforest model.
set.seed(36)
train <- sample(nrow(df1), 0.7*nrow(df1), replace = FALSE)
TrainSet <- df1[train,]
ValidSet <- df1[-train,]
model_rf <- caret::train(Score ~ .,
data = TrainSet,
method = "rf",
preProcess = c("scale", "center"),
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
verboseIter = FALSE))
That’s our predict result from ValidSet, and using the [correct answer]/[all predictors] to caculate the accuracy [0.74].
##
## predValid 1 2 3 4
## 1 72 14 0 0
## 2 24 199 28 2
## 3 2 17 20 15
## 4 1 0 0 1
From Raw data, we can find [scored 2] had 757 records are more then [scored 4] had 37. That causes our model will not tend to preict [scored 4].
For the Class unbalance problem we can using some resampling skill trying to modify our results. I used ‘down’ and ‘smote’ resampling methods. Down-sampling: randomly subset all the classes in the training set so that their class frequencies match the least prevalent class. For example, suppose that 80% of the training set samples are the first class and the remaining 20% are in the second class. Down-sampling would randomly sample the first class to be the same size as the second class (so that only 40% of the total training set is used to fit the model).SMOTE (Synthetic Minority Oversampling TEchnique) consists of synthesizing elements for the minority class, based on those that already exist. It works randomly picingk a point from the minority class and computing the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.
## Loading required package: grid
##
## predValid 1 2 3 4
## 1 62 29 0 0
## 2 29 110 12 1
## 3 4 70 24 7
## 4 4 21 12 10
##
## predValid 1 2 3 4
## 1 21 3 0 0
## 2 75 184 25 3
## 3 2 13 8 3
## 4 1 30 15 12
Althought we modified the two resampling methods can increse the number of [class 4], their accuracy of ‘down’ is 0.52 and ‘SMOTE’ is 0.57. Both of then are lower than original methond. Maybe that can solve our prediction problem, but we still need to discus the balance of prdict require and accuracy.
## rf variable importance
##
## Overall
## avg.Vegetables 100.000
## avg.F 57.425
## avg.Salt 50.628
## avg.P 47.163
## avg.C 44.678
## avg.E 35.488
## EER.kcal. 18.627
## age 18.200
## weight 17.475
## height 14.305
## Typelunch 2.323
## Typedinner 1.212
## gendermale 0.000
From the importance table of original model, we choose the [avg.Vegetables,avg.F,avg.Salt,avg.P,avg.C,avg.E] which importace over 30 then calculate their average.
Train set:
Train set:## Warning: `parse_quosure()` is deprecated as of rlang 0.2.0.
## Please use `parse_quo()` instead.
## This warning is displayed once per session.
Predict result:
We can get some insights of below charts: 1. How to get high score? More Vegetables, others are low. Especially we need to distinguish with [Score3] and [Score4]. The Vegetables of [Score3] and [Score4] is alomst the same, but others [Score4] are lower than [Score3]. 2. Using randomforest modeling result will enhance the distinguishing of the Vegetables between each Scores. Others are also the same with Trainset when more lower get more higer Score.
In this Case, the Score of raw data should be 1, but our predition was 4. Check the data, we can find avg.Vegetables is 181.2 and avg.F is 0.52 which seems matched our model’s criterions. That’s why it predicted this case to Score4.
From importance table, it showed us the human information features got lower importance([age, weight, height, gender]). Maybe we can think why human information will not effect our Score? Or maybe we need to do more features combine the infromation data with food components data.
## rf variable importance
##
## Overall
## avg.Vegetables 100.000
## avg.F 57.425
## avg.Salt 50.628
## avg.P 47.163
## avg.C 44.678
## avg.E 35.488
## EER.kcal. 18.627
## age 18.200
## weight 17.475
## height 14.305
## Typelunch 2.323
## Typedinner 1.212
## gendermale 0.000