Hacarus_CodingTest

Data & Questions

We have about 1,300 meal records of 60 persons in the spreadsheet below. Each row indicates 1 meal with several features like person profile, number of dishes, detail nutrients like amount of energy, carbo, fat, protein and so on. You can see score of each meal at the last “R” column. The score is from 1 to 4 and 1 indicates the worst and 4 does the best meal. Choose whichever features you like and can create any features from original ones to build the classifier.

Data preprocessing

We can find some missing value at below table, hence do missing value check first.

MissingValueCheck function: check the rate of missing value in data, if rate > 0.05, we need to do other preprocessing, else pass the checking process.

MissingValueCheck <- function(DF){
  df1 <- DF[!is.na(DF$NA.),]
  df2 <- data.frame(all = nrow(df1),missing = colSums(is.na(df1)),ratio = colSums(is.na(df1))/nrow(df1))%>%
    filter(ratio > 0.05)
  if (nrow(df2) == 0){
    check = 'Pass'
  } else {
    check = row.names(df2)
  }
  return(check)
}

Modeling - randomforest

Passed the data check, we start to calcualte and choose features : I choosed [Type, gender, age, height, weight, EER.kcal] from raw data and calcualted [avg.E,avg.P,avg.F,avg.C,avg.Salt,avg.Vegetables] which columns divided from dishes.

Training and Validing dataset spilt to 70% and 30%, then using package ‘caret’ to build a randomforest model.

set.seed(36)
train <- sample(nrow(df1), 0.7*nrow(df1), replace = FALSE)
TrainSet <- df1[train,]
ValidSet <- df1[-train,]

model_rf <- caret::train(Score ~ .,
                         data = TrainSet,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 10, 
                                                  verboseIter = FALSE))

Valid result

That’s our predict result from ValidSet, and using the [correct answer]/[all predictors] to caculate the accuracy [0.74].

##          
## predValid   1   2   3   4
##         1  72  14   0   0
##         2  24 199  28   2
##         3   2  17  20  15
##         4   1   0   0   1

Result insight and discussion 1 - Class unbalanced problem

From Raw data, we can find [scored 2] had 757 records are more then [scored 4] had 37. That causes our model will not tend to preict [scored 4].

For the Class unbalance problem we can using some resampling skill trying to modify our results. I used ‘down’ and ‘smote’ resampling methods. Down-sampling: randomly subset all the classes in the training set so that their class frequencies match the least prevalent class. For example, suppose that 80% of the training set samples are the first class and the remaining 20% are in the second class. Down-sampling would randomly sample the first class to be the same size as the second class (so that only 40% of the total training set is used to fit the model).SMOTE (Synthetic Minority Oversampling TEchnique) consists of synthesizing elements for the minority class, based on those that already exist. It works randomly picingk a point from the minority class and computing the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.

## Loading required package: grid

##          
## predValid   1   2   3   4
##         1  62  29   0   0
##         2  29 110  12   1
##         3   4  70  24   7
##         4   4  21  12  10

##          
## predValid   1   2   3   4
##         1  21   3   0   0
##         2  75 184  25   3
##         3   2  13   8   3
##         4   1  30  15  12

Althought we modified the two resampling methods can increse the number of [class 4], their accuracy of ‘down’ is 0.52 and ‘SMOTE’ is 0.57. Both of then are lower than original methond. Maybe that can solve our prediction problem, but we still need to discus the balance of prdict require and accuracy.

Result insight and discussion 2 - feature importance

## rf variable importance
## 
##                Overall
## avg.Vegetables 100.000
## avg.F           57.425
## avg.Salt        50.628
## avg.P           47.163
## avg.C           44.678
## avg.E           35.488
## EER.kcal.       18.627
## age             18.200
## weight          17.475
## height          14.305
## Typelunch        2.323
## Typedinner       1.212
## gendermale       0.000

From the importance table of original model, we choose the [avg.Vegetables,avg.F,avg.Salt,avg.P,avg.C,avg.E] which importace over 30 then calculate their average.

Train set:

## Warning: `parse_quosure()` is deprecated as of rlang 0.2.0.
## Please use `parse_quo()` instead.
## This warning is displayed once per session.

Predict result: We can get some insights of below charts: 1. How to get high score? More Vegetables, others are low. Especially we need to distinguish with [Score3] and [Score4]. The Vegetables of [Score3] and [Score4] is alomst the same, but others [Score4] are lower than [Score3]. 2. Using randomforest modeling result will enhance the distinguishing of the Vegetables between each Scores. Others are also the same with Trainset when more lower get more higer Score.

Result insight and discussion 3 - Special Case

In this Case, the Score of raw data should be 1, but our predition was 4. Check the data, we can find avg.Vegetables is 181.2 and avg.F is 0.52 which seems matched our model’s criterions. That’s why it predicted this case to Score4.

What we can do next?

From importance table, it showed us the human information features got lower importance([age, weight, height, gender]). Maybe we can think why human information will not effect our Score? Or maybe we need to do more features combine the infromation data with food components data.

## rf variable importance
## 
##                Overall
## avg.Vegetables 100.000
## avg.F           57.425
## avg.Salt        50.628
## avg.P           47.163
## avg.C           44.678
## avg.E           35.488
## EER.kcal.       18.627
## age             18.200
## weight          17.475
## height          14.305
## Typelunch        2.323
## Typedinner       1.212
## gendermale       0.000