Tidymodels & palmer-penguins

Oddly enough, I’ve been hearing a lot about tidymodels this week, from reddit to youtube. Plus, the tidytuesday dataset this week looks fun.I figure I’ll spend a tidytuesday trying to figure out this new and in-development machine-learning package.

For the newcomer, tidymodels is the planned replacement for the caret package. Complicated machine learning (ml) models will continue using caret for a while as tidymodels is very new to the field. I’m taking the word of others more experienced in the matter, but it does have the same abilities as caret. This does not speak to tidymodels potential, for it has successfully tied a large number of ml models under one roof. It’s use of tidy implementation means it could also make ml an easier field to break into for the non-computer scientist like me (I’m the classic biologist turned programmer).

This is my exploration into the models, some of my notes, and my implementation of tidymodels. I’ve referred to a number of sites to get a broad idea. I’ve put those references at the very bottom. This will be useful to those new to the package or machine learning.

the tidymodel packages

tidymodels isn’t just one package. It’s a family of packages, each serving a distinct purpose. I focus in this post on getting familiar with the essentials; rsample, parsnip, and yardstick.

  • rsample : Processes data to prepare it for models.

  • recipes : used to create ‘recipes’ for models. Can be used to design preprocessing matrixs, and useful for reproducible research.

  • parsnip : For running the models. Attempts to provide a singular interface to multiple packages. If you haven’t realized, the name is also a pun on the original caret package!

  • tune : For fine tuning the models. Can be helpful in getting a model the way you need it. not attempted here.

  • yardstick : helps in quantifiying models.

I began my machine learning journey with caret, which is wonderfully well-built. Any advanced user could easily develop sophisticated scripts. Tidymodels has an advantage over caret that I appreciate as someone still learning ml; it is refreshingly easy to change models, metrics, or other details during experimentation.

palmer penguins

When I came across tidymodels, this week’s tidytuesday dataset were the palmer penguins. It is proposed as an alternative to the classic iris dataset.

Below, I show a simple ml model to classify penguins based on body features. I hope it is of help to someone.

data exploration

pacman::p_load(magrittr,
               tidyverse,
               ggplot2,
               tidymodels
               )

I’m grabbing the data directly from github, but you can also get it from the CRAN Palmer Penguins package.

There are two types of files. The smaller and larger penguins dataset. I’ll make use of the smaller penguins to minimize the variables while playing with tidymodels.

penguins     <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv')
## Parsed with column specification:
## cols(
##   species = col_character(),
##   island = col_character(),
##   bill_length_mm = col_double(),
##   bill_depth_mm = col_double(),
##   flipper_length_mm = col_double(),
##   body_mass_g = col_double(),
##   sex = col_character(),
##   year = col_double()
## )
penguins %>% dim
## [1] 344   8
summary(penguins)
##    species             island          bill_length_mm  bill_depth_mm
##  Length:344         Length:344         Min.   :32.10   Min.   :13.10
##  Class :character   Class :character   1st Qu.:39.23   1st Qu.:15.60
##  Mode  :character   Mode  :character   Median :44.45   Median :17.30
##                                        Mean   :43.92   Mean   :17.15
##                                        3rd Qu.:48.50   3rd Qu.:18.70
##                                        Max.   :59.60   Max.   :21.50
##                                        NA's   :2       NA's   :2
##  flipper_length_mm  body_mass_g       sex                 year
##  Min.   :172.0     Min.   :2700   Length:344         Min.   :2007
##  1st Qu.:190.0     1st Qu.:3550   Class :character   1st Qu.:2007
##  Median :197.0     Median :4050   Mode  :character   Median :2008
##  Mean   :200.9     Mean   :4202                      Mean   :2008
##  3rd Qu.:213.0     3rd Qu.:4750                      3rd Qu.:2009
##  Max.   :231.0     Max.   :6300                      Max.   :2009
##  NA's   :2         NA's   :2
penguins %>%
    skimr::skim()
(#tab:prepping_data)Data summary
Name Piped data
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
character 3
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
species 0 1.00 6 9 0 3 0
island 0 1.00 5 9 0 3 0
sex 11 0.97 4 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇

I need species as a factor. All rows with nas will be removed to enable downstream analysis.

penguins$species <- penguins$species %>%
    as.factor()
penguins <- penguins[-(penguins$bill_length_mm %>% is.na() %>% which()),]

Some rapid fire exploration. The graphs below suggest that each of the metrics can help identify a specific penguin. All four metrics measuring the body have a multi-modal distribution. It’s obvious that in every case, a peak is representative of one species. For example, bill_depth_mm can identify Gentoo, bill_length_mm for Adelie, while Chinstrap has a strong overlap with either species in all metrics. I imagine that this overlap will not interfere with the identification.

penguins.pivot <- penguins %>%
    pivot_longer(cols = bill_length_mm:body_mass_g,
                 names_to = "metrics",
                 values_to = "values")

penguins.pivot %>%
    ggplot(aes(values, fill = species)) +
    geom_histogram(bins = 20) +
    facet_wrap(~ metrics, scales = "free_x")
metrics vs species

(#fig:met_vs_spc)metrics vs species

I don’t make use of the pivot functions often enough. They can be incredibly handy.

machine learning

Following chunk makes use of rsample to split the dataset into train/test. As this is a small dataset, lets split 65/35. The vfold_cv is necessary downstream when tuning the model.

penguins.split <- penguins %>%
    rsample::initial_split(prop = 0.65, strata = species)

penguins.train <- penguins.split %>%
    rsample::training()

penguins.test <- penguins.split %>%
    rsample::testing()

penguins.cv <- penguins.train %>%
    rsample::vfold_cv(v = 10)

building a model

In one of my first real introductions to tidymodels (which was also with the palmer penguins data set), the author made a large effort to use logistic regression. That is the wrong way to go, because a logistic is expecting two factors for the response. Linear regression is open to a greater range of response variables. Multi-class regression is done here with the engine ranger which is improved over randomForest.

The ‘r set_engine’ is not a necessity here. Depending on the parsnip chosen, it automatically sets to rangr. I imagine that for the complicated models, it will be necessary to specify the engine.

The package tune brings a handy fit_resample function. As the name suggests, it adds to the ‘fit’ function by adding resampling capabilities. Specifically, it measures metrics over multiple resamples (folds) for tuning the model.

model.fit.prep <- parsnip::rand_forest(mode = "classification") %>%
    parsnip::set_engine("ranger") %>%
    tune::fit_resamples(species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g,
                        resamples = penguins.cv)

model.fit.prep
## # Resampling results
## # 10-fold cross-validation
## # A tibble: 10 x 4
##    splits           id     .metrics         .notes
##    <list>           <chr>  <list>           <list>
##  1 <split [201/23]> Fold01 <tibble [2 × 3]> <tibble [0 × 1]>
##  2 <split [201/23]> Fold02 <tibble [2 × 3]> <tibble [0 × 1]>
##  3 <split [201/23]> Fold03 <tibble [2 × 3]> <tibble [0 × 1]>
##  4 <split [201/23]> Fold04 <tibble [2 × 3]> <tibble [0 × 1]>
##  5 <split [202/22]> Fold05 <tibble [2 × 3]> <tibble [0 × 1]>
##  6 <split [202/22]> Fold06 <tibble [2 × 3]> <tibble [0 × 1]>
##  7 <split [202/22]> Fold07 <tibble [2 × 3]> <tibble [0 × 1]>
##  8 <split [202/22]> Fold08 <tibble [2 × 3]> <tibble [0 × 1]>
##  9 <split [202/22]> Fold09 <tibble [2 × 3]> <tibble [0 × 1]>
## 10 <split [202/22]> Fold10 <tibble [2 × 3]> <tibble [0 × 1]>

Now that we have a basic model, lets build a training example.

model.fit <- parsnip::rand_forest(mode = "classification") %>%
    parsnip::set_engine("ranger") %>%
    fit(species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g,
                        data = penguins.train)

model.fit
## parsnip model object
##
## Fit time:  43ms
## Ranger result
##
## Call:
##  ranger::ranger(formula = species ~ bill_length_mm + bill_depth_mm +      flipper_length_mm + body_mass_g, data = data, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
##
## Type:                             Probability estimation
## Number of trees:                  500
## Sample size:                      224
## Number of independent variables:  4
## Mtry:                             2
## Target node size:                 10
## Variable importance mode:         none
## Splitrule:                        gini
## OOB prediction error (Brier s.):  0.03363747

evaluating classification model

Here, yardstick comes into play.

penguins.fit.pred <- model.fit %>%
    predict(penguins.test) %>%
    bind_cols(penguins.test)

Looking at the metrics, the model is acceptable. Other metrics could be observed through ‘metric_set’ which I will not elaborate here. I’m satisfied by looking at kappa and accuracy.

penguins.fit.pred %>%
    metrics(species, estimate = .pred_class)
## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.975
## 2 kap      multiclass     0.960

Region-Over-Curves (ROC) are a popular measure of success. The smaller the ROC, the greater the ability to identify species. ROC curves of this dataset suggests a very high ability to identity species with few type 1 or 2 errors. Interestingly, I thought that chinstrap would be difficult to identify as it always overlapped with one of the other species. This seems to have been a benefit in identification.

penguins.met <- model.fit %>%
    predict(penguins.test, type = "prob") %>%
    bind_cols(penguins.test) %>%
    yardstick::gain_curve(species,
                          .pred_Adelie,
                          .pred_Chinstrap,
                          .pred_Gentoo)

penguins.met %>%
    autoplot() +
    ggpubr::theme_pubclean() +
    theme(strip.background = element_blank())
ROC

(#fig:curves_study)ROC

summary

As mentioned in the beginning, it is easy to modify the ml approach on the fly. I have enjoyed working with tidy models, and plan to continue with it for a while. The only regret I have about the tidymodels is a near lack of documentation. The website is definitely rich in articles, but I am having a tough time finding the minutia about function details. That is to be expected for the time being while the package develops further. For those who are not happy with tidymodels, there is the alternative ml wrapper ‘mlr3’. It is as capable as the tidymodels and does not follow the tidy approach, which some may find attractive. Plus, I’ve noticed after writing this article that mlr3 has a bigger following than I thought. I plan to eventually post about my experiences with mlr3 vs tidymodels.