on
Tidymodels & palmer-penguins
Oddly enough, I’ve been hearing a lot about tidymodels
this week, from reddit to youtube. Plus, the
tidytuesday
dataset this week looks fun.I figure I’ll spend a tidytuesday trying to figure out this new and
in-development machine-learning package.
For the newcomer, tidymodels
is the planned replacement for the caret
package. Complicated
machine learning (ml) models will continue using caret
for a while as tidymodels
is very new to
the field. I’m taking the word of others more experienced in the matter, but it does have the same
abilities as caret
. This does not speak to tidymodels
potential, for it has successfully tied a
large number of ml models under one roof. It’s use of tidy implementation means it could also make
ml an easier field to break into for the non-computer scientist like me (I’m the classic biologist
turned programmer).
This is my exploration into the models, some of my notes, and my implementation of tidymodels
. I’ve
referred to a number of sites to get a broad idea. I’ve put those references at the very
bottom. This will be useful to those new to the package or machine learning.
the tidymodel packages
tidymodels
isn’t just one package. It’s a family of packages, each serving a distinct
purpose. I focus in this post on getting familiar with the essentials; rsample
, parsnip
, and yardstick
.
rsample
: Processes data to prepare it for models.recipes
: used to create ‘recipes’ for models. Can be used to design preprocessing matrixs, and useful for reproducible research.parsnip
: For running the models. Attempts to provide a singular interface to multiple packages. If you haven’t realized, the name is also a pun on the originalcaret
package!tune
: For fine tuning the models. Can be helpful in getting a model the way you need it. not attempted here.yardstick
: helps in quantifiying models.
I began my machine learning journey with caret
, which is wonderfully well-built. Any advanced user
could easily develop sophisticated scripts. Tidymodels
has an advantage over caret
that I appreciate
as someone still learning ml; it is refreshingly easy to change models, metrics, or other details
during experimentation.
palmer penguins
When I came across tidymodels
, this week’s tidytuesday dataset were the palmer penguins. It is
proposed as an alternative to the classic iris dataset.
Below, I show a simple ml model to classify penguins based on body features. I hope it is of help to someone.
data exploration
pacman::p_load(magrittr,
tidyverse,
ggplot2,
tidymodels
)
I’m grabbing the data directly from github, but you can also get it from the CRAN Palmer Penguins package.
There are two types of files. The smaller and larger penguins dataset. I’ll make use of
the smaller penguins to minimize the variables while playing with tidymodels
.
penguins <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv')
## Parsed with column specification:
## cols(
## species = col_character(),
## island = col_character(),
## bill_length_mm = col_double(),
## bill_depth_mm = col_double(),
## flipper_length_mm = col_double(),
## body_mass_g = col_double(),
## sex = col_character(),
## year = col_double()
## )
penguins %>% dim
## [1] 344 8
summary(penguins)
## species island bill_length_mm bill_depth_mm
## Length:344 Length:344 Min. :32.10 Min. :13.10
## Class :character Class :character 1st Qu.:39.23 1st Qu.:15.60
## Mode :character Mode :character Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 Length:344 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 Class :character 1st Qu.:2007
## Median :197.0 Median :4050 Mode :character Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
penguins %>%
skimr::skim()
Name | Piped data |
Number of rows | 344 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
character | 3 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
species | 0 | 1.00 | 6 | 9 | 0 | 3 | 0 |
island | 0 | 1.00 | 5 | 9 | 0 | 3 | 0 |
sex | 11 | 0.97 | 4 | 6 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
year | 0 | 1.00 | 2008.03 | 0.82 | 2007.0 | 2007.00 | 2008.00 | 2009.0 | 2009.0 | ▇▁▇▁▇ |
I need species as a factor. All rows with nas will be removed to enable downstream analysis.
penguins$species <- penguins$species %>%
as.factor()
penguins <- penguins[-(penguins$bill_length_mm %>% is.na() %>% which()),]
Some rapid fire exploration. The graphs below suggest that each of the metrics can help identify a
specific penguin. All four metrics measuring the body have a multi-modal distribution. It’s obvious
that in every case, a peak is representative of one species. For example, bill_depth_mm
can
identify Gentoo, bill_length_mm
for Adelie, while Chinstrap has a strong overlap with either
species in all metrics. I imagine that this overlap will not interfere with the identification.
penguins.pivot <- penguins %>%
pivot_longer(cols = bill_length_mm:body_mass_g,
names_to = "metrics",
values_to = "values")
penguins.pivot %>%
ggplot(aes(values, fill = species)) +
geom_histogram(bins = 20) +
facet_wrap(~ metrics, scales = "free_x")
I don’t make use of the pivot functions often enough. They can be incredibly handy.
machine learning
Following chunk makes use of rsample
to split the dataset into train/test. As this is a small
dataset, lets split 65/35. The vfold_cv is necessary downstream when tuning the model.
penguins.split <- penguins %>%
rsample::initial_split(prop = 0.65, strata = species)
penguins.train <- penguins.split %>%
rsample::training()
penguins.test <- penguins.split %>%
rsample::testing()
penguins.cv <- penguins.train %>%
rsample::vfold_cv(v = 10)
building a model
In one of my first real introductions to tidymodels
(which was also with the palmer penguins data
set), the author made a large effort to use logistic regression. That is the wrong way to go,
because a logistic is expecting two factors for the response. Linear regression is open to a greater
range of response variables. Multi-class regression is done here with the engine ranger
which is
improved over randomForest
.
The ‘r set_engine’ is not a necessity here. Depending on the parsnip chosen, it automatically sets to rangr. I imagine that for the complicated models, it will be necessary to specify the engine.
The package tune
brings a handy fit_resample
function. As the name suggests, it adds to the ‘fit’
function by adding resampling capabilities. Specifically, it measures metrics over multiple
resamples (folds) for tuning the model.
model.fit.prep <- parsnip::rand_forest(mode = "classification") %>%
parsnip::set_engine("ranger") %>%
tune::fit_resamples(species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g,
resamples = penguins.cv)
model.fit.prep
## # Resampling results
## # 10-fold cross-validation
## # A tibble: 10 x 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [201/23]> Fold01 <tibble [2 × 3]> <tibble [0 × 1]>
## 2 <split [201/23]> Fold02 <tibble [2 × 3]> <tibble [0 × 1]>
## 3 <split [201/23]> Fold03 <tibble [2 × 3]> <tibble [0 × 1]>
## 4 <split [201/23]> Fold04 <tibble [2 × 3]> <tibble [0 × 1]>
## 5 <split [202/22]> Fold05 <tibble [2 × 3]> <tibble [0 × 1]>
## 6 <split [202/22]> Fold06 <tibble [2 × 3]> <tibble [0 × 1]>
## 7 <split [202/22]> Fold07 <tibble [2 × 3]> <tibble [0 × 1]>
## 8 <split [202/22]> Fold08 <tibble [2 × 3]> <tibble [0 × 1]>
## 9 <split [202/22]> Fold09 <tibble [2 × 3]> <tibble [0 × 1]>
## 10 <split [202/22]> Fold10 <tibble [2 × 3]> <tibble [0 × 1]>
Now that we have a basic model, lets build a training example.
model.fit <- parsnip::rand_forest(mode = "classification") %>%
parsnip::set_engine("ranger") %>%
fit(species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g,
data = penguins.train)
model.fit
## parsnip model object
##
## Fit time: 43ms
## Ranger result
##
## Call:
## ranger::ranger(formula = species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g, data = data, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
##
## Type: Probability estimation
## Number of trees: 500
## Sample size: 224
## Number of independent variables: 4
## Mtry: 2
## Target node size: 10
## Variable importance mode: none
## Splitrule: gini
## OOB prediction error (Brier s.): 0.03363747
evaluating classification model
Here, yardstick
comes into play.
penguins.fit.pred <- model.fit %>%
predict(penguins.test) %>%
bind_cols(penguins.test)
Looking at the metrics, the model is acceptable. Other metrics could be observed through ‘metric_set’ which I will not elaborate here. I’m satisfied by looking at kappa and accuracy.
penguins.fit.pred %>%
metrics(species, estimate = .pred_class)
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.975
## 2 kap multiclass 0.960
Region-Over-Curves (ROC) are a popular measure of success. The smaller the ROC, the greater the ability to identify species. ROC curves of this dataset suggests a very high ability to identity species with few type 1 or 2 errors. Interestingly, I thought that chinstrap would be difficult to identify as it always overlapped with one of the other species. This seems to have been a benefit in identification.
penguins.met <- model.fit %>%
predict(penguins.test, type = "prob") %>%
bind_cols(penguins.test) %>%
yardstick::gain_curve(species,
.pred_Adelie,
.pred_Chinstrap,
.pred_Gentoo)
penguins.met %>%
autoplot() +
ggpubr::theme_pubclean() +
theme(strip.background = element_blank())
summary
As mentioned in the beginning, it is easy to modify the ml approach on the fly. I have enjoyed
working with tidy models, and plan to continue with it for a while. The only regret I have about the
tidymodels
is a near lack of documentation. The website is definitely rich in articles, but I am
having a tough time finding the minutia about function details. That is to be expected for the time
being while the package develops further. For those who are not happy with tidymodels
, there is the
alternative ml wrapper ‘mlr3’. It is as capable as the tidymodels
and does not follow the tidy
approach, which some may find attractive. Plus, I’ve noticed after writing this article that mlr3
has a bigger following than I thought. I plan to eventually post about my experiences with mlr3 vs
tidymodels
.