This function trains the a machine learning model on the training data

train.model(siamcat, method = "lasso", measure = "classif.acc",
param.set = NULL, grid.size=11, min.nonzero=5, perform.fs = FALSE,
param.fs = list(no_features = 100, method = "AUC", direction="absolute"),
feature.type='normalized', verbose = 1)



object of class siamcat-class


string, specifies the type of model to be trained, may be one of these: c('lasso', 'enet', 'ridge', 'lasso_ll', 'ridge_ll', 'randomForest')


character, specifies the model selection criterion during internal cross-validation, see mlr_measures for more details, defaults to 'classif.acc'


list, set of extra parameters for mlr, see below for details, defaults to NULL


integer, grid size for internal tuning (needed for some machine learning methods, for example lasso_ll), defaults to 11


integer number of minimum nonzero coefficients that should be present in the model (only for 'lasso', 'ridge', and 'enet'), defaults to 5


boolean, should feature selection be performed? Defaults to FALSE


list, parameters for the feature selection, see Details, defaults to list(thres.fs=100, method.fs="AUC", direction='absolute')


string, on which type of features should the function work? Can be either "original", "filtered", or "normalized". Please only change this paramter if you know what you are doing!


integer, control output: 0 for no output at all, 1 for only information about progress and success, 2 for normal level of information and 3 for full debug information, defaults to 1


object of class siamcat-class with added model_list

Machine learning methods

This functions performs the training of the machine learning model and functions as an interface to the mlr3-package.

The function expects a siamcat-class-object with a prepared cross-validation (see in the data_split-slot of the object. It then trains a model for each fold of the data split.

The different machine learning methods are implemented as Learners from the mlr3learners package:

  • 'lasso', 'enet', and 'ridge' use the 'classif.cv_glmnet' or 'regr.cv_glmnet' Learners, which interface to the glmnet package,

  • 'lasso_ll' and 'ridge_ll' use a custom Learner, which is only available for classification tasks. The underlying package is the LiblineaR packge.

  • 'randomForest' is implemented via the 'classif.ranger' or regr.ranger Learners available trough the ranger package.

Hyperparameter tuning

There is additional control over the machine learning procedure by supplying information through the param.set parameter within the function. We encourage you to check out the excellent mlr documentation for more in-depth information.

Here is a short overview which parameters you can supply in which form:

  • enet The alpha parameter describes the mixture between lasso and ridge penalty and is -per default- determined using internal cross-validation (the default would be equivalent to param.set=list('alpha'=c(0,1))). You can supply either the limits of the hyperparameter exploration (e.g. with limits 0.2 and 0.8: param.set=list('alpha'=c(0.2,0.8))) or you can supply a fixed alpha value as well (param.set=list('alpha'=0.5)).

  • lasso_ll/ridge_ll You can supply both class.weights and the cost parameter (cost of the constraints violation, see LiblineaR for more info). The default values would be equal to param.set=list('class.weights'=c(5, 1), 'cost'=c(-2, 3)).

  • randomForest You can supply the two parameters num.trees (Number of trees to grow) and mtry (Number of variables randomly sampled as candidates at each split). See also ranger for more info. The default values correspond to param.set=list('num.trees'=c(100, 1000), 'mtry'= c(round(sqrt.mdim / 2), round(sqrt.mdim), round(sqrt.mdim * 2))) with sqrt.mdim=sqrt(nrow(data)).

Feature selection

If feature selection should be performed (for example for functional data with a large number of features), the param.fs list should contain:

  • no_features - Number of features to be retained after feature selection,

  • method - method for the feature selection, may be AUC, gFC, or Wilcoxon for binary classification problems or spearman, pearson, or MI (mutual information) for regression problems

  • direction - indicates if the feature selection should be performed in a single direction only. Can be either

    • absolute - select the top associated features (independent of the sign of enrichment),

    • positivethe top positively associated featured (enriched in the case group for binary classification or enriched in higher values for regression),

    • negative the top negatively associated features (inverse of positive)

    Direction will be ignored for Wilcoxon and MI.



# simple working example
siamcat_example <- train.model(siamcat_example, method='lasso')
#> Trained lasso models successfully.