train.model.Rd
This function trains the a machine learning model on the training data
train.model(siamcat, method = "lasso", measure = "classif.acc",
param.set = NULL, grid.size=11, min.nonzero=5, perform.fs = FALSE,
param.fs = list(no_features = 100, method = "AUC", direction="absolute"),
feature.type='normalized', verbose = 1)
object of class siamcat-class
string, specifies the type of model to be trained, may be one
of these: c('lasso', 'enet', 'ridge', 'lasso_ll', 'ridge_ll',
'randomForest')
character, specifies the model selection criterion during
internal cross-validation, see mlr_measures for more details,
defaults to 'classif.acc'
list, set of extra parameters for mlr, see below for
details, defaults to NULL
integer, grid size for internal tuning (needed for some
machine learning methods, for example lasso_ll
), defaults to
11
integer number of minimum nonzero coefficients that
should be present in the model (only for 'lasso'
, 'ridge'
,
and 'enet'
), defaults to 5
boolean, should feature selection be performed? Defaults
to FALSE
list, parameters for the feature selection, see Details,
defaults to list(thres.fs=100, method.fs="AUC", direction='absolute')
string, on which type of features should the function
work? Can be either "original"
, "filtered"
, or
"normalized"
. Please only change this paramter if you know what you
are doing!
integer, control output: 0
for no output at all,
1
for only information about progress and success, 2
for
normal level of information and 3
for full debug information,
defaults to 1
object of class siamcat-class with added model_list
This functions performs the training of the machine learning model
and functions as an interface to the mlr3
-package.
The function expects a siamcat-class-object with a prepared
cross-validation (see create.data.split) in the
data_split
-slot of the object. It then trains a model for each fold
of the data split.
The different machine learning methods are implemented as Learners from the mlr3learners package:
'lasso'
, 'enet'
, and 'ridge'
use the
'classif.cv_glmnet'
or 'regr.cv_glmnet'
Learners, which
interface to the glmnet package,
'lasso_ll'
and 'ridge_ll'
use a custom Learner, which
is only available for classification tasks. The underlying package is the
LiblineaR packge.
'randomForest'
is implemented via the 'classif.ranger'
or regr.ranger
Learners available trough the ranger
package.
There is additional control over the machine learning procedure by
supplying information through the param.set
parameter within the
function. We encourage you to check out the excellent
mlr documentation
for more in-depth information.
Here is a short overview which parameters you can supply in which form:
enet The alpha parameter describes the mixture between
lasso and ridge penalty and is -per default- determined using internal
cross-validation (the default would be equivalent to
param.set=list('alpha'=c(0,1))
). You can supply either the limits of
the hyperparameter exploration (e.g. with limits 0.2 and 0.8:
param.set=list('alpha'=c(0.2,0.8))
) or you can supply a fixed alpha
value as well (param.set=list('alpha'=0.5)
).
lasso_ll/ridge_ll You can supply both class.weights and
the cost parameter (cost of the constraints violation, see
LiblineaR for more info). The default values would be
equal to param.set=list('class.weights'=c(5, 1),
'cost'=c(-2, 3))
.
randomForest You can supply the two parameters num.trees
(Number of trees to grow) and mtry (Number of variables randomly
sampled as candidates at each split). See also
ranger for more info. The default values
correspond to
param.set=list('num.trees'=c(100, 1000), 'mtry'=
c(round(sqrt.mdim / 2), round(sqrt.mdim), round(sqrt.mdim * 2)))
with
sqrt.mdim=sqrt(nrow(data))
.
If feature selection should be performed (for example for functional data
with a large number of features), the param.fs
list should contain:
no_features
- Number of features to be retained after
feature selection,
method
- method for the feature selection, may be
AUC
, gFC
, or Wilcoxon
for binary classification
problems or spearman
, pearson
, or MI
(mutual
information) for regression problems
direction
- indicates if the feature selection should be
performed in a single direction only. Can be either
absolute
- select the top associated features (independent of
the sign of enrichment),
positive
the top positively associated featured (enriched in
the case group for binary classification or enriched in higher values
for regression),
negative
the top negatively associated features (inverse of
positive)
Direction will be ignored for Wilcoxon
and MI
.
data(siamcat_example)
# simple working example
siamcat_example <- train.model(siamcat_example, method='lasso')
#> Trained lasso models successfully.