Supervised machine learning: a theoretical study with applications

NUÑEZ GONZALEZ, JOSE DAVID

Supervised machine learninga theoretical study with applications

NUÑEZ GONZALEZ, JOSE DAVID

Supervised by:

Rosario Delgado de la Torre Director

Defence university: Universitat Autònoma de Barcelona

Fecha de defensa: 09 November 2022

Committee:

Xavier Bardina Simorra Chair
David Moriña Soler Secretary
Raquel Iniesta Benedicto Committee member

Type: Thesis

Teseo: 822751 DIALNET TDX editor

Abstract

This Thesis is framed in the topic of Supervised Machine Learning, where we present a theoretical study with applications. Specifically, contributions have been made at the different moments of the Machine Learning life cycle from an integral point of view, focusing our attention on the three fundamental stages of the cycle: preprocessing of the dataset, construction of the predictive model (classifier), and validation of the model using performance metrics. The first work focuses on the preprocessing phase. We have proposed a novel oversampling method that uses a Bayesian network constructed as the probabilistic model for the relationships of dependence between the features in the minority class setting, to generate artificial instances of the minority class for a dataset with both categorical and/or continuous variables. It relies on the fact that the likelihood is a measure of the goodness of fit of a model to a set of instances, which is a paradigm different from that in which the existing oversampling methods are based: the idea of distance between the features, which turns out to be a weakness when applied to datasets with non-continuous variables. The second paper is related to the construction of a predictive model, specifically, a classifier. We have implemented an expert system based on an ensemble of Bayesian classifiers to help in decision making in the Intensive Care Unit of the Hospital of Mataró. The system predicts the vital outcome of the patient admitted to the ICU (live/die) as well as the destination upon discharge from the ICU, if the prediction is ``live'', or the cause of death if it is ``die''. The combination rule to decide the prediction provided by the ensemble, from the predictions given by the base classifiers, is a Weighted Average with specific weights based on the Area Under the Precision-Recall curve (AUPR), suitable for deal with unbalanced datasets, which is compatible with the MAP criterion. The last contribution attends to the validation phase. We have introduced an improvement of the original definition of the Confusion ENtropy (CEN) metric, which is based on the Shannon's entropy from the field of Information Theory, as a measure of the uncertainty entailed by the result of a classification process. This modification allows to avoid the undesired behaviour showed by CEN, which in some cases is "out-of-range'', and in some others shows a lack of monotonicity when the situation monotonically goes from perfect to completely wrong classification.