On the design of distributed and scalable feature selection algorithms

Palma Mendoza, Raúl José

On the design of distributed and scalable feature selection algorithms

Palma Mendoza, Raúl José

Supervised by:

Luis de Marcos Ortega Director
Daniel Rodríguez García Co-director

Defence university: Universidad de Alcalá

Fecha de defensa: 03 October 2019

Committee:

José Javier Dolado Cosín Chair
Ana Castillo Martínez Secretary
Verónica Bolón-Canedo Committee member

Type: Thesis

Teseo: 150887 DIALNET TESEO editor

Abstract

Feature selection is an important stage in the pre-processing of the data prior to the training of a data mining model or as part of many data analysis processes. The objective of feature selection consists in detecting within a group of features which are the most relevant and which are redundant according to some established metric. With this, it is possible to create more efficient and interpretable data mining models, also, by reducing the number of features, data collection costs can be reduced in future. Currently, according to the phenomenon widely known as “big data”, the datasets available for analyze are growing in size. This causes that many existing algorithms for data mining become unable to process them completely and even, depending on their size, feature selection algorithms themselves, also become unable to process them directly. Considering that this trend towards the growth of datasets is not expected to cease, the existence of scalable feature selection algorithms that are capable of increasing their processing capacity taking advantage of the resources of computer clusters becomes very important. The following doctoral dissertation presents the redesign of two widely known feature selection algorithms: ReliefF and CFS, both algorithms were designed with the purpose of being scalable and capable of processing large volumes of data. This is demonstrated by an extensive comparison of both proposals with their original versions, as well as with other scalable versions designed for similar purposes. All comparisons were made using large publicly available datasets. The implementations were made using the Apache Spark tool, which has noways become a reference framework in the “big data” field. The source code written has been made available through a GitHub public repository