Métodos de boosting para clasificación desbalanceada con predictores categóricos: una aplicación a la predicción de deserción universitaria
Fecha
Autores
Título de la revista
ISSN de la revista
Título del volumen
Editor
Pontificia Universidad Católica del Perú
Acceso al texto completo solo para la Comunidad PUCP
Resumen
El objetivo de esta investigación es estudiar el desempeño de los métodos de boosting para problemas
de clasificación desbalanceada con covariables categóricas y aplicarlos a un conjunto de datos
sobre deserción universitaria. Esta problemática se caracteriza por una variable respuesta con baja
proporción de casos positivos (estudiantes que desertan) y por la presencia predominante de variables
categóricas.
Se desarrolló un estudio de simulación para evaluar el desempeño de XGBoost y CatBoost frente
a la regresión logística, bajo distintos tamaños muestrales y niveles de desbalance. La evaluación
comparó los métodos de generación de variables (logística y por cuartiles) y de evaluación de modelos
predictivos (entrenamiento y prueba y validación cruzada k-folds). Adicionalmente, utiliza métricas
apropiadas para datos desbalanceados, como G-Mean, Kappa y MCC. Estas métricas ofrecen una
mejor visión del desempeño en comparación con la simple precisión, especialmente al evaluar la capacidad
del modelo para identificar correctamente a los desertores. Se incorporó, además, el análisis del
punto de corte, lo cual permite ajustar el umbral de decisión según la métrica G-mean, que permite
priorizar los aciertos de los métodos predictivos.
Finalmente, se aplicaron los modelos a datos reales de estudiantes universitarios. Las variables
más influyentes fueron el nivel educativo de los padres, el ingreso familiar y la experiencia laboral
previa. CatBoost mostró el mejor rendimiento en métricas clave y fue el más robusto frente al desbalance
y la naturaleza categórica de los datos. Los resultados respaldan el uso de métodos de boosting,
especialmente CatBoost, en contextos educativos donde se requiere identificar estudiantes en riesgo
de deserción.
The aim of this research is to study the performance of boosting methods for imbalanced classification problems with categorical covariates and to apply them to a dataset on university dropout. This issue is characterized by a response variable with a low proportion of positive cases (students who drop out) and a predominance of categorical variables. A simulation study was conducted to evaluate the performance of XGBoost and CatBoost compared to logistic regression, under different sample sizes and imbalance levels. The evaluation compared methods for variable generation (logistic-based and quartile-based) and for predictive model assessment (train-test split and k-fold cross-validation). Additionally, appropriate metrics for imbalanced data were used, such as G-Mean, Kappa, and MCC. These metrics provide a better understanding of performance than simple accuracy, especially in assessing the model’s ability to correctly identify dropouts. The analysis also included the selection of the decision threshold, allowing adjustment based on the G-Mean metric, which prioritizes the correct classification of dropout cases. Finally, the models were applied to real data from university students. The most influential variables were parental education level, family income, and prior work experience. CatBoost showed the best performance in key metrics and was the most robust in the face of imbalance and the categorical nature of the data. The results support the use of boosting methods, particularly CatBoost, in educational contexts where it is necessary to identify students at risk of dropping out.
The aim of this research is to study the performance of boosting methods for imbalanced classification problems with categorical covariates and to apply them to a dataset on university dropout. This issue is characterized by a response variable with a low proportion of positive cases (students who drop out) and a predominance of categorical variables. A simulation study was conducted to evaluate the performance of XGBoost and CatBoost compared to logistic regression, under different sample sizes and imbalance levels. The evaluation compared methods for variable generation (logistic-based and quartile-based) and for predictive model assessment (train-test split and k-fold cross-validation). Additionally, appropriate metrics for imbalanced data were used, such as G-Mean, Kappa, and MCC. These metrics provide a better understanding of performance than simple accuracy, especially in assessing the model’s ability to correctly identify dropouts. The analysis also included the selection of the decision threshold, allowing adjustment based on the G-Mean metric, which prioritizes the correct classification of dropout cases. Finally, the models were applied to real data from university students. The most influential variables were parental education level, family income, and prior work experience. CatBoost showed the best performance in key metrics and was the most robust in the face of imbalance and the categorical nature of the data. The results support the use of boosting methods, particularly CatBoost, in educational contexts where it is necessary to identify students at risk of dropping out.
Descripción
Palabras clave
Aprendizaje automático (Inteligencia artificial), Estudiantes universitarios--Deserciones, Estadística--Predicciones, Estadística--Modelos matemáticos, Análisis de regresión
Citación
Colecciones
item.page.endorsement
item.page.review
item.page.supplemented
item.page.referenced
Licencia Creative Commons
Excepto donde se indique lo contrario, la licencia de este ítem se describe como info:eu-repo/semantics/embargoedAccess
