Default Prediction for Real Estate Companies with Imbalanced Dataset
Default Prediction for Real Estate Companies with Imbalanced Dataset
Yuan-Xiang Dong(Chongqing University); Zhi Xiao(Chongqing University); Xue Xiao(National University of Singapore)
10권 2호, 314~333쪽
초록
When analyzing default predictions in real estate companies, the number ofnon-defaulted cases always greatly exceeds the defaulted ones, which creates the twoclassimbalance problem. This lowers the ability of prediction models to distinguish thedefault sample. In order to avoid this sample selection bias and to improve theprediction model, this paper applies a minority sample generation approach to createnew minority samples. The logistic regression, support vector machine (SVM)classification, and neural network (NN) classification use an imbalanced dataset. Theywere used as benchmarks with a single prediction model that used a balanced datasetcorrected by the minority samples generation approach. Instead of using predictionorientedtests and the overall accuracy, the true positive rate (TPR), the true negativerate (TNR), G-mean, and F-score are used to measure the performance of defaultprediction models for imbalanced dataset. In this paper, we describe an empiricalexperiment that used a sampling of 14 default and 315 non-default listed real estatecompanies in China and report that most results using single prediction models with abalanced dataset generated better results than an imbalanced dataset.
Abstract
When analyzing default predictions in real estate companies, the number ofnon-defaulted cases always greatly exceeds the defaulted ones, which creates the twoclassimbalance problem. This lowers the ability of prediction models to distinguish thedefault sample. In order to avoid this sample selection bias and to improve theprediction model, this paper applies a minority sample generation approach to createnew minority samples. The logistic regression, support vector machine (SVM)classification, and neural network (NN) classification use an imbalanced dataset. Theywere used as benchmarks with a single prediction model that used a balanced datasetcorrected by the minority samples generation approach. Instead of using predictionorientedtests and the overall accuracy, the true positive rate (TPR), the true negativerate (TNR), G-mean, and F-score are used to measure the performance of defaultprediction models for imbalanced dataset. In this paper, we describe an empiricalexperiment that used a sampling of 14 default and 315 non-default listed real estatecompanies in China and report that most results using single prediction models with abalanced dataset generated better results than an imbalanced dataset.
- 발행기관:
- 한국정보처리학회
- 분류:
- 기타컴퓨터학