학술논문금융정보연구2025.02 발행

CT-GAN을 활용한 중소기업 채무불이행 예측 연구

SMEs Default Prediction Using CT-GAN

임창민(성균관대학교); 임병화(성균관대학교)

14권 1호, 1~32쪽

초록

본 연구는 국내 A은행의 대규모 중소기업 데이터를 활용하여, 머신러닝(Machine Learning) 기반의 중소기업 채무불이행 예측 연구를 하였다. 2017년부터 2022년까지 5년 동안의 중소기업 대출 정보를 바탕으로 총 616,007개의 데이터를 이용하여 머신러닝 모형을 학습하였다. 일반적으로 채무불이행 예측을 위한 중소기업 데이터는 정상기업에 비해 채무불이행 기업의 수가 현저히 작은 불균형 데이터를 갖는다. 본 연구에서는 부족한 데이터수와 불균형 데이터를 보완하기 위해 정형데이터(tabular data)에 특화된 생성형 인공지능 기법인 CT-GAN(Conditional Tabular GAN)을 활용하여 합성데이터를 생성하였다. 합성데이터를 포함한 데이터를 기반으로 머신러닝 모형을 학습한 결과, 예측 성능이 향상된다는 것을 밝혔다. 머신러닝 모형 중에서는 XGBoost와 같은 나무계열 모형의 성능이 가장 우수했으며, 인공신경망 모형의 성능은 상대적으로 저조하였다. 본 연구에서는 설명가능한 인공지능(XAI)의 대표 기법인 SHAP을 활용하여 합성데이터를 이용하여 학습한 머신러닝 모형에 대한 중요 변수를 살펴보았는데, 합성데이터 생성 기법에 상관없이 이자보상비율 변수의 중요도가 가장 높은 것으로 나타났다. 본 연구는 국내 중소기업 신용평가 분야에 머신러닝 모형의 적용에 있어, 합성데이터 활용 가능성과 XAI 해석에 중요한 시사점을 제공한다.

Abstract

In this paper, we study a machine learning-based prediction of SME defaults using large-scale SME data from Bank A in Korea. Based on SME loan information for five years from 2017 to 2022, a total of 616,007 data were collected and used to train machine learning models. In general, SME datasets for default prediction are highly imbalanced, with a significantly fewer default companies compared to normal companies. In this study, to compensate for insufficient number of data and imbalanced data, we generated synthetic data using CT-GAN(Conditional Tabular GAN), a generative artificial intelligence technique specialized in table data. By training machine learning models on datasets that included synthetic data, it was found that the predictive performance improved. Among the machine learning models, tree-based models such as XGBoost showed the best performance, while artificial neural network models performed relatively poorly. Additionally, this study used SHAP, a representative method of explainable artificial intelligence(XAI), to identify key variables in the machine learning models trained with synthetic data. Regardless of the synthetic data generation method used, the interest coverage ratio variable was found to be the most important. In this study, we provide important implications for the utilization of synthetic data and the interpretation of Explainable AI(XAI) in applying machine learning models to the credit evaluation field of SMEs in Korea.

발행기관:: 한국금융정보학회
DOI:: http://dx.doi.org/10.35214/rfis.14.1.202502.001
분류:: 금융(화폐)경제

AI 법률 상담

이 논문의 주제에 대해 더 알고 싶으신가요?

460만+ 법률 자료에서 관련 판례·법령·해석례를 찾아 답변합니다

AI 상담 시작