학술논문기업경영리뷰2024.11 발행

BERT와 Llama2를 활용한 국내 학술논문의 자동분류 모델 구축과 성능분석

Construction and Performance Analysis of an Automatic Classification Model for Domestic Academic Papers Using BERT and Llama2

강광선(프로텐 기술연구소); 이염남(경희대학교 테크노경영대학원); 홍아름(경희대학교 테크노경영대학원)

15권 4호, 103~128쪽

AI이 논문 주제로 AI 상담 원문 보기 (KCI)

초록

기초거대언어모델이 경쟁적으로 발표되고 있다. 그 중에서도 2023년 2월 발표한 메타의 Llama2 모델은 연구 커뮤니티에 개방되면서 접근성 뿐 아니라 제한된 자원 제약 하에서도 쓸 수 있으며 검증된 우수한 성능을 보여주었다. Llama2는 ChatGPT 3.5와 유사한 성능을 구현 하면서도 중소기업들이 상업적으로도 활용할 수 있는 모델이다. 본 연구는 학술논문 자동분류 모델 구축 맥락에서 접근성이 좋은 BERT모델과 Llama2모델의 성능을 비교한다. 학습데이터는 AI-HUB의 1995년부터 2020년까지 16만건의 ‘논문자료 요약’ 데이터 셋을 사용하였다. 대상 분류는 한국연구재단의 연구 분야 분류기준으로 8개 분류로 정의되어 있다. 실험 결과 텍스트 입력 길이에 따라 짧은 경우는 BERT모델을, 중간이상 길이의 텍스트에 대해서는 Llama2모델이 유용하다. 대상 분류별 분석한 결과 BERT의 경우 사회과학, 공학, 농수해양에서 높았으며 Llama2는 인문학, 자연과학, 의약학, 예술체육, 복합학에서 성능이 높게 나왔다. 기업이나 조직에서 문서 자동 분류나 자연어 처리 모델을 선택할 때, 입력 데이터의 특성, 분류작업의 민감도와 목표에 따라 적절한 모델을 선택해야 한다.

Abstract

Large-scale language models are being competitively released, among which Meta’s Llama2 model, introduced in February 2023, has gained attention for its accessibility and proven performance. Llama2 not only provides accessibility to the research community but also performs efficiently under limited resource constraints, offering capabilities similar to ChatGPT 3.5, while being commercially viable for small and medium-sized enterprises. This study compares the performance of the widely accessible BERT model and the Llama2 model within the context of building an automatic classification model for academic papers. The training data consists of 160,000 “paper summary” datasets from AI-HUB, spanning from 1995 to 2020. The target classification is based on the eight categories defined by the National Research Foundation of Korea’s research area classification criteria. Experimental results indicate that the BERT model is more effective for short text inputs, while the Llama2 model excels with medium to long text inputs. In terms of category-specific analysis, BERT performs better in social sciences, engineering, and agriculture/fisheries/marine sciences, while Llama2 shows superior performance in humanities, natural sciences, medicine, arts/sports, and interdisciplinary studies. When selecting a document classification or natural language processing model for companies or organizations, it is essential to consider the characteristics of the input data, the sensitivity of the classification task, and the goals to choose the most suitable model.

발행기관:: KNU 기업경영연구소
분류:: 경영학일반

AI 법률 상담

이 논문의 주제에 대해 더 알고 싶으신가요?

460만+ 법률 자료에서 관련 판례·법령·해석례를 찾아 답변합니다

AI 상담 시작