LLM 기반 의미론적 특허 데이터 노이즈 필터링 방법론 연구
LLM-Based Semantic Noise Filtering Method for Patent Text Data
임진성(경상국립대학교 대학원 기술경영학과); 송지훈(경상국립대학교)
28권 5호, 1379~1389쪽
초록
Patent data are essential for tracking technological progress, assessing competitiveness, and forecasting future developments. However, the rapid evolution of technology and the rise of convergent fields make filtering irrelevant data a persistent challenge. Traditional statistical models and manual preprocessing by researchers require substantial time and effort, prompting continuous research on efficient information structuring. In particular, filtering methods based on statistics or keywords have limitations in fully capturing subtle technical nuances and complex contexts. To address these limitations, this study proposes a semantic noise filtering methodology for patent data leveraging the contextual understanding capabilities of large language models (LLMs). The approach integrates LLM-based classification, statistical stability analysis, and cross-LLM review procedures to enhance the consistency and reliability of the filtering results. Applied to 1,930 domestic patents in the bio-artificial organ domain from 2000 to 2024, the method identified 55.4% as noise. The results demonstrate the method’s potential as an effective tool for technology policy formulation and strategic decision-making support.
Abstract
Patent data are essential for tracking technological progress, assessing competitiveness, and forecasting future developments. However, the rapid evolution of technology and the rise of convergent fields make filtering irrelevant data a persistent challenge. Traditional statistical models and manual preprocessing by researchers require substantial time and effort, prompting continuous research on efficient information structuring. In particular, filtering methods based on statistics or keywords have limitations in fully capturing subtle technical nuances and complex contexts. To address these limitations, this study proposes a semantic noise filtering methodology for patent data leveraging the contextual understanding capabilities of large language models (LLMs). The approach integrates LLM-based classification, statistical stability analysis, and cross-LLM review procedures to enhance the consistency and reliability of the filtering results. Applied to 1,930 domestic patents in the bio-artificial organ domain from 2000 to 2024, the method identified 55.4% as noise. The results demonstrate the method’s potential as an effective tool for technology policy formulation and strategic decision-making support.
- 발행기관:
- 한국산업융합학회
- 분류:
- 기타공학일반