[Paper] Web Crawling System based on Tag Path and Text Appearance Frequency

국문명: 태그 경로 및 텍스트 출현 빈도 기반 웹 크롤링 시스템

[PAPER] Web Crawling System based on Tag Path and Text Appearance Frequency

요 약

SNS나 뉴스, 블로그 등 웹상의 다양한 수집 채널에서 본문을 수집하기 위해 현재 태그와 스타일 속성 등의 웹 페이지 구조를 기반으로 본문을 수집하는 웹 크롤러가 주로 이용되고 있다. 하지만 이는 웹 페이지 구조가 바뀔 때마다 본문 수집 로직을 변경해야 하고, 수집 채널마다 별도의 본문 수집 로직을 구현해야 하므로 웹 페이지 구조가 복잡하면서 변화가 잦은 오늘 날의 웹 환경에는 매우 비효율적이다. 최근 이런 단점을 보완하기 위해 웹페이지의 DOM 트리를 기반으로 단어/링크 밀도나 인접한 태그의 종류, 웹페이지의 시각적 특징, 텍스트의 출현 순서 등을 이용하여 본문을 수집하는 방법 등이 제안되었으나, 수집 채널마다 본문 수집 정확도가 크게 차이가 난다는 문제점이 존재한다. 따라서 본 논문에서는 수집 채널에 따른 성능 차이를 줄임과 동시에 보다 높은 정확도로 웹 페이지의 본문을 수집하기 위해 DOM 트리의 루트 노드에서 텍스트 노드까지 출현하는 태그를 순서대로 나열한 태그 경로와 텍스트가 특정 수집 채널에서 출현하는 횟수를 나타내는 텍스트 출현 빈도를 이용하여 웹 페이지를 정형화한 후 본문을 수집하는 방법을 제안하였다. 또한 다른 수집 방법들과 성능을 비교 분석하여 재현율과 정밀도의 조화 평균 으로 계산한 F1-Score가 우수함을 확인하였으며, 제안한 본문 수집 방법을 기반으로 누구나 쉽게 사용할 수 있는 웹 크롤링 시스템인 WCTT(Web Crawling system based on Tag path and Text appearance frequency)를구현하였다. WCTT는 모든 수집 채널에서 동일한 로직으로 본문을 수집하 므로 유지 관리 및 확장이 용이하고, 추후 키워드 네트워크 분석 등에 용이하도록 수집된 본문에 불필요한 텍스트를 제거하고 명사를 추출한 후 본문 에서 출현한 단어의 중요도와 단어 간 연관 관계를 파악할 수 있는 TF-IDF(Term Frequency-Inverse Document Frequency) 테이블과 동시 출현 행렬(Co-occurrence Matrix)을 생성하여 사용자에게 제공한다.

ABSTRACT

In order to collect text from various collection channels on the web such as SNS, news, and blogs, a web crawler that collects text based on web page structure such as tag and style attributes is mainly used. However, this is very inefficient in today's web environment where the web page structure is complex and frequent, because the text collection logic must be changed whenever the web page structure is changed, and a separate text collection logic must be implemented by each collection channel. Recently, in order to compensate for this shortcoming, methods based on the DOM tree of web pages, has been proposed such as a method of collecting text using word/link density, types of adjacent tags, visual characteristics of web pages, and the order of appearance of text, etc. But, there is a problem that the text collection accuracy varies greatly by each channel. Therefore, in this paper, to reduce the performance difference depending on the collection channel and at the same time collect the text of the web page with higher accuracy, we proposed a method of collecting the text after formalizing a web page using the tag path that lists the tags that appear from the root node of the DOM tree to the text node in order, and the text appearance frequency that means the number of times the text appears in a specific collection channel. In addition, by comparing and analyzing performance with other collection methods, it was confirmed that     calculated as the harmonic average of recall and precision was excellent. And based on the proposed text collection method, we implemented WCTT (Web Crawling system based on Tag path and Text appearance frequency), a web crawling system that anyone can use easily. Since WCTT collects the text with the same logic by each collection channel, it is easy to maintain and expand. Also, WCTT removes unnecessary text and extracts nouns from the collected text for keyword network analysis and so on, creates a TF-IDF (Term Frequency-Inverse Document Frequency) table to figure out importance between words and a co-occurrence matrix that can identity associations between words.