(1.国防科技大学计算机学院,湖南省长沙市 4100001;2.国防科技大学计算机学院,湖南省长沙市 410000)
关键词Heritrix 语料聚焦爬虫 APHash算法 Tika
中图分类号:TP393.08 文献标识码:A 文章编号:
Design and implementation of specific data acquisition system based on Heritrix and focused crawler
HE Yang1 PAN GuangQiang2
(1.National University of Defense Technology,Changsha 410000,china.HE Yang,
2.National University of Defense Technology,Changsha 410000,china.PAN GuangQiang)
AbstractAt present, the corpus plays an important role in the study, data collection methods now can not meet the need of. This paper presents a new method of data acquisition, can quickly and accurately capture domain specific corpus. Through the components of modified Heritrix open source crawler, we introduce the APhash algorithm to solve the problem, the average distribution of crawler queue, adjustable high acquisition speed, and by the addition of URL to determine the conditions, the domain specific corpus collection. The collection content using open by parsing the Tika tools, the specific data collection
Key wordsHeritrixcorpusAPHashfocused crawler Tika
[3]Dong HHussaln F KFocused Crawling for Automade Service Discovery,Annotation and Classification in Industrial Digital Ecosystems[J].IEEE Trans on Industrial Electronics,2011 58(6):2106-2116。
[4]邱哲符滔滔Lucene2,0+Heritrix开发自己的搜索引 北京:人民邮电出版社,2007.
[9] http://baike.baidu.com/link?url=FCUicrM4g6eSJynF5v3cjzUNgze_3ytnD3K_B0VDAHgU-pDRjjsyuusn0axvN5_fnbLZlieoIWnpS8ngPjKFO_
何 洋,男,1982年4月出生,辽宁锦州人,国防科学技术大学计算机学院计算机科学与技术专业工程硕士。主要研究方向为大数据挖掘、网络爬虫。