«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

[1]耿耘,蒋严冰,郭岩,等.基于组合验证的Web页面抽取算法研究[J].江西师范大学学报(自然科学版),2013,(02):142-147.
　GENG Yun,JIANG Yan-bing,GUO Yan,et al.Research of Information Extraction Algorithm Based on Compositional Verification[J].Journal of Jiangxi Normal University:Natural Science Edition,2013,(02):142-147.
点击复制

基于组合验证的Web页面抽取算法研究()

分享到：

《江西师范大学学报》（自然科学版）[ISSN:1006-6977/CN:61-1281/TN]

卷:
期数:: 2013年02期

页码:: 142-147

栏目:

出版日期:: 2013-03-01

文章信息/Info

Title:: Research of Information Extraction Algorithm Based on Compositional Verification

作者:: 耿耘;蒋严冰;郭岩;刘悦;余钧;程学旗; 北京大学软件与微电子学院,北京100190;中国科学院计算技术研究所,北京100101；北京大学软件与微电子学院,北京,100190；中国科学院计算技术研究所,北京,100101

Author(s):: GENG Yun；JIANG Yan-bing；GUO Yan；LIU Yue；YU Jun；CHENG Xue-qi

关键词:: 信息抽取; 组合验证; 阈值; 多算法

Keywords:: information extraction; cross validation; threshold value; multi-algorithm

分类号:: TP391

文献标志码:: A

摘要:: 通过研究抽取算法的本质和抽取算法之间的关系,对抽取算法的互补性进行分析,提出了一种多算法组合验证机制,该机制能检测出抽取算法的错误,并通过结合动态阈值调整的方法,提高抽取算法的抽取准确率.

Abstract:: The nature of universal web-information retrieval algorithm has been investigated,and a frame of cross-validation mechanism which could detect failure of the retrieval process has been proposed.After then,the performance by dynamically adjust threshold value of each algorithm has been improved.

参考文献/References:

[1] W3C.W3C document object model [EB/OL].
[2012-10-11].http:∥www.w3.org/DOM.
[2] Fabio Fumarola,Tim Weninger,Rick Barber,et al.HyLiEn: a hybrid approach to general list extraction on the Web [EB/OL].
[2012-09-16].http:∥www.cs.uiuc.edu/～hanj/pdf/www11_ffumarola.pdf.
[3] Gupta S,Kaiser G,Stolfo S.Extracting context to improve accuracy for html content extraction [EB/OL].
[2012-09-16].http:∥wwwconference.org/www2005/cdrom/docs/p1114.pdf.
[4] Gottron T.Combining content extraction heuristics:the CombinE system [EB/OL].
[2012-09-22].http:∥citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140.3709.
[5] Sun Fei,Song Dandan,Liao Lejian.DOM based content extraction via text density [EB/OL].
[2012-09-22].http:∥disnet.cs.bit.edu.cn/DOM%20Based%20Content%20Extraction%20via%20Text%20Density.pdf.
[6] Liu Wei,Yan Hualiang,Xiao Jianguo,et al.Solution for automatic Web review extraction [J].Journal of Software,2010,21(12):3220-3236.
[7] Nitin Jindal,Liu Bing.A generalized tree matching algorithm considering nested lists for web data extraction [EB/OL].
[2012-09-26].https:∥www.siam.org/proceedings∥datamining/2010/dm10_081_jindaln.pdf.
[8] Xia Yingju,Yu Hao,Zhang Shu.Automatic web data extraction using tree alignment [EB/OL].
[2012-09-27].http:∥dl.acm.org/citation.cfm?id=1646194.
[9] Tim W,William H H,Han Jiawei.CETR-content extraction via tag ratios [EB/OL].
[2012-09-29].http:∥web.engr.illinois.edu/～weninge1/pubs/WHH_WWW10.pdf.
[10] Valter C,Giansalvatore M,Paolo M.Roadrunner:towards automatic data extraction from large web sites [EB/OL].
[2012-09-29].http:∥www.vldb.org/conf/2001/P109.pdf.
[11] Thomas Gottron.Content code blurring: a new approach to content extraction [EB/OL].
[2012-09-30].http:∥citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.140.5138&rep=rep1&type=pdf.
[12] Luo Ping,Fan Jian,Sam Liu,et al.Web article extraction for Web printing: a DOM+Visual based approach [EB/OL].
[2012-09-30].http:∥www.hpl.hp.com/techreports/2009/HPL-2009-185.html.

相似文献/References:

[1]王艳华,杨志豪,李彦鹏,等.基于监督学习和半监督学习的蛋白质关系抽取[J].江西师范大学学报(自然科学版),2013,(04):392.
　WANG Yan-hua,YANG Zhi-hao,LI Yan-peng,et al.Protein-Protein Interaction Extraction Based on the Combination of Supervised and Semi-Supervised Learning Method[J].Journal of Jiangxi Normal University:Natural Science Edition,2013,(02):392.

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed4443
全文下载/Downloads2956
评论/Comments

更新日期/Last Update: 1900-01-01