[1]耿耘,蒋严冰,郭岩,等.基于组合验证的Web页面抽取算法研究[J].江西师范大学学报(自然科学版),2013,(02):142-147.
 GENG Yun,JIANG Yan-bing,GUO Yan,et al.Research of Information Extraction Algorithm Based on Compositional Verification[J].,2013,(02):142-147.
点击复制

基于组合验证的Web页面抽取算法研究()
分享到:

《江西师范大学学报》(自然科学版)[ISSN:1006-6977/CN:61-1281/TN]

卷:
期数:
2013年02期
页码:
142-147
栏目:
出版日期:
2013-03-01

文章信息/Info

Title:
Research of Information Extraction Algorithm Based on Compositional Verification
作者:
耿耘;蒋严冰;郭岩;刘悦;余钧;程学旗
北京大学软件与微电子学院,北京100190;中国科学院计算技术研究所,北京100101;北京大学软件与微电子学院,北京,100190;中国科学院计算技术研究所,北京,100101
Author(s):
GENG Yun;JIANG Yan-bing;GUO Yan;LIU Yue;YU Jun;CHENG Xue-qi
关键词:
信息抽取组合验证阈值多算法
Keywords:
information extractioncross validationthreshold valuemulti-algorithm
分类号:
TP391
文献标志码:
A
摘要:
通过研究抽取算法的本质和抽取算法之间的关系,对抽取算法的互补性进行分析,提出了一种多算法组合验证机制,该机制能检测出抽取算法的错误,并通过结合动态阈值调整的方法,提高抽取算法的抽取准确率.
Abstract:
The nature of universal web-information retrieval algorithm has been investigated,and a frame of cross-validation mechanism which could detect failure of the retrieval process has been proposed.After then,the performance by dynamically adjust threshold value of each algorithm has been improved.

参考文献/References:

[1] W3C.W3C document object model [EB/OL].
[2012-10-11].http:∥www.w3.org/DOM.
[2] Fabio Fumarola,Tim Weninger,Rick Barber,et al.HyLiEn: a hybrid approach to general list extraction on the Web [EB/OL].
[2012-09-16].http:∥www.cs.uiuc.edu/~hanj/pdf/www11_ffumarola.pdf.
[3] Gupta S,Kaiser G,Stolfo S.Extracting context to improve accuracy for html content extraction [EB/OL].
[2012-09-16].http:∥wwwconference.org/www2005/cdrom/docs/p1114.pdf.
[4] Gottron T.Combining content extraction heuristics:the CombinE system [EB/OL].
[2012-09-22].http:∥citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140.3709.
[5] Sun Fei,Song Dandan,Liao Lejian.DOM based content extraction via text density [EB/OL].
[2012-09-22].http:∥disnet.cs.bit.edu.cn/DOM%20Based%20Content%20Extraction%20via%20Text%20Density.pdf.
[6] Liu Wei,Yan Hualiang,Xiao Jianguo,et al.Solution for automatic Web review extraction [J].Journal of Software,2010,21(12):3220-3236.
[7] Nitin Jindal,Liu Bing.A generalized tree matching algorithm considering nested lists for web data extraction [EB/OL].
[2012-09-26].https:∥www.siam.org/proceedings∥datamining/2010/dm10_081_jindaln.pdf.
[8] Xia Yingju,Yu Hao,Zhang Shu.Automatic web data extraction using tree alignment [EB/OL].
[2012-09-27].http:∥dl.acm.org/citation.cfm?id=1646194.
[9] Tim W,William H H,Han Jiawei.CETR-content extraction via tag ratios [EB/OL].
[2012-09-29].http:∥web.engr.illinois.edu/~weninge1/pubs/WHH_WWW10.pdf.
[10] Valter C,Giansalvatore M,Paolo M.Roadrunner:towards automatic data extraction from large web sites [EB/OL].
[2012-09-29].http:∥www.vldb.org/conf/2001/P109.pdf.
[11] Thomas Gottron.Content code blurring: a new approach to content extraction [EB/OL].
[2012-09-30].http:∥citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.140.5138&rep=rep1&type=pdf.
[12] Luo Ping,Fan Jian,Sam Liu,et al.Web article extraction for Web printing: a DOM+Visual based approach [EB/OL].
[2012-09-30].http:∥www.hpl.hp.com/techreports/2009/HPL-2009-185.html.

相似文献/References:

[1]王艳华,杨志豪,李彦鹏,等.基于监督学习和半监督学习的蛋白质关系抽取[J].江西师范大学学报(自然科学版),2013,(04):392.
 WANG Yan-hua,YANG Zhi-hao,LI Yan-peng,et al.Protein-Protein Interaction Extraction Based on the Combination of Supervised and Semi-Supervised Learning Method[J].,2013,(02):392.

更新日期/Last Update: 1900-01-01