An integrated approach for information extraction
- Resource Type
- Conference
- Authors
- Xia, YingJu; Yang, YuHang; Ge, Fujiang; Zhang, Shu; Yu, Hao
- Source
- The 5th International Conference on New Trends in Information Science and Service Science Information Science and Service Science (NISS), 2011 5th International Conference on New Trends in. 1:122-127 Oct, 2011
- Subject
- Computing and Processing
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Web pages
Blogs
Maintenance engineering
Data mining
Estimation
Accuracy
Linear regression
- Language
This paper proposes an integrated approach to automatic information extraction for Forums, Blogs and News web sites using wrapper. This paper presents a tree alignment and transfer learning method to generate the wrapper. The tree alignment algorithm is adopted to find the best matching structure of the input web pages. A kind of linear regression method is employed to get the weight of different tag-matching. For wrapper maintenance, this paper presents a method using a log likelihood ratio test for detecting the change points on the similarity series which gotten from the wrapper and input web pages. Experimental results show that the method achieves high accuracy and has steady performance.