Clustering Web Pages Based on Structure and Style Similarity (Application Paper)
- Resource Type
- Conference
- Authors
- Gowda, Thamme; Mattmann, Chris A.
- Source
- 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI) Information Reuse and Integration, 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI), 2016 IEEE 17th International Conference on. :175-180 Jul, 2016
- Subject
- Computing and Processing
Engineering Profession
General Topics for Engineers
Power, Energy and Industry Applications
Robotics and Control Systems
Web pages
Vegetation
HTML
Cascading style sheets
World Wide Web
Data models
Videos
jaccard
similarity
tika
metadata
clustering
documents
- Language
We consider cluster analysis task on web pages based on various techniques to group the pages. While grouping the web pages based on the semantic meaning expressed in the content is required for some applications, we focus on clustering based on the web page structure and style for applications like categorization, cleaning, schema detection and automatic extractions. This paper describes some of the applications of similarity measures and a clustering technique to group the web pages into clusters. The structural similarity of HTML pages is measured by using Tree Edit Distance measure on DOM trees. The stylistic similarity is measured by using Jaccard similarity on CSS class names. An aggregated similarity measure is computed by combining structural and stylistic measures. A clustering method is then applied to this aggregated similarity measure to group the documents.