VIPS: a VIsion based Page Segmentation Algorithm


Introduction

The VIsion-based Page Segmentation (VIPS) algorithm aims to extract the semantic structure of a web page based on its visual presentation. Such semantic structure is a tree structure; each node in the tree corresponds to a block. Each node will be assigned a value (Degree of Coherence) to indicate how coherent of the content in the block based on visual perception, the bigger is the DoC value, the more coherent is the block. The VIPS algo-rithm makes full use of page layout structure. It first extracts all the suitable blocks from the html DOM tree, and then it finds the separators between these blocks. Here, separators denote the hori-zontal or vertical lines in a web page that visually cross with no blocks. Based on these separators, the semantic tree of the web page is constructed. Thus, a web page can be represented as a set of blocks (leaf nodes of the semantic tree). Compared with DOM based methods, the segments obtained by VIPS are much more semantically aggregated. Noisy information, such as navigation, advertisement, and decoration can be easily removed because they are often placed in certain positions of a page. Contents with different topics are distinguished as separate blocks.


Paper List

Original Paper

Applications using VIPS


Demo

Copyright Notice: All these programs can only be used for research.

VIPS dll (The VIPS DLL is always under development. All versions are downloadable here.)

How to use VIPS dll.

Notice: we are currently working to enhance the VIPS algorithm, any suggestions or problems can be send to dengcai2 AT cs DOT uiuc DOT edu.




Last modified: March 23 2005