Abstract:
A citation record extraction system is provided for extracting citation records from publication list pages having different layouts and contents. An HTML rendering engine receives a publication list web page, parses the publication list web page to obtain layout information of the web page. A web page sequence builder generates a web page characteristic sequence for the web page according to the layout information. A web page repeated pattern analyzer analyzes repeated patterns presented in the web page characteristic sequence, screens out non-citation records therefrom, and obtains a citation record of the publication list web page.
Abstract:
A citation record extraction system is provided. An HTML rendering engine receives a publication list web page, parses the publication list web page to obtain layout information of the web page. A web page sequence builder generates a web page characteristic sequence for the web page according to the layout information. A web page repeated pattern analyzer analyzes repeated pattern presented in the web page characteristic sequence, screens out non-citation record therefrom, and obtains a citation record of the publication list web page.