EXTRACTING PRINCIPAL CONTENT FROM WEB PAGES
    1.
    发明公开
    EXTRACTING PRINCIPAL CONTENT FROM WEB PAGES 审中-公开
    WEBSITES的提取基本内涵

    公开(公告)号:EP2776945A4

    公开(公告)日:2015-05-27

    申请号:EP12847034

    申请日:2012-11-07

    申请人: EVERNOTE CORP

    IPC分类号: G06F17/30

    CPC分类号: G06F17/3089 G06F17/30707

    摘要: Extracting principal content from Web pages includes identifying and classifying items on the Web page, building a list of candidates, calculating candidate scores, selecting a top score candidate, performing clean up processing for the top score candidate, and performing final page processing for the top score candidate. Candidate scores may vary according to a number of paragraphs and images grouped according to size. A word length of CJK (Chinese-Japanese-Korean) text may be determined according to punctuation therein. Candidate scores may be modified according to a number of containers and pieces and wherein a container is a Web page element that is associated with tags ‘body’, ‘div’, ‘td’, ‘li’, ‘article/section’ and pieces are candidates that do not include other candidates. Candidate scores may be modified according to a number of ratios corresponding to text and link density.