-
公开(公告)号:US09773182B1
公开(公告)日:2017-09-26
申请号:US13614858
申请日:2012-09-13
CPC分类号: G06K9/3208 , G06F17/2745
摘要: A method and system for classifying document data is described. An exemplary method includes identifying a markup language document having a plurality of portions, determining a set of substantive content metrics and a set of noise metrics for each of the plurality of portions, calculating a noise-to-content ratio for each of the plurality of portions based on a corresponding set of substantive content metrics and a corresponding set of noise metrics, and removing noise from the markup language document using the noise-to-content ratio.