- 专利标题: Document data classification using a noise-to-content ratio
-
申请号: US13614858申请日: 2012-09-13
-
公开(公告)号: US09773182B1公开(公告)日: 2017-09-26
- 发明人: Bernhard Wolkerstorfer , Lei Li , Narendra S. Parihar
- 申请人: Bernhard Wolkerstorfer , Lei Li , Narendra S. Parihar
- 申请人地址: US NV Reno
- 专利权人: Amazon Technologies, Inc.
- 当前专利权人: Amazon Technologies, Inc.
- 当前专利权人地址: US NV Reno
- 代理机构: Lowenstein Sandler LLP
- 主分类号: G06K9/00
- IPC分类号: G06K9/00 ; G06K9/32
摘要:
A method and system for classifying document data is described. An exemplary method includes identifying a markup language document having a plurality of portions, determining a set of substantive content metrics and a set of noise metrics for each of the plurality of portions, calculating a noise-to-content ratio for each of the plurality of portions based on a corresponding set of substantive content metrics and a corresponding set of noise metrics, and removing noise from the markup language document using the noise-to-content ratio.
信息查询