-
公开(公告)号:US20250021602A1
公开(公告)日:2025-01-16
申请号:US18900105
申请日:2024-09-27
Applicant: Amazon Technologies, Inc.
Inventor: Shrikant G Nayak , Sathya Prakash Podila Venkata Subramanya , Divya Nalam , Vijay Daniel Manason , Valluri Subbanna Chowdary
IPC: G06F16/84 , G06F18/214 , G06F40/154 , G06F40/16 , G06N5/02 , G06N20/20
Abstract: Results of applying a set of voting rules to a target corpus of documents are used to obtain a set of derived probabilistic labels indicating the probabilities of the presence of a particular attribute within the documents' constituent objects. A machine learning model is trained to identify a candidate portion of a document from which a value of the attribute is to be extracted. The training data for the model includes learned representations obtained from paths of constituent objects, and the corresponding derived labels. A proposed value for the attribute, obtained based on an assigned attribute value presence probability score for an individual constituent object from a selected candidate portion of a document, is provided.
-
公开(公告)号:US11714954B1
公开(公告)日:2023-08-01
申请号:US17119465
申请日:2020-12-11
Applicant: AMAZON TECHNOLOGIES, INC.
Inventor: Vijay Daniel Manason , Sathya Prakash Podila Venkata Subramanya , Ansar Pasha , Meghana Agrawal , Mandar Subhashrao Joshi , Shrikant G Nayak , Sandeep Bhaskar , Antonisamy Arokiasamy , Navin Anand
IPC: G06F40/137 , G06F40/143 , G06F16/901 , G06F16/958 , G06F40/30
CPC classification number: G06F40/137 , G06F16/9024 , G06F16/986 , G06F40/143 , G06F40/30
Abstract: A webpage containing information to be extracted may undergo changes to a layout of elements that present the information. These changes could result in an inability to retrieve the information later. A first graph is determined that represents elements of a first version of a webpage at a first time. An element in the first graph for which information is being acquired is specified. A relevant portion of the first graph is designated that includes the element and immediate neighbors in the first graph. Later, a second version of the webpage is retrieved, and a second graph of that second version is determined. The relevant portion of the first graph is compared to the second graph. If a match is found, the information of interest is extracted from the specified element of the second graph. This allows extraction of information to proceed even if the layout of elements changes.
-
公开(公告)号:US12130863B1
公开(公告)日:2024-10-29
申请号:US17107633
申请日:2020-11-30
Applicant: Amazon Technologies, Inc.
Inventor: Shrikant G Nayak , Sathya Prakash Podila Venkata Subramanya , Divya Nalam , Vijay Daniel Manason , Valluri Subbanna Chowdary
IPC: G06F16/80 , G06F16/84 , G06F18/214 , G06F40/154 , G06F40/16 , G06N5/02 , G06N20/20
CPC classification number: G06F16/86 , G06F18/2148 , G06F40/154 , G06F40/16 , G06N5/02 , G06N20/20
Abstract: Results of applying a set of voting rules to a target corpus of documents are used to obtain a set of derived probabilistic labels indicating the probabilities of the presence of a particular attribute within the documents' constituent objects. A machine learning model is trained to identify a candidate portion of a document from which a value of the attribute is to be extracted. The training data for the model includes learned representations obtained from paths of constituent objects, and the corresponding derived labels. A proposed value for the attribute, obtained based on an assigned attribute value presence probability score for an individual constituent object from a selected candidate portion of a document, is provided.
-
-