-
公开(公告)号:US12045288B1
公开(公告)日:2024-07-23
申请号:US17031062
申请日:2020-09-24
Applicant: Amazon Technologies, Inc.
Inventor: Ahmet Emre Barut , Chengwei Su , Weitong Ruan , Wael Hamza
IPC: G06F16/30 , G06F16/532 , G06F16/583 , G06F16/9032 , G06V20/20 , G06N20/00
CPC classification number: G06F16/90332 , G06F16/532 , G06F16/583 , G06V20/20 , G06N20/00
Abstract: Devices and techniques are generally described for selection of objects in image data using natural language input. In various examples, first image data representing at least a first object and first natural language data may be received. In some examples, first embedding data representing the first natural language data may be generated. Second embedding data representing the first image data may be generated. Relative location data indicating a location of the first object in the first image data relative to at least one other object may be generated. The first embedding data, the second embedding data, and the relative location data may be input into a multi-modal transformer model. The multi-modal transformer model may determine that the first natural language data relates to the first object.