-
公开(公告)号:US20230017072A1
公开(公告)日:2023-01-19
申请号:US17370522
申请日:2021-07-08
Applicant: Google LLC
Inventor: Anurag Arnab , Mostafa Dehghani , Georg Heigold , Chen Sun , Mario Lucic , Cordelia Luise Schmid
Abstract: A computer-implemented method for classifying video data with improved accuracy includes obtaining, by a computing system comprising one or more computing devices, video data comprising a plurality of video frames; extracting, by the computing system, a plurality of video tokens from the video data, the plurality of video tokens comprising a representation of spatiotemporal information in the video data; providing, by the computing system, the plurality of video tokens as input to a video understanding model, the video understanding model comprising a video transformer encoder model; and receiving, by the computing system, a classification output from the video understanding model.
-
公开(公告)号:US09971746B2
公开(公告)日:2018-05-15
申请号:US14168649
申请日:2014-01-30
Applicant: Google LLC
CPC classification number: G06F17/2247 , G06F17/248 , G06F17/30864
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for determining summary content for resources in a domain. In one aspect, a method includes accessing a first resource belonging to a particular domain, selecting an anchor in the first resource linking to a second resource belonging to the particular domain, identifying particular text content in the first resource that is subordinate to the anchor that the second resource includes the particular text content that is subordinate to the anchor, based on determining that the second resource includes the particular text content that is subordinate to the anchor, generating a domain template for the particular domain, the domain template specifying a location of the particular text content in the second resource, and determining, for each respective resource belonging to the particular domain having a structure matching the domain template, respective text content for the respective resource.
-
公开(公告)号:US12056197B2
公开(公告)日:2024-08-06
申请号:US18150739
申请日:2023-01-05
Applicant: Google LLC
IPC: G06F16/951 , G06F40/143 , G06F40/186
CPC classification number: G06F16/951 , G06F40/143 , G06F40/186
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for determining summary content for resources in a domain. In one aspect, a method includes accessing a first resource belonging to a particular domain, selecting an anchor in the first resource linking to a second resource belonging to the particular domain, identifying particular text content in the first resource that is subordinate to the anchor that the second resource includes the particular text content that is subordinate to the anchor, based on determining that the second resource includes the particular text content that is subordinate to the anchor, generating a domain template for the particular domain, the domain template specifying a location of the particular text content in the second resource, and determining, for each respective resource belonging to the particular domain having a structure matching the domain template, respective text content for the respective resource.
-
公开(公告)号:US20230177384A1
公开(公告)日:2023-06-08
申请号:US17545526
申请日:2021-12-08
Applicant: Google LLC
Inventor: Arsha Nagrani , Shan Yang , Anurag Arnab , Chen Sun , Cordelia Luise Schmid
Abstract: Example embodiments according to aspects of the present disclosure provide an example computer-implemented method for multimodal data processing with improved cross-modal attention. The example method includes inputting a multimodal sequence to an example machine-learned model. The example model includes a first modal processing stream receiving a first modal portion of the multimodal sequence and a second modal processing stream receiving a second modal portion of the multimodal sequence. The example model includes fusing the first modal processing stream and the second modal processing stream across one or more fusion layers of the machine-learned model through a plurality of cross-modal context encodings. The example method includes outputting an inference based at least in part on the plurality of cross-modal context encodings.
-
公开(公告)号:US20210264203A1
公开(公告)日:2021-08-26
申请号:US17046313
申请日:2019-11-18
Applicant: Google LLC
Inventor: Ariel Fuxman , Aleksei Timofeev , Zhen Li , Chun-Ta Lu , Manan Shah , Chen Sun , Krishnamurthy Viswanathan , Chao Jia
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for realizing a multimodal image classifier. In an aspect, a method includes, for each image of a plurality of images: processing the image by a textual generator model to obtain a set of phrases that are descriptive of the content of the image, wherein each phrase is one or more terms, processing the set of phrases by a textual embedding model to obtain an embedding of predicted text for the image, and processing the image using an image embedding model to obtain an embedding of image pixels of the image. Then a multimodal image classifier is trained on the embeddings of predicted text for the images and the embeddings of image pixels for the images to produce, as output, labels of an output taxonomy to classify an image based on the image as input.
-
公开(公告)号:US20240143700A1
公开(公告)日:2024-05-02
申请号:US18409411
申请日:2024-01-10
Applicant: Google LLC
Inventor: Ariel Fuxman , Aleksei Timofeev , Zhen Li , Chun-Ta Lu , Manan Shah , Chen Sun , Krishnamurthy Viswanathan , Chao Jia
IPC: G06F18/24 , G06F18/214 , G06F18/2413
CPC classification number: G06F18/24 , G06F18/214 , G06F18/24147
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for realizing a multimodal image classifier. In an aspect, a method includes, for each image of a plurality of images: processing the image by a textual generator model to obtain a set of phrases that are descriptive of the content of the image, wherein each phrase is one or more terms, processing the set of phrases by a textual embedding model to obtain an embedding of predicted text for the image, and processing the image using an image embedding model to obtain an embedding of image pixels of the image. Then a multimodal image classifier is trained on the embeddings of predicted text for the images and the embeddings of image pixels for the images to produce, as output, labels of an output taxonomy to classify an image based on the image as input.
-
公开(公告)号:US20210166009A1
公开(公告)日:2021-06-03
申请号:US16637960
申请日:2019-08-06
Applicant: Google LLC
Inventor: Chen Sun , Abhinav Shrivastava , Cordelia Luise Schmid , Rahul Sukthankar , Kevin Patrick Murphy , Carl Martin Vondrick
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for performing action localization. In one aspect, a system comprises a data processing apparatus; a memory in data communication with the data processing apparatus and storing instructions that cause the data processing apparatus to perform operations comprising: receiving an input comprising an image depicting a person; identifying a plurality of context positions from the image; determining respective feature representations of each of the context positions; providing a feature representation of the person and the feature representations of each of the context positions to a context neural network to obtain relational features, wherein the relational features represent relationships between the person and the context positions; and determining an action performed by the person using the feature representation of the person and the relational features.
-
公开(公告)号:US20210019470A1
公开(公告)日:2021-01-21
申请号:US17065256
申请日:2020-10-07
Applicant: Google LLC
IPC: G06F40/14 , G06F16/951 , G06F40/186
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for determining summary content for resources in a domain. In one aspect, a method includes accessing a first resource belonging to a particular domain, selecting an anchor in the first resource linking to a second resource belonging to the particular domain, identifying particular text content in the first resource that is subordinate to the anchor that the second resource includes the particular text content that is subordinate to the anchor, based on determining that the second resource includes the particular text content that is subordinate to the anchor, generating a domain template for the particular domain, the domain template specifying a location of the particular text content in the second resource, and determining, for each respective resource belonging to the particular domain having a structure matching the domain template, respective text content for the respective resource.
-
-
-
-
-
-
-