SYSTEMS AND METHODS FOR MULTI-MODAL LANGUAGE MODELS

    公开(公告)号:US20240370718A1

    公开(公告)日:2024-11-07

    申请号:US18400477

    申请日:2023-12-29

    Abstract: Embodiments described herein provide a method of generating a multi-modal task output to a text instruction relating to inputs of multiple different modalities (e.g., text, audio, video, 3D). The method comprises receiving, via a data interface, a first input of a first modality, a second input of a second modality and the text instruction relating to the first and the second inputs; encoding, by a first multimodal encoder adapted for the first modality, the first input of the first modality into a first encoded representation conditioned on the text instruction; encoding, by a second multimodal encoder adapted for the second modality, the second input of the second modality into a second encoded representation conditioned on the text instruction; and generating, by a neural network based language model, the multi-modal task output based on an input combining the first encoded representation, the second encoded representation, and the text instruction.

    SYSTEMS AND METHODS FOR SUBJECT-DRIVEN IMAGE GENERATION

    公开(公告)号:US20240161369A1

    公开(公告)日:2024-05-16

    申请号:US18498768

    申请日:2023-10-31

    CPC classification number: G06T11/60 G06T9/00 G06V10/761 G06V10/82

    Abstract: Embodiments described herein provide systems and methods of subject-driven image generation. In at least one embodiment, a system receives, via a data interface, an image containing a subject, a text description of the subject in the image, and a text prompt relating to a different rendition of the subject. The system encodes, via an image encoder, the image into an image feature vector. The system encodes, via a text encoder, the text description int a text feature vector. The system generates, by a multimodal encoder, a vector representation of the subject based on the image feature vector and the text feature vector. The system generates, by a neural network based image generation model, an output image based on an input combining the text prompt and the vector representation.

    SYSTEMS AND METHODS FOR VISION-LANGUAGE MODEL INSTRUCTION TUNING

    公开(公告)号:US20240160858A1

    公开(公告)日:2024-05-16

    申请号:US18505982

    申请日:2023-11-09

    CPC classification number: G06F40/40 G06V10/774 G06V10/82 G06V20/70

    Abstract: Embodiments described herein provide a method of generating a vision-language task output to a text instruction relating to an input image, the method comprising receiving, via a data interface, the input image and the text instruction comprising an instruction relating to the image. The method further includes encoding, via an image encoder, the image into a first image representation. The method further includes generating, by a multimodal encoder, a second image representation based on cross-attending the first image representation to the text instruction. The method further includes generating, by a neural network based language model, a vision-language task output in response to the text instruction based on an input combining the second image representation and the text instruction.

Patent Agency Ranking