Pre-training for scene text detection

    公开(公告)号:US12254707B2

    公开(公告)日:2025-03-18

    申请号:US17955285

    申请日:2022-09-28

    Abstract: Embodiments of the present disclosure relate to a method, device and computer readable storage medium of scene text detection. In the method, a first visual representation of a first image is generated with an image encoding process. A first textual representation of a first text unit in the first image is generated with a text encoding process based on a first plurality of symbols obtained by masking a first symbol of a plurality of symbols in the first text unit. A first prediction of the masked first symbol is determined with a decoding process based on the first visual and textual representations. At least the image encoding process is updating according to at least a first training objective to increase at least similarity of the first prediction and the masked first symbol.

    MULTI-DIMENSIONAL GENERATIVE FRAMEWORK FOR VIDEO GENERATION

    公开(公告)号:US20240193412A1

    公开(公告)日:2024-06-13

    申请号:US18063843

    申请日:2022-12-09

    Applicant: Lemon Inc.

    CPC classification number: G06N3/08 G06T2207/20081

    Abstract: Generating a multi-dimensional video using a multi-dimensional video generative model for, including, but not limited to, at least one of static portrait animation, video reconstruction, or motion editing. The method including providing data into the multi-dimensionally aware generator of the multi-dimensional video generative model, and generating the multi-dimensional video from the data by the multi-dimensionally aware generator. The generating of the multi-dimensional video includes inverting the data into a latent space of the multi-dimensionally aware generator, synthesizing content of the multi-dimensional video using an appearance component of the multi-dimensionally aware generator and corresponding camera pose and formulating an intermediate appearance code, developing a synthesis layer for encoding a motion component of the multi-dimensionally aware generator at a plurality of timesteps and formulating an intermediate motion code, introducing temporal dynamics into the intermediate appearance code and the intermediate motion code, and generating multi-dimensionally aware spatio-temporal representations of the data.

    DEBIASING TEXT-TO-IMAGE DIFFUSION MODELS

    公开(公告)号:US20250139846A1

    公开(公告)日:2025-05-01

    申请号:US19009706

    申请日:2025-01-03

    Abstract: There are provided methods, devices, and computer program products for image generation, particularly to debiasing text-to-image diffusion models. In a method, a plurality of images are obtained by an image generating model based on a prompt. The plurality of images comprises a plurality of instances of an object, respectively and the object is specified by the prompt. A plurality of attributes of the plurality of instances of the object are determined respectively. The image generating model is updated based on the plurality of attributes and a predetermined distribution of a plurality of predetermined attributes related to the object. With the above method, the images generated by the updated image generating model may follow the predetermined distribution, and the updated image generating model may output debiased results.

    MULTIMODAL DATA PROCESSING
    6.
    发明公开

    公开(公告)号:US20240144664A1

    公开(公告)日:2024-05-02

    申请号:US18393238

    申请日:2023-12-21

    CPC classification number: G06V10/82 G06V10/467

    Abstract: Embodiments of the present disclosure provide a solution for multimodal data processing. A method comprises: obtaining image data and text data; and extracting a target visual feature of image data and a target textual feature of text data using a feature extraction model. The feature extraction model comprises alternatively deployed cross-modal encoding parts and visual encoding parts. The extracting comprises: performing, using a first cross-modal encoding part of the feature extraction model, cross-modal feature encoding on a first intermediate visual feature of the image data and a first intermediate textual feature of the text data, to obtain a second intermediate visual feature and a second intermediate textual feature; performing, using a first visual encoding part of the feature extraction model, visual modal feature encoding on the second intermediate visual feature, to obtain a third intermediate visual feature.

Patent Agency Ranking