REAL-TIME SCENE TEXT AREA DETECTION
    1.
    发明申请

    公开(公告)号:WO2022098488A1

    公开(公告)日:2022-05-12

    申请号:PCT/US2021/055157

    申请日:2021-10-15

    Abstract: This application is directed to identifying text areas in an image. A computer system obtains the image including one or more text areas, and generates a sequence of feature maps from the image based on a downsampling rate. Each feature map has a first dimension and a second dimension, and the feature maps include a first feature map and a second feature map. Each of the first and second dimensions of the first feature map has a respective size that is reduced to that of a respective dimension of the second feature map by the downsampling rate. The second feature map is upsampled by an upsampling rate using a local context-aware upsampling network. The upsampled second feature map is aggregated with the first feature map to generate an aggregated first feature map. The one or more text areas are identified in the image based on the aggregated first feature map.

    HARMONICS BASED TARGET SPEECH EXTRACTION NETWORK

    公开(公告)号:WO2022204612A1

    公开(公告)日:2022-09-29

    申请号:PCT/US2022/025476

    申请日:2022-04-20

    Inventor: ZHANG, Yi LIN, Yuan

    Abstract: An apparatus may include a processor and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to generate a weighting vector based on a feature map of reference audio of a speaker, generate a speaker embedding based on the feature map and the weighting vector, wherein the speaker embedding is configured to filter speech components of a voice of the speaker, wherein the speech components includes one or more harmonic frequencies of the voice of the speaker, and extract, from a speech mixture, audio of the speaker based on the speaker embedding, wherein the speech mixture is a mixed audio signal that includes the voice of the speaker and other sounds.

    METHOD AND SYSTEM FOR MULTI-LANGUAGE TEXT RECOGNITION MODEL WITH AUTONOMOUS LANGUAGE CLASSIFICATION

    公开(公告)号:WO2021237227A1

    公开(公告)日:2021-11-25

    申请号:PCT/US2021/040137

    申请日:2021-07-01

    Abstract: Systems and methods are provided for implementing multi-language scene text recognition. Particularly, the system and method can improve automated text recognition applications by autonomously recognizing characters in text, and a language of origin for the text. Additionally, a multi-language text recognition model is employed, which applies deep learning algorithms to accurately detect multiple languages using the one model. Therefore, the system and method can achieve an efficient, accurate, and seamless integration of autonomous language detection and character recognition for multiple languages using a single model. A method can involve extracting visual features corresponding to textual content of an input image, where the input image comprises textual content and non-textual context. The extracted features can be encoded to map each visual feature with a character to recognize the textual content. Further, a language for the recognized text can be autonomously recognized based on index values corresponding to the characters.

    SYSTEMS, METHODS, AND DEVICES FOR AUDIO-VISUAL SPEECH PURIFICATION USING RESIDUAL NEURAL NETWORKS

    公开(公告)号:WO2022197296A1

    公开(公告)日:2022-09-22

    申请号:PCT/US2021/022823

    申请日:2021-03-17

    Abstract: This application is directed to audio purification. An electronic device obtains image data corresponding to a sequence of image frames that focus on lip movement of a person. The electronic device also obtains audio data that is synchronous with the lip movement in the sequence of image frames and modifies the audio data using the image data, thereby reducing background noise in the audio data. In some embodiments, the audio data is separated to first audio magnitude data and first audio phase data corresponding to distinct audio frequencies. The first audio magnitude data are modified to second audio magnitude data based on the image data. The first audio phase data are updated to second audio phase data based on the second audio magnitude data. The audio data is modified when the audio data are recovered from the second audio magnitude data and the second audio phase data.

    TRANSFORMER-BASED SCENE TEXT DETECTION
    5.
    发明申请

    公开(公告)号:WO2022099325A1

    公开(公告)日:2022-05-12

    申请号:PCT/US2022/011790

    申请日:2022-01-10

    Abstract: A system may include a backbone network configured to generate feature maps from an image, a transformer network coupled to the backbone network, and a scene text detection subsystem, the scene text detection subsystem comprising a processor, and a non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to generate a plurality of image tokens from one or more feature maps of an input image, and generate, via the transformer encoder, a set of token queries, wherein the set of token queries quantify an attention of a respective textual feature of a respective image token of the sequence of image tokens relative to all other respective textual features of all other image tokens, and generate, via the transformer decoder, a set of predicted text boxes of the input image.

    SCENE TEXT RECOGNITION MODEL WITH TEXT ORIENTATION OR ANGLE DETECTION

    公开(公告)号:WO2022046486A1

    公开(公告)日:2022-03-03

    申请号:PCT/US2021/046490

    申请日:2021-08-18

    Abstract: Novel tools and techniques are provided for implementing scene text recognition model with text orientation detection or text angle detection. In various embodiments, a computing system may perform feature extraction on an input image, containing text, using a convolutional layer of a convolutional neural network ("CNN") to produce a feature map, and may perform orientation or angle determination of the text in the input image, using a first dense layer of the CNN. If the image of the text is determined to be in the normal orientation or in response to the input image having been rotated to the normal orientation, the computing system may perform feature encoding on values in the feature map, using a sequence layer of the CNN to produce an encoded feature map. The computing system may use a second dense layer of the CNN to process each encoded feature to produce a classification of text.

    MULTI-HEAD TEXT RECOGNITION MODEL FOR MULTI-LINGUAL OPTICAL CHARACTER RECOGNITION

    公开(公告)号:WO2021081562A2

    公开(公告)日:2021-04-29

    申请号:PCT/US2021/014171

    申请日:2021-01-20

    Abstract: This application is directed to performing optical character recognition (OCR) using deep learning techniques. An electronic device receives an image and a language indicator that indicates that the textual content in the image corresponds to a first language. The electronic device processes the image using a multilingual text recognition model applicable to a plurality of languages. The electronic device generates a feature sequence including a plurality of probability values corresponding to the textual content of the image. The feature sequence includes a plurality of feature subsets that correspond to the plurality of languages. For each feature subset, each probability value indicates a probability that a respective textual content corresponds to a respective character in a dictionary of the corresponding language. The electronic device constructs a sparse mask based on the first language and combines the feature sequence and the sparse mask to determine the textual content.

Patent Agency Ranking