TEXT-TO-VECTORIZED REPRESENTATION TRANSFORMATION

    公开(公告)号:US20220027557A1

    公开(公告)日:2022-01-27

    申请号:US16934220

    申请日:2020-07-21

    Abstract: An approach for a fast and accurate word embedding model, “desc2vec,” for out-of-dictionary (OOD) words with a model learning from the dictionary descriptions of the word is disclosed. The approach includes determining that a target text element is not in a set of reference text elements, information describing the target text element is obtained. The information comprises a set of descriptive text elements. A set of vectorized representations for the set of descriptive text elements is determined. A target vectorized representation for the target text element is determined based on the set of vectorized representations using a machine learning model. The machine learning model is trained to represent a predetermined association between the set of vectorized representations for the set of descriptive text elements describing the target text element and the target vectorized representation.

    SELF-ADAPTIVE WEB CRAWLING AND TEXT EXTRACTION

    公开(公告)号:US20190303501A1

    公开(公告)日:2019-10-03

    申请号:US15936666

    申请日:2018-03-27

    Abstract: A method, computer system, and a computer program product for crawling and extracting main content from a web page is provided. The present invention may include retrieving a HTML document associated with a web page. The present invention may then include identifying at least one entry point located in the retrieved HTML document by utilizing a self-adaptive entry point locator. The present invention may also include extracting a main content article associated with the retrieved HTML document based on the identified at least one entry point. The present invention may further include presenting the extracted main content associated with the retrieved HTML document to the user.

Patent Agency Ranking