GENERATING FAKE DOCUMENTS USING WORD EMBEDDINGS TO DETER INTELLECTUAL PROPERTY THEFT

    公开(公告)号:WO2022146921A1

    公开(公告)日:2022-07-07

    申请号:PCT/US2021/065209

    申请日:2021-12-27

    IPC分类号: G07D7/20 G06K9/62

    摘要: A computer-implemented method, system and computer program product for generating fake documents. A corpus of domain specific documents is built and word embeddings for each word in such documents are identified as embedding vectors. Concepts in the corpus are then clustered together by clustering the embedding vectors. A feasible candidate replacement set is generated for each concept using the clustered concepts in the corpus. After such pre-processing steps are accomplished, concepts are extracted from a document. The concept importance values are computed for these extracted concepts, in which the extracted concepts are clustered into bins based on such measurements. A joint optimization problem is solved to identify both the concepts in the document to be replaced using the clustered concepts in the bins as well as the corresponding replacement concepts obtained from the clustered concepts in the corpus. Such replacements are made to generate a fake document.