SYSTEM AND METHOD FOR GENERATING A SYNTHETIC DATASET FROM AN ORIGINAL DATASET

    公开(公告)号:US20230060848A1

    公开(公告)日:2023-03-02

    申请号:US17407181

    申请日:2021-08-19

    IPC分类号: G06K9/62

    摘要: A method for generating a synthetic dataset from an original dataset includes encoding categorical features of the original dataset, embedding the encoded dataset in a low-dimensional space, selecting a seed record from the embedded dataset, identifying a plurality of nearest neighbor records to the seed record, generating a new record by randomly selecting features from the plurality of nearest neighbor records, and concatenating the new record into the synthetic dataset. For a synthetic dataset that contains N records, which may be the same as or different from the number of records in the original dataset, the selecting, identifying, generating, and concatenating operations operate a total of N times on the records in the embedded dataset.