Computerized data compression and analysis using potentially non-adjacent pairs

    公开(公告)号:US20210157818A1

    公开(公告)日:2021-05-27

    申请号:US16951954

    申请日:2020-11-18

    申请人: Takashi Suzuki

    发明人: Takashi Suzuki

    IPC分类号: G06F16/25 G06F16/23 H03M7/30

    摘要: A computerized method of compressing symbolic information organized into a plurality of documents, each document having a plurality of symbols, includes: (i) automatically identifying a plurality of sequential and non-sequential symbol pairs in an input document; (ii) counting the number of appearances of each unique symbol pair; and (iii) producing a compressed document that includes a replacement symbol at each position associated with one of the plurality of symbol pairs, at least one of which corresponds to a non-sequential symbol pair. For each non-sequential pair the compressed document includes corresponding indicia indicating a distance between locations of the non-sequential symbols of the pair in the input document. In some instances the plurality of symbol pairs includes only those pairs of non-sequential symbols for which the distance between locations of the non-sequential symbols of the pair in the input document is less than a numeric distance cap.

    Computerized methods of data compression and analysis

    公开(公告)号:US10387377B2

    公开(公告)日:2019-08-20

    申请号:US15600495

    申请日:2017-05-19

    申请人: Takashi Suzuki

    发明人: Takashi Suzuki

    摘要: A computerized method and apparatus compresses symbolic information, such as text. Symbolic information is compressed by recursively identifying pairs of symbols (e.g., pairs of words or characters) and replacing each pair with a respective replacement symbol. The number of times each symbol pair appears in the uncompressed text is counted, and pairs are only replaced if they appear more than a threshold number of times. In recursive passes, each replaced pair can include a previously substituted replacement symbol. The method and apparatus can achieve high compression especially for large datasets. Metadata, such as the number of times each pair appears, generated during compression of the documents can be used to analyze the documents and find similarities between two documents.