Cleaning noise words from transaction descriptions
摘要:
A method, system, and non-transitory computer readable medium for removing noise ngrams from transaction records. The method may include obtaining noise ngrams; ordering the noise ngrams based on frequency of occurrence; discarding a portion of the noise ngrams below a frequency threshold to obtain a higher frequency subset of the noise ngrams; obtaining a transaction record of interest; and identifying a portion of the higher frequency subset within the transaction record of interest. Identifying the portion of the higher frequency subset may include constructing a regular expression based on the higher frequency subset; constructing a finite state machine based on the regular expression; providing the transaction record of interest as an input to the finite state machine; and executing the finite state machine. The method may also include removing, based on the identification, the portion of the higher frequency subset from the transaction record of interest.
信息查询
0/0