OPTIMAL PII-SAFE TRAINING SET GENERATION FOR SPEECH RECOGNITION MODEL

    公开(公告)号:EP3813059A1

    公开(公告)日:2021-04-28

    申请号:EP19218256.6

    申请日:2019-12-19

    摘要: A method comprising receiving, as input, one or more audio files; applying a trained speech recognition algorithm to said one or more audio files, to obtain textual output corresponding to each of said one or more audio files; extracting, based on said textual output, from each of said one or more audio files, one or more portions having a specified syntactic pattern; selecting a subset of said portions based on at least one of: (i) a content of said textual output associated with each of said portions, (ii) a duration of each of said portions, and (iii) a confidence score assigned by said trained speech recognition algorithm to said obtained textual output; receiving, as input, transcriptions of each of said portions; generating a re-training set comprising: (iv) said portions in said subset, and (iv) said transcriptions; and re-training said trained speech recognition algorithm on said re-training set.