发明申请
- 专利标题: METHOD FOR AUTOMATICALLY IDENTIFYING SENTENCE BOUNDARIES IN NOISY CONVERSATIONAL DATA
- 专利标题(中): 自动识别语音对话数据中的声界边界的方法
-
申请号: US11845462申请日: 2007-08-27
-
公开(公告)号: US20090063150A1公开(公告)日: 2009-03-05
- 发明人: Tetsuya Nasukawa , Diwakar Punjani , Shourya Roy , L. Venkata Subramaniam , Hironori Takeuchi
- 申请人: Tetsuya Nasukawa , Diwakar Punjani , Shourya Roy , L. Venkata Subramaniam , Hironori Takeuchi
- 申请人地址: US NY Armonk
- 专利权人: INTERNATIONAL BUSINESS MACHINES CORPORATION
- 当前专利权人: INTERNATIONAL BUSINESS MACHINES CORPORATION
- 当前专利权人地址: US NY Armonk
- 主分类号: G10L15/04
- IPC分类号: G10L15/04
摘要:
Sentence boundaries in noisy conversational transcription data are automatically identified. Noise and transcription symbols are removed, and a training set is formed with sentence boundaries marked based on long silences or on manual markings in the transcribed data. Frequencies of head and tail n-grams that occur at the beginning and ending of sentences are determined from the training set. N-grams that occur a significant number of times in the middle of sentences in relation to their occurrences at the beginning or ending of sentences are filtered out. A boundary is marked before every head n-gram and after every tail n-gram occurring in the conversational data and remaining after filtering. Turns are identified. A boundary is marked after each turn, unless the turn ends with an impermissible tail word or is an incomplete turn. The marked boundaries in the conversational data identify sentence boundaries.
公开/授权文献
信息查询