-
公开(公告)号:US20240119075A1
公开(公告)日:2024-04-11
申请号:US18479646
申请日:2023-10-02
发明人: PRABIR MALLICK , SAMIRAN PAL , AVINASH KUMAR SINGH , ANUMITA DASGUPTA , SOHAM DATTA , KAAMRAAN KHAN , TAPAS NAYAK , INDRAJIT BHATTACHARYA , GIRISH KESHAV PALSHIKAR
IPC分类号: G06F16/332 , G06F16/33 , G06F40/186 , G06F40/284 , G06F40/289 , G06F40/30 , G06F40/40
CPC分类号: G06F16/3329 , G06F16/3344 , G06F40/186 , G06F40/284 , G06F40/289 , G06F40/30 , G06F40/40
摘要: Conventional Question and Answer (QA) datasets are created for generating factoid questions only and the present disclosure generates longform technical QA dataset from textbooks. Initially, the system receives a technical textbook document and extracts a plurality of contexts. Further, a first plurality of questions are generated based on the plurality of contexts. A plurality of answerable questions are generated further based on the plurality of contexts using an unsupervised template-based matching technique. Further, a combined plurality of questions are generated by combining the first plurality of questions and the plurality of answerable questions. Further, an answer for the combined plurality of questions are generated using an autoregressive language model and a mapping score is computed. Further, a plurality of optimal answers are selected based on the corresponding mapping score. Finally, a longform technical question and answer dataset is generated based on the combined plurality of questions and optimal answers.