-
公开(公告)号:US20160307033A1
公开(公告)日:2016-10-20
申请号:US15193058
申请日:2016-06-26
CPC分类号: G06K9/00456 , G06F17/2223 , G06F17/275 , G06F17/2775 , G06K9/18 , G06K9/3208 , G06K9/6821 , G06K2209/011
摘要: Disclosed are systems, computer-readable mediums, and methods for determining that text contains Chinese, Japanese, or Korean characters. One method includes determining a language hypothesis for each text fragment in a plurality of text fragments identified from connected components in a document image. The method further includes selecting a first subset of text fragments from the plurality of text fragments based on ratings for the language hypothesis of each text fragment in the plurality of text fragments. The method further includes verifying, by a processor, the language hypothesis of one or more text fragments in the first subset of text fragments based on optical character recognition of the one or more text fragments. The method further includes determining, by the processor, that Chinese, Japanese, or Korean (CJK) characters are present in the document image based on the verification of the language hypothesis of each of the one or more text fragments.
摘要翻译: 公开了用于确定该文本包含中文,日文或韩文字符的系统,计算机可读介质和方法。 一种方法包括确定从文档图像中的连接分量识别的多个文本片段中的每个文本片段的语言假设。 该方法还包括基于多个文本片段中每个文本片段的语言假设的等级从多个文本片段中选择文本片段的第一子集。 该方法还包括基于一个或多个文本片段的光学字符识别,由处理器验证文本片段的第一子集中的一个或多个文本片段的语言假设。 该方法还包括基于对一个或多个文本片段中的每一个的语言假设的验证,由处理器确定中文,日文或韩文(CJK)字符存在于文档图像中。
-
公开(公告)号:US09811726B2
公开(公告)日:2017-11-07
申请号:US15193058
申请日:2016-06-26
CPC分类号: G06K9/00456 , G06F17/2223 , G06F17/275 , G06F17/2775 , G06K9/18 , G06K9/3208 , G06K9/6821 , G06K2209/011
摘要: Disclosed are systems, computer-readable mediums, and methods for determining that text contains Chinese, Japanese, or Korean characters. One method includes determining a language hypothesis for each text fragment in a plurality of text fragments identified from connected components in a document image. The method further includes selecting a first subset of text fragments from the plurality of text fragments based on ratings for the language hypothesis of each text fragment in the plurality of text fragments. The method further includes verifying, by a processor, the language hypothesis of one or more text fragments in the first subset of text fragments based on optical character recognition of the one or more text fragments. The method further includes determining, by the processor, that Chinese, Japanese, or Korean (CJK) characters are present in the document image based on the verification of the language hypothesis of each of the one or more text fragments.
-
公开(公告)号:US09378414B2
公开(公告)日:2016-06-28
申请号:US14561851
申请日:2014-12-05
CPC分类号: G06K9/00456 , G06F17/2223 , G06F17/275 , G06F17/2775 , G06K9/18 , G06K9/3208 , G06K9/6821 , G06K2209/011
摘要: Disclosed are systems, computer-readable mediums, and methods for determining a text contains Chinese, Japanese, or Korean characters. A document image is received and binarized. The binarized document image is searched for connected components. A plurality of fragments is identified based on the connected components. A language hypothesis for each fragment of the plurality of fragments is determined. The language hypothesis has a probability rating. A subset of fragments from the plurality of fragments having the highest probability ratings is selected. The language hypothesis of each fragment in the subset of fragments is verified. A determination of the presence of Chinese, Japanese, or Korean characters is made based at least on the verification of the language hypothesis of the subset of fragments.
摘要翻译: 公开了用于确定包含中文,日文或韩文字符的文本的系统,计算机可读介质和方法。 接收文档图像并进行二值化。 搜索二进制文档图像连接的组件。 基于连接的组件识别多个片段。 确定多个片段的每个片段的语言假设。 语言假设具有概率等级。 选择具有最高概率等级的多个片段的片段的子集。 验证了片段子集中每个片段的语言假设。 至少基于对片段子集的语言假设的验证,确定中文,日文或韩文字符的存在。
-
公开(公告)号:US20150178559A1
公开(公告)日:2015-06-25
申请号:US14561851
申请日:2014-12-05
CPC分类号: G06K9/00456 , G06F17/2223 , G06F17/275 , G06F17/2775 , G06K9/18 , G06K9/3208 , G06K9/6821 , G06K2209/011
摘要: Disclosed are systems, computer-readable mediums, and methods for determining a text contains Chinese, Japanese, or Korean characters. A document image is received and binarized. The binarized document image is searched for connected components. A plurality of fragments is identified based on the connected components. A language hypothesis for each fragment of the plurality of fragments is determined. The language hypothesis has a probability rating. A subset of fragments from the plurality of fragments having the highest probability ratings is selected. The language hypothesis of each fragment in the subset of fragments is verified. A determination of the presence of Chinese, Japanese, or Korean characters is made based at least on the verification of the language hypothesis of the subset of fragments.
摘要翻译: 公开了用于确定包含中文,日文或韩文字符的文本的系统,计算机可读介质和方法。 接收文档图像并进行二值化。 搜索二进制文档图像连接的组件。 基于连接的组件识别多个片段。 确定多个片段的每个片段的语言假设。 语言假设具有概率等级。 选择具有最高概率等级的多个片段的片段的子集。 验证了片段子集中每个片段的语言假设。 至少基于对片段子集的语言假设的验证,确定中文,日文或韩文字符的存在。
-
-
-