Method and apparatus for automatic character script determination
    1.
    发明授权
    Method and apparatus for automatic character script determination 失效
    自动字符脚本确定的方法和装置

    公开(公告)号:US5444797A

    公开(公告)日:1995-08-22

    申请号:US47515

    申请日:1993-04-19

    CPC分类号: G06K9/6807

    摘要: An automatic script determining apparatus automatically determines the gross script-type of the text image of a document. A connected component generating means generates connected components from the pixels comprising the text image. A bounding box generating means generates a bounding box surrounding each connected component. A centroid determining means determines a centroid for each bounding box. A script feature determining means determines the locations, relative to the centroid, of one or more predetermined types of features, for each bounding box. A script determining means determines a distribution of the located script features for the entire text image, and compares the determined spatial distribution to predetermined distribution for at least one script-type to determine the script type of the text image.

    摘要翻译: 自动脚本确定装置自动确定文档的文本图像的总脚本类型。 连接分量生成装置从包括文本图像的像素生成连接分量。 边界框生成装置生成围绕每个连接的部件的边界框。 重心确定装置确定每个边界框的质心。 脚本特征确定装置为每个边界盒确定相对于质心的一个或多个预定类型的特征的位置。 脚本确定装置确定整个文本图像的所定位的脚本特征的分布,并将确定的空间分布与至少一个脚本类型的预定分布进行比较,以确定文本图像的脚本类型。

    Method and apparatus for automatic determination of text line, word and
character cell spatial features
    2.
    发明授权
    Method and apparatus for automatic determination of text line, word and character cell spatial features 失效
    用于自动确定文本行,单词和字符单元空间特征的方法和装置

    公开(公告)号:US5384864A

    公开(公告)日:1995-01-24

    申请号:US47514

    申请日:1993-04-19

    申请人: A. Lawrence Spitz

    发明人: A. Lawrence Spitz

    IPC分类号: G06K9/20 G06K9/34

    CPC分类号: G06K9/348 G06K2209/01

    摘要: An automatic character cell determining apparatus automatically determines the character cells within the text image of a document. A connected component generating means generates connected components from the pixels comprising the text image. A bounding box generating means generates a bounding box surrounding each connected component. A character cell determining means for locating character cells comprising one or more connected components comprises a vertical splaying means and a horizontal splaying means for ensuring white spaces between lines and connected components, a vertical profile means for determining the vertical positions of a line, means for splitting ligatures of two or more connected components and means for generating character cells grouping together one or more connected components.

    摘要翻译: 自动字符单元确定装置自动确定文档的文本图像内的字符单元。 连接分量生成装置从包括文本图像的像素生成连接分量。 边界框生成装置生成围绕每个连接的部件的边界框。 用于定位包括一个或多个连接分量的字符单元的字符单元确定装置包括垂直展开装置和用于确保线和连接分量之间的空白的水平放大装置,用于确定线的垂直位置的垂直分布装置, 分离两个或多个连接组件的连接以及用于生成将一个或多个连接组件组合在一起的字符单元的装置。

    Encoding-format-desensitized methods and means for interchanging
electronic document as appearances
    3.
    发明授权
    Encoding-format-desensitized methods and means for interchanging electronic document as appearances 失效
    用于将电子文档交换为外观的编码格式脱敏的方法和手段

    公开(公告)号:US5210824A

    公开(公告)日:1993-05-11

    申请号:US680592

    申请日:1991-03-28

    IPC分类号: G06F17/21 G06F17/22 G06F17/30

    摘要: A database system is provided for interchanging visually faithful renderings of fully formatted electronic documents among computers having different hardware configurations and different software operating environments for representing such documents by different encoding formats and for transferring such documents utilizing different file transfer protocols. All format conversions and other activities that are involved in transferring such documents among such computers essentially are transparent to their users and require no a priori knowledge on the part of any of the users with respect to the computing and/or network environments of any of the other users. All database operations are initiated and have their progress checked by means of a remote procedure call protocol which enables client applications to obtain partial results from them relatively quickly, without having to wait for such operations to complete their work. These database operations are forked as child processes by a main database server program, so the functionally of the database system may be extended easily by adding further database operation programs to it.

    摘要翻译: 提供了一种数据库系统,用于在具有不同硬件配置的计算机和用于通过不同编码格式表示这些文档的不同软件操作环境的计算机之间交换完全格式化的电子文档的视觉忠实的渲染,并且使用不同的文件传输协议传送这些文档。 涉及在这些计算机之间转移这些文档的所有格式转换和其他活动本质上对其用户是透明的,并且不要求任何用户对于任何用户的计算和/或网络环境的先验知识 其他用户。 启动所有数据库操作,并通过远程过程调用协议检查其进度,使客户端应用程序能够相对较快地获取部分结果,而无需等待此类操作完成工作。 这些数据库操作由主数据库服务器程序分派为子进程,因此可以通过向其添加其他数据库操作程序来轻松扩展数据库系统的功能。

    Character and phoneme recognition based on probability clustering
    4.
    发明授权
    Character and phoneme recognition based on probability clustering 失效
    基于概率聚类的字符和语音识别

    公开(公告)号:US5075896A

    公开(公告)日:1991-12-24

    申请号:US427148

    申请日:1989-10-25

    CPC分类号: G10L15/187

    摘要: Prior to character or phoneme recognition, a classifier provides a respective probability list for each of a sequence of sample characters or phonemes, each probability list indicating the respective sample's probability for each character or phoneme type. These probability lists are clustered in character or phoneme probability space, in which each dimension corresponds to the probability that a character or phoneme candidate is an instance of a specific character or phoneme type. For each resulting cluster, data is stored indicating its cluster ID and a probability list indicating the probability of each type at the cluster's center. Then, during recognition, a probability cluster identifier compares the probability list for each candidate with the probability list for each cluster to find the nearest cluster. The cluster identifier then provides the nearest cluster's cluster ID to a constraint satisfier that attempts to recognize the candidate based on rules, patterns, or a combination of rules and patterns. If necessary, the constraint satisfier uses the cluster ID to retrieve the stored probability list of the cluster to assist it in recognition.

    Method and apparatus for enhanced automatic determination of text line
dependent parameters
    5.
    发明授权
    Method and apparatus for enhanced automatic determination of text line dependent parameters 失效
    用于增强自动确定文本行相关参数的方法和装置

    公开(公告)号:US5513304A

    公开(公告)日:1996-04-30

    申请号:US191895

    申请日:1994-02-04

    IPC分类号: G06K9/20 G06K9/34

    CPC分类号: G06K9/348 G06K2209/01

    摘要: An automatic character cell determining apparatus automatically determines the character cells within the text image of a document. A connected component generator means generates connected components from the pixels comprising the text image. An aligning device aligns skewed and warped lines to the proper image axes. A bounding box generator generates a bounding box surrounding each connected component. A character cell determining device for locating character cells including one or more connected components has a vertical splaying device and a horizontal splaying device for ensuring white spaces between lines and connected components, a vertical profile device for determining the vertical positions of a line, a splitting device for splitting ligatures of two or more connected components and a character cell generator for generating character cells grouping together one or more connected components.

    摘要翻译: 自动字符单元确定装置自动确定文档的文本图像内的字符单元。 连接分量发生器装置从包括文本图像的像素生成连接分量。 对准装置将倾斜和翘曲的线对齐到正确的图像轴。 边界框生成器围绕每个连接的组件生成一个边界框。 用于定位包括一个或多个连接分量的字符单元的字符单元确定装置具有用于确定线和连接分量之间的空白的垂直显示装置和水平放映装置,用于确定线的垂直位置的垂直分布装置,分割 用于分离两个或多个连接组件的连接的设备和用于生成将一个或多个连接组件组合在一起的字符单元的字符单元发生器。

    Method and apparatus for automatic language determination of Asian
language documents
    6.
    发明授权
    Method and apparatus for automatic language determination of Asian language documents 失效
    用于自动语言确定亚洲语言文件的方法和装置

    公开(公告)号:US5425110A

    公开(公告)日:1995-06-13

    申请号:US47673

    申请日:1993-04-19

    申请人: A. Lawrence Spitz

    发明人: A. Lawrence Spitz

    摘要: An automatic language determining apparatus automatically determines the particular Asian language of the text image of a document when the gross script-type is known to be, or is determined to be, an Asian script-type. A connected component generating means generates connected components from the pixels comprising the text image. A character cell generating means generates a character cell surrounding at least one connected component. An optical density determining means determines the optical density, in absolute numbers or percentage of pixels, of the pixels within each character cell. A script feature determining means first generates a histogram, then converts, by linear discriminate analysis, the histogram to a point in a new coordinate space. A language determining means compares the determined point of the text portion in the new coordinate space to predetermined regimes in the new coordinate space corresponding to at least one Asian language to determine the particular Asian language of the text image.

    摘要翻译: 自动语言确定装置当已知或确定为亚洲脚本类型时,自动确定文档的文本图像的特定亚洲语言。 连接分量生成装置从包括文本图像的像素生成连接分量。 字符单元生成单元生成围绕至少一个连接分量的字符单元。 光密度确定装置确定每个字符单元内的像素的光密度(以像素的绝对数或百分比表示)。 脚本特征确定装置首先生成直方图,然后通过线性判别分析将直方图转换为新坐标空间中的点。 语言确定装置将新坐标空间中的文本部分的确定点与对应于至少一种亚洲语言的新坐标空间中的预定方案进行比较,以确定文本图像的特定亚洲语言。

    Method and apparatus for classifying documents
    7.
    发明授权
    Method and apparatus for classifying documents 失效
    分类文件的方法和装置

    公开(公告)号:US5414781A

    公开(公告)日:1995-05-09

    申请号:US158831

    申请日:1993-11-24

    CPC分类号: G06K9/00

    摘要: A method and apparatus for identifying documents and classes of documents. The documents are provided with distinctive logotypes which are preferably at the top of each document. The coding of the logotypes is by the use of distinctive angular alignments in the logotype. The logotype is scanned at different angles in order to determine angular "signatures" for comparison with a predetermined power distribution.

    摘要翻译: 一种用于识别文件和文件类别的方法和装置。 这些文件被提供有独特的标识,优选地在每个文档的顶部。 标识符的编码是通过在标识中使用独特的角度对齐。 以不同的角度扫描标识,以便确定与预定功率分布进行比较的角度“签名”。

    Method for matching text images and documents using character shape codes
    8.
    发明授权
    Method for matching text images and documents using character shape codes 失效
    使用字符形状代码匹配文本图像和文档的方法

    公开(公告)号:US5438628A

    公开(公告)日:1995-08-01

    申请号:US220926

    申请日:1994-03-31

    IPC分类号: G06K9/62 G06K9/68 G06K9/00

    CPC分类号: G06K9/6807

    摘要: A first method for exact and inexact matching of documents stored in a document database includes the step of converting the documents in the database to a compacted tokenized form. A search string or search document is then converted to the compact tokenized form and compared to determine if the test string occurs in the documents of the database or whether the documents in the database correspond to the test document. A second method for inexact matching of a test document to the documents in the database includes generating sets of one or more floating point values for each document in the database and for the test document. The sets of floating point numbers for the database are then compared to the set for the test document to determine a degree of matching. A threshold value is established and each document in the database which generates a matching value closer to the test document that the threshold is considered to be an inexact match of the test document.

    摘要翻译: 用于精确和不精确匹配存储在文档数据库中的文档的第一种方法包括将数据库中的文档转换为压缩的标记化形式的步骤。 然后将搜索字符串或搜索文档转换为紧凑的标记表单并进行比较,以确定测试字符串是否出现在数据库的文档中,或者数据库中的文档是否对应于测试文档。 测试文档与数据库中的文档的不精确匹配的第二种方法包括为数据库中的每个文档和测试文档生成一个或多个浮点值的集合。 然后将数据库的浮点数集合与测试文档的集合进行比较,以确定匹配程度。 建立阈值,并且数据库中的每个文档生成更接近测试文档的匹配值,阈值被认为是测试文档的不精确匹配。

    Method and apparatus for automatic character type classification of
European script documents
    9.
    发明授权
    Method and apparatus for automatic character type classification of European script documents 失效
    欧洲脚本文件自动字符类型分类的方法和装置

    公开(公告)号:US5375176A

    公开(公告)日:1994-12-20

    申请号:US47540

    申请日:1993-04-19

    申请人: A. Lawrence Spitz

    发明人: A. Lawrence Spitz

    IPC分类号: G06K9/62 G06K9/68

    CPC分类号: G06K9/6807

    摘要: An automatic abstract character coding system automatically generates abstract coded characters from the text image of a document when the gross script-type is known to be, or is determined to be, a European type script. A connected component generating means generates connected components from the pixels comprising the text image. A spatial feature determining means generates a character cell surrounding one or more aligned connected component. A character-type classifying means converts the character cell to one of a plurality of abstract character codes.

    摘要翻译: 自动抽象字符编码系统自动生成来自文档的文本图像的抽象编码字符,当粗体脚本类型已知或被确定为欧洲类型脚本时。 连接分量生成装置从包括文本图像的像素生成连接分量。 空间特征确定装置产生围绕一个或多个对齐的连接分量的字符单元。 字符型分类装置将字符单元转换为多个抽象字符代码之一。

    Determination of image skew angle from data including data in compressed
form
    10.
    发明授权
    Determination of image skew angle from data including data in compressed form 失效
    从包括压缩形式的数据的数据中确定图像角度角

    公开(公告)号:US5245676A

    公开(公告)日:1993-09-14

    申请号:US454339

    申请日:1989-12-21

    申请人: A. Lawrence Spitz

    发明人: A. Lawrence Spitz

    摘要: Skew angle of an image is determined based on determination of location of fiducial points on the image. Fiducial points may be located through a comparison of the scanning of a first line with scanning of a subsequent line. These fiducial points may be defined in terms of pixel color transitions located on a first scan line without a corresponding transition on the succeeding scan line. Skew angle may be determined from image data in uncompressed form or in compressed form. Where skew angle is determined from image data in compressed form, the 2-dimensional CCITT facsimile recommendations may be used. In such cases, the locations of the fiducial points may be taken as the locations of the pass codes of the compressed image data. Specifically, pass codes indicating a pass of white pixels are used.

    摘要翻译: 基于确定图像上的基准点的位置来确定图像的倾斜角度。 可以通过对第一行的扫描与后续行的扫描的比较来定位基准点。 可以根据位于第一扫描线上的像素颜色转变来定义这些基准点,而在随后的扫描线上没有相应的转变。 可以以未压缩形式或以压缩形式从图像数据确定倾斜角度。 在以压缩形式从图像数据确定倾斜角度的情况下,可以使用二维CCITT传真建议。 在这种情况下,可以将基准点的位置作为压缩图像数据的通行码的位置。 具体地,使用表示白色像素的通过的代码。