摘要:
A method of determining Unicode values corresponding to the text in digital documents includes: providing a digital document containing information related to the text in the document, the information including at least one set of data selected from the group consisting of: the numerical character code comprised by a single byte value or a sequence of multiple bytes, the glyph name corresponding to the character code for simple fonts, the code-to-Unicode mapping provided by a ToUnicode CMap, and font outline data embedded in the document; obtaining the information related to the text from the document; and determining the Unicode values corresponding to a specific code of a specific font on a per-glyph basis by executing a cascade of determination steps for each code separately, the cascade being executed in a predetermined sequence using different sources of information.