摘要:
A system and method provide the ability to construct lexical analyzers on the fly in an efficient and pervasive manner. The system and method split the table describing the automata into two distinct tables and splits the lexical analyzer into two phases, one for each table. The two phases consist of a single transition algorithm and a range transition algorithm, both of which are table driven and permit the dynamic modification of those tables during operation. A third ‘entry point’ table may also be used to speed up the process of finding the first table element from state 0 for any given input character.
摘要:
A dynamically extensible approach to parsing textual input consisting of a predictive parser and associated predictive parser generator is provided. The combination, together with a plug-in/resolver architecture, provides the ability to handle a set of languages that is vastly larger than that conventionally handled by predictive parsing techniques. The generator accepts extended BNF language specifications containing embedded reverse polish plug-in call specifications giving the plug-in number to be called as well as an arbitrary textual parameter to be passed to the plug-in. The parser supports the ability to register a ‘resolver’ function as well as one or more custom reverse-polish plug-in handlers which are passed the textual parameter(s) specified in the extended BNF as well as having full control over the parsing and evaluation stacks. The ‘resolver’ is with a ‘no action’ parameter when the parser first encounters a token in the input stream and may modify the token as necessary. The resolver is also called when the parser must evaluate or assigu an entry on the evaluation stack at which time it can implement additional behaviors depending on the language or environment. Finally the ‘resolver’ is called when the parse terminates. The ‘resolver’ is the primary mechanism whereby more complex languages can be handled and is also a key part of connecting to external systems or storage when the parser is used in an interpreted context. The reverse polish plug-in functions are provided with an API to allow full control over and access to the parser stacks and can rapidly be configured to implement almost any language constructs.
摘要:
A strongly-typed, distributed, run-time system capable of describing and manipulating arbitrarily complex, non-flat, binary data derived from type descriptions in a standard (or slightly extended) programming language, including handling of type inheritance. The system is composed of four primary components. First, a plurality of databases having binary type and field descriptions. Second, a run-time modifiable type compiler that is capable of generating type databases either via explicit API calls or by compilation of unmodified header files or individual type definitions in a standard programming language. Third, a complete API suite for access to type information as well as full support for reading and writing types, type relationships and inheritance, and type fields, given knowledge of the unique numeric type ID and the field name/path. Finally, a hashing process for converting type names to unique type IDs (which may also incorporate a number of logical flags relating to the nature of the type). Further extensions and improvements are also provided as described herein.
摘要:
A system and method provide the ability to construct lexical analyzers on the fly in an efficient and pervasive manner. The system and method split the table describing the automata into two distinct tables and splits the lexical analyzer into two phases, one for each table. The two phases consist of a single transition algorithm and a range transition algorithm, both of which are table driven and permit the dynamic modification of those tables during operation. A third ‘entry point’ table may also be used to speed up the process of finding the first table element from state 0 for any given input character.
摘要:
The present invention enables the creation, management, retrieval, distribution and massively large collections of information that can be shared across a distributed network without building absolute references or even pre-existing knowledge of the data and data structures being stored in such an environment. The system includes the following components: (1) a ‘flat’ data model wherein arbitrarily complex structures can be instantiated within a single memory allocation (including both the aggregation arrangements and the data itself, as well as any cross references between them via ‘relative’ references); (2) a run-time type system capable of defining and accessing binary strongly-typed data; (3) a set of ‘containers’ within which information encoded according to the system can be physically stored and preferably include a memory resident form, a file-based form, and a server-based form; (4) a client-server environment that is tied to the types system and capable of interpreting and executing all necessary collection manipulations remotely; (5) a basic aggregation structure providing as a minimum a ‘parent’, ‘nextChild’, ‘previousChild’, ‘firstChild’, and ‘lastChild’ links or equivalents; and (6) a data attachment structure (whose size may vary) to which strongly typed data can be attached and which is associated in some manner with (and possibly identical to) a containing aggregation node in the collection. Additional extensions and modifications to the system are also specified herein.
摘要:
A stemming framework for combining stemming algorithms together in a multilingual environment to obtain improved stemming behavior over any individual stemming algorithm, together with a new language independent stemming algorithm based on shortest path techniques. The stemmer essentially treats the stemming problem as a simple instance of the shortest path problem where the cost for each path can be computed from its word component and its number of characters. The goal of the stemmer is to find the shortest path to construct the entire word. The stemmer uses dynamic dictionaries constructed as lexical analyzer state transition tables to recognize the various allowable word parts for any given language in order to obtain maximum speed. The stemming framework provides the necessary logic to combine multiple stemmers in parallel and to merge their results to obtain the best behavior. Mapping dictionaries handle irregular plurals, tense, phrase mapping and proper name recognition.
摘要:
A system and method provide the ability to construct lexical analyzers on the fly in an efficient and pervasive manner. The system and method split the table describing the automata into two distinct tables and splits the lexical analyzer into two phases, one for each table. The two phases consist of a single transition algorithm and a range transition algorithm, both of which are table driven and permit the dynamic modification of those tables during operation. A third ‘entry point’ table may also be used to speed up the process of finding the first table element from state 0 for any given input character.
摘要:
A system and method for implementing a data-flow based system includes three basic components: a data-flow based scheduling environment that balances the needs of data initiated program execution as a result of flows with other practical considerations such as user responsiveness, event driven invocation, user interface considerations, and the need to also support control-flow based paradigms where required; a visual programming language, based on the flow of strongly-typed run-time accessible data and data collections between small control-flow based locally and network distributed functional building-blocks, known as widgets; and a formalized pin-based interface to allow access to data-flow contents from the executing code within the widgets. The pins on the widgets include both pins used to control execution of a widget as well as pins used to receive data input from a data flow. The system and method further include a debugging environment that enables visual debugging of one or more widgets (or collections of widgets). Data control techniques include the concepts of “OR” and “AND” consumption thereby permitting either consumption immediately or only after all widget inputs have received the token. Additional extensions to this framework will also be described that relate to the environment, the programming language and the interface.
摘要:
A system and method for extracting data, hereinafter referred to as MitoMine, that produces a strongly-typed ontology defined collection referencing (and cross referencing) all extracted records. The input to the mining process can be any data source, such as a text file delimited into a set of possibly dissimilar records. MitoMine contains parser routines and post processing functions, known as ‘munchers’. The parser routines can be accessed either via a batch mining process or as part of a running server process connected to a live source. Munchers can be registered on a per data-source basis in order to process the records produced, possibly writing them to an external database and/or a set of servers. The present invention also embeds an interpreted ontology based language within a compiler/interpreter (for the source format) such that the statements of the embedded language are executed as a result of the source compiler ‘recognizing’ a given construct within the source and extracting the corresponding source content. In this way, the execution of the statements in the embedded program will occur in a sequence that is dictated wholly by the source content. This system and method therefore make it possible to bulk extract free-form data from such sources as CD-ROMs, the web etc. and have the resultant structured data loaded into an ontology based system.
摘要:
A new memory tuple is described that creates both a handle as well as a reference to an item within the handle. The reference is created using an offset value that defines the physical offset of the data within the memory block. Thereafter, if references are passed in terms of their offset value, this value will be the same in any copy of the handle regardless of the machine. In a distributed computing environment, equivalence between handles is established in a single transaction between two communicating machines. Thereafter, the two machines can communicate about specific handle contents simply by using offsets.