摘要:
A method and apparatus for managing cache injection in a multiprocessor system reduces processing time associated with direct memory access transfers in a symmetrical multiprocessor (SMP) or a non-uniform memory access (NUMA) multiprocessor environment. The method and apparatus either detect the target processor for DMA completion or direct processing of DMA completion to a particular processor, thereby enabling cache injection to a cache that is coupled with processor that executes the DMA completion routine processing the data injected into the cache. The target processor may be identified by determining the processor handling the interrupt that occurs on completion of the DMA transfer. Alternatively or in conjunction with target processor identification, an interrupt handler may queue a deferred procedure call to the target processor to process the transferred data. In NUMA multiprocessor systems, the completing processor/target memory is chosen for accessibility of the target memory to the processor and associated cache.
摘要:
Caches are associated with processors, such multiple caches may be associated with multiple processors. This association may be different for different main memory address ranges. The techniques of the invention are flexible, as a system designer can choose how the caches are associated with processors and main memory banks, and the association between caches, processors, and main memory banks may be changed while the multiprocessor system is operating. Cache coherence may or may not be maintained. An effective address in an illustrative embodiment comprises an interest group and an associated address. The interest group is an index into a cache vector table and an entry into the cache vector table and the associated address is used to select one of the caches. This selection can be pseudo-random. Alternatively, in some applications, the cache vector table may be eliminated, with the interest group directly encoding the subset of caches to use.
摘要:
Specifically, a central queue based packet switch, illustratively an eight-way router, that advantageously avoids deadlock and an accompanying method for use therein. Specifically, each packet switch (25.sub.1) contains input port circuits (310) and output port circuits (380) inter-connected through two parallel paths: a multi-slot central queue (350) and a low latency by-pass; the latter cross-point switching matrix (360). The central queue has one slot dedicated to each output port to store a message portion ("chunk") destined for only that output port with the remaining slots being shared for all the output ports and dynamically allocated thereamong, as the need arises. Only those chunks which are contending for the same output port are stored in the central queue; otherwise, these chunks are routed to the appropriate output ports through the cross-point switching matrix.
摘要:
There is broadly contemplated herein an arrangement whereby each event source feeds a small dedicated “pre-counter” while an actual count is kept in a 64-bit wide RAM. Such an implementation preferably may involve a state machine that simply sweeps through the pre-counters, in a predetermined fixed order. Preferably, the state machine will access each pre-counter, add the value from the pre-counter to a corresponding RAM location, and then clear the pre-counter. Accordingly, the pre-counters merely have to be wide enough such that even at a maximal event rate, the pre-counter will not be able to wrap (i.e., reach capacity or overflow) before the “sweeper” state machine accesses the pre-counter.
摘要:
A parallel multiprocessor data processing system having a plurality of nodes for processing data and a switch connected to each of said nodes for switching messages between the nodes, each node having a node processor for defining messages under program control to be sent to another node. Each of the nodes has an I/O processor for controlling the sending of messages to another node via the switch, and a shared memory which can be accessed by both the node processor and the I/O processor. Instructions for the messages to be sent by the I/O processor are stored in mailboxes in the shared memory by the node processor. A comparing circuit compares addresses on the bus to the contents of a plurality of address registers and sets the corresponding bit in a results register for each match. The adapter processor reads the contents of the results register such that the adapter processor may, with a single bus access, determine all mailboxes that have been accessed by the node processor.
摘要:
There is broadly contemplated herein an arrangement whereby each event source feeds a small dedicated “pre-counter” while an actual count is kept in a 64-bit wide RAM. Such an implementation preferably may involve a state machine that simply sweeps through the pre-counters, in a predetermined fixed order. Preferably, the state machine will access each pre-counter, add the value from the pre-counter to a corresponding RAM location, and then clear the pre-counter. Accordingly, the pre-counters merely have to be wide enough such that even at a maximal event rate, the pre-counter will not be able to wrap (i.e., reach capacity or overflow) before the “sweeper” state machine accesses the pre-counter.
摘要:
Caches are associated with processors, such multiple caches may be associated with multiple processors. This association may be different for different main memory address ranges. The techniques of the invention are flexible, as a system designer can choose how the caches are associated with processors and main memory banks, and the association between caches, processors, and main memory banks may be changed while the multiprocessor system is operating. Cache coherence may or may not be maintained. An effective address in an illustrative embodiment comprises an interest group and an associated address. The interest group is an index into a cache vector table and an entry into the cache vector table and the associated address is used to select one of the caches. This selection can be pseudo-random. Alternatively, in some applications, the cache vector table may be eliminated, with the interest group directly encoding the subset of caches to use.
摘要:
The geometric processing of an ordered sequence of graphics commands is distributed over a set of processors by the following steps. The sequence of graphics commands is partitioned into an ordered set of N subsequences S0 . . . SN−1, and an ordered set of N state vectors V0 . . . VN−1 is associated with said ordered set of subsequences S0 . . . SN−1. A first phase of processing is performed on the set of processors whereby, for each given subsequence Sj in the set of subsequences S0 . . . SN−2, state vector Vj+1 is updated to represent state as if the graphics commands in subsequence Sj had been executed in sequential order. A second phase of the processing is performed whereby the components of each given state vector Vk in the set of state vectors V1 . . . VN−1 generated in the first phase is merged with corresponding components in the preceding state vectors V0 . . . Vk−1 such that the state vector Vk represents state as if the graphics commands in subsequences S0 . . . Sk−1 had been executed in sequential order. Finally, a third phase of processing is performed on the set of processors whereby, for each subsequence Sm in the set of subsequences S1 . . . SN−1, geometry operations for subsequence Sm are performed using the state vector Vm generated in the second phase. In addition, in the third phase, geometry operations for subsequence S0 are performed using the state vector V0. Advantageously, the present invention provides a mechanism that allows a large number of processors to work in parallel on the geometry operations of the three-dimensional rendering pipeline. Moreover, this high degree of parallelism is achieved with very little synchronization (one processor waiting from another) required, which results in increased performance over prior art graphics processing techniques.