摘要:
In one embodiment, an apparatus comprises a microcontroller unit to store instructions into an execution queue. The apparatus also comprises an execution queue unit to generate a widely decoded functional execution instruction based on at least one instruction stored in the execution queue. Additionally, the apparatus comprises a functional unit to execute the widely decoded functional execution instruction asynchronous to the generation of the widely decoded functional execution instruction.
摘要:
A system and method is disclosed to maintain the coherence of shared data in cache and memory contained in the nodes of a multiprocessing computer system. The distributed multiprocessing computer system contains a number of processors each connected to main memory. A processor in the distributed multiprocessing computer system is identified as a Home processor for a memory block if it includes the original memory block and a coherence directory for the memory block in its main memory. An Owner processor is another processor in the multiprocessing computer system that includes a copy of the Home processor memory block in a cache connected to its main memory. Whenever an Owner processor is present for a memory block, it is the only processor in the distributed multiprocessing computer system to contain a copy of the Home processor memory block. Eviction of a memory block copy held by an Owner processor in its cache requires a write of the memory block copy to its Home and update of the corresponding coherence directory. No reads of the Home processor directory or modification of other processor cache and main memory is required. The coherence controller in each processor is able to send and receive messages out of order to maintain the coherence of the shared data in cache and main memory. If an out of order message causes an incorrect next program state, the coherence controller is able to restore the prior correct saved program state and resume execution.
摘要:
A system and method is disclosed to efficiently translate virtual-to-physical addresses of large size pages of data by eliminating one level of a multilevel page table. A computer system containing a processor includes a translation lookaside buffer (“TLB”) in the processor. The processor is connected to a system memory that contains a page table with multiple levels. The page table translates the virtual address of a page of data stored in system memory into the corresponding physical address of the page of data. If the size of the page is above a certain threshold value, then translation of the page using the multilevel page table occurs by eliminating one or more levels of the page table. The threshold value preferably is 512 Megabytes. The multilevel page table is only used for translation of the virtual address of the page of data stored in system memory into the corresponding physical address of the page of data if a lookup of the TLB for the virtual address of the page of data results in a miss. The TLB also contains entries from the final level of the page table (i.e., physical addresses of pages of data) corresponding to a subfield of bits from corresponding virtual addresses of the page of data. Virtual-to-physical address translation using the multilevel page table is not required if the TLB contains the needed physical address of the page of data corresponding to the subfield of bits from the virtual address of the page of data.
摘要:
A system is disclosed in which an on-chip logic analyzer (OCLA) includes a loop detector logic which receives incoming program counter (PC) data and detects when software loops exist. When a software loop is detected, the loop detector may be configured to store the first loop in memory, while all subsequent iterations are not stored, thus saving space in memory which would otherwise be consumed. The loop detector comprises a content addressable memory (CAM) which is enabled by a user programmed signal. The CAM may be configured with a programmable mask to determine which bits of the incoming PC data to compare with the CAM entries. The depth of the CAM also is programmable, to permit the CAM to be adjusted to cover the number of instructions in a loop.
摘要:
A multi-processor system in which each processor receives a message from another processor in the system. The message may contain corrupted data that was corrupted during transmission from the preceding processor. Upon receiving the message, the processor detects that a portion of the message contains corrupted data. The processor then replaces the corrupted portion with a predetermined bit pattern known or otherwise programmed into all other processors in the system. The predetermined bit pattern indicates that the associated portion of data was corrupted. The processor that detects the error in the message preferably alerts the system that an error has been detected. The message now containing the predetermined bit pattern in place of the corrupted data is retransmitted to another processor. The predetermined bit pattern will indicate that an error in the message was detected by the previous processor. In response, the processor detecting the predetermined bit pattern preferably will not alert the system of the existence of an error. The same message with the predetermined bit pattern can be retransmitted to other processors which also will detect the presence of the predetermined bit pattern and in response not alert the system of the presence of an error. As such, because only the first processor to detect an error alerts the system of the error and because messages containing uncorrectable errors still are transmitted through the system, fault isolation is improved and the system is less likely to fall into a deadlock condition.
摘要:
A computer system includes multiple processors, each of which includes an associated memory. Each of the processors is capable of accessing the memory of all other processors. Memory can be stored and accessed using different addressing schemes. For data that will only be used by the local processor, data is stored in memory using processor contiguous addressing, so that data is stored in the local memory. For data that may be accessed by multiple processors, data is stored using striping among a local processor set. A stripe control register in the memory controller of each memory comprises a mask that indicates which memory blocks should be accessed using processor contiguous addressing and which should be accessed by using striped addressing. For both striped and contiguous addressing, the address space includes a processor identification field to identify the processor where the associated memory resides, together with an offset indicating where in memory the address is located. The processor identification field for striped addressing includes two bits located in low order address space identifying a four processor local stripe set. The other processor identification bits define which four processors comprise each stripe set.
摘要:
A computer method and apparatus causes the load-store instruction grouping in a microprocessor instruction pipeline to be disrupted at appropriate times. The computer method and apparatus employs a memory access member which periodically stalls the issuance of store instructions when there are prior store instructions pending in the store queue. The periodic stalls bias the issue stage to issue load groups and store instruction groups. In the latter case, the store queue is free to update the data cache with the data from previous store instructions. Thus, the invention memory access member biases issuance of store instructions in a manner that prevents the store queue from becoming full, and as such enables the store queue to write to the data cache before the store queue becomes full.
摘要:
Method and apparatus for a filtered stream buffer coupled to a memory and a processor, and operating to prefetch data from the memory. The filtered stream buffer includes a cache block storage area and a filter controller. The filter controller determines whether a pattern of references has a predetermined relationship, and if so, prefetches stream data into the cache block storage area. Such stream data prefetches are particularly useful in vector processing computers, where once the processor starts to fetch a vector, the addresses of future fetches can be predicted based in the pattern of past fetches. According to various aspects of the present invention, the filtered stream buffer further includes a history table, a validity indicator which is associated with the cache block storage area and indicates which cache blocks, if any, are valid. According to yet another aspect of the present invention, the filtered stream buffer controls random access memory (RAM) chips to stream the plurality of consecutive cache blocks from the RAM into the cache block storage area. According to yet another aspect of the present invention, the stream data includes data for a plurality of strided cache blocks, wherein each of which these strided cache blocks corresponds to an address determined by adding to the first address an integer multiple of the difference between the second address and the first address. According to yet another aspect of the present invention, the processor generates three addresses of data words in the memory, and the filter controller determines whether a predetermined relationship exists among three addresses, and if so, prefetches strided stream data into said cache block storage area.
摘要:
Work submitted to a co-processor enters through one of multiple input queues, used to provide various quality of service levels. In-memory linked-lists store work to be performed by a network services processor in response to lack of processing resources in the network services processor. The work is moved back from the in-memory inked-lists to the network services processor in response to availability of processing resources in the network services processor.
摘要:
A network processor includes multiple processor cores for processing packet data. In order to provide the processor cores with access to a memory subsystem, an interconnect circuit directs communications between the processor cores and the L2 Cache and other memory devices. The processor cores are divided into several groups, each group sharing an individual bus, and the L2 Cache is divided into a number of banks, each bank having access to a separate bus. The interconnect circuit processes requests to store and retrieve data from the processor cores across multiple buses, and processes responses to return data from the cache banks. As a result, the network processor provides high-bandwidth memory access for multiple processor cores.