摘要:
A system and method for transferring a data stream between devices having different clock domains. The method initiates a serial data stream between a transmitter and a receiver. The transmitter operates according to a first clock having a first clock rate, and the receiver operates according to a second clock having a second clock rate. A ratio between the second clock rate and the first clock rate is an integer number greater than or equal to one. A first state is provided over a serial line between the transmitter and the receiver. One or more start bits are provided over the serial line. The start bits indicate a second state different from the first state. One or more ratio bits are provided over the serial line after the start bit. The ratio bits indicate the ratio between the second clock rate and the first clock rate. The start bits are received. Using a transition between the first state and the second state evident in receiving each of the start bits, the ratio bits are received. The remainder of the serial data stream is received at appropriate intervals of the second clock rate.
摘要:
A data caching system and method includes a data store for caching data from a main memory, a primary tag array for holding tags associated with data cached in the data store, and a duplicate tag array which holds copies of the tags held in the primary tag array. The duplicate tag array is accessible by functions, such as external memory cache probes, such that the primary tag remains available to the processor core. An address translator maps virtual page addresses to physical page address. In order to allow a data caching system which is larger than a page size, a portion of the virtual page address is used to index the tag arrays and data store. However, because of the virtual to physical mapping, the data may reside in any of a number of physical locations. During an internally-generated memory access, the virtual address is used to look up the cache. If there is a miss, other combinations of values are substituted for the virtual bits of the tag array index. For external probes which provide physical addresses to the duplicate tag array, combinations of values are appended to the index portion of the physical address. Tag array lookups can be performed either sequentially, or in parallel.
摘要:
A floating point unit capable of executing multiple instructions in a single clock cycle using a central window and a register map is disclosed. The floating point unit comprises: a plurality of translation units, a future file, a central window, a plurality of functional units, a result queue, and a plurality of physical registers. The floating point unit receives speculative instructions, decodes them, and then stores them in the central window. Speculative top of stack values are generated for each instruction during decoding. Top of stack relative operands are computed to physical registers using a register map. Register stack exchange operations are performed during decoding. Instructions are then stored in the central window, which selects the oldest stored instructions to be issued to each functional pipeline and issues them. Conversion units convert the instruction's operands to an internal format, and normalization units detect and normalize any denormal operands. Finally, the functional pipelines execute the instructions.
摘要:
Use of an internal processor data bus is maximized in a system where external transactions may occur at a rate which is fractionally slower than the rate of the internal transactions. The technique inserts a selectable delay element in the signal path during an external operation such as a cache fill operation. The one cycle delay provides a time slot in which an internal operation, such as a load from an internal cache, may be performed. This technique therefore permits full use of the time slots on the internal data bus. It can, for, example, allow load operations to begin at a much earlier time than would otherwise be possible in architectures where fill operations can consume multiple bus time slots.
摘要:
A pipelined CPU executing instructions of variable length, and referencing memory using various data widths. Macroinstruction pipelining is employed (instead of microinstruction pipelining), with queueing between units of the CPU to allow flexibility in instruction execution times. A wide bandwidth is available for memory access; fetching 64-bit data blocks on each cycle. A hierarchical cache arrangement has an improved method of cache set selection, increasing the likelihood of a cache hit. A writeback cache is used (instead of writethrough) and writeback is allowed to proceed even though other accesses are suppressed due to queues being full. A branch prediction method employs a branch history table which records the taken vs. not-taken history of branch opcodes recently used, and uses an empirical algorithm to predict which way the next occurrence of this branch will go, based upon the history table. A floating point processor function is integrated on-chip, with enhanced speed due to a bypass technique; a trial mini-rounding is done on low-order bits of the result, and if correct, the last stage of the floating point processor can be bypassed, saving one cycle of latency. For CAL type instructions, a method for determining which registers need to be saved is executed in a minimum number of cycles, examining groups of register mask bits at one time. Internal processor registers are accessed with short (byte width) addresses instead of full physical addresses as used for memory and I/O references, but off-chip processor registers are memory-mapped and accessed by the same busses using the same controls as the memory and I/O. In a non-recoverable error detected by ECC circuits in the cache, an error transition mode is entered wherein the cache operates under limited access rules, allowing a maximum of access by the system for data blocks owned by the cache, but yet minimizing changes to the cache data so that diagnostics may be run. Separate queues are provided for the return data from memory and cache invalidates, yet the order or bus transactions is maintained by a pointer arrangement. The bus protocol used by the CPU to communicate with the system bus is of the pended type, with transactions on the bus identified by an ID field specifying the originator, and arbitration for bus grant goes one simultaneously with address/data transactions on the bus.
摘要:
A method and mechanism for generating a clock signal with a relatively linear increase or decrease in clock frequency. A first clock signal is generated with a first frequency which is then used to generate a second clock signal with a second frequency. The second frequency is generated by dropping selected pulses of the first clock signal. Particular patterns of bits are stored in a storage element. Bits are then selected and conveyed from the storage element at a frequency determined by the first clock signal. The conveyed bits are used to construct the second clock signal. By selecting the particular pattern of bits selected and conveyed, the frequency of the second clock signal may be determined. Further, by changing the patterns of bits within the registers at selected times, the frequency of the second clock signal may be made to change in a relatively linear manner.
摘要:
A computer system employs virtual channels and allocates different resources to the virtual channels. Packets which do not have logical/protocol-related conflicts are grouped into a virtual channel. Accordingly, logical conflicts occur between packets in separate virtual channels. The packets within a virtual channel may share resources (and hence experience resource conflicts), but the packets within different virtual channels may not share resources. Since packets which may experience resource conflicts do not experience logical conflicts, and since packets which may experience logical conflicts do not experience resource conflicts, deadlock-free operation may be achieved. Additionally, each virtual channel may be assigned control packet buffers and data packet buffers. Control packets may be substantially smaller in size, and may occur more frequently than data packets. By providing separate buffers, buffer space may be used efficiently. If a control packet which does not specify a data packet is received, no data packet buffer space is allocated. If a control packet which does specify a data packet is received, both control packet buffer space and data packet buffer space is allocated.
摘要:
A method and system of expediting issuance of a second request of a pair of ordered requests into a distributed coherent communication fabric. The first request of the ordered pair is issued into the coherent communication fabric and directed to a first target. Issuance of the second request into the coherent communication fabric is stalled until the first target receives and orders the first request and transmits a response acknowledging the same.
摘要:
A computer system is presented implementing a system and method for properly ordering write operations. The system and method for properly ordering write operations aids in maintaining memory coherency within the computer system. The computer system includes multiple interconnected processing nodes. One or more of the processing nodes includes a central processing unit (CPU) and/or a cache memory, and one or more of the processing nodes includes a memory controller coupled to a memory. The CPU or cache generates a write command to store data within the memory. The memory controller receives the write command and responds to the write command by issuing a target done response to the CPU or cache after the memory controller: (i) properly orders the write command within the memory controller with respect to other commands pending within the memory controller, and (ii) determines that a coherency state with respect to the write command has been established within the computer system.
摘要:
A processor employing a post-cache (LS2) buffer. Loads are stored into the LS2 buffer after probing the data cache. The load/store unit snoops the loads in the LS2 buffer against snoop requests received from an external bus. If a snoop invalidate request hits a load within the LS2 buffer and that load hit in the data cache during its initial probe, the load/store unit scans the LS2 buffer for older loads which are misses. If older load misses are detected, a synchronization indication is set for the load misses. Subsequently, one of the load misses completes and the load/store unit transmits a synchronization signal with the status for the load miss. The processor synchronizes to the instruction corresponding to the load miss, thereby discarding load hit which was subsequently snoop hit. The discarding instructions are refetched and reexecuted, thereby causing the load hit to reexecute subsequent to an earlier load miss. Load hits may generally proceed ahead of load misses and strong memory ordering rules may still be enforced.