METHOD AND APPARATUS FOR EFFICIENT MATRIX ALIGNMENT IN A SYSTOLIC ARRAY

    公开(公告)号:US20190042262A1

    公开(公告)日:2019-02-07

    申请号:US16147506

    申请日:2018-09-28

    IPC分类号: G06F9/38 G06F15/80 G06F9/30

    摘要: An apparatus and method for efficient matrix alignment in a systolic array. For example, one embodiment of a processor comprises: a first set of physical tile registers to store first matrix data in rows or columns; a second set of physical tile registers to store second matrix data in rows or columns; a decoder to decode a matrix instruction identifying a first input matrix, a first offset, a second input matrix, and a second offset; and execution circuitry, responsive to the matrix instruction, to read a subset of rows or columns from the first set of physical tile registers in accordance with the first offset, spanning multiple physical tile registers from the first set if indicated by the first offset to generate a first input matrix and the execution circuitry to read a subset of rows or columns from the second set of physical tile registers in accordance with the second offset, spanning multiple physical tile registers from the second set if indicated by the second offset to generate a second input matrix; and the execution circuitry to perform an arithmetic operation with the first and second input matrices in accordance with an opcode of the matrix instruction.

    APPARATUS AND METHOD FOR MEMORY-HIERARCHY AWARE PRODUCER-CONSUMER INSTRUCTIONS
    4.
    发明申请
    APPARATUS AND METHOD FOR MEMORY-HIERARCHY AWARE PRODUCER-CONSUMER INSTRUCTIONS 审中-公开
    用于记忆级别生产者消费者指令的装置和方法

    公开(公告)号:US20140208031A1

    公开(公告)日:2014-07-24

    申请号:US13994724

    申请日:2011-12-21

    IPC分类号: G06F12/08 G06T1/60

    摘要: An apparatus and method are described for efficiently transferring data from a producer core to a consumer core within a central processing unit (CPU). For example, one embodiment of a method comprises: A method for transferring a chunk of data from a producer core of a central processing unit (CPU) to consumer core of the CPU, comprising: writing data to a buffer within the producer core of the CPU until a designated amount of data has been written; upon detecting that the designated amount of data has been written, responsively generating an eviction cycle, the eviction cycle causing the data to be transferred from the fill buffer to a cache accessible by both the producer core and the consumer core; and upon the consumer core detecting that data is available in the cache, providing the data to the consumer core from the cache upon receipt of a read signal from the consumer core.

    摘要翻译: 描述了一种用于在中央处理单元(CPU)内有效地将数据从生产者核心传送到消费者核心的装置和方法。 例如,一种方法的一个实施例包括:一种用于将数据块从中央处理单元(CPU)的生产者核心转移到CPU的消费者核心的方法,包括:将数据写入到所述CPU的生产者核心内的缓冲器 CPU直到指定数据量被写入; 在检测到指定量的数据被写入时,响应地产生驱逐周期,使得将数据从填充缓冲器传送到可由生产者核心和消费者核心访问的高速缓存的逐出循环; 并且在消费者核心检测到数据在高速缓存中可用时,在从消费者核心接收到读取信号时从高速缓存提供数据给消费者核心。

    METHOD AND APPARATUS FOR CUTTING SENIOR STORE LATENCY USING STORE PREFETCHING
    6.
    发明申请
    METHOD AND APPARATUS FOR CUTTING SENIOR STORE LATENCY USING STORE PREFETCHING 有权
    使用商店预购切割高级商店的方法和装置

    公开(公告)号:US20140223105A1

    公开(公告)日:2014-08-07

    申请号:US13993508

    申请日:2011-12-30

    IPC分类号: G06F9/38 G06F12/08

    摘要: In accordance with embodiments disclosed herein, there are provided methods, systems, mechanisms, techniques, and apparatuses for cutting senior store latency using store prefetching. For example, in one embodiment, such means may include an integrated circuit or an out of order processor means that processes out of order instructions and enforces in-order requirements for a cache. Such an integrated circuit or out of order processor means further includes means for receiving a store instruction; means for performing address generation and translation for the store instruction to calculate a physical address of the memory to be accessed by the store instruction; and means for executing a pre-fetch for a cache line based on the store instruction and the calculated physical address before the store instruction retires.

    摘要翻译: 根据本文公开的实施例,提供了使用商店预取来切割高级商店延迟的方法,系统,机制,技术和装置。 例如,在一个实施例中,这种装置可以包括集成电路或乱序处理器装置,其处理不一致的指令并对高速缓存执行按顺序的要求。 这样的集成电路或不按顺序的处理器装置还包括用于接收存储指令的装置; 用于执行所述存储指令的地址生成和转换以计算由所述存储指令访问的存储器的物理地址的装置; 以及用于在存储指令退出之前基于所述存储指令和所计算的物理地址来执行用于高速缓存行的预取的装置。

    EXTENDING CACHE COHERENCY PROTOCOLS TO SUPPORT LOCALLY BUFFERED DATA
    8.
    发明申请
    EXTENDING CACHE COHERENCY PROTOCOLS TO SUPPORT LOCALLY BUFFERED DATA 有权
    扩展缓存协议来支持本地缓存数据

    公开(公告)号:US20100169581A1

    公开(公告)日:2010-07-01

    申请号:US12346543

    申请日:2008-12-30

    IPC分类号: G06F12/08 G06F12/00

    摘要: A method and apparatus for extending cache coherency to hold buffered data to support transactional execution is herein described. A transactional store operation referencing an address associated with a data item is performed in a buffered manner. Here, the coherency state associated with cache lines to hold the data item are transitioned to a buffered state. In response to local requests for the buffered data item, the data item is provided to ensure internal transactional sequential ordering. However, in response to external access requests, a miss response is provided to ensure the transactionally updated data item is not made globally visible until commit. Upon commit, the buffered lines are transitioned to a modified state to make the data item globally visible.

    摘要翻译: 这里描述了用于扩展高速缓存一致性以保存缓冲数据以支持事务执行的方法和装置。 以缓冲的方式执行引用与数据项相关联的地址的事务存储操作。 这里,与保存数据项的高速缓存行相关联的一致性状态被转换到缓冲状态。 响应缓冲数据项的本地请求,提供数据项以确保内部事务顺序排序。 然而,响应于外部访问请求,提供了错误响应以确保事务更新的数据项在提交之前不会被全局可见。 一旦提交,缓存的行将转换到修改状态,使数据项全局可见。

    Method and apparatus for cutting senior store latency using store prefetching
    9.
    发明授权
    Method and apparatus for cutting senior store latency using store prefetching 有权
    使用存储预取来切割高级存储延迟的方法和装置

    公开(公告)号:US09405545B2

    公开(公告)日:2016-08-02

    申请号:US13993508

    申请日:2011-12-30

    IPC分类号: G06F12/08 G06F9/38

    摘要: In accordance with embodiments disclosed herein, there are provided methods, systems, mechanisms, techniques, and apparatuses for cutting senior store latency using store prefetching. For example, in one embodiment, such means may include an integrated circuit or an out of order processor means that processes out of order instructions and enforces in-order requirements for a cache. Such an integrated circuit or out of order processor means further includes means for receiving a store instruction; means for performing address generation and translation for the store instruction to calculate a physical address of the memory to be accessed by the store instruction; and means for executing a pre-fetch for a cache line based on the store instruction and the calculated physical address before the store instruction retires.

    摘要翻译: 根据本文公开的实施例,提供了使用商店预取来切割高级商店延迟的方法,系统,机制,技术和装置。 例如,在一个实施例中,这种装置可以包括集成电路或乱序处理器装置,其处理不一致的指令并对高速缓存执行按顺序的要求。 这样的集成电路或不按顺序的处理器装置还包括用于接收存储指令的装置; 用于执行所述存储指令的地址生成和转换以计算由所述存储指令访问的存储器的物理地址的装置; 以及用于在存储指令退出之前基于所述存储指令和所计算的物理地址来执行用于高速缓存行的预取的装置。

    Method and system to reduce the power consumption of a memory device
    10.
    发明授权
    Method and system to reduce the power consumption of a memory device 有权
    降低存储器件功耗的方法和系统

    公开(公告)号:US08352683B2

    公开(公告)日:2013-01-08

    申请号:US12823047

    申请日:2010-06-24

    摘要: A method and system to reduce the power consumption of a memory device. In one embodiment of the invention, the memory device is a N-way set-associative level one (L1) cache memory and there is logic coupled with the data cache memory to facilitate access to only part of the N-ways of the N-way set-associative L1 cache memory in response to a load instruction or a store instruction. By reducing the number of ways to access the N-way set-associative L1 cache memory for each load or store request, the power requirements of the N-way set-associative L1 cache memory is reduced in one embodiment of the invention. In one embodiment of the invention, when a prediction is made that the accesses to cache memory only requires the data arrays of the N-way set-associative L1 cache memory, the access to the fill buffers are deactivated or disabled.

    摘要翻译: 一种降低存储器件功耗的方法和系统。 在本发明的一个实施例中,存储器件是N路组合关联级(L1)高速缓冲存储器,并且存在与数据高速缓冲存储器耦合的逻辑,以便于仅访问N- 响应于加载指令或存储指令,单向设置关联L1高速缓冲存储器。 通过减少针对每个加载或存储请求访问N路组合关联的L1高速缓冲存储器的方法的数量,在本发明的一个实施例中,减少了N路组合关联的L1高速缓冲存储器的功率需求。 在本发明的一个实施例中,当预测到对高速缓存存储器的访问仅需要N路组关联的L1高速缓冲存储器的数据阵列时,对填充缓冲器的访问被去激活或禁用。