REGISTER-BASED MATRIX MULTIPLICATION

    公开(公告)号:US20220291923A1

    公开(公告)日:2022-09-15

    申请号:US17678221

    申请日:2022-02-23

    Applicant: Arm Limited

    Abstract: Techniques for performing matrix multiplication in a data processing apparatus are disclosed, comprising apparatuses, matrix multiply instructions, methods of operating the apparatuses, and virtual machine implementations. Registers, each register for storing at least four data elements, are referenced by a matrix multiply instruction and in response to the matrix multiply instruction a matrix multiply operation is carried out. First and second matrices of data elements are extracted from first and second source registers, and plural dot product operations, acting on respective rows of the first matrix and respective columns of the second matrix are performed to generate a square matrix of result data elements, which is applied to a destination register. A higher computation density for a given number of register operands is achieved with respect to vector-by-element techniques.

    PREFETCH STRATEGY CONTROL
    12.
    发明申请
    PREFETCH STRATEGY CONTROL 审中-公开
    预选策略控制

    公开(公告)号:US20150121038A1

    公开(公告)日:2015-04-30

    申请号:US14061837

    申请日:2013-10-24

    Applicant: ARM LIMITED

    CPC classification number: G06F9/3455 G06F9/383 G06F9/3851 G06F9/3887

    Abstract: A single instruction multiple thread (SIMT) processor 2 includes execution circuitry 6, prefetch circuitry 12 and prefetch strategy selection circuitry 14. The prefetch strategy selection circuitry serves to detect one or more characteristics of a stream of program instructions that are being executed to identify whether or not a given data access instruction within a program will be executed a plurality of times. The prefetch strategy to use is selected from a plurality of selectable prefetch strategy in dependence upon the detection of such characteristics.

    Abstract translation: 单指令多线程(SIMT)处理器2包括执行电路6,预取电路12和预取策略选择电路14.预取策略选择电路用于检测正在执行的程序指令流的一个或多个特性,以识别是否 或者不在程序内的给定的数据访问指令将被执行多次。 根据这种特征的检测,从多个可选择的预取策略中选择要使用的预取策略。

    LOCATING DATA IN STORAGE
    13.
    发明公开

    公开(公告)号:US20240248753A1

    公开(公告)日:2024-07-25

    申请号:US18099588

    申请日:2023-01-20

    Applicant: Arm Limited

    CPC classification number: G06F9/4881

    Abstract: A processor to: receive a task to be executed, the task comprising a task-based parameter associated with the task, for use in determining a position, within an array of data descriptors, of a particular data descriptor of a particular portion of data to be processed in executing the task. Each of the data descriptors in the array of data descriptors has a predetermined size and is indicative of a location in a storage system of a respective portion of data. The processor derives, based on the task, array location data indicative of a location in the storage system of a predetermined data descriptor, and obtains the particular data descriptor, based on the array location data and the task-based parameter. The processor obtains the particular portion of data based on the particular data descriptor and processes the particular portion of data in executing the task.

    APPARATUS AND METHOD FOR EXECUTING A PLURALITY OF THREADS
    14.
    发明申请
    APPARATUS AND METHOD FOR EXECUTING A PLURALITY OF THREADS 审中-公开
    用于执行大量螺纹的装置和方法

    公开(公告)号:US20160259668A1

    公开(公告)日:2016-09-08

    申请号:US15058389

    申请日:2016-03-02

    Applicant: ARM LIMITED

    CPC classification number: G06F9/3851 G06F9/46 G06F15/16 G06T1/20

    Abstract: An apparatus and method are provided for executing a plurality of threads. The apparatus has processing circuitry arranged to execute the plurality of threads, with each thread executing a program to perform processing operations on thread data. Each thread has a thread identifier, and the thread data includes a value which is dependent on the thread identifier. Value generator circuitry is provided to perform a computation using the thread identifier of a chosen thread in order to generate the above mentioned value for that chosen thread, and to make that value available to the processing circuitry for use by the processing circuitry when executing the chosen thread. Such an arrangement can give rise to significant performance benefits when executing the plurality of threads on the apparatus.

    Abstract translation: 提供一种用于执行多个线程的装置和方法。 该装置具有被布置为执行多个线程的处理电路,每个线程执行程序以对线程数据执行处理操作。 每个线程都有一个线程标识符,线程数据包含一个取决于线程标识符的值。 提供值生成器电路以使用所选线程的线程标识符执行计算,以便为所选择的线程生成上述值,并且使得该值可用于处理电路,以便在执行所选择的线程时由处理电路使用 线。 当在设备上执行多个线程时,这种布置可以产生显着的性能益处。

    ADAPTIVE PREFETCHING IN A DATA PROCESSING APPARATUS
    15.
    发明申请
    ADAPTIVE PREFETCHING IN A DATA PROCESSING APPARATUS 审中-公开
    数据处理设备中的自适应预处理

    公开(公告)号:US20150134933A1

    公开(公告)日:2015-05-14

    申请号:US14080139

    申请日:2013-11-14

    Applicant: ARM Limited

    Abstract: A data processing apparatus and method of data processing are disclosed. An instruction execution unit executes a sequence of program instructions, wherein execution of at least some of the program instructions initiates memory access requests to retrieve data values from a memory. A prefetch unit prefetches data values from the memory for storage in a cache unit before they are requested by the instruction execution unit. The prefetch unit is configured to perform a miss response comprising increasing a number of the future data values which it prefetches, when a memory access request specifies a pending data value which is already subject to prefetching but is not yet stored in the cache unit. The prefetch unit is also configured, in response to an inhibition condition being met, to temporarily inhibit the miss response for an inhibition period.

    Abstract translation: 公开了一种数据处理装置和数据处理方法。 指令执行单元执行程序指令序列,其中至少一些程序指令的执行启动存储器访问请求以从存储器检索数据值。 在由指令执行单元请求之前,预取单元从存储器预取数据值以存储在高速缓存单元中。 预取单元被配置为当存储器访问请求指定已经经历预取但尚未存储在高速缓存单元中的未决数据值时,执行未命中的响应,包括增加其预取的未来数据值的数量。 响应于满足禁止条件,预取单元还被配置为暂时抑制禁止期间的未命中响应。

    DATA PROCESSING METHOD AND APPARATUS FOR PREFETCHING
    16.
    发明申请
    DATA PROCESSING METHOD AND APPARATUS FOR PREFETCHING 有权
    数据处理方法和装置预处理

    公开(公告)号:US20150121014A1

    公开(公告)日:2015-04-30

    申请号:US14061842

    申请日:2013-10-24

    Applicant: ARM LIMITED

    Abstract: A data processing device includes processing circuitry 20 for executing a first memory access instruction to a first address of a memory device 40 and a second memory access instruction to a second address of the memory device 40, the first address being different from the second address. The data processing device also includes prefetching circuitry 30 for prefetching data from the memory device 40 based on a stride length 70 and instruction analysis circuitry 50 for determining a difference between the first address and the second address. Stride refining circuitry 60 is also provided to refine the stride length based on factors of the stride length and factors of the difference calculated by the instruction analysis circuitry 50.

    Abstract translation: 数据处理设备包括处理电路20,用于执行对存储器件40的第一地址的第一存储器访问指令和到存储器件40的第二地址的第二存储器访问指令,第一地址不同于第二地址。 数据处理装置还包括预取电路30,用于基于步幅长度70和指令分析电路50从存储器装置40预取数据,用于确定第一地址和第二地址之间的差异。 还提供跨步精炼电路60以基于步幅长度的因素和由指令分析电路50计算的差异的因素来细化步幅长度。

    DECODING A COMPLEX PROGRAM INSTRUCTION CORRESPONDING TO MULTIPLE MICRO-OPERATIONS
    17.
    发明申请
    DECODING A COMPLEX PROGRAM INSTRUCTION CORRESPONDING TO MULTIPLE MICRO-OPERATIONS 有权
    解码与多个微操作相关的复杂程序指令

    公开(公告)号:US20150100763A1

    公开(公告)日:2015-04-09

    申请号:US14466183

    申请日:2014-08-22

    Applicant: ARM Limited

    Inventor: Rune HOLM

    Abstract: A data processing apparatus 2 has processing circuitry 4 which can process multiple parallel threads of processing. A shared instruction decoder 30 decodes program instructions to generate micro-operations to be processed by the processing circuitry 4. The instructions include at least one complex instruction which has multiple micro-operations. Multiple fetch units 8 are provided for fetching the micro-operations generated by the decoder 30 for processing by the processing circuitry 4. Each fetch unit 8 is associated with at least one of the threads. The decoder 30 generates the micro-operations of a complex instruction individually in response to separate decode requests 24 triggered by a fetch unit 8, each decode request 24 identifying which micro-operation of the complex instruction is to be generated by the decoder 30 in response to the decode request 24.

    Abstract translation: 数据处理装置2具有可处理多个并行处理线程的处理电路4。 共享指令解码器30对程序指令进行解码以产生由处理电路4处理的微操作。指令包括具有多个微操作的至少一个复合指令。 提供多个提取单元8用于取出由解码器30产生的用于处理电路4进行处理的微操作。每个提取单元8与至少一个线程相关联。 解码器30响应于由提取单元8触发的分离的解码请求24分别生成复合指令的微操作,每个解码请求24识别解码器30响应中生成复合指令的哪个微操作 到解码请求24。

Patent Agency Ranking