Automatic tool to eliminate conflict cache misses
    1.
    发明授权
    Automatic tool to eliminate conflict cache misses 有权
    消除冲突缓存未命中的自动工具

    公开(公告)号:US07805708B2

    公开(公告)日:2010-09-28

    申请号:US11382813

    申请日:2006-05-11

    IPC分类号: G06F9/44 G06F9/445

    摘要: This invention simulates program to create a conflict graph of the cache accesses. The conflict graph is used to relay out relocatable functions to minimize cache conflict misses where conflicting functions map to the same portion of the cache. The conflict graph includes a vertex for each function and an edge between functions having a weight corresponding to a conflict amount. This conflict graph enables a layout of functions to minimize the number of conflicting items that map to the same location in the cache weighted by the degree of conflict encoded by the edges in the graph.

    摘要翻译: 本发明模拟程序以创建高速缓存访​​问的冲突图。 冲突图用于转发可重定位函数,以将冲突函数映射到缓存的相同部分的缓存冲突漏洞最小化。 冲突图包括每个功能的顶点和具有对应于冲突量的权重的函数之间的边。 该冲突图能够实现功能的布局,以最小化通过由图中的边缘编码的冲突程度加权的高速缓存中映射到缓存中的相同位置的冲突项目的数量。

    Method and apparatus for splitting packets in multithreaded VLIW processor
    2.
    发明授权
    Method and apparatus for splitting packets in multithreaded VLIW processor 有权
    用于在多线程VLIW处理器中分组数据包的方法和装置

    公开(公告)号:US07096343B1

    公开(公告)日:2006-08-22

    申请号:US09538755

    申请日:2000-03-30

    IPC分类号: G06F9/50

    摘要: A method and apparatus are disclosed for allocating functional units in a multithreaded very large instruction word (VLIW) processor. The present invention combines the techniques of conventional very long instruction word architectures and conventional multithreaded architectures to reduce execution time within an individual program, as well as across a workload. The present invention utilizes instruction packet splitting to recover some efficiency lost with conventional multithreaded architectures. Instruction packet splitting allows an instruction bundle to be partially issued in one cycle, with the remainder of the bundle issued during a subsequent cycle. The allocation hardware assigns as many instructions from each packet as will fit on the available functional units, rather than allocating all instructions in an instruction packet at one time. Those instructions that cannot be allocated to a functional unit are retained in a ready-to-run register. On subsequent cycles, instruction packets in which all instructions have been issued to functional units are updated from their thread's instruction stream, while instruction packets with instructions that have been held are retained. The functional unit allocation logic can then assign instructions from the newly-loaded instruction packets as well as instructions that were not issued from the retained instruction packets.

    摘要翻译: 公开了用于在多线程超大指令字(VLIW)处理器中分配功能单元的方法和装置。 本发明结合了传统的非常长的指令字架构和传统的多线程体系结构的技术,以减少单个程序内的执行时间以及跨工作负载。 本发明利用指令分组分解来恢复传统多线程体系结构损失的一些效率。 指令包分割允许在一个周期内部分地发出指令包,在后续周期中发出捆绑的剩余部分。 分配硬件分配来自每个分组的指令将适合可用的功能单元,而不是一次分配指令分组中的所有指令。 那些不能分配给功能单元的指令被保留在一个准备运行的寄存器中。 在随后的周期中,已经从其线程的指令流更新了向功能单元发出了所有指令的指令包,同时保留了具有指令的指令包。 然后,功能单元分配逻辑可以从新加载的指令分组以及未从保留的指令分组发出的指令分配指令。

    Method and apparatus for identifying splittable packets in a multithreaded VLIW processor
    3.
    发明授权
    Method and apparatus for identifying splittable packets in a multithreaded VLIW processor 有权
    用于在多线程VLIW处理器中识别可分页分组的方法和装置

    公开(公告)号:US06658551B1

    公开(公告)日:2003-12-02

    申请号:US09538757

    申请日:2000-03-30

    IPC分类号: G06F900

    摘要: A method and apparatus are disclosed for allocating functional units in a multithreaded very large instruction word (VLIW) processor. The present invention combines the techniques of conventional very long instruction word (VLIW) architectures and conventional multithreaded architectures to reduce execution time within an individual program, as well as across a workload. The present invention utilizes instruction packet splitting to recover some efficiency lost with conventional multithreaded architectures. Instruction packet splitting allows an instruction bundle to be partially issued in one cycle, with the remainder of the bundle issued during a subsequent cycle. There are times, however, when instruction packets cannot be split without violating the semantics of the instruction packet assembled by the compiler. A packet split identification bit is disclosed that allows hardware to efficiently determine when it is permissible to split an instruction packet. The split bit informs the hardware when splitting is prohibited. The allocation hardware assigns as many instructions from each packet as will fit on the available functional units, rather than allocating all instructions in an instruction packet at one time, provided the split bit has not been set. Those instructions that cannot be allocated to a functional units are retained in a ready-to-run register. On subsequent cycles, instruction packets in which all instructions have been issued to functional units are updated from their thread's instruction stream, while instruction packets with instructions that have been held are retained. The functional unit allocation logic can then assign instructions from the newly-loaded instruction packets as well as instructions that were not issued from the retained instruction packets.

    摘要翻译: 公开了用于在多线程超大指令字(VLIW)处理器中分配功能单元的方法和装置。 本发明结合了传统的非常长的指令字(VLIW)架构和常规多线程体系结构的技术,以减少单个程序内的执行时间,以及跨工作负载。 本发明利用指令分组分解来恢复传统多线程体系结构损失的一些效率。 指令包分割允许在一个周期内部分地发出指令包,在后续周期中发出捆绑的剩余部分。 然而,有时候,当指令包不能被分割而不违反编译器组装的指令包的语义时, 公开了一种分组分割识别位,其允许硬件有效地确定何时可以分割指令分组。 拆分时禁止拆分硬件。 分配硬件分配来自每个分组的指令将适合可用的功能单元,而不是一次分配指令分组中的所有指令,前提是分裂位尚未设置。 那些不能分配给功能单元的指令将保留在一个即可运行的寄存器中。 在随后的周期中,已经从其线程的指令流更新了向功能单元发出了所有指令的指令包,同时保留了具有指令的指令包。 然后,功能单元分配逻辑可以从新加载的指令分组以及未从保留的指令分组发出的指令分配指令。

    Visual program memory hierarchy optimization
    4.
    发明授权
    Visual program memory hierarchy optimization 有权
    可视化程序存储器层次结构优化

    公开(公告)号:US06947052B2

    公开(公告)日:2005-09-20

    申请号:US10191175

    申请日:2002-07-09

    摘要: In general, and in a form of the present invention, a method is provided for reducing execution time of a program executed on a digital system by improving hit rate in a cache of the digital system. This is done by determining cache performance during execution of the program over a period of time as a function of address locality, and then identifying occurrences of cache conflict between two program modules. One of the conflicting program modules is then relocated so that cache conflict is eliminated or at least reduced. In one embodiment of the invention, a 2D plot of cache operation is provided as a function of address versus time for the period of time. A set of cache misses having temporal locality and spatial locality is identified as a horizontally arranged grouping of pixels at a certain address locality having a selected color indicative of a cache miss. Cache conflict is determined by overlying an interference grid or a shadow grid on the plot responsive to the address locality such that a plurality of lines are displayed at other address localities that map to the same region in cache as the first address locality. In order to relocate a program module, a relocation parameter is provided to a linker to cause the program module to be linked at a different address.

    摘要翻译: 通常,以本发明的形式,提供了一种通过提高数字系统的高速缓存中的命中率来减少在数字系统上执行的程序的执行时间的方法。 这通过在一段时间内在程序执行期间根据地址位置确定高速缓存性能,然后识别两个程序模块之间的缓存冲突的发生来完成。 然后重新定位冲突的程序模块之一,以便消除或至少减少高速缓存冲突。 在本发明的一个实施例中,提供高速缓存操作的2D图作为时间段的地址与时间的函数。 具有时间局部性和空间局部性的一组高速缓存未命中被识别为具有指示高速缓存未命中的选定颜色的特定地址位置处的水平排列的像素分组。 缓存冲突是通过在地图上叠加干涉网格或阴影网格来确定的,使得多个行显示在与第一地址位置映射到高速缓存中的相同区域的其他地址位置处。 为了重新定位程序模块,向链接器提供重定位参数,以使程序模块以不同的地址链接。

    Method and apparatus for allocating functional units in a multithreaded VLIW processor
    5.
    发明授权
    Method and apparatus for allocating functional units in a multithreaded VLIW processor 失效
    用于在多线程VLIW处理器中分配功能单元的方法和装置

    公开(公告)号:US07007153B1

    公开(公告)日:2006-02-28

    申请号:US09538670

    申请日:2000-03-30

    IPC分类号: G06F15/00 G06F15/76

    CPC分类号: G06F9/3851 G06F9/3853

    摘要: A method and apparatus are disclosed for allocating functional units in a multithreaded very large instruction word (VLIW) processor. The present invention combines the techniques of conventional VLIW architectures and conventional multithreaded architectures to reduce execution time within an individual program, as well as across a workload. The present invention utilizes a compiler to detect parallelism. The disclosed multithreaded VLIW architecture exploits program parallelism by issuing multiple instructions, in a similar manner to single threaded VLIW processors, from a single program sequencer, and also supports multiple program sequencers, as in simultaneous multithreading. Instructions are allocated to functional units to issue multiple VLIW instructions to multiple functional units in the same cycle. The allocation mechanism of the present invention occupies a pipeline stage just before arguments are dispatched to functional units. The allocate stage determines how to group the instructions together to maximize efficiency, by selecting appropriate instructions and assigning the instructions to the FUs. The criteria for selection are thread priority or resource availability or both. Under the thread priority criteria, different threads can have different priorities. The allocate stage selects and forwards the packets (or instructions from packets) for execution belonging to the thread with the highest priority according to the priority policy implemented. Under the resource availability criteria, a packet (having up to K instructions) can be allocated only if the resources (functional units) required by the packet are available for the next cycle. Functional units report their availability to the allocate stage.

    摘要翻译: 公开了用于在多线程超大指令字(VLIW)处理器中分配功能单元的方法和装置。 本发明结合了常规VLIW架构和常规多线程体系结构的技术,以减少单个程序内的执行时间,以及跨工作负载。 本发明利用编译器来检测并行性。 所公开的多线程VLIW架构通过从单个程序定序器以类似于单线程VLIW处理器的方式发出多个指令来利用程序并行性,并且还支持多个程序定序器,如同时多线程。 指令分配给功能单元,以在同一周期内向多个功能单元发出多个VLIW指令。 本发明的分配机制在将参数分派到功能单元之前占据了流水线阶段。 分配阶段通过选择适当的指令并将指令分配给FU来确定如何将指令组合在一起以最大化效率。 选择的标准是线程优先级或资源可用性或两者。 在线程优先级标准下,不同的线程可以有不同的优先级。 分配阶段根据实现的优先级策略,选择并转发属于具有最高优先级的线程的数据包(或数据包的指令)。 在资源可用性标准下,仅当分组所需的资源(功能单元)可用于下一个周期时,才能分配(具有高达K个指令)的分组。 功能单位向分配阶段报告其可用性。

    Method and apparatus for releasing functional units in a multithreaded VLIW processor
    6.
    发明授权
    Method and apparatus for releasing functional units in a multithreaded VLIW processor 有权
    用于释放多线程VLIW处理器中的功能单元的方法和装置

    公开(公告)号:US06665791B1

    公开(公告)日:2003-12-16

    申请号:US09538669

    申请日:2000-03-30

    IPC分类号: G06F15163

    CPC分类号: G06F9/3851 G06F9/3853

    摘要: A method and apparatus are disclosed for releasing functional units in a multithreaded very large instruction word (VLIW) processor. The functional unit release mechanism can retrieve the capacity lost due to multiple cycle instructions. The functional unit release mechanism of the present invention permits idle functional units to be reallocated to other threads, thereby improving workload efficiency. Instruction packets are assigned to functional units, which can maintain their state, independent of the issue logic. Each functional unit has an associated state machine (SM) that keeps track of the number of cycles that the functional unit will be occupied by a multiple-cycle instruction. Functional units do not reassign themselves as long as the functional unit is busy. When the instruction is complete, the functional unit can participate in functional unit allocation, even if other functional units assigned to the same thread are still busy. The functional unit release approach of the present invention allows the functional units that are not associated with a multiple-cycle instruction to be allocated to other threads while the blocked thread is waiting, thereby improving throughput of the multithreaded VLIW processor. Since the state is associated with each functional unit separately from the instruction issue unit, the functional units can be assigned to threads independently of the state of any one thread and its constituent instructions.

    摘要翻译: 公开了用于释放多线程超大指令字(VLIW)处理器中的功能单元的方法和装置。 功能单元释放机构可以检索由于多个循环指令而导致的容量损失。 本发明的功能单元释放机构允许将空闲功能单元重新分配给其他线程,从而提高工作效率。 指令包被分配给功能单元,它们可以保持其状态,而与发行逻辑无关。 每个功能单元具有关联的状态机(SM),其跟踪功能单元将被多周期指令占用的周期数。 只要功能单元繁忙,功能单元就不会自动重新分配。 指令完成后,即使分配给同一线程的其他功能单元仍然忙,功能单元也可以参与功能单元分配。 本发明的功能单元释放方法允许在阻塞的线程等待时将不与多周期指令相关联的功能单元分配给其他线程,从而提高多线程VLIW处理器的吞吐量。 由于状态与指令发布单元分开地与每个功能单元相关联,所以功能单元可以独立于任何一个线程的状态及其组成指令分配给线程。

    Two step thread creation with register renaming
    7.
    发明授权
    Two step thread creation with register renaming 有权
    两级线程创建与注册重命名

    公开(公告)号:US06286027B1

    公开(公告)日:2001-09-04

    申请号:US09201034

    申请日:1998-11-30

    IPC分类号: G06F900

    摘要: An apparatus and method in digital processing provides a simple and efficient way of communicating parameters from a parent thread to child thread with two step thread creation. The method comprising the steps of: allocating hardware context for the child thread; enabling the parent thread to execute other instructions wherein parent thread register writes update both parent and child architectural registers; and spawning the child thread. In essence, the parent thread sends parameters to the child by writing to the parent's registers prior to spawning of the child thread.

    摘要翻译: 数字处理中的装置和方法提供了一种简单而有效的方式,通过两步线程创建将参数从父线程传递到子线程。 该方法包括以下步骤:为子线程分配硬件上下文; 使父线程执行其他指令,其中父线程寄存器写入更新父架构寄存器和子结构寄存器; 并产生子线程。 实质上,父线程通过在生成子线程之前写入父进程的寄存器来向子进程发送参数。

    History-based prefetch cache including a time queue
    8.
    发明授权
    History-based prefetch cache including a time queue 失效
    基于历史的预取缓存包括时间队列

    公开(公告)号:US5778435A

    公开(公告)日:1998-07-07

    申请号:US655590

    申请日:1996-05-30

    IPC分类号: G06F9/38 G06F12/08 G06F9/00

    摘要: A history-based prefetch cache which includes a time queue. The time queue correlates past events with cache misses in a microprocessor. The time queue is set to N cycles, N being a predetermined, arbitrary or programmable amount. The prefetch cache is a prefetch target buffer which receives inputs from a time queue and a cache and determines if an event is present in the cache. If an address is not present in the cache it is prefetched based on past events and inserted into the prefetch target buffer so that the microprocessor will not miss it the next time.

    摘要翻译: 包含时间队列的基于历史的预取缓存。 时间队列将过去的事件与微处理器中的高速缓存未命中相关联。 时间队列被设置为N个周期,N是预定的,任意的或可编程的量。 预取缓存是从时间队列和高速缓存接收输入并确定高速缓存中是否存在事件的预取目标缓冲区。 如果缓存中不存在地址,则根据过去的事件进行预取,并将其插入预取目标缓冲区,以便微处理器下一次不会错过。