Abstract:
Methods, systems, and apparatuses are disclosed for implementing fast large-integer arithmetic within an integrated circuit, such as on IA (Intel Architecture) processors, in which such means include receiving a 512-bit value for squaring, the 512-bit value having eight sub-elements each of 64-bits and performing a 512-bit squaring algorithm by: (i) multiplying every one of the eight sub-elements by itself to yield a square of each of the eight sub-elements, the eight squared sub-elements collectively identified as T1, (ii) multiplying every one of the eight sub-elements by the other remaining seven of the eight sub-elements to yield an asymmetric intermediate result having seven diagonals therein, wherein each of the seven diagonals are of a different length, (iii) reorganizing the asymmetric intermediate result having the seven diagonals therein into a symmetric intermediate result having four diagonals each of 7×1 sub-elements of the 64-bits in length arranged across a plurality of columns, (iv) adding all sub-elements within their respective columns, the added sub-elements collectively identified as T2, and (v) yielding a final 512-bit squared result of the 512-bit value by adding the value of T2 twice with the value of T1 once. Other related embodiments are disclosed.
Abstract:
Methods, systems, and apparatuses are disclosed for implementing fast large-integer arithmetic within an integrated circuit, such as on IA (Intel Architecture) processors, in which such means include receiving a 512-bit value for squaring, the 512-bit value having eight sub-elements each of 64-bits and performing a 512-bit squaring algorithm by: (i) multiplying every one of the eight sub-elements by itself to yield a square of each of the eight sub-elements, the eight squared sub-elements collectively identified as T1, (ii) multiplying every one of the eight sub-elements by the other remaining seven of the eight sub-elements to yield an asymmetric intermediate result having seven diagonals therein, wherein each of the seven diagonals are of a different length, (iii) reorganizing the asymmetric intermediate result having the seven diagonals therein into a symmetric intermediate result having four diagonals each of 7×1 sub-elements of the 64-bits in length arranged across a plurality of columns, (iv) adding all sub-elements within their respective columns, the added sub-elements collectively identified as T2, and (v) yielding a final 512-bit squared result of the 512-bit value by adding the value of T2 twice with the value of T1 once. Other related embodiments are disclosed.
Abstract:
An assembler, which can form part of a development/debug system, supports pseudo instructions to enable the assembler to resolve return address ambiguities.
Abstract:
Provided are a method, system, and program for optimizing code. A program is accessed comprising a plurality of instructions including at least one no operation (NOP) instruction. At least one NOP instruction in the program that is not needed to provide a processing delay to ensure data is available to at least one dependent instruction accessing the data is removed.
Abstract:
Apparatus and method for dictionary accelerator compression. For example, one embodiment of an apparatus comprises: a plurality of cores; a compression/decompression accelerator coupled to or integral to one or more of the plurality of cores, the compression/decompression accelerator to perform decompression and compression operations in response to read and write operations, respectively, wherein responsive to notification of a compression job to compress a memory page or a portion thereof, a history buffer associated with the compression/decompression accelerator to is to be initialized with pre-configured dictionary data, the compression/decompression accelerator to match portions of the pre-configured dictionary data with portions of the memory page to generate compressed output data.
Abstract:
An assembler, which can be provided as part of a debugger and/or development system, avoids register allocation errors, such as register bank conflicts and/or insufficient physical registers, automatically.
Abstract:
A method and system to provide virtual resource usage information for assembler programs. In one embodiment, a graphical user interface displays virtual resource usage for portions of an assembler program.
Abstract:
A processor is described having an instruction execution pipeline having a functional unit to execute an instruction that compares vector elements against an input value. Each of the vector elements and the input value have a first respective section identifying a location within data and a second respective section having a byte sequence of the data. The functional unit has comparison circuitry to compare respective byte sequences of the input vector elements against the input value's byte sequence to identify a number of matching bytes for each comparison. The functional unit also has difference circuitry to determine respective distances between the input vector ‘s elements’ byte sequences and the input value's byte sequence within the data.
Abstract:
Systems and methods are disclosed for supporting virtual microengines in a multithreaded processor, such as a microengine running on a network processor. In one embodiment code is written for execution by a plurality of virtual microengines. The code is than compiled and linked for execution on a physical microengine, at which time the physical microengine's threads are assigned to thread groups corresponding to the virtual microengines. Internal next neighbor rings are allocated within the physical microengine to facilitate communication between the thread groups. The code can then be loaded onto the physical microengine and executed, with each thread group executing the code written for its corresponding virtual microengine.
Abstract:
Methods, software and systems to determine channel ownership and physical block location within the channel in non-uniformly distributed DRAM configurations and also to detect in-range memory address matches are presented. A first method, which may also be implemented in software and/or hardware, allocates memory non-uniformly between a number of memory channels, determines a selected memory channel from the memory channels for a program address, and maps the program address to a physical address within the selected memory channel. A second method, which may also be implemented in software and/or hardware, designates a range of memory to perform address matching, monitors memory accesses and when a memory access occurs with the specified range, perform a particular function.