摘要:
Systolic array-based systems and methods for performing block matching in motion compensation. A target pixel block is loaded into a systolic array. A matching sized block of a reference search space is loaded into the array, row by row. A sum of absolute difference (SOAD) is computed for each row and stored. After each row has been loaded, the reference space is incremented to the next column. After the entire reference space has been searched, the reference block with the smallest SOAD is taken as the motion vector for the target pixel block.
摘要:
A parameterizable clip instruction for SIMD microprocessor architecture and method of performing a clip operating the same. A single instruction is provided with three input operands: a destination address, a source address and a controlling parameter. The controlling parameter includes a range type and a range specifier. The range type is a multi-bit integer in the operand that is used to index a table of range types. The range specifier plugs into the range type to define a range. The data input at the source address is clipped according to the controlling parameters. The instruction is particularly suited to video encoding/decoding applications where interpolations or other calculations, lies outside the maximum value and that final result will have to be clipped to saturation value, for example, the maximum pixel value. Signed and unsigned clipping ranges may be used that are not only powers of two.
摘要:
Techniques are disclosed for handling control transfer instructions in pipelined processors. Such instructions may cause the sequence of subsequent instructions to change, and thus may require subsequent instructions to be deleted from the processor's pipeline. Pre-decode means (110) are provided for at least partially decoding control transfer instructions early in the pipeline. Subsequent instructions can then be prevented from progressing through the pipeline. The mechanism required to delete unwanted instructions is thereby simplified.
摘要:
A 2N bit right only barrel shifter for a microprocessor comprising upper and lower N bit shifter portions. A N bit input is put in the upper portion. An X bit right shift of the N bit number yields the results in the N bit upper portion and the result of an N-X bit left shift in the lower portion. The N bit shifter is comprised of a Log2N stage multiplexer where in each successive stage of the multiplexer adds 2x additional bits where x increments from 0 to (Log2N-1).
摘要:
Instructions of a program are stored in compressed form in a program memory (12). In a processor which executes the instructions, a program counter (50) identifies a position in the program memory. An instruction cache (40) has cache blocks, each for storing one or more instructions of the program in decompressed form. A cache loading unit (42) includes a decompression section (44) and performs a cache loading operation in which one or more compressed-form instructions are read from the position in the program memory identified by the program counter and are decompressed and stored in one of the said cache blocks of the instruction cache. A cache pointer (52) identifies a position in the instruction cache of an instruction to be fetched for execution. An instruction fetching unit (46) fetches an instruction to be executed from the position identified by the cache pointer. When a cache miss occurs because the instruction to be fetched is not present in the instruction cache, the cache loading unit performs such a cache loading operation. An updating unit (48) updates the program counter and cache pointer in response to the fetching of instructions so as to ensure that the position identified by the said program counter is maintained consistently at the position in the program memory at which the instruction to be fetched from the instruction cache is stored in compressed form.
摘要:
Systolic array-based systems and methods for performing block matching in motion compensation. A target pixel block is loaded into a systolic array. A matching sized block of a reference search space is loaded into the array, row by row. A sum of absolute difference (SOAD) is computed for each row and stored. After each row has been loaded, the reference space is incremented to the next column. After the entire reference space has been searched, the reference block with the smallest SOAD is taken as the motion vector for the target pixel block.
摘要:
A method of performing branch prediction in a microprocessor using variable length instructions is provided. An instruction is fetched from memory based on a specified fetch address and a branch prediction is made based on the address. The prediction is selectively discarded if the look-up was based on a non-sequential fetch to an unaligned instruction address and a branch target alignment cache (BTAC) bit of the instruction is equal to zero. In order to remove the inherent latency of branch prediction, an instruction prior to a branch instruction may be fetched concurrently with a branch prediction unit look-up table entry containing prediction information for a next instruction word. Then, the branch instruction is fetched and a prediction is made on this branch instruction based on information fetched in the previous cycle. The predicted target instruction is fetched on the next clock cycle. If zero overhead loops are used, a look-up table of a branch prediction unit is updated whenever the zero-overhead loop mechanism is updated. A last fetch address of a last instruction of a loop body of a zero overhead loop in the branch prediction look-up table is stored. Then, whenever an instruction fetch hits the end of a loop body, predictively re-directing an instruction fetch to the start of the loop body. The last fetch address of the loop body is derived from the address of the first instruction after the end of the loop.
摘要:
A processor has respective first and second external instruction formats (F1, F2) in which instructions (add, load) are received by the processor. Each instruction has an opcode (e.g. 1011) which specifies an operation to be executed. Each external format has one or more preselected opcode bits (F1: i+1˜i+4; F2:i+1˜i+3) in which the opcode appears. The processor also has an internal instruction format (G1) into which instructions in the external formats are translated prior to execution of the operation. A first operation (add) is specifiable in both the first and second external formats (F1, F2) and a second operation (load) is specifiable in the second external format (F2). The first and second operations have distinct opcodes (101, 011) in the second external format. In each of the preselected opcode bits which the first and second external formats have in common (i+1˜i+3), the opcodes of the first operation (101) in the two external formats are identical. Such “congruent” instruction encodings can enable a translation process, for translating the external-format opcode into a corresponding internal-format opcode, to be carried out simply and quickly without the need to positively identify each individual external-format opcode.