摘要:
In some embodiments, a motion estimation search window cache is adaptively re-organized according to frame properties including a frame width and a number of reference frames corresponding to the current frame to be encoded/decoded. The cache reorganization may include an adaptive mapping of reference frame locations to search window cache allocation units (addresses). In some embodiments, a search window is shaped as a quasi-rectangle with truncated upper left and lower right corners, having a full-frame horizontal extent. A search range is defined in a central region of the search window, and is laterally bounded by the truncated corners.
摘要:
In some embodiments, control and data messages are transmitted non-contentiously over corresponding control and data channels of inter-processor links in a matrix of mesh-interconnected matrix processors. A data stream instruction executed by a user thread of an instruction processing pipeline of a matrix processor may initiate a data stream transfer by a hardware data switch of the matrix processor over multiple consecutive cycles over a data channel. While the data stream is being transferred, the corresponding control channel may transfer control messages non-contentiously with respect to the data stream. The control messages may be messages received from other matrix processors and/or control messages initiated by a kernel thread of the current matrix processor.
摘要:
A method for processing a plurality of sub-blocks in a block of video is disclosed. The method generally includes the steps of (A) intra predicting a first group of the sub-blocks in a first quadrant of the block, (B) intra predicting a second group of the sub-blocks in a second quadrant of the block after starting the intra predicting of the first group and (C) intra predicting a third group of the sub-blocks in the first quadrant after starting the intra predicting of the second group, wherein the first group and the third group together account for all of the sub-blocks in the first quadrant.
摘要:
Described systems and methods allow a reduction in the memory bandwidth required in video coding (decoding/encoding) applications. According to a first aspect, the data assigned to each memory word is chosen to correspond to a 2D subarray of a larger array such as a macroblock. An array memory word organization allows reducing both the average and worst-case bandwidth required to retrieve predictions from memory in video coding applications, particularly for memory word sizes (memory bus widths) larger than the size of typical predictions. According to a second aspect, two or more 2D subarrays such as video predictions are retrieved from memory simultaneously as part of a larger 2D array, if retrieving the larger array requires fewer clock cycles than retrieving the subarrays individually. Allowing the combination of multiple predictions in one memory access operation can lead to a reduction in the average bandwidth required to retrieve predictions from memory.
摘要:
Described systems and methods allow a reduction in the memory bandwidth required in video coding (decoding/encoding) applications. According to a first aspect, the data assigned to each memory word is chosen to correspond to a 2D subarray of a larger array such as a macroblock. An array memory word organization allows reducing both the average and worst-case bandwidth required to retrieve predictions from memory in video coding applications, particularly for memory word sizes (memory bus widths) larger than the size of typical predictions. According to a second aspect, two or more 2D subarrays such as video predictions are retrieved from memory simultaneously as part of a larger 2D array, if retrieving the larger array requires fewer clock cycles than retrieving the subarrays individually. Allowing the combination of multiple predictions in one memory access operation can lead to a reduction in the average bandwidth required to retrieve predictions from memory.
摘要:
According to some embodiments, an integrated circuit comprises a microprocessor matrix of mesh-interconnected matrix processors. Each processor comprises a data switch including a data switch link register and matrix routing logic. The data switch link register includes one or more matrix link-enable register fields specifying a link enable status (e.g. a message-independent, p-to-p, and/or broadcast link enable status) for each inter-processor matrix link of the processor. The matrix routing logic routes inter-processor messages according to the matrix link-enable register field(s). A particular link may be selected by a current matrix processor by selecting an ordered list of matrix links according to a relationship between ΔH and ΔV, and choosing the first enabled link in the selected list for routing. ΔH is the horizontal matrix position difference between the current (sender) processor and a destination processor, and ΔV is the vertical matrix position difference between the current and destination processors.
摘要:
In some embodiments, an integrated circuit comprises a microprocessor matrix including a plurality of mesh-interconnected matrix processors, wherein each matrix processor comprises a data switch configured to direct inter-processor communications within the matrix, and a mapping unit configured to generate a data switch functionality map for a plurality of data switches in the microprocessor matrix. The data switch functionality map is generated by sending a first message through the matrix, and, setting a first functionality status designation for the first data switch in the data switch functionality map upon receiving a reply to the first message from a first data switch through the matrix.
摘要:
A memory controller for a multi-bank random access memory (RAM) such as SDRAM includes a transaction slicer for slicing complex client transactions into simple slices, and a command scheduler for re-ordering preparatory memory commands such as activate and precharge in an order that can be different from the order of the corresponding client transactions. The command scheduler may also re-order memory access commands such as read and write. The slicing and out-of-order command scheduling allow a reduction in memory latency. The data transfer to and from clients can be kept in order.
摘要:
An integrated circuit is designed by interconnecting pre-designed data-driven cores (intellectual property, functional blocks). Hardware description language (e.g. Verilog or VHDL) and software language (e.g. C or C++) code for interconnecting the cores is automatically generated by software tools from a central circuit specification. The central specification recites pre-designed hardware cores (intellectual property) and the interconnections between the cores. HDL and software language test benches, and timing constraints are also automatically generated from the central specification. The automatic generation of code simplifies the interconnection of pre-existing cores for the design of complex integrated circuits.
摘要:
Pre-designed and verified data-driven hardware cores (intellectual property, functional blocks) are assembled to generate large systems on a single chip. Token transfer between cores is achieved upon synchronous assertion, over dedicated connections, of a one-bit ready signal by the transmitter and a one-bit request signal by the receiver. The ready-request signal handshake is necessary and sufficient for token transfer. There are no combinational paths through the cores, and no latches or master controller are used. The architecture and interface allow a significant simplification in the design and verification of large systems integrated on a single chip.