Abstract:
Block processing pipeline methods and apparatus in which reference data are stored to a memory according to tile formats to reduce memory accesses when fetching the data from the memory. When the pipeline stores reference data from a current frame being processed to memory as a reference frame, the reference samples are stored in macroblock sequential order. Each macroblock sample set is stored as a tile. Reference data may be stored in tile formats for luma and chroma. Chroma reference data may be stored in tile formats for chroma 4:2:0, 4:2:2, and/or 4:4:4 formats. A stage of the pipeline may write luma and chroma reference data for macroblocks to memory according to one or more of the macroblock tile formats in a modified knight's order. The stage may delay writing the reference data from the macroblocks until the macroblocks have been fully processed by the pipeline.
Abstract:
In the video encoders described herein, blocks of pixels from a video frame may be encoded (e.g., using CAVLC encoding) in a block processing pipeline using wavefront ordering (e.g., in knight's order). Each of the encoded blocks may be written to a particular one of multiple DMA buffers such that the encoded blocks written to each of the buffers represent consecutive blocks of the video frame in scan order. A transcode pipeline may operate in parallel with (or at least overlapping) the operation of the block processing pipeline. The transcode pipeline may read encoded blocks from the buffers in scan order and merge them into a single bit stream (in scan order). A transcoder core of the transcode pipeline may decode the encoded blocks and encode them using a different encoding process (e.g., CABAC). In some cases, the transcoder may be bypassed.
Abstract:
Systems, apparatuses, and methods for implementing a low power decimator. A decimator may receive a plurality of input samples from a digital microphone. The decimator may include one or more coefficient tables for storing values combining two or more filter coefficients for filtering the received samples. The decimator may utilize a concatenation of multiple samples to perform a lookup of a corresponding coefficient table. The coefficient tables may store only the necessary non-redundant values for all coefficient combinations which can be applied to the multiple samples. The result of the lookup of the coefficient table may have its sign inverted or be zeroed based on the values of the multiple samples.
Abstract:
Memory latency tolerance methods and apparatus for maintaining an overall level of performance in block processing pipelines that prefetch reference data into a search window. In a general memory latency tolerance method, search window processing in the pipeline may be monitored. If status of search window processing changes in a way that affects pipeline throughput, then pipeline processing may be modified. The modification may be performed according to no stall methods, stall recovery methods, and/or stall prevention methods. In no stall methods, a block may be processed using the data present in the search window without waiting for the missing reference data. In stall recovery methods, the pipeline is allowed to stall, and processing is modified for subsequent blocks to speed up the pipeline and catch up in throughput. In stall prevention methods, processing is adjusted in advance of the pipeline encountering a stall condition.
Abstract:
A block processing pipeline that includes a software pipeline and a hardware pipeline that run in parallel. The software pipeline runs at least one block ahead of the hardware pipeline. The stages of the pipeline may each include a hardware pipeline component that performs one or more operations on a current block at the stage. At least one stage of the pipeline may also include a software pipeline component that determines a configuration for the hardware component at the stage of the pipeline for processing a next block while the hardware component is processing the current block. The software pipeline component may determine the configuration according to information related to the next block obtained from an upstream stage of the pipeline. The software pipeline component may also obtain and use information related to a block that was previously processed at the stage.
Abstract:
A knight's order processing method for block processing pipelines in which the next block input to the pipeline is taken from the row below and one or more columns to the left in the frame. The knight's order method may provide spacing between adjacent blocks in the pipeline to facilitate feedback of data from a downstream stage to an upstream stage. The rows of blocks in the input frame may be divided into sets of rows that constrain the knight's order method to maintain locality of neighbor block data. Invalid blocks may be input to the pipeline at the left of the first set of rows and at the right of the last set of rows, and the sets of rows may be treated as if they are horizontally arranged rather than vertically arranged, to maintain continuity of the knight's order algorithm.
Abstract:
Block processing pipeline methods and apparatus in which pixel data from a reference frame is prefetched into a search window memory. The search window may include two or more overlapping regions of pixels from the reference frame corresponding to blocks from the rows in the input frame that are currently being processed in the pipeline. Thus, the pipeline may process blocks from multiple rows of an input frame using one set of pixel data from a reference frame that is stored in a shared search window memory. The search window may be advanced by one column of blocks by initiating a prefetch for a next column of reference data from a memory. The pipeline may also include a reference data cache that may be used to cache a portion of a reference frame and from which at least a portion of a prefetch for the search window may be satisfied.
Abstract:
Methods and apparatus for caching reference data in a block processing pipeline. A cache may be implemented to which reference data corresponding to motion vectors for blocks being processed in the pipeline may be prefetched from memory. Prefetches for the motion vectors may be initiated one or more stages prior to a processing stage. Cache tags for the cache may be defined by the motion vectors. When a motion vector is received, the tags can be checked to determine if there are cache block(s) corresponding to the vector (cache hits) in the cache. Upon a cache miss, a cache block in the cache is selected according to a replacement policy, the respective tag is updated, and a prefetch (e.g., via DMA) for the respective reference data is issued.
Abstract:
Systems, apparatuses, and methods for implementing a low power decimator. A decimator may receive a plurality of input samples from a digital microphone. The decimator may include one or more coefficient tables for storing values combining two or more filter coefficients for filtering the received samples. The decimator may utilize a concatenation of multiple samples to perform a lookup of a corresponding coefficient table. The coefficient tables may store only the necessary non-redundant values for all coefficient combinations which can be applied to the multiple samples. The result of the lookup of the coefficient table may have its sign inverted or be zeroed based on the values of the multiple samples.
Abstract:
An apparatus may include a counter circuit and an execution unit. The execution unit may be configured to receive and execute a first instruction. The first instruction may include a first number corresponding to a first number of instructions of a plurality of instructions, a second number corresponding to a number of times to execute a subset of the plurality of instructions, and a third number corresponding to a number of instructions in the subset. The execution unit may be further configured to initialize a first count value in the counter circuit to the second number in response to the execution of the first instruction, to execute the first number of the plurality of instructions, and to execute the subset of the plurality of instructions. The counter circuit may be configured to modify the first count value in response to determining a last instruction of the subset has been retired.