摘要:
A cache for storing data elements is disclosed. The cache includes a cache memory having one or more lines and one or more cache line counters, each associated with a line of the cache memory. In operation, a cache line counter of the one or more of cache line counters is incremented when a request is received to prefetch a data element into the cache memory and is decremented when the data element is consumed. Optionally, one or more reference queues may be used to store the locations of data elements in the cache memory. In one embodiment, data cannot be evicted from cache lines unless the associated cache line counters indicate that the prefetched data has been consumed.
摘要:
A system and method for calculating memory addresses in a partitioned memory in a processing system having a processing unit, input and output units, a program sequencer and an external interface. An address calculator includes a set of storage elements, such as registers, and an arithmetic unit for calculating a memory address of a vector element dependent upon values stored in the storage elements and the address of a previous vector element. The storage elements hold STRIDE, SKIP and SPAN values and optionally a TYPE value, relating to the spacing between elements in the same partition, the spacing between elements in the consecutive partitions, the number of elements in a partition and the size of a vector element, respectively.
摘要:
A method and apparatus for the elimination of prolog and epilog instructions in a vector processor. To eliminate the prolog, a functional unit of the vector processor has at least one input for receiving an input data value tagged with a data validity tag and an output for outputting an intermediate result tagged with a data validity tag. The data validity tags indicate the validity of the data. Before a loop is executed, the data validity tags are set to indicate that the associated data values are invalid. During execution of the loop body a functional unit checks the validity of input data. If all of the input data values are valid the functional operation is performed, the corresponding data validity tag set to indicate that the result is valid. If any of the input data values is invalid, the data validity tag of the result is set to indicate that the result is invalid. To eliminate the epilog, an iteration counter is associated with each sink unit of the vector processor. When a specified number of data values have been produced by a particular sink, no more data values are produced by that sink. The instructions for the pipelined loop body may be repeated, without alteration, to eliminate prolog and epilog instructions.
摘要:
A re-configurable, streaming vector processor (100) is provided which includes a number of function units (102), each having one or more inputs for receiving data values and an output for providing a data value, a re-configurable interconnection switch (104) and a micro-sequencer (118). The re-configurable interconnection switch (104) includes one or more links, each link operable to couple an output of a function unit (102) to an input of a function unit (102) as directed by the micro-sequencer (118). The vector processor may also include one or more input-stream units (122) for retrieving data from memory. Each input-stream unit is directed by a host processor and has a defined interface (116) to the host processor. The vector processor also includes one or more output-stream units (124) for writing data to memory or to the host processor. The defined interface of the input-stream and output-stream units forms a first part of the programming model. The instructions stored in a memory, in the sequence that direct the re-configurable interconnection switch, form a second part of the programming model.
摘要:
Various load and store instructions may be used to transfer multiple vector elements between registers in a register file and memory. A cnt parameter may be used to indicate a total number of elements to be transferred to or from memory, and an rcnt parameter may be used to indicate a maximum number of vector elements that may be transferred to or from a single register within a register file. Also, the instructions may use a variety of different addressing modes. The memory element size may be specified independently from the register element size such that source and destination sizes may differ within an instruction. With some instructions, a vector stream may be initiated and conditionally enqueued or dequeued. Truncation or rounding fields may be provided such that source data elements may be truncated or rounded when transferred. Also, source data elements may be sign- or unsigned-extended when transferred.
摘要:
A method for producing a formatted description of a computation representable by a data-flow graph and computer for performing a computation so described. A source instruction is generated for each input of the data-flow graph, a computational instruction is generated for each node of the data-flow graph, and a sink instruction is generated for each output of the data-flow graph. The computational instruction for a node includes a descriptor of an operation performed at the node and a descriptor of each instruction that produces an input to the node. The formatted description is a sequential instruction list comprising source instructions, computational instructions and sink instructions. Each instruction has an instruction identifier and the descriptor of each instruction that produces an input to the node is the instruction identifier. The computer is directed by a program of instructions to implement a computation representable by a data-flow graph.
摘要:
A method for scheduling a computation for execution on a computer with a number of interconnected functional units. The computation is representable by a data-flow graph with a number of nodes connected by edge. A loop-period of the computation is calculated and the nodes are scheduled for throughput by assigning an execution cycle and a functional unit to each node of the data-flow graph. The scheduling of flexible nodes is adjusted to minimize the number of interconnections required in each execution cycle. The edges of the data-flow graph are allocated to one or more of the interconnections between functional units. The scheduling method may be used, for example, to optimize the interconnection fabric design for an ASIC or as part of a compiler for a re-configurable streaming vector processor.
摘要:
An interconnection device (300) with a number of links (306, 308, 310, 312 and 314), each link having a number of link input ports (302), link output ports (304) and storage registers (316). An input selection switch (402) is coupled to a selected link input port to receive an input data token. The storage registers (316) may be used to store input data tokens. A storage access switch (404) is coupled to the input selection switch (402) and to the storage registers (316) and may be used to select the current input data token or a token from the storage registers as an output data token. An output selection switch (406) receives the output data token and provides it to a selected link output port. The interconnection device may, for example, be used to connect the inputs and outputs of the processing elements of a vector processor or digital signal processor.
摘要:
A memory interface device (100) providing a fractional address interface between a data processor (104) and a memory system (102) and a method for retrieving intermediate data values from a memory system using fractional addressing. The device includes an address generator (108) for generating first and second memory addresses, the first memory address being less than or equal to a specified fractional address, the second memory address being greater than or equal to the fractional address. The device also includes a memory access unit (110) coupled to the address generator (108) for retrieving first and second data values from the memory system (102) at the first and second memory addresses, respectively. The device also includes a data access unit (112) for interpolating between the first and second data values and passing the interpolated value to the data processor (104). The memory interface has application in a variety of data processing systems, including digital signal processors and streaming vector processors.
摘要:
A method and apparatus that allocates bandwidth among wireless sensor nodes in wireless sensor groups in a wireless sensor network (WSN) is disclosed. The method may include forming a plurality of wireless sensor node groups from a plurality of wireless sensor nodes based on battery levels of the wireless senor nodes, allocating transmission time slots for the wireless sensor nodes in each of the wireless sensor node groups based on at least one channel quality metric, determining average battery levels for each of the wireless sensor node groups and average battery level of all of the wireless sensor nodes, determining differences between the average battery levels of each of the wireless sensor node groups and the average battery level of all of the wireless sensor nodes, wherein if any difference in the average battery levels is above a predetermined threshold, regrouping the plurality of wireless sensor nodes according to the battery levels of the plurality wireless sensor nodes to minimize any variance in average battery level across all of the wireless sensor node groups.