Abstract:
Methods, devices, and non-transitory process-readable storage media for compacting data within cache lines of a cache. An aspect method may include identifying, by a processor of the computing device, a base address (e.g., a physical or virtual cache address) for a first data segment, identifying a data size (e.g., based on a compression ratio) for the first data segment, obtaining a base offset based on the identified data size and the base address of the first data segment, and calculating an offset address by offsetting the base address with the obtained base offset, wherein the calculated offset address is associated with a second data segment. In some aspects, the method may include identifying a parity value for the first data segment based on the base address and obtaining the base offset by performing a lookup on a stored table using the identified data size and identified parity value.
Abstract:
Aspects include computing devices, systems, and methods for implementing a cache memory access requests for compressed data using cache bank spreading. In an aspect, cache bank spreading may include determining whether the compressed data of the cache memory access fits on a single cache bank. In response to determining that the compressed data fits on a single cache bank, a cache bank spreading value may be calculated to replace/reinstate bank selection bits of the physical address for a cache memory of the cache memory access request that may be cleared during data compression. A cache bank spreading address in the physical space of the cache memory may include the physical address of the cache memory access request plus the reinstated bank selection bits. The cache bank spreading address may be used to read compressed data from or write compressed data to the cache memory device.
Abstract:
Various embodiments of methods and systems for thermally aware scheduling of workloads in a portable computing device that contains a heterogeneous, multi-processor system on a chip (“SoC”) are disclosed. Because individual processing components in a heterogeneous, multi-processor SoC may exhibit different processing efficiencies at a given temperature, and because more than one of the processing components may be capable of processing a given block of code, thermally aware workload scheduling techniques that compare performance curves of the individual processing components at their measured operating temperatures can be leveraged to optimize quality of service (“QoS”) by allocating workloads in real time, or near real time, to the processing components best positioned to efficiently process the block of code.
Abstract:
A method includes storing, with a first programmable processor, shared variable data to cache lines of a first cache of the first processor. The method further includes executing, with the first programmable processor, a store-with-release operation, executing, with a second programmable processor, a load-with-acquire operation, and loading, with the second programmable processor, the value of the shared variable data from a cache of the second programmable processor.
Abstract:
Certain aspects of the present disclosure provide techniques for improved hardware utilization. An input data tensor is divided into a first plurality of sub-tensors, and a plurality of logical sub-arrays in a physical multiply-and-accumulate (MAC) array is identified. For each respective sub-tensor of the first plurality of sub-tensors, the respective sub-tensor is mapped to a respective logical sub-array of the plurality of logical sub-arrays, and the respective sub-tensor is processed using the respective logical sub-array.
Abstract:
Providing content-aware cache replacement and insertion policies in processor-based devices is disclosed. In some aspects, a processor-based device comprises a cache memory device and a cache controller circuit of the cache memory device. The cache controller circuit is configured to determine a plurality of content costs for each of a plurality of cached data values in the cache memory device, based on a plurality of bit values of each of the plurality of cached data values. The cache controller circuit is configured to identify, based on the plurality of content costs, a cached data value of the plurality of cached data values associated with a lowest content cost as a target cached data value. The cache controller circuit is also configured to evict the target cached data value from the cache memory device.
Abstract:
Various embodiments include methods and devices for virtual cache coherency. Embodiments may include receiving a snoop for a physical address from a coherent processing device, determining whether an entry for the physical address corresponding to a virtual address in a virtual cache exists in a snoop filter, and sending a cache coherency operation to the virtual cache in response to determining that the entry exists in the snoop filter.
Abstract:
Aspects include computing devices, apparatus, and methods implemented by the apparatus for implementing dynamic input/output (I/O) coherent workload processing on a computing device. Aspect methods may include offloading, by a processing device, a workload to a hardware accelerator for execution using an I/O coherent mode, detecting a dynamic trigger for switching from the I/O coherent mode to a non-I/O coherent mode while the workload is executed by the hardware accelerator, and switching from the I/O coherent mode to a non-I/O coherent mode while the workload is executed by the hardware accelerator.
Abstract:
Various aspects are described herein. In some aspects, the present disclosure provides a method of communicating data between an electronic unit of a system-on-chip (SoC) and a dynamic random access memory (DRAM). The method includes initiating a memory transaction corresponding to first data. The method includes determining a non-unique first signature and a unique second signature associated with the first data based on content of the first data. The method includes determining if the non-unique first signature is stored in at least one of a local buffer on the SoC separate from the DRAM or the DRAM. The method includes determining if the unique second signature is stored in at least one of the local buffer or the DRAM based on determining the non-unique first signature is stored. The method includes eliminating the memory transaction with respect to the DRAM based on determining the unique second signature is stored.
Abstract:
Providing memory management unit (MMU) partitioned translation caches, and related apparatuses, methods, and computer-readable media. In this regard, an apparatus comprising an MMU is provided. The MMU comprises a translation cache providing a plurality of translation cache entries defining address translation mappings. The MMU further comprises a partition descriptor table providing a plurality of partition descriptors defining a corresponding plurality of partitions each comprising one or more translation cache entries of the plurality of translation cache entries. The MMU also comprises a partition translation circuit configured to receive a memory access request from a requestor. The partition translation circuit is further configured to determine a translation cache partition identifier (TCPID) of the memory access request, identify one or more partitions of the plurality of partitions based on the TCPID, and perform the memory access request on a translation cache entry of the one or more partitions.