Abstract:
A processor maintains an access log indicating a stream of cache misses at a cache of the processor. In response to each of at least a subset of cache misses at the cache, the processor records a corresponding entry in the access log, indicating a physical memory address of the memory access request that resulted in the corresponding miss. In addition, the processor maintains an address translation log that indicates a mapping of physical memory addresses to virtual memory addresses. In response to an address translation (e.g., a page walk) that translates a virtual address to a physical address, the processor stores a mapping of the physical address to the corresponding virtual address at an entry of the address translation log. Software executing at the processor can use the two logs for memory management.
Abstract:
Existing multiprocessor computing systems often have insufficient memory coherency and, consequently, are unable to efficiently utilize separate memory systems. Specifically, a CPU cannot effectively write to a block of memory and then have a GPU access that memory unless there is explicit synchronization. In addition, because the GPU is forced to statically split memory locations between itself and the CPU, existing multiprocessor computing systems are unable to efficiently utilize the separate memory systems. Embodiments described herein overcome these deficiencies by receiving a notification within the GPU that the CPU has finished processing data that is stored in coherent memory, and invalidating data in the CPU caches that the GPU has finished processing from the coherent memory. Embodiments described herein also include dynamically partitioning a GPU memory into coherent memory and local memory through use of a probe filter.
Abstract:
A method and system for allocating memory to a memory operation executed by a processor in a computer arrangement having a first processor configured for unified operation with a second processor. The method includes receiving a memory operation from a processor and mapping the memory operation to one of a plurality of memory heaps. The mapping produces a mapping result. The method also includes providing the mapping result to the processor.
Abstract:
The present invention provides for page table access and dirty bit management in hardware via a new atomic test[0] and OR and Mask. The present invention also provides for a gasket that enables ACE to CCI translations. This gasket further provides request translation between ACE and CCI, deadlock avoidance for victim and probe collision, ARM barrier handling, and power management interactions. The present invention also provides a solution for ARM victim/probe collision handling which deadlocks the unified northbridge. These solutions includes a dedicated writeback virtual channel, probes for IO requests using 4-hop protocol, and a WrBack Reorder Ability in MCT where victims update older requests with data as they pass the requests.
Abstract:
A processing system including a parallel processing unit selectively allocating pages of memory for interleaving across configurable subsets of channels based on a mode of allocation. In some embodiments, in a first mode, a page of memory is allocated to and interleaved across a plurality of channels, and in a second mode, a page of memory is allocated to and interleaved across a subset of the plurality of channels.
Abstract:
The disclosed computer-implemented method can include reaching, by a chiplet involved in carrying out an operation for a process, a synchronization barrier. The method can additionally include receiving, by the chiplet, dedicated control messages pushed to the chiplet by other chiplets involved in carrying out the operation for the process, wherein the dedicated control messages are pushed over a control network by the other chiplets. The method can also include advancing, by the chiplet, the synchronization barrier in response to receipt of the dedicated control messages. Various other methods, systems, and computer-readable media are also disclosed.
Abstract:
A method and apparatus of managing memory includes storing a first memory page at a shared memory location in response to the first memory page including data shared between a first virtual machine and a second virtual machine. A second memory page is stored at a memory location unique to the first virtual machine in response to the second memory page including data unique to the first virtual machine. The first memory page is accessed by the first virtual machine and the second virtual machine, and the second memory page is accessed by the first virtual machine and not the second virtual machine.
Abstract:
A method and system for allocating memory to a memory operation executed by a processor in a computer arrangement having a plurality of processors. The method includes receiving a memory operation from a processor that references an address in a shared memory, mapping the received memory operation to at least one virtual memory pool to produce a mapping result, and providing the mapping result to the processor.
Abstract:
Systems and methods for dynamic repartitioning of physical memory address mapping involve relocating data stored at one or more physical memory locations of one or more memory devices to another memory device or mass storage device, repartitioning one or more corresponding physical memory maps to include new mappings between physical memory addresses and physical memory locations of the one or more memory devices, then loading the relocated data back onto the one or more memory devices at physical memory locations determined by the new physical address mapping. Such dynamic repartitioning of the physical memory address mapping does not require a processing system to be rebooted and has various applications in connection with interleaving reconfiguration and error correcting code (ECC) reconfiguration of the processing system.
Abstract:
A technique for recovering from a hang in a virtualized accelerated processing device (“APD”) is provided. In the virtualization scheme, different virtual machines are assigned different “time-slices” in which to use the APD. When a time-slice expires, the APD stops operations for a current VM and starts operations for another VM. To stop operations on the APD, a virtualization scheduler sends a request to idle the APD. The APD responds by completing work and idling. If one or more portions of the APD do not complete this idling process before a timeout expires, then a hang occurs. In response to the hang, the virtualization scheduler informs the hypervisor that a hang has occurred. The hypervisor performs a function level reset on the APD and informs the VM that the hang has occurred. The VM responds by stopping command issue to the APD and re-initializing the APD for the function.