摘要:
Processors and methods implementing a machine instruction to perform cache line demotion on multiple cache lines to enable efficient sharing of cache lines between processor cores. One general aspect includes a processor comprising: a plurality of hardware processor cores, where each of the hardware processor cores to include a first cache. The processor also includes a second cache, communicatively coupled to and shared by the plurality of hardware processor cores. The processor to support a first machine instruction, the first machine instruction to include a vector register operand identifying a vector register which contains a plurality of data elements each used to identify a cache line. An execution of the first machine instruction by one of the plurality of hardware processor cores to cause a plurality of identified cache lines to be demoted, such that the demoted cache lines are moved from the first cache to the second cache.
摘要:
Methods and apparatus implementing Hardware/Software co-optimization to improve performance and energy for inter-VM communication for NFVs and other producer-consumer workloads. The apparatus include multi-core processors with multi-level cache hierarchies including and L1 and L2 cache for each core and a shared last-level cache (LLC). One or more machine-level instructions are provided for proactively demoting cachelines from lower cache levels to higher cache levels, including demoting cachelines from L1/L2 caches to an LLC. Techniques are also provided for implementing hardware/software co-optimization in multi-socket NUMA architecture system, wherein cachelines may be selectively demoted and pushed to an LLC in a remote socket. In addition, techniques are disclosure for implementing early snooping in multi-socket systems to reduce latency when accessing cachelines on remote sockets.
摘要:
Apparatus, method, and system for implementing a software-transparent hardware predictor for core-to-core data communication optimization are described herein. An embodiment of the apparatus includes a plurality of hardware processor cores each including a private cache; a shared cache that is communicatively coupled to and shared by the plurality of hardware processor cores; and a predictor circuit. The predictor circuit is to track activities relating to a plurality of monitored cache lines in the private cache of a producer hardware processor core (producer core) and to enable a cache line push operation upon determining a target hardware processor core (target core) based on the tracked activities. An execution of the cache line push operation is to cause a plurality of unmonitored cache lines in the private cache of the producer core to be moved to the private cache of the target core.
摘要:
Apparatus and methods implementing a hardware queue management device for reducing inter-core data transfer overhead by offloading request management and data coherency tasks from the CPU cores. The apparatus include multi-core processors, a shared L3 or last-level cache (“LLC”), and a hardware queue management device to receive, store, and process inter-core data transfer requests. The hardware queue management device further comprises a resource management system to control the rate in which the cores may submit requests to reduce core stalls and dropped requests. Additionally, software instructions are introduced to optimize communication between the cores and the queue management device.
摘要:
Technologies for identifying a cache line of a network packet for eviction from an on-processor cache of a network device communicatively coupled to a network controller. The network device is configured to determine whether a cache line of the cache corresponding to the network packet is to be evicted from the cache based on a determination that the network packet is not needed subsequent to processing the network packet, and provide an indication that the cache line is to be evicted from the cache based on an eviction policy received from the network controller.
摘要:
Technologies for managing network flow lookups of a network device include a network controller and a target device, each communicatively coupled to the network device. The network device includes a cache for a processor of the network device and a main memory. The network device additionally includes a multi-level hash table having a first-level hash table stored in the cache of the network device and a second-level hash table stored in the main memory of the network device. The network device is configured to determine whether to store a network flow hash corresponding to a network flow indicating the target device in the first-level or second-level hash table based on a priority of the network flow provided to the network device by the network controller.
摘要:
Methods and apparatus to support multiple-writer/multiple-reader concurrency for software flow/packet classification on general purpose multi-core systems. A flow table with rows mapped to respective hash buckets with multiple entry slots is implemented in memory of a host platform with multiple cores, with each bucket being associated with a version counter. Multiple writer and reader threads are run on the cores, with writers providing updates to the flow table data. In connection with inserting new key data, a determination is made to which buckets will be changed, and access rights to those buckets are acquired prior to making any changes. For example, under a flow table employing cuckoo hashing, access rights are acquired to buckets along a full cuckoo path. Once the access rights are obtained, a writer is enabled to update data in the applicable buckets to effect entry of the new key data, while other writer threads are prevented from changing any of these buckets, but may concurrently insert or modify key data in other buckets.
摘要:
Methods and apparatus to support multiple-writer/multiple-reader concurrency for software flow/packet classification on general purpose multi-core systems. A flow table with rows mapped to respective hash buckets with multiple entry slots is implemented in memory of a host platform with multiple cores, with each bucket being associated with a version counter. Multiple writer and reader threads are run on the cores, with writers providing updates to the flow table data. In connection with inserting new key data, a determination is made to which buckets will be changed, and access rights to those buckets are acquired prior to making any changes. For example, under a flow table employing cuckoo hashing, access rights are acquired to buckets along a full cuckoo path. Once the access rights are obtained, a writer is enabled to update data in the applicable buckets to effect entry of the new key data, while other writer threads are prevented from changing any of these buckets, but may concurrently insert or modify key data in other buckets.
摘要:
Methods, apparatus and systems for improved performance and energy efficiency of software-based routers. A software router running on a host computer system employing multiple Network Interface Controllers (NICs) maintains a routing table wherein packet flows are classified as managed flows (MFs) under which packets are received at and forwarded from the same NIC and unmanaged flows UFs under which packets are received at and forwarded from different NICs. Forwarding table data is employed by a NIC to facilitate packet identification and flow classification operations under which the NIC determines whether a received packet is an MF, UF, or an unclassified flow. Under various schemes, packet forwarding for MFs is handled by the software router architecture such that either only the packet header is copied into memory in the host or the entire packet forwarding is handled by the NIC.
摘要:
Technologies for supporting concurrency of a flow lookup table at a network device. The flow lookup table includes a plurality of candidate buckets that each includes one or more entries. The network device includes a flow lookup table write module configured to perform a displacement operation of a key/value pair to move the key/value pair from one bucket to another bucket via an atomic instruction and increment a version counter associated with the buckets affected by the displacement operation. The network device additionally includes a flow lookup table read module to check the version counters during a lookup operation on the flow lookup table to determine whether a displacement operation is affecting the presently read value of the buckets. Other embodiments are described herein and claimed.