Abstract:
Systems described herein operate to improve network performance in a multi-tenant cloud computing environment. Systems can include communication circuitry and processing circuitry to generate a phase sequence matrix that indicates the identity and number of phases of a workload by measuring resources of the compute node during execution of the workload throughout a lifetime of the workload. The processing circuitry can generate a workload fingerprint that includes the phase sequence matrix and a phrase residency matrix. The phase residency matrix can indicate the fraction of execution time of the workload spent in each phase identified in the phase sequence matrix. A cloud controller can access the workload fingerprint for multiple workloads operating on multiple compute nodes in the cloud cluster to adjust workload allocations based at least on these workload fingerprints and on whether service level objectives (SLOs) are being met.
Abstract:
Systems, apparatuses and methods may provide for technology that identifies a first set of compute nodes and a second set of compute nodes, wherein the first set of compute nodes execute more slowly than the second set of compute nodes. The technology may also automatically determine a compute node configuration that results in a relatively low difference in completion time between the first set of compute nodes and the second set of compute nodes with respect to a neural network workload. In an example, the technology applies the compute node configuration to an execution of the neural network workload on one or more nodes in the first set of compute nodes and one or more nodes in the second set of compute nodes.
Abstract:
Technologies for adapting a communication protocol (e.g., TCP/IP, UDP, etc.) to network communications between endpoints (e.g., accelerated kernels configured within accelerator devices) include a sled having a compute engine. The compute engine monitors telemetry data associated with one or more network communications between a given kernel and another kernel. The network communications are established via a given communication protocol. The compute engine determines, as a function of the monitored telemetry data, that a condition to change the network communications from the communication protocol to another communication protocol is triggered. The compute engine shifts the network communications to the other communication protocol.
Abstract:
Technologies for providing adaptive platform quality of service include a compute device. The compute device is to obtain class of service data for an application to be executed, execute the application, determine, as a function of one or more resource utilizations of the application as the application is executed, a present phase of the application, set a present class of service for the application as a function of the determined phase, wherein the present class of service is within a range associated with the determined phase, determine whether a present performance metric of the application satisfies a target performance metric, and increment, in response to a determination that the present performance metric does not satisfy the target performance metric, the present class of service to a higher class of service in the range. Other embodiments are also described and claimed.
Abstract:
An embodiment of network coordinator apparatus may include a node provisioner to provision each of a plurality of low power nodes, a node associater to create a first association for each of the plurality of lower power nodes, and a node coordinator communicatively coupled to the node provisioner and the node associater to coordinate the plurality of lower power nodes.
Abstract:
Embodiments detailed herein include an apparatus that includes a reliability assessment engine (RAE) stored in non-volatile memory and processing circuitry to execute the RAE to: receive data of at least one physical condition from a plurality of intra-die variation monitoring circuits, apply the received data least one to at least one reliability physics model, and calculate at least one of an estimated amount of lifetime consumed and an estimated amount of lifetime remaining.
Abstract:
Examples may include techniques to a schedule a workload to one or more computing resources of a data center. A class is determined for the workload based on a workload type or profile for the workload. Predicted operating values for at least one of the one or more computing resources is determined based on the class and the predicted operating values are used as inputs in at least one scoring model to evaluate the workload being supported by the at least one of the one or more computing resources. The workload is then scheduled to the at least one or more computing resources based on the evaluation.
Abstract:
Technologies for distributed durable data replication include a computing device having persistent memory that stores a memory state and an update log. The computing device isolates a host partition from a closure partition. The computing device may sequester one or more processor cores for use by the closure partition. The host partition writes transaction records to the update log prior to writing state changes to persistent memory. A replication service asynchronously transmits log records to a remote computing device, which establishes a replica update log in persistent memory. If the host partition fails, the closure partition transmits remaining log records from the update log to the remote computing device. The update log may be quickly replayed when recovering the computing device from failure. The remote computing device may also replay the replica update log to update a remote copy of the state data. Other embodiments are described and claimed.
Abstract:
An embodiment includes determining a first power metric (e.g., memory module temperature) corresponding to a group of computing nodes that includes first and second computing nodes; and distributing a computing task to a third computing node (e.g., load balancing) in response to the determined first power metric; wherein the third computing node is located remotely from the first and second computing nodes. The first power metric may be specific to the group of computing nodes and is not specific to either of the first and second computing nodes. Such an embodiment may leverage knowledge of computing node group behavior, such as power consumption, to more efficiently manage power consumption in computing node groups. This “power tuning” may rely on data taken at the “silicon level” (e.g., an individual computing node such as a server) and/or a large group level (e.g., data center). Other embodiments are described herein.
Abstract:
In an embodiment, a processor includes a fuzzy thermoelectric cooling (TEC) controller to: obtain a current TEC level associated with the processor; obtain a current fan power level associated with the processor; fuzzify the current TEC level to obtain a first fuzzy fan level; fuzzify the current fan power level to obtain a second fuzzy fan level; determine a new TEC power level based at least in part on the first fuzzy fan level, the second fuzzy fan level, and a plurality of fuzzy rules; and provide the new TEC power level to a TEC device associated with the processor, where the TEC device is to transfer heat from the processor to a heat sink. Other embodiments are described and claimed.