Abstract:
A computer-implemented method includes: detecting, by one or more processors, an indication that suggests a node has crashed, wherein the node is included in a distributed computing environment; in response to the detecting, confirming by the one or more processors whether the node has crashed by performing a set of probes on the node; and in response to the confirming that the node has crashed, initiating by the one or more processors a remediation of the node.
Abstract:
Component power consumption is collected from each of a plurality of controllers of a node having a plurality of components. The component power consumption is provided to each of the plurality of controllers. A power differential is determined as a difference between a power cap for an apparatus and a total power consumption for the apparatus based, at least in part, on the component power consumption. A proportion of the total power consumption corresponding to the at least one component associated with the at least one component controller is determined. A local power budget is computed for the at least one component based, at least in part, on the power differential and the proportion of the total power consumption corresponding to the at least one component. A failure associated with the at least one component controller or the at least one component is determined.
Abstract:
Virtual servers are monitored in real-time. A group of virtual servers from virtual server events occurring within a time window is identified by a computer system in real-time. A metric is determined for the group of virtual servers by the computer system in real-time using the virtual server events occurring within the time window for the group of virtual servers. A set of actions is performed by the computer system using the metric.
Abstract:
Controlling server power usage in a data center is provided. Power usage among a plurality of server racks in active mode processing a set of workloads in the data center is managed. It is detected that a new server rack in standby mode is being added to the plurality of server racks. It is ensured that the new server rack in the standby mode is properly controlled and monitored prior to transitioning the new server rack to the active mode. It is determined whether power safety criteria are met to safely join the new server rack to the plurality of server racks prior to transitioning the new server rack from the standby mode to the active mode. The new server rack is transitioned to the active mode in without exceeding a power budget for the plurality of server racks in response to determining that the power safety criteria are met.
Abstract:
A computer-implemented method includes: detecting, by one or more processors, an indication that suggests a node has crashed, wherein the node is included in a distributed computing environment; in response to the detecting, confirming by the one or more processors whether the node has crashed by performing a set of probes on the node; and in response to the confirming that the node has crashed, initiating by the one or more processors a remediation of the node.
Abstract:
Controlling server power usage in a data center is provided. Power usage among a plurality of server racks in active mode processing a set of workloads in the data center is managed. It is detected that a new server rack in standby mode is being added to the plurality of server racks. It is ensured that the new server rack in the standby mode is properly controlled and monitored prior to transitioning the new server rack to the active mode. It is determined whether power safety criteria are met to safely join the new server rack to the plurality of server racks prior to transitioning the new server rack from the standby mode to the active mode. The new server rack is transitioned to the active mode in without exceeding a power budget for the plurality of server racks in response to determining that the power safety criteria are met.
Abstract:
Virtual servers are monitored in real-time. A group of virtual servers from virtual server events occurring within a time window is identified by a computer system in real-time. A metric is determined for the group of virtual servers by the computer system in real-time using the virtual server events occurring within the time window for the group of virtual servers. A set of actions is performed by the computer system using the metric.
Abstract:
Technology for computing number of active servers needed over time in a cloud/compute cluster includes the following operations (not necessarily in the following order): (i) determining the capacity of each VCE provisioned on the cloud against the resource guaranteed to that VCE; (ii) forecasting the resource needs over time using historical requests for each VCE flavor; and (iii) using the forecasted resource needs to determine the required number of future servers at some future time. Some embodiments of the present invention use a formula that accounts for the interplay among various parameter values of the VCE flavors and also the mapping of the needs of VCEs of various flavors to the capabilities of physical resources.
Abstract:
An apparatus includes a plurality of components and a plurality of component controllers. Each of the plurality of component controllers is associated with at least one component of the plurality of components. Each component controller is configured to compute a local power budget for the at least one component based, at least in part, on the power differential and the proportion of the total power consumption corresponding to the at least one component. A service processor is configured to determine failure associated with at least one component controller of the plurality of component controllers or the at least one component associated with the at least one component controller. The service processor is configured to in response to a reset threshold not being exceeded, reset the at least one component controller without interrupting operations of any components of the at least one component that have not failed.
Abstract:
Systems and techniques that facilitate automated validation of power topology are provided. In various embodiments, a control component can transmit a transition command to a power-distribution node of a data center, wherein the transition command can cause an outlet of the power-distribution node to transition between power states. In various aspects, a verification component can verify that a power-consumption node of the data center is connected to the outlet by comparing a pre-transition power characteristic of the power-consumption node with a post-transition power characteristic of the power-consumption node.