-
1.
公开(公告)号:US20130326038A1
公开(公告)日:2013-12-05
申请号:US13489207
申请日:2012-06-05
申请人: Peter Bodik , Ishai Menache , Peter Winkler , Gregory M. Foxman , N. M. Mosharaf Kabir Chowdhury
发明人: Peter Bodik , Ishai Menache , Peter Winkler , Gregory M. Foxman , N. M. Mosharaf Kabir Chowdhury
IPC分类号: G06F15/173
CPC分类号: G06F9/4856 , H04L41/0668
摘要: Upon receiving a request to improve one or more conditions of a datacenter network, a fault management system may analyze information of the datacenter network including communication patterns among services provided in the network. The fault management system determines one or more logical machines associated with one or more services to be moved from one or more devices to one or more other devices of the network. The fault management system may select these one or more logical machines for migration based on a cost function including factors for fault tolerance, bandwidth usage, number of moves and/or response time latency. The fault management system may improve the fault tolerance of the network without significantly affecting the bandwidth usage of the network.
摘要翻译: 在接收到改善数据中心网络的一个或多个条件的请求时,故障管理系统可以分析包括在网络中提供的服务中的通信模式的数据中心网络的信息。 故障管理系统确定与要从一个或多个设备移动到网络的一个或多个其他设备的一个或多个服务相关联的一个或多个逻辑机器。 故障管理系统可以基于包括容错因素,带宽使用,移动次数和/或响应时间延迟的成本函数来选择这些一个或多个用于迁移的逻辑机器。 故障管理系统可以提高网络的容错能力,而不会对网络带宽的使用造成重大影响。
-
公开(公告)号:US20100241903A1
公开(公告)日:2010-09-23
申请号:US12408570
申请日:2009-03-20
CPC分类号: G06F11/079 , G06F11/0709 , G06F11/0748 , H04L41/0681 , H04L43/0823 , H04L43/0888
摘要: The present invention extends to methods, systems, and computer program products for automatically generating and refining health models. Embodiments of the invention use machine learning tools to analyze historical telemetry data from a server deployment. The tools output fingerprints, for example, small groupings of specific metrics-plus-behavioral parameters, that uniquely identify and describe past problem events mined from the historical data. Embodiments automatically translate the fingerprints into health models that can be directly applied to monitoring the running system. Fully-automated feedback loops for identifying past problems and giving advance notice as those problems emerge in the future is facilitated without any operator intervention. In some embodiments, a single portion of expert knowledge, for example, Key Performance Indicator (KPI) data, initiates health model generation. Once initiated, the feedback loop can be fully automated to access further telemetry and refine health models based on the further telemetry.
摘要翻译: 本发明延伸到用于自动生成和改进健康模型的方法,系统和计算机程序产品。 本发明的实施例使用机器学习工具来分析来自服务器部署的历史遥测数据。 这些工具输出指纹,例如,特定指标加行为参数的小组,可以唯一地识别和描述从历史数据中挖掘的过去的问题事件。 实施例将指纹自动转换为可直接应用于监视运行系统的健康模型。 全面自动化的反馈回路用于识别过去的问题,并在未来出现这些问题时提前通知,无需任何操作员干预。 在一些实施例中,专家知识的单一部分,例如关键绩效指标(KPI)数据,启动健康模型生成。 一旦启动,反馈回路可以完全自动化,以进一步遥测和基于进一步的遥测来改进健康模型。
-
公开(公告)号:US07962797B2
公开(公告)日:2011-06-14
申请号:US12408570
申请日:2009-03-20
IPC分类号: G06F11/00
CPC分类号: G06F11/079 , G06F11/0709 , G06F11/0748 , H04L41/0681 , H04L43/0823 , H04L43/0888
摘要: The present invention extends to methods, systems, and computer program products for automatically generating and refining health models. Embodiments of the invention use machine learning tools to analyze historical telemetry data from a server deployment. The tools output fingerprints, for example, small groupings of specific metrics-plus-behavioral parameters, that uniquely identify and describe past problem events mined from the historical data. Embodiments automatically translate the fingerprints into health models that can be directly applied to monitoring the running system. Fully-automated feedback loops for identifying past problems and giving advance notice as those problems emerge in the future is facilitated without any operator intervention. In some embodiments, a single portion of expert knowledge, for example, Key Performance Indicator (KPI) data, initiates health model generation. Once initiated, the feedback loop can be fully automated to access further telemetry and refine health models based on the further telemetry.
摘要翻译: 本发明延伸到用于自动生成和改进健康模型的方法,系统和计算机程序产品。 本发明的实施例使用机器学习工具来分析来自服务器部署的历史遥测数据。 这些工具输出指纹,例如,特定指标加行为参数的小组,可以唯一地识别和描述从历史数据中挖掘的过去的问题事件。 实施例将指纹自动转换为可直接应用于监视运行系统的健康模型。 全面自动化的反馈回路用于识别过去的问题,并在未来出现这些问题时提前通知,无需任何操作员干预。 在一些实施例中,专家知识的单一部分,例如关键绩效指标(KPI)数据,启动健康模型生成。 一旦启动,反馈回路可以完全自动化,以进一步遥测和基于进一步的遥测来改进健康模型。
-
公开(公告)号:US20100306597A1
公开(公告)日:2010-12-02
申请号:US12473900
申请日:2009-05-28
申请人: Moises Goldszmidt , Peter Bodik
发明人: Moises Goldszmidt , Peter Bodik
CPC分类号: G06F11/079 , G06F11/0709 , G06F11/3409 , G06F2201/81 , H04L41/0681
摘要: Methods for automatically identifying and classifying a crisis state occurring in a system having a plurality of computer resources. Signals are received from a device that collects the signals from each computer resource in the system. For each epoch, an epoch fingerprint is generated. Upon detecting a performance crisis within the system, a crisis fingerprint is generated consisting of at least one epoch fingerprint. The technology is able to identify that a performance crisis has previously occurred within the datacenter if a generated crisis fingerprint favorably matches any of the model crisis fingerprints stored in a database. The technology may also predict that a crisis is about to occur.
摘要翻译: 用于自动识别和分类在具有多个计算机资源的系统中发生的危机状态的方法。 从收集系统中每台计算机资源的信号的设备接收信号。 对于每个时期,都会产生一个时代指纹。 在检测到系统内的性能危机之后,产生由至少一个时代指纹组成的危机指纹。 该技术能够确定如果生成的危机指纹有利地匹配存储在数据库中的任何模型危机指纹,则数据中心之前发生了性能危机。 该技术还可能预测危机即将发生。
-
公开(公告)号:US09262216B2
公开(公告)日:2016-02-16
申请号:US13372717
申请日:2012-02-14
申请人: Peter Bodik , Andrew D. Ferguson , Srikanth Kandula , Eric Boutin
发明人: Peter Bodik , Andrew D. Ferguson , Srikanth Kandula , Eric Boutin
IPC分类号: G06F9/48
CPC分类号: G06F9/4887
摘要: A computing cluster operated according to a resource allocation policy based on a predictive model of completion time. The predictive model may be applied in a resource control loop that iteratively updates resources assigned to an executing job. At each iteration, the amount of resources allocated to the job may be updated based on of the predictive model so that the job will be scheduled to complete execution at a target completion time. The target completion time may be derived from a utility function determined for the job. The utility function, in turn, may be derived from a service level agreement with service guarantees and penalties for late completion of a job. Allocating resources in this way may maximize utility for an operator of the computing cluster while minimizing disruption to other jobs that may be concurrently executing.
摘要翻译: 一种基于完成时间预测模型的资源分配策略运行的计算集群。 预测模型可以应用在资源控制循环中,其循环地更新分配给执行作业的资源。 在每次迭代时,可以基于预测模型来更新分配给作业的资源量,使得作业将被调度以在目标完成时间完成执行。 目标完成时间可以从为作业确定的效用函数导出。 效用函数反过来可能来自服务级别协议,服务保证和作业迟到完成的处罚。 以这种方式分配资源可以最大限度地实现计算集群的运营商,同时最大限度地减少可能并发执行的其他作业。
-
公开(公告)号:US20130212277A1
公开(公告)日:2013-08-15
申请号:US13372717
申请日:2012-02-14
申请人: Peter Bodik , Andrew D. Ferguson , Srikanth Kandula , Eric Boutin
发明人: Peter Bodik , Andrew D. Ferguson , Srikanth Kandula , Eric Boutin
IPC分类号: G06F15/173
CPC分类号: G06F9/4887
摘要: A computing cluster operated according to a resource allocation policy based on a predictive model of completion time. The predictive model may be applied in a resource control loop that iteratively updates resources assigned to an executing job. At each iteration, the amount of resources allocated to the job may be updated based on of the predictive model so that the job will be scheduled to complete execution at a target completion time. The target completion time may be derived from a utility function determined for the job. The utility function, in turn, may be derived from a service level agreement with service guarantees and penalties for late completion of a job. Allocating resources in this way may maximize utility for an operator of the computing cluster while minimizing disruption to other jobs that may be concurrently executing.
摘要翻译: 一种基于完成时间预测模型的资源分配策略运行的计算集群。 预测模型可以应用在资源控制循环中,其循环地更新分配给执行作业的资源。 在每次迭代时,可以基于预测模型来更新分配给作业的资源量,使得作业将被调度以在目标完成时间完成执行。 目标完成时间可以从为作业确定的效用函数导出。 效用函数反过来可能来自服务级别协议,服务保证和作业迟到完成的处罚。 以这种方式分配资源可以最大限度地实现计算集群的运营商,同时最大限度地减少可能并发执行的其他作业。
-
公开(公告)号:US08078913B2
公开(公告)日:2011-12-13
申请号:US12473900
申请日:2009-05-28
申请人: Moises Goldszmidt , Peter Bodik
发明人: Moises Goldszmidt , Peter Bodik
IPC分类号: G06F11/00
CPC分类号: G06F11/079 , G06F11/0709 , G06F11/3409 , G06F2201/81 , H04L41/0681
摘要: Methods for automatically identifying and classifying a crisis state occurring in a system having a plurality of computer resources. Signals are received from a device that collects the signals from each computer resource in the system. For each epoch, an epoch fingerprint is generated. Upon detecting a performance crisis within the system, a crisis fingerprint is generated consisting of at least one epoch fingerprint. The technology is able to identify that a performance crisis has previously occurred within the datacenter if a generated crisis fingerprint favorably matches any of the model crisis fingerprints stored in a database. The technology may also predict that a crisis is about to occur.
摘要翻译: 用于自动识别和分类在具有多个计算机资源的系统中发生的危机状态的方法。 从收集系统中每台计算机资源的信号的设备接收信号。 对于每个时期,都会产生一个时代指纹。 在检测到系统内的性能危机之后,产生由至少一个时代指纹组成的危机指纹。 该技术能够确定如果生成的危机指纹有利地匹配存储在数据库中的任何模型危机指纹,则数据中心之前发生了性能危机。 该技术还可能预测危机即将发生。
-
-
-
-
-
-