HARDWARE RELIABILITY DIAGNOSTICS AND FAILURE DETECTION VIA PARALLEL SOFTWARE COMPUTATION AND COMPARE

    公开(公告)号:US20210165730A1

    公开(公告)日:2021-06-03

    申请号:US17175488

    申请日:2021-02-12

    Abstract: Methods, apparatus, and software for hardware reliability diagnostics and failure detection via parallel software computation and compare. Parallel testing is performed on hardware resources such as processor cores, accelerators, and Other Processing Units (XPUs) using test algorithms such as encryption/decryption. The results of the testing (the algorithm outputs) are compared to detect errant hardware. Comparison may be across cores (via execution of software-based algorithms), across accelerators/XPUs (via algorithms implement in hardware) or between cores and accelerators/XPUs. Techniques are disclosed to enable all cores to be tested while a platform is performing a workload, such as in a data center environment, wherein unused cores are used for testing, with workloads being migrated between cores between tests.

    METHOD AND APPARATUS TO PROACTIVELY SCREEN HARDWARE ERRORS OF A COMPUTER PROCESSING SYSTEM

    公开(公告)号:US20250004896A1

    公开(公告)日:2025-01-02

    申请号:US18217245

    申请日:2023-06-30

    Abstract: Methods and apparatus to implement proactive hardware error screening are disclosed. In one embodiment, a computer processing system includes a plurality of computational units to execute tasks for one or more applications; a plurality of sensors collects measurement data of the plurality of computational units, to collect measurement data of the plurality of computational units; a data structure indicating hardware health statuses of the plurality of computational units determined based on the measurement data is stored in a storage; and the plurality of computational units is scheduled to perform task execution on the computer processing system for the one or more applications based on the hardware health statuses of the plurality of computational units indicated in the data structure, wherein a first computational unit is excluded from the task execution when a corresponding first hardware health status of the first computational unit indicates an impending hardware failure.

Patent Agency Ranking