Abstract:
System and method for enabling operations for virtual computing instances with physical passthru devices includes moving an input-output memory management unit (IOMMU) domain from a source virtual computing instance having a physical passthru device to a destination virtual computing instance, where guest operations are performed in the source virtual computing instance. After the destinating virtual computing instance is powered on, any interrupt notifications from the physical passthru device are buffered. After memory data is transferred from the source virtual computing instance to the destination virtual computing instance, posting of interrupt notifications from the physical passthru device is resumed and any buffered interrupt notifications from the physical passthru device are posted. Guest operations are performed in the destination virtual computing instance.
Abstract:
An automated end-to-end analysis of customer service requests is disclosed. A core dump is received, wherein the core dump corresponds to a customer service request regarding a crash of a computer system. The core dump is automatically analyzed with a processor to generate analysis results. A graphical representation for display on a graphic user interface of a computer is generate, wherein the graphical representation corresponds to the analysis results for the core dump.
Abstract:
Disclosed are various examples of host and data processing unit (DPU) coordination for DPU maintenance events. A host device can have a DPU device connected to it. A data processing unit (DPU) maintenance process executed by a host device can quiesce applications or virtual machines of the host device, and call a DPU isolation interface that isolates the DPU device to prevent host panic. A kernel process of the host device unloads a driver of the DPU device from the host device and removes the DPU device from a device manager of the host device. A DPU maintenance action is performed once the DPU device is isolated.
Abstract:
Techniques for migrating virtual machines (VMs) in the presence of uncorrectable memory errors are provided. According to one set of embodiments, a source host hypervisor of a source host system can determine, for each guest memory page of a VM to be migrated from the source host system to a destination host system, whether the guest memory page is impacted by an uncorrectable memory error in a byte-addressable memory of the source host system. If the source host hypervisor determines that the guest memory page is impacted, the source host hypervisor can transmit a data packet to a destination host hypervisor of the destination host system that includes error metadata identifying the guest memory page as being corrupted. Alternatively, if the source host hypervisor determines that the guest memory page is not impacted, the source host hypervisor can attempt to read the guest memory page from the byte-addressable memory in a memory exception-safe manner.
Abstract:
A computer-implemented method assessing the risk of a future crash occurring on a computer system is disclosed. Crash results are received from a crash analysis system. The crash results are analyzed, at a processor, to determine the likelihood of the future crash occurring on the computer system. Information regarding the likelihood of the future crash occurring on the computer system is provided to a user of the computer system.
Abstract:
Discovering a hardware failure in a processor is disclosed. When an operating system or application fails, a function containing the instruction that failed along with the register set of the CPU at the failure is recorded. The function is analyzed into its basic blocks. The failing instruction, the failing basic block, the definitions that reach the failing instruction, and the CPU register set at the failure provide information to determine whether the failure was caused by hardware or software. If, after a complete search of the definitions reaching the failing instruction, the search discovers a first definition defining the failing instruction and a second definition defining the first definition such that the second definition reaches the failing instruction and the first definition assigns a register value that does not match a register value in the failing instruction, then a hardware failure is the cause of the crash.
Abstract:
Techniques for migrating virtual machines (VMs) in the presence of uncorrectable memory errors are provided. According to one set of embodiments, a source host hypervisor of a source host system can determine, for each guest memory page of a VM to be migrated from the source host system to a destination host system, whether the guest memory page is impacted by an uncorrectable memory error in a byte-addressable memory of the source host system. If the source host hypervisor determines that the guest memory page is impacted, the source host hypervisor can transmit a data packet to a destination host hypervisor of the destination host system that includes error metadata identifying the guest memory page as being corrupted. Alternatively, if the source host hypervisor determines that the guest memory page is not impacted, the source host hypervisor can attempt to read the guest memory page from the byte-addressable memory in a memory exception-safe manner.
Abstract:
Techniques for migrating virtual machines (VMs) in the presence of uncorrectable memory errors are provided. According to one set of embodiments, a source host hypervisor of a source host system can determine, for each guest memory page of a VM to be migrated from the source host system to a destination host system, whether the guest memory page is impacted by an uncorrectable memory error in a byte-addressable memory of the source host system. If the source host hypervisor determines that the guest memory page is impacted, the source host hypervisor can transmit a data packet to a destination host hypervisor of the destination host system that includes error metadata identifying the guest memory page as being corrupted. Alternatively, if the source host hypervisor determines that the guest memory page is not impacted, the source host hypervisor can attempt to read the guest memory page from the byte-addressable memory in a memory exception-safe manner.
Abstract:
In a crash analysis system, a method for analyzing a core dump corresponding to a crash of a computer system is disclosed. A core dump is received wherein the core dump corresponds to a crash of a computer system. A culprit module responsible for the crash of the computer system is determined. A signature back trace, which pertains to a symptom of the crash of the computer system is generated.
Abstract:
An automated end-to-end analysis of customer service requests is disclosed. A core dump is received, wherein the core dump corresponds to a customer service request regarding a crash of a computer system. A processor automatically analyzes the core dump to determine if a pcpu lockup of the computer system is due to a software issue. Provided the pcpu lockup of the computer system is due to the software issue, the processor determines which thread is a culprit thread responsible for the pcpu lockup of the computer system.