Abstract:
A method and apparatus for automated performance verification for integrated circuit design is described herein. The method includes test preparation and automated verification stages. The test preparation stage generates design feature-specific performance tests to meet expected performance goals under certain workloads using optimization approaches and for different design configurations. The automated verification stage is implemented by integrating functional, automated modules into a verification infrastructure. These modules include register transfer level (RTL) simulation, performance evaluation and performance publish modules. The RTL simulation module schedules performance testing jobs, runs a series of performance tests on simulation logic simultaneously and generates performance counters for each functional unit. The performance evaluation module consists of three sub-functions including a functional comparison between actual results and a reference file containing the expected results, performance measurements for throughput, execution time, and latency values, and performance analysis. The performance publish module publishes performance results and analysis reports.
Abstract:
Systems, apparatuses, and methods for efficient parallel execution of multiple work units in a processor by reducing a number of memory accesses are disclosed. A computing system includes a processor core with a parallel data architecture. One or more of a software application and firmware implement matrix operations and support the broadcast of shared data to multiple compute units of the processor core. The application creates thread groups by matching compute kernels of the application with data items, and grouping the resulting work units into thread groups. The application assigns the thread groups to compute units based on detecting shared data among the compute units. Rather than send multiple read access to a memory subsystem for the shared data, a single access request is generated. The single access request includes information to identify the multiple compute units for receiving the shared data when broadcasted.
Abstract:
A system, method, and computer program product are provided for tessellation using shaders. New graphics pipeline stages implemented by shaders are introduced, including an inner ring shader, an outer edge shader, and topologic shader, which work together with a domain shader and geometry shader to provide tessellated points and primitives. A hull shader is modified to compute values used by the new shaders to perform tessellation algorithms. This approach provides parallelism and customizability to the presently static tessellation engine implementation.