摘要:
A two level branch history table (TLBHT) is substantially improved by providing a mechanism to prefetch entries from the very large second level branch history table (L2 BHT) into the active (very fast) first level branch history table (L1 BHT) before the processor uses them in the branch prediction process and at the same time prefetch cache misses into the instruction cache. The mechanism prefetches entries from the very large L2 BHT into the very fast L1 BHT before the processor uses them in the branch prediction process. A TLBHT is successful because it can prefetch branch entries into the L1 BHT sufficiently ahead of the time the entry is needed. This feature of the TLBHT is also used to prefetch instructions into the cache ahead of their use. In fact, the timeliness of the prefetches produced by the TLBHT can be used to remove most of the cycle time penalty incurred by cache misses.
摘要:
IEEE compliant floating point unit mechanism allows variability in the execution of floating point operations according to the IEEE 754 standard and allowing variability of the standard to co-exist in hardware or in the combination of hardware and millicode. The FPU has a detector of special conditions which dynamically detects an event that the hardware execution of an IEEE compliant Binary Floating Point instruction will require millicode emulation. The complete set of events which millicode may emulate are predetermined early in the design process of the hardware. An exception handling unit assist millicode emulation by trapping the result of an exceptional condition without invoking the trap handler. When an exceptional condition is detected during execution, the IEEE 754 standard requires two different actions under control of a mask bit. If the mask bit is on, the result is written into an FPR and the trap handler is invoked. Otherwise, a default value is written, a flag is set, and the program continues execution. This allows a variation to the IEEE 754 standard. Two different versions of the function of the Multiply-then-Substract instruction are implemented for two different IEEE 754 compliant architectures.
摘要:
An adder which takes advantage of the early arriving bits of a time skewed operand to provide a result to an add or substract operation without additional latency. Possible partial results are calculated and then selectively combined according to the late arriving data as the late arriving data becomes available. In an embodiment of the present invention, a first operand is partitioned into groups according to the arrival time of the skewed data, and possible partial results for each group are calculated for the full range of partial inputs that affect it. In addition, the high order groups are calculated with and without a borrow (carry) which is propagated from a low order group. Once the delayed partial operands are known and the borrows (carrys) determined the partial results are gated through multiplexers according to the borrows and partial results, and thus the result is provided with a delay similar to the delay in arrival of the skewed operand.
摘要:
There is a unique partitioning problem in determining how to execute the floating point multiply instruction defined by IEEE 754 standard for the quad word format on a S/390 processor. Several manufacturers including IBM and HP define the binary quad word format to have a 113 bit significand. IBM S/390 hexadecimal long floating point format has a 56 bit significand and most S/390 floating point units only contain a long format multiplier. Quad word format multiplication must be executed as a series of several long precision multiplications and extended precision or long precision additions. The S/390 hexadecimal quad word format is easier to implement than binary format since it has a 112 bit significand and can easily be partitioned into two 56 bit parts. But a 113 bit significand would just exceed two partitions and require a third. For extended precision multiplies each partition is multiplied by each other, so if there are two partitions only four multiplies are required but for three partitions this increases to nine multiplies. Methods for partitioning are disclosed here.
摘要翻译:确定如何在S / 390处理器上执行由IEEE 754标准定义的四字格式的浮点乘法指令,存在独特的划分问题。 包括IBM和HP在内的几家制造商将二进制四字格式定义为具有113位有效位数。 IBM S / 390十六进制长浮点格式具有56位有效位数,大多数S / 390浮点单元仅包含长格式乘数。 四字格式乘法必须作为一系列长精度乘法和扩展精度或长精度加法执行。 S / 390十六进制四进制字格式比二进制格式更容易实现,因为它具有112位有效位数,并且可以轻松地分为两个56位的部分。 但是一个113位的有效位数只会超过两个分区,需要三分之一。 对于扩展精度乘法,每个分区彼此相乘,因此如果有两个分区只需要四个乘法,但是对于三个分区,这增加到九个乘法。 这里公开了划分方法。
摘要:
A system implementing a methodology for determining the exponent in parallel with determining the fractional shift during normalization according to partitioning the exponent into partial exponent groups according to the fractional shift data flow, determining all possible partial exponent values for each partial exponent group according to the fractional data flow, and providing the exponent by selectively combining possible partial exponents from each partial exponent group according to the fractional data flow. There is also provided a system implementing a methodology for generating the sticky bit during normalization. Sticky bit information is precalculated and multiplexed according to the fractional dataflow. In an embodiment of the invention, group sticky signals are calculated in tree form, each group sticky having a number of possible sticky bits corresponding to the shift increment amount of the multiplexing. The group sticky bits are further multiplexed according to subsequent shift amounts in the fractional dataflow to provide an output sticky bit at substantially the same time as when the final fractional shift amount is available, and thereby at substantially the same time as the normalized fraction.
摘要:
A computer processor floating point processor six cycle pipeline system where instruction text is fetched prior to the first cycle and decoded during the first cycle for the fetched particular instruction and the base (B) and index (X) register values are read for use in address generation. RXE Instructions are of the RX-type but extended by placing the extension of the operation code beyond the first four bytes of the instruction format and to assign the operation codes in such a way that the machine may determine the exact format from the first 8 bits of the operation code alone. ESA/390 instructions SS, RR; RX; S; RRE; RI; and the new RXE instructions have a format which can be used for fixed point processing as well as floating point processing where instructions of the RXE format have their R1, X2, B2, and D2 fields in the identical positions in said instruction register as in the RX format to enable the processor to determine from the first 8 bits of the operation code alone that an instruction being decoded is an RXE format instruction and the register indexed extensions of the RXE format instruction, after which it gates the correct information to said X-B-D adder. During the second cycle the address add of B+X+Displacement is performed and sent to the cache processor's, and during the third and fourth cycles the cache is respectively accessed and data is returned, and during a fifth cycle execution of the fetched instruction occurs with the result putaway in a sixth cycle.
摘要翻译:计算机处理器浮点处理器六循环流水线系统,其中指令文本在第一周期之前获取并且在第一周期期间被解码用于所提取的特定指令,并且基准(B)和索引(X)寄存器值被读取用于地址 代。 RXE指令是RX型,但通过将操作码的扩展置于指令格式的前四个字节之外进行扩展,并以这样的方式分配操作码,使得机器可以从前8位确定确切的格式 的操作代码。 ESA / 390指令SS,RR; RX; S; RRE; RI; 并且新的RXE指令具有可用于固定点处理以及浮点处理的格式,其中RXE格式的指令在所述指令寄存器中的相同位置具有其R1,X2,B2和D2字段,如 RX格式,使处理器能够从操作代码的前8位确定正在解码的指令是RXE格式指令和RXE格式指令的寄存器索引扩展,之后它将正确信息锁定到所述XBD加法器 。 在第二周期期间,执行B + X +位移的地址添加并发送到高速缓存处理器,并且在第三和第四周期期间,分别访问高速缓存并返回数据,并且在第五周期期间执行所取出的指令 结果放在第六个周期。
摘要:
An IEEE 754 standard floating point multiply instruction for binary extended precision format can be executed with a quad word format on an S/390 process. The multiplication calculation multiplies each partition by each other. In the multiplication calculation process dataflow process of either operand is a denormalized number, they are normalized at a stage which creates an expanded exponent range of one more bit, and the calculation continues to a parallel path multiplexor stage, but if neither operand is denormalized then the exponent of the number is expended and the calculation splits into four parallel paths, wherein two operand's sign bits are processed in a sign calculation block stage, the operands' two 16 bit binary exponents are processed by an exponent conversion block stage, and a partition multiplicand significand block stage receives a 113 bit multiplicand significand input for a fourth path. In this calculation third and fourth paths converge with a calculation which provides partial products and intermediate sums and finally a final product as a calculation block stage output, and this output and the exponent from said second path and the sign bit from said first path merge to provide a product which is represented in hexadecimal internal format and is converted back to binary format in calculation block stage and rounded.
摘要:
Provides a program translation and execution method which stores target routines (for execution by a target processor) corresponding to incompatible instructions, interruptions and authorizations of an incompatible program written for execution on another computer system built to a computer architecture incompatible with the architecture of the target processor's computer system. The disclosed process allows the target processor to emulate incompatible acts expected in the operation of an incompatible program when the target processor itself is incapable of performing the emulated acts. Each of the instructions, interruptions and authorizations found in the incompatible programs has one or more corresponding target routines, any of which may need to be preprocessed before it can precisely emulate the execution results required by the incompatible architecture. Target routines (corresponding to the incompatible instruction instances in an incompatible program being emulated) are accessed, patched where necessary, and executed by a target processor to enable the target processor to precisely obtain the execution results of the emulated incompatible program. Before preprocessing, each target routine may not be able to provide identical execution results as required by the incompatible architecture, and the preprocessing may patch one or more of its target instructions to enable the target routine to perform the identical emulation execution of the corresponding incompatible instruction. The patching and other modifications to a target routine are done by one or more preprocessing instructions stored in the target routine.
摘要:
A method and system which provides exactly rounded division and square root results for a designated rounding mode independently of a remainder, or equivalent calculation of the relationship between the remainder and zero, for predetermined combinations of the rounding mode and the guard digit of an estimate that has several more bits of precision than the exactly rounded result, and has an error tolerance magnitude less than the weight of the least significant bit of the estimate. The estimate is generated in accordance with a quadratically converging division or square root algorithm. The method and system is described in connection with IEEE 754-1985 and IBM S/390 binary floating point architectures.
摘要翻译:一种独立于余数的指定舍入模式提供精确的舍入除法和平方根结果的方法和系统,或为舍入模式和估计的保护数字的预定组合的余数和零之间的关系的等效计算提供的方法和系统, 具有比精确舍入结果多几位精度,并且具有小于估计的最低有效位的权重的误差容差幅度。 根据二次收敛除法或平方根算法生成估计值。 该方法和系统是结合IEEE 754-1985和IBM S / 390二进制浮点架构进行描述的。
摘要:
A method and system which provides exactly rounded division and square root results for a designated rounding mode independently of a remainder, or equivalent calculation of the relationship between the remainder and zero, for predetermined combinations of the rounding mode and the least significant bit of an estimate that has one more bit of precision than the exactly rounded result, and has an error tolerance magnitude less than the weight of the least significant bit of the estimate. The estimate is generated in accordance with a quadratically converging division or square root algorithm. The method and system is described in connection with IEEE 754-1985 and IBM S/390 binary floating point architectures.
摘要翻译:一种方法和系统,其针对指定的舍入模式提供精确的舍入除法和平方根结果,独立于余数,或者对于舍入模式的预定组合和估计的最低有效位之间的余数和零之间的关系的等效计算, 其具有比精确舍入结果多一位精度,并且具有小于估计的最低有效位的权重的误差容限幅度。 根据二次收敛除法或平方根算法生成估计值。 该方法和系统是结合IEEE 754-1985和IBM S / 390二进制浮点架构进行描述的。