Abstract:
1,039,580. Recognising spoken words. STANDARD TELEPHONES & CABLES Ltd. Dec. 20, 1963 [Dec. 31, 1962], No. 50401/63. Heading G4R. In apparatus for recognizing spoken words a library of power traces are provided, one for each possible word, a light image of the power trace of the unknown word is formed and compared with each of those in the library the size of the trace being repeatedly changed during comparison, an indication being produced when the image agrees with one of the traces in the library by which the word is identified. To construct the library traces sonograms, Fig. 1a (not shown), are made of certain words spoken by different speakers. These show the power in different frequency channels. For any particular channel the power curve, Fig. 2 (not shown), may vary during the speaking of a word. In a simple mask, Fig. 5 (not shown), the area 10 is transparent and the rest opaque, a power trace may be expected to lie in the area 10 in most cases. In the form of Fig. 6 (not shown), the mask has a transparent area 11 and fringe areas 12, 13, 14 and 15 having an opacity inversely proportional to the probability that the trace should fall on them. The area 15 for example represents a deviant trace having a low probability of occurrence. To avoid using areas of the mask which represent redundant information parts may be obscured, Fig. 7 (not shown), the strips 17 and 18 being designed, in effect, to weight samples of the trace in accordance with their significance for word discrimination. The strips are also designed to normalize the outputs, i.e. to ensure that the maximum output of each mask is the same. The speech signal is recorded on a magnetic tape 38, Fig. 12 (not shown), and later read from the tape at twice the speed, the tape having loops 44, 49 to allow this. The speech of 2 seconds is thereby compressed into one second providing one second for the process of comparing the trace with the library of masks. The signal read from the tape is applied to a bank of ten filters 58, each passing a band, the centre frequency of which is as indicated. The outputs are rectified at 59 and low-pass filters 60 exclude frequencies above 50 cycles per second so that the envelope only is passed. The ten envelopes are each sampled 200 times in the second by a sampler 61 driven by counter 63. A staircase generator also controlled by the counter gives a series of ten-step wave forms. The samples of the envelopes are applied via a logarithmic amplifier 64 to an adder 71 receiving the staircase voltage so that the sample signals relating to the different envelopes are each given a different bias. The output of the adder is applied to the vertical deflection circuits 73, Fig. 13 (not shown), of an iatron 74 having a long persistence time. Each sample appears as a dot the vertical position depending on the amplitude of the sample. Each envelope is plotted as a series of 200 dots the traces of the ten envelopes being separated vertically by the bias voltages so as to be one above the other. At the end of the tracing operation, which takes one second, the comparison operation begins and occupies the next second. The traces on the screen of the tube 74 are projected through a mirror 75 and an anamorphic lens 76 which distorts to alter the magnification of the image by Œ15%. This compensates for the different rates of speaking that might be expected. The lens is driven by a motor 79 which also adjusts the diaphragm of a following lens 77 to keep the brightness of the image constant as its magnification is varied. The image is transmitted through an image deflector tube 78 controlled via the deflection generator 82 from a position transducer 81 on the anamorphic lens. The tube 78 causes the image to scan over the library of masks 84 containing 1000 word masks in an array of 32 x 32, ten such scans being made in one second. The magnification changes by 3% between successive scans. Light passed by a mask is received in a photomultiplier 86 and the signal generated is amplified and passed to a threshold discriminator 88 which detects a match. The generator 82 also generates X and Y staircase voltages having 32 steps in synchronism with the mask scanning operation of tube 78. These voltages are quantized and used to enable corresponding ones of 1000 gates 91 arranged in a 32 x 32 array. A signal from discriminator 88 passes through the gate corresponding to the mask giving the match signal and selects the corresponding word for a display 92. The word may also be printed, punched or otherwise recorded.