Abstract:
In a speech-based system, a wake word or other trigger expression is used to preface user speech that is intended as a command. The system receives multiple directional audio signals, each of which emphasizes sound from a different direction. The signals are monitored and analyzed to detect the directions of interfering audio sources such as televisions or other types of electronic audio players. One of the directional signals having the strongest presence of speech is selected to be monitored for the trigger expression. If the directional signal corresponds to the direction of an interfering audio source, a more strict standard is used to detect the trigger expression. In addition, the directional audio signal having the second strongest presence of speech may also be monitored to detect the trigger expression.
Abstract:
A system to select video frames for optical character recognition (OCR) based on feature metrics associated with blur and sharpness. A device captures a video frame including text characters. An edge detection filter is applied to the frame to determine gradient features in perpendicular directions. An “edge map” is created from the gradient features, and points along edges in the edge map are identified. Edge transition widths are determined at each of the edge points based in local intensity minimum and maximum on opposite sides of the respective edge point in the frame. Sharper edges have smaller edge transition widths than blurry images. Statistics are determined from the edge transition widths, and the statistics are processed by a trained classifier to determine if the frame is or is not sufficiently sharp for text processing.
Abstract:
A process for training and optimizing a system to select video frames for optical character recognition (OCR) based on feature metrics associated with blur and sharpness. A set of image frames are subjectively labelled based on a comparison of each frame before and after binarization to determine to what degree text is recognizable in the binary image. A plurality of different sharpness feature metrics are generated based on the original frame. A classifier is then trained using the feature metrics and the subjective labels. The feature metrics are then tested for accuracy and/or correlation with subjective labelling data. The set of feature metrics may be refined based on which metrics produce the best results.
Abstract:
A multi-orientation text detection method and associated system is disclosed that utilizes orientation-variant glyph features to determine a text line in an image regardless of an orientation of the text line. Glyph features are determined for each glyph in an image with respect to a neighboring glyph. The glyph features are provided to a learned classifier that outputs a glyph pair score for each neighboring glyph pair. Each glyph pair score indicates a likelihood that the corresponding pair of neighboring glyphs form part of a same text line. The glyph pair scores are used to identify candidate text lines, which are then ranked to select a final set of text lines in the image.
Abstract:
Embodiments of the subject technology provide for determining a region of a first acquired image based at least on a viewing mode and a set of respective positions of graphical elements to decrease the pre-processing time and perceived latency for the first image. One or more regions of text in the first image are detected, and a set of regions of text that overlap with the region of the image is determined and pre-processed. The subject technology may then pre-process an entirety of a subsequent image (e.g., to pick up missing text from the region of the first image). Thus, additional OCR results may be provided to the user by using the subsequent image(s) and merging subsequent results with previous results from the first image.
Abstract:
A system that identifies and recognizes text that offers reduced the computational complexity for processing complex images. Widths of scan line segments within candidate text regions are determined, with the shortest segments selected as being representative of stroke width. Statistical features of the stroke widths are used as part of the process to classify each region as containing or not containing a text character or glyph.
Abstract:
A system may use multiple speech interface devices to interact with a user by speech. All or a portion of the speech interface devices may detect a user utterance and may initiate speech processing to determine a meaning or intent of the utterance. Within the speech processing, arbitration is employed to select one of the multiple speech interface devices to respond to the user utterance. Arbitration may be based in part on metadata that directly or indirectly indicates the proximity of the user to the devices, and the device that is deemed to be nearest the user may be selected to respond to the user utterance.
Abstract:
A system may use multiple speech interface devices to interact with a user by speech. All or a portion of the speech interface devices may detect a user utterance and may initiate speech processing to determine a meaning or intent of the utterance. Within the speech processing, arbitration is employed to select one of the multiple speech interface devices to respond to the user utterance. Arbitration may be based in part on metadata that directly or indirectly indicates the proximity of the user to the devices, and the device that is deemed to be nearest the user may be selected to respond to the user utterance.
Abstract:
An audio device may be configured to work in conjunction with a handheld remote controller to receive voice commands from a user. The audio device may have multiple local microphones that are used for sound source localization, to determine the position of the user. A remote audio signal may be received from the remote controller and used in conjunction with local microphone signals generated by the local microphones to aid in determining the position of the user. The last known position of the user may be recorded whenever the user speaks into the remote controller. When the user is unable to find the remote controller, the audio device may direct the user toward the last known position of the user.
Abstract:
Disclosed are techniques for recognizing text from one or more frames of image data using contextual information. In some implementations, image data including a captured textual item is processed to identify an entity in the image data. A context can be selected using the entity, where the context corresponds to a dictionary. Text in the captured textual item can be identified using the dictionary. The identified text can be output to a display device.