Abstract:
A system and method for forming a segmented speech signal from an input speech signal having a plurality of frames. The input speech signal is converted from a time domain signal to a frequency domain signal having a plurality of speech frames, wherein each speech frame in the frequency domain signal is represented by at least one spectral value associated with the speech frame. A spectral difference value is then determined for each pair of adjacent frames in the frequency domain signal, wherein the spectral difference value for each pair of adjacent frames is representative of a difference between the at least one spectral value associated with each frame in the pair of adjacent frames. An initial cluster boundary is set between each pair of adjacent frames in the frequency domain signal, and a variance value is assigned to each cluster in the frequency domain signal, wherein the variance value for each cluster is equal to one of the determined spectral difference values. Next, a plurality of cluster merge parameters is calculated, wherein each of the cluster merge parameters is associated with a pair of adjacent clusters in the frequency domain signal. A minimum cluster merge parameter is selected from the plurality of cluster merge parameters. A merged cluster is then formed by canceling a cluster boundary between the clusters associated with the minimum merge parameter and assigning a merged variance value to the merged cluster, wherein the merged variance value is representative of the variance values assigned to the clusters associated with the minimum merge parameter. The process is repeated in order to form a plurality of merged clusters, and the segmented speech signal is formed in accordance with the plurality of merged clusters.
Abstract:
A system and method of video processing are disclosed. In a particular implementation, a device includes a processor configured to generate index data for video content. The index data includes a summary frame and metadata. The summary frame is associated with a portion of the video content and illustrates multiple representations of an object included in the portion of the video content. The metadata includes marker data that indicates a playback position of the video content. The playback position is associated with the summary frame. The device also includes a memory configured to store the index data.
Abstract:
An apparatus and method for displaying content is disclosed. A particular method includes determining a viewing orientation of a user relative to a display and providing a portion of content to the display based on the viewing orientation. The portion includes at least a first viewable element of the content and does not include at least one second viewable element of the content. The method also includes determining an updated viewing orientation of the user and updating the portion of the content based on the updated viewing orientation. The updated portion includes at least the second viewable element. A display difference between the portion and the updated portion is non-linearly related to an orientation difference between the viewing orientation and the updated viewing orientation.
Abstract:
This disclosure describes techniques for modifying application program interface (API) calls in a manner that can cause a device to render native three dimensional (3D) graphics content in stereoscopic 3D. The techniques of this disclosure can be implemented in a manner where API calls themselves are modified, but the API itself and the GPU hardware are not modified. The techniques of the present disclosure include using the same viewing frustum defined by the original content to generate a left-eye image and a right-eye image and shifting the viewport offset of the left-eye image and the right-eye image.
Abstract:
The present disclosure provides systems, methods and apparatus, including computer programs encoded on computer storage media, for providing virtual keyboards. In one aspect, a system includes a camera, a display, a video feature extraction module and a gesture pattern matching module. The camera captures a sequence of images containing a finger of a user, and the display displays each image combined with a virtual keyboard having a plurality of virtual keys. The video feature extraction module detects motion of the finger in the sequence of images relative to virtual sensors of the virtual keys, and determines sensor actuation data based on the detected motion relative to the virtual sensors. The gesture pattern matching module uses the sensor actuation data to recognize a gesture.
Abstract:
The example techniques of this disclosure are directed to generating a stereoscopic view from an application designed to generate a mono view. For example, the techniques may modify source code of a vertex shader to cause the modified vertex shader, when executed, to generate graphics content for the images of the stereoscopic view. As another example, the techniques may modify a command that defines a viewport for the mono view to commands that define the viewports for the images of the stereoscopic view.
Abstract:
In a particular illustrative embodiment, a method of determining a viewpoint of a person based on skin color area and face area is disclosed. The method includes receiving image data corresponding to an image captured by a camera, the image including at least one object to be displayed at a device coupled to the camera. The method further includes determining a viewpoint of the person relative to a display of the device coupled to the camera. The viewpoint of the person may be determined by determining a face area of the person based on a determined skin color area of the person and tracking a face location of the person based on the face area. One or more objects displayed at the display may be moved in response to the determined viewpoint of the person.
Abstract:
A method and system that combines voice recognition engines and resolves any differences between the results of individual voice recognition engines. A speaker independent (SI) Hidden Markov Model (HMM) engine, a speaker independent Dynamic Time Warping (DTW-SI) engine and a speaker dependent Dynamic Time Warping (DTW-SD) engine are combined. Combining and resolving the results of these engines results in a system with better recognition accuracy and lower rejection rates than using the results of only one engine.
Abstract:
Embodiments include methods and systems which determine pixel displacement between frames based on a respective weighting-value for each pixel or a group of pixels. The weighting-values provide an indication as to which pixels are more pertinent to optical flow computations. Computational resources and effort can be focused on pixels with higher weights, which are generally more pertinent to optical flow determinations.
Abstract:
In general, this disclosure describes techniques for providing a gesture-based user interface. For example, according to some aspects of the disclosure, a user interface generally includes a camera and a computing device that identifies and tracks the motion of one or more fingertips of a user. In some examples, the user interface is configured to identify predefined gestures (e.g., patterns of motion) associated with certain motions of the user's fingertips. In another example, the user interface is configured to identify hand postures (e.g., patterns of showing up of fingertips). Accordingly, the user can interact with the computing device by performing the gestures.