摘要:
Methods and systems are described for performing video retrieval together with video grounding. A word-based query for a video is and encoded into a query representation using a trained query encoder. One or more similar video representations are identified, from a plurality of video representations that are similar to the query representation. Each similar video representation represents a respective relevant video. A grounding is generated for each relevant video by forward propagating each respective similar video representation together with the query representation through a trained grounding module. The relevant videos or identifiers of the relevant videos are outputted together with the grounding generated for each relevant video.
摘要:
Methods and systems are described for performing video retrieval together with video grounding. A word-based query for a video is and encoded into a query representation using a trained query encoder. One or more similar video representations are identified, from a plurality of video representations that are similar to the query representation. Each similar video representation represents a respective relevant video. A grounding is generated for each relevant video by forward propagating each respective similar video representation together with the query representation through a trained grounding module. The relevant videos or identifiers of the relevant videos are outputted together with the grounding generated for each relevant video.
摘要:
A method and a system for retrieving video temporal segments are provided. In the method, a video is analyzed to obtain frame feature information of the video; the frame feature information is input into an encoder to output first data relating to temporal information of the video; the first data and a retrieval description for retrieving video temporal segments of the video are input into a decoder to output second data; attention computation training is conducted according to the first data and the second data; video temporal segments of the video corresponding to the retrieval description are determined according to the attention computation training.
摘要:
An information processing terminal connectable to a WWW (World Wide Web) server via a public network includes a storage unit that stores content data including image information or sound information with identification information of the content data, an acquiring unit that acquires identification information of content data from the WWW server, a retrieving unit that retrieves content data corresponding to the identification information acquired by the acquiring unit from the storage unit, and a presenting unit that presents the content data retrieved by the retrieving unit.
摘要:
A query system for structured multimedia content retrieval comprises a query language based on logic formalism for content retrieval. The language includes query constructs and formalisms for specifying different aspects of XML documents and the constructs and formalisms are particularly adapted for spatial, temporal and visual datatypes. Certain critical specification issues in MPEG-7 XML queries are identified. An XML query language with multimedia query constructs is described which is based on a logic formalism, called path predicate calculus. In this path predicate calculus, the atomic logic formulas are element predicates rather than relation predicates in relational calculus. In this path calculus query language, queries in this calculus are equivalent to finding all proofs to existential closure of logical assertions in the form of path predicates that the tree document elements must satisfy. Spatial, temporal and visual datatypes and relationships can also be described in this formalism for content retrieval.
摘要:
A method of real-time video event detection includes: obtaining, based on a natural language query, a query vector; performing multimodal feature extraction on a video stream to obtain a video vector, obtaining a similarity score by comparing the query vector to the video vector; comparing the similarity score to a predetermined threshold; and activating, based on the similarity score being above the predetermined threshold, an action trigger. The multimodal feature extraction is performed using a plurality of overlapping windows that include sequential frames of the video stream.
摘要:
Methods, systems, and products index digital scenes in digital media. A uniform resource locator is assigned to each different digital scene within the digital media. The uniform resource locator uniquely identifies a resource from which each different digital scene may be retrieved. Individual scenes may thus be retrieved, thus conserving bandwidth and memory.
摘要:
Presenting natural-language-understanding (NLU) results can include redundancies and awkward sentence structures. In an embodiment of the present invention, a method includes, responsive to receiving a result to a NLU query, loading a matching template of a plurality of templates stored in a memory. Each template has mask fields associated with at least one property. The method compares the properties of the mask fields of each of the templates to properties of the query and properties of the result, and selects the matching template. The method further completes the matching template by inserting fields of the result into corresponding mask fields of the matching template. The method may further suppress certain mask fields of the matching template to increase brevity and improve the naturalness of the response when appropriate based on the results of the NLU query. The method further presents the completed matching template to a user via a display.
摘要:
An information processing terminal connectable to a WWW (World Wide Web) server via a public network includes a storage unit that stores content data including image information or sound information with identification information of the content data, an acquiring unit that acquires identification information of content data from the WWW server, a retrieving unit that retrieves content data corresponding to the identification information acquired by the acquiring unit from the storage unit, and a presenting unit that presents the content data retrieved by the retrieving unit.
摘要:
Disclosed are systems and methods that convert digital video data, such as two-dimensional digital video data, into a natural language text description describing the subject matter represented in the video. For example, the disclosed implementations may process video data in real-time, near real-time, or after the video data is created and generate a text-based video narrative describing the subject matter of the video. In addition, the disclosed implementations may also support a question and answer session in which a user may submit queries about the subject matter of one or more videos and the disclosed implementations will present natural language responses based on the subject matter of the video and any corresponding context.