摘要:
A scalable per-title encoding technique may include detecting scene cuts in an input video received by an encoding network or system, generating segments of the input video, performing per-title encoding of a segment of the input video, training a deep neural network (DNN) for each representation of the segment, thereby generating a trained DNN, compressing the trained DNN, thereby generating a compressed trained DNN, and generating an enhanced bitrate ladder including metadata comprising the compressed trained DNN. In some embodiments, the method also may include generating a base layer bitrate ladder for CPU devices, and providing the enhanced bitrate ladder for GPU-available devices.
摘要:
Methods are provided for reducing the size of a transpose buffer used for computation of a two-dimensional (2D) separable transform. Scaling factors and clip bit widths determined for a particular transpose buffer size and the expected transform sizes are used to reduce the size of the intermediate results of applying the 2D separable transform. The reduced bit widths of the intermediate results may vary across the intermediate results. In some embodiments, the scaling factors and associated clip bit widths may be adapted during encoding.
摘要:
There is included a method and apparatus comprising computer code configured to cause a processor or processors to perform obtaining video data, detecting at least one face from at least one frame of the video data, determining a set of facial landmark features of the at least one face from the at least one frame of the video data, and coding the video data at least partly by a neural network based on the determined set of facial landmark features.
摘要:
A method for decoding an image according to the present invention comprises the steps of: receiving and parsing a parameter set including indication information which indicates the presence of withheld information to be used in the future; receiving and parsing a slide header including the withheld information, when the indication information indicates the presence of the withheld information; and decoding the image according to semantics and a value corresponding to the withheld information. As a result, provided are a method and an apparatus for describing an additional extension information indication in a bitstream supporting a hierarchical image.
摘要:
A video decoder is configured to decode a bitstream that comprises an encoded representation of video data. As part of decoding the bitstream, the video decoder obtains, from the bitstream, one or more syntax elements indicating one or more partitioning schemes. For each respective partitioning scheme of the one or more partitioning schemes, the respective partitioning scheme specifies a respective set of disjoint partitions whose union forms an output layer set. Each respective partition of the respective set of disjoint partitions contains one or more of the layers. The video decoder is further configured to decode each of the partitions of a particular partitioning scheme using different processing cores in a plurality of hardware cores, the particular partitioning scheme being one of the one or more partitioning schemes.
摘要:
A video processing device for producing a frame of a merged digital video sequence. A memory storing a first and a second digital video sequence depicting the same scene. The first digital video sequence has a higher pixel density than the second digital video sequence. A scaler generating an up-scaled version having the same pixel density as the first video sequence. A decoder decoding a frame of the first digital video sequence and a skip block identifying a position for a skip block and a non-skip block in the frame of the first digital video sequence. A block extractor extracting a block of pixels from the frame of the second digital video sequence based on the skip block and a block of pixels from the frame of the first digital video sequence based on the non-skip block. A merging unit merging both extracted blocks to produce the merged video sequence.
摘要:
A multi-layer video decoder is configured to determine, based on a list of triplet entries, whether the multi-layer video decoder is capable of decoding a bitstream that comprises an encoded representation of the multi-layer video data. The number of triplet entries in the list is equal to a number of single-layer decoders in the multi-layer video decoder. Each respective triplet entry in the list of triplet entries indicates a profile, a tier, and a level for a respective single-layer decoder in the multi-layer video decoder. The multi-layer video decoder is configured such that, based on the multi-layer video decoder being capable of decoding the bitstream, the multi-layer video decoder decodes the bitstream.
摘要:
Systems, methods, and devices for coding multilayer video data are disclosed that may include encoding, decoding, transmitting, or receiving multilayer video data. The systems, methods, and devices may receive or transmit a non-entropy coded representation format within a video parameter set (VPS). The systems, methods, and devices may code (encode or decode) video data based on the non-entropy coded representation format within the VPS, wherein the representation format includes one or more of chroma format, whether different color planes are separately coded, picture width, picture height, luma bit depth, and chroma bit depth.
摘要:
The present invention relates to a system and method for efficiently generating images and videos as an array of objects of interest (e.g., faces and hands, plates, etc.) in a desired resolution to perform vision tasks, such as face recognition, facial expression analysis, detection of hand gestures, among others. The composition of such images and videos takes into account the similarity of objects in the same category to encode them more effectively, providing savings in terms of time transmission and storage. Transmission time is less advantage to such a system in terms of efficiency, while less low cost storage means for storing data.
摘要:
Embodiments of the present invention provide techniques for efficiently coding/decoding video data during circumstances where a decoder only requires or utilizes a portion of coded frames. A coder may exchange signaling with a decoder to identify unused areas of frames and prediction modes for the unused areas. An input frame may be parsed into a used area and an unused area based on the exchanged signaling. If motion vectors of the input frame are not limited to the used areas of the reference frames, the unused area of the input frame may be coded using low complexity. If the motion vectors of the input frame are limited to the used areas of the reference frames, the pixel blocks in the unused area of the input frame may not be coded, or the unused area of the input frame may be filled with gray, white, or black pixel blocks.