Grab video Xiaobian rice bowl? Microsoft Research New Technology Automatically Captions Video

Recently, Professors Lin Jiawen and Sun Min of the Department of Electronic Engineering at Tsinghua University in Taiwan announced that they have collaborated with Dr. Tao Mei of Microsoft Research Asia to develop and use computer vision technology to tag and title video content.

It is reported that Dr. Tao Tao has participated in the research and development of Microsoft COCO. Microsoft COCO is a new set of data for image recognition, classification, and description that is designed to recognize multiple objects. Known in the industry is the Microsoft COCO Image Caption Contest. Participants use the self-developed image recognition system and Microsoft COCO to textually specify images. The results are evaluated based on the accuracy, level of detail, and similarity described by humans.

â€¢ Microsoft stated that two professors from Tsinghua University in Taiwan used Microsoft COCO data sets to create a system that uses computer vision technology to determine the main content in the video and add titles to it.

Microsoft pointed out in the blog post:

Prof. Sun based on deep learning to automatically find special moments or important content in the video, and created a new method of video title generation that generates accurate and interesting titles based on these important content in the video. At the same time, Professor Lin has developed a method that can automatically detect faces in video, and provides users with more comprehensive information and suggestions for sharing these videos. Through collaboration, their algorithms can detect and describe important content while generating tags and titles.

Professor Sun Min and his students also improved this system by participating in the VideoToText challenge contest. According to sources, they will present the latest research results at the European Computer Vision Conference (ECCV).

Explaining and describing the content in the video/picture screen requires not only understanding what is in the picture, but also understanding what the objects in the image are related to. The use of algorithms to identify video content and then produce titles or labels is relatively more difficult and computationally intensive. Recognizing image content to generate tags or screen description text has become more sophisticated.

Last month, Google released the latest machine learning system. By identifying the content in the image and matching the corresponding text, the current algorithm described the accuracy of the image as high as 93.9%.

Thanks to COCO, Microsoft also has a certain amount of accumulation in the picture description, which is widely applied to the album sorting function in One Drive. This feature allows users to effectively classify and display photos, as well as recognize text from images. Of course, the most important thing is that it can also identify and analyze image features and mark them automatically.

In addition to Microsoft and Google, Facebook also released a similar system this year that can understand what is happening in the photos and describe the content as natural language. Facebook shows a picture of a guy playing skateboard. The algorithm breaks down the photo content into "a skateboard, a man, a trick, and his skateboard." It thinks what may have happened is "do it, skateboard, do it." Users can use the VPN over the wall to use the iPhone version of Facebook, and can also use the iPhone's own voiceover feature to read out text descriptions.

Whether it is a picture description or a video description, at the consumer level: it not only helps users automatically manage albums (video sets). In addition, this technology can help blind users to interpret the contents of photos and videos with voice.