Video Classification using NLP

Utilized existing state of the art Image Captioning models to classify video’s into different categories based on the actions performed in them.

First, we create split the video into multiple frames and create Frame-description pairs using NeuralTalk model for different frames in the video.
We train a machine learning models on this Frame-description pairs to classify video into different categories.
Model achived accuries in the range of 36%.

Error Analysis:

Model fails to detect actions like a person sitting down or standing up because the image captioning fails to capture these subtle differences, when it handles one frame at a time.
Also, the model learns wrong signals like “When it sees a chair, it always always predicts sitting”.