Momenta^™ AI,
large multimodal transformers for video understanding

Momenta™ is a large multi-modal pre-trained transformer that generates precise video understanding.

Next-generation multimodal AI designed to be video first

Temporal attention to across sequences

Encode features across multiple frames using temporal attention to encode object and sequence understanding.

Fusion of text and video features

Effective fusion of a visual and text tokens in a cross-modality decoder to perform open-set object detection on novel categories.

Active learning for state of the art performance

A semi-supervised active learning regime guarantees that Momenta can achieve SOTA performance in any new scene or edge condition.

Semantic temporal tracking

Traditional tracking algorithms often prioritize the spatial dimension, aiming to localize an object in each frame of a video. However, these methods may not capture the "semantics" or meaningful attributes of the objects being tracked (e.g., whether they are throwing or catching something). In contrast, Momenta's semantic temporal tracking involves monitoring these meaningful attributes over time to achieve accurate video understanding.

Video event understanding

Unlike simple object detection, which may identify object states or attributes of people or cars, video event understanding aims to capture higher-level activities and interactions among these elements. Momenta excels in understanding complex scenes and events in real-time.

Structured object localization

Simple object localization, which involves merely identifying the bounding boxes or regions where specific objects are located in each video frame, has limited use. In structured object localization, additional contextual or structural information—such as orientation, parts, or interaction with other objects—is considered to achieve a more nuanced understanding of the object's presence and status in the video.

Active self supervised learning

Momenta constantly improves by learning from millions of actively selected images and videos. This enables it to achieve SOTA performance in specific tasks and develop a broader zero-shot capability for new scenes. Momenta is trained on one of the largest video and image datasets from diverse scenes.

Momenta™ AI, large multimodal transformers for video understanding