SAN FRANCISCO — Twelve Labs, which operates video searching technology, announced that it raised $50 million in Series A funding to fuel the ongoing development of its industry-leading foundation models dedicated to all aspects of video. The round was co-led by new investor New Enterprise Associates (NEA) and NVentures, NVIDIA’s venture capital arm, which recently participated in Twelve Labs’ strategic round. Previous investors, including Index Ventures, Radical Ventures, WndrCo, and Korea Investment Partners also joined the round. In addition to R&D, funds will be used to nearly double headcount. Twelve Labs plans to add more than 50 employees by the end of the year.
Twelve Labs has integrated a number of NVIDIA frameworks and services within its platform, including the NVIDIA H100 Tensor Core GPU and NVIDIA L40S GPU, as well as inference frameworks such as NVIDIA Triton Inference Server and NVIDIA TensorRT. These technologies have enabled Twelve Labs to develop first-of-their-kind foundation models for multimodal video understanding. Twelve Labs is also exploring product and research collaborations with NVIDIA to bring best-in-class multimodal foundation models and enabling frameworks to market.
“We believe the future is multimodal, and Twelve Labs is leading the charge in terms of figuring out how to make multimodal AI efficient, purposeful and enterprise grade,” said Tiffany Luck, Partner at NEA and new Twelve Labs board member. “The company has brought together an incredibly talented team from across the globe to solve one of the most complex and exciting problems in AI. Twelve Labs is building the future of video understanding and multimodal AI, and we’re excited to support them as they execute their vision and impact our world in a positive way.”
“As a core component of generative AI, multimodal video understanding is a key to delivering more robust LLMs across industries,” said Mohamed “Sid” Siddeek, corporate vice president and head of NVentures. “The world-class team at Twelve Labs is leveraging NVIDIA accelerated computing together with their incredible capacity for video understanding, leading to new ways for enterprise customers to take advantage of generative AI.”
Pushing Video Forward
Understanding across modalities can’t just be bolted on as a feature to existing LLMs. Multimodal foundation models actually have to be so from inception. Other approaches try to shoehorn video understanding into an LLM paradigm by doing transcription analysis with traditional computer vision understanding and then gluing those together to attempt video understanding. In contrast to other foundation model providers, Twelve Labs was created specifically for multimodal video understanding.
Its release of its Marengo-2.6 model, a state-of-the-art multimodal embedding model, is unlike anything currently available to companies. Marengo 2.6 offers a pioneering approach to multimodal representations tasks– not just to video but also image and audio, performing any-to-any search tasks, including Text-To-Video, Text-To-Image, Text-To-Audio, Audio-To-Video, Image-To-Video, and more. Marengo-2.6’s unique architecture is based on the concept of “Gated Modality Experts.” This allows for the processing of multimodal inputs through specialized encoders before combining them into a comprehensive multimodal representation. This model represents a significant leap in video understanding technology, enabling more intuitive and comprehensive search capabilities across various media types.
Twelve Labs also opened its beta of Pegasus-1, which sets a new standard in video-language modeling. Pegasus-1 is designed to understand and articulate complex video content, transforming how we interact with and analyze multimedia. It can process and generate language from video input with exceptional accuracy and detail. Persistently refined since its initial closed beta release in October, the open beta is faster and more accessible, with enhanced performance. To get there, the Twelve Labs team drastically reduced the model’s size, from 80 billion parameters to 17 billion, with three components jointly trained together: video encoder, video-language alignment model, language decoder. Twelve Labs will release additional flagship Pegasus models in the coming months for organizations that can support larger models.