Meta Introduces the First Self-Supervised Algorithm for Speech, Vision and Text

Meta is announcing data2vec, the first high-performance self-supervised algorithm that learns the same way in multiple modalities, including speech, vision and text. Most machines learn exclusively from labeled data. However, through self-supervised learning, machines are able to learn about the world just by observing it and then figuring out the structure of images, speech or text. This is a more scalable and efficient approach for machines to tackle new complex tasks, such as understanding text for more spoken languages.

Self-supervised learning algorithms for images, speech, text or other modalities function in very different ways, which has limited researchers in applying them more broadly. Because an algorithm designed for understanding images can’t be directly applied to reading text, it’s difficult to push several modalities ahead at the same rate. With data2vec, Meta developed a unified way for models to predict their own representations of the input data, regardless if it’s speech, text or audio. By focusing on these representations, a single algorithm can work with completely different types of input.

With data2vec, Meta is closer to building machines that learn about different aspects of the world around them without having to rely on labeled data. This paves the way for more general self-supervised learning and brings Meta closer to a world where AI might use videos, articles, and audio recordings to learn about complicated subjects, such as the game of soccer or different ways to bake bread. Data2vec will also enable Meta to develop more adaptable AI, which we believe will be able to perform tasks beyond what’s possible today.

Learn more about data2vec.