Meta’s Data2vec 2.0: Second time is faster

Spread the love


Images and words of a big black dog overlap "I drink black tea"



Meta’s Data2vec is an example of a general neural network that can use the same exact code to crunch samples of data from different modalities — in this case, speech, text, and images — and make predictions about that data.

Paevski et al.

What do you do after proving your point in neural networks?

One answer is to do it fast.

On Tuesday, Meta Properties, the owner of Facebook, Instagram and WhatsApp, unveiled Data2Week 2.0. The neural network was introduced earlier this year It behaves as a kind of generalist, working on tasks involving text, image and speech data with the same basic approach to all three.

The second time, Meta’s scientists made the project faster, In a few cases, machine learning tasks are more accurate in benchmark tests.

“Data2vec 2.0 shows that the training speed of self-tracking can be significantly improved without any loss in downstream task accuracy,” write authors Alexey Pavsky, Arun Babu, Wei-Ning Hsu, and Michael Alley. Data2vec paper In this new work, efficient self-supervised learning with contextualized target representations for vision, speech, and language, Published in arXiv.

Further: What is ChatGPT and why is it important?

This second Data2vec’s unique achievement is reducing the time it takes to train Data2vec. Training a neural net is usually measured in terms of “epochs”, i.e. the number of times the neural net is given training examples. It can also be measured by wall clock time, hours, minutes and days counted from start to finish.

“Experiments show that Data2vec 2.0 can achieve the same accuracy as existing popular algorithms at 2-16x training speed,” they write.

The name Data2vec is a play on the program’s name for the “embedding” language Created at Google in 2013 Called Word2vec. That program predicts how words are grouped together, so Word2vec is representative of a neural network designed for a specific type of data.

For Data2Week, Pavsky and colleagues take a neural network called Transformer developed by Ashish Vaswani and colleagues. At Google in 2017, and is extended to use for many other data types. A single structure of neural network can train all three, image, speech and text.

Pewski and colleagues call the transformer “self-monitored” learning. In a self-monitoring system, a neural network is trained through several stages and its results are compared with each other.

First, the network abstracts a data model, which is called building a representation of the input data. Then, in the second version of the network some of those input data items are “hidden”, left unexposed. This requires reconstructing the representation created by the first version of the network, which forces the second network to create a better model of how the data fits together by filling in the gaps.

Further: The true goal of AI is no longer intelligence

The two networks — a compressed representation of the full, unmasked input data and one that attempts to complete an incomplete version — are called, cleverly enough, teacher and student, respectively. The student network tries to develop its sense of data by deconstructing what the teacher has already accomplished, albeit indirectly, if you will.

The authors made two key changes to Data2vec this time around to make it faster: using abstract representations of the author network called “summaries” and “estimation”.

In the first score, the student network that is supposed to predict the teacher’s representations no longer uses the part of the transformer called the decoder.

That is the standard approach to decompressing abstract representations of the author network. Instead, the authors use so-called convolutional neural networks, a foundational tool in neural nets for representing data models in a compressed form and a tool much older than Transformer. This is a great example of how old technology can stick around in programming.

“Instead of using a transformer-based decoder, we use a compact convolutional decoder, which is easier and faster to train,” they write.

For the second change, the new Data2vec generates the representation only once, instead of repeatedly generating the compressed representation in the teacher network. It then reuses the subject to be guessed as a target for each masked data point.

As the authors put it, “We re-apply the teacher representation to multiple masked versions of the training sample to offset the cost of teacher model computation.

“Specifically, we consider M different masked versions of the training sample and compute the loss with respect to the same target representation.”

Data2vec 2.0 diagram



Framework of Data2vec 2.0. This time Meta has replaced the second part of the program, which was a transformer-based decoder, with a decoder based on the older technology of convolutional neural networks. They reused abstracted representations of the “teacher” network as a single target for multiple masked instances of the “student” network’s data.

Pavsky et al 2022

In the results section of the paper, Pavsky and team describe how the duo reduced training time and improved accuracy in all three domains: image recognition, speech recognition, and natural language processing.

For image processing, the authors used Data2vec as a basis for fine-tuning the so-called “ViT”, a “vision transformer”, a neural network specifically designed for vision tasks. Introduced last year (PDF) Alexey Dosovitskiy and colleagues at Google. The Data2vec program is a pre-trained foundation on top of which ViT is a fine-tuning based on the literature.

Compared to January’s results, Data2vec-backed ViT again topped other neural networks used as the basis for ViT in terms of accuracy on ImageNet.

But in addition to accuracy, the new Data2vec took significantly fewer training epochs. The previous Data2vec took 800 epochs; This time, it was reduced to 150 epochs. And next to the competing self-monitoring network, masked auto-encoders or MAE, another meta build (PDF), training is reduced from 1,600 epochs to 100, even though the accuracy of the new Data2vec outperforms MAE. The fastest training method takes just 66 hours for Data2vec 2.0 and 113.6 hours for MAE.

Further: Artificial Intelligence: 5 Innovative Applications That Could Change Everything

In speech recognition, the task is to fill in the missing parts of a fragment of an audio file of a spoken phrase. The new Data2vec went up against several competing neural networks for speech, including the original data2vec and programs called Wav2vec, HuBERT, and WavLM. By no means did Data2vec 2.0 beat those networks, but it “achieved higher accuracy than other models with faster training times.” For example, Data2vec 2.0 takes 43 hours of training to achieve the accuracy that the original Data2vec required 57 hours.

In the third arena, natural language processing, Data2vec 2.0 was tested across a spectrum of challenges that included GLUE, a general language comprehension evaluation framework developed by NYU’s Courant Institute of Mathematical Sciences. In 2019.

In one experiment, the network must predict whether a sentence follows from another sentence — logical content — while another representative task challenges the network to label a phase as grammatically correct or not.

The original Data2vec, plus two transformer-based programs, against Google’s modified version called BERT and RoBERTa, Launched in 2019 By the Paul Allen School of Computer Science at the University of Washington and Meta, Data2vec’s 2.0-version gets better at GLUE results while training better.

The overall average accuracy score across all GLUE tasks for this new version is 82.6, a fraction below the original Data2vec’s 82.7, but higher than BERT’s 81.2 and RoBERTa’s 82.5. But Data2vec 2.0 takes only 28.2 hours to reach that stage, less than half the 69 hours it took for the original Data2vec and much less than the 50.5 hours it took for RoBERTa.

Further: AI is very much needed by people developing artificial intelligence

Pewski and team write that in the future Data2Week will expand beyond speech, image, and text to other data formats, raising the prospect of being a generalist.

A limitation is likely to be in place. Like the original Data2vec, version 2.0 handles each data type differently when first input to the network during training. This means that Data2vec has not yet developed a completely generic way to handle data types.

Image, speech and text are all produced by pre-processing the data. In that sense, the multi-modal aspect of the network still relies on clues about the data, which the team refers to as “small modality-specific input encoders.”

Furthermore, abstract encodings from the teacher network are generated separately for each of the three data types. There is not yet the ability to create a kind of “super-encoding” that combines all data types into one representation at once.

So, like Data2vec 1.0, the technology of the future is that a neural network can be the network to rule them all.

As with the original Data2Week, there is also Meta Posted the code on GitHub.


Source link

Leave a Comment