Grass is the Data Layer of AI. Join the AI revolution by touching Grass!

Grass, Ontocord and LAION launch VALID

Article December 5, 2024

We're excited to share the release of VALID (Video-Audio Large Interleaved Dataset) by the world-renowned teams at Ontocord and LAION. VALID was built using the Grass Video Repository.

This dataset is comprised of 30 million audio snippets interleaved with images and text, making it the FIRST EVER video-audio interleaved dataset.

Discover VALID!

Grass is the Data Layer of AI. Join the AI revolution by touching Grass! VALID (Video-Audio Large Interleaved Dataset)

Overview

The VALID (Video-Audio Large Interleaved Dataset) is a multimodal dataset comprising approximately 720,000 Creative Commons licensed videos crawled from YouTube, and processed into audio-video-text data records for machine learning research. The dataset provides a unique opportunity for training models to understand relationships between modalities such as video frames, audio clips, and multilingual textual data, making it suitable for applications like multimodal representation learning.
* Please note the current version is a PREVIEW version. We are still in the process of uploading. Please be patient.

Features

· Audio-Video-Text Format: A combination of:
<video>
  <caption><image> the caption </caption>
  <caption><image> the caption </caption>
  <caption><image> the caption </caption>
</video>
<transcript> <audio> multi-lingual transcript </transcript>
English text


· The non-text multimodal portion begins the data item and can include multiple media. Some snippets may have more than one audio, and more than one video. Others may have only images/videos or only audio paired with English text. Each video contains multiple frames stored as images, and text captions for each image. There can also be standalone images interleaved as well. Even though each audio video snippets are no more than 10 seconds, a data record may span over more than 10 secs (e.g., if a data item has two 10 second videos, then the corresponding English text corresponds roughly to 20 seconds of video). The intention for this format is to teach a model to associate multiple modalities with each other, and understand multiple audio-video elements in an interleaved fashion.

Data Components

· Images: PNG format, phashed to ensure variability, with 0-10 images per audio snippet. Each image includes a caption created with Florence-2.

· Audio: OGG format, multilingual, ~10 seconds per snippet, with shorter sound or music snippets (1-3 seconds) to minimize copyright issues. Each audio snippet is transcribed either with Whisper for non-English, or with the original Youtube ASR for English.

· Text: Not including the captions and transcripts, the “text” portion is a concatenation of Youtube’s original English transcripts associated with the above media of around 1-40 words per data record.

Dataset Size

· About 7,000,000 records.

· About 15,000,000 images, each captioned with FLorence-2.

· About 30,000,000 audio snippets, about half of which transcribed with Whisper-large, and half with Youtube ASR.

· Divided into about 12K shards of about 600 records, each in a parquet file and a corresponding .tar.gz file for the media.

· About 14TB in total.

File Organization

· Each data entry follows the <video><image(s)><audio><text> structure as described above.

· Metadata includes alignment between modalities, and implicit ordering of audio/visual elements.

Multimodal Details

· Audio-Video Alignment: Snippets allow learning temporal relationships between audio and visual elements.

· Text Annotations: Text descriptions, including captions and Youtube ASR English translations, provide linguistic alignment.

Preprocessing

· Phashing for Images: Ensures that images within the dataset are dynamic and non-static.

· Audio Snippet Lengths: Music and sound effects are clipped to 1-3 seconds to minimize copyright concerns under fair use principles.

Read more about VALID, the Video-Audio Large Interleaved Dataset: https://huggingface.co

Source: getgrass.io