Data Marketplace

Use verified, licensed data with confidence. You can download right away or check the data through inquiry.

A total of 321 datasets

Pre-training DataAudio
English-Vietnamese Parallel Speech Dataset
A parallel speech dataset consisting of sentence-level aligned English and Vietnamese utterances, designed for training Speech-to-Speech translation models.
DomainAd and Marketing
LanguageEnglish|Vietnamese
Pre-training DataAudio
English-Indonesian Parallel Speech Dataset
A parallel speech dataset consisting of sentence-level aligned English and Indonesian utterances, designed for training Speech-to-Speech translation models.
DomainAd and Marketing
LanguageEnglish|Indonesian
Pre-training DataAudio
English-Korean Parallel Speech Dataset
A parallel speech dataset consisting of sentence-level aligned English and Korean utterances, designed for training Speech-to-Speech translation models.
DomainAd and Marketing
LanguageEnglish|Korean
Frontier DataText
Multilingual Chain-of-Thought Reasoning Text Dataset
A multilingual chain-of-thought reasoning dataset built from complex problems requiring step-by-step decomposition and coherent answer generation, with AI-generated drafts reviewed by expert-level annotators.
DomainHumanities and Social
LanguageKorean|Hindi|Indonesian|Arabic(U.A.E)|Thai|Bengali|Arabic(Egypt)|Japanese
Frontier DataText
Expert CoT Text Dataset
An expert chain-of-thought text dataset built from expert verbal reasoning to support LLM training for step-by-step reasoning.
DomainAd and Marketing|Sports, Arts and Culture|Games|Humanities and Social|IT and Tech|Law and Public|Medical|Management, Economic and Finance
LanguageKorean
Frontier DataText
Doctoral Exam Questions and Solutions Text Dataset
A high-difficulty text dataset built from doctoral-level exam questions and solutions to support LLM training for expert reasoning and problem solving.
DomainScience and Engineering
LanguageEnglish
Frontier DataText
Domain-Specific Benchmark Dataset
A multi-turn benchmark dataset built by benchmarking BFCL to evaluate agent action performance across finance, legal, medical, manufacturing, and defense domains.
DomainHumanities and Social|IT and Tech|Medical|Management, Economic and Finance|Ad and Marketing|Science and Engineering|Law and Public
LanguageEnglish
Frontier DataText
Safety Response Multi-turn Dataset
A multi-turn conversational dataset designed to evaluate model response capabilities against major safety risk categories and attack patterns.
DomainHumanities and Social|IT and Tech|Medical|Bio, Environment and Energy|Education|Management, Economic and Finance|Ad and Marketing|Sports, Arts and Culture|Science and Engineering|Law and Public|Lifestyle
LanguageEnglish
Pre-training DataVideo
Physical AI: Human-Object Interaction Video Dataset
A video dataset collected for training Physical AI models in manufacturing environments. Includes human-object manipulation footage along with structured annotations such as trajectory and mesh data.
DomainIT and Tech
Language-
Pre-training DataVideo
AI-Generated Video with Frame-level Caption Dataset
A dataset consisting of AI-generated videos sampled at 1fps with frame-level scene description captions. Applicable for video understanding and multimodal model training.
DomainLifestyle
LanguageKorean

Data Marketplace

English-Vietnamese Parallel Speech Dataset

English-Indonesian Parallel Speech Dataset

English-Korean Parallel Speech Dataset

Multilingual Chain-of-Thought Reasoning Text Dataset

Expert CoT Text Dataset

Doctoral Exam Questions and Solutions Text Dataset

Domain-Specific Benchmark Dataset

Safety Response Multi-turn Dataset

Physical AI: Human-Object Interaction Video Dataset

AI-Generated Video with Frame-level Caption Dataset

Check out the details of Snowflakes, Flitto's core dataset.