Skip to content
@opendatalab

OpenDataLab

OpenDataLab provides access to numerous significant open-source datasets.

English🌎|简体中文🀄

Note

📚 In 2025, we have open-sourced a high-quality multilingual dataset, WanJuan 3.0 (WanJuan Silu)

🧾 ​​January 2025: Initial Release of Multilingual Pre-training Corpus​​: Primarily text-based data.Collected publicly available web content, literature, patents, and more from 5 countries/regions.Total data size exceeds ​​1.2TB​​, with ​​300 billion tokens​​, achieving international leadership.The initial release includes ​​Thai, Russian, Arabic, Korean, and Vietnamese​​ sub-corpora, each exceeding ​​150GB​​.Leveraging the ​​"InternLM" Intelligent Tagging System​​, the research team categorized each sub-corpus into ​​7 major classes​​ (e.g., history, politics, culture, real estate, shopping, weather, dining, encyclopedias, professional knowledge) and ​​32 sub-classes​​, ensuring localized linguistic and cultural relevance.Designed for researchers to easily retrieve data for diverse needs.
​​Download Links​​: RussianArabicKoreanVietnameseThai.


🌏 ​​March 2025: Second Release of Multilingual Multimodal Corpus​​: which comprises over 1.2TB of indigenous textual corpora from five countries. Each subset includes seven major categories and 34 subcategories, covering a wide range of local characteristics, such as history, politics, culture, real estate, shopping, weather, dining, encyclopedic knowledge, and professional expertise. Here are the download links for the five subsets, and we welcome everyone to download and use them.

Comprises ​​4 data types​​:

  • Image-Text​​: Over ​​2 million images​​ (raw size: 362.174GB).
  • Audio-Text​​: ​​200 hours​​ of ultra-high-precision annotated audio per language.
  • Video-Text​​: Over ​​8 million video clips​​ (raw duration: 28,000+ hours; refined to 16,000+ hours of high-quality content).
  • Localized SFT (Supervised Fine-Tuning)​​:184,000 SFT entries​​ covering local culture, daily conversations, code, mathematics, and science.​​23,000 entries per language​​, including ​​3,000 culturally unique Q&A pairs designed by local residents​​ and ​​20,000 translated entries​​ filtered through a quality-check pipeline combining rules and model scoring.Covers ​​8 languages​​ across ​​4 modalities​​, totaling ​​11.5 million entries​​, refined to industrial-grade quality for "ready-to-use" applications.
    Download Links​​: 5 languages (Arabic, Russian, Korean, Vietnamese, Thai)3 languages (Serbian, Hungarian, Czech).

🔥🔥🔥OpenDataLab Provide ecology for high-quality datasets for community. It provides:

● High-speed and simple way to access open datasets
● 7700+ Large scale and high-quality open datasets for large model
● 1200+ Open datasets for Computer Vision
● 200+ Open datasets by CVPR
● Categorized datasets for hot topics

● Data acquisition toolkits supporting large datasets
● Data acquisition toolkits supporting kinds of tasks
● Open source intelligent Toolbox for Labeling

● Format standardization
● DSDL: Dataset Description Language
● Define a CV dataset by DSDL
● OpenDataLab Standardized 100+ CV Datasets

Check our tutorials videos (in Chinese) to get started.


📣 We have upgraded and launched the function of authors uploading datasets independently. We hereby invite you to participate in using it to better promote your open source datasets, AI research results, etc., so that more people can access, obtain and use your dataset.

This is an introduction to the dataset autonomous upload function 【help doc】,You can create and share your dataset according to our guidelines. 💪

If you have any questions or obstacles, please feel free to contact us OpenDataLab@pjlab.org.cn.

Popular repositories Loading

  1. MinerU MinerU Public

    Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.

    Python 58.8k 4.9k

  2. PDF-Extract-Kit PDF-Extract-Kit Public

    A Comprehensive Toolkit for High-Quality PDF Content Extraction

    Python 9.6k 719

  3. DocLayout-YOLO DocLayout-YOLO Public

    DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

    Python 2.1k 155

  4. OmniDocBench OmniDocBench Public

    [CVPR 2025] A Comprehensive Benchmark for Document Parsing and Evaluation

    Python 1.6k 165

  5. labelU labelU Public

    Data annotation toolbox supports image, audio and video data.

    Python 1.5k 168

  6. LabelLLM LabelLLM Public

    The Open-Source Data Annotation Platform

    TypeScript 1.2k 123

Repositories

Showing 10 of 61 repositories
  • opendatalab/MinerU-Ecosystem’s past year of commit activity
    Python 48 Apache-2.0 2 0 2 Updated Apr 8, 2026
  • mineru-vl-utils Public

    A Python package for interacting with the MinerU Vision-Language Model.

    opendatalab/mineru-vl-utils’s past year of commit activity
    Python 109 AGPL-3.0 30 1 0 Updated Apr 8, 2026
  • OmniDocBench Public

    [CVPR 2025] A Comprehensive Benchmark for Document Parsing and Evaluation

    opendatalab/OmniDocBench’s past year of commit activity
    Python 1,642 Apache-2.0 165 120 8 Updated Apr 8, 2026
  • MinerU Public

    Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.

    opendatalab/MinerU’s past year of commit activity
    Python 58,750 AGPL-3.0 4,861 157 2 Updated Apr 7, 2026
  • MinerU-Document-Explorer Public

    Agent-native knowledge engine with MCP tools for document indexing, wiki organization, fast retrieval and deep reading across PDF/DOCX/PPTX/Markdown

    opendatalab/MinerU-Document-Explorer’s past year of commit activity
    TypeScript 161 MIT 12 1 0 Updated Apr 7, 2026
  • WebMainBench Public

    WebMainBench is a high-precision benchmark for evaluating web main content extraction.

    opendatalab/WebMainBench’s past year of commit activity
    Python 15 Apache-2.0 10 1 0 Updated Apr 3, 2026
  • Earth-Agent Public

    [ICLR 2026] The official implementation of the paper “Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents”

    opendatalab/Earth-Agent’s past year of commit activity
    Python 118 MIT 15 10 0 Updated Apr 2, 2026
  • MinerU-Diffusion Public

    A diffusion-based framework for document OCR that replaces autoregressive decoding with block-level parallel diffusion decoding.

    opendatalab/MinerU-Diffusion’s past year of commit activity
    Python 438 MIT 21 4 0 Updated Mar 31, 2026
  • MinerU-HTML Public

    MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Agents, RAG applications, and training data generation.

    opendatalab/MinerU-HTML’s past year of commit activity
    Python 229 Apache-2.0 24 1 0 Updated Mar 27, 2026
  • MinerU-Webkit Public
    opendatalab/MinerU-Webkit’s past year of commit activity
    HTML 5 Apache-2.0 1 0 0 Updated Mar 25, 2026

Top languages

Loading…

Most used topics

Loading…