February 10, 2026 3 min read 596 words

Yudong Wang, Zixuan Fu, Hengyu Zhao, Chen Zhao, Chuyue Zhou, Xinle Lin, Hongya Lyu, Shuaikang Xue, Yi Yi, Yingjiao Wang, Zhi Zheng, Yuzhou Zhang, Jie Zhou, Chaojun Xiao, Xu Han, Zhiyuan Liu, Maosong Sun

Also: Русский

This article is AI-generated from a scientific publication. We recommend verifying information in the original source.

Data Science and Technology Towards AGI Part I: Tiered Data Management

In Brief

This research introduces a new way of managing data for training advanced artificial intelligence (AI) systems, called a tiered data management framework. Instead of just using more data, the system uses AI itself to organize and improve data across five levels—called L0 to L4—each with different quality and structure. This helps AI models learn faster and better, especially as we aim to build human-level AI (AGI).

The Problem

Right now, most large AI models (like those behind chatbots and search engines) rely on massive amounts of text data, often scraped from the internet. But this approach is running into limits: it’s expensive to collect and clean data, and more data doesn’t always mean better performance. These bottlenecks are slowing progress toward building truly intelligent machines—AI that can reason, learn, and adapt like humans (called AGI). The researchers report that current methods are stuck in a cycle of simply scaling up data size, which isn’t sustainable or efficient.

The Solution

To solve this, the researchers propose a five-tiered data management system—L0 to L4—where data is organized from raw, unstructured sources (L0) to verified, high-quality knowledge (L4). Each tier has specific characteristics: L0 is raw text from websites or books; L1 is filtered and cleaned; L2 is structured and labeled; L3 is fact-checked and consistent; L4 is expert-verified and suitable for advanced reasoning. Crucially, large language models (LLMs) are used to help manage and improve data at every level—scoring quality, fixing errors, and organizing content. This creates a feedback loop where better data leads to smarter models, and smarter models help create even better data.

The process follows a cyclical pattern: assess data, manage it across tiers, and evaluate how well the model performs—this is known as data-model co-evolution .

The figure illustrates the historical progression of machine learning, culminating in a new paradigm called Data-Model Co-evolution, which involves a cyclical process of data assessment, management, and model evaluation.

The framework ensures data is matched to the right training phase: low-tier data (L0–L2) for early pre-training, higher-tier data (L3–L4) for later alignment and fine-tuning. This smart allocation balances cost, quality, and learning impact.

shows how machine learning has evolved over time—from early symbolic reasoning to today’s data-driven models—and how the new data-model co-evolution phase closes the loop between data and model development. The figure illustrates that the future isn’t just about more data, but smarter data handling.

Key Findings

Tier-aware data usage improved training efficiency and model performance across multiple training phases, though exact metrics are not specified in the abstract.
The framework enables strategic allocation of data across pre-training, mid-training, and alignment stages, optimizing cost and benefit.
LLMs were used throughout the data management process, including quality scoring and content editing, to refine data across all tiers .

The researchers report that using tiered data significantly improves how models learn, but they do not provide specific numerical comparisons in the abstract.

Why It Matters

This approach could make building advanced AI faster, cheaper, and more sustainable. By using AI to help organize and improve its own training data, researchers could reduce reliance on massive, expensive data collections. This system might help develop AI that understands facts more accurately, avoids hallucinations (making up information), and learns from less data—key steps toward real human-level intelligence.

Limitations

The researchers report that the framework’s effectiveness depends on the quality of the LLMs used for data assessment and editing, which may introduce bias or error.
No details are given on how the framework performs across different domains (e.g., medicine, law, science), so generalizability is unclear.
The abstract does not explain how the tiered data was generated at scale, so real-world implementation challenges remain.

Read Original Paper

All Articles