Load_dataset() creates a duplicate in cache

priyammaz · April 25, 2026, 4:18pm

I have around 500GB of sharded Parquet files stored locally in a single directory. When I load them using load_dataset("parquet", data_dir="/path_to_data"), everything works as expected. However, I’ve noticed that the dataset is effectively duplicated in the Hugging Face cache directory (/path_to_cache/datasets/parquet), which ends up consuming another ~500GB of storage.

Why does this duplication happen internally? Is there a way to load or reference the dataset without triggering this additional copy in the cache?

I understand that if load_dataset is being used to download a dataset, then it must save it somewhere, but in my case its already saved in a different folder, there shouldnt be any need to create a secondary copy unless something else is happening that I dont understand

John6666 · April 25, 2026, 10:48pm

I think that was one of those cases where there really wasn’t a clean, library-level workaround other than using IterableDataset…

It’s a common headache with HF libraries: they rely on caching mechanisms, so while they offer performance benefits, it’s hard to just turn them off…

just in case, @lhoestq

Why `load_dataset("parquet", data_dir=...)` creates another ~500 GB copy

Your local Parquet files are not being downloaded again. They are being used as source files to build a normal Hugging Face Dataset.

The key distinction is:

/path_to_data/*.parquet
    = your original storage/source format

/path_to_cache/datasets/parquet/...
    = Hugging Face Datasets' prepared Arrow cache

A regular non-streaming Hugging Face Dataset is Arrow-backed. The loading path prepares the dataset as Arrow files in the Datasets cache, while streaming=True takes a different path and avoids that full preparation step. Hugging Face’s loading docs describe the builder as able to “download and prepare the dataset as Arrow files in the cache” or “get a streaming dataset without downloading or caching anything.” (Hugging Face)

So this:

from datasets import load_dataset

ds = load_dataset("parquet", data_dir="/path_to_data")

roughly means:

local Parquet shards
    ↓ read/decode
Arrow cache files
    ↓ memory-mapped / accessed by Dataset
regular map-style Dataset

It does not mean:

regular Dataset directly points at compressed Parquet files in-place

That is why your disk usage doubles.

Why Parquet is not used directly as the regular `Dataset` backend

Parquet and Arrow solve different problems.

Format	Main role	Why it matters here
Parquet	Compressed storage / interchange / data lake format	Efficient on disk, but must be decoded before use
Arrow	Runtime table / memory-oriented columnar format	Better suited to fast indexing, slicing, mapping, and memory mapping

Apache Arrow’s docs explain the lower-level reason: Parquet data must be decoded from Parquet format and compression, so it cannot be directly mapped from disk like an Arrow IPC-style file; memory_map=True may help some systems but does not remove the decoding/resident-memory issue. (Apache Arrow)

Apache Arrow’s FAQ makes the same conceptual split: Parquet is optimized for storage efficiency with compression and encoding, while Arrow is laid out for direct and efficient computation. (Apache Arrow)

That is the core cause of the “duplicate”:

Parquet is your compact source format.
Arrow is the normal Hugging Face Dataset runtime format.

The Datasets cache is therefore not just a download cache. Hugging Face’s cache docs distinguish the Hub cache, which stores downloaded Hub files, from the Datasets cache, which stores datasets converted into Arrow format. (Hugging Face)

What will not solve it

`cache_dir=None`

This does not mean “no cache.” It only leaves the cache behavior at the default location or avoids overriding it.

Use cache_dir to move the Arrow cache:

ds = load_dataset(
    "parquet",
    data_dir="/path_to_data",
    cache_dir="/mnt/big_disk/hf_datasets_cache",
)

But this still creates the Arrow cache.

`datasets.disable_caching()`

This is a common trap. It does not stop the initial non-streaming load_dataset() preparation cache. A related Hugging Face forum discussion confirms the distinction: disabling caching affects transform-style cache behavior such as .map() / .filter() intermediates, but load_dataset() still writes the original prepared dataset cache. (Hugging Face Forums)

So this is not enough:

import datasets
from datasets import load_dataset

datasets.disable_caching()

ds = load_dataset("parquet", data_dir="/path_to_data")

`keep_in_memory=True`

For ~500 GB, this usually makes the problem worse. It shifts pressure from disk to RAM. Unless you have unusually large memory and a narrow one-off workload, avoid it.

Real solution categories

There is no single flag that gives all of these at once:

regular Dataset
+ random access
+ no Arrow cache
+ direct compressed-Parquet backing

You need to pick one of three strategies.

Strategy 1 — Avoid the Arrow copy with streaming

This is the cleanest answer if disk space is the main constraint.

from datasets import load_dataset

ds = load_dataset(
    "parquet",
    data_dir="/path_to_data",
    split="train",
    streaming=True,
)

This returns an IterableDataset, not a regular Dataset.

Use this when you are:

training over examples,
tokenizing,
scanning,
filtering,
computing statistics,
converting to another format,
doing mostly sequential reads.

Hugging Face’s streaming docs explicitly describe streaming local files without conversion, including cases where Arrow conversion would take too long or exceed available disk. (Hugging Face)

The tradeoff is API behavior. Hugging Face’s map-style-vs-iterable guide says IterableDataset is ideal for very large datasets, including hundreds of GB, while regular Dataset is better when you need normal indexed behavior. (Hugging Face)

Add column selection immediately

For Parquet, this is especially important:

ds = load_dataset(
    "parquet",
    data_dir="/path_to_data",
    split="train",
    streaming=True,
    columns=["text", "label"],
)

Parquet is columnar, and the streaming docs note that columns and filters can be used to stream only selected columns and apply filtering. (Hugging Face)

Add filters when useful

ds = load_dataset(
    "parquet",
    data_dir="/path_to_data",
    split="train",
    streaming=True,
    columns=["text", "label", "quality_score"],
    filters=[("quality_score", ">=", 0.8)],
)

This works best if your Parquet files have useful row-group statistics or partitioning. If every row group contains mixed values, filtering may still require substantial scanning.

Shuffle carefully

ds = ds.shuffle(seed=42, buffer_size=100_000)

This is a buffer shuffle, not a perfect global shuffle. For training it is often acceptable, but if your shards are sorted by source, time, label, project, or language, also randomize shard order or re-shard the data.

Strategy 2 — Avoid `streaming=True`, but reduce what gets materialized

If you need a normal map-style Dataset, then assume Arrow cache is unavoidable. Your job is to make it smaller.

Load only required columns

from datasets import load_dataset

ds = load_dataset(
    "parquet",
    data_dir="/path_to_data",
    split="train",
    columns=["id", "text", "label"],
    cache_dir="/mnt/big_disk/hf_datasets_cache",
)

This still writes Arrow cache, but only for selected columns.

Avoid this pattern:

ds = load_dataset("parquet", data_dir="/path_to_data", split="train")
ds = ds.remove_columns(["huge_unused_column"])

By the time remove_columns() runs, the huge column may already have been materialized into Arrow cache.

Use filters during load, not after full load

Prefer:

ds = load_dataset(
    "parquet",
    data_dir="/path_to_data",
    split="train",
    columns=["id", "text", "label", "quality_score"],
    filters=[("quality_score", ">=", 0.8)],
    cache_dir="/mnt/big_disk/hf_datasets_cache",
)

Avoid:

ds = load_dataset("parquet", data_dir="/path_to_data", split="train")
ds = ds.filter(lambda x: x["quality_score"] >= 0.8)

The second version may first materialize the full dataset, then filter it.

Pre-reduce the Parquet before Hugging Face Datasets

This is often the best non-streaming workaround.

Use DuckDB or PyArrow to create a smaller Parquet dataset first:

import duckdb

duckdb.sql("""
COPY (
    SELECT
        id,
        text,
        label
    FROM read_parquet('/path_to_data/*.parquet')
    WHERE text IS NOT NULL
      AND quality_score >= 0.8
) TO '/path_to_reduced_data/reduced.parquet'
(FORMAT PARQUET)
""")

Then load the reduced dataset normally:

from datasets import load_dataset

ds = load_dataset(
    "parquet",
    data_dir="/path_to_reduced_data",
    split="train",
    cache_dir="/mnt/big_disk/hf_datasets_cache",
)

DuckDB’s Parquet docs describe projection and filter pushdown for Parquet scans, which is exactly what you want before handing the data to a map-style Dataset. (Hugging Face)

This changes the storage picture from:

500 GB raw Parquet
+ ~500 GB Arrow cache

to something more like:

500 GB raw Parquet
+ smaller reduced Parquet
+ smaller Arrow cache

Strategy 3 — Accept Arrow, but manage it deliberately

If you need full Dataset behavior, accept the Arrow copy and make it intentional.

Put the cache on a large disk

export HF_DATASETS_CACHE="/mnt/big_nvme/hf_datasets_cache"

or:

ds = load_dataset(
    "parquet",
    data_dir="/path_to_data",
    split="train",
    cache_dir="/mnt/big_nvme/hf_datasets_cache",
)

This does not save storage. It prevents accidental cache growth under your home directory or system disk.

Convert once, then save a named prepared dataset

from datasets import load_dataset

ds = load_dataset(
    "parquet",
    data_dir="/path_to_data",
    split="train",
    columns=["id", "text", "label"],
    cache_dir="/mnt/scratch/hf_build_cache",
)

ds.save_to_disk(
    "/mnt/datasets/my_dataset_arrow_v1",
    max_shard_size="2GB",
)

Future runs should use:

from datasets import load_from_disk

ds = load_from_disk("/mnt/datasets/my_dataset_arrow_v1")

This is cleaner than repeatedly rebuilding from raw Parquet. The Datasets docs cover saving and reloading prepared datasets via save_to_disk() / load_from_disk(). (Hugging Face)

Be careful: during conversion you may temporarily have three large things:

1. raw Parquet source
2. temporary HF Arrow cache
3. saved Arrow dataset artifact

After validating the saved artifact, remove the temporary build cache if it is no longer needed.

Strategy 4 — Stream raw data into final processed shards

If your real goal is tokenization or preprocessing, do not materialize the raw dataset as Arrow first.

Better pipeline:

raw Parquet
    -> streaming read
    -> tokenize/process
    -> write final processed Parquet or Arrow shards

Example skeleton:

from datasets import load_dataset
import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path

raw = load_dataset(
    "parquet",
    data_dir="/path_to_data",
    split="train",
    streaming=True,
    columns=["id", "text"],
)

out_dir = Path("/path_to_tokenized_parquet")
out_dir.mkdir(parents=True, exist_ok=True)

buffer = []
shard_id = 0
rows_per_shard = 100_000

def tokenize_text(text):
    # Replace with your tokenizer.
    return {
        "input_ids": [1, 2, 3],
        "attention_mask": [1, 1, 1],
    }

for row in raw:
    encoded = tokenize_text(row["text"])
    buffer.append({
        "id": row["id"],
        "input_ids": encoded["input_ids"],
        "attention_mask": encoded["attention_mask"],
    })

    if len(buffer) >= rows_per_shard:
        table = pa.Table.from_pylist(buffer)
        pq.write_table(table, out_dir / f"part-{shard_id:05d}.parquet")
        buffer.clear()
        shard_id += 1

if buffer:
    table = pa.Table.from_pylist(buffer)
    pq.write_table(table, out_dir / f"part-{shard_id:05d}.parquet")

This avoids:

raw Parquet
    -> full raw Arrow cache
    -> tokenized Arrow cache
    -> final saved copy

and instead creates only the processed artifact you actually need.

How I would decide in your case

If disk is tight

Use:

ds = load_dataset(
    "parquet",
    data_dir="/path_to_data",
    split="train",
    streaming=True,
    columns=["needed_col_1", "needed_col_2"],
)

This is the closest to “reference my local Parquet without another 500 GB copy.”

If you need a regular `Dataset`

Use:

ds = load_dataset(
    "parquet",
    data_dir="/path_to_data",
    split="train",
    columns=["needed_col_1", "needed_col_2"],
    filters=[("quality_score", ">=", 0.8)],  # if applicable
    cache_dir="/mnt/big_disk/hf_datasets_cache",
)

This still creates Arrow cache, but reduces and relocates it.

If you repeatedly use the same dataset

Convert once:

ds.save_to_disk("/mnt/datasets/my_dataset_arrow_v1")

Then reuse:

ds = load_from_disk("/mnt/datasets/my_dataset_arrow_v1")

Do not rebuild the Arrow cache from raw Parquet for every experiment.

If your task is preprocessing/tokenization

Prefer:

stream raw Parquet -> write final processed shards

rather than:

raw Parquet -> full raw Arrow cache -> processed cache -> final output

Quick diagnostic checks

Check what backing files were created

ds = load_dataset(
    "parquet",
    data_dir="/path_to_data",
    split="train",
)

print(ds.cache_files[:5])

If you see .arrow files under the Datasets cache, that is the normal prepared dataset backend.

Test streaming behavior safely

from itertools import islice
from datasets import load_dataset

ds = load_dataset(
    "parquet",
    data_dir="/path_to_data",
    split="train",
    streaming=True,
)

print(type(ds))

for row in islice(ds, 5):
    print(row.keys())

You should get an iterable dataset and no full ~500 GB Arrow cache.

Estimate non-streaming expansion on a small subset

Do not test the full 500 GB first. Use 1–5% of shards:

ds = load_dataset(
    "parquet",
    data_dir="/path_to_small_subset",
    split="train",
    columns=["id", "text", "label"],
    cache_dir="/tmp/hf_cache_test",
)

print(ds.cache_files[:3])

Then compare:

small Parquet size
vs
small Arrow cache size

Use that ratio to estimate the full run.

Bottom line

The extra ~500 GB appears because non-streaming load_dataset("parquet", data_dir=...) builds a regular Arrow-backed Hugging Face Dataset. Your local Parquet files are the source, not the final runtime backing store.

Your realistic choices are:

Avoid the copy
Use streaming=True or IterableDataset.from_parquet().
Reduce the copy
Use columns, filters, or pre-reduce with DuckDB/PyArrow.
Accept the copy deliberately
Move HF_DATASETS_CACHE, save a named prepared dataset with save_to_disk(), and clean temporary build caches.

Compact summary

The cache is not a duplicate download; it is the Arrow runtime representation.
Parquet is compressed source storage; Arrow is the regular Dataset backend.
cache_dir moves the cache; it does not remove it.
disable_caching() does not stop initial load_dataset() preparation.
streaming=True is the cleanest no-extra-copy path.
If avoiding streaming, select columns and filter before Arrow materialization.
For repeated use, convert once, save_to_disk(), then load_from_disk().
If preprocessing, stream raw Parquet into final processed shards instead of building a full raw Arrow cache first.

Topic		Replies	Views
How to disable caching in load_dataset()? 🤗Datasets	6	7527	July 10, 2024
Using local dataset without changing cache 🤗Datasets	2	532	September 6, 2023
Loading Huge Image Dataset seems to take a lot of time 🤗Datasets	7	3857	May 16, 2022
Load dataset from a specific cache file 🤗Datasets	3	1442	February 26, 2024
Load Dataset and Save as Parquet 🤗Datasets	3	5892	January 7, 2025