I think that was one of those cases where there really wasn’t a clean, library-level workaround other than using IterableDataset…
It’s a common headache with HF libraries: they rely on caching mechanisms, so while they offer performance benefits, it’s hard to just turn them off…
just in case, @lhoestq
Why load_dataset("parquet", data_dir=...) creates another ~500 GB copy
Your local Parquet files are not being downloaded again. They are being used as source files to build a normal Hugging Face Dataset.
The key distinction is:
/path_to_data/*.parquet
= your original storage/source format
/path_to_cache/datasets/parquet/...
= Hugging Face Datasets' prepared Arrow cache
A regular non-streaming Hugging Face Dataset is Arrow-backed. The loading path prepares the dataset as Arrow files in the Datasets cache, while streaming=True takes a different path and avoids that full preparation step. Hugging Face’s loading docs describe the builder as able to “download and prepare the dataset as Arrow files in the cache” or “get a streaming dataset without downloading or caching anything.” (Hugging Face)
So this:
from datasets import load_dataset
ds = load_dataset("parquet", data_dir="/path_to_data")
roughly means:
local Parquet shards
↓ read/decode
Arrow cache files
↓ memory-mapped / accessed by Dataset
regular map-style Dataset
It does not mean:
regular Dataset directly points at compressed Parquet files in-place
That is why your disk usage doubles.
Why Parquet is not used directly as the regular Dataset backend
Parquet and Arrow solve different problems.
| Format |
Main role |
Why it matters here |
| Parquet |
Compressed storage / interchange / data lake format |
Efficient on disk, but must be decoded before use |
| Arrow |
Runtime table / memory-oriented columnar format |
Better suited to fast indexing, slicing, mapping, and memory mapping |
Apache Arrow’s docs explain the lower-level reason: Parquet data must be decoded from Parquet format and compression, so it cannot be directly mapped from disk like an Arrow IPC-style file; memory_map=True may help some systems but does not remove the decoding/resident-memory issue. (Apache Arrow)
Apache Arrow’s FAQ makes the same conceptual split: Parquet is optimized for storage efficiency with compression and encoding, while Arrow is laid out for direct and efficient computation. (Apache Arrow)
That is the core cause of the “duplicate”:
Parquet is your compact source format.
Arrow is the normal Hugging Face Dataset runtime format.
The Datasets cache is therefore not just a download cache. Hugging Face’s cache docs distinguish the Hub cache, which stores downloaded Hub files, from the Datasets cache, which stores datasets converted into Arrow format. (Hugging Face)
What will not solve it
cache_dir=None
This does not mean “no cache.” It only leaves the cache behavior at the default location or avoids overriding it.
Use cache_dir to move the Arrow cache:
ds = load_dataset(
"parquet",
data_dir="/path_to_data",
cache_dir="/mnt/big_disk/hf_datasets_cache",
)
But this still creates the Arrow cache.
datasets.disable_caching()
This is a common trap. It does not stop the initial non-streaming load_dataset() preparation cache. A related Hugging Face forum discussion confirms the distinction: disabling caching affects transform-style cache behavior such as .map() / .filter() intermediates, but load_dataset() still writes the original prepared dataset cache. (Hugging Face Forums)
So this is not enough:
import datasets
from datasets import load_dataset
datasets.disable_caching()
ds = load_dataset("parquet", data_dir="/path_to_data")
keep_in_memory=True
For ~500 GB, this usually makes the problem worse. It shifts pressure from disk to RAM. Unless you have unusually large memory and a narrow one-off workload, avoid it.
Real solution categories
There is no single flag that gives all of these at once:
regular Dataset
+ random access
+ no Arrow cache
+ direct compressed-Parquet backing
You need to pick one of three strategies.
Strategy 1 — Avoid the Arrow copy with streaming
This is the cleanest answer if disk space is the main constraint.
from datasets import load_dataset
ds = load_dataset(
"parquet",
data_dir="/path_to_data",
split="train",
streaming=True,
)
This returns an IterableDataset, not a regular Dataset.
Use this when you are:
- training over examples,
- tokenizing,
- scanning,
- filtering,
- computing statistics,
- converting to another format,
- doing mostly sequential reads.
Hugging Face’s streaming docs explicitly describe streaming local files without conversion, including cases where Arrow conversion would take too long or exceed available disk. (Hugging Face)
The tradeoff is API behavior. Hugging Face’s map-style-vs-iterable guide says IterableDataset is ideal for very large datasets, including hundreds of GB, while regular Dataset is better when you need normal indexed behavior. (Hugging Face)
Add column selection immediately
For Parquet, this is especially important:
ds = load_dataset(
"parquet",
data_dir="/path_to_data",
split="train",
streaming=True,
columns=["text", "label"],
)
Parquet is columnar, and the streaming docs note that columns and filters can be used to stream only selected columns and apply filtering. (Hugging Face)
Add filters when useful
ds = load_dataset(
"parquet",
data_dir="/path_to_data",
split="train",
streaming=True,
columns=["text", "label", "quality_score"],
filters=[("quality_score", ">=", 0.8)],
)
This works best if your Parquet files have useful row-group statistics or partitioning. If every row group contains mixed values, filtering may still require substantial scanning.
Shuffle carefully
ds = ds.shuffle(seed=42, buffer_size=100_000)
This is a buffer shuffle, not a perfect global shuffle. For training it is often acceptable, but if your shards are sorted by source, time, label, project, or language, also randomize shard order or re-shard the data.
Strategy 2 — Avoid streaming=True, but reduce what gets materialized
If you need a normal map-style Dataset, then assume Arrow cache is unavoidable. Your job is to make it smaller.
Load only required columns
from datasets import load_dataset
ds = load_dataset(
"parquet",
data_dir="/path_to_data",
split="train",
columns=["id", "text", "label"],
cache_dir="/mnt/big_disk/hf_datasets_cache",
)
This still writes Arrow cache, but only for selected columns.
Avoid this pattern:
ds = load_dataset("parquet", data_dir="/path_to_data", split="train")
ds = ds.remove_columns(["huge_unused_column"])
By the time remove_columns() runs, the huge column may already have been materialized into Arrow cache.
Use filters during load, not after full load
Prefer:
ds = load_dataset(
"parquet",
data_dir="/path_to_data",
split="train",
columns=["id", "text", "label", "quality_score"],
filters=[("quality_score", ">=", 0.8)],
cache_dir="/mnt/big_disk/hf_datasets_cache",
)
Avoid:
ds = load_dataset("parquet", data_dir="/path_to_data", split="train")
ds = ds.filter(lambda x: x["quality_score"] >= 0.8)
The second version may first materialize the full dataset, then filter it.
Pre-reduce the Parquet before Hugging Face Datasets
This is often the best non-streaming workaround.
Use DuckDB or PyArrow to create a smaller Parquet dataset first:
import duckdb
duckdb.sql("""
COPY (
SELECT
id,
text,
label
FROM read_parquet('/path_to_data/*.parquet')
WHERE text IS NOT NULL
AND quality_score >= 0.8
) TO '/path_to_reduced_data/reduced.parquet'
(FORMAT PARQUET)
""")
Then load the reduced dataset normally:
from datasets import load_dataset
ds = load_dataset(
"parquet",
data_dir="/path_to_reduced_data",
split="train",
cache_dir="/mnt/big_disk/hf_datasets_cache",
)
DuckDB’s Parquet docs describe projection and filter pushdown for Parquet scans, which is exactly what you want before handing the data to a map-style Dataset. (Hugging Face)
This changes the storage picture from:
500 GB raw Parquet
+ ~500 GB Arrow cache
to something more like:
500 GB raw Parquet
+ smaller reduced Parquet
+ smaller Arrow cache
Strategy 3 — Accept Arrow, but manage it deliberately
If you need full Dataset behavior, accept the Arrow copy and make it intentional.
Put the cache on a large disk
export HF_DATASETS_CACHE="/mnt/big_nvme/hf_datasets_cache"
or:
ds = load_dataset(
"parquet",
data_dir="/path_to_data",
split="train",
cache_dir="/mnt/big_nvme/hf_datasets_cache",
)
This does not save storage. It prevents accidental cache growth under your home directory or system disk.
Convert once, then save a named prepared dataset
from datasets import load_dataset
ds = load_dataset(
"parquet",
data_dir="/path_to_data",
split="train",
columns=["id", "text", "label"],
cache_dir="/mnt/scratch/hf_build_cache",
)
ds.save_to_disk(
"/mnt/datasets/my_dataset_arrow_v1",
max_shard_size="2GB",
)
Future runs should use:
from datasets import load_from_disk
ds = load_from_disk("/mnt/datasets/my_dataset_arrow_v1")
This is cleaner than repeatedly rebuilding from raw Parquet. The Datasets docs cover saving and reloading prepared datasets via save_to_disk() / load_from_disk(). (Hugging Face)
Be careful: during conversion you may temporarily have three large things:
1. raw Parquet source
2. temporary HF Arrow cache
3. saved Arrow dataset artifact
After validating the saved artifact, remove the temporary build cache if it is no longer needed.
Strategy 4 — Stream raw data into final processed shards
If your real goal is tokenization or preprocessing, do not materialize the raw dataset as Arrow first.
Better pipeline:
raw Parquet
-> streaming read
-> tokenize/process
-> write final processed Parquet or Arrow shards
Example skeleton:
from datasets import load_dataset
import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path
raw = load_dataset(
"parquet",
data_dir="/path_to_data",
split="train",
streaming=True,
columns=["id", "text"],
)
out_dir = Path("/path_to_tokenized_parquet")
out_dir.mkdir(parents=True, exist_ok=True)
buffer = []
shard_id = 0
rows_per_shard = 100_000
def tokenize_text(text):
# Replace with your tokenizer.
return {
"input_ids": [1, 2, 3],
"attention_mask": [1, 1, 1],
}
for row in raw:
encoded = tokenize_text(row["text"])
buffer.append({
"id": row["id"],
"input_ids": encoded["input_ids"],
"attention_mask": encoded["attention_mask"],
})
if len(buffer) >= rows_per_shard:
table = pa.Table.from_pylist(buffer)
pq.write_table(table, out_dir / f"part-{shard_id:05d}.parquet")
buffer.clear()
shard_id += 1
if buffer:
table = pa.Table.from_pylist(buffer)
pq.write_table(table, out_dir / f"part-{shard_id:05d}.parquet")
This avoids:
raw Parquet
-> full raw Arrow cache
-> tokenized Arrow cache
-> final saved copy
and instead creates only the processed artifact you actually need.
How I would decide in your case
If disk is tight
Use:
ds = load_dataset(
"parquet",
data_dir="/path_to_data",
split="train",
streaming=True,
columns=["needed_col_1", "needed_col_2"],
)
This is the closest to “reference my local Parquet without another 500 GB copy.”
If you need a regular Dataset
Use:
ds = load_dataset(
"parquet",
data_dir="/path_to_data",
split="train",
columns=["needed_col_1", "needed_col_2"],
filters=[("quality_score", ">=", 0.8)], # if applicable
cache_dir="/mnt/big_disk/hf_datasets_cache",
)
This still creates Arrow cache, but reduces and relocates it.
If you repeatedly use the same dataset
Convert once:
ds.save_to_disk("/mnt/datasets/my_dataset_arrow_v1")
Then reuse:
ds = load_from_disk("/mnt/datasets/my_dataset_arrow_v1")
Do not rebuild the Arrow cache from raw Parquet for every experiment.
If your task is preprocessing/tokenization
Prefer:
stream raw Parquet -> write final processed shards
rather than:
raw Parquet -> full raw Arrow cache -> processed cache -> final output
Quick diagnostic checks
Check what backing files were created
ds = load_dataset(
"parquet",
data_dir="/path_to_data",
split="train",
)
print(ds.cache_files[:5])
If you see .arrow files under the Datasets cache, that is the normal prepared dataset backend.
Test streaming behavior safely
from itertools import islice
from datasets import load_dataset
ds = load_dataset(
"parquet",
data_dir="/path_to_data",
split="train",
streaming=True,
)
print(type(ds))
for row in islice(ds, 5):
print(row.keys())
You should get an iterable dataset and no full ~500 GB Arrow cache.
Estimate non-streaming expansion on a small subset
Do not test the full 500 GB first. Use 1–5% of shards:
ds = load_dataset(
"parquet",
data_dir="/path_to_small_subset",
split="train",
columns=["id", "text", "label"],
cache_dir="/tmp/hf_cache_test",
)
print(ds.cache_files[:3])
Then compare:
small Parquet size
vs
small Arrow cache size
Use that ratio to estimate the full run.
Bottom line
The extra ~500 GB appears because non-streaming load_dataset("parquet", data_dir=...) builds a regular Arrow-backed Hugging Face Dataset. Your local Parquet files are the source, not the final runtime backing store.
Your realistic choices are:
-
Avoid the copy
Use streaming=True or IterableDataset.from_parquet().
-
Reduce the copy
Use columns, filters, or pre-reduce with DuckDB/PyArrow.
-
Accept the copy deliberately
Move HF_DATASETS_CACHE, save a named prepared dataset with save_to_disk(), and clean temporary build caches.
Compact summary
- The cache is not a duplicate download; it is the Arrow runtime representation.
- Parquet is compressed source storage; Arrow is the regular
Dataset backend.
cache_dir moves the cache; it does not remove it.
disable_caching() does not stop initial load_dataset() preparation.
streaming=True is the cleanest no-extra-copy path.
- If avoiding streaming, select columns and filter before Arrow materialization.
- For repeated use, convert once,
save_to_disk(), then load_from_disk().
- If preprocessing, stream raw Parquet into final processed shards instead of building a full raw Arrow cache first.