Upload a large folder from S3 to a dataset

Hi everyone, trying to upload a Zarr image from S3 to HuggingFace. I have read Cloud storage, but I can not upload files one-by-one because the Zarr structure has many files for a single Zarr screen, and the Hugging Face upload terminates on 128 files / minute.

I’d like to upload the whole folder from s3 to HF:

from huggingface_hub import upload_large_folder

destination_dataset = "stefanches/idr0012-fuchs-cellmorph-S-BIAD845"

data_file = bia-integrator-data/S-BIAD845/009bd3ab-eb79-4cf4-8a11-ad028b827c03/009bd3ab-eb79-4cf4-8a11-ad028b827c03.zarr # really a directory in Zarr format

with s3.open(data_file) as zarr_path:

    path_in_repo = data_file[len(data_dir)-5:]

    upload_large_folder(

        folder_path=zarr_path,

        repo_id=destination_dataset,

        repo_type="dataset",

    )

    print(f"Uploaded {data_file} to {path_in_repo}")

However, I get the following error: TypeError: expected str, bytes or os.PathLike object, not S3File

What could be a possible workaround?

1 Like

The first workaround seems straightforward…


Root cause: upload_large_folder expects a local filesystem path (str/PathLike). You passed an s3fs.S3File. Hence the TypeError. The helper does not traverse remote file objects. It only walks a local directory tree. (Hugging Face)

Workable paths, from least change to most change:

1) Mount S3 so it looks local, then call upload_large_folder

Effect: No code changes beyond the path. The Hub sees a normal folder.

  • Mount options

    • s3fs-fuse (Linux/macOS/BSD): FUSE mount of an S3 bucket. (GitHub)
    • rclone mount: stable, good VFS cache controls. Use --vfs-cache-mode=full for POSIX-like behavior. (Rclone)
  • Example

# s3fs-fuse (https://github.com/s3fs-fuse/s3fs-fuse)
s3fs ${BUCKET} /mnt/s3 -o iam_role=auto,use_path_request_style

# rclone (https://rclone.org/commands/rclone_mount/)
rclone mount s3:${BUCKET} /mnt/s3 --vfs-cache-mode full
# docs: https://huggingface.co/docs/huggingface_hub/guides/upload
from huggingface_hub import upload_large_folder

upload_large_folder(
    folder_path="/mnt/s3/bia-integrator-data/S-BIAD845/.../....zarr",
    repo_id="stefanches/idr0012-fuchs-cellmorph-S-BIAD845",
    repo_type="dataset",
    multi_commits=True,  # chunked, resumable
    multi_commits_verbose=True,
)

Notes: multi_commits=True splits large trees into several commits and can resume. Mounts with VFS write caching avoid odd POSIX edge cases. (Hugging Face)

2) Stream from S3 in batches using the commit API (no local copy)

Effect: Push 50–100 files per commit. Avoids the 128 files/min symptom and reduces 429s.

  • Why it works: CommitOperationAdd accepts a path or a file-like object. You can hand it s3fs file handles directly and commit in batches. (Hugging Face)

  • Minimal script

# refs:
# - HF upload guide: https://huggingface.co/docs/huggingface_hub/guides/upload
# - HfApi.create_commit: https://huggingface.co/docs/huggingface_hub/package_reference/hf_api
import posixpath, s3fs
from huggingface_hub import HfApi, CommitOperationAdd

api = HfApi()
fs = s3fs.S3FileSystem()

prefix = "s3://YOUR_BUCKET/bia-integrator-data/S-BIAD845/009bd3.../009bd3....zarr"
repo_id = "stefanches/idr0012-fuchs-cellmorph-S-BIAD845"
root_in_repo = "009bd3....zarr"

ops, open_fhs = [], []
for key in fs.find(prefix):
    if key.endswith("/"):  # skip pseudo-dirs
        continue
    rel = key[len(prefix):].lstrip("/")
    fh = fs.open(key, "rb")          # S3 file-like
    open_fhs.append(fh)
    ops.append(CommitOperationAdd(
        path_in_repo=posixpath.join(root_in_repo, rel),
        path_or_fileobj=fh
    ))
    if len(ops) >= 80:               # 50–100 per commit helps avoid 429s
        api.create_commit(repo_id=repo_id, repo_type="dataset",
                          operations=ops, commit_message="batch")
        for h in open_fhs: h.close()
        ops, open_fhs = [], []

if ops:
    api.create_commit(repo_id=repo_id, repo_type="dataset",
                      operations=ops, commit_message="final")
    for h in open_fhs: h.close()

Tip: If you still hit HTTP 429 with many small files, reduce the batch size or sleep between commits. This pattern is used because large, flat trees can trigger rate limiting. Issues and forum reports in 2024–2025 confirm this. (GitHub)

3) Collapse the Zarr into a single .zarr.zip, then upload one file

Effect: Replace thousands of tiny files with one LFS object. Best for read-heavy, write-rare assets.

  • Why it works: Zarr’s ZipStore stores an entire hierarchy in a single ZIP. Zarr v2 and v3 document ZipStore. Clients can open via ZipStore directly. (zarr.readthedocs.io)

  • Make the zip and upload

# refs:
# - ZipStore v3 guide: https://zarr.readthedocs.io/en/v3.1.0/user-guide/storage.html
# - ZipStore v2 API:   https://zarr.readthedocs.io/en/v2.15.0/api/storage.html
# - HF upload_file:    https://huggingface.co/docs/huggingface_hub/guides/upload
import s3fs, zarr
from huggingface_hub import HfApi

fs = s3fs.S3FileSystem()
api = HfApi()

# Read from S3 directory-like Zarr
src = zarr.storage.FsspecStore(
    url="s3://YOUR_BUCKET/bia-integrator-data/S-BIAD845/.../.zarr",
    read_only=True
)

# Write a single zip in S3
with fs.open("s3://YOUR_BUCKET/S-BIAD845/....zarr.zip", "wb") as zout:
    zdst = zarr.storage.ZipStore(zout, mode="w")
    zarr.copy_store(src, zdst)
    zdst.close()

# Upload one file to the Hub
with fs.open("s3://YOUR_BUCKET/S-BIAD845/....zarr.zip", "rb") as f:
    api.upload_file(
        repo_id="stefanches/idr0012-fuchs-cellmorph-S-BIAD845",
        repo_type="dataset",
        path_in_repo="....zarr.zip",
        path_or_fileobj=f,
    )

Trade-offs: ZIP is immutable. Random writes require re-zipping. For read-only public datasets this is fine and common. (zarr.readthedocs.io)

Background and constraints, stated explicitly:

  • upload_large_folder is designed for huge trees and is resumable, but it walks a local path. Not a file-like. Use a mount or switch to commit-level APIs. Docs updated Jul 22 2024. (Hugging Face)
  • Rate limiting appears with many small files. Community issues mention 429/503 during big uploads. Batching and retries are the practical mitigations. Dates: 2024-10-01 and 2025-05-12. (GitHub)

Pitfalls and checks:

  • Preserve Zarr layout. When mounting, confirm that directory entries like .zgroup, .zarray, zarr.json are visible under the mount.
  • rclone VFS cache can use disk. Cap it with --vfs-cache-max-size if needed. (Reddit)
  • If you need pure read-only publishing and minimal file count, the .zarr.zip route is simplest for the Hub. Zarr docs show direct ZipStore reads. (zarr.readthedocs.io)

Suggested choice matrix:

  • Need minimal code change and you trust a FUSE mount → Option 1.
  • Need full control, want to stay pure-Python without mounts → Option 2.
  • Want to avoid small-file problems entirely and the dataset is static → Option 3.

Curated references (updated dates shown):

  • Hugging Face upload guide, commit API and file-like support. Updated Jul 22 2024. (Hugging Face)
  • multi_commits for large trees. Docs pages across versions. 2023–2024. (Hugging Face)
  • 429 symptoms with many files. GitHub issue 2024-10-01 and forum 2025-04-23. (GitHub)
  • Zarr ZipStore docs (v3.1.0 and v2.15.0). 2024–2025. (zarr.readthedocs.io)
  • s3fs-fuse README and rclone mount docs. Current. (GitHub)
1 Like

Thanks @John6666 , super informative. Do you actually use some LM for these answers, or collect the points somewhere (I just adore the answer speed!)

1 Like

Do you actually use some LM for these answers,

Just GPT-5 Thinking on Web browser.:sweat_smile:

1 Like

I’m trying to make the second option work (not for zarr but for lance datasets). See: Support hosting lance / vortex / iceberg / zarr datasets on huggingface hub · Issue #7863 · huggingface/datasets · GitHub

However, path_or_fileobj=fh isn’t supported unless fh is one str, bytes, io.BufferedIOBase

```
ValueError: path_or_fileobj must be either an instance of str, bytes or io.BufferedIOBase. If you passed a file-like object, make sure it is in binary mode.
```

I’m not 100% sure but I think this effectively leads to downloading and uploading instead of direct streaming which can be a dealbreaker for large datasets (100 GB+)

Another issue is that Xet doesn’t support binary IO buffer, so it falls back to HTTP upload

```
Uploading files as a binary IO buffer is not supported by Xet Storage. Falling back to HTTP upload
```

Any pointers where to look? I’ve already looked at the HfFileSystem docs: Interact with the Hub through the Filesystem API

1 Like

The second option might not be compatible with XET (now the HF standard)…


You are running into two separate but tightly-related constraints in the current Hub upload stack:

  1. The Hub “commit API” only accepts a very specific notion of “file-like”.
  2. The Xet fast path largely expects filesystem paths, not arbitrary file-like streams.

Those two facts together make “true S3 → Hub streaming” (no local staging, still fast, still Xet) hard today.


1) What your two errors actually mean

Error A: path_or_fileobj must be ... str, bytes, io.BufferedIOBase

This comes from huggingface_hub’s commit/upload plumbing (CommitOperationAdd, and anything layered on top of it).

  • The public docs for HfApi explicitly say path_or_fileobj must be:

    • local file path (str/Path)
    • bytes
    • or a “file object” that is a subclass of io.BufferedIOBase and supports seek() + tell() (typical open(..., "rb")). (Hugging Face)
  • The underlying implementation enforces exactly that isinstance(..., io.BufferedIOBase) check and raises the ValueError you saw if it’s not true. (Hugging Face)

Why this bites you with S3/fsspec
Many cloud FS libraries (including fsspec-based ones) return objects that behave like buffered files but are not subclasses of io.BufferedIOBase. So they fail the strict type check even if they have .read()/.seek()/.tell().

Error B: Uploading files as a binary IO buffer is not supported by Xet Storage. Falling back to HTTP upload

This is expected behavior when Xet integration is not being used for the upload payload you provide.

Hugging Face has a tracking issue that spells it out: uploads using bytes or binary IO objects do not go through Xet, and older or mismatched library versions can trigger fallback behavior. The fix there is “upgrade to a Xet-enabled combination” (datasets>=3.6.0 and huggingface_hub>=0.31.0). (GitHub)

Separately, the Hub environment variables doc makes clear the platform has moved to Xet as the main transfer backend and points you to Xet performance settings (and notes hf_transfer is deprecated in this world). (Hugging Face)

Practical translation
Even if you do manage to pass a “valid” file-like object, you can still end up on the HTTP upload path rather than the Xet path depending on how you provide the data.


2) Why “direct streaming S3 → Hub” is not straightforward (even if it sounds like it should be)

A Hub upload is not “send bytes once and done” in many cases:

  • The commit pipeline frequently needs size + hashing / identity and sometimes re-reads or seeks during upload planning.
  • The commit API explicitly requires seek() + tell() for file objects. (Hugging Face)
  • Xet is chunk-based and can do parallelism/dedup, but that tends to be implemented around filesystem-backed inputs and local caching knobs (see Xet cache and high-performance knobs). (Hugging Face)

So a pure streaming object that is not seekable, or not recognized as a “real buffered file”, breaks assumptions.


3) The “docs mismatch” you implicitly discovered

Hugging Face Datasets’ “Cloud storage” guide shows an example that does:

with fs.open(data_file) as fileobj:
    upload_file(path_or_fileobj=fileobj, ...)

(Hugging Face)

That example assumes fileobj is accepted. In practice, with many fsspec implementations, fileobj will not be an io.BufferedIOBase subclass, so you hit your ValueError. This is a real footgun: the example demonstrates the intended workflow, but the concrete Python type constraints can block it.


4) What you can do today (realistic options, with tradeoffs)

Option 1: “Make it work and make it fast” (recommended if Xet performance matters)

Stage to a filesystem path (local disk or mounted FS), then use upload_large_folder.

  • The Hub upload guide positions upload_large_folder as the robust way to push large folders: it hashes, resumes, uses multiple workers, and is built for “streaming large amounts of data is challenging.” (Hugging Face)
  • There is also an hf upload-large-folder ... CLI entry point. (Hugging Face)

This is not “zero-copy”, but it is:

  • predictable
  • resumable
  • most likely to use the best-supported backend path

If your real dealbreaker is “no local disk”, you can still do bounded staging (download one shard/file at a time to ephemeral disk, upload, delete). You still pay S3 egress either way, but you control peak disk usage.

Option 2: “No big staging, but accept risk” (mount S3 as a filesystem path)

Mount the S3 prefix via something like s3fs-fuse or rclone mount, then upload from that mount path as if it’s local.

This can satisfy the “needs a path” constraints, but there are pitfalls:

  • Mount stability and large-file behavior can be flaky depending on tool and flags (this is a known ecosystem problem, not HF-specific). (rclone forum)
  • Random reads for hashing or chunking can amplify latency over a FUSE mount.

This is sometimes good enough, but it is not as deterministic as real local disk.

Option 3: “Pure S3 streaming, but you will likely lose Xet” (advanced and usually disappointing)

You can try to wrap your S3 object in something that:

  • is an io.BufferedIOBase subclass
  • correctly implements read, seek, tell

That can get you past the ValueError. (Hugging Face)
But:

  • You still need to implement seek() meaningfully. For S3 that typically means Range GET and buffering.
  • You should expect extra reads (hashing + upload), so you may pull the same data twice unless you cache.
  • And you may still see Xet fall back to HTTP depending on how the upload path decides eligibility. (GitHub)

Net: you save some disk, but you do not magically avoid “download then upload” at the network level. The bytes still traverse your machine unless Hugging Face provides a server-side ingest path (not documented in the current upload guides, which focus on local files / file objects). (Hugging Face)


5) For your specific “Lance on Hub” goal: separate upload vs access

You are trying to make “host Lance datasets on the Hub” practical. That has two halves:

A) Uploading the Lance directory structure

This is mostly an “upload lots of files, reliably” problem today. upload_large_folder exists specifically for that. (Hugging Face)
The earlier community reports about per-minute file throughput and large-folder pain are consistent with why upload_large_folder was introduced. (Hugging Face Forums)

B) Reading Lance efficiently from the Hub with random access

That is what the Lance/Vortex/Iceberg/Zarr “hosting” issue is about. (GitHub)
You already found the key downstream tracking issues:

  • Lance repo discussion about exposing Hugging Face filesystem access via OpenDAL. (GitHub)
  • Vortex tracking issue that references the same HF datasets issue. (GitHub)

Also note: efficient random access often depends on HTTP Range behavior and storage proxy quirks. There is active work around Range support on the Xet bridge host for partial reads (different layer than your upload problem, but it signals this area is evolving quickly). (GitHub)


6) Where to look next (high-signal references and why they matter)

Type constraints and why your S3 handle fails

  • HfApi docs for path_or_fileobj types and the io.BufferedIOBase requirement. (Hugging Face)
  • The CommitOperationAdd validation logic that hard-checks (io.BufferedIOBase, bytes). (Hugging Face)

Xet fallback behavior and the “upgrade matrix”

  • “Faster downloads/uploads with Xet storage” issue (explains the warning and pins the “works well with Xet” version combo). (GitHub)
  • Hub environment variables, especially HF_XET_HIGH_PERFORMANCE, HF_XET_CACHE, and the note that hf_transfer is deprecated under Xet. (Hugging Face)

Uploading large folders robustly

  • Hub upload guide section on upload_large_folder and its limitations and operational tips. (Hugging Face)
  • Community pain-point issue about large folders failing in practice. (GitHub)

The Lance/Vortex “host on Hub” roadmap

  • The core tracking issue: “Support hosting lance / vortex / iceberg / zarr datasets on huggingface hub”. (GitHub)
  • Lance-side integration direction via OpenDAL. (GitHub)
  • Vortex-side tracking issue. (GitHub)

HfFileSystem is not a magic fix for uploads

  • The HfFileSystem docs explicitly warn it adds overhead and recommend using HfApi for performance and reliability. (Hugging Face)
    So even if hf:// works great for reading, it’s not necessarily the best path for pushing 100 GB+.

7) Concrete “next move” for your case

If the goal is “100 GB+ Lance dataset on Hub” and you want the best chance of:

  • not hitting file-count / rate pitfalls
  • not falling back to slow paths unexpectedly
  • not fighting type checks

Then the highest-probability path is:

  1. Ensure Xet-capable versions (at least the combo called out in the Xet tracking issue). (GitHub)
  2. Use upload_large_folder (Python or CLI) from a real filesystem directory. (Hugging Face)
  3. If local disk is the blocker, use bounded staging or a mount approach, understanding the mount tradeoffs. (rclone forum)
  4. Track the Lance/Vortex integration issues for the read path after upload. (GitHub)

If you want, share:

  • which library versions you are on (datasets, huggingface_hub)
  • whether the repo is on Xet storage
  • how many files are in the Lance directory (order of magnitude)

And the decision between “bounded staging” vs “mount” vs “custom wrapper” can be made more mechanically.


Summary

  • The ValueError is a strict type check: only str/Path, bytes, or io.BufferedIOBase are accepted. (Hugging Face)
  • Xet often does not support uploads provided as bytes/buffered file-like objects, so it falls back to HTTP unless you’re on the right versions and using the supported path. (GitHub)
  • For 100 GB+ directory-style datasets (like Lance), upload_large_folder from a filesystem path is the most reliable current approach. (Hugging Face)
  • Hosting “Lance on Hub” is actively tracked, but upload and efficient random-access reads are separate problems. (GitHub)