The second option might not be compatible with XET (now the HF standard)…
You are running into two separate but tightly-related constraints in the current Hub upload stack:
- The Hub “commit API” only accepts a very specific notion of “file-like”.
- The Xet fast path largely expects filesystem paths, not arbitrary file-like streams.
Those two facts together make “true S3 → Hub streaming” (no local staging, still fast, still Xet) hard today.
1) What your two errors actually mean
Error A: path_or_fileobj must be ... str, bytes, io.BufferedIOBase
This comes from huggingface_hub’s commit/upload plumbing (CommitOperationAdd, and anything layered on top of it).
-
The public docs for HfApi explicitly say path_or_fileobj must be:
- local file path (
str/Path)
bytes
- or a “file object” that is a subclass of
io.BufferedIOBase and supports seek() + tell() (typical open(..., "rb")). (Hugging Face)
-
The underlying implementation enforces exactly that isinstance(..., io.BufferedIOBase) check and raises the ValueError you saw if it’s not true. (Hugging Face)
Why this bites you with S3/fsspec
Many cloud FS libraries (including fsspec-based ones) return objects that behave like buffered files but are not subclasses of io.BufferedIOBase. So they fail the strict type check even if they have .read()/.seek()/.tell().
Error B: Uploading files as a binary IO buffer is not supported by Xet Storage. Falling back to HTTP upload
This is expected behavior when Xet integration is not being used for the upload payload you provide.
Hugging Face has a tracking issue that spells it out: uploads using bytes or binary IO objects do not go through Xet, and older or mismatched library versions can trigger fallback behavior. The fix there is “upgrade to a Xet-enabled combination” (datasets>=3.6.0 and huggingface_hub>=0.31.0). (GitHub)
Separately, the Hub environment variables doc makes clear the platform has moved to Xet as the main transfer backend and points you to Xet performance settings (and notes hf_transfer is deprecated in this world). (Hugging Face)
Practical translation
Even if you do manage to pass a “valid” file-like object, you can still end up on the HTTP upload path rather than the Xet path depending on how you provide the data.
2) Why “direct streaming S3 → Hub” is not straightforward (even if it sounds like it should be)
A Hub upload is not “send bytes once and done” in many cases:
- The commit pipeline frequently needs size + hashing / identity and sometimes re-reads or seeks during upload planning.
- The commit API explicitly requires
seek() + tell() for file objects. (Hugging Face)
- Xet is chunk-based and can do parallelism/dedup, but that tends to be implemented around filesystem-backed inputs and local caching knobs (see Xet cache and high-performance knobs). (Hugging Face)
So a pure streaming object that is not seekable, or not recognized as a “real buffered file”, breaks assumptions.
3) The “docs mismatch” you implicitly discovered
Hugging Face Datasets’ “Cloud storage” guide shows an example that does:
with fs.open(data_file) as fileobj:
upload_file(path_or_fileobj=fileobj, ...)
(Hugging Face)
That example assumes fileobj is accepted. In practice, with many fsspec implementations, fileobj will not be an io.BufferedIOBase subclass, so you hit your ValueError. This is a real footgun: the example demonstrates the intended workflow, but the concrete Python type constraints can block it.
4) What you can do today (realistic options, with tradeoffs)
Option 1: “Make it work and make it fast” (recommended if Xet performance matters)
Stage to a filesystem path (local disk or mounted FS), then use upload_large_folder.
- The Hub upload guide positions
upload_large_folder as the robust way to push large folders: it hashes, resumes, uses multiple workers, and is built for “streaming large amounts of data is challenging.” (Hugging Face)
- There is also an
hf upload-large-folder ... CLI entry point. (Hugging Face)
This is not “zero-copy”, but it is:
- predictable
- resumable
- most likely to use the best-supported backend path
If your real dealbreaker is “no local disk”, you can still do bounded staging (download one shard/file at a time to ephemeral disk, upload, delete). You still pay S3 egress either way, but you control peak disk usage.
Option 2: “No big staging, but accept risk” (mount S3 as a filesystem path)
Mount the S3 prefix via something like s3fs-fuse or rclone mount, then upload from that mount path as if it’s local.
This can satisfy the “needs a path” constraints, but there are pitfalls:
- Mount stability and large-file behavior can be flaky depending on tool and flags (this is a known ecosystem problem, not HF-specific). (rclone forum)
- Random reads for hashing or chunking can amplify latency over a FUSE mount.
This is sometimes good enough, but it is not as deterministic as real local disk.
Option 3: “Pure S3 streaming, but you will likely lose Xet” (advanced and usually disappointing)
You can try to wrap your S3 object in something that:
- is an
io.BufferedIOBase subclass
- correctly implements
read, seek, tell
That can get you past the ValueError. (Hugging Face)
But:
- You still need to implement
seek() meaningfully. For S3 that typically means Range GET and buffering.
- You should expect extra reads (hashing + upload), so you may pull the same data twice unless you cache.
- And you may still see Xet fall back to HTTP depending on how the upload path decides eligibility. (GitHub)
Net: you save some disk, but you do not magically avoid “download then upload” at the network level. The bytes still traverse your machine unless Hugging Face provides a server-side ingest path (not documented in the current upload guides, which focus on local files / file objects). (Hugging Face)
5) For your specific “Lance on Hub” goal: separate upload vs access
You are trying to make “host Lance datasets on the Hub” practical. That has two halves:
A) Uploading the Lance directory structure
This is mostly an “upload lots of files, reliably” problem today. upload_large_folder exists specifically for that. (Hugging Face)
The earlier community reports about per-minute file throughput and large-folder pain are consistent with why upload_large_folder was introduced. (Hugging Face Forums)
B) Reading Lance efficiently from the Hub with random access
That is what the Lance/Vortex/Iceberg/Zarr “hosting” issue is about. (GitHub)
You already found the key downstream tracking issues:
- Lance repo discussion about exposing Hugging Face filesystem access via OpenDAL. (GitHub)
- Vortex tracking issue that references the same HF datasets issue. (GitHub)
Also note: efficient random access often depends on HTTP Range behavior and storage proxy quirks. There is active work around Range support on the Xet bridge host for partial reads (different layer than your upload problem, but it signals this area is evolving quickly). (GitHub)
6) Where to look next (high-signal references and why they matter)
Type constraints and why your S3 handle fails
HfApi docs for path_or_fileobj types and the io.BufferedIOBase requirement. (Hugging Face)
- The
CommitOperationAdd validation logic that hard-checks (io.BufferedIOBase, bytes). (Hugging Face)
Xet fallback behavior and the “upgrade matrix”
- “Faster downloads/uploads with Xet storage” issue (explains the warning and pins the “works well with Xet” version combo). (GitHub)
- Hub environment variables, especially
HF_XET_HIGH_PERFORMANCE, HF_XET_CACHE, and the note that hf_transfer is deprecated under Xet. (Hugging Face)
Uploading large folders robustly
- Hub upload guide section on
upload_large_folder and its limitations and operational tips. (Hugging Face)
- Community pain-point issue about large folders failing in practice. (GitHub)
The Lance/Vortex “host on Hub” roadmap
- The core tracking issue: “Support hosting lance / vortex / iceberg / zarr datasets on huggingface hub”. (GitHub)
- Lance-side integration direction via OpenDAL. (GitHub)
- Vortex-side tracking issue. (GitHub)
HfFileSystem is not a magic fix for uploads
- The HfFileSystem docs explicitly warn it adds overhead and recommend using
HfApi for performance and reliability. (Hugging Face)
So even if hf:// works great for reading, it’s not necessarily the best path for pushing 100 GB+.
7) Concrete “next move” for your case
If the goal is “100 GB+ Lance dataset on Hub” and you want the best chance of:
- not hitting file-count / rate pitfalls
- not falling back to slow paths unexpectedly
- not fighting type checks
Then the highest-probability path is:
- Ensure Xet-capable versions (at least the combo called out in the Xet tracking issue). (GitHub)
- Use
upload_large_folder (Python or CLI) from a real filesystem directory. (Hugging Face)
- If local disk is the blocker, use bounded staging or a mount approach, understanding the mount tradeoffs. (rclone forum)
- Track the Lance/Vortex integration issues for the read path after upload. (GitHub)
If you want, share:
- which library versions you are on (
datasets, huggingface_hub)
- whether the repo is on Xet storage
- how many files are in the Lance directory (order of magnitude)
And the decision between “bounded staging” vs “mount” vs “custom wrapper” can be made more mechanically.
Summary
- The
ValueError is a strict type check: only str/Path, bytes, or io.BufferedIOBase are accepted. (Hugging Face)
- Xet often does not support uploads provided as bytes/buffered file-like objects, so it falls back to HTTP unless you’re on the right versions and using the supported path. (GitHub)
- For 100 GB+ directory-style datasets (like Lance),
upload_large_folder from a filesystem path is the most reliable current approach. (Hugging Face)
- Hosting “Lance on Hub” is actively tracked, but upload and efficient random-access reads are separate problems. (GitHub)