xorbits.datasets.Dataset.export#

Dataset.export(path: Union[str, os.PathLike], storage_options: Optional[dict] = None, create_if_not_exists: Optional[bool] = True, max_chunk_rows: Optional[int] = None, column_groups: Optional[dict] = None, num_threads: Optional[int] = None, version: Optional[str] = None, overwrite: Optional[bool] = True)[source]#

Export the dataset to storage.

The storage can be local or remote, e.g. local disk or S3, …

Parameters
  • path (str) – The export path, can be a local path or a remote url, lease refer to: fsspec

  • storage_options (dict, optional) – Key/value pairs to be passed on to the caching file-system backend, if any.

  • create_if_not_exists (bool) – Whether to create the path if it does not exist.

  • max_chunk_rows (int) – Max rows per chunk file, default is 100.

  • column_groups (dict) – A dict of group name string to a list of column index or name.

  • num_threads (int) – The thread concurrency on each chunk.

  • version (str) – The version string, default is 0.0.0.

  • overwrite (bool) – Whether overwrites the dataset version.

Return type

A dict of export info.

Examples

Export to local disk.

>>> import xorbits.datasets as xdatasets
>>> ds = xdatasets.from_huggingface("cifar10", split="train")
>>> ds.export("./export_dir")

Export to remote storage.

>>> import xorbits.datasets as xdatasets
>>> storage_options = {"key": aws_access_key_id, "secret": aws_secret_access_key}
>>> ds = xdatasets.from_huggingface("cifar10", split="train")
>>> ds.export("./export_dir", storage_options=storage_options)