xorbits.pandas.read_parquet#

xorbits.pandas.read_parquet(path, engine: str = 'auto', columns: list = None, groups_as_chunks: bool = False, use_arrow_dtype: bool = None, incremental_index: bool = False, storage_options: dict = None, memory_scale: int = None, merge_small_files: bool = True, merge_small_file_options: dict = None, gpu: bool = None, **kwargs)[source]#

Load a parquet object from the file path, returning a DataFrame.

Parameters
  • path (str, path object or file-like object) – String, path object (implementing os.PathLike[str]), or file-like object implementing a binary read() function. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.parquet. A file URL can also be a path to a directory that contains multiple partitioned parquet files. Both pyarrow and fastparquet support paths to directories as well as file URLs. A directory path could be: file://localhost/path/to/tables or s3://bucket/partition_dir.

  • engine ({'auto', 'pyarrow', 'fastparquet'}, default 'auto') –

    Parquet library to use. If ‘auto’, then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable.

    When using the 'pyarrow' engine and no storage options are provided and a filesystem is implemented by both pyarrow.fs and fsspec (e.g. “s3://”), then the pyarrow.fs filesystem is attempted first. Use the filesystem keyword with an instantiated fsspec filesystem if you wish to use its implementation.

  • columns (list, default=None) – If not None, only these columns will be read from the file.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    New in version 1.3.0(pandas).

  • use_nullable_dtypes (bool, default False (Not supported yet)) –

    If True, use dtypes that use pd.NA as missing value indicator for the resulting DataFrame. (only applicable for the pyarrow engine) As new dtypes are added that support pd.NA in the future, the output with this option will change to use those dtypes. Note: this is an experimental option, and behaviour (e.g. additional support dtypes) may change without notice.

    Deprecated since version 2.0(pandas).

  • dtype_backend ({'numpy_nullable', 'pyarrow'}, default 'numpy_nullable' (Not supported yet)) –

    Back-end data type applied to the resultant DataFrame (still experimental). Behaviour is as follows:

    • "numpy_nullable": returns nullable-dtype-backed DataFrame (default).

    • "pyarrow": returns pyarrow-backed nullable ArrowDtype DataFrame.

    New in version 2.0(pandas).

  • filesystem (fsspec or pyarrow filesystem, default None (Not supported yet)) –

    Filesystem object to use when reading the parquet file. Only implemented for engine="pyarrow".

    New in version 2.1.0(pandas).

  • filters (List[Tuple] or List[List[Tuple]], default None (Not supported yet)) –

    To filter out data. Filter syntax: [[(column, op, val), …],…] where op is [==, =, >, >=, <, <=, !=, in, not in] The innermost tuples are transposed into a set of filters applied through an AND operation. The outer list combines these sets of filters through an OR operation. A single list of tuples can also be used, meaning that no OR operation between set of filters is to be conducted.

    Using this argument will NOT result in row-wise filtering of the final partitions unless engine="pyarrow" is also specified. For other engines, filtering is only performed at the partition level, that is, to prevent the loading of some row-groups and/or files.

    New in version 2.1.0(pandas).

  • **kwargs – Any additional kwargs are passed to the engine.

Return type

DataFrame

See also

DataFrame.to_parquet

Create a parquet object that serializes a DataFrame.

Examples

>>> original_df = pd.DataFrame(  
...     {"foo": range(5), "bar": range(5, 10)}
...    )
>>> original_df  
   foo  bar
0    0    5
1    1    6
2    2    7
3    3    8
4    4    9
>>> df_parquet_bytes = original_df.to_parquet()  
>>> from io import BytesIO  
>>> restored_df = pd.read_parquet(BytesIO(df_parquet_bytes))  
>>> restored_df  
   foo  bar
0    0    5
1    1    6
2    2    7
3    3    8
4    4    9
>>> restored_df.equals(original_df)  
True
>>> restored_bar = pd.read_parquet(BytesIO(df_parquet_bytes), columns=["bar"])  
>>> restored_bar  
    bar
0    5
1    6
2    7
3    8
4    9
>>> restored_bar.equals(original_df[['bar']])  
True

The function uses kwargs that are passed directly to the engine. In the following example, we use the filters argument of the pyarrow engine to filter the rows of the DataFrame.

Since pyarrow is the default engine, we can omit the engine argument. Note that the filters argument is implemented by the pyarrow engine, which can benefit from multithreading and also potentially be more economical in terms of memory.

>>> sel = [("foo", ">", 2)]  
>>> restored_part = pd.read_parquet(BytesIO(df_parquet_bytes), filters=sel)  
>>> restored_part  
    foo  bar
0    3    8
1    4    9
Extra Parameters
----------------
groups_as_chunks : bool, default False
    if True, each row group correspond to a chunk.
    if False, each file correspond to a chunk.
    Only available for 'pyarrow' engine.
incremental_index: bool, default False
    If index_col not specified, ensure range index incremental,
    gain a slightly better performance if setting False.
use_arrow_dtype: bool, default None
    If True, use arrow dtype to store columns. Default enabled if pandas >= 2.1
memory_scale: int, optional
    Scale that real memory occupation divided with raw file size.
merge_small_files: bool, default True
    Merge small files whose size is small.
merge_small_file_options: dict
    Options for merging small files

This docstring was copied from pandas.