Command Line Interface Guide: root_to_parquet

Instructions for function hepconvert.root_to_parquet

Command:

hepconvert root-to-parquet [options] [OUT_FILE] [IN_FILE]

Examples:

hepconvert root-to-parquet -f --progress-bar --tree 'tree1' out_file.parquet in_file.root

Options:

tree (str) if there are multiple TTrees in the input file, specify the name of the TTree to copy.

--drop-branches, -db, and --keep-branches, -kb (list) str or dict Specify branch names to remove from the ROOT file. Either a str, list of str (for multiple branches), or a dict with form {‘tree’: ‘branches’} to remove branches from certain ttrees. Wildcarding accepted.

--cut For branch skimming, passed to uproot.iterate. str, if not None, this expression filters all of the expressions.

--expressions For branch skimming, passed to uproot.iterate. Names of TBranches or aliases to convert to ararys or mathematical expressions of them. If None, all TBranches selected by the filters are included.

--force or -f Use flag to overwrite a file if it already exists.

--step-size (int) Size of batches of data to read and write. If an integer, the maximum number of entries to include in each iteration step; if a string, the maximum memory size to include. The string must be a number followed by a memory unit, such as “100 MB”. Default is “100 MB”

--compression of -c (str) Compression type. Options are “lzma”, “zlib”, “lz4”, and “zstd”. Default is “zlib”.

--compression-level (int) Level of compression set by an integer. Default is 1.

Options passed to ak.to_parquet:

--list-to32 (bool) If True, convert Awkward lists into 32-bit Arrow lists if they’re small enough, even if it means an extra conversion. Otherwise, signed 32-bit ak.types.ListType maps to Arrow ListType, signed 64-bit ak.types.ListType maps to Arrow LargeListType, and unsigned 32-bit ak.types.ListType picks whichever Arrow type its values fit into.

--string-to32 (bool) Same as the above for Arrow string and large_string.

--bytestring-to32 (bool) Same as the above for Arrow binary and large_binary.

--emptyarray-to (None or dtype) If None, ak.types.UnknownType maps to Arrow’s null type; otherwise, it is converted a given numeric dtype.

--categorical-as-dictionary (bool) If True, ak.contents.IndexedArray and #ak.contents.IndexedOptionArray labeled with __array__ = “categorical” are mapped to Arrow DictionaryArray; otherwise, the projection is evaluated before conversion (always the case without __array__ = “categorical”).

--extensionarray (bool) If True, this function returns extended Arrow arrays (at all levels of nesting), which preserve metadata so that Awkward to Arrow to Awkward preserves the array’s ak.types.Type (though not the #ak.forms.Form). If False, this function returns generic Arrow arrays that might be needed for third-party tools that don’t recognize Arrow’s extensions. Even with extensionarray=False, the values produced by Arrow’s to_pylist method are the same as the values produced by Awkward’s #ak.to_list.

--count-nulls (bool) If True, count the number of missing values at each level and include these in the resulting Arrow array, which makes some downstream applications faster. If False, skip the up-front cost of counting them.

-c or --compression (None, str, or dict) Compression algorithm name, passed to pyarrow.parquet.ParquetWriter. Parquet supports {“NONE”, “SNAPPY”, “GZIP”, “BROTLI”, “LZ4”, “ZSTD”} (where “GZIP” is also known as “zlib” or “deflate”). If a dict, the keys are column names (the same column names that #ak.forms.Form.columns returns and #ak.forms.Form.select_columns accepts) and the values are compression algorithm names, to compress each column differently.

--compression-level (None, int, or dict None) Compression level, passed to pyarrow.parquet.ParquetWriter. Compression levels have different meanings for different compression algorithms: GZIP ranges from 1 to 9, but ZSTD ranges from -7 to 22, for example. Generally, higher numbers provide slower but smaller compression.

--row-group-size (int or None) Will be overwritten by step_size.

--data-page-size (None or int) Number of bytes in each data page, passed to pyarrow.parquet.ParquetWriter. If None, the Parquet default of 1 MiB is used.

--parquet-flavor (None or “spark”) If None, the output Parquet file will follow Arrow conventions; if “spark”, it will follow Spark conventions. Some systems, such as Spark and Google BigQuery, might need Spark conventions, while others might need Arrow conventions. Passed to pyarrow.parquet.ParquetWriter. as flavor.

--parquet-version (“1.0”, “2.4”, or “2.6”) Parquet file format version. Passed to pyarrow.parquet.ParquetWriter. as version.

--parquet-page-version (“1.0” or “2.0”) Parquet page format version. Passed to pyarrow.parquet.ParquetWriter. as data_page_version.

--parquet-metadata-statistics (bool or dict) If True, include summary statistics for each data page in the Parquet metadata, which lets some applications search for data more quickly (by skipping pages). If a dict mapping column names to bool, include summary statistics on only the specified columns. Passed to pyarrow.parquet.ParquetWriter. as write_statistics.

--parquet-dictionary-encoding (bool or dict) If True, allow Parquet to pre-compress with dictionary encoding. If a dict mapping column names to bool, only use dictionary encoding on the specified columns. Passed to pyarrow.parquet.ParquetWriter. as use_dictionary.

--parquet-byte-stream-split (bool or dict) If True, pre-compress floating point fields (float32 or float64) with byte stream splitting, which collects all mantissas in one part of the stream and exponents in another. Passed to [pyarrow.parquet.ParquetWriter](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html). as use_byte_stream_split.

--parquet-coerce-timestamps (None, “ms”, or “us”) If None, any timestamps (datetime64 data) are coerced to a given resolution depending on parquet_version: version “1.0” and “2.4” are coerced to microseconds, but later versions use the datetime64’s own units. If “ms” is explicitly specified, timestamps are coerced to milliseconds; if “us”, microseconds. Passed to pyarrow.parquet.ParquetWriter. as coerce_timestamps.

--parquet-old-int96-timestamps (None or bool) If True, use Parquet’s INT96 format for any timestamps (datetime64 data), taking priority over parquet_coerce_timestamps. If None, let the parquet_flavor decide. Passed to pyarrow.parquet.ParquetWriter as use_deprecated_int96_timestamps.

--parquet-compliant-nested (bool) If True, use the Spark/BigQuery/Parquet convention for nested lists, in which each list is a one-field record with field name “element”; otherwise, use the Arrow convention, in which the field name is “item”. Passed to pyarrow.parquet.ParquetWriter as use_compliant_nested_type.

--parquet-extra-options (None or dict) Any additional options to pass to pyarrow.parquet.ParquetWriter.

--storage-options (None or dict) Any additional options to pass to fsspec.core.url_to_fs to open a remote file for writing.