PyTables has a powerful capability to deal with native HDF5 files created with another tools. However, there are situations were you may want to create truly native PyTables files with those tools while retaining fully compatibility with PyTables format. That is perfectly possible, and in this appendix is presented the format that you should endow to your own-generated files in order to get a fully PyTables compatible file.
We are going to describe the 2.0 version of PyTables file format (introduced in PyTables version 2.0). As time goes by, some changes might be introduced (and documented here) in order to cope with new necessities. However, the changes will be carefully pondered so as to ensure backward compatibility whenever is possible.
A PyTables file is composed with arbitrarily large amounts of HDF5 groups (Groups in PyTables naming scheme) and datasets (Leaves in PyTables naming scheme). For groups, the only requirements are that they must have some system attributes available. By convention, system attributes in PyTables are written in upper case, and user attributes in lower case but this is not enforced by the software. In the case of datasets, besides the mandatory system attributes, some conditions are further needed in their storage layout, as well as in the datatypes used in there, as we will see shortly.
As a final remark, you can use any filter as you want to create a PyTables file, provided that the filter is a standard one in HDF5, like zlib, shuffle or szip (although the last one can not be used from within PyTables to create a new file, datasets compressed with szip can be read, because it is the HDF5 library which do the decompression transparently).
The File object is, in fact, an special HDF5 group structure that is root for the rest of the objects on the object tree. The next attributes are mandatory for the HDF5 root group structure in PyTables files:
The next attributes are mandatory for group structures:
The next attributes are optional for group structures:
This depends on the kind of Leaf. The format for each type follows.
The next attributes are mandatory for table structures:
A Table has a dataspace with a 1-dimensional chunked layout.
The datatype of the elements (rows) of Table must be the H5T_COMPOUND compound data type, and each of these compound components must be built with only the next HDF5 data types classes:
H5T_BITFIELD: This class is used to represent the Bool type. Such a type must be build using a H5T_NATIVE_B8 datatype, followed by a HDF5 H5Tset_precision call to set its precision to be just 1 bit.
H5T_STRING: The datatype used to describe strings in PyTables is H5T_C_S1 (i.e. a string C type) followed with a call to the HDF5 H5Tset_size() function to set their length.
H5T_ARRAY: This allows the construction of homogeneous, multidimensional arrays, so that you can include such objects in compound records. The types supported as elements of H5T_ARRAY data types are the ones described above. Currently, PyTables does not support nested H5T_ARRAY types.
H5T_COMPOUND: This allows the support for datatypes that are compounds of compounds (this is also known as nested types along this manual).
This support can also be used for defining complex numbers. Its format is described below:
The H5T_COMPOUND type class contains two members. Both members must have the H5T_FLOAT atomic datatype class. The name of the first member should be “r” and represents the real part. The name of the second member should be “i” and represents the imaginary part. The precision property of both of the H5T_FLOAT members must be either 32 significant bits (e.g. H5T_NATIVE_FLOAT) or 64 significant bits (e.g. H5T_NATIVE_DOUBLE). They represent Complex32 and Complex64 types respectively.
The next attributes are mandatory for array structures:
An Array has a dataspace with a N-dimensional contiguous layout (if you prefer a chunked layout see EArray below).
The elements of Array must have either HDF5 atomic data types or a compound data type representing a complex number. The atomic data types can currently be one of the next HDF5 data type classes: H5T_BITFIELD, H5T_INTEGER, H5T_FLOAT and H5T_STRING. The H5T_TIME class is also supported for reading existing Array objects, but not for creating them. See the Table format description in Table format for more info about these types.
In addition to the HDF5 atomic data types, the Array format supports complex numbers with the H5T_COMPOUND data type class. See the Table format description in Table format for more info about this special type.
You should note that H5T_ARRAY class datatypes are not allowed in Array objects.
The next attributes are mandatory for CArray structures:
An CArray has a dataspace with a N-dimensional chunked layout.
The elements of CArray must have either HDF5 atomic data types or a compound data type representing a complex number. The atomic data types can currently be one of the next HDF5 data type classes: H5T_BITFIELD, H5T_INTEGER, H5T_FLOAT and H5T_STRING. The H5T_TIME class is also supported for reading existing CArray objects, but not for creating them. See the Table format description in Table format for more info about these types.
In addition to the HDF5 atomic data types, the CArray format supports complex numbers with the H5T_COMPOUND data type class. See the Table format description in Table format for more info about this special type.
You should note that H5T_ARRAY class datatypes are not allowed yet in Array objects.
The next attributes are mandatory for earray structures:
An EArray has a dataspace with a N-dimensional chunked layout.
The elements of EArray are allowed to have the same data types as for the elements in the Array format. They can be one of the HDF5 atomic data type classes: H5T_BITFIELD, H5T_INTEGER, H5T_FLOAT, H5T_TIME or H5T_STRING, see the Table format description in Table format for more info about these types. They can also be a H5T_COMPOUND datatype representing a complex number, see the Table format description in Table format.
You should note that H5T_ARRAY class data types are not allowed in EArray objects.
The next attributes are mandatory for vlarray structures:
An VLArray has a dataspace with a 1-dimensional chunked layout.
The data type of the elements (rows) of VLArray objects must be the H5T_VLEN variable-length (or VL for short) datatype, and the base datatype specified for the VL datatype can be of any atomic HDF5 datatype that is listed in the Table format description Table format. That includes the classes:
They can also be a H5T_COMPOUND data type representing a complex number, see the Table format description in Table format for a detailed description.
You should note that this does not include another VL datatype, or a compound datatype that does not fit the description of a complex number. Note as well that, for object and vlstring pseudo-atoms, the base for the VL datatype is always a H5T_NATIVE_UCHAR (H5T_NATIVE_UINT for vlunicode). That means that the complete row entry in the dataset has to be used in order to fully serialize the object or the variable length string.
The next attributes are optional for leaves:
FLAVOR: This is meant to provide the information about the kind of object kept in the Leaf, i.e. when the dataset is read, it will be converted to the indicated flavor. It can take one the next string values:
- “numpy”: Read data (structures arrays, arrays, records, scalars) will be returned as NumPy objects.
- “python”: Read data will be returned as Python lists, tuples, or scalars.