Overview: HDF5 is a specification and format for creating hierarchical data from very large data sources.; In HDF5 the data is organized in a file. The file object acts as the / (root) group of the hierarchy. Similar to the UNIX file system, in HDF5 the datasets and their groups are organized as an inverted tree.; Several groups can be created under the / (root) group. The h5py package is a Pythonic interface to the HDF5 binary data format. It lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays.
|Show more releases|
This tutorial shows how to use the
v, a python package to store big data efficiently. It will mainly focus on creating and reading HDF5 files.
1. What is H5PY?
The h5py is a package that interfaces Python to the HDF5 binary data format, enabling you to store big amounts of numeric data and manipulate it from NumPy.
2. Importance of H5PY
H5Py enables storing and manipulate big amounts of numerical data. Imagine that you need to store large amounts of data with quick access. Definitely text file shall not work. Scientists run cosmological simulations that generate big quantities of data. To analyze them, the exact dataset which the scientists want should be accessible quickly and painlessly. H5PY works well in such cases.
H5Py is a powerful and quick running binary format with no maximum limit for the file size. The tool runs as parallel IO carrying a lot of low-level optimizations within itself to run the queries faster with smaller memory requirements.
Consider the multi-terabyte datasets that can be sliced as if they were real NumPy arrays. Thousands of datasets will be able to be stored in a single file and categorized. They can be tagged based on categories or however we want. H5Py can directly use NumPy and Python metaphors such as their NumPy array syntax and dictionary. For example datasets in a file can be iterated over and over or the attributes of the datasets such as .dtype or .shape can be checked out.
While H5Py is an easy to use high-level interface, it is based on Cython, an object-oriented program encapsulating HDF5 C API. So, one can do almost anything using C in HDF5 and thus anything can be done using H5Py. On top of all these, all the files created are in binary format which is widely used standard and hence can be exchanged with any programmers who use any other programs like MATLAB and IDL. Also installing HDF5 directly is a pain. But installing H5Py is simpler in comparison by just using a favorite package manager.
3. Installation of H5PY
Pre-build installation is the most recommended way to install H5Py and it can be done using Python distributions or H5Py wheels or OS-specific package managers.
4. Write HDF5 files
Next, we show below how to write HDF5 files using Python. First we important
And create two numpy array with random numbers: First array with dimensions 100 x 100 and the second array with dimensions 200 x 200.
As the datasets are numpy arrays, we can confirm the dataset dimensions:
Finally, as the datasets were created, we can use the
h5py library to store the data into the HDF5 format.
In case you want to compress the HDF5 file, please add the parameter
5. Reading HDF5 files
Last but not least, now that we have written some data to the HDF5 file, we want to read it. This can be done as follows:
Read HDF5 file
H5py Create Dataset
Print datasets names in the HDF5 file
Access data on dataset
Don’t forget to close the file object when done.