Xhistogram Tutorial

Histograms are the foundation of many forms of data analysis. The goal of xhistogram is to make it easy to calculate weighted histograms in multiple dimensions over n-dimensional arrays, with control over the axes. Xhistogram builds on top of xarray, for automatic coordiantes and labels, and dask, for parallel scalability.

Toy Data

We start by showing an example with toy data. First we use xarray to create some random, normally distributed data.

1D Histogram

[1]:
import xarray as xr
import numpy as np
%matplotlib inline

nt, nx = 100, 30
da = xr.DataArray(np.random.randn(nt, nx), dims=['time', 'x'],
                  name='foo') # all inputs need a name
display(da)
da.plot()
<xarray.DataArray 'foo' (time: 100, x: 30)>
array([[ 1.5593684 ,  0.40145489, -0.22479739, ..., -0.3437886 ,
        -1.23356416,  0.03077635],
       [-0.2228109 ,  0.38783241,  0.68898411, ..., -0.16182289,
         1.23321392, -1.59096389],
       [ 1.92575725, -1.25776679, -1.30329336, ..., -0.51350828,
        -1.49556751,  0.51878971],
       ...,
       [-0.77886827,  0.10927865, -0.02355692, ..., -0.0774592 ,
         0.67645978,  0.58709225],
       [ 1.01767561,  1.07267062,  0.13102546, ..., -0.42234066,
         1.48014365, -0.0097595 ],
       [-0.05547442, -1.26186425,  0.10945666, ...,  0.37304738,
         1.46653078, -1.2716321 ]])
Dimensions without coordinates: time, x
[1]:
<matplotlib.collections.QuadMesh at 0x7f1a8fc81a00>
_images/tutorial_2_2.png

By default xhistogram operates on all dimensions of an array, just like numpy. However, it operates on xarray DataArrays, taking labels into account.

[2]:
from xhistogram.xarray import histogram

bins = np.linspace(-4, 4, 20)
h = histogram(da, bins=[bins])
display(h)
h.plot()
<xarray.DataArray 'histogram_foo' (foo_bin: 19)>
array([  1,   3,  10,  21,  62, 125, 233, 372, 423, 520, 463, 342, 242,
       109,  44,  24,   4,   1,   1])
Coordinates:
  * foo_bin  (foo_bin) float64 -3.789 -3.368 -2.947 -2.526 ... 2.947 3.368 3.789
[2]:
[<matplotlib.lines.Line2D at 0x7f1a8fb525e0>]
_images/tutorial_4_2.png

TODO: - Bins needs to be a list; this is annoying, would be good to accept single items - The foo_bin coordinate is the estimated bin center, not the bounds. We need to add the bounds to the coordinates, but we can as long as we are returning DataArray and not Dataset.

Both of the above need GitHub Issues

Histogram over a single axis

[3]:
h_x = histogram(da, bins=[bins], dim=['time'])
h_x.plot()
[3]:
<matplotlib.collections.QuadMesh at 0x7f1a8faf4610>
_images/tutorial_6_1.png

TODO: - Relax / explain requirement that dims is always a list

[4]:
h_x.mean(dim='x').plot()
[4]:
[<matplotlib.lines.Line2D at 0x7f1a8fa28b50>]
_images/tutorial_8_1.png

Weighted Histogram

Weights can be the same shape as the input:

[5]:
weights = 0.4 * xr.ones_like(da)
histogram(da, bins=[bins], weights=weights)
[5]:
<xarray.DataArray 'histogram_foo' (foo_bin: 19)>
array([  0.4,   1.2,   4. ,   8.4,  24.8,  50. ,  93.2, 148.8, 169.2,
       208. , 185.2, 136.8,  96.8,  43.6,  17.6,   9.6,   1.6,   0.4,
         0.4])
Coordinates:
  * foo_bin  (foo_bin) float64 -3.789 -3.368 -2.947 -2.526 ... 2.947 3.368 3.789

Or can use Xarray broadcasting:

[6]:
weights = 0.2 * xr.ones_like(da.x)
histogram(da, bins=[bins], weights=weights)
[6]:
<xarray.DataArray 'histogram_foo' (foo_bin: 19)>
array([  0.2,   0.6,   2. ,   4.2,  12.4,  25. ,  46.6,  74.4,  84.6,
       104. ,  92.6,  68.4,  48.4,  21.8,   8.8,   4.8,   0.8,   0.2,
         0.2])
Coordinates:
  * foo_bin  (foo_bin) float64 -3.789 -3.368 -2.947 -2.526 ... 2.947 3.368 3.789

2D Histogram

Now let’s say we have multiple input arrays. We can calculate their joint distribution:

[7]:
db = xr.DataArray(np.random.randn(nt, nx), dims=['time', 'x'],
                  name='bar') - 2

histogram(da, db, bins=[bins, bins]).plot()
[7]:
<matplotlib.collections.QuadMesh at 0x7f1a8f9a7b20>
_images/tutorial_14_1.png

Real Data Example

Ocean Volume Census in TS Space

Here we show how to use xhistogram to do a volume census of the ocean in Temperature-Salinity Space

First we open the World Ocean Atlas dataset from the opendap dataset (http://apdrc.soest.hawaii.edu/dods/public_data/WOA/WOA13/1_deg/annual).

Here we read the annual mean Temparature, Salinity and Oxygen on a 5 degree grid.

[8]:
# Read WOA using opendap
Temp_url = 'http://apdrc.soest.hawaii.edu:80/dods/public_data/WOA/WOA13/5_deg/annual/temp'
Salt_url = 'http://apdrc.soest.hawaii.edu:80/dods/public_data/WOA/WOA13/5_deg/annual/salt'
Oxy_url = 'http://apdrc.soest.hawaii.edu:80/dods/public_data/WOA/WOA13/5_deg/annual/doxy'

ds = xr.merge([
    xr.open_dataset(Temp_url).tmn.load(),
    xr.open_dataset(Salt_url).smn.load(),
    xr.open_dataset(Oxy_url).omn.load()])
ds
/home/docs/checkouts/readthedocs.org/user_builds/xhistogram/conda/latest/lib/python3.9/site-packages/xarray/coding/times.py:119: SerializationWarning: Ambiguous reference date string: 1-1-1 00:00:0.0. The first value is assumed to be the year hence will be padded with zeros to remove the ambiguity (the padded reference date string is: 0001-1-1 00:00:0.0). To remove this message, remove the ambiguity by padding your reference date strings with zeros.
  warnings.warn(warning_msg, SerializationWarning)
/home/docs/checkouts/readthedocs.org/user_builds/xhistogram/conda/latest/lib/python3.9/site-packages/xarray/coding/times.py:527: SerializationWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using cftime.datetime objects instead, reason: dates out of range
  dtype = _decode_cf_datetime_dtype(data, units, calendar, self.use_cftime)
/home/docs/checkouts/readthedocs.org/user_builds/xhistogram/conda/latest/lib/python3.9/site-packages/numpy/core/_asarray.py:102: SerializationWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using cftime.datetime objects instead, reason: dates out of range
  return array(a, dtype, copy=False, order=order)
[8]:
<xarray.Dataset>
Dimensions:  (lat: 36, lev: 102, lon: 72, time: 1)
Coordinates:
  * time     (time) object -001-01-15 00:00:00
  * lev      (lev) float64 0.0 5.0 10.0 15.0 ... 5.2e+03 5.3e+03 5.4e+03 5.5e+03
  * lat      (lat) float64 -87.5 -82.5 -77.5 -72.5 -67.5 ... 72.5 77.5 82.5 87.5
  * lon      (lon) float64 -177.5 -172.5 -167.5 -162.5 ... 167.5 172.5 177.5
Data variables:
    tmn      (time, lev, lat, lon) float32 nan nan nan nan ... nan nan nan nan
    smn      (time, lev, lat, lon) float32 nan nan nan nan ... nan nan nan nan
    omn      (time, lev, lat, lon) float32 nan nan nan nan ... nan nan nan nan
Attributes:
    long_name:  statistical mean sea water temperature [degc] 

Use histogram to bin data points. Use canonical ocean T/S ranges, and bin size of \(0.1^0C\), and \(0.025psu\). Similar ranges and bin size as this review paper on Mode Waters: https://doi.org/10.1016/B978-0-12-391851-2.00009-X .

[9]:
sbins = np.arange(31,38, 0.025)
tbins = np.arange(-2, 32, 0.1)
[10]:
# histogram of number of data points
# histogram of number of data points
hTS = histogram(ds.smn, ds.tmn, bins=[sbins, tbins])
np.log10(hTS.T).plot(levels=31)
/home/docs/checkouts/readthedocs.org/user_builds/xhistogram/conda/latest/lib/python3.9/site-packages/xarray/core/computation.py:739: RuntimeWarning: divide by zero encountered in log10
  result_data = func(*input_data)
[10]:
<matplotlib.collections.QuadMesh at 0x7f1a8f650310>
_images/tutorial_20_2.png

However, we would like to do a volume census, which requires the data points to be weighted by volume of the grid box.

\begin{equation} dV = dz*dx*dy \end{equation}

[11]:
# histogram of number of data points weighted by volume resolution
# Note that depth is a non-uniform axis

# Create a dz variable
dz = np.diff(ds.lev)
dz =np.insert(dz, 0, dz[0])
dz = xr.DataArray(dz, coords= {'lev':ds.lev}, dims='lev')

# weight by volume of grid cell (resolution = 5degree, 1degree=110km)
dVol = dz * (5*110e3) * (5*110e3*np.cos(ds.lat*np.pi/180))

# Note: The weights are automatically broadcast to the right size
hTSw = histogram(ds.smn, ds.tmn, bins=[sbins, tbins], weights=dVol)
np.log10(hTSw.T).plot(levels=31, vmin=11.5, vmax=16, cmap='brg')
[11]:
<matplotlib.collections.QuadMesh at 0x7f1a8f59a250>
_images/tutorial_22_1.png

The ridges of this above plot are indicative of T/S classes with a lot of volume, and some of them are indicative of Mode Waters (example Eighteen Degree water with T\(\sim18^oC\), and S\(\sim36.5psu\).

Averaging a variable

Next we calculate the mean oxygen value in each TS bin.

\begin{equation} \overline{A} (m,n) = \frac{\sum_{T(x,y,z)=m, S(x,y,z)=n} (A(x,y,z) dV)}{\sum_{T(x,y,z)=m, S(x,y,z)=n}dV}. \end{equation}

[12]:
hTSO2 = (histogram(ds.smn.where(~np.isnan(ds.omn)),
                   ds.tmn.where(~np.isnan(ds.omn)),
                   bins=[sbins, tbins],
                   weights=ds.omn.where(~np.isnan(ds.omn))*dVol)/
                histogram(ds.smn.where(~np.isnan(ds.omn)),
                          ds.tmn.where(~np.isnan(ds.omn)),
                          bins=[sbins, tbins],
                          weights=dVol))

(hTSO2.T).plot(vmin=1, vmax=8)
[12]:
<matplotlib.collections.QuadMesh at 0x7f1a8f4df130>
_images/tutorial_25_1.png

Some interesting patterns in average oxygen emerge. Convectively ventilated cold water have the highest oxygen and mode waters have relatively high oxygen. Oxygen minimum zones are interspersed in the middle of volumetic ridges (high volume waters).

NOTE: NaNs in weights will make the weighted sum as nan. To avoid this, call .fillna(0.) on your weights input data before calling histogram().

Dask Integration

Should just work, but need examples.