A random walk through NetCDF

or, Maybe RSEs just really like a spec

Jack Atkinson

Principal Research Software Engineer
ICCS - University of Cambridge

2026-04-23

Precursors

Slides and Materials

To access links or follow on your own device these slides can be found at:
jackatkinson.net/slides

Code used in demonstrations is available at:
/jatkinson1000/NetCDF-examples

Licensing

Except where otherwise noted, these presentation materials are licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License.

Vectors and icons by SVG Repo under CC0(1.0) or FontAwesome under SIL OFL 1.1

FAIR

FAIR data and software is increasingly important:

  • Findable
  • Accessible
  • Interoperable
  • Reusable

BSBR by the SSI under fair use

What is NetCDF?

  • A way to store scientific data.
  • A file format to store array-oriented data.
  • A self-describing file format for (scientific) data.
  • A binary file format for storing structured data in a portable machine-independent way.1
  • An open international standard for encoding (geospatial) data.
  • Network Common Data Form.

Other notes:

  • Developed by UCAR (University Corporation for Atmospheric Research)

What is in a NetCDF?

A file containing:

  • Dimensions
  • Variables
  • Attributes

UML Block diagram from the NetCDF Documentation

What is in a NetCDF?

A file

netcdf basic_dataset {
}

What is in a NetCDF?

A file containing:

  • Dimensions
netcdf basic_dataset {

dimensions:
    lon = 2 ;
    lat = 5 ;

}

What is in a NetCDF?

A file containing:

  • Dimensions
  • Variables
netcdf basic_dataset {

dimensions:
    lon = 2 ;
    lat = 5 ;

variables:
    float temperature(lon, lat) ;

}

What is in a NetCDF?

A file containing:

  • Dimensions
  • Variables
    • With data
    • Row-Major (last dimension fastest)!
    • Auto-filled
netcdf basic_dataset {

dimensions:
    lon = 2 ;
    lat = 5 ;

variables:
    float temperature(lon, lat) ;

data:

 temperature =
  1.1, 2.2, 3.3, 4.4, 5.5,
  6.6, 7.7, 8.8, 9.9, 10 ;

}

What is in a NetCDF?

A file containing:

  • Dimensions
  • Variables
  • Attributes
    • type can be inferred
    • can be global or variable scoped
netcdf basic_dataset {

:title = "A simple example NetCDF dataset" ;

dimensions:
    lon = 2 ;
    lat = 5 ;

variables:
    float temperature(lon, lat) ;
        temperature:standard_name = "air_temperature" ;
        temperature:units = "K" ;

data:

 temperature =
  1.1, 2.2, 3.3, 4.4, 5.5,
  6.6, 7.7, 8.8, 9.9, 10 ;

}

What is in a NetCDF?

A file containing:

  • Dimensions
  • Variables
  • Attributes
    • Can have a length
  • Coordinates??
    • Convention, not spec!
    • A common source of confusion.
netcdf basic_dataset {

:title = "A simple example NetCDF dataset" ;

dimensions:
    lon = 2 ;
    lat = 5 ;

variables:
    float temperature(lon, lat) ;
        temperature:standard_name = "air_temperature" ;
        temperature:units = "K" ;

    float lon(lon) ;
        lon:standard_name = "longitude" ;
        lon:units = "degree_east" ;
        float lon:valid_range = 0, 360 ;

data:

  temperature =
    1.1, 2.2, 3.3, 4.4, 5.5,
    6.6, 7.7, 8.8, 9.9, 10 ;

  lon = 0, 5 ;

}

What is in a NetCDF?

A file containing:

  • Dimensions
    • UNLIMITED
    • Allows growth of the dataset
    • First dimension
  • Variables
  • Attributes
  • Coordinates??
netcdf basic_dataset {

:title = "A simple example NetCDF dataset" ;

dimensions:
    time = UNLIMITED ;
    lon = 2 ;
    lat = 5 ;

variables:
    float temperature(time, lon, lat) ;
        temperature:standard_name = "air_temperature" ;
        temperature:units = "K" ;

    float lon(lon) ;
        lon:standard_name = "longitude" ;
        lon:units = "degree_east" ;
        float lon:valid_range = 0, 360 ;

data:

  temperature =
    1.1, 2.2, 3.3, 4.4, 5.5,
    6.6, 7.7, 8.8, 9.9, 10 ;

  lon = 0, 5 ;
}

What is in a NetCDF?

Data Types:

  • double – IEEE 64-bit float
  • float – IEEE 32-bit float, also real
  • int – 32-bit signed integer, also long
  • short – 16-bit signed integer
  • byte – 8-bit integers
  • char – Characters
netcdf netcdf_types {

dimensions:
    x = 5 ;

variables:
    float var(x) ;
    double var:var_double = 1.0 ;
    float var:var_float = 10.0 ;
    int var:var_int = -10 ;
    byte var:var_byte = 0, 127, 128, 255, 256 ;
    char var:var_char = "a" ;
}

Interrogation of NetCDF

ncview

A useful and reasonably powerful utility to quickly visualise netcdf datasets.


Installable from source or various package managers (apt, brew, spack).

Usable from remote machines with X-forwarding.


On CSD3 (Thank you to Kacper):

module purge
module load rhel8/cclake/base
module load ncview
ncview mynetcdffile.nc


**demo**

Inspecting NetCDF Files

We can take a look at the binary file data as an octal dump using od to get some idea of how it is packaged.

od myfile.nc


Hints can be gleaned by adding flags:

  • -b for bytes
  • -s for shorts
  • -c for chars
netcdf tiny {
dimensions:
        dim = 5;
variables:
        short var(dim);
data:
        var = 3, 1, 4, 1, 5 ;
}
0000000   103 104 106 001 000 000 000 000 000 000 000 012 000 000 000 001
           C   D   F 001  \0  \0  \0  \0  \0  \0  \0  \n  \0  \0  \0 001
            17475     326       0       0       0    2560       0     256

0000020   000 000 000 003 144 151 155 000 000 000 000 005 000 000 000 000
          \0  \0  \0 003   d   i   m  \0  \0  \0  \0 005  \0  \0  \0  \0
                0     768   26980     109       0    1280       0       0

0000040   000 000 000 000 000 000 000 013 000 000 000 001 000 000 000 003
          \0  \0  \0  \0  \0  \0  \0  \v  \0  \0  \0 001  \0  \0  \0 003
                0       0       0    2816       0     256       0     768

0000060   166 141 162 000 000 000 000 001 000 000 000 000 000 000 000 000
           v   a   r  \0  \0  \0  \0 001  \0  \0  \0  \0  \0  \0  \0  \0
            24950     114       0     256       0       0       0       0

0000100   000 000 000 000 000 000 000 003 000 000 000 014 000 000 000 120
          \0  \0  \0  \0  \0  \0  \0 003  \0  \0  \0  \f  \0  \0  \0   P
                0       0       0     768       0    3072       0   20480

0000120   000 003 000 001 000 004 000 001 000 005 200 001
          \0 003  \0 001  \0 004  \0 001  \0 005 200 001
              768     256    1024     256    1280     384

0000134

Inspecting NetCDF Files

Perhaps the most useful tool when working with netcdf files is ncdump which ships with netcdf.


It allows us to “dump” out a representation of the data in the file for inspection.

ncdump myfile.nc

# Headers only
ncdump -h nyfile.nc

# Specific variable(s)
ncdump -v temperature,pressure myfile.nc

# Comment data structure
ncdump -c [c|f] myfile.nc

# Set display precision 
ncdump -p 2 myfile.nc

CDL Syntax

The natural question to ask is, is this reversible?

Yes (mostly)

The “dump” representation of NetCDF data is a language itself:
CDL - The NetCDF Common Data Language


Comes complete with its own language specification


The basis of much knowledge, and a constant reference in preparing this talk!


The “mostly” comes from CDL comments // that are not preserved.

Building netcdf files

So how do we actually go the other direction?

ncgen

Another useful utility that comes bundles with NetCDF


This can be particularly useful for simple version-controlled test files.

# Create a binary
ncgen -b mycdlfile.cdl

# Create a named output file
ncgen -o myncfile.nc mycdlfile.cdl

NetCDF Extended

The NetCDF Extended model

Everything so far has been what is called the NetCDF Classic model.


This is where data packaging to a binary representation was fully handled by NetCDF. Runs up to (and including) NetCDF 3.


NetCDF 4 introduced a new underlying format building on HDF5.

NetCDF4 is a subset of HDF5. We can verify this with:

h5dump mynetcdf4file.nc

This allows:

  • Larger file sizes
  • Multiple unlimited dimensions
  • New data types
  • New data structures

The NetCDF Extended model

UML Block diagram from the NetCDF Documentation

string

Variable length array of UTF-8 unicode

Does what unicode did for everyone everywhere.

netcdf languages {

dimensions:
    n_lang = 5 ;

variables:
    string languages(n_lang) ;
  string phrase(n_lang) ;

data:

  languages = "English", "Ogham", "Welsh", "Anglo-Saxon", "Braille" ;

  phrase = 
    "Hello World!",
    "/ ᚛᚛ᚉᚑᚅᚔᚉᚉᚔᚋ ᚔᚈᚔ ᚍᚂᚐᚅᚑ ᚅᚔᚋᚌᚓᚅᚐ᚜",
    "Dw i'n gallu bwyta gwydr, 'dyw e ddim yn gwneud dolur i mi.",
    "/ ᛁᚳ᛫ᛗᚨᚷ᛫ᚷᛚᚨᛋ᛫ᛖᚩᛏᚪᚾ᛫ᚩᚾᛞ᛫ᚻᛁᛏ᛫ᚾᛖ᛫ᚻᛖᚪᚱᛗᛁᚪᚧ᛫ᛗᛖ᛬",
    "/ ⠊⠀⠉⠁⠝⠀⠑⠁⠞⠀⠛⠇⠁⠎⠎⠀⠁⠝⠙⠀⠊⠞⠀⠙⠕⠑⠎⠝⠞⠀⠓⠥⠗⠞⠀⠍⠑" ;
}

enum

Store integer values that can be converted to a string:

netcdf clouds {
types:
  ubyte enum cloud_class_t {Clear = 0, Cumulonimbus = 1, Stratus = 2, 
      Stratocumulus = 3, Cumulus = 4, Altostratus = 5, Nimbostratus = 6, 
      Altocumulus = 7, Cirrostratus = 8, Cirrocumulus = 9, Cirrus = 10, 
      Missing = 255} ;

dimensions:
    station = 5 ;

variables:
    cloud_class_t primary_cloud(station) ;

data:

 primary_cloud = Clear, Stratus, Clear, Cumulonimbus, Cirrus ;
}

enum

Store integer values that can be converted to a string:

netcdf irish_rover {
types:
  uint64 enum cargo {bags\ of\ the\ best\ Sligo\ rags = 1000000,
      barrels\ of\ bones = 2000000,
      bails\ of\ old\ nanny\ goats\'\ tails = 3000000,
      barrels\ of\ stones = 4000000, dogs = 5000000, hogs = 6000000,
      barrels\ of\ porter = 7000000,
      sides\ of\ old\ blind\ horses\ hides = 8000000} ;
dimensions:
        dim = 4 ;
variables:
        cargo in_the_hold_of_the_Irish_Rover(dim) ;
data:

 in_the_hold_of_the_Irish_Rover = bags\ of\ the\ best\ Sligo\ rags ,
        barrels\ of\ bones, hogs, dogs ;
}

vlen

Arrays of a type with variable length allowed.

Denoted by enclosing {}

netcdf vlens {
types:
  int(*) collection_of_ints ;

dimensions:
    m = 4 ;

variables:
    collection_of_ints ragged_array(m) ;

data:

 ragged_array = {10, 11, 12, 13, 14}, {20, 21, 22, 23}, {30, 31, 32}, 
    {40, 41} ;
}

Compounds

Like C structs.

Combination of other types.

netcdf compounds {

types:
  compound observation_t {
    int day ;
    char mnth(3) ;
    string bearing ;
    float miles ;
  };

dimensions:
    n = 3 ;

variables:
    observation_t obs(n) ;

data:

 obs = {12, {"jan"}, "NW", 12.2}, 
    {14, {"jan"}, "N", 14}, 
    {15, {"mar"}, "S", 2} ;
}

opaque

Raw data represented as a hex string (0x....)

int length indicates how many hex bytes per blob. Recall our experiments with od

Let’s not go to opaque, it’s a silly place.

netcdf opaque_data {

types:
  opaque(11) raw_data_t ;

dimensions:
    time = 4 ;

variables:
    raw_data_t raw_obs(time) ;

data:

 raw_obs = 0X0102030405060708090A0B, 0XAABBCCDDEEFFEEDDCCBBAA, 
    0XFFFFFFFFFFFFFFFFFFFFFF, 0XCF0DEFACED0CAFE0FACADE ;
}

groups

Organise your data like a Unix filesystem with the power of recursion!


Refer to the contents of other groups!


try:

ncdump -v /grp1/var groups.nc
netcdf groups {

dimensions:
    dim = 4 ;
variables:
    float var(dim) ;

data:
  var = 1, 2, 3, 4 ;

group: grp1 {
  dimensions:
    dim = 2 ;
  variables:
    float var(dim) ;

  data:
    var = -1, -2 ;
  } // group grp1

group: grp2 {
  dimensions:
    dim = 2 ;
  variables:
    float var(/grp1/dim, /dim) ;

  data:
     var = 5, 6, 7, 8,
           -1, -2, -3, -4 ;
  } // group grp2
}

ncdump and ncgen revisited

ncdump:

# What kind of NetCDF file are we dealing with?
ncdump -k myfile.nc

# File information and metadata (for NetCDF4)
ncdump -s myfile.nc

ncgen:

# Generate a classic NetCDF file
ncgen -b myfile.cdl
ncgen -k1 myfile.cdl

# Generate a NetCDF4 file
ncgen -k3 myfile.cdl

My contribution to NetCDF

And what I love about RSE.

  • Issue
    • The spec allows unsigned suffix in any order: 4us or 4su
    • ncgen only parses u first e.g. 4us
  • Pull Request
  • A lot learnt about parsing (yacc) and lexers (lex)

The CF-Conventions

The CF-Conventions

  • cfconventions.org/
  • Latest Specification
  • Metadata conventions for NetCDF datasets
    • Facilitates interoperability
    • Aids downstream tooling
  • Mostly through NetCDF attributes
    • not invasive to the data
  • Also sets out standard formats for more unusual data
    • e.g. Trajectories

The CF-Conventions

Figure I.1 from the CF-Conventions

Units and Standard Names

If you use nothing else, use these!

  • Add a standard_name
    • So everyone can agree what it is
  • Add the units of the quantity
    • I don’t need to explain this to you
    • Tied to standard_name, not free to choose!
  • Consult the standard name table
  • long_name is optional and can be descriptive
    • E.g. plot labels
netcdf units_and_names {

...

variables:

    float psl(lat,lon) ;
        psl:standard_name = "air_pressure_at_sea_level" ;
        psl:units = "hPa" ;
        psl:long_name = "mean sea level pressure" ;

...

}

Calendar bonus

“Timezones are hard” - C. Edsall (2023)

  • Always include units conforming to UDUNITS
  • Advisable to include calendar
    • Without intercomparison becomes tricky
    • 360 or 365 days?
    • leap years?
netcdf time {

...

variables:

    double time(time) ;
        time:standard_name = "time" ;
        time:units = "days since 1990-1-1 0:0:0" ;
        time:calendar = "proleptic_gregorian" ;

...

}

Ancilliary Data

  • A variable providing information about another variable
  • E.g.
    • observation quality
    • Error measurement
netcdf ancilliary {

...

variables:

    float u(time, z);
        u:standard_name = "wind_speed";
        u:units = "m s-1";
        u:long_name = "Windspeed measured during radiosone ascent";
        u:ancillary_variables = "windspeed_qc";

    int u_qc(time, z);
        u_qc:standard_name = "quality_flag";
        u_qc:long_name = "Windspeed observation quality flag";

data:

    u = 12.0, 12.3, 12.2, 1000.0, 12.4 ... ;
    
    u_qc = 1, 1, 1, 0, 1 ... ;

}

Auxiliary coordinates

  • When there is a second (or more) set of coordinates not matching the dimension coordinates
  • Indicated through the coordinates attribute
  • Canonical examples:
    • vertical coordinate: pressure level, geopotential, sigma, …
    • Geodetic to cartesian to grid
netcdf auxiliary {

dimensions
    plev = 2 , time = UNLIMITED ;

...

variables:

    float u(plev, time) ;
        u:coordinates = "sigma z"
        ...

    int plev(plev) ;
        plev:standard_name = "model_level_number" ;
        plev:units = "1" ;
        plev:long_name = "model level from top of atmosphere" ;
        plev:positive = "down" ;

    float sigma(plev) ;
        sigma:standard_name = "atmosphere_sigma_coordinate" ;
        plev:units = "1" ;
        sigma:positive = "down" ;

    float z(plev) ;
        z:standard_name = "geopotential_height" ;
        z:units = "m" ;
        z:positive = "up" ;

...

}

Closing

ncoldump

But Jack, how did you end up this deep in the netcdf-spec in the first place?


I present to you: Tree Sitter CDL


The result of a past learning and development week:
“how hard could it be to write a tree-sitter grammar?”.


Motivated by the difficulty reading files, as seen in this talk, we can now run ncoldump!

Also usable in text-editors.

Beyond today

  • Parallel NetCDF
  • Interfaces (Python, Fortran, C)
  • cf-python

Thanks for Listening

References

Barker, Michelle, Neil P Chue Hong, Daniel S Katz, Anna-Lena Lamprecht, Carlos Martinez-Ortiz, Fotis Psomopoulos, Jennifer Harrow, et al. 2022. “Introducing the FAIR Principles for Research Software.” Scientific Data 9 (1): 622. https://doi.org/10.1038/s41597-022-01710-x.
Wilkinson, Mark D, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (1): 1–9. https://doi.org/10.1038/sdata.2016.18.