Powering up python

Software skills to enhance research code.

Jack Atkinson

Senior Research Software Engineer
ICCS - University of Cambridge

2024-04-05

Precursors

Slides and Materials

To access links or follow on your own device these slides can be found at:
https://jackatkinson.net/slides


All materials are available at:

Licensing

Except where otherwise noted, these presentation materials are licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License.

Vectors and icons by SVG Repo used under CC0(1.0)

Precursors

  • Be nice (Python code of conduct)
  • Please ask questions whenever they arise.
    • Someone else is probably wondering the same thing.
  • I will make mistakes.
    • Not all of them will be intentional.

whoami

Research background in fluid mechanics and atmosphere:

  • Numerics and fluid mechanics in Engineering,
  • Cloud microphysics & volcanic plumes in Geography,
  • Radiation belts and satellite data at BAS.

Now a Research Software Engineer (RSE) at the Institute of Computing for Climate Science (ICCS) working with various groups and projects.
I have a particular interest in climate model design and parameterisation.

This talk can be summarised as “things I wish I’d known sooner.”

What is Research Software?

Major Computational Programs

 

 

 

Data processing

 

 

Experiment support

 

 

 

Bathymetry by NOAA under public domain
CTD Bottles by WHOI under public domain
Keeling Curve by Scripps under public domain
Climate simulation by Los Alamos National Laboratory under CC BY-NC-ND
Dawn HPC by Joe Bishop with permission

Why does this matter?

Why does this matter?

More widely than publishing papers, code is used in control and decision making:


  • Weather forecasting
  • Climate policy
  • Disease modelling (e.g. Covid)
  • Satellites and spacecraft1
  • Medical Equipment


Your code (or its derivatives) may well move from research to operational one day.

Margaret Hamilton and the Apollo XI by NASA under public domain

Why does this matter?1

def calc_p(n,t):
    return n*1.380649e-23*t
data = np.genfromtxt("mydata.csv")
p = calc_p(data[0,:],data[1,:]+273.15)
print(np.sum(p)/len(p))

What does this code do?

# Boltzmann Constant and 0 Kelvin
Kb = 1.380649e-23
T0 = 273.15

def calc_pres(n, t):
    """
    Calculate pressure using ideal gas law p = nkT

    Parameters:
        n : array of number densities of molecules [N m-3]
        t : array of temperatures in [K]
    Returns:
         array of pressures [Pa]
    """
    return n * Kb * t


# Read in data from file and convert T from [oC] to [K]
data = np.genfromtxt("mydata.csv")
n = data[0, :]
temp = data[1, :] + T0

# Calculate pressure, average, and print
pres = calc_pres(n, temp)
pres_av = np.sum(pres) / len(pres)
print(pres_av)

Virtual Environments

Virtual Environments

What?

  • A self-contained python environment
  • Packages installed in a local folder
  • Advised to use on a per-project basis

Why?

  • Avoid system pollution
  • Allow different versions
  • Reproducibility - set versions
$ python3 -m venv myvenv
$ source myvenv/bin/activate
(myvenv) $ pip install <packagename>
(myvenv) $ deactivate
$ 
PS> python -m venv myvenv
PS> source venv/bin/activate
(myvenv) PS> pip install <packagename>
(myvenv) PS> deactivate
PS> 


For more information see the Real Python article on environments.
For those using conda it also has environments, set up in a slightly different way.
Also consider uv.

Exercise 1

Scenario: you have just finished some simulations with a climate model that should improve precipitation modelling and have the output data as a netCDF file.

You know that your colleague has produced relevant figures and analysis before, so you ask them for a copy of their code (yay, reuse :+1:).

Exercise 1

Go to exercise 1 (exercises/01_base_code/) and:

  • Examine the code in precipitation_climatology.py
  • Set up a virtual environment
  • Install the necessary dependencies
    • Hint: There is a requirements.txt file in the root of the repo.
  • Run the code
    • does it do what you thought it would?

Code Formatting (PEP8)

Python PEPs

Python Enhancement Proposals

  • Technical documentation for the python community
  • Guidelines, standards, and best-practice

Relevant to us today are:

PEP8 & Formatting

By ensuring code aligns with PEP8 we:

  • standardise style,
  • conform to best-practices, and
  • improve code readability to
  • make code easier to share, and
  • reduce misinterpretation.



“Readability counts”
    - Tim Peters in the Zen of Python



“But I don’t have time to read and memorise all of this…”

PEP8 & Formatting - Black

Black (Langa 2020) - black.readthedocs.io

  • a PEP 8 compliant formatter
    • Strict subset of PEP8
    • “Opinionated so you don’t have to be.”
  • For full details see Black style
  • Try online
(myvenv) $ pip install black
(myvenv) $ black myfile.py
(myvenv) $ black mydirectory/
(myvenv) PS> pip install black
(myvenv) PS> black myfile.py
(myvenv) PS> black mydirectory/

PEP8 & Formatting - Black - Example

def long_func(x, param_one, param_two=[], param_three=24, param_four=None,
        param_five="Empty Report", param_six=123456):


    val = 12*16 +(24) -10*param_one +  param_six

    if x > 5:
        
        print("x is greater than 5")


    else:
        print("x is less than or equal to 5")


    if param_four:
        print(param_five)



    print('You have called long_func.')
    print("This function has several params.")

    param_2.append(x*val)
    return param_2
def long_func(
    x,
    param_one,
    param_two=[],
    param_three=24,
    param_four=None,
    param_five="Empty Report",
    param_six=123456,
):
    val = 12 * 16 + (24) - 10 * param_one + param_six

    if x > 5:
        print("x is greater than 5")

    else:
        print("x is less than or equal to 5")

    if param_four:
        print(param_five)

    print("You have called long_func.")
    print("This function has several params.")

    param_2.append(x * val)
    return param_2

PEP8 & Formatting - Black

  • I suggest incorporating into your projects now
    • Well-suited to incorporation into continuous integration or git hooks.
    • “write and run”
    • Gradually you’ll find yourself writing in the black style
    • Widely-used standard1

Other notes:

  • An extension for jupyter notebooks exists:
    • pip install "black[jupyter]"
  • black rules are configurable if you prefer something slightly different.
  • An alternative tool is ruff

Exercise 2

Go to exercise 2 (exercises/02_formatting/) and:

  • install black
  • run black on precipitation_climatology.py
  • examine the output
    • Is it more readable?1
    • Is there any aspect of the formatting style you find unintuitive?

PEP8 & Formatting - PyLint

Static Analysis

  • Check the code without running it
  • Catch issues before you run any code
  • Improve code quality1

There are various tools available:

  • pycodestyle
  • flake8
  • Pylint
  • ruff
(myvenv) $ pip install pylint
(myvenv) $ pylint myfile.py
(myvenv) $ pylint mydirectory/
(myvenv) PS> pip install pylint
(myvenv) PS> pylint myfile.py
(myvenv) PS> pylint mydirectory/

PEP8 & Formatting - PyLint - Example

def long_func(
    x,
    param_one,
    param_two=[],
    param_three=24,
    param_four=None,
    param_five="Empty Report",
    param_six=123456,
):
    val = 12 * 16 + (24) - 10 * param_one + param_six

    if x > 5:
        print("x is greater than 5")

    else:
        print("x is less than or equal to 5")

    if param_four:
        print(param_five)

    print("You have called long_func.")
    print("This function has several params.")

    param_2.append(x * val)
    return param_2
(myvenv) $ pylint long_func.py
************* Module long_func
long_func.py:1:0: C0116: Missing function or method docstring (missing-function-docstring)
long_func.py:1:0: W0102: Dangerous default value [] as argument (dangerous-default-value)
long_func.py:1:0: R0913: Too many arguments (7/5) (too-many-arguments)
long_func.py:24:4: E0602: Undefined variable 'param_2' (undefined-variable)
long_func.py:25:11: E0602: Undefined variable 'param_2' (undefined-variable)
long_func.py:4:4: W0613: Unused argument 'param_two' (unused-argument)
long_func.py:5:4: W0613: Unused argument 'param_three' (unused-argument)

------------------------------------------------------------------
Your code has been rated at 0.00/10

(myvenv) $

PEP8 & Formatting - PyLint - Example

def long_func(
    x,
    param_one,
    param_two=[],
    param_four=None,
    param_five="Empty Report",
    param_six=123456,
):
    val = 12 * 16 + (24) - 10 * param_one + param_six

    if x > 5:
        print("x is greater than 5")

    else:
        print("x is less than or equal to 5")

    if param_four:
        print(param_five)

    print("You have called long_func.")
    print("This function has several params.")

    param_two.append(x * val)
    return param_two
(myvenv) $ pylint long_func.py
************* Module long_func
long_func.py:1:0: C0114: Missing module docstring (missing-module-docstring)
long_func.py:1:0: C0116: Missing function or method docstring (missing-function-docstring)
long_func.py:1:0: W0102: Dangerous default value [] as argument (dangerous-default-value)
long_func.py:1:0: R0913: Too many arguments (6/5) (too-many-arguments)

------------------------------------------------------------------
Your code has been rated at 6.36/10 (previous run: 0.00/10, +6.36)

(myvenv) $


Search the error code to understand the issue:

PEP8 & Formatting - PyLint - IDE Integration

Other notes:

  • Well-suited to incorporation into continuous integration or git hooks.
  • You can supress warnings in code with: #pylint: disable=rule-name
  • project-wide custom configuration is also possible.
  • ruff, mentioned before, also does linting1

PEP8 & Formatting - PyLint - IDE Integration

  • Catch issues before running PyLint
  • Gradually coerces you to become a better programmer
  • Available on all good text editors and emacs:

Exercise 3

Go to exercise 3 (exercises/03_linting/) and:

  • install pylint
  • run pylint on precipitation_climatology.py
  • examine the report and try and address some of the issues.
    • Ignore missing docstrings and f-strings for now - we’ll come to them later.
    • Try and deal with: W0611 Unused imports, C0412 Ungrouped imports, W0102 Dangerous default
    • If you feel like it you could try and fix: W0621 Redefining name, W1514 Unexplicit open
    • Unless you are really keen don’t worry about: R0913 Too many arguments, C0103 Unconforming naming style.

Exercise 3

Extensions:

  • try and add linting to your preferred text editor or IDE
  • explore the option to supress pylint warnings
  • explore the configuration options for pylint

Comments and Docstrings

Comments

Comments are tricky, and very much to taste.

Some thoughts:1

“Programs must be written for people to read and […] machines to execute.”
  - Hal Abelson

“A bad comment is worse than no comment at all.”

“A comment is a lie waiting to happen.”

=> Comments have to be maintained, just like the code, and there is no way to check them!

Cat code comment image by 35_equal_W

Comments to avoid

  • Dead code e.g.

    # plt.plot(time, velocity, "r0")
    plt.plot(time, velocity, "kx")
    # plt.plot(time, acceleration, "kx")
    # plt.ylabel("acceleration")
    plt.ylabel("velocity")
  • Variable definitions e.g.

    # Set Force
    f = m * a
  • Redundant comments e.g. i += 1 # Increment i

Comments - some thoughts1

  • Comments should not duplicate the code.
  • Good comments do not excuse unclear code.
    • Comments should dispel confusion, not cause it.
    • If you can’t write a clear comment, there may be a problem with the code.
  • Explain unidiomatic code in comments.
  • Provide links to:
    • the original source of copied code.
    • external references where they will be most helpful.
  • Use comments to mark incomplete implementations.
  • Comments are not [user] documentation.
    • Read by developers, user documentation is for…

Docstrings

These are what make your code reusable (by you and others).

  • In python docstrings are designated at the start of ‘things’ using triple quotes: """...""".
  • PEP257 (Goodger and Rossum 2001) tells us what docstrings should say.
    Specific conventions tell us how they should say it.
  • Where comments describe how it works, docstrings describe how to use it.
    Unlike comments, docstrings follow a set format.

Various formatting options exist: numpy, Google, reST, etc.
We will use numpydoc it is readable and widely used in scientific code.
Full guidance for numpydoc is available.

Docstrings

Key components:

  • A description of what the thing is.
  • A description of any inputs (Parameters).
  • A description of any outputs (Returns).

Consider also:

  • Extended summary
  • Errors raised
  • Usage examples
  • Key references
"""
Short one-line description.

Parameters
----------
name : type
    description of parameter

Returns
-------
name : type
    description of return value
"""

Docstrings

Key components:

  • A description of what the thing is.
  • A description of any inputs (Parameters).
  • A description of any outputs (Returns).
def calculate_gyroradius(mass, v_perp, charge, B, gamma=None):
    """
    Calculates the gyroradius of a charged particle in a magnetic field

    Parameters
    ----------
    mass : float
        The mass of the particle [kg]
    v_perp : float
        velocity perpendicular to magnetic field [m/s]
    charge : float
        particle charge [coulombs]
    B : float
        Magnetic field strength [teslas]
    gamma : float, optional
        Lorentz factor for relativistic case. default=None for non-relativistic case.

    Returns
    -------
    r_g : float
        Gyroradius of particle [m]

    Notes
    -----
    .. [1]  Walt, M, "Introduction to Geomagnetically Trapped Radiation,"
       Cambridge Atmospheric and Space Science Series, equation (2.4), 2005.
    """

    r_g = mass * v_perp / (abs(charge) * B)

    if gamma:
        r_g = r_g * gamma

    return r_g

Docstrings - pydocstyle

pydocstyle is a tool we can use to help ensure the quality of our docstrings.1

(myvenv) $ pip install pydocstyle
(myvenv) $ pydocstyle myfile.py
(myvenv) $ pydocstyle mydirectory/
(myvenv) $
(myvenv) $
(myvenv) $ pydocstyle gyroradius.py
gyroradius.py:2 in public function `calculate_gyroradius`:
        D202: No blank lines allowed after function docstring (found 1)
gyroradius.py:2 in public function `calculate_gyroradius`:
        D400: First line should end with a period (not 'd')
gyroradius.py:2 in public function `calculate_gyroradius`:
        D401: First line should be in imperative mood (perhaps 'Calculate', not 'Calculates')
(myvenv) $

Note: pydocstyle does not catch missing variables in docstrings. This can be done with Pylint’s docparams and docstyle extensions but is left as an exercise to the reader.

def calculate_gyroradius(mass, v_perp, charge, B, gamma=None):
    """
    Calculates the gyroradius of a charged particle in a magnetic field

    Parameters
    ----------
    mass : float
        The mass of the particle [kg]
    v_perp : float
        velocity perpendicular to magnetic field [m/s]
    charge : float
        particle charge [coulombs]
    B : float
        Magnetic field strength [teslas]
    gamma : float, optional
        Lorentz factor for relativistic case. default=None for non-relativistic case.

    Returns
    -------
    r_g : float
        Gyroradius of particle [m]

    Notes
    -----
    .. [1]  Walt, M, "Introduction to Geomagnetically Trapped Radiation,"
       Cambridge Atmospheric and Space Science Series, equation (2.4), 2005.
    """

    r_g = mass * v_perp / (abs(charge) * B)

    if gamma:
        r_g = r_g * gamma

    return r_g

Exercise 4

Go to exercise 4 (exercises/04_docstrings_and_comments/) and examine the comments:

  • Is there any dead code?
    • How is it best to handle it?
  • Are comments used sensibly?
    • Are any redundant and better off being removed?
    • Is there anywhere that would benefit from a comment?

Docstrings:

  • Work through the file adding docstrings where they are missing.1

Exercise 4

Extensions:

  • Install pydocstlye and use it to check the docstrings you have written.

Writing better (Python) code

f-strings

A better way to format strings since Python 3.6
Not catching on because of self-teaching from old code.

Strings are prepended with an f allowing variables to be used in-place:

name = "electron"
mass = 9.1093837015E-31

# modulo
print("The mass of an %s is %.3e kg." % (name, mass))

# format
print("The mass of an {} is {:.3e} kg.".format(name, mass))

# f-string
print(f"The mass of an {name} is {mass:.3e} kg.")

f-strings can take expressions:

print(f"a={a} and b={b}. Their product is {a * b}, sum is {a + b}, and a/b is {a / b}.")

See Real Python for more information.

Remove Magic Numbers

Numbers in code that are not immediately obvious.

  • Hard to read
  • Hard to maintain
  • Hard to adapt

Instead:

  • Name a variable conveying meaning
  • Set to a constant
  • Use a comment to explain

numberwang by Mitchell and Webb under fair use

Remove Magic Numbers

"""Module implementing pendulum equations."""
import numpy as np

def get_period(l):
    """..."""
    return 2.0 * np.pi * np.sqrt(l / 9.81)

def max_height(l, theta):
    """..."""
    return l * np.cos(theta)

def max_speed(l, theta):
    """..."""
    return np.sqrt(2.0 * 9.81 * max_height(l, theta))

def energy(m, l, theta):
    """..."""
    return m * 9.81 * max_height(l, theta)

def check_small_angle(theta):
    """..."""
    if theta <= np.pi / 1800.0:
        return True
    return False

def bpm(l):
    """..."""
    return 60.0 / get_period(l)


"""Module implementing pendulum equations."""
import numpy as np

GRAV = 9.81

def get_period(l):
    """..."""
    return 2.0 * np.pi * np.sqrt(l / GRAV)

def max_height(l, theta):
    """..."""
    return l * np.cos(theta)

def max_speed(l, theta):
    """..."""
    return np.sqrt(2.0 * GRAV * max_height(l, theta))

def energy(m, l, theta):
    """..."""
    return m * GRAV * max_height(l, theta)

def check_small_angle(theta, small_ang=np.pi/1800.0):
    """..."""
    if theta <= small_ang:
        return True
    return False

def bpm(l):
    """..."""
    # Divide 60 seconds by period [s] for beats per minute
    return 60.0 / get_period(l)

Put config in a config file

  • Ideally we shouldn’t have hop in and out of the code (and recompile in higher level langs) every time we change a runtime setting
  • No easy record of runs

Instead:

  • It’s easy to read a json file into python as a dictionary Handle as you wish - create a class, read to variables etc.
  • Could even make config filename a command line argument
{
  "config_name": "June 2022 m01 n19 run",
  "start_date": "2022-05-28 00:00:00",
  "end_date": "2022-06-12 23:59:59",
  "satellites": ["m01", "n19"],
  "noise_floor": [3.0, 3.0, 3.0],
  "check_SNR": true,
  "L_lim": [1.5, 8.0],
  "telescopes": [90],
  "n_bins": 27
}
import json


with open('config.json') as json_file:
    config = json.load(json_file)

print(config)
{'config_name': 'June 2022 m01 n19 run', 'start_date': '2022-05-28 00:00:00', 'end_date': '2022-06-12 23:59:59', 'satellites': ['m01', 'n19'], 'noise_floor': [3.0, 3.0, 3.0], 'check_SNR': True, 'L_lim': [1.5, 8.0], 'telescopes': [90], 'n_bins': 27}

Exercise 5

Magic Numbers

  • Look through the code and identify any magic numbers.
  • Implement what you feel is the best approach in each case

f-strings

  • Look for any string handling (currently using the .format() approach) and update it to use f-strings.
    • Is the intent clearer?
    • Is the layout of the data written to file easier to understand?

Configuration settings

  • There is helpfully a list of configurable inputs at the end of the file under "__main__".
    We can improve on this, however, by placing them in a configuration file.
  • Create an appropriate json file to be read in as a dictionary and passed to the main function.

Other things

Beyond the scope of today are a few other honourable mentions:

  • Functions and modules
  • Packaging
    • Breaking projects into modules and __init__.py
    • Distributing projects with pyproject.toml
  • Documentation
    • Auto-generation from docstrings with sphinx or mkdocs
  • Type hinting
    • Adding type hinting to python code - how and why?
    • Type checking with mypy

These lessons will be added to the course content in future but are beyond the scope of today.

Honourable Mentions

  • ruff
    This is a recent tool at allows you to do formatting and linting in one go.
    Setup is slightly harder than the soold discussed here however.
  • pint
    This library allows you to store the units as part of a value, and convert between units. This can be useful, especially at in/output, but adds to “overheads”.
  • mypy
    This is a package for type checking - helping to ensure our code behaves as we might expect. Similarities to Fortran/C/C++.

Closing

Where can I get help?

ICCS runs Climate Code Clinics that can be booked by any researcher in climate science or related fields at any time.

Apply online for a 1hr slot where 2 ICCS RSEs will sit down to take a look at your code, answer your questions, and help you improve it.

Recent topics have included:

  • Adding documentation to code
  • Packaging and distributing code for easy installation
  • Opening projects for collaboration and project management
  • Structuring ML projects
  • Linking machine learning to Fortran
  • Adding MPI and OpenMP to code

Where can I learn more?

Where can I learn more?

References

The code in this workshop is based on a script from (Irving 2019).

Cannon, B, D Ingram, P Ganssle, P Gedam, S Eustace, T Kluyver, and T Chung. 2020. PEP 621 – Storing project metadata in pyproject.toml.” https://peps.python.org/pep-0621/.
Goodger, D, and G van Rossum. 2001. PEP 257 – Docstring Conventions.” https://peps.python.org/pep-0257/.
Irving, Damien. 2019. “Python for Atmosphere and Ocean Scientists.” Journal of Open Source Education 2 (16): 37. https://doi.org/10.21105/jose.00037.
Langa, Ł. 2020. Black: The uncompromising Python code formatter.” https://github.com/psf/black. https://black.readthedocs.io/en/stable/.
Murphy, N. 2023. “Writing Clean Scientific Software.” In. Presented at the HPC Best Practices Webinar Series. https://www.youtube.com/watch?v=Q6Ksu_uX3bc.
Rossum, G van, B Warsaw, and A Coghlan. 2001, 2013. PEP8 – Style Guide for Python Code.” https://peps.python.org/pep-0008/.
Spertus, E. 2021. stackoverflow - Best practices for writing code comments.” https://stackoverflow.blog/2021/12/23/best-practices-for-writing-code-comments/.
Wilson, Greg, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy K Teal. 2017. “Good Enough Practices in Scientific Computing.” PLoS Computational Biology 13 (6): e1005510.