Setting up a Data Pipeline#

Outline#

Questions:#

  • How can I set up a pipeline to generate, map, and reduce data from a point cloud?

  • What are some common reducers dupin provides?

Objectives:#

  • Define what a generator is and the expected output.

  • Demonstrate the builder syntax for the creation of pipelines.

  • Show how to use multiple maps or reducer through teeing.

Imports#

[1]:
import freud

import dupin as du

The data module#

The data generation portion of dupin (generate to aggregate) can be found in the dupin.data submodule.

Generators#

The base of the data generation portion of dupin (generate to aggregate) is the generator. Generators are simply registered callables which when called return a dictionary of features. These dictionaries have feature names as keys with either float or NumPy arrays feature values.

@du.data.CustomGenerator
def eg_generator():
    return {"feat-1": 1.2, "feat-2": 0.0}

We will in this tutorial use a builtin generator class from dupin which uses freud a Python package for analyzing molecular trajectories as our generator. The point cloud or trajectory we are using comes from a molecular dynamics simulation of thermostated Lennard-Jones particles in a fixed volume periodic box (NVT) run using hoomd-blue.

Below we define our generator which use Steinhardt order parameters. While not necessary for understanding, we use the spherical harmonic numbers \(l \in \{2,4,6,8,10,12\}\). This requires we specify multiple feature names in the attrs key-word argument below. attrs maps the attribute name in the freud compute object to feature names in dupin. For 2 dimensional array quantities such as we have hear, we map the attribute name particle_order to multiple names given by the \(l\) value.

[2]:
ls = (2, 4, 6, 8, 10, 12)
steinhardt = freud.order.Steinhardt(l=ls)
generator = du.data.freud.FreudDescriptor(
    compute=steinhardt, attrs={"particle_order": [f"$Q_{{{l}}}$" for l in ls]}
)

Builder syntax#

dupin has 2 ways of attaching steps to a given data generation pipeline for mapping or reducing: the builder syntax and the decorator syntax. This tutorial will only cover the builder syntax; for the decorator syntax, see the API documentation.

The builder syntax involves calling special methods from a extent pipeline (generators and all derivative objects are pipelines): pipe, map, and reduce.

  • pipe: Adds a new layer to the pipeline either for the map or reduce step. Objects passed to pipe must be known reducers or mappers. When piping two operations they are executed in order from left to right in a way that output from the first one is used as input for the right one.

  • map: Add a map layer to the pipeline. Objects passed to map can either be known mappers or a custom map function. Mappers can be used to map a vector-like (usually per-particle) quantity into another vector-like per-particle quantity that describes the property of interest better compared to the original property.

  • reduce: Add a reduce layer to the pipeline. Objects passed to reduce can either be known reducers or a custom reduction function. Reducers take a vector-like (usually per-particle) quantity and reduce it to one or more scalars that can be used in detection.

The builder syntax leads to a pipeline whose steps should be read from left to right that is A.pipe(B).map(C).reduce(D) goes from A->B->C->D. Below we showcase the builder syntax. Don’t worry if you don’t understand the specific mappers or reducers here. The rest of the tutorial will go over commonly used values.

[3]:
pipeline = generator.pipe(  # a map step
    du.data.spatial.NeighborAveraging(
        expected_kwarg="neighbors", remove_kwarg=False
    )
).reduce(du.data.reduce.NthGreatest((-1, 1, 10, -10)))

Reducers#

We are going to skip over maps here as they are less commonly useful than reducers. Feel free to look at the documentation for dupin.data.spatial.NeighborAveraging above.

Reducers take an array and return one or more features associated with the array. For purposes of event detection, features which focus on the extrema or limits of a distribution tend to outperform other as they can signal the transition earlier than other features. dupin has two classes which perform this function: NthGreatest and Percentile.

  • NthGreatest take the specified nth greatest or least (indicated by negative indices).

  • Percentile takes the specific quantiles given.

The two classes perform similar functions, and the chosen class is a matter of taste mostly. If you prefer to specify the exact indices to take use NthGreatest if you’d rather think in terms of percentages use Percentile. For this tutorial we will use NthGreatest. Below we create the final pipeline for this section of the tutorial which will be used in the next section.

[4]:
pipeline = generator.pipe(du.data.reduce.NthGreatest((-1, 1, 10, -10)))