Feature Selection#

Outline#

Questions#

  • How can I reduce the dimensionality of my system?

  • What are the builtin methods for feature selection in dupin?

Objectives#

  • Explain why feature selection can be useful before detecting.

  • Show how to use the MeanShift feature selection method.

Import#

[1]:
import pandas as pd

import dupin as du

FILENAME = "lj-data.h5"
[2]:
def display_dataframe(df):
    style = df.head().style
    style.set_table_styles(
        [
            {
                "selector": "th",
                "props": "background-color: #666666; color: #ffffff; border: 1px solid #222222;",
            },
            {
                "selector": "td",
                "props": "background-color: #666666; color: #ffffff; border: 1px solid #222222;",
            },
        ]
    )
    display(style)

Load the Data#

Below we go ahead and upload the data from the HDF5 file created in the previous section.

[3]:
# The simulation was started in a simple cubic crystal.
# We don't use these frames since the melting of that crystal is the biggest signal in the trajectory.
data = pd.read_hdf(FILENAME, key="data").iloc[3:]

Transforming the Signal#

In dupin before detecting the change points of a signal, we can optionally modify the signal through the transform step. For those familiar, this can be thought of in the paradigms of signal processing, feature selection, and dimensionality reduction. We will focus in this tutorial on the use of transforming for feature selection.

Why Feature Selection#

Given the reduce step, dupin in data generation can easily end up with 100s or 1000s of features. This high dimensionality leads to 3 problems for molecular system point cloud data.

  1. Given thermal noise, as \(N_s \to \infty\) the probability a spurious event is found goes to 1.

  2. Large dimensionality also washes out true events in only a few directions. Given \(N_s = \infty\), a change in any finite number of features, many methods of change point detection will lead to no event detected.

  3. The computational cost of event detection tends to scale at least linearly in the number of features. Thus minimizing the number of features in our signal can significantly speed up detection.

Generally, given good feature selection, we also do not need to worry about removing information. When there are numerous features many will give the same information regarding events, and many may not detect any events at all.

Mean Shift#

The most useful and simplest feature selection tool is dupin.preprocessing.filter.MeanShift. The class assumes each book-end of the distribution is a Gaussian distribution. It then compares the mean of each side to the distribution on the other. A feature is kept if the mean from one end is less than sensitivity likely to have been sampled from the other. Thus, features which have not changed over the length of the trajectory are removed.

[4]:
mean_shift = du.preprocessing.filter.MeanShift(sensitivity=1e-6)
filtered_data = mean_shift(data)
display_dataframe(filtered_data)
  10th_greatest_$Q_{2}$ 1st_greatest_$Q_{2}$ 10th_least_$Q_{2}$ 10th_greatest_$Q_{4}$ 1st_greatest_$Q_{4}$ 10th_least_$Q_{4}$ 10th_greatest_$Q_{6}$ 1st_greatest_$Q_{6}$ 1st_least_$Q_{6}$ 10th_least_$Q_{6}$ 10th_greatest_$Q_{8}$ 1st_least_$Q_{8}$ 10th_least_$Q_{8}$ 10th_greatest_$Q_{10}$ 1st_greatest_$Q_{10}$ 1st_least_$Q_{10}$ 10th_least_$Q_{10}$ 10th_greatest_$Q_{12}$ 1st_greatest_$Q_{12}$ 1st_least_$Q_{12}$ 10th_least_$Q_{12}$
3 0.154826 0.187591 0.035020 0.209281 0.226303 0.077595 0.543408 0.573807 0.246732 0.284698 0.374615 0.137892 0.166574 0.308446 0.343068 0.121403 0.157507 0.396944 0.433905 0.209838 0.227965
4 0.145845 0.169722 0.032802 0.216655 0.285643 0.072493 0.548409 0.578489 0.214945 0.278768 0.371742 0.139842 0.162539 0.309451 0.350324 0.118400 0.158816 0.405271 0.443005 0.194769 0.227046
5 0.159110 0.187771 0.035439 0.213706 0.233688 0.079383 0.537238 0.598261 0.243953 0.282362 0.368857 0.124305 0.165234 0.311252 0.332301 0.113480 0.157218 0.398565 0.447089 0.174709 0.227156
6 0.153914 0.171392 0.033056 0.207366 0.226134 0.074429 0.543205 0.569693 0.222771 0.286757 0.371920 0.134696 0.166330 0.304221 0.345650 0.122641 0.151312 0.404369 0.484257 0.156539 0.224638
7 0.143113 0.166033 0.028085 0.207654 0.240179 0.072758 0.541446 0.597715 0.234455 0.280158 0.367076 0.143325 0.179890 0.308270 0.353579 0.134088 0.159848 0.416040 0.438517 0.194086 0.226607

In this particular case the number of features remains roughly the same as most of our features underwent a mean shift through the nucleation process. Below we go ahead and save the filtered DataFrame to disk for the next and final section of the tutorial.

[5]:
filtered_data.to_hdf("lj-filtered-data.h5", "data")