{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Feature Selection\n", "\n", "## Outline\n", "### Questions\n", "- How can I reduce the dimensionality of my system?\n", "- What are the builtin methods for feature selection in **dupin**?\n", "\n", "### Objectives\n", "- Explain why feature selection can be useful before detecting.\n", "- Show how to use the `MeanShift` feature selection method.\n", "\n", "## Import" ] }, { "cell_type": "code", "execution_count": 1, "id": "1", "metadata": { "tags": [] }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "import dupin as du\n", "\n", "FILENAME = \"lj-data.h5\"" ] }, { "cell_type": "code", "execution_count": 2, "id": "2", "metadata": { "tags": [] }, "outputs": [], "source": [ "def display_dataframe(df):\n", " style = df.head().style\n", " style.set_table_styles(\n", " [\n", " {\n", " \"selector\": \"th\",\n", " \"props\": \"background-color: #666666; color: #ffffff; border: 1px solid #222222;\",\n", " },\n", " {\n", " \"selector\": \"td\",\n", " \"props\": \"background-color: #666666; color: #ffffff; border: 1px solid #222222;\",\n", " },\n", " ]\n", " )\n", " display(style)" ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "## Load the Data\n", "\n", "Below we go ahead and upload the data from the HDF5 file created in the previous section." ] }, { "cell_type": "code", "execution_count": 3, "id": "4", "metadata": { "tags": [] }, "outputs": [], "source": [ "# The simulation was started in a simple cubic crystal.\n", "# We don't use these frames since the melting of that crystal is the biggest signal in the trajectory.\n", "data = pd.read_hdf(FILENAME, key=\"data\").iloc[3:]" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## Transforming the Signal\n", "\n", "In **dupin** before detecting the change points of a signal, we can optionally modify the signal through the transform step. \n", "For those familiar, this can be thought of in the paradigms of signal processing, feature selection, and dimensionality reduction.\n", "We will focus in this tutorial on the use of transforming for feature selection.\n", "\n", "### Why Feature Selection\n", "\n", "Given the reduce step, **dupin** in data generation can easily end up with 100s or 1000s of features.\n", "This high dimensionality leads to 3 problems for molecular system point cloud data.\n", "\n", "1. Given thermal noise, as $N_s \\to \\infty$ the probability a spurious event is found goes to 1.\n", "2. Large dimensionality also washes out true events in only a few directions.\n", " Given $N_s = \\infty$, a change in any finite number of features, many methods of change point detection will lead to no event detected.\n", "3. The computational cost of event detection tends to scale at least linearly in the number of features.\n", " Thus minimizing the number of features in our signal can significantly speed up detection.\n", " \n", "Generally, given good feature selection, we also do not need to worry about removing information.\n", "When there are numerous features many will give the same information regarding events, and many may not *detect* any events at all.\n", "\n", "## Mean Shift\n", "\n", "The most useful and simplest feature selection tool is `dupin.preprocessing.filter.MeanShift`.\n", "The class assumes each book-end of the distribution is a Gaussian distribution.\n", "It then compares the mean of each side to the distribution on the other.\n", "A feature is kept if the mean from one end is less than `sensitivity` likely to have been sampled from the other.\n", "Thus, features which have not *changed* over the length of the trajectory are removed." ] }, { "cell_type": "code", "execution_count": 4, "id": "6", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 10th_greatest_$Q_{2}$1st_greatest_$Q_{2}$10th_least_$Q_{2}$10th_greatest_$Q_{4}$1st_greatest_$Q_{4}$10th_least_$Q_{4}$10th_greatest_$Q_{6}$1st_greatest_$Q_{6}$1st_least_$Q_{6}$10th_least_$Q_{6}$10th_greatest_$Q_{8}$1st_least_$Q_{8}$10th_least_$Q_{8}$10th_greatest_$Q_{10}$1st_greatest_$Q_{10}$1st_least_$Q_{10}$10th_least_$Q_{10}$10th_greatest_$Q_{12}$1st_greatest_$Q_{12}$1st_least_$Q_{12}$10th_least_$Q_{12}$
30.1548260.1875910.0350200.2092810.2263030.0775950.5434080.5738070.2467320.2846980.3746150.1378920.1665740.3084460.3430680.1214030.1575070.3969440.4339050.2098380.227965
40.1458450.1697220.0328020.2166550.2856430.0724930.5484090.5784890.2149450.2787680.3717420.1398420.1625390.3094510.3503240.1184000.1588160.4052710.4430050.1947690.227046
50.1591100.1877710.0354390.2137060.2336880.0793830.5372380.5982610.2439530.2823620.3688570.1243050.1652340.3112520.3323010.1134800.1572180.3985650.4470890.1747090.227156
60.1539140.1713920.0330560.2073660.2261340.0744290.5432050.5696930.2227710.2867570.3719200.1346960.1663300.3042210.3456500.1226410.1513120.4043690.4842570.1565390.224638
70.1431130.1660330.0280850.2076540.2401790.0727580.5414460.5977150.2344550.2801580.3670760.1433250.1798900.3082700.3535790.1340880.1598480.4160400.4385170.1940860.226607
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "mean_shift = du.preprocessing.filter.MeanShift(sensitivity=1e-6)\n", "filtered_data = mean_shift(data)\n", "display_dataframe(filtered_data)" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "In this particular case the number of features remains roughly the same as most of our features underwent a mean shift through the nucleation process.\n", "Below we go ahead and save the filtered DataFrame to disk for the next and final section of the tutorial." ] }, { "cell_type": "code", "execution_count": 5, "id": "8", "metadata": { "tags": [] }, "outputs": [], "source": [ "filtered_data.to_hdf(\"lj-filtered-data.h5\", \"data\")" ] }, { "cell_type": "markdown", "id": "9", "metadata": { "nbsphinx": "hidden", "tags": [] }, "source": [ "[Previous section](03-collecting-data.ipynb) [Next section](05-detect-events.ipynb)" ] } ], "metadata": { "kernelspec": { "display_name": "Python (dupin)", "language": "python", "name": "dupin" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 5 }