{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "# Feature Selection\n",
    "\n",
    "## Outline\n",
    "### Questions\n",
    "- How can I reduce the dimensionality of my system?\n",
    "- What are the builtin methods for feature selection in **dupin**?\n",
    "\n",
    "### Objectives\n",
    "- Explain why feature selection can be useful before detecting.\n",
    "- Show how to use the `MeanShift` feature selection method.\n",
    "\n",
    "## Import"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "1",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "import dupin as du\n",
    "\n",
    "FILENAME = \"lj-data.h5\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "2",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def display_dataframe(df):\n",
    "    style = df.head().style\n",
    "    style.set_table_styles(\n",
    "        [\n",
    "            {\n",
    "                \"selector\": \"th\",\n",
    "                \"props\": \"background-color: #666666; color: #ffffff; border: 1px solid #222222;\",\n",
    "            },\n",
    "            {\n",
    "                \"selector\": \"td\",\n",
    "                \"props\": \"background-color: #666666; color: #ffffff; border: 1px solid #222222;\",\n",
    "            },\n",
    "        ]\n",
    "    )\n",
    "    display(style)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3",
   "metadata": {},
   "source": [
    "## Load the Data\n",
    "\n",
    "Below we go ahead and upload the data from the HDF5 file created in the previous section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "4",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# The simulation was started in a simple cubic crystal.\n",
    "# We don't use these frames since the melting of that crystal is the biggest signal in the trajectory.\n",
    "data = pd.read_hdf(FILENAME, key=\"data\").iloc[3:]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5",
   "metadata": {},
   "source": [
    "## Transforming the Signal\n",
    "\n",
    "In **dupin** before detecting the change points of a signal, we can optionally modify the signal through the transform step. \n",
    "For those familiar, this can be thought of in the paradigms of signal processing, feature selection, and dimensionality reduction.\n",
    "We will focus in this tutorial on the use of transforming for feature selection.\n",
    "\n",
    "### Why Feature Selection\n",
    "\n",
    "Given the reduce step, **dupin** in data generation can easily end up with 100s or 1000s of features.\n",
    "This high dimensionality leads to 3 problems for molecular system point cloud data.\n",
    "\n",
    "1. Given thermal noise, as $N_s \\to \\infty$ the probability a spurious event is found goes to 1.\n",
    "2. Large dimensionality also washes out true events in only a few directions.\n",
    "   Given $N_s = \\infty$, a change in any finite number of features, many methods of change point detection will lead to no event detected.\n",
    "3. The computational cost of event detection tends to scale at least linearly in the number of features.\n",
    "   Thus minimizing the number of features in our signal can significantly speed up detection.\n",
    "   \n",
    "Generally, given good feature selection, we also do not need to worry about removing information.\n",
    "When there are numerous features many will give the same information regarding events, and many may not *detect* any events at all.\n",
    "\n",
    "## Mean Shift\n",
    "\n",
    "The most useful and simplest feature selection tool is `dupin.preprocessing.filter.MeanShift`.\n",
    "The class assumes each book-end of the distribution is a Gaussian distribution.\n",
    "It then compares the mean of each side to the distribution on the other.\n",
    "A feature is kept if the mean from one end is less than `sensitivity` likely to have been sampled from the other.\n",
    "Thus, features which have not *changed* over the length of the trajectory are removed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "6",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style type=\"text/css\">\n",
       "#T_22415 th {\n",
       "  background-color: #666666;\n",
       "  color: #ffffff;\n",
       "  border: 1px solid #222222;\n",
       "}\n",
       "#T_22415 td {\n",
       "  background-color: #666666;\n",
       "  color: #ffffff;\n",
       "  border: 1px solid #222222;\n",
       "}\n",
       "</style>\n",
       "<table id=\"T_22415\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th class=\"blank level0\" >&nbsp;</th>\n",
       "      <th id=\"T_22415_level0_col0\" class=\"col_heading level0 col0\" >10th_greatest_$Q_{2}$</th>\n",
       "      <th id=\"T_22415_level0_col1\" class=\"col_heading level0 col1\" >1st_greatest_$Q_{2}$</th>\n",
       "      <th id=\"T_22415_level0_col2\" class=\"col_heading level0 col2\" >10th_least_$Q_{2}$</th>\n",
       "      <th id=\"T_22415_level0_col3\" class=\"col_heading level0 col3\" >10th_greatest_$Q_{4}$</th>\n",
       "      <th id=\"T_22415_level0_col4\" class=\"col_heading level0 col4\" >1st_greatest_$Q_{4}$</th>\n",
       "      <th id=\"T_22415_level0_col5\" class=\"col_heading level0 col5\" >10th_least_$Q_{4}$</th>\n",
       "      <th id=\"T_22415_level0_col6\" class=\"col_heading level0 col6\" >10th_greatest_$Q_{6}$</th>\n",
       "      <th id=\"T_22415_level0_col7\" class=\"col_heading level0 col7\" >1st_greatest_$Q_{6}$</th>\n",
       "      <th id=\"T_22415_level0_col8\" class=\"col_heading level0 col8\" >1st_least_$Q_{6}$</th>\n",
       "      <th id=\"T_22415_level0_col9\" class=\"col_heading level0 col9\" >10th_least_$Q_{6}$</th>\n",
       "      <th id=\"T_22415_level0_col10\" class=\"col_heading level0 col10\" >10th_greatest_$Q_{8}$</th>\n",
       "      <th id=\"T_22415_level0_col11\" class=\"col_heading level0 col11\" >1st_least_$Q_{8}$</th>\n",
       "      <th id=\"T_22415_level0_col12\" class=\"col_heading level0 col12\" >10th_least_$Q_{8}$</th>\n",
       "      <th id=\"T_22415_level0_col13\" class=\"col_heading level0 col13\" >10th_greatest_$Q_{10}$</th>\n",
       "      <th id=\"T_22415_level0_col14\" class=\"col_heading level0 col14\" >1st_greatest_$Q_{10}$</th>\n",
       "      <th id=\"T_22415_level0_col15\" class=\"col_heading level0 col15\" >1st_least_$Q_{10}$</th>\n",
       "      <th id=\"T_22415_level0_col16\" class=\"col_heading level0 col16\" >10th_least_$Q_{10}$</th>\n",
       "      <th id=\"T_22415_level0_col17\" class=\"col_heading level0 col17\" >10th_greatest_$Q_{12}$</th>\n",
       "      <th id=\"T_22415_level0_col18\" class=\"col_heading level0 col18\" >1st_greatest_$Q_{12}$</th>\n",
       "      <th id=\"T_22415_level0_col19\" class=\"col_heading level0 col19\" >1st_least_$Q_{12}$</th>\n",
       "      <th id=\"T_22415_level0_col20\" class=\"col_heading level0 col20\" >10th_least_$Q_{12}$</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th id=\"T_22415_level0_row0\" class=\"row_heading level0 row0\" >3</th>\n",
       "      <td id=\"T_22415_row0_col0\" class=\"data row0 col0\" >0.154826</td>\n",
       "      <td id=\"T_22415_row0_col1\" class=\"data row0 col1\" >0.187591</td>\n",
       "      <td id=\"T_22415_row0_col2\" class=\"data row0 col2\" >0.035020</td>\n",
       "      <td id=\"T_22415_row0_col3\" class=\"data row0 col3\" >0.209281</td>\n",
       "      <td id=\"T_22415_row0_col4\" class=\"data row0 col4\" >0.226303</td>\n",
       "      <td id=\"T_22415_row0_col5\" class=\"data row0 col5\" >0.077595</td>\n",
       "      <td id=\"T_22415_row0_col6\" class=\"data row0 col6\" >0.543408</td>\n",
       "      <td id=\"T_22415_row0_col7\" class=\"data row0 col7\" >0.573807</td>\n",
       "      <td id=\"T_22415_row0_col8\" class=\"data row0 col8\" >0.246732</td>\n",
       "      <td id=\"T_22415_row0_col9\" class=\"data row0 col9\" >0.284698</td>\n",
       "      <td id=\"T_22415_row0_col10\" class=\"data row0 col10\" >0.374615</td>\n",
       "      <td id=\"T_22415_row0_col11\" class=\"data row0 col11\" >0.137892</td>\n",
       "      <td id=\"T_22415_row0_col12\" class=\"data row0 col12\" >0.166574</td>\n",
       "      <td id=\"T_22415_row0_col13\" class=\"data row0 col13\" >0.308446</td>\n",
       "      <td id=\"T_22415_row0_col14\" class=\"data row0 col14\" >0.343068</td>\n",
       "      <td id=\"T_22415_row0_col15\" class=\"data row0 col15\" >0.121403</td>\n",
       "      <td id=\"T_22415_row0_col16\" class=\"data row0 col16\" >0.157507</td>\n",
       "      <td id=\"T_22415_row0_col17\" class=\"data row0 col17\" >0.396944</td>\n",
       "      <td id=\"T_22415_row0_col18\" class=\"data row0 col18\" >0.433905</td>\n",
       "      <td id=\"T_22415_row0_col19\" class=\"data row0 col19\" >0.209838</td>\n",
       "      <td id=\"T_22415_row0_col20\" class=\"data row0 col20\" >0.227965</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_22415_level0_row1\" class=\"row_heading level0 row1\" >4</th>\n",
       "      <td id=\"T_22415_row1_col0\" class=\"data row1 col0\" >0.145845</td>\n",
       "      <td id=\"T_22415_row1_col1\" class=\"data row1 col1\" >0.169722</td>\n",
       "      <td id=\"T_22415_row1_col2\" class=\"data row1 col2\" >0.032802</td>\n",
       "      <td id=\"T_22415_row1_col3\" class=\"data row1 col3\" >0.216655</td>\n",
       "      <td id=\"T_22415_row1_col4\" class=\"data row1 col4\" >0.285643</td>\n",
       "      <td id=\"T_22415_row1_col5\" class=\"data row1 col5\" >0.072493</td>\n",
       "      <td id=\"T_22415_row1_col6\" class=\"data row1 col6\" >0.548409</td>\n",
       "      <td id=\"T_22415_row1_col7\" class=\"data row1 col7\" >0.578489</td>\n",
       "      <td id=\"T_22415_row1_col8\" class=\"data row1 col8\" >0.214945</td>\n",
       "      <td id=\"T_22415_row1_col9\" class=\"data row1 col9\" >0.278768</td>\n",
       "      <td id=\"T_22415_row1_col10\" class=\"data row1 col10\" >0.371742</td>\n",
       "      <td id=\"T_22415_row1_col11\" class=\"data row1 col11\" >0.139842</td>\n",
       "      <td id=\"T_22415_row1_col12\" class=\"data row1 col12\" >0.162539</td>\n",
       "      <td id=\"T_22415_row1_col13\" class=\"data row1 col13\" >0.309451</td>\n",
       "      <td id=\"T_22415_row1_col14\" class=\"data row1 col14\" >0.350324</td>\n",
       "      <td id=\"T_22415_row1_col15\" class=\"data row1 col15\" >0.118400</td>\n",
       "      <td id=\"T_22415_row1_col16\" class=\"data row1 col16\" >0.158816</td>\n",
       "      <td id=\"T_22415_row1_col17\" class=\"data row1 col17\" >0.405271</td>\n",
       "      <td id=\"T_22415_row1_col18\" class=\"data row1 col18\" >0.443005</td>\n",
       "      <td id=\"T_22415_row1_col19\" class=\"data row1 col19\" >0.194769</td>\n",
       "      <td id=\"T_22415_row1_col20\" class=\"data row1 col20\" >0.227046</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_22415_level0_row2\" class=\"row_heading level0 row2\" >5</th>\n",
       "      <td id=\"T_22415_row2_col0\" class=\"data row2 col0\" >0.159110</td>\n",
       "      <td id=\"T_22415_row2_col1\" class=\"data row2 col1\" >0.187771</td>\n",
       "      <td id=\"T_22415_row2_col2\" class=\"data row2 col2\" >0.035439</td>\n",
       "      <td id=\"T_22415_row2_col3\" class=\"data row2 col3\" >0.213706</td>\n",
       "      <td id=\"T_22415_row2_col4\" class=\"data row2 col4\" >0.233688</td>\n",
       "      <td id=\"T_22415_row2_col5\" class=\"data row2 col5\" >0.079383</td>\n",
       "      <td id=\"T_22415_row2_col6\" class=\"data row2 col6\" >0.537238</td>\n",
       "      <td id=\"T_22415_row2_col7\" class=\"data row2 col7\" >0.598261</td>\n",
       "      <td id=\"T_22415_row2_col8\" class=\"data row2 col8\" >0.243953</td>\n",
       "      <td id=\"T_22415_row2_col9\" class=\"data row2 col9\" >0.282362</td>\n",
       "      <td id=\"T_22415_row2_col10\" class=\"data row2 col10\" >0.368857</td>\n",
       "      <td id=\"T_22415_row2_col11\" class=\"data row2 col11\" >0.124305</td>\n",
       "      <td id=\"T_22415_row2_col12\" class=\"data row2 col12\" >0.165234</td>\n",
       "      <td id=\"T_22415_row2_col13\" class=\"data row2 col13\" >0.311252</td>\n",
       "      <td id=\"T_22415_row2_col14\" class=\"data row2 col14\" >0.332301</td>\n",
       "      <td id=\"T_22415_row2_col15\" class=\"data row2 col15\" >0.113480</td>\n",
       "      <td id=\"T_22415_row2_col16\" class=\"data row2 col16\" >0.157218</td>\n",
       "      <td id=\"T_22415_row2_col17\" class=\"data row2 col17\" >0.398565</td>\n",
       "      <td id=\"T_22415_row2_col18\" class=\"data row2 col18\" >0.447089</td>\n",
       "      <td id=\"T_22415_row2_col19\" class=\"data row2 col19\" >0.174709</td>\n",
       "      <td id=\"T_22415_row2_col20\" class=\"data row2 col20\" >0.227156</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_22415_level0_row3\" class=\"row_heading level0 row3\" >6</th>\n",
       "      <td id=\"T_22415_row3_col0\" class=\"data row3 col0\" >0.153914</td>\n",
       "      <td id=\"T_22415_row3_col1\" class=\"data row3 col1\" >0.171392</td>\n",
       "      <td id=\"T_22415_row3_col2\" class=\"data row3 col2\" >0.033056</td>\n",
       "      <td id=\"T_22415_row3_col3\" class=\"data row3 col3\" >0.207366</td>\n",
       "      <td id=\"T_22415_row3_col4\" class=\"data row3 col4\" >0.226134</td>\n",
       "      <td id=\"T_22415_row3_col5\" class=\"data row3 col5\" >0.074429</td>\n",
       "      <td id=\"T_22415_row3_col6\" class=\"data row3 col6\" >0.543205</td>\n",
       "      <td id=\"T_22415_row3_col7\" class=\"data row3 col7\" >0.569693</td>\n",
       "      <td id=\"T_22415_row3_col8\" class=\"data row3 col8\" >0.222771</td>\n",
       "      <td id=\"T_22415_row3_col9\" class=\"data row3 col9\" >0.286757</td>\n",
       "      <td id=\"T_22415_row3_col10\" class=\"data row3 col10\" >0.371920</td>\n",
       "      <td id=\"T_22415_row3_col11\" class=\"data row3 col11\" >0.134696</td>\n",
       "      <td id=\"T_22415_row3_col12\" class=\"data row3 col12\" >0.166330</td>\n",
       "      <td id=\"T_22415_row3_col13\" class=\"data row3 col13\" >0.304221</td>\n",
       "      <td id=\"T_22415_row3_col14\" class=\"data row3 col14\" >0.345650</td>\n",
       "      <td id=\"T_22415_row3_col15\" class=\"data row3 col15\" >0.122641</td>\n",
       "      <td id=\"T_22415_row3_col16\" class=\"data row3 col16\" >0.151312</td>\n",
       "      <td id=\"T_22415_row3_col17\" class=\"data row3 col17\" >0.404369</td>\n",
       "      <td id=\"T_22415_row3_col18\" class=\"data row3 col18\" >0.484257</td>\n",
       "      <td id=\"T_22415_row3_col19\" class=\"data row3 col19\" >0.156539</td>\n",
       "      <td id=\"T_22415_row3_col20\" class=\"data row3 col20\" >0.224638</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_22415_level0_row4\" class=\"row_heading level0 row4\" >7</th>\n",
       "      <td id=\"T_22415_row4_col0\" class=\"data row4 col0\" >0.143113</td>\n",
       "      <td id=\"T_22415_row4_col1\" class=\"data row4 col1\" >0.166033</td>\n",
       "      <td id=\"T_22415_row4_col2\" class=\"data row4 col2\" >0.028085</td>\n",
       "      <td id=\"T_22415_row4_col3\" class=\"data row4 col3\" >0.207654</td>\n",
       "      <td id=\"T_22415_row4_col4\" class=\"data row4 col4\" >0.240179</td>\n",
       "      <td id=\"T_22415_row4_col5\" class=\"data row4 col5\" >0.072758</td>\n",
       "      <td id=\"T_22415_row4_col6\" class=\"data row4 col6\" >0.541446</td>\n",
       "      <td id=\"T_22415_row4_col7\" class=\"data row4 col7\" >0.597715</td>\n",
       "      <td id=\"T_22415_row4_col8\" class=\"data row4 col8\" >0.234455</td>\n",
       "      <td id=\"T_22415_row4_col9\" class=\"data row4 col9\" >0.280158</td>\n",
       "      <td id=\"T_22415_row4_col10\" class=\"data row4 col10\" >0.367076</td>\n",
       "      <td id=\"T_22415_row4_col11\" class=\"data row4 col11\" >0.143325</td>\n",
       "      <td id=\"T_22415_row4_col12\" class=\"data row4 col12\" >0.179890</td>\n",
       "      <td id=\"T_22415_row4_col13\" class=\"data row4 col13\" >0.308270</td>\n",
       "      <td id=\"T_22415_row4_col14\" class=\"data row4 col14\" >0.353579</td>\n",
       "      <td id=\"T_22415_row4_col15\" class=\"data row4 col15\" >0.134088</td>\n",
       "      <td id=\"T_22415_row4_col16\" class=\"data row4 col16\" >0.159848</td>\n",
       "      <td id=\"T_22415_row4_col17\" class=\"data row4 col17\" >0.416040</td>\n",
       "      <td id=\"T_22415_row4_col18\" class=\"data row4 col18\" >0.438517</td>\n",
       "      <td id=\"T_22415_row4_col19\" class=\"data row4 col19\" >0.194086</td>\n",
       "      <td id=\"T_22415_row4_col20\" class=\"data row4 col20\" >0.226607</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n"
      ],
      "text/plain": [
       "<pandas.io.formats.style.Styler at 0x7f90fc6c79a0>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "mean_shift = du.preprocessing.filter.MeanShift(sensitivity=1e-6)\n",
    "filtered_data = mean_shift(data)\n",
    "display_dataframe(filtered_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7",
   "metadata": {},
   "source": [
    "In this particular case the number of features remains roughly the same as most of our features underwent a mean shift through the nucleation process.\n",
    "Below we go ahead and save the filtered DataFrame to disk for the next and final section of the tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "filtered_data.to_hdf(\"lj-filtered-data.h5\", \"data\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {
    "nbsphinx": "hidden",
    "tags": []
   },
   "source": [
    "[Previous section](03-collecting-data.ipynb) [Next section](05-detect-events.ipynb)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (dupin)",
   "language": "python",
   "name": "dupin"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}