dupin.preprocessing.filter#

Correlated

Filter out dimensions that are highly correlated with each other.

MeanShift

Filter out dimensions that don't undergo a significant shift in mean.

jump_size_importance

Rank features based on the size of the relative difference between ends.

local_smoothness_importance

Rank features based on how well a spaced LSQ spline fits the feature.

mean_shift_importance

Rank features based on how strong of a mean shift they have.

noise_importance

Rank features based on how standard deviation compares to the mean.

Details

Feature selection schemes.

This provides feature selection schemes distinct from packages like scikit-learn. These packages can easily be used as well for feature selection.

class dupin.preprocessing.filter.Correlated(method='spectral', correlation='pearson', max_clusters=10, method_args=(), method_kwargs=None)[source]#

Filter out dimensions that are highly correlated with each other.

The filter computes the chosen feature correlation matrix, and clusters the features based on the distance or similarity matrix depending on the specified clustering method. The number of clusters is determined by the minimum avaerage silhouette score for each number of clusters tested. Then a set number of features from each cluster is chosen through provided feature importance or randomly.

Parameters:
  • method (str, optional) – The method to use. Current options are “spectral”. Defaults to “spectral”.

  • correlation (str, optional) – The correlation type to use for computing similarity and distance matrices. Currently supported options are “pearson”. Defaults to “pearson”.

  • max_clusters (int, optional) – The maximum number of clusters to try. Defaults to 10.

  • method_args (tuple, optional) – Any positional arguments to pass to the selected method’s construction.

  • method_kwargs (dict [str, any ], optional) – Any keyword arguments to pass to the selected method’s construction.

method#

The method to use. Current options are “spectral”.

Type:

str

correlation#

The correlation type to use for computing similarity and distance matrices. Currently supported options are “pearson”.

Type:

str

max_clusters#

The maximum number of clusters to try. Defaults to 10.

Type:

int

n_clusters_#

The determined optimal number of clusters.

Type:

int

labels_#

The cluster labels for the best performing number of clusters.

Type:

\((N_{features},)\) numpy.ndarray of int

scores_#

The scores for each number of clusters tried. Starts at 2.

Type:

\((N - 2,)\) numpy.ndarray of float

filter_#

The array of features selected.

Type:

\((N_{features})\) numpy.ndarray of bool

__call__(signal, features_per_cluster=1, return_filter=False, feature_importance=None)[source]#

Filter out correlated features.

Parameters:
  • signal (\((N_{samples}, N_{features})\) numpy.ndarray of float) – The signal to filter dimensions from.

  • features_per_cluster (int, optional) – The number of features to keep per cluster. Defaults to 1.

  • return_filter (bool, optional) – Whether to return the features selected or not. Defaults to False.

  • feature_importance (\((N_{features},)\) numpy.ndarray of float , optional) – The importances of each feature. This determines which feature(s) from each cluster are chosen. If not provided, random importances are used.

Returns:

By default returns the filtered data with features deemed insignificant removed. If return_filter is True, the Boolean array filtering features is returned.

Return type:

\((N_{samples}, N_{filtered})\) numpy.ndarray of float or \((N_{features})\) numpy.ndarray of bool

__init__(method='spectral', correlation='pearson', max_clusters=10, method_args=(), method_kwargs=None)[source]#
__weakref__#

list of weak references to the object

class dupin.preprocessing.filter.MeanShift(sensitivity)[source]#

Filter out dimensions that don’t undergo a significant shift in mean.

The filter computes the mean and standard deviation of both ends of the signal, and determines whether the mean of one either end is statistically significant (judged by sensitivity) compared to the other. The filter assumes Gaussian noise.

Parameters:

sensitivity (float, optional) – The minimum likelihood that one of the signal’s end’s mean is drawn from the Gaussian approximation of the other end to require. In other words, the lower the number the increased probability that the difference in means is not random. Defaults to 0.01.

sensitivity#

The minimum likelihood that one of the signal’s end’s mean is drawn from the Gaussian approximation of the other end to require. In other words, the lower the number the increased probability that the difference in means is not random.

Type:

float

mean_shifts_#

The maximum number of standard deviations between the means of the two ends of the last computed signal.

Type:

\((N_{features},)\) numpy.ndarray of float

likelihoods_#

The likelihood that such a mean shift would be observed in a Gaussian with the given mean and standard deviation. The likelihood discounts signal length.

Type:

\((N_{features},)\) numpy.ndarray of float

filter_#

The array of features selected.

Type:

\((N_{features})\) numpy.ndarray of bool

__call__(signal, sample_size=0.1, return_filter=False)[source]#

Filter dimensions without a detected mean shift.

Parameters:
  • signal (\((N_{samples}, N_{features})\) numpy.ndarray of float) – The signal to filter dimensions from.

  • sample_size (float or int, optional) – Either the fraction of the overall signal to use to evaluate the statistics of each end of the signal, or the number of data points to use on each end of the signal for statistics. Default to 0.1. If this would result in less than three data points, three will be used.

  • return_filter (bool, optional) – Whether to return the Boolean array filter rather than the filtered data. Defaults to False.

Returns:

By default returns the filtered data with features deemed insignificant removed. If return_filter is True, the Boolean array filtering features is returned.

Return type:

\((N_{samples}, N_{filtered})\) numpy.ndarray of float or \((N_{features})\) numpy.ndarray of bool

__init__(sensitivity)[source]#
__weakref__#

list of weak references to the object

dupin.preprocessing.filter.jump_size_importance(signal, n_end=3)[source]#

Rank features based on the size of the relative difference between ends.

Parameters:
  • signal (\((N_{samples}, N_{features})\) numpy.ndarray of float) – The potentially multidimensional signal.

  • n_end (int, optional) – The number of indices to take on either end to compute the mean to determine the jump from one end to the other.

Returns:

feature_importance – Feature rankings from 0 to 1 (higher is more important), for all features. A higher ranking indicates the relative magnitude of the jump or drop between signal ends is larger.

Return type:

\((N_{features})\) numpy.ndarray of float

dupin.preprocessing.filter.local_smoothness_importance(signal, dim=1, spacing=None)[source]#

Rank features based on how well a spaced LSQ spline fits the feature.

Uses the negative MSE projected to a range of \([0, 1]\).

Parameters:
  • signal (\((N_{samples}, N_{features})\) numpy.ndarray of float) – The potentially multidimensional signal.

  • dim (int, optional) – The dimension of spline to use, defaults to 1.

  • spacing (int, optional) – The number of spaces beyond the dimension to space knots, defaults to None. When None, the behavior is \(\lceil d / 2 \rceil\).

Returns:

feature_importance – Feature rankings from 0 to 1 (higher is more important), for all features. A higher ranking indicates that the fit was better.

Return type:

\((N_{features})\) numpy.ndarray of float

dupin.preprocessing.filter.mean_shift_importance(likelihoods)[source]#

Rank features based on how strong of a mean shift they have.

Parameters:

likelihoods (\((N_{features})\) numpy.ndarray of float) – The likelihoods given from a MeanShift object or the likelihood that the given feature’s signal happened by chance.

Returns:

feature_importance – Feature rankings from 0 to 1 (higher is more important), for all features. A higher ranking indicates that the likelihood was lower.

Return type:

\((N_{features})\) numpy.ndarray of float

dupin.preprocessing.filter.noise_importance(signal, window_size)[source]#

Rank features based on how standard deviation compares to the mean.

Uses the rolling standard deviation over mean ignoring a mean of zero.

Parameters:
  • signal (\((N_{samples}, N_{features})\) numpy.ndarray of float) – The potentially multidimensional signal.

  • window_size (int) – The size of rolling window to use.

Returns:

feature_importance – Feature rankings from 0 to 1 (higher is more important), for all features. A higher ranking indicates the standard deviation relative to the mean is low across the feature.

Return type:

\((N_{features})\) numpy.ndarray of float