dupin.preprocessing.filter#
Filter out dimensions that are highly correlated with each other. |
|
Filter out dimensions that don't undergo a significant shift in mean. |
|
Rank features based on the size of the relative difference between ends. |
|
Rank features based on how well a spaced LSQ spline fits the feature. |
|
Rank features based on how strong of a mean shift they have. |
|
Rank features based on how standard deviation compares to the mean. |
Details
Feature selection schemes.
This provides feature selection schemes distinct from packages like scikit-learn. These packages can easily be used as well for feature selection.
Filter out dimensions that are highly correlated with each other.
The filter computes the chosen feature correlation matrix, and clusters the features based on the distance or similarity matrix depending on the specified clustering method. The number of clusters is determined by the minimum avaerage silhouette score for each number of clusters tested. Then a set number of features from each cluster is chosen through provided feature importance or randomly.
- Parameters:
method (
str
, optional) – The method to use. Current options are “spectral”. Defaults to “spectral”.correlation (
str
, optional) – The correlation type to use for computing similarity and distance matrices. Currently supported options are “pearson”. Defaults to “pearson”.max_clusters (
int
, optional) – The maximum number of clusters to try. Defaults to 10.method_args (
tuple
, optional) – Any positional arguments to pass to the selected method’s construction.method_kwargs (
dict
[str
,any
], optional) – Any keyword arguments to pass to the selected method’s construction.
The method to use. Current options are “spectral”.
- Type:
The correlation type to use for computing similarity and distance matrices. Currently supported options are “pearson”.
- Type:
The maximum number of clusters to try. Defaults to 10.
- Type:
The determined optimal number of clusters.
- Type:
The cluster labels for the best performing number of clusters.
- Type:
\((N_{features},)\)
numpy.ndarray
ofint
The scores for each number of clusters tried. Starts at 2.
- Type:
\((N - 2,)\)
numpy.ndarray
offloat
The array of features selected.
- Type:
\((N_{features})\)
numpy.ndarray
ofbool
Filter out correlated features.
- Parameters:
signal (\((N_{samples}, N_{features})\)
numpy.ndarray
offloat
) – The signal to filter dimensions from.features_per_cluster (
int
, optional) – The number of features to keep per cluster. Defaults to 1.return_filter (
bool
, optional) – Whether to return the features selected or not. Defaults to False.feature_importance (\((N_{features},)\)
numpy.ndarray
offloat
, optional) – The importances of each feature. This determines which feature(s) from each cluster are chosen. If not provided, random importances are used.
- Returns:
By default returns the filtered data with features deemed insignificant removed. If
return_filter
isTrue
, the Boolean array filtering features is returned.- Return type:
\((N_{samples}, N_{filtered})\)
numpy.ndarray
offloat
or \((N_{features})\)numpy.ndarray
ofbool
list of weak references to the object
- class dupin.preprocessing.filter.MeanShift(sensitivity)[source]#
Filter out dimensions that don’t undergo a significant shift in mean.
The filter computes the mean and standard deviation of both ends of the signal, and determines whether the mean of one either end is statistically significant (judged by
sensitivity
) compared to the other. The filter assumes Gaussian noise.- Parameters:
sensitivity (
float
, optional) – The minimum likelihood that one of the signal’s end’s mean is drawn from the Gaussian approximation of the other end to require. In other words, the lower the number the increased probability that the difference in means is not random. Defaults to 0.01.
- sensitivity#
The minimum likelihood that one of the signal’s end’s mean is drawn from the Gaussian approximation of the other end to require. In other words, the lower the number the increased probability that the difference in means is not random.
- Type:
- mean_shifts_#
The maximum number of standard deviations between the means of the two ends of the last computed signal.
- Type:
\((N_{features},)\)
numpy.ndarray
offloat
- likelihoods_#
The likelihood that such a mean shift would be observed in a Gaussian with the given mean and standard deviation. The likelihood discounts signal length.
- Type:
\((N_{features},)\)
numpy.ndarray
offloat
- filter_#
The array of features selected.
- Type:
\((N_{features})\)
numpy.ndarray
ofbool
- __call__(signal, sample_size=0.1, return_filter=False)[source]#
Filter dimensions without a detected mean shift.
- Parameters:
signal (\((N_{samples}, N_{features})\)
numpy.ndarray
offloat
) – The signal to filter dimensions from.sample_size (
float
orint
, optional) – Either the fraction of the overall signal to use to evaluate the statistics of each end of the signal, or the number of data points to use on each end of the signal for statistics. Default to 0.1. If this would result in less than three data points, three will be used.return_filter (
bool
, optional) – Whether to return the Boolean array filter rather than the filtered data. Defaults toFalse
.
- Returns:
By default returns the filtered data with features deemed insignificant removed. If
return_filter
isTrue
, the Boolean array filtering features is returned.- Return type:
\((N_{samples}, N_{filtered})\)
numpy.ndarray
offloat
or \((N_{features})\)numpy.ndarray
ofbool
- __weakref__#
list of weak references to the object
- dupin.preprocessing.filter.jump_size_importance(signal, n_end=3)[source]#
Rank features based on the size of the relative difference between ends.
- Parameters:
signal (\((N_{samples}, N_{features})\)
numpy.ndarray
offloat
) – The potentially multidimensional signal.n_end (
int
, optional) – The number of indices to take on either end to compute the mean to determine the jump from one end to the other.
- Returns:
feature_importance – Feature rankings from 0 to 1 (higher is more important), for all features. A higher ranking indicates the relative magnitude of the jump or drop between signal ends is larger.
- Return type:
\((N_{features})\)
numpy.ndarray
offloat
- dupin.preprocessing.filter.local_smoothness_importance(signal, dim=1, spacing=None)[source]#
Rank features based on how well a spaced LSQ spline fits the feature.
Uses the negative MSE projected to a range of \([0, 1]\).
- Parameters:
signal (\((N_{samples}, N_{features})\)
numpy.ndarray
offloat
) – The potentially multidimensional signal.dim (
int
, optional) – The dimension of spline to use, defaults to 1.spacing (
int
, optional) – The number of spaces beyond the dimension to space knots, defaults toNone
. WhenNone
, the behavior is \(\lceil d / 2 \rceil\).
- Returns:
feature_importance – Feature rankings from 0 to 1 (higher is more important), for all features. A higher ranking indicates that the fit was better.
- Return type:
\((N_{features})\)
numpy.ndarray
offloat
- dupin.preprocessing.filter.mean_shift_importance(likelihoods)[source]#
Rank features based on how strong of a mean shift they have.
- Parameters:
likelihoods (\((N_{features})\)
numpy.ndarray
offloat
) – The likelihoods given from aMeanShift
object or the likelihood that the given feature’s signal happened by chance.- Returns:
feature_importance – Feature rankings from 0 to 1 (higher is more important), for all features. A higher ranking indicates that the likelihood was lower.
- Return type:
\((N_{features})\)
numpy.ndarray
offloat
- dupin.preprocessing.filter.noise_importance(signal, window_size)[source]#
Rank features based on how standard deviation compares to the mean.
Uses the rolling standard deviation over mean ignoring a mean of zero.
- Parameters:
signal (\((N_{samples}, N_{features})\)
numpy.ndarray
offloat
) – The potentially multidimensional signal.window_size (int) – The size of rolling window to use.
- Returns:
feature_importance – Feature rankings from 0 to 1 (higher is more important), for all features. A higher ranking indicates the standard deviation relative to the mean is low across the feature.
- Return type:
\((N_{features})\)
numpy.ndarray
offloat