2021-11-22 08:52:57
Principal Component Analysis: 7 Methods for Dimension Reduction in Scikit-Learn
One of the main problems of machine learning on large datasets is the huge size of the computational vectors. Therefore, methods of dimensionality reduction to reduce the number of variables are very relevant. This method is Principal Component Analysis (PCA), the essence of which is to reduce the dimension of the dataset, while retaining as much "variability" as possible, i.e. statistical information.
PCA is a statistical method of converting high-dimensional data into low-dimensional data by choosing the most important features that collect as much information about the dataset as possible. Features are selected based on the variance they cause in the output. The trait that causes the most variance is the first major component. The trait responsible for the second largest variance is considered the second main component, etc. It is important that the main components are not related to each other in any way. In addition to speeding up ML algorithms, PCA allows you to visualize data by projecting it into a lower dimension in order to display it in 2D or 3D space.
The popular Python library Scikit-learn includes the sklearn.decomposition.PCA module, which is implemented as a transformer object for multiple components in the fit () method. It can also be used for new data to project onto these components.
To use the PCA method in the Scikit-Learn library, there are 2 steps:
1. Initialize the PCA class by passing the required number of components to the constructor;
2. invoke the fitting methods, and then transform them, passing them a feature set. The transform method returns the specified number of base components.
Scikit-learn supports several variations of the PCA method:
• Kernel Principal Component Analysis (KPCA) is a nonlinear dimensionality reduction method using the kernel. The PCA kernel was developed to help classify data whose decision boundaries are described by a non-linear function. The idea is to move into a higher-dimensional space in which the decision-making boundary becomes linear. The sklearn.decomposition module has different kernels: linear, polynomial (poly), Gaussian radial basis function (rbf), sigmoid (sigmoid '), etc. The default is linear, which is suitable if the data is linear separable.
• Sparse PCA - A sparse version of PCA, the purpose of which is to extract a set of sparse components that best recovers data. Typically, PCA extracted components have extremely dense expressions, i.e. nonzero coefficients as linear combinations of the original variables. This makes it difficult to interpret the results. In practice, real principal components can be more naturally represented as sparse vectors, for example, in face recognition, they can display parts of faces.
• Incremental Principal Component Analysis (IPCA) - Incremental PCA method when the dataset to be decomposed is too large to fit in memory. IPCA constructs a low-rank approximation for the input data using an amount of memory that is independent of the size of the input sample. It still depends on the input features, but changing the packet size allows you to control the memory usage.
• Fast Independent Component Analysis (ICA) - Fast independent PCA is used to evaluate sources with noisy measurements and reconstruct sources, since classic PCA does not work with non-Gaussian processes.
• Linear Discriminant Analysis (LDA) - Linear Discriminant Analysis, like classical PCA, is a linear transformation method. But the PCA is not monitored, i.e. ignores class labels, and LDA is supervised machine learning that is used to distinguish between two classes or groups. LDA is suitable for controlled dimensionality reduction by projecting input data into linear subspace from directions that maximize separation between classes. The dimension of the output is necessarily less than the number of classes. In scikit-learn, LDA is implemented using LinearDiscriminantAnalysis, where the n_components parameter specifies the number of functions to return.
119 views05:52