Decentralization of PCA
method
has multiple perspectives on the role of data dimensionality rection. Wu Enda said in his video that dimensionality rection is used for data compression to rece noise and prevent slow running and small memory; When it is reced to 2 or 3 dimensions, it can be visualized for data analysis; Don't use dimension rection to prevent over fitting. It's easy to remove important features related to tags. But why data need to be compressed, in addition to occupying memory, is there any other reason - "dimension disaster" problem: the higher the dimension, the more sparse the distribution of your data on each feature dimension, which is basically disastrous for machine learning algorithms. The final result may be that each sample has its own characteristics, which can not form a unified feature to distinguish positive cases from negative cases. There is another case, when the feature is more than the sample size, some classification algorithms (SVM) are invalid, which is related to the principle of classification algorithm<
data dimension rection method:
& 65532<
linear dimensionality rection method:
principal component analysis (PCA) and discriminant analysis (LDA)
understanding of PCA:
1. PCA can be understood as the projection of high-dimensional data to low-dimensional data, and make the projection error minimum. It's an unsupervised method
2. It can also be understood as coordinate rotation and translation (corresponding to coordinate transformation and decentralization), so that the n-dimensional space can be analyzed in n-1 dimension, and the characteristics of small variance (small variance, small uncertainty, small amount of information)
3. Derivation of PCA
4. Connection between PCA and SVD
(Understanding PCA from the perspective of matrix decomposition)
5. Application of PCA dimension rection
6 Disadvantages of PCA:
(1) PCA is a linear dimensionality rection method, sometimes the nonlinear relationship between data is very important, when we use PCA, we will get very poor results. Next, we introce PCA of kernel method
(2) principal component analysis is more effective only when the sample points obey Gaussian distribution
(3) cost sensitive PCA (cspca) can be used to rece the dimension of imbalanced data.
(4) the size of feature roots determines how much information we are interested in. In other words, small feature roots often represent noise, but in fact, the projection to smaller feature roots may also include the data we are interested in
(5) the directions of eigenvectors are orthogonal, which makes PCA vulnerable to outlier
(6) it is difficult to explain the results. For example, in the establishment of linear regression model (linear regression model) analysis of dependent variables
bus line: Metro Line 2 → Metro Line 1, the whole journey is about 10.8km
1. Walk about 670m from Changsha meixihu international culture and Art Center to meixihu east station
2. Take Metro Line 2, pass 7 stops, reach Wuyi Square Station
3. Take Metro Line 1, pass 1 stop, reach peiyuanqiao station
4, walk about 1.2km, Reach POFU International Plaza
bus line: Metro Line 2 → 358, the whole journey is about 11.4km
1. Walk about 670m from Changsha meixihu international culture and Art Center to meixihu east station
2. Take Metro Line 2, pass 7 stops, and reach Wuyi Square Station
3, walk about 360m, Arrive at Huatu Ecation (taipingjiekou) station
4, take bus 358, pass 4 stops, arrive at provincial women and children station
5, walk about 200 meters to POFU International Plaza
1. "Real time receipt": after the depositor's transfer application is accepted, the dection will be processed and remitted immediately
2. "Ordinary arrival": ordinary arrival means non real-time arrival. After accepting the transfer application from the depositor, the payment will be dected and remitted 2 hours later
3. "Next day to account": after the depositor's transfer application is accepted, it will be postponed to the next natural day for dection and remittance. Please note that the next day is not 24 hours later
4. The above remittance time refers to the time when the bank dects and initiates the remittance. In case of inter-bank transaction, the arrival time also depends on the inter-bank clearing system of the people's Bank of China and the entry time of the receiving bank
when the bank processes the business, if the relevant clearing system has been closed, the business will be dected at the agreed time and delayed until the normal service of the clearing system.
data centralization
that is to say, set the average value of each attribute to 0 (Muyang will give the source code written by himself below. Muyang's data represents attributes with columns. In this step, set the average value of each column to 0)
calculate the covariance matrix according to the centralized matrix
there are three kinds of covariance values. 0 means that the attributes are independent of each other and have no influence
a positive value indicates that attributes are positively correlated. If attribute a and attribute B are positively correlated, then a increases, B increases, a decreases, and b decreases
a negative value indicates that the attribute is negatively correlated. If attribute C and attribute d are negatively correlated, then C increases, D decreases, C decreases, and D increases
therefore, covariance matrix can also be understood as correlation coefficient matrix, which indicates the degree of correlation between attributes
calculate the eigenvalue matrix according to the covariance matrix
only the diagonal elements of the eigenvalue matrix have values, and the upper and lower triangular elements are all 0.
the corresponding eigenvectors are calculated according to the eigenvalue matrix
sort the eigenvalue matrices and set a threshold value, if the sum of the first I eigenvalues & gt= If the threshold is set, there will be I principal components, and the corresponding eigenvectors will be selected as the principal component vector matrix
the dimension reced matrix is obtained by multiplying the original matrix by the transposed principal component vector
for example, if the original data is a 150 * 4 matrix and two principal components are obtained in step 6, then the principal component matrix is a 2 * 4 matrix
150 * 4 matrix is multiplied by 4 * 2 matrix to get 150 * 2 matrix, which reflects the effect of dimension rection
(this data set with fewer attributes is selected to facilitate beginners' understanding. In practical projects, there are more than four attribute values, but the dimension rection methods are the same< br />
