Friday, April 17, 2020

________________ PCA - SVD and Machine Learning II ________________


"There is nothing more deceptive than an obvious fact."

       Sherlock Holm, Sir Arthur Conan Doyle  _________


Part II:


A cluster of data is an important region of interest (ROI) in machine learning. It has location, orientation and individual points. To better describe its characteristics and doing math manipulation, cluster usually converts to matrix.

In 2D, location is the centroid (X-mean, Y-mean) and all the points (Xi, Yi) of the cluster. From its center, eigenvectors very much define the orientation of the data. Every eigenvector can be scaled to be a unit vector. This is also implied an angle associate with a vector of unit length 1. The angle is not often pointed out in a lot of text. The fact is so obvious, sometime the angle being neglected and not mentioned at all. A vector should be associated with an angle to complete its property in a coordinate system. Vector angle is the important missing link to gain more understanding of eigen- vector. Arcsine, arccosine, and arctangent are common formulas to find this angle from (x, y) components of the vector.

                  Figure-5

The unit vector r=1 of figure-5 is the eigenvector start at center (u, v) and the tip at (x, y). And the parametric equations as x = u + r*cos(theta) ,  y = v + r*sin(theta).

It also illustrates the Pythagoras theorem of cos(theta)**2 + sin(theta)**2 = 1. This is also applied to N-dimension eigenvector (more features or variables). The norm is always equal to 1 for eigenvector.

sin(theta) and cos(theta) are consider as loadings in PCA. Think of them as how much each feature/variable (x and y, 2D in this case) contribute at theta angle of the unit vector to principal component axes (biplot illustrates this point more in part III).



----------------------------  <(0_0)>  -------------------------


Information age comes with massive amount of data and features to consider. Typically, the input variables are too large for useful analysis. PCA is a technique used to reduce the high-dimensional datasets to manageable fewer dimensions, while still preserving the original structure and relationships. Information loss is at minimum from original data.

Most of the time the ROI is an oval shape. We will define a random data set and layout the process of analyzing PCA using python. Let’s go over some matrix properties:

       Q , U , V : orthogonal matrix      Am.n : rectangle matrix
       S , D    : diagonal matrix        I    : identity matrix
       S = ST             D = DT
       Q-1 = QT         QQT = QTQ = I        UTU = Im.m         VTV = In.n

       AAT/(1-n) - Covariance in matrix form

       Amxn = Umxm Smxn VTnxn   (SVD) - singular value decomposition

Some properties of SVD:

A = USVT   ,   AT = VSUT        <==       (USVT)T = (V T) T (US)T  =  VSTUT 
ATA = VSUTUSVT
ATA = VS2VT
ATAV   = VS2VTV  =  VS2           ==>    ATAvi = σi2vi
AAT U  = USVT VSUTU = US2       ==>    AATui = σi2ui
A(V)  = USVT(V) = US            ==>    Avi   = σiui


ATAvi = σi2vi and AATui = σi2ui (Ax = λx form) are of the equation A(V) = USVT(V) = US. Which is equivalent to Avi = σiui. Calculating the SVD consists of finding the eigenvalues or variance ( λi ) and eigenvectors (ui, vi) of AAT and ATA . ui is called left singular vector (unit eigenvectors of AAT). vi is called right singular vector (unit eigenvectors of ATA). The singular values (sigma or standard deviation) of Amxn are square roots of eigenvalues (σi = λi) from AAT or ATA. The sigma values are the diagonal entries of the S or D matrix. They are arranged in descending order.

Any matrix data, when it can decompose to eigenvectors then the location (centroid) and orientation (angle) of the data cluster are defined. The largest eigenvalue and its corresponding eigenvector define the 1st principal component axis. The 2nd principal component axis is defined by the 2nd largest eigenvalue and its corresponding eigenvector, … and so on.

The singular values or square root of eigenvalue/variance is the standard deviation or one sigma in the corresponding eigenvector direction. We can map out the three sigma region (ROI - Region Of Interest) to further analyze the data cluster.

Any (Xi, Yi) cluster data point multiply by the unit vectors or eigenvectors (dot product) will become the new (Xi, Yi)’ of principal component axes. Basically it uses the unit vectors to scale data to the 1st and 2nd principal axes. In 2D, cos(theta) and sin(theta) are the loading or weighted factors for the new eigenvector direction or principal component axes.

Another way of looking at it is, when the data cluster is rotated by negative theta we get to the principal component axes. The frame of reference axis changes from the existing (Xi, Yi) to new (Xi, Yi)’ of principal component axes.

For 3D-(Xi, Yi, Zi) dataset, let Ud be any column eigenvector:

  U1 = [C1, C2, C3], U2 = [C4, C5, C6# some scalars/constants
 
  Xi’ = [Xi, Yi, Zi] @ U1  =  Xi * C1 + Yi * C2 + Zi * C3
  Yi’ = [Xi, Yi, Zi] @ U2  =  Xi * C4 + Yi * C5 + Zi * C6

  Xi’ is on 1st principal axis PC1, Yi’ is on 2nd principal axis PC2.

  ( d = 1, 2, 3, … k   and  i = 1, 2, 3, … n    k <<< n ).
  ( @ is the dot product or matrix multiplication in python 3.x ).

In general, data transformation is a linear combination equation of weights / loadings (eigenvector values of Ud) and data matrix Amxn. Depending on python library and the data, you need to transpose the matrix to make it work. Applying dot product, we have:

    A’n.d = (Anxm) @  (Um.d)     or      A’d.n = (Ud.m) @ (Amxn)



---------------------------  (*^_^*)’  ---------------------------   


When using python, the data matrix column and row should be labeled correctly depending on the parameters in the module. That means data may have to transpose sometimes. We have to be aware of the library usage. In this case (m) is feature variables, (n) is number of observations, (d) is the reduced dimension. The data should be centered at (0, 0), or zero-mean. Python subtract mean from data matrix when utilize covariance matrix or PCA by default.

    Principal component or eigenvector:

    PCmxd  =  (VT )T   =  V                 from                 Anxm = Unxn Snxm VTmxm

    Feature reduction formulas The new feature matrix is the dot
    product of data matrix and eigenvector (unit vector). Later, the
    scree plot (part III) will help to decide how many eigenvectors
    to include. Transformed cluster data are the projection of
    existing (Xi, Yi, ...) points onto the principal component axes:

                       Fnxd = Anxm  @  (PC)mxd 

# ----------------------------------------------------------------


                    PYTHON SESSION BELOW EXPLAINED:

    Initially, a simple random data cluster of x-y variables using
    covariance matrix is compared with PCA using sklearn library.
    This shows the matrix transformation by eigenvector and rotation
    matrix in action. After that, iris dataset will be utilized to
    demonstrate using PCA in production (explained in part III).


                Figure-6                     Figure-7


Figure-7 shows the final data sample position after matrix transformation from Figure-6 (translation - derotation). The steps are reversible (major application in data compression). The ellipse is mapped out with 3-sigma (a-b, major-minor axes) to analyze the cluster's ROI further.

Before doing shearing, scaling, rotating of a cluster matrix, data center should move to origin (zero-mean) to prevent unwanted distortion. In python, numpy and sklearn library, both covariance and PCA methods have built-in removal of the mean.

After getting Figure-7 from FINISHED PYTHON SESSION below, we can verify the manual process of using covariance and rotation matrix (translation-derotation) with sklearn PCA. The projected data onto principal component axes should be the same in both cases.

Same covariance result with/without zero-mean:
 
>>> np.cov(np.array([x1,y1])) # as is, no mean removal

    array([[37.6662, 30.2462],
           [30.2462, 33.5037]])

>>> np.cov(np.array([x1c,y1c])) # zero-mean matrix sample


    array([[37.6662, 30.2462],
           [30.2462, 33.5037]])

Same PCA-transform or projected data result with/without zero-mean:

>>> pca = PCA() # instantiate PCA class

>>> x1y1_pca = pca.fit_transform(np.array([x1,y1]).T) # PC1_2 data

>>> x1y1_mean_pca = pca.fit_transform(np.array([x1c,y1c]).T)

>>> x1y1_pca[:3] # no mean removal of matrix sample, first 3 rows

    array([[-4.4831, 2.4537],
           [ 1.6845, 1.5764],
           [-0.9238, 0.0164]])

>>> x1y1_mean_pca[:3] # zero-mean matrix sample

    array([[-4.4831, 2.4537],
           [ 1.6845, 1.5764],
           [-0.9238, 0.0164]])

Now, comparing PCA() projected data with matrix translation-derotation result:

>>> (rot_M @ (x1c, y1c))[:,:3].T

    array([[-4.4831, -2.4537],
           [ 1.6845, -1.5764],
           [-0.9238, -0.0164]])

They are the same except for the sign flipping sometimes. Because eigenvector can be in opposite direction (180 degrees out – But it is still on the same principal component axis line).


# ---------------------------------------------------------------

One way of getting eigenvectors is from numpy.linalg.svd() or scipy.linalg.svd(). This method accepts zero-mean data and variables/features in columns.

# sample matrix factorized into U, S, VT - features in column

>>> U, S, VT = np.linalg.svd(np.array([x1c,y1c]).T)
>>> PC = VT.T # transpose to compare with eigVec (column format)
>>> PC

    array([[ 0.731 , -0.6824],
           [ 0.6824,  0.731 ]])

>>> eigVec # from covariance matrix factorization, np.linalg.eig()

    array([[ 0.731 , -0.6824],
           [ 0.6824,  0.731 ]])

>>> pca.fit(np.array([x1,y1]).T)
>>> pca.components_.T # column eigenvectors

    array([[ 0.731 ,  0.6824],
           [ 0.6824, -0.731 ]]) 

We can compute the projected data using either VT.T, eigVec, or pca.components_.T:

>>> F = (np.array([x1c,y1c]).T) @ PC # same values as derotation
>>> F[:3]

    array([[-4.4831, -2.4537],
           [ 1.6845, -1.5764],
           [-0.9238, -0.0164]])


PCA() from scikit-learn and svd() of numpy-scipy library are fast ways to get the result. It is amazing with a few lines of code in python, we can get answer quickly. That is not possible with some other programming languages.

# ------------------------------------------------------------------

Most of the time we can solve one unknown from other variables in an equation. Independent and respond variables can also be switched around (reversible process). This is generally held true in matrix equations. Considering the equations:

  A’n.p = (Anxm) @  (Um.p)  -->   A’n.p (Um.p)T = (Anxm) @ (Um.p)(Um.p)T

  ==>  Anxm  =  A’n.p @ (Up.m)

  A’p.n = (Up.m) @ (Amxn)  -->   (Up.m)T A’p.n = (Up.m)T(Up.m) @ (Amxn)

  ==>  Amxn = (Um.p) @ A’p.n 

That is how data compression work. The more eigenvectors or principal components we use, the better quality of the retrieved data. We can find the balance or “good choice” of how many eigenvectors to use with scree plot (part III).

By switching row/column and move eigenvector matrix around you can achieve some amazing result with matrix multiplication.

Let’s try out with existing data:

>>> A = x1y1_pca @ eigVec # transformed matrix multiply eigenvector

>>> A[:3]                 # list first three rows of zero-mean data

    array([[-1.6026,  4.8529],
           [ 2.3071,  0.0028],
           [-0.6641,  0.6423]])

Verify:

>>> (xyDF-xyDF.mean(axis=0))[:3] # original data minus mean

          x1       y1
 0 -1.602601 -4.85292
 1  2.307078 -0.00283
 2 -0.664081 -0.64234

Try x1y1_mean_pca and pca.components_.T matrix :

>>> ( x1y1_mean_pca @ pca.components_.T )[:3]

      array([[-1.6026, -4.8529], 
             [ 2.3071, -0.0028],  
             [-0.6641, -0.6423]])


Because most method remove the mean, We have to subtract mean from original data to compare. Again, data change sign due to eigenvector flipping 180° out.

When working with matrix, it would help to start thinking of vectors as columns or rows of matrix. Multiplication between matrices is the dot products of row vectors by column vectors. Features or variables must be aligned with python library method. Also, they should match correctly with other matrix during multiplication. Each eigenvector composes of individual weight variable values and must be paired accordingly with other matrix features when using dot product.


# ----- FIGURE-5-6-7 DISPLAY AFTER FINISHED PYTHON SESSION ------



# continuation from Part-I with same libraries/modules
plt.figure(5)
#fig, ax = plt.subplots() # figure and axis
plt.xlim(-.5, 5) # set x-axis limit
plt.ylim(-.5, 5)
#ax.set_aspect('equal') # set aspect ratio of axis
plt.axis('equal')

# initialize center(u, v) values and others
u, v, a, b, rot, r = 2.5, 2.5, 1, .5, np.pi/6, 2
# default 50 points or angles in radian
theta = np.linspace(0, 2*np.pi)

plt.axhline(ls='--', color='grey') # line thru origin
plt.axvline(ls='--', color='grey')

# draw a magenta circle at (u, v) of r=2
c3 = plt.Circle([u,v], 2, color='m', fill=False, lw=3)

plt.gca().add_artist(c3) # add circle c3 to figure
plt.plot(np.array([0, 0.5*2, .5*2, 0]) + u,
         np.array([0, .866*2, 0, 0]) + v, lw=3) # triangle

plt.plot([u, u], [0, 5], color='g') # vertical line thru center
plt.plot([0, 5], [v, v], color='g') # horizontal line thru center

# text formatting of figure 5
plt.text(u + .5*2.1, v + .866*2.5, 'Eigenvector', size=16)
plt.text(u + .5*2.1, v + .866*2.1, '(x, y)', size=16)
plt.text(u + .1, v -.4, r'cos($\theta$)', size=18)
plt.text(u + .5*2.1, v + .5, r'sin($\theta$)', size=18)
plt.text(1.6 , v + .1, '(u, v)', size=20, color='r')

plt.text(u + .25, v + 1, '1', size=20, color='r')
plt.text(u + .15, v + .1, r'$\theta$)', size=20, color='r')

plt.figure(6)
sp.random.seed(15) # reproducible, to get same sample

# x1 mean at 20 with spread of 6 std of 100 points, multiply
# by tan to get y1, add 3 std for small variation in y1
x1 = sp.random.normal(20,6,100)
y1 = x1*np.tan(sp.pi/5) + sp.random.normal(0,3,100)

# convert x1 & y1 to pandas dataframe, use xyDF.x1.values ,
# xyDF.y1.values for calculation - less likely to cause error later
xyDF = pd.DataFrame({'x1': x1, 'y1': y1}) # view labeled data easier

# covariance of x1 & y1, notice transpose to match method
covXY = np.cov(xyDF.T) # method default row as variable

# unpacking covariance into eigenvalues and eigenvectors
eigVal, eigVec = np.linalg.eig(covXY)
sig1, sig2 = eigVal**0.5 # singular value or sigma

# initialize new center(u, v) and 3 sigma values of a & b
u, v, a, b = x1.mean(), y1.mean(), 3*sig1, 3*sig2

# default 50 points for parametric equations
theta = np.linspace(0, 2*np.pi)

# angle of 1st eigenvector or principal axis
angle = np.degrees(np.arccos(eigVec[0,0]))

# define an ellipse using parametric eq. at new center(u, v)
# with zero angle as reference, draw as green ellipse later
ell_M  = np.array([u + a*np.cos(theta), v + b*np.sin(theta)])

# red ellipse of a & b at three sigma value from the
# data center, rotated at angle of 1st eigenvector
e1 = mp.Ellipse([u,v], 2*a, 2*b, angle, lw=3, fill=False, color='r')

plt.axhline(ls='--', color='grey') # grey line thru origin
plt.axvline(ls='--', color='grey')

plt.scatter(x1, y1, 100, 'm', edgecolor='y') # plot sample data
plt.plot(ell_M[0,:],ell_M[1,:], 'g--', lw=3) # draw green ellipse
plt.gca().add_artist(e1) # add red ellipse to figure

# draw three sigma lines of data sample (ROI - red ellipse)
sigma3 = np.array([[u,v]]).T + (3*eigVal**0.5) * eigVec
plt.plot([u, sigma3[0,0]], [v, sigma3[1,0]], lw=3, color='b')
plt.plot([u, sigma3[0,1]], [v, sigma3[1,1]], lw=3, color='b')

# move sample center to origin and derotation of red ellipse
plt.figure(7)
plt.axhline(ls='-', color='b', lw=2)
plt.axvline(ls='-', color='b', lw=2)

x1c = x1-x1.mean() # removed mean from x1 matrix sample
y1c = y1-y1.mean() # removed mean from y1 matrix sample

# negative rotation angle of eigenvector
rotNeg = - np.arccos(eigVec[0,0])
# negative rotation matrix
rot_M = np.array([[np.cos(rotNeg), -np.sin(rotNeg)],
                  [np.sin(rotNeg),  np.cos(rotNeg)]])

# transformation of data sample by rotation matrix
el1, el2 = rot_M @ (x1c, y1c)
# plot data sample at new center (0, 0)
plt.scatter(el1, el2, 100, 'm', edgecolor='y')

# green ellipse at origin, after translation & derotation
# of red ellipse
e2  = mp.Ellipse([0,0], 2*a, 2*b, angle=0, lw=3,
                             fill=False, color='g')
plt.gca().add_artist(e2) # add green ellipse to figure 7

plt.text(1, 19, 'PC_2', size=18, color='purple') # label x-y axes
plt.text(23.5, 3, 'PC_1', size=18, color='purple')

#plt.xlabel('PC_1', color='purple', fontsize=18) 
#plt.ylabel('PC_2', color='purple', fontsize=18)
 
plt.axis('equal') # set aspect=1 for xy-axis
plt.show()


================================================================