Saturday, March 28, 2020

_______________ *** PCA - SVD and Machine Learning I *** ________________



"Data! Data! Data!” he cried impatiently. “I can’t make bricks without clay."

               Sherlock Holm - Sir Arthur Conan Doyle  _______


Part I:

In modern day, Sherlock Holm would concern more on how to effectively using data. We are an information overload society. In business and manufacturing data are abundant. The problem becomes how to develop useful tools to deal with massive amount of data. One of the methods gaining momentum is used machine learning building different models and studying patterns to make accurate predictions when fed data.

Machine learning for a while being consider just a novelty idea, it is mainly used in academia and research centers. But thing has changed drastically, ML as a novelty has grown by leaps and bounds.

It is becoming an important emerging technology, especially in analyzing big data. Using different statistical algorithms, it has the capability to learn from past experience or historical data to answer business questions, detect and analyze trends and help solving problems.

Some common machine learning tasks such as:
       
                     *       Classification 
    *       Clustering
    *       Feature selection
    *       Regression
    *       Dimension reduction

have been helping to solve problems in image recognition (automated optical inspection - AOI), product recommender, medical diagnosis, financial analysis, spam detection, predictive maintenance, etc.

Nowadays Python is the de facto tool used in data analysis. With the help of this general purpose, open source programming language we will explore some math operations such as PCA - Principal Component Analysis, SVD - Singular Value Decomposition that are useful in
machine learning pipeline.

Before diving into PCA or SVD, we start out with circle, ellipse, and some matrix manipulation to gain better understanding of the mathematical components and related graphical representations.

Python interactive console combining with running file in batch mode is an excellent tool to learn math, study models, patterns testing, and analyzing problem.

You can use python online (repl.it, jdoodle.com, onlinegdb.com, etc.). When in python IDE (Eclipse-PyDev, Jupyter, Thonny, NetBeans ...), you can hover cursor over the dot after module name or left parentheses (after method name) to get more information about it.

    * Circle equation and its parametric formulas:

      (x u)2 + (y v)2 = r2          (u, v): center point

       x = u + r cosθ              y = v + r sinθ

    * Ellipse equation and its parametric formulas:

      ( (x u)2 / a2) + ( (y v)2 / b2 ) = 1

      ( x, y ) = ( u + a cosθ ,  v + b sinθ )

 
    * Rotation matrix using python numpy(np):

      R = np.array( [[np.cos(θ), np.sin(θ)],
                     [np.sin(θ),     np.cos(θ)]] )



# ------ FOUR FIGURES DISPLAY AFTER FINISHED PYTHON SESSION --------

# import modules for later use
import numpy as np, pandas as pd, scipy as sp, seaborn as sb
import matplotlib.pyplot as plt, matplotlib.patches as mp
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.preprocessing import StandardScaler, scale
from sympy import pprint, init_printing, symbols
import sympy as sym

# style/theme to display graphic
plt.style.use(['classic', 'seaborn-white'])
# display integer with four decimals
np.set_printoptions(suppress=True, precision=4)

fig, ax = plt.subplots() # figure and axis
plt.xlim(-.5, 5) # set x-axis limit
plt.ylim(-.5, 5)
ax.set_aspect('equal') # set aspect ratio of axis

plt.axhline(ls='--', color='grey') # line thru origin
plt.axvline(ls='--', color='grey')

# variables for circle and ellipse
# u, v  : center | a, b : major & minor ellipse axis
# r, rot: circle radius & rotation angle in radian
u, v, a, b, rot, r = 2.5, 2.5, 1, .5, np.pi/6, 2
# default 50 points or angles in radian
theta = np.linspace(0, 2*np.pi)

# define circle using python module
c1 = plt.Circle([u,v], 1.5, color='m') # one way to define circle
c2 =  mp.Circle([u,v], 1, color='g')   # green filled circle

ax.add_artist(c1) # ax.add_patch(c1) => add circle c1 to figure
plt.gca().add_artist(c2) # plt.gca().add_patch(c2) => add c2

# plt.show()
# for interactive mode uses this method after any graphic object
# properly setup to display. It moves to the end in batch mode.
# Combining with plt.figure(2), plt.figure(3),... to get multiple
# display windows of graphic objects.


# circle parametric equation
rx, ry = u + r*np.cos(theta), v + r*np.sin(theta)

plt.plot(rx, ry, 'm-') # display circle with radius "r"

# plt.show()
# some IDE may display right away. Otherwise uses
# command above to show graphic. Typically, this
# method put at the end of program in batch mode.

# create another graphic window, may have
# to close previous one to display others

plt.figure(2)

# define ellipse using python module
ell   = mp.Ellipse([u, v], 2*a, 2*b, angle=rot*180/np.pi,
                   fill=True, color='b', lw=3)

# ellipse parametric equation
ex, ey = u + a*np.cos(theta), v + b*np.sin(theta)

plt.ylim(1.5, 3.5)        # try to center graphic
plt.plot(ex, ey, 'r-')    # to display ellipse from formula
plt.gca().add_artist(ell) # add ellipse to figure(2)


plt.figure(3)

# convert circle and ellipse formulas to array for matrix
# manipulation, to gain better understanding of PCA & SVD later.
cir_M = np.array([rx, ry])
ell_M = np.array([a*np.cos(theta), b*np.sin(theta)])

# define rotation matrix
rot_M = np.array([[np.cos(rot), -np.sin(rot)],
                  [np.sin(rot),  np.cos(rot)]])

# an arbitrary scale matrix
scale_M = np.array([[1.1, 0 ],
                    [0 , 0.9]])

# an arbitrary shear matrix
shear_M = np.array([[1, 0.2],
                    [0.0, 1]])

# dot product of two matrices
cir_scale  = scale_M @ cir_M # circle distorted in x, y direction
cir_shear  = shear_M @ cir_M # circle shearing  in x, y direction
ell_rotate = rot_M   @ ell_M # ellipse rotating at an angle

# purple is the original. Dashed line is being
# transformed by scale, shear, or rotation matrix.
plt.ylim(0, 5)
plt.plot(cir_M[0], cir_M[1], 'm-', lw=3) # original circle matrix
plt.plot(cir_scale [0], cir_scale [1], 'g--', lw=3)
plt.plot(cir_shear [0], cir_shear [1], 'r--', lw=3)

plt.figure(4)
# multiple figures help resized display windows easily.
# plt.subplots(2,2) axes can be useful.

plt.plot(ell_rotate[0], ell_rotate[1], 'b--', lw=3)
plt.plot(ell_M[0], ell_M[1], 'm-', lw=3) # original ellipse matrix

plt.axhline(ls='--', color='grey') # line thru origin
plt.axvline(ls='--', color='grey')

plt.axis('equal')
plt.show()


--------------------------------------------------------------------



             Figure 1                            Figure 2
   


·         Figure 1 shows three circles. Unfilled outer magenta uses parametric equation to create an array for plotting (most computer visual effects and data analysis will work with array or matrix for further processing). Two inner circles use matplotlib method. One with green face and yellow edge, the other is a simple filled purple circle.

·         Figure 2 displays a rotated blue ellipse using circle method, and unfilled red ellipse come from parametric equation.



            Figure 3                             Figure 4
    



·         Figure 3 shows one original purple circle. The dashed green oval comes from the dot product of scale matrix and circle array. The transformation of circle into dashed red oval is the result of multiplying shear matrix and circle array (convert from circle parametric equation).

·         Figure 4 displays a purple original ellipse. Its transformation into dashed blue oval is the result of multiplying rotation matrix and ellipse array (convert from ellipse parametric equation).


Shape creating and transformation mainly come from matplotlib module/library or parametric equations. Later on, we will use these techniques to map out a region of interest (ROI). Considering a cluster of data points, certain important information such as:

                         *        Centroid location
                         *         Correlation between variables or features
                         *         Direction and angle of eigen(unit) vectors
                         *         Three sigma region (ROI) or data variance values


These values and the derived PCA and SVD play a key role in gaining more insight the nature of data and its ROI. Upcoming Part II will explore PCA and SVD in detail. Part III utilizes biplot to display loading vectors of each variable or feature.



Reference:


    * Introduction to Linear Algebra, Gilbert Strang 5th – 2016

    * Linear Algebra and Its Applications, David Lay 5th 2016


    * Linear Algebra: ideas and applications, Richard Penny 4th 2016


    * Linear Algebra A Modern Introduction, David Poole 4th 2015


    * Linear Algebra with Applications, Otto Bretscher 5th 2013

    * Elementary Linear Algebra, Ron Larson 8th 2017
 
    * Elementary Linear Algebra Applications, Anton-Rorres 11th 2014

    * Python for Data Analysis, wes mckinney 2nd 2017

    * Geometry of Canonical Variate Analysis, Campbell-Atchley 1981 

    * Coding the Matrix: Linear Algebra, Philip N. Klein 2013

    * Machine Learning Refined, Watt-Borhani-Katsaggelos 2016

    * Applied Machine Learning, M. Gopal 2019

    * Applied Machine Learning, David Forsyth 2019

    * Machine Learning: A Bayesian and Optimization Perspective 2015
                                                 Sergios Theodoridis


 

--------------------------------------------------------------------