Nuts and Bolts Engineering Corner

"It is a capital mistake to theorize before one has data."

Sherlock Holm — Sir Arthur Conan Doyle _________

Part III:

So far we have explained the PCA, and SVD thru some math and python code. Briefly, PCA as an unsupervised learning algorithm usually used in preprocessing step of machine learning to reduce high dimension datasets. New transformed datasets with fewer dimensions will be more computational efficiency to work with and study. SVD is useful when dealing with matrix data (derived eigenvectors, eigenvalues, ...)

Iris dataset is used to demonstrate how PCA, scree plot, and biplot work in production. After that, machine learning and neural networks will be explored.

The dataset has four features, one extra dimension over the usual x-y-z axis system we are accustomed to. It contains 150 observations under 5 attributes: petal length, petal width, sepal length, sepal width, and species (3 classes of Setosa, Versicolour, and Virginica).

After finished the PYTHON SESSION at the bottom, three graphs display – pair plot, scree graph, scatter plot of projected data combining with loading vectors (biplot).

The pair plot above shows how the data look in pair of features. Overall ‘Setosa’ indicates good separation. ‘Versicolour’ and ‘Virginica’ overlap somewhat. There are no abnormalities such as data lump together in one blob (low prediction - may not able to classify) or too many outliners (may want to remove those). This is a normal categorical data. We can use with machine learning or artificial neural networks (ANN) model later.

# ------------------------------------------------------------------

>>> pca.explained_variance_ratio_

array([0.9246, 0.0531, 0.0171, 0.0052])

The 1^st principal component PC_1, explained more than 92% of the variance and about 5.3% for PC_2. It happens to be at the “knee” or elbow in the SCREE PLOT. So instead of using 4 features dataset we can utilize the transformed or projected data of PC_1 and PC_2 for further analysis. The combined principal components should describe more than 80% of the variances.

Take a look at some data:

>>> iris_X[:3] # list 3 rows of 4 features

sepal_L sepal_W petal_L petal_W

0 5.1 3.5 1.4 0.2

1 4.9 3.0 1.4 0.2

2 4.7 3.2 1.3 0.2

>>> eigVect # pca.components_.T (unit vectors, values as loadings / weights)

PC_1 PC_2 PC_3 PC_4

sepal_L 0.361387 0.656589 -0.582030 -0.315487

sepal_W -0.084523 0.730161 0.597911 0.319723

petal_L 0.856671 -0.173373 0.076236 0.479839

petal_W 0.358289 -0.075481 0.545831 -0.753657

>>> PC_i.iloc[:3,:2] # 1^st & 2^nd columns of principal components

PC_1 PC_2

0 -2.684126 0.319397

1 -2.714142 -0.177001

2 -2.888991 -0.144949

Recall from Part_II, we can compute the transformed or projected data (named score elsewhere) by removing mean from iris_X then multiply eigenvector matrix (pca.components_.T or eigVect). The result is the same as PC_i.

>>> ((iris_X - iris_X.mean(axis=0)) @ eigVect)[:3] # same as PC_i

PC_1 PC_2 PC_3 PC_4

0 -2.684126 0.319397 -0.027915 -0.002262

1 -2.714142 -0.177001 -0.210464 -0.099027

2 -2.888991 -0.144949 0.017900 -0.019968

# -------------------------------------------------------------

We can graph PC_1 & PC_2 together with loading vectors of each feature as a biplot:

In biplot, top and right axis reading loadings/weights of features. Using bottom and left axis to read scores (PC_1, PC_2) or transformed (projected) values of the original dataset variables.

If two loading vectors close to 90 degrees, they are less positive correlation. If the two close to 180° angle, they are more negative correlation. The longer vector or larger magnitude will have more influence on the principal component axis close to it.

If two loading vectors overlap or form a very small angle, then the smaller vector(feature) may be removed from further analysis. Because they are positive correlation and proportional together. In this case the less significant “petal_W” can be dropped. Then we only have three features to deal with.

Now, there are several options to do the analysis. We can use the original 4 features data, 3 features (petal_W column removed), or two principal components from original dataset. Also, new principal components can be compute from the three features dataset. These are some of the scenarios to consider when working with big data.

If the dataset has too many features, we may want to reduce the number of variables. By looking at the biplot or using some other features extraction technique (i.e. SelectKBest, RandomForestClassifier, etc., a threshold can be set to get a subset of features).

You can try several features extraction methods to get a good intersection subset of variables. It is easier to work with less features than more, and the result may come out better (noise reduction).

PCA is normally used and can combine with other methods to create new dataset of principal components. The bottom line is to get a good dataset that yield the highest prediction or accuracy score at the least amount of time. The same expectation goes for machine learning and neural networks models.

We will compare the 3 & 4 features iris dataset and two principal components (PC_1, PC_2) using SVC (support vector classifier) model. The general steps to follow in machine learning are:

1. Import python libraries

2. Get data and organize to a useful format

3. Define models

4. Fit data to models

5. Predict and cross validation baseline scores

6. PCA & features extraction combined?

7. Compare step five_six and build new dataset?

8. Fine tune good model_data with pipeline & grid-search

9. Get final accuracy score and prediction

------------------- [._.]’ -------------------

>>> from sklearn.svm import SVC # support vector classifier

>>> from sklearn.pipeline import Pipeline, make_pipeline

>>> from sklearn.preprocessing import StandardScaler, scale

>>> from sklearn.model_selection import train_test_split , cross_val_score, GridSearchCV

# randomize data and split 75/25 with equal label ratio in train and test (1/3 each)

>>> X_train, X_test, y_train, y_test = train_test_split(iris_X, iris_y, random_state=1, stratify=iris_y )

# using 3 features - dropped petal_W column

>>> X_train2, X_test2, y_train2, y_test2 = train_test_split(iris_X.iloc[:, :3], iris_y, random_state=1, stratify=iris_y )

# using PC_1 & PC_2 data

>>> X_train3, X_test3, y_train3, y_test3 = train_test_split(iris_PC[:,:2], iris_y, random_state=1, stratify=iris_y )

>>> SVC = SVC() # instantiate model object with default settings

# evaluate SVC() by cross-validation with different datasets (full, 3 features , PCs)

>>> cross_val_score(svc, X_train, y_train).mean().round(4) # ==> 0.9383

>>> cross_val_score(svc, X_train2, y_train2).mean().round(4) # ==> 0.9379

>>> cross_val_score(svc, X_train3, y_train3).mean().round(4) # ==> 0.9466

The average score from using PC_1 & PC_2 come out a little better. PCA tend to pull data in the eigenvector directions (orthogonality <===> decorrelation), more separation of classes.

>>> svc.fit(X_train3, y_train3) # fit train data to SVC model

# predict outcome and compare with test data to check accuracy

>>> (svc.predict(X_test3) == y_test3).mean() # ==> 0.9474

-------------------- (*._.*)’ -------------------

We can also use pipeline & grid-search with principal components and SVC to fine tune the model.

# define pipeline and parameters to use in grid-search

# fit, predict, score and other methods can be used with ‘pipe’

>>> pipe = make_pipeline(SVC())

>>> param_grid = {'svc__C': [0.001, 0.01, 0.1, 1, 10, 100],

'svc__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}

# define grid with pipeline, parameters and 5-fold

# fit, predict, score and other methods can be used with ‘grid’

>>> grid = GridSearchCV(pipe, param_grid, cv=5)

>>> grid.fit(X_train3, y_train3) # fit PCs train data to model

>>> grid.best_params_ # best parameters

{'svc__C': 100, 'svc__gamma': 1}

>>> grid.best_score_ # best validation score

0.9644

# never mixed test data with train or validation data

>>> grid.score(X_test3, y_test3) # test data is used only once

1.0

The model shows better score with GridSearchCV (best parameters help here). We can expand the scope with different models and compare accuracy between them using a for-loop. Then you can start fine tune using grid-search & pipeline with the best scored model to get the best test result. Test data should never mix with training phase of the model. It acts as fresh in-coming data for prediction. It should be used only once for final validation purpose.

------------------ (*.__.*)’ ------------------

Perceptron and Artificial Neural Networks:

Consider linear combination equation we used in previous post of mold experiment (multiple linear regression function of respond surface model), and fuzzy logic - TSK fuzzy model: first order linear equation). Again, it is an important math function building block. With no exception, the slope-intercept formula will be used in ML (perceptron & ANN). Different function names, but they are all talking about the same thing i.e. polynomial function of degree 1 or first order.

A simple perceptron is y = ∑w_ix_i + b = w^Tx + b = ( ∑ weight_{_i} * input_{_i} ) + bias, and the activation function of output y [i.e. f = tanh(y), sigmoid(y) …] are the most fundamental equations that make machine learning and neural networks (make famous with universal approximation theorem) possible. If the interval is small enough, the linear equation and its activation function can approximate any complex curve - The same approximation idea as with derivative, integral, or numerical problem.

From complex stuffs break down to simple elements then from elemental basic builds up to useful compound object are the ever going engineering processes (nature very much doing the same thing most of the time!). We will briefly explore the neural networks and deep learning with simple examples.

>>> from sklearn.linear_model import Perceptron

# define simple perceptron with default parameters

# predict and other methods can be used after instantiation & fit

>>> perc_model = Perceptron()

>>> perc_model.fit(X_train3, y_train3)

>>> perc_model.score(X_test3, y_test3)

0.8421

>>> from sklearn.neural_network import MLPClassifier # multi-layer perceptron

# define simple multi-layer perceptron with one setting

# predict and other methods can be used after instantiation & fit

>>> mlp_model = MLPClassifier(max_iter=1000)

>>> mlp_model.fit(X_train3, y_train3)

>>> mlp_model.score(X_test3, y_test3)

0.9473

------------------- ([._.])’ --------------------

Deep Learning:

As the data getting bigger and numerical problem become more complex, there is always the need for faster speed. GPU was utilized for its processing power. The scikit-learn does not work with graphical processing unit so far. To answer the growing demand, some library such as h2o4gpu from h2o.ai can be used as sklearn drop-in. Others such as tensorflow and theano already have built-in GPU support. Graphical processing unit such as CUDA is the workhorse for neural networks when the nodes (neurons) getting large with deep learning models (more layers). If PC set up properly with CUDA, the GPU supported libraries will use it (otherwise fall back to CPU by default). We will use tensorflow and keras libraries in the example bellow.

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense, Activation

from tensorflow.keras.optimizers import SGD, Adam

import pandas as pd, numpy as np

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score, classification_report

>>> iris = load_iris() # load iris dataset

# bring data to pandas DataFrame with column features (remember use ‘*.values’ outside pandas)

>>> iris_X = pd.DataFrame(iris.data, columns=['sepal_L', 'sepal_W', 'petal_L', 'petal_W'])

>>> iris_y = iris.target # assign target labels to a variable

# encode categorical labels for deep learning – tensorflow library

>>> y_cat = pd.get_dummies(iris_y) # convert classes to three columns

# randomize data and split 80/20 with equal label ratio in train and test (1/3 each)

# remember use y_test4.values later, because dataframe may cause problem outside pandas

>>> X_train4, X_test4, y_train4, y_test4 = train_test_split(iris_X, y_cat, test_size=0.2, random_state=0, stratify=iris_y)

>>> iris_seq = Sequential() # define model for Deep Learning

# four inputs (4 features) and three outputs (3 classes)

# multi-class (>1) classification use softmax with cross-entropy

>>> iris_seq.add(Dense(units=3, input_shape=(4,),

activation='softmax', name='layer1'))

>>> iris_seq.compile(optimizer=Adam(learning_rate=.1),

loss='categorical_crossentropy',

metrics=['accuracy'])

>>> iris_seq.fit(X_train4, y_train4,

validation_split=0.2,

epochs=50, verbose=0);

>>> iris_seq.evaluate(X_test4, y_test4)

loss: 0.1261 - accuracy: 1.0000 # may not the same due to some randomness

>>> y_pred_stats = iris_seq.predict(X_test4) # probability values of prediction

>>> y_pred_label = np.argmax(y_pred_stats , axis=-1) # probability to number label

>>> y_test4_label = np.argmax(y_test4.values, axis=-1) # indices of max to number label

# score = (y_test4_label == y_pred_label).mean()

# accuracy = accuracy_score(y_test4_label, y_pred_label)

>>> print(14 * ' ' + '_____ Iris Classification Report: _____\n')

>>> print(classification_report(y_test4_label, y_pred_label))

_____ Iris Classification Report: _____

precision recall f1-score support

0 1.00 1.00 1.00 10

1 1.00 1.00 1.00 10

2 1.00 1.00 1.00 10

accuracy 1.00 30

macro avg 1.00 1.00 1.00 30

weighted avg 1.00 1.00 1.00 30

Let’s check out with more hidden layers between input and output:

>>> iris_seq = Sequential()

>>> iris_seq.add(Dense(40, input_dim=4, activation='relu'))

>>> iris_seq.add(Dense(40, activation='relu'))

>>> iris_seq.add(Dense(3, activation='softmax'))

>>> iris_seq.compile(Adam(lr=.1), 'categorical_crossentropy', 'accuracy')

>>> iris_seq.fit(X_train4, y_train4, epochs=50, verbose=0)

>>> iris_seq.evaluate(X_test4, y_test4)

loss: 0.0426 - accuracy: 1.0000 # loss is better than before

>>> iris_pred = iris_seq.predict(X_test4).round()

>>> accuracy_score(y_test4, iris_pred) # ==> 1.0

>>> print(classification_report(y_test4, iris_pred)) # same as above

Your result may be a little different due to some randomness outcome. You can check out optimizer SGD model with leaning_rate = 0.5, momentum=0.5 to see how the report look like. There are many other datasets you can try with machine learning and neural networks:

sklearn.datasets , seaborn.load_dataset() , plotly.data,

plotly.data._get_dataset, bokeh.sampledata

https://archive.ics.uci.edu/ml/index.php ,

https://www.kaggle.com/datasets

# ---------------------- FINAL THOUGHTS ----------------------------

Like any other good engineering tools, machine learning and neural networks can be used to predict the outcome, to get best parameters, and fine tuning processes in production.

Data acquisition should be an available option in any capital equipment to capture key variables for analysis or monitoring a certain critical feature threshold.

Data management system may utilize machine learning, and neural networks to gain better insight of the available data. Technical personnel can make a more informed decision about processes and engineering systems, especially in the new age of information overload. ANN and machine learning are useful in analyzing big data.

When dealing with big data, more layers are sandwich between inputs and outputs in deep learning. This may help to get a good result for large dataset and features. In numerical problem, smaller steps (more weights and biases) usually give a better approximation - there is always a point of diminishing returns to be aware of. In reality, no fast and hard rule about how many layer numbers and parameters for fine tuning. A lot of time, it is just trial and error for optimization.

A fast PC with CUDA graphic card and a lot of RAM will help during fitting model step (training mode – most time consuming step to derive weights and biases). The information should be saved in *.h5 format for later usage in testing or transferring model.

# --- THREE FIGURES (Pairplot, Scree Graph, Biplot) DISPLAY AFTER FINISH PYTHON SESSION ---

# import python libraries

import numpy as np, pandas as pd, scipy as sp, seaborn as sb

import matplotlib.pyplot as plt, matplotlib.patches as mp

from sklearn.datasets import load_iris

from sklearn.decomposition import PCA, TruncatedSVD

from sklearn.preprocessing import StandardScaler, scale

#from sympy import pprint, init_printing, symbols

#import sympy as sym

#plt.style.use(['classic', 'seaborn-white']) # style/theme to display graphic

plt.style.use('default') # scree graph look better

np.set_printoptions(suppress=True, precision=4) # display integer with four decimals

# Part_III

iris = load_iris() # load iris dataset

# bring data to pandas DataFrame with column features

iris_X = pd.DataFrame(iris.data, columns=['sepal_L', 'sepal_W', 'petal_L', 'petal_W'])

iris_y = iris.target # assign target labels to a variable

#plt.figure(1) # pair plot

sb.pairplot(sb.load_dataset('iris'), hue='species', palette=['r','g','b'], height=1.7,

plot_kws=dict(s=80, edgecolor="y", lw=1, alpha=.8))

pca = PCA() # instantiate PCA object with all principal components are kept (default)

pca.fit(iris_X); # fit data to PCA model

plt.figure(2) # scree plot

plt.plot(['PC_1','PC_2','PC_3','PC_4'], pca.explained_variance_ratio_, 'bo-', lw=3, ms=12, mew=2, mec='y', mfc='m')

plt.xlabel('Principal Components', color='navy', fontsize=16)

plt.ylabel('Explained Variance Ratios', color='navy', fontsize=16)

plt.title('SCREE PLOT', color='purple', fontsize=20)

#plt.figure(3) # biplot

fig, ax1 = plt.subplots(figsize=(10,8))

ax1.set_xlim(-4,4)

ax1.set_ylim(-2,2)

ax1.tick_params(colors='navy')

ax1.set_xlabel("1st Principal Component ( PC_1 or X' )", color='navy', fontsize=16)

ax1.set_ylabel("2nd Principal Component ( PC_2 or Y' )", color='navy', fontsize=16)

plt.title('Projected Data & Loading Vectors', color='purple', fontsize=22)

ax1.hlines(0,-4,4, ls='--', colors='grey', lw=2)

ax1.vlines(0,-3,3, ls='--', colors='grey', lw=2)

# projected data onto principal component axes

iris_PC = pca.transform(iris_X)

# create principal components DataFrame

PC_i = pd.DataFrame(iris_PC, columns=['PC_1','PC_2','PC_3','PC_4'])

# scatter plot of transformed data on PC_1 & PC_2 axes

sb.scatterplot(iris_PC[:,0], iris_PC[:,1], hue=iris.target_names[iris_y], style=iris.target_names[iris_y], s=100, palette=['deeppink','g','c'], markers=['o', '^', 's'], edgecolor='y')

# using a second xy-axis for plotting Principal Component loading vectors.

ax2 = ax1.twinx().twiny() # top and right axes

ax2.tick_params(colors='b')

#ax2.set_xlabel('Projected Data & Loading Vectors', color='maroon', fontsize=20)

ax2.set_ylim(-1,1) # maximum limits of loading vectors of feature variables

ax2.set_xlim(-1,1)

# create eigenvectors DataFrame

eigVect = pd.DataFrame(pca.components_.T, index=['sepal_L', 'sepal_W', 'petal_L', 'petal_W'],

columns=['PC_1','PC_2','PC_3','PC_4'])

#for i in eigVect.index:

# ax2.annotate(i, (eigVect.PC_1.loc[i]*a, eigVect.PC_2.loc[i]*a), color='r', ha='left', fontsize=18)

# ax2.arrow(0,0, eigVect.PC_1.loc[i], eigVect.PC_2.loc[i], color='b', lw=2, head_width=0.03)

a = 1.08 # value separates arrow tip and text.

for i in range(len(eigVect.index)):

# draw vectors or loading magnitude of each feature

plt.arrow(0, 0, eigVect.PC_1[i], eigVect.PC_2[i],

color='b', lw=2, head_width=0.03)

# feature names

plt.text(eigVect.PC_1[i]*a, eigVect.PC_2[i]*a,

iris_X.columns[i], color='r', ha='left', fontsize=18)

leg = ax1.legend(loc=4, markerscale=1.5, frameon=True, edgecolor='y', fancybox=True, shadow=True)

for handle, text in zip(leg.legendHandles, leg.get_texts()):

#text.set_color('navy')

text.set_color(handle.get_facecolor()[0])

plt.show()

# ------------------------------------------------------------------

Nuts and Bolts Engineering Corner

Tuesday, November 2, 2021

CALCULATOR

Production Metrics & Process Time Calculation

Production Metrics

Tuesday, November 10, 2020

PCA - SVD and Machine Learning III ___