Home >Documents >Analisis Komponen Utama Dg R

Analisis Komponen Utama Dg R

Date post:08-Jan-2016
Category:
View:98 times
Download:8 times
Share this document with a friend
Description:
AKU dengan R
Transcript:

Analisis Komponen Utama/ Principal Component Analysis (Teori)

Tujuannya mereduksi dimensi peubah yang saling berkorelasi menjadi peubah2 baru yang tidak berkorelasi dengan tetap mempertahankan sebanyak mungkin keragaman data asalnya. (patokan 80%)Misal ada 1000 variable, apa kelebihan n kekurangannya..??1. Terlalu rumit2. Segi interpretasi sulitSehingga perlu dilakukan reduksi data. Syaratnya harus ada korelasi kuat antar variable.Langkah langkah AKU / PCA:Pengujian hipotesis matriks korelasi melihat ada tidaknya korelasi yang erat antar variable.dengan menggunakan uji bartlet:H0: =Ip(Selain diagonal utama=0, Artinya korelasi antar peubah 0)H1: Ip(Selain diagonal utama=0, Artinya korelasi yang erat antar peubah UJi Barlett:

n =jumlah observasi;p= jumlah variable ;R= matrik korelasi (estimasi) ;=determinan matrik korelasi Tolak H0 jika x2hitung> x2tabelKarena kita niatnya make AKU, yang kita harapkan Tolak H0. Artinya antar variable awal ada korelasi sehingga tujuan reduksi/ penyusutan dimensi data menjadi tercapai.1. Mencari akar ciri dari matriks kovarian (S) atau basis korelasi (R). Jika satuan variable sama pake kovarian, jika satuan berbeda pake korelasi.2. Mengurutkan akar ciri yang diperoleh dari terbesar ke terkecil (12...p0)3. Membuat peubah baru (komponen utama) yang merupakan kombinasi linear dari peubah asalnya.Membuat vector ciri yang dinormalisasi (dibuat orthonormal) dari masing2 akar ciri yang bersesuaianY1=e1X=e11x1++e1pxpY2=e2X=e21x1++e2pxpYp=epX=ep1x1++eppxp*disini belum ada proses reduksiDimana, X=[x1 xp]Sifat peubah baru: Tidak saling berkorelasi, dan berurutan dari ukuran kepentingannya. Y1 paling penting sampai Yp1. Melakukan proses reduksi KU yang terbentuk. Ada 3 cara :1. Dengan proporsi keragaman (bagi akar ciri per total akar ciri)2. Akar ciri >13. Scree plotMisal : proporsi keragamanY1=e1X=e11x1++e1pxp76%Y2=e2X=e21x1++e2pxp23%Yp=epX=ep1x1++eppxpProporsi keragaman variable baru pertama belum cukup, sehingga ditambah dengan variable baru kedua. Jadi banyaknya KU yang terbentuk adalah 2.Akar ciri: Selama akar cirinya >1, itulah banyaknya KU.Scree plot : dilihat landau curamnya dan besarnya akar ciri. (scree plot tu plot antara jumlah variable dengan akar cirinya)Melakukan penamaan pada KU yang digunakan setelah terjadi proses reduksi. Ada 2 cara:1. Korelasi antar KU dengan variable asalnya. Korelasi yang besar, tu yang mencirikan KU2. Dengan melihat penimbang (weighting)Y1=e1X=e11x1++e1pxppenimbang tu e-nya. Penimbangnya yang paling besar. Kalo penimbangnya beda2 tipis, berarti KU dicirikan oleh variable2 tsb.5 functions to do Principal Components Analysis in R

Posted on June 17, 2012

Principal Component Analysis (PCA) is a multivariate technique that allows us to summarize the systematic patterns of variations in the data.

From a data analysis standpoint, PCA is used for studying one table of observations and variables with the main idea of transforming the observed variables into a set of new variables, the principal components, which are uncorrelated and explain the variation in the data. For this reason, PCA allows to reduce a complex data set to a lower dimension in order to reveal the structures or the dominant types of variations in both the observations and the variables.

PCA in R

In R, there are several functions from different packages that allow us to perform PCA. In this post Ill show you 5 different ways to do a PCA using the following functions (with their corresponding packages in parentheses):

prcomp()(stats)

princomp()(stats)

PCA()(FactoMineR)

dudi.pca()(ade4)

acp()(amap)

Brief note: It is no coincidence that the three external packages ("FactoMineR","ade4", and"amap") have been developed by French data analysts, which have a long tradition and preference for PCA and other related exploratory techniques.

No matter what function you decide to use, the typical PCA results should consist of a set of eigenvalues, a table with the scores or Principal Components (PCs), and a table of loadings (or correlations between variables and PCs). The eigenvalues provide information of the variability in the data. The scores provide information about the structure of the observations. The loadings (or correlations) allow you to get a sense of the relationships between variables, as well as their associations with the extracted PCs.

The Data

To make things easier, well use the datasetUSArreststhat already comes with R. Its a data frame with 50 rows (USA states) and 4 columns containing information about violent crime rates by US State. Since most of the times the variables are measured in different scales, the PCA must be performed with standardized data (mean = 0, variance = 1). The good news is that all of the functions that perform PCA come with parameters to specify that the analysis must be applied on standardized data.

Option 1: using prcomp()

The functionprcomp()comes with the default"stats"package, which means that you dont have to install anything. It is perhaps the quickest way to do a PCA if you dont want to install other packages.

# PCA with function prcomppca1 = prcomp(USArrests, scale. = TRUE)# sqrt of eigenvaluespca1$sdev## [1] 1.5749 0.9949 0.5971 0.4164# loadingshead(pca1$rotation)## PC1 PC2 PC3 PC4

## Murder -0.5359 0.4182 -0.3412 0.64923

## Assault -0.5832 0.1880 -0.2681 -0.74341

## UrbanPop -0.2782 -0.8728 -0.3780 0.13388

## Rape -0.5434 -0.1673 0.8178 0.08902# PCs (aka scores)head(pca1$x)## PC1 PC2 PC3 PC4

## Alabama -0.9757 1.1220 -0.43980 0.15470

## Alaska -1.9305 1.0624 2.01950 -0.43418

## Arizona -1.7454 -0.7385 0.05423 -0.82626

## Arkansas 0.1400 1.1085 0.11342 -0.18097

## California -2.4986 -1.5274 0.59254 -0.33856

## Colorado -1.4993 -0.9776 1.08400 0.00145Option 2: using princomp()

The functionprincomp()also comes with the default"stats"package, and it is very similar to her cousinprcomp(). What I dont like ofprincomp()is that sometimes it wont display all the values for the loadings, but this is a minor detail.

# PCA with function princomppca2 = princomp(USArrests, cor = TRUE)# sqrt of eigenvaluespca2$sdev## Comp.1 Comp.2 Comp.3 Comp.4

## 1.5749 0.9949 0.5971 0.4164# loadingsunclass(pca2$loadings)## Comp.1 Comp.2 Comp.3 Comp.4

## Murder -0.5359 0.4182 -0.3412 0.64923

## Assault -0.5832 0.1880 -0.2681 -0.74341

## UrbanPop -0.2782 -0.8728 -0.3780 0.13388

## Rape -0.5434 -0.1673 0.8178 0.08902# PCs (aka scores)head(pca2$scores)## Comp.1 Comp.2 Comp.3 Comp.4

## Alabama -0.9856 1.1334 -0.44427 0.156267

## Alaska -1.9501 1.0732 2.04000 -0.438583

## Arizona -1.7632 -0.7460 0.05478 -0.834653

## Arkansas 0.1414 1.1198 0.11457 -0.182811

## California -2.5240 -1.5429 0.59856 -0.341996

## Colorado -1.5146 -0.9876 1.09501 0.001465Option 3: using PCA()

A highly recommended option, especially if you want more detailed results and assessing tools, is thePCA()function from the package"FactoMineR". It is by far the best PCA function in R and it comes with a number of parameters that allow you to tweak the analysis in a very nice way.

# PCA with function PCAlibrary(FactoMineR)# apply PCApca3 = PCA(USArrests, graph = FALSE)# matrix with eigenvaluespca3$eig## eigenvalue percentage of variance cumulative percentage of variance

## comp 1 2.4802 62.006 62.01

## comp 2 0.9898 24.744 86.75

## comp 3 0.3566 8.914 95.66

## comp 4 0.1734 4.336 100.00# correlations between variables and PCspca3$var$coord## Dim.1 Dim.2 Dim.3 Dim.4

## Murder 0.8440 -0.4160 0.2038 0.27037

## Assault 0.9184 -0.1870 0.1601 -0.30959

## UrbanPop 0.4381 0.8683 0.2257 0.05575

## Rape 0.8558 0.1665 -0.4883 0.03707# PCs (aka scores)head(pca3$ind$coord)## Dim.1 Dim.2 Dim.3 Dim.4

## Alabama 0.9856 -1.1334 0.44427 0.156267

## Alaska 1.9501 -1.0732 -2.04000 -0.438583

## Arizona 1.7632 0.7460 -0.05478 -0.834653

## Arkansas -0.1414 -1.1198 -0.11457 -0.182811

## California 2.5240 1.5429 -0.59856 -0.341996

## Colorado 1.5146 0.9876 -1.09501 0.001465Option 4: using dudi.pca()

Another option is to use thedudi.pca()function from the package"ade4"which has a huge amount of other methods as well as some interesting graphics.

# PCA with function dudi.pcalibrary(ade4)# apply PCApca4 = dudi.pca(USArrests, nf = 5, scannf = FALSE)# eigenvaluespca4$eig## [1] 2.4802 0.9898 0.3566 0.1734# loadingspca4$c1## CS1 CS2 CS3 CS4

## Murder -0.5359 0.4182 -0.3412 0.64923

## Assault -0.5832 0.1880 -0.2681 -0.74341

## UrbanPop -0.2782 -0.8728 -0.3780 0.13388

## Rape -0.5434 -0.1673 0.8178 0.08902# correlations between variables and PCspca4$co## Comp1 Comp2 Comp3 Comp4

## Murder -0.8440 0.4160 -0.2038 0.27037

## Assault -0.9184 0.1870 -0.1601 -0.30959

## UrbanPop -0.4381 -0.8683 -0.2257 0.05575

## Rape -0.8558 -0.1665 0.4883 0.03707# PCshead(pca4$li)## Axis1 Axis2 Axis3 Axis4

## Alabama -0.9856 1.1334 -0.44427 0.156267

## Alaska -1.9501 1.0732 2.04000 -0.438583

## Arizona -1.7632 -0.7460 0.05478 -0.834653

## Arkansas 0.1414 1.1198 0.11457 -0.182811

## California -2.5240 -1.5429 0.59856 -0.341996

## Colorado -1.5146 -0.9876 1.09501 0.001465Option 5: using acp()

A fifth possibility is theacp()function from the package"amap".

# PCA with function acplibrary(amap)# apply PCApca5 = acp(USArrests)# sqrt of eigenvaluespca5$sdev## Comp 1 Comp 2 Comp 3 Comp 4

## 1.5749 0.9949 0.5971 0.4164# loadingspca5$loadings## Comp 1 Comp 2 Comp 3 Comp 4

## Murder 0.5359 0.4182 -0.3412 0.64923

## Assault 0.5832 0.1880 -0.2681 -0.74341

## UrbanPop 0.2782 -0.8728 -0.3780 0.13388

## Rape 0.5434 -0.1673 0.8178 0.08902# scoreshead(pca5$scores)## Comp 1 Comp 2 Comp 3 Comp 4

## Alabama 0.9757 1.1220 -0.43980 0.15470

## Alaska 1.9305 1.0624 2.01950 -0.43418

## Arizona 1.7454 -0.7385 0.05423 -0.82626

## Arkansas -0.1400 1.1085 0.11342 -0.18097

## California 2.4986 -1.5274 0.59254 -0.33856

## Colorado 1.4993 -0.9776 1.08400 0.00145Of course these are not the only options to do a PCA, but Ill leave the other approaches for another post.

PCA plots

Everybody uses PCA to visualize the data, and most of the discussed functions come with their own plot functions. But you can also make use of the great graphical displays of"ggplot2". Just to show you a couple of plots, lets take the basic results fromprcomp().

Plot of observations

# load ggplot2library(ggplot2)# create data frame with scoresscores = as.data.frame(pca1$x)# plot of observationsggplot(data = scores, aes(x = PC1, y = PC2, label = rownames(scores))) + geom_hline(yintercept = 0, colour = "gray65") + geom_vline(xintercept = 0, colour = "gray65") + geom_text(colour = "tomato", alpha = 0.8, size = 4) + ggtitle("PCA plot of USA States - Crime Rates")

Circle of correlations

# function to create a circlecircle

of 29/29
Analisis Komponen Utama/ Principal Component Analysis (Teori)  Tujuannya mereduksi dimensi peubah yang saling berkorelasi menjadi peuba h2 baru yang tidak  berkorelasi dengan tetap mempertahankan sebanyak mungkin keragaman data asalnya. (patokan 80%) Misal ada 1000 variableapa kelebihan n kekurangannya..!! 1. Terlalu rumit 2. "egi interpretasi sulit "ehingga perlu dilakukan reduksi data. "yaratnya harus ada korelasi kuat antar variable. #angkah langkah $& ' $* engujian hipotesis matriks korelasi+ melihat ada tidaknya korelasi yang erat antar variable. dengan menggunakan uji bartlet*  H0 : ρ =Ip ("elain diagonal utama,0$rtinya korelasi antar peubah 0)  -1 * ρ≠ Ip ("elain diagonal utama,0$rtinya korelasi yang erat antar peubah  UJi Barlett:  n = jumlah observasi p , jumlah variable  R , matrik korelasi (estimasi) = determinan matrik korelasi  Tolak -0 jika  x 2  hitung > x 2  tabel  arena kita niatnya make $ &yang kita harapkan To lak -0. $rtinya antar variable a/al ada korelasi sehingga tujuan reduksi' penyusutan dimensi data menjadi terapai. 1. Menar i akar ir i da ri mat ri ks kova ri an ( ") a ta u basi s kore lasi ( ) . i ka s at uan vari able sama pake kovarianjika satuan berbeda pake korelasi. 2. Me ngur ut ka n ak ar i ri yang di pe role h da ri te rb esar ke t er kei l (3 143 2...4 3  p4 0) 5. Membuat peuba h ba ru (komponen ut ama) y ang me rupakan kombinasi l inear da ri peubah asalnya. Membuat vetor iri yang dinormalisasi (dibuat orthonormal) dari masing2 akar iri yang  bersesuaian 61,e17,e1191:;:e1p9  p 62,e27,e2191:;:e2p9  p
Embed Size (px)
Recommended