Monday, July 24, 2017

Factor Analysis

https://blog.dominodatalab.com/how-to-do-factor-analysis/


Factor Analysis and its difference from Principal Component Analysis (PCA)

Factor analysis aims to give insight into the latent variables that are behind people's behavior and the choices that they make. PCA, on the other hand, is all about the most compact representation of a dataset by picking dimensions that capture the most variance. This distinction can be subtle, but one notable difference is that PCA assumes no error of measurement or noise in the data: all of the 'noise' is folded into the variance capturing.

Another important difference is that the number of researcher degrees of freedom, or choices one has to make, is much greater than that of PCA. Not only does one have to choose the number of factors to extract (there are ~10 theoretical criteria which rarely converge), but then decide on the method of extraction (there are ~7), as well as the type of rotation (there are also 7), as well as whether to use a variance or covariance matrix, and so on.

More technically, running a factor analysis is the mathematical equivalent of asking a statistically savvy oracle the following: "Suppose there are N latent variables that are influencing people's choices –tell me how much each variable influence the responses for each item that I see, assuming that there is measurement error on everything". Often times, the 'behavior' or responses that are being analyzed comes in the form of how people answer questions on surveys.


Mathematically speaking, for person i, item j and behavior Yij, Factor analysis seeks to determine the following:

Yij = Wj1 * Fi1 + Wj2 * Fi2 + … + Uij

Where W's are the factor weights or loadings, F's are the factors, and U is the measurement error / the variance that can't be accounted for by the other terms in the equation. The insight of the people who created factor analysis was that this equation is actually a matrix reduction problem.

Types of Factor Analysis

Both exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) are employed to understand shared variance of measured variables that is believed to be attributable to a factor or latent construct. The goal of EFA is to identify factors based on data and to maximize the amount of variance explained. The researcher is not required to have any specific hypotheses about how many factors will emerge, and what items or variables these factors will comprise. If these hypotheses exist, they are not incorporated into and do not affect the results of the statistical analyses.

By contrast, CFA evaluates a priori hypotheses and is largely driven by theory. CFA analyses require the researcher to hypothesize, in advance, the number of factors, whether or not these factors are correlated, and which items/measures load onto and reflect which factors. As such, in contrast to exploratory factor analysis, where all loadings are free to vary, CFA allows for the explicit constraint of certain loadings to be zero.



Checking all the possible correlations of a variable with the others in the set, you can discover that you may have two types of variance:

Unique variance: Some variance is unique to the variable under examination. It cannot be associated to what happens to any other variable.

Shared variance: Some variance is shared with one or more other variables, creating redundancy in the data. Redundancy implies that you can find the same information, with slightly different values, in various features and across many observations.