RETRACTED ARTICLE: Subject choice in educational data sets by using principal component and procrustes analysis

Khan, Ansar; Jana, Mrityunjoy; Bera, Subhankar; Das, Arosikha

doi:10.1007/s40808-016-0264-x

RETRACTED ARTICLE: Subject choice in educational data sets by using principal component and procrustes analysis

Original Article
Published: 09 December 2016

Volume 2, pages 1–5, (2016)
Cite this article

Download PDF

Modeling Earth Systems and Environment Aims and scope Submit manuscript

RETRACTED ARTICLE: Subject choice in educational data sets by using principal component and procrustes analysis

Download PDF

Ansar Khan¹,
Mrityunjoy Jana²,
Subhankar Bera¹ &
…
Arosikha Das³

1925 Accesses
29 Citations
1 Altmetric
Explore all metrics

This article was retracted on 14 June 2018

This article has been updated

Abstract

Principal component analysis (PCA) is a dimension-reducing technique that replaces subjects in a multivariate data set by a smaller number of derived subjects. Dimension reduction is often undertaken to help in describing the data set, but as each principal component usually involves all the original subjects, interpretation of a PCA result can still be difficult. One way to overcome this difficulty is to select a subset of the original subjects and use this subset to approximate the data. On the other hand, procrustes analysis (PA) as a measure of similarity can also be used to assess the efficiency of the subject selection methods in extracting representative subjects. In this paper researcher evaluate the efficiency of four different methods, namely B2, B4, PCA-PA, and PA methods. Researcher applies the methods in assessing the academic records of higher secondary students which include 11 subjects.

Principal Component Clustered Factors for Determining Study Performance in Computer Programming Class

Article 12 February 2020

Robust Principal Component Analysis of Data with Missing Values

Analysing the Conceptions on Modelling of Engineering Undergraduate Students: A Case Study Using Cluster Analysis

Introduction

Empirical research usually comprises large data set due to the number of subjects involved. Each subject is measured individually in order to explore the interdependence between subjects. The lack of information on which the most influenced subjects are, is compensated by collecting as much as possible data, hoping that no key subjects are missing. However, this emerges to the difficulty in interpreting the data itself. Reduction dimension technique is then considered to overcome the problem and principal component analysis (PCA) is the most widely used. PCA reduces the dimension of the multivariate data via replacement of a number of subjects by a smaller number of derived subjects. The so-called principal components are obtained as the linear combination of the original subjects which are uncorrelated and have the biggest variance. Thus, it is possible to reduce a number of k subjects from p subjects in total, where k < p. But, it is not guaranteed a simple interpretation can be drawn (Jolliffe 2002). In this framework, subject selection contributes in reducing the number of subjects which are irrelevant with the study or provide minor impact on data variation. Pioneering work on subject selection can be found in Beale et al. (1967), which propose on removing or reducing insignificant subjects in regression analysis. Jolliffe (1972) describes subject selection methods based on coefficient of correlation, PCA, and cluster analysis. King and Jackson (1999) implement PCA based subject selection method and suggest B4 method in ecological study. George (2000) discusses subject selection as a special case of model selection in multivariate regression. Al Kandari and Jolliffe (2001, 2005) explain some criteria of subject selection based on the covariance of the principal components, along with their effects on data variation.

On the other side, procrustes analysis (PA) is a set of mathematical least-squares technique to directly estimate and perform simultaneous similarity transformations among the model point coordinates matrices up to their maximal agreement. PA is introduced by Hurley and Cattell (1962) in solving a kind of multivariate regression equation problem. PA employs data scaling and configuration scaling in calculating matching measure. Their aim is to eliminate possible incommensurability of subjects within the individual data sets (data scaling) and size differences between data sets (configuration scaling), see (Gower and Dijksterhuis 2004). Basically, translation, rotation, and dilation, which performed in the respected order, are the kinds of transformations that may be deemed desirable before embarking on the actual procrustes matching (Digby and Kempton 1987; Al Kandari and Jolliffe 2001; Bakhtiar and Siswadi 2011). PA can also be utilized to determine the goodness of fit between a data matrix and its approximation (Siswadi and Bakhtiar 2011). In this work researcher exploit PA to measure the best matching between original data matrix and the reduced order matrix due to subject selection.

The aim of this paper is to implement PCA and PA in examining educational data. As it is known, educational data commonly embraces a very large data due to the number of registered student bodies as well as the number of subjects offered. Researcher study and compare four subject selection methods based on PCA and PA to identify the courses with dominant effect in influencing the quality standard of educational process.

Materials and method

In this work researcher exploit a large number of educational data and implement four different approaches to perform subject selection, namely B2, B4, PCA-PA, and PA methods. The first two methods are exclusively relied on PCA, the third method is combination between PCA and PA, and the last is solely a PA method.

Database

This study involves academic records of 857 Higher Secondary students on 11 subjects taken during the academic year of 2013–2014. Each academic record is marked by numbers 0, 1, 2, 3, 4, where 4 represents the best achievement. In fact, the original data is stored in a matrix of size 857 $\times$ 11, where row of matrix represents individual observation and column of matrix corresponds to subjects as subjects. Researcher codify the 11 subjects as follows: Economics (ECON), Geography (GEGR), History (HIST), Political Science (POLS), Sanskrit (SNSK), Accountancy (ACCT), Business Studies (BSTD), Biological Sciences (BIOS), Chemistry (CHEM), Mathematics (MATH), and Physics (PHYS).

Descriptive statistics of the data, as depicted by Table 1, show that ECON, BSTD, CHEM, and SNSK are subjects whose average are high, i.e., the proportion of 4-mark is higher than that of other marks. While, POLS and MATH are subjects with the lowest average. It is also shown by this table that ECON, GEGR, HIST, POLS, and SNSK are subjects with the largest variance and MATH and PHYS are those with the lowest. Further description by box-plot shows that ECON, ACCT, BIOS, and CHEM have symmetric distribution pattern, HIST, POLS, and PHYS have positive distribution pattern, and the remaining subjects have negative distribution pattern. Moreover, calculation of Pearson correlation matrix indicates that almost all subjects have p values <1%, which shows significant correlation between subjects. In particular, POLS and MATH possess the most correlated subjects, while BSTD and PHYS provide the most uncorrelated ones. The former fact is obvious, since MATH is a prerequisite for enrolling ECON.

Table 1 Statistical characteristics of location

Full size table

Method

Principal component analysis

For given subjects X₁, X₂…X_P a principal component is a linear combination of subjects which maximize variation of the data. Suppose that all subjects are collected in X, and then the first principal component is given by

$${{w}_{1}}^{T}X={{w}_{11}}{{X}_{1}}+\cdot \cdot \cdot +{{w}_{1p}}{{X}_{p}}$$

(1)

where weight coefficient vector w₁ should be determined such that maximizes the variance. The second principal component w₂^TX should be constructed such that it is uncorrelated with the first principal component and has second biggest variance, and so on. Standard Lagrange multiplier technique reveals that the optimal weight ${{w}_{i}}$ is equivalent to the eigenvectors of covariance matrix of X corresponding to the i-th biggest eigenvalue λ_i.

In general, transformation from original subject matrix X to principal component Y can be written as Y = WX, where W denotes the weighting matrix constructed from the eigenvectors of covariance matrix of X. Position of each object on the principal component coordinate system, i.e., the score, is provided by Z = XW^T. The total of variance which can be explained by first k principal components V_k is then given by

$${{V}_{k}}=\frac{\sum\nolimits_{i=1}^{k}{{{\lambda }_{i}}}}{\sum\nolimits_{i=1}^{p}{{{\lambda }_{i}}}}.$$

(2)

In our subsequent analysis researcher shall also denote by X the data matrix instead of subject matrix. Jolliffe (2002) and Gower and Dijksterhuis (2004) describe some criteria in determining the number of principal components should be employed to represent the variation of data matrix X. Cumulative percentage of the total of variation in the range of 70–90% will preserve most of the information contained by X. The magnitude of the principal component can also be considered as a criterion, where a principal component whose variance is less than one, i.e., λ_k < 1, is considerably less informative and hence, might be excluded. Another way to determine the number of principal components is by using cross validated method, where it is suggested to compute the strength of prediction when k-th principal component is added. A point prediction raised by this method is based on singular value decomposition. Jolliffe (1972) introduces methods in selecting best subjects subset in the sense of the degree of data variation preserved based on PCA. They are B1, B2, B3, and B4 methods. In this work researcher shall exploit B2 and B4 methods in subjects selection.

Procrustes analysis

Suppose Y is a configuration of n points in a q dimensional Euclidean space with coordinates given by an n × q matrix Y = (y_ij). This configuration needs to be optimally matched to another configuration of n points in a p dimensional Euclidean space with coordinate matrix X = (x_ij). It is assumed that the r-th point in the first configuration is in a one-to-one correspondence with the r-th point in the second configuration. If p > q then a number of p − q columns of zeros are placed at the end of matrix Y so that both configurations are placed in the same dimensional space. Henceforth, it is assumed without loss of generality that p = q. To measure the difference between two n-point configurations, PA exploits the sum of the squared distances E between the points in Y space and the corresponding points in X space. This measure is also known as procrustes distance which given by

$$E(X,Y)=\sum\limits_{i=1}^{n}{\sum\limits_{j=1}^{p}{{{({{x}_{ij}}-{{y}_{ij}})}^{2}}}}.$$

(3)

A series of transformations namely translation, rotation, and dilation precedes the calculation of the distance. Optimal translation is achieved by coinciding the centroids of both configuration matrices at the origin. Matrices after translation process are then notated by X_T and Y_T PA performs rotation on Y_T over X_T by post multiplying Y_T by an orthogonal matrix Q. The motions are sought such that minimize E(X_T, Y_TQ). It is proved that the optimal rotation matrix is given by Q^*=VU^T, where USV^T is the complete form of singular value decomposition of X_T^TY_T (Sibson 1978). As the last adjustment, dilation is undertaken by multiplying configuration Y_TQ^* by a scalar c. The scalar should be selected such that minimizes the procrustes distance E(X_T, cY_TQ^*). Overall, subject to an optimal translation-rotation-dilation adjustment, the lowest possible procrustes distance E^* is provided by

$${{E}^{*}}=(X,Y)=tr({{X}_{T}}{{X}_{T}}^{T})-\frac{t{{r}^{2}}({{X}_{T}}{{Q}^{*T}}{{Y}_{T}}^{T})}{tr({{Y}_{T}}{{Y}_{T}}^{T})}.$$

(4)

Goodness of fit measure GF based on PA can then be formulated as

$$GF(X,Y)=1-\frac{{{E}^{*}}(X,Y)}{tr(X{{X}^{T}})},$$

(5)

which laid in the range of 0–1. This measure shall be utilized in subject selection, where reduced order matrix which provides smaller goodness of fit coefficient is considerably less significant.

B2 method

Procedures in B2 method are simplification of those in B1 method, where an analysis based on principal component is performed only once. The procedure begins by doing PCA over n × q data matrix. If we decide to retain a number of q subjects, then weight coefficients w_ij with the biggest magnitude are selected from the last p − q principal components, and linked to corresponding subjects. The p − q subjects are then removed starting from the last.

B4 method

Similar to B2 method, B4 method needs only one step PCA, but the procedures are now backward. Researchers starting by doing PCA over n × q data matrix. Selection process is performed by choosing coefficients with the biggest magnitude from the first q principal components, and compared each other starting from the first component.

PCA-PA method

After performing PCA over n × q data matrix X, researcher constructs a score matrix Z from the first k principal components which represents the data structure. The matrix Z constitutes as a base configuration for comparison with other configurations. Next researcher remove a column of X consecutively and accomplish a PCA over reduced data matrix to produce Y_(i), where Y_(i) denotes n × k an score matrix obtained from PCA by removing i-th column of X. Researcher then compare Y_(i) over the base configuration Z by using PA to provide a goodness of fit measure. Subject corresponds to i-th column which has smallest goodness of fit coefficient is excluded. Researcher reruns the procedure until the remaining q subjects. These q selected subjects represent all p subjects of the data.

PA method

By this method researcher apply directly procrustes analysis to select subjects. Obviously, this is simpler than previous ones. Researchers first replace one column of X consecutively by a column of zeros. Researcher then matches this new matrix up to the original matrix X. Respective subject that provides smallest goodness of fit coefficient is excluded. Researchers repeat the process until the remaining q subjects.

Efficiency score

An efficiency measure is then needed to justify whether a certain method is considerably more efficient than others in representing the original data. Al Kandari and Jolliffe (2001, 2005) and Westad et al. (2003) suggest an efficiency measure based on the total percentage of variation which can be explained by the first k principal components constructed from selected q subjects, whose expression is provided in the previous section. In this study, efficiency score is measured according to procrustes distance spanned by the matrices. Suppose that X is the original data matrix and X^q is a configuration obtained by keeping q subjects of X. Researcher define by Y and Y^q the corresponding PCA score matrices related to X and X^q, respectively. Researcher here assumes that Y is the best approximation for X. Then, the efficiency score R² is calculated according to the following formula

$${{R}^{2}}=\left[ 1-\frac{{{E}^{*}}(Y,{{Y}^{q}})}{tr(Y{{Y}^{T}})} \right]\times 100\%$$

(6)

Efficiency score R² varies between 0–100%. Higher score reveals more efficient and thus closer similarity between configurations.

Results and discussion

Based on data exploration, the number of selected subject q is not determined by a certain eigenvalue, rather researcher follow a criterion proposed by Jolliffe (1972), where q is selected such that the subjects can explain at least 80% of the variation of the data. It means researcher keep 7 of 11 subjects. This, however, is coincident with the number of departments offering the subjects. For PCA-PA based methods, researchers use the first two principal components, i.e., k = 2, for the analysis, since they can explain up to 80% of the variation of the data.

Table 2 gives the result of subjects selection by using four methods in term of selected and excluded subjects. All the methods show an almost consistent outcome, where GEGR, MATH, HIST and BIOS are subjects that always selected by all four methods, whereas, CHEM, and BSTD are recommended by three methods. GEGR, HIST, BIOS and MATH are four subjects with highest variances and thus contribute more effects on the variation. Especially for GEGR, it is a subject selected by all methods in the first priority. On the other side, POLS is subject that always excluded by all the methods. These subjects, except for MATH, have higher averages and lower variances than others, hence considerably having less contribution to the variation of the data. Another obvious fact confirmed by the result relates to GEGR and BIOS. Except by PCA-PA method, these two subjects show a reverse conduct. If MATH is included then GEGR is excluded, and vice versa. It can be understood, since these subjects have similar characteristics due to a high correlation and one is prerequisite for another.

Table 2 Statistical results of subject choice in school premises by PCA-PA

Full size table

In particular, 6 of 8 subjects selected by B2 methods are also selected by B4 method. It means that B2 and B4 methods share 74.38% of similarity. PCA and PA methods have also shown similar facts even though PA is much simpler, where they endorse seven mutual selected subjects, equivalent to 88.8% of similarity. From its straightforwardness, PA method is preferably recommended. From the efficiency point of view all the methods are efficient and show insignificant differences, since all provide high and similar scores. They are more than 99%.

Conclusion

Researchers have implemented a series of subject selection methods based on principal component and procrustes analyses. The methods have been applied to the assessment of educational data. It has been shown that all the methods provide consistent results. In fact, all the cases perform minor differences with result. The outcome of this research can be benefited by the school education management in decision making, particularly in courses mapping and student clustering.

Change history

14 June 2018
The authors are retracting this article (Khan et al., 2016) because it overlaps with a previously published article (Siswadi et al., 2012). All authors agree with this retraction.

References

Al Kandari NM, Jolliffe IT (2001) Subject selection and interpretation of covariance principal component. J Stat Comput Simul 30(2):339–354
Article Google Scholar
Al Kandari NM, Jolliffe IT, (2005) Subject selection and interpretation of correlation principal component. Environmetrics, 16:659–672
Article Google Scholar
Bakhtiar T, Siswadi (2011) Orthogonal procrustes analysis: Its transformation arrangement and minimal distance. Int J Math Stat 20(M11):16–24
Google Scholar
Beale EML., Kendal MG, Mann DW (1967) The discarding of subjects in multivariate analysis. Biometrika 54:357–366
Article Google Scholar
Digby PGN, Kempton RA (1987) Multivariate analysis of ecological communities. Chapman & Hall, New York
Book Google Scholar
George EI (2000) The subject selection problem. J Am Stat Assoc 95:1304–1308
Article Google Scholar
Gower JC, Dijksterhuis GB (2004) Procrustes problem. Oxford University Press, New York
Book Google Scholar
Hurley JR, Cattell RB (1962) The procrustes program: producing direct rotation to test a hypothesized factor structure. Behav Sci 7:258–262
Article Google Scholar
Jolliffe IT (1972) Discarding subject in principal component analysis-i: artificial data. Appl Stat 21:160–173
Article Google Scholar
Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer-Verlag, New York
Google Scholar
King JR, Jackson DA (1999) Subject selection in large environmental data sets using principal component analysis. Environmetrics, 10:67–77
Article Google Scholar
Sibson R (1978) Studies robustness of multidimensional scaling: procrustes statistics. J R Stat Soc B 40:234–238
Google Scholar
Siswadi, Bakhtiar T (2011) Goodness of fit of biplots via procrustes analysis. J Math Sci 52(2):191–201
Google Scholar
Westad F, Hersleth M, Lea P (2003) Subject selection in pca in sensory descriptive and consumer data. Food Qual Preference 14:463–472
Article Google Scholar

Download references

Acknowledgements

We would like to express our gratitude to the Head Master, Bankura Municipal High School (H.S), Bankura for helping in collecting quantitative and qualitative data. We would articulate our gratitude to all whose names have not been mentioned individually but have helped us in this work.

Author information

Authors and Affiliations

Department of Geography and Environment Management, Vidyasagar University, Midnapore, India
Ansar Khan & Subhankar Bera
Department of Education, North Orissa University, Baripada, India
Mrityunjoy Jana
Department of Applied Geography, Ravenshaw University, Cuttack, India
Arosikha Das

Authors

Ansar Khan
View author publications
You can also search for this author in PubMed Google Scholar
Mrityunjoy Jana
View author publications
You can also search for this author in PubMed Google Scholar
Subhankar Bera
View author publications
You can also search for this author in PubMed Google Scholar
Arosikha Das
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ansar Khan.

Additional information

The authors are retracting this article (Khan et al., 2016) because it overlaps with a previously published article (Siswadi et al., 2012). All authors agree with this retraction.

About this article

Cite this article

Khan, A., Jana, M., Bera, S. et al. RETRACTED ARTICLE: Subject choice in educational data sets by using principal component and procrustes analysis. Model. Earth Syst. Environ. 2, 1–5 (2016). https://doi.org/10.1007/s40808-016-0264-x

Download citation

Received: 25 July 2016
Accepted: 24 November 2016
Published: 09 December 2016
Issue Date: December 2016
DOI: https://doi.org/10.1007/s40808-016-0264-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

RETRACTED ARTICLE: Subject choice in educational data sets by using principal component and procrustes analysis

Abstract

Similar content being viewed by others

Principal Component Clustered Factors for Determining Study Performance in Computer Programming Class

Robust Principal Component Analysis of Data with Missing Values

Analysing the Conceptions on Modelling of Engineering Undergraduate Students: A Case Study Using Cluster Analysis

Introduction

Materials and method

Database

Method

Principal component analysis

Procrustes analysis

B2 method

B4 method

PCA-PA method

PA method

Efficiency score

Results and discussion

Conclusion

Change history

14 June 2018

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

About this article

Cite this article

Keywords

Navigation

RETRACTED ARTICLE: Subject choice in educational data sets by using principal component and procrustes analysis

Abstract

Similar content being viewed by others

Principal Component Clustered Factors for Determining Study Performance in Computer Programming Class

Robust Principal Component Analysis of Data with Missing Values

Analysing the Conceptions on Modelling of Engineering Undergraduate Students: A Case Study Using Cluster Analysis

Introduction

Materials and method

Database

Method

Principal component analysis

Procrustes analysis

B2 method

B4 method

PCA-PA method

PA method

Efficiency score

Results and discussion

Conclusion

Change history

14 June 2018

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

About this article

Cite this article

Share this article

Keywords

Search

Navigation