Clustering Optimization Using K-Means with Principal Component Analysis and Mean-Median Based Centroid Initialization

Authors

  • Aisyah Rahma Putri Mathematics Study Program, Faculty of Mathematics and Natural Sciences, State University of Malang
  • Lita Wulandari Aeli Mathematics Study Program, Faculty of Mathematics and Natural Sciences, State University of Malang https://orcid.org/0009-0003-9432-4291
  • Ramdhan Fazrianto Suwarman Mathematics Study Program, Faculty of Mathematics and Natural Sciences, State University of Malang
  • Andi Daniah Pahrany Mathematics Study Program, Faculty of Mathematics and Natural Sciences, State University of Malang

DOI:

https://doi.org/10.31851/sainmatika.v22i2.20168

Keywords:

Initial centroid, K-Means, Mean, Median, PCA, Welfare

Abstract

East Java is the province with the second largest population in Indonesia, which results in major challenges in solving poverty and unemployment, so clustering of districts/cities is needed to identify areas that need more attention, especially in certain aspects of welfare. This study uses data on community welfare indicators in East Java Province and the K-Means clustering algorithm combined with PCA and mean-median centroid initialization. From this method, six clusters were formed, with the main focus on increasing economic growth, decreasing unemployment, and improving the quality of life, which in turn can reduce poverty, thus improving overall welfare. The results show that the application of PCA to K-Means is able to improve the quality of clustering results, while centroid initialization with the mean median value can accelerate the convergence process of K-Means, where the PCA formed produces two principal components with a cumulative percentage of variance of 58%. The clustering evaluation resulted in a Silhouette Coefficient value of 0.505 and DBI of 0.601 with 7 iterations.

References

Aeli, L. W., Pahrany, A. D., & Indratno, S. W. (2022). Life insurance model with regression Cox proportional hazard affected by areal spatial factor. In Mathematics, Substance and Surmise: Views on the Meaning and Ontology of Mathematics. AIP Publishing. https://doi.org/10.1007/978-3-319-21473-3_6

Ann G. Ryan, Douglas C. Montgomery, Elizabeth A. Peck, G. G. V. (2013). Solutions manual to accompany introduction to linear regression analysis. Technometrics, 49(2). https://doi.org/10.1198/tech.2007.s499

Badan Pusat Statistik. (2023). Hasil long form sensus penduduk 2020 Provinsi Jawa Timur. Badan Pusat Statistik Provinsi Jawa Timur.

Bluman, A. G. (2019). Elementary statistics: A step by step approach: A brief version (8th ed.). McGraw Hill Education.

Budiman, A. V. Y., Permai, S. D., & Irwansyah, E. (2024). Criminality mapping in Java Island using clustering large applications based on randomized search (CLARANS). 2024 4th International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET 2024). https://doi.org/10.1109/IRASET60544.2024.10548525

Butar Butar, R. P. (2023). Analisis hierarchical dan non-hierarchical clustering untuk pengelompokkan potensi ekonomi kelautan Indonesia 2021. Jurnal Sistem dan Teknologi Informasi (JustIN), 11(3), 543. https://doi.org/10.26418/justin.v11i3.67283

Cui, M. (2020). On the elbow method. 5–8. https://doi.org/10.23977/accaf.2020.010102

Gupta, M. K., & Chandra, P. (2019). P-k-means: K-means using partition based cluster initialization method. SSRN Electronic Journal, 567–573. https://doi.org/10.2139/ssrn.3462549

Haumahu, G., & Lewaherilla, N. (2020). Penerapan analisis komponen utama dalam mereduksi faktor-faktor penyebab diare di Provinsi Maluku. Mathematics & Applications (MAP Journal), 2(1), 41–46.

Hays, W. L. (1983). Review of Using multivariate statistics. Contemporary Psychology: A Journal of Reviews, 28(8). https://doi.org/10.1037/022267

Johnson, R. A., & Wichern, D. W. (2007). Applied multivariate statistical analysis (6th ed., pp. 671–757). Pearson.

Jolliffe, I. T. (1998). Principal components. Data Handling in Science and Technology, 20(Part A), 519–556. https://doi.org/10.1016/S0922-3487(97)80047-0

Lestari, T. E., Permadi, H., & Susilowati, S. (2020). Data mining pada faktor-faktor potensi daerah di Kabupaten Sidoarjo Provinsi Jawa Timur. Jurnal Matematika, 10(2), 67. https://doi.org/10.24843/jmat.2020.v10.i02.p124

Meiriza, A., Ali, E., Rahmiati, & Agustin. (2023). Perbandingan algoritma K-Means dan K-Medoids untuk pengelompokan program BPJS Ketenagakerjaan. The Indonesian Journal of Computer Science, 12(2), 714–728. https://doi.org/10.33022/ijcs.v12i2.3184

Norshahlan, M., Jaya, H., & Kustini, R. (2023). Penerapan metode clustering dengan algoritma K-Means pada pengelompokan data calon siswa baru. Jurnal Sistem Informasi Triguna Dharma (JURSI TGD), 2(6), 1042. https://doi.org/10.53513/jursi.v2i6.9148

Pendi, P. (2021). Analisis regresi dengan metode komponen utama dalam mengatasi masalah multikolinearitas. Bimaster: Buletin Ilmiah Matematika, Statistika dan Terapannya, 10(1), 131–138.

Pratama, R. C. (2020). Pengelompokan kabupaten/kota di Provinsi Papua berdasarkan indikator kesejahteraan rakyat 2020. Seminar Nasional Official Statistics, 2020, 853–862.

Rais, M., Goejantoro, R., & Prangga, S. (2021). Optimalisasi K-Means cluster dengan principal component analysis pada pengelompokan kabupaten/kota di Pulau Kalimantan berdasarkan indikator tingkat pengangguran terbuka. Eksponensial, 12(2), 129. https://doi.org/10.30872/eksponensial.v12i2.805

Rosyada, I. A., & Utari, D. T. (2024). Penerapan principal component analysis untuk reduksi variabel pada algoritma K-Means clustering. Jambura Journal of Probability and Statistics, 5(1), 6–13. https://doi.org/10.37905/jjps.v5i1.18733

Taufik, A., Novita, E., Eva, M., Ar-Rosid, D., Istighfarin, & Putri, A. R. (2023). Penerepan principal component analysis (PCA) untuk mereduksi variabel-variabel seputar pertanian yang saling berkorelasi di Provinsi Jawa Timur.

Umargono, E., Suseno, J. E., & S. K., V. G. (2020). K-Means clustering optimization using the elbow method and early centroid determination based on mean and median. Proceedings of ISSTEC 2019, 474, 234–240. https://doi.org/10.5220/0009908402340240

Williams, P. (2022). Smart devices. Cossma, 23(12). https://doi.org/10.1016/b978-0-08-100741-9.00012-7

Wira, B., Budianto, A. E., & Wiguna, A. S. (2019). Implementasi metode K-Medoids clustering untuk mengetahui pola pemilihan program studi mahasiswa baru tahun 2018 di Universitas Kanjuruhan Malang. RAINSTEK: Jurnal Terapan Sains & Teknologi, 1(3), 53–68. https://doi.org/10.21067/jtst. v1i3.3046

Zubair, M., Iqbal, M. A., Shil, A., Chowdhury, M. J. M., Moni, M. A., & Sarker, I. H. (2022). An improved K-Means clustering algorithm towards an efficient data-driven modeling. Annals of Data Science, 11(5), 1525–1544. https://doi.org/10.1007/s40745-022-00428-2

Downloads

Published

2025-12-10

How to Cite

Clustering Optimization Using K-Means with Principal Component Analysis and Mean-Median Based Centroid Initialization. (2025). Sainmatika: Jurnal Ilmiah Matematika Dan Ilmu Pengetahuan Alam, 22(2), 108-128. https://doi.org/10.31851/sainmatika.v22i2.20168