Generalized Variable Conversion Using K-means Clustering and Web Scraping

  • Kourosh ModarresiEmail author
  • Abdurrahman Munir
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10861)


The world of AI and Machine Learning is the world of data and learning from data so the insights could be used for analysis and prediction. Almost all data sets are of mixed variable types as they may be quantitative (numerical) or qualitative (categorical). The problem arises from the fact that a long list of methods in Machine Learning such as “multiple regression”, “logistic regression”, “k-means clustering”, and “support vector machine”, all to be as examples of such models, designed to deal with numerical data type only. Though the data, that need to be analyzed and learned from, is almost always, a mixed data type and thus, standardization step must be undertaken for all these data sets. The standardization process involves the conversion of qualitative (categorical) data into numerical data type.


Mixed variable types NLP K-means clustering 


  1. 1.
    Ahn, D., Jijkoun, V., Mishne, G., Müller, K., de Rijke, M., Schlobach, S.: Using Wikipedia at the TREC QA track. In: Proceedings of TREC (2004)Google Scholar
  2. 2.
    Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). Scholar
  3. 3.
    Backstrom, L., Leskovec, J.: Supervised random walks: predicting and recommending links in social networks. In: ACM International Conference on Web Search and Data Mining (WSDM) (2011)Google Scholar
  4. 4.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations (ICLR) (2015)Google Scholar
  5. 5.
    Baudiš, P.:YodaQA: a modular question answering system pipeline. In: POSTER 2015-19th International Student Conference on Electrical Engineering, pp. 1156–1165 (2015)Google Scholar
  6. 6.
    Baudiš, P., Šedivý, J.: Modeling of the question answering task in the YodaQA system. In: Mothe, J., Savoy, J., Kamps, J., Pinel-Sauvagnat, K., Jones, Gareth J.F., SanJuan, E., Cappellato, L., Ferro, N. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 222–228. Springer, Cham (2015). Scholar
  7. 7.
    Becker, S., Bobin, J., Candès, E.J.: NESTA: a fast and accurate first-order method for sparse recovery. SIAM J. Imaging Sci. 4(1), 1–39 (2009)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Bjorck, A.: Numerical Methods for Least Squares Problems. SIAM, Philadelphia (1996)CrossRefGoogle Scholar
  9. 9.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  10. 10.
    Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1247–1250. ACM (2008)Google Scholar
  11. 11.
    Brill, E., Dumais, S., Banko, M.: An analysis of the AskMSR question-answering system. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 257–264 (2002)Google Scholar
  12. 12.
    Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)CrossRefGoogle Scholar
  13. 13.
    Buscaldi, D., Rosso, P.: Mining knowledge from Wikipedia for the question answering task. In: International Conference on Language Resources and Evaluation (LREC), pp. 727–730 (2006)Google Scholar
  14. 14.
    Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. 9, 717–772 (2008)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Candès, E.J.: Compressive sampling. In: Proceedings of the International Congress of Mathematicians, Madrid, Spain (2006)Google Scholar
  16. 16.
    Candès, E.J., Tao, T.: Near-optimal signal recovery from random projections: universal encoding strategies. IEEE Trans. Inform. Theor. 52, 5406–5425 (2004)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Caruana, R.: Multitask learning. In: Thrun, S., Pratt, L. (eds.) Learning to Learn, pp. 95–133. Springer, Boston (1998). Scholar
  18. 18.
    Chen, D., Bolton, J., Manning, C.D.: A thorough examination of the CNN/daily mail reading comprehension task. In: Association for Computational Linguistics (ACL) (1998). 2016Google Scholar
  19. 19.
    Chen, D., Fisch, A., Weston, J., Bordes, A.: Reading Wikipedia to Answer Open-Domain Questions, arXiv:1704.00051 (2017)
  20. 20.
    Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: International Conference on Machine Learning (ICML) (2008)Google Scholar
  21. 21.
    d’Aspremont, A., El Ghaoui, L., Jordan, M.I., Lanckriet, G.R.G.: A direct formulation for sparse PCA using semidefinite programming. SIAM Rev. 49(3), 434–448 (2007)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32, 407–499 (2004)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Elden, L.: Algorithms for the regularization of Ill-conditioned least squares problems. BIT 17, 134–145 (1977)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Elden, L.: A note on the computation of the generalized cross-validation function for Ill-conditioned least squares problems. BIT 24, 467–472 (1984)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Engl, H.W., Groetsch, C.W. (eds.): Inverse and Ill-Posed Problems. Academic Press, London (1987)zbMATHGoogle Scholar
  26. 26.
    Fader, A., Zettlemoyer, L., Etzioni, O.: Open question answering over curated and extracted knowledge bases. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1156–1165 (2014)Google Scholar
  27. 27.
    Fazel, M., Hindi, H., Boyd, S.: A rank minimization heuristic with application to minimum order system approximation. In: Proceedings American Control Conference, vol. 6, pp. 4734–4739 (2001)Google Scholar
  28. 28.
    Golub, G.H., Van Loan, C.F.: Matrix Computations, 4th edn. Computer Assisted Mechanics and Engineering Sciences, Johns Hopkins University Press, US (2013)Google Scholar
  29. 29.
    Golub, G.H., Van Loan, C.F.: An analysis of the total least squares problem. SIAM J. Numer. Anal. 17, 883–893 (1980)MathSciNetCrossRefGoogle Scholar
  30. 30.
    Golub, G.H., Heath, M., Wahba, G.: Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21, 215–223 (1979)MathSciNetCrossRefGoogle Scholar
  31. 31.
    Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. SSS. Springer, New York (2009). Scholar
  32. 32.
    Hastie, T.J., Tibshirani, R.: Handwritten digit recognition via deformable prototypes. Technical report, AT&T Bell Laboratories (1994)Google Scholar
  33. 33.
    Hein, T., Hofmann, B.: On the nature of ill-posedness of an inverse problem in option pricing. Inverse Prob. 19, 1319–1338 (2003)MathSciNetCrossRefGoogle Scholar
  34. 34.
    Hewlett, D., Lacoste, A., Jones, L., Polosukhin, I., Fandrianto, A., Han, J., Kelcey, M., Berthelot, D.: Wikireading: a novel large-scale language understanding task over Wikipedia. In: Association for Computational Linguistics (ACL), pp. 1535–1545 (2016)Google Scholar
  35. 35.
    Hill, F., Bordes, A., Chopra, S., Weston, J.: The goldilocks principle: reading children’s books with explicit memory representations. In: International Conference on Learning Representations (ICLR) (2016)Google Scholar
  36. 36.
    Hua, T.A., Gunst, R.F.: Generalized ridge regression: a note on negative ridge parameters. Comm. Stat. Theor. Methods 12, 37–45 (1983)MathSciNetCrossRefGoogle Scholar
  37. 37.
    Jolliffe, I.T., Trendafilov, N.T., Uddin, M.: A modified principal component technique based on the LASSO. J. Comput. Graph. Stat. 12, 531–547 (2003)MathSciNetCrossRefGoogle Scholar
  38. 38.
    Kirsch, A.: An Introduction to the Mathematical Theory of Inverse Problems. Springer, New York (1996). Scholar
  39. 39.
    Mardia, K., Kent, J., Bibby, J.: Multivariate Analysis. Academic Press, New York (1979)Google Scholar
  40. 40.
    Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The stanford corenlp natural language processing toolkit. In: Association for Computational Linguistics (ACL), pp. 55–60 (2014)Google Scholar
  41. 41.
    Marquardt, D.W.: Generalized inverses, ridge regression, biased linear estimation and nonlinear estimation. Technometrics 12, 591–612 (1970)CrossRefGoogle Scholar
  42. 42.
    Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large incomplete matrices. JMLR 11, 2287–2322 (2010)MathSciNetzbMATHGoogle Scholar
  43. 43.
    McCabe, G.: Principal variables. Technometrics 26, 137–144 (1984)MathSciNetCrossRefGoogle Scholar
  44. 44.
    Miller, A.H., Fisch, A., Dodge, J., Karimi, A.-H., Bordes, A., Weston, J.: Key-value memory networks for directly reading documents. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1400–1409 (2016)Google Scholar
  45. 45.
    Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Association for Computational Linguistics and International Joint Conference on Natural Language Processing (ACL/IJCNLP), pp. 1003–1011 (2009)Google Scholar
  46. 46.
    Modarresi, K., Golub, G.H.: An adaptive solution of linear inverse problems. In: Proceedings of Inverse Problems Design and Optimization Symposium (IPDO2007), 16–18 April, Miami Beach, Florida, pp. 333–340 (2007)Google Scholar
  47. 47.
    Modarresi, K.: A local regularization method using multiple regularization levels, Stanford, CA, April 2007Google Scholar
  48. 48.
    Modarresi, K.: Algorithmic approach for learning a comprehensive view of online users. Procedia Comput. Sci. 80C, 2181–2189 (2016)CrossRefGoogle Scholar
  49. 49.
    Modarresi, K.: Computation of recommender system using localized regularization. Procedia Comput. Sci. 51, 2407–2416 (2015)CrossRefGoogle Scholar
  50. 50.
    Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: Empirical Methods in Natural Language Processing (EMNLP) (2016)Google Scholar
  51. 51.
    Ryu, P.-M., Jang, M.-G., Kim, H.-K.: Open domain question answering using Wikipedia-based knowledge model. Inf. Process. Manag. 50(5), 683–692 (2014)CrossRefGoogle Scholar
  52. 52.
    Seo, M., Kembhavi, A., Farhadi, A., Hajishirzi, H.: Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603 (2016)
  53. 53.
    Tarantola, A.: Inverse Problem Theory. Elsevier, Amsterdam (1987)zbMATHGoogle Scholar
  54. 54.
    Tibshirani, R.: Regression shrinkage and selection via the LASSO. J. Roy. Stat. Soc. Ser. B 58(1), 267–288 (1996)MathSciNetzbMATHGoogle Scholar
  55. 55.
    Tikhonov, A.N., Goncharsky, A.V. (eds.): Ill-Posed Problems in the Natural Sciences. MIR, Moscow (1987)Google Scholar
  56. 56.
    Wang, Z., Mi, H., Hamza, W., Florian, R.: Multi-perspective context matching for machine comprehension. arXiv preprint arXiv:1612.04211 (2016)
  57. 57.
    Witten, R., Candès, E.J.: Randomized algorithms for low-rank matrix factorizations: sharp performance bounds. To appear in Algorithmica (2013)Google Scholar
  58. 58.
    Zhou, Z., Wright, J., Li, X., Candès, E.J., Ma, Y.: Stable principal component pursuit. In: Proceedings of International Symposium on Information Theory, June 2010Google Scholar
  59. 59.
    Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph. Stat. 15(2), 265–286 (2006)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Adobe Inc.San JoseUSA

Personalised recommendations