Standardization of Featureless Variables for Machine Learning Models Using Natural Language Processing

  • Kourosh Modarresi
  • Abdurrahman Munir
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10861)


AI and machine learning are mathematical modeling methods for learning from data and producing intelligent models based on this learning. The data these models need to deal with, is normally a mixed of data type where both numerical (continuous) variables and categorical (non-numerical) data types. Most models in AI and machine learning accept only numerical data as their input and thus, standardization of mixed data into numerical data is a critical step when applying machine learning models. Having data in the standard shape and format that models require often a time consuming, nevertheless very significant step of the process.


Machine learning Natural Language Processing Mixed type variables 


  1. 1.
    Ahn, D., Jijkoun, V., Mishne, G., Müller, K., de Rijke, M., Schlobach, S.: Using Wikipedia at the TREC QA track. In: Proceedings of TREC (2004)Google Scholar
  2. 2.
    Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). Scholar
  3. 3.
    Backstrom, L., Leskovec, J.: Supervised random walks: predicting and recommending links in social networks. In: ACM International Conference on Web Search and Data Mining, WSDM (2011)Google Scholar
  4. 4.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations, ICLR (2015)Google Scholar
  5. 5.
    Baudiš, P.: YodaQA: a modular question answering system pipeline. In: POSTER 2015-19th International Student Conference on Electrical Engineering, pp. 1156–1165 (2015)Google Scholar
  6. 6.
    Baudiš, P., Šedivý, J.: Modeling of the question answering task in the YodaQA system. In: Mothe, J., Savoy, J., Kamps, J., Pinel-Sauvagnat, K., Jones, G.J.F., SanJuan, E., Cappellato, L., Ferro, N. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 222–228. Springer, Cham (2015). Scholar
  7. 7.
    Becker, S., Bobin, J., Candès, E.J.: NESTA: a fast and accurate first-order method for sparse recovery. SIAM J. Imag. Sci. 4(1), 1–39 (2009)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Bjorck, A.: Numerical Methods for Least Squares Problems. SIAM, Philadelphia (1996)CrossRefGoogle Scholar
  9. 9.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  10. 10.
    Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1247–1250. ACM (2008)Google Scholar
  11. 11.
    Brill, E., Dumais, S., Banko, M.: An analysis of the AskMSR question-answering system. In: Empirical Methods in Natural Language Processing, EMNLP, pp. 257–264 (2002)Google Scholar
  12. 12.
    Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)CrossRefGoogle Scholar
  13. 13.
    Buscaldi, D., Rosso, P.: Mining knowledge from Wikipedia for the question answering task. In: International Conference on Language Resources and Evaluation, LREC, pp. 727–730 (2006)Google Scholar
  14. 14.
    Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. 9, 717–772 (2008)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Candès, E.J.: Compressive sampling. In: Proceedings of the International Congress of Mathematicians, Madrid, Spain (2006)Google Scholar
  16. 16.
    Candès, E.J., Tao, T.: Near-optimal signal recovery from random projections: universal encoding strategies. IEEE Trans. Inf. Theory 52, 5406–5425 (2004)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Caruana, R.: Multitask learning. In: Thrun, S., Pratt, L. (eds.) Learning to Learn, pp. 95–133. Springer, Boston (1998). Scholar
  18. 18.
    Chen, D., Bolton, J., Manning, C.D.: A thorough examination of the CNN/Daily Mail reading comprehension task. In: Association for Computational Linguistics, ACL (2016)Google Scholar
  19. 19.
    Chen, D., Fisch, A., Weston, J., Bordes, A.: Reading Wikipedia to answer open-domain questions. arXiv:1704.00051 (2017)
  20. 20.
    Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: International Conference on Machine Learning, ICML (2008)Google Scholar
  21. 21.
    d’Aspremont, A., El Ghaoui, L., Jordan, M.I., Lanckriet, G.R.G.: A direct formulation for sparse PCA using semidefinite programming. SIAM Rev. 49(3), 434–448 (2007)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32, 407–499 (2004)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Eldén, L.: Algorithms for the regularization of ill-conditioned least squares problems. BIT 17, 134–145 (1977)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Eldén, L.: A note on the computation of the generalized cross-validation function for ill-conditioned least squares problems. BIT 24, 467–472 (1984)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Engl, H.W., Groetsch, C.W. (eds.): Inverse and Ill-Posed Problems. Academic Press, London (1987)zbMATHGoogle Scholar
  26. 26.
    Fader, A., Zettlemoyer, L., Etzioni, O.: Open question answering over curated and extracted knowledge bases. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1156–1165 (2014)Google Scholar
  27. 27.
    Fazel, M., Hindi, H., Boyd, S.: A rank minimization heuristic with application to minimum order system approximation. In: Proceedings American Control Conference, vol. 6, pp. 4734–4739 (2001)Google Scholar
  28. 28.
    Golub, G.H., Van Loan, C.F.: Matrix Computations, 4th edn. Computer Assisted Mechanics and Engineering Sciences, Johns Hopkins University Press, Baltimore (2013)Google Scholar
  29. 29.
    Golub, G.H., Van Loan, C.F.: An analysis of the total least squares problem. SIAM J. Numer. Anal. 17, 883–893 (1980)MathSciNetCrossRefGoogle Scholar
  30. 30.
    Golub, G.H., Heath, M., Wahba, G.: Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21, 215–223 (1979)MathSciNetCrossRefGoogle Scholar
  31. 31.
    Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning; Data Mining, Inference and Prediction. Springer, New York (2001). Scholar
  32. 32.
    Hastie, T.J., Tibshirani, R.: Handwritten digit recognition via deformable prototypes. Technical report. AT&T Bell Laboratories (1994)Google Scholar
  33. 33.
    Hein, T., Hofmann, B.: On the nature of ill-posedness of an inverse problem in option pricing. Inverse Probl. 19, 1319–1338 (2003)MathSciNetCrossRefGoogle Scholar
  34. 34.
    Hewlett, D., Lacoste, A., Jones, L., Polosukhin, I., Fandrianto, A., Han, J., Kelcey, M., Berthelot, D.: WikiReading: a novel large-scale language understanding task over wikipedia. In: Association for Computational Linguistics, ACL, pp. 1535–1545 (2016)Google Scholar
  35. 35.
    Hill, F., Bordes, A., Chopra, S., Weston, J.: The Goldilocks principle: reading children’s books with explicit memory representations. In: International Conference on Learning Representations, ICLR (2016)Google Scholar
  36. 36.
    Hua, T.A., Gunst, R.F.: Generalized ridge regression: a note on negative ridge parameters. Commun. Stat. Theory Methods 12, 37–45 (1983)MathSciNetCrossRefGoogle Scholar
  37. 37.
    Jolliffe, I.T., Trendafilov, N.T., Uddin, M.: A modified principal component technique based on the LASSO. J. Comput. Graph. Stat. 12, 531–547 (2003)MathSciNetCrossRefGoogle Scholar
  38. 38.
    Kirsch, A.: An Introduction to the Mathematical theory of Inverse Problems. Springer, New York (1996). Scholar
  39. 39.
    Mardia, K., Kent, J., Bibby, J.: Multivariate Analysis. Academic Press, New York (1979)zbMATHGoogle Scholar
  40. 40.
    Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The stanford CoreNLP natural language processing toolkit. In: Association for Computational Linguistics, ACL, pp. 55–60 (2014)Google Scholar
  41. 41.
    Marquardt, D.W.: Generalized inverses, ridge regression, biased linear estimation, and nonlinear estimation. Technometrics 12, 591–612 (1970)CrossRefGoogle Scholar
  42. 42.
    Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large incomplete matrices. JMLR 2010(11), 2287–2322 (2010)MathSciNetzbMATHGoogle Scholar
  43. 43.
    McCabe, G.: Principal variables. Technometrics 26, 137–144 (1984)MathSciNetCrossRefGoogle Scholar
  44. 44.
    Miller, A.H., Fisch, A., Dodge, J., Karimi, A.-H., Bordes, A., Weston, J.: Key-value memory networks for directly reading documents. In: Empirical Methods in Natural Language Processing, EMNLP, pp. 1400–1409 (2016)Google Scholar
  45. 45.
    Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Association for Computational Linguistics and International Joint Conference on Natural Language Processing, ACL/IJCNLP, pp. 1003–1011 (2009)Google Scholar
  46. 46.
    Modarresi, K., Golub, G.H.: An adaptive solution of linear inverse problems. In: Proceedings of Inverse Problems Design and Optimization Symposium, IPDO 2007, Miami Beach, Florida, 16–18 April, pp. 333–340 (2007)Google Scholar
  47. 47.
    Modarresi, K.: A local regularization method using multiple regularization levels, Stanford, CA, April 2007Google Scholar
  48. 48.
    Modarresi, K.: Algorithmic approach for learning a comprehensive view of online users. Proc. Comput. Sci. 80(C), 2181–2189 (2016)CrossRefGoogle Scholar
  49. 49.
    Modarresi, K.: Computation of recommender system using localized regularization. Proc. Comput. Sci. 51(C), 2407–2416 (2015)CrossRefGoogle Scholar
  50. 50.
    Modarresi, K., Munir, A.: Generalized variable conversion using K-means clustering and web scraping. In: ICCS 2018 (2018, Accepted)Google Scholar
  51. 51.
    Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000 + questions for machine comprehension of text. In: Empirical Methods in Natural Language Processing, EMNLP (2016)Google Scholar
  52. 52.
    Ryu, P.-M., Jang, M.-G., Kim, H.-K.: Open domain question answering using Wikipedia-based knowledge model. Inf. Process. Manag. 50(5), 683–692 (2014)CrossRefGoogle Scholar
  53. 53.
    Seo, M., Kembhavi, A., Farhadi, A., Hajishirzi, H.: Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603 (2016)
  54. 54.
    Tarantola, A.: Inverse Problem Theory. Elsevir, Amsterdam (1987)zbMATHGoogle Scholar
  55. 55.
    Tibshirani, R.: Regression shrinkage and selection via the LASSO. J. Roy. Stat. Soc. Ser. B 58(1), 267–288 (1996)MathSciNetzbMATHGoogle Scholar
  56. 56.
    Tikhonov, A.N., Goncharsky, A.V. (eds.): Ill-Posed Problems in the Natural Sciences. MIR, Moscow (1987)Google Scholar
  57. 57.
    Wang, Z., Mi, H., Hamza, W., Florian, R.: Multi-perspective context matching for machine comprehension. arXiv preprint arXiv:1612.04211 (2016)
  58. 58.
    Witten, R., Candès, E.J.: Randomized algorithms for low-rank matrix factorizations: sharp performance bounds. Algorithmica 72, 264–281 (2013)MathSciNetCrossRefGoogle Scholar
  59. 59.
    Zhou, Z., Wright, J., Li, X., Candès, E.J., Ma, Y.: Stable principal component pursuit. In: Proceedings of International Symposium on Information Theory, June 2010Google Scholar
  60. 60.
    Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph. Stat. 15(2), 265–286 (2006)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Adobe Inc.San JoseUSA

Personalised recommendations