Abstract
We present and study an agent-based model of T-Cell cross-regulation in the adaptive immune system, which we apply to binary classification. Our method expands an existing analytical model of T-cell cross-regulation (Carneiro et al. in Immunol Rev 216(1):48–68, 2007) that was used to study the self-organizing dynamics of a single population of T-Cells in interaction with an idealized antigen presenting cell capable of presenting a single antigen. With agent-based modeling we are able to study the self-organizing dynamics of multiple populations of distinct T-cells which interact via antigen presenting cells that present hundreds of distinct antigens. Moreover, we show that such self-organizing dynamics can be guided to produce an effective binary classification of antigens, which is competitive with existing machine learning methods when applied to biomedical text classification. More specifically, here we test our model on a dataset of publicly available full-text biomedical articles provided by the BioCreative challenge (Krallinger in The biocreative ii. 5 challenge overview, p 19, 2009). We study the robustness of our model’s parameter configurations, and show that it leads to encouraging results comparable to state-of-the-art classifiers. Our results help us understand both T-cell cross-regulation as a general principle of guided self-organization, as well as its applicability to document classification. Therefore, we show that our bio-inspired algorithm is a promising novel method for biomedical article classification and for binary document classification in general.
Similar content being viewed by others
Notes
We use the terminology of self/nonself discrimination, though perhaps a more accurate description is classification of harmless vs. harmful substances; harmless can also include antigens from bacteria that are necessary for vertebrate bodies, and harmful can also include body’s own tumor cells.
A good, though already a bit dated, overview of the vertebrate immune system for the artificial life community is Hofmeyer’s [12].
The simplification of proliferation to mere duplication adopted in the canonical CRM model is maintained in our agent-based model to minimize the number of parameters (excluding proliferation rates) and the parameter search space
Every E f or R f has equal probability of binding to the APC that presents feature f
The list of common (stop) words includes 33 of the most common English words from which we manually excluded the word “with”, as we know it to be of importance to PPI
For feature extraction we used both the training data of Biocreative 2.5 and Biocreative 2 as described in [11]; all classifiers used the exact same feature set.
TF.IDF is a common text weighting measure to evaluate the importance of a feature/word in a document in a corpus. TF stands for term frequency in a document and IDF for inverse document frequency in the corpus.
Notice that this parameter search on the provided labeled training data uses only the information available to the teams participating in Biocreative 2.5 challenge, and none of the test data whose labels were revealed post-challenge.
\(\hbox{F-score} ={\frac{\hbox{2.Precision.Recall}}{\hbox{Precision} + \hbox{Recall}}}\) where \(\hbox{Precision} = {\frac{\hbox{TP}}{\hbox{TP} + \hbox{FP}}}\) and \(\hbox{Recall} ={\frac{\hbox{TP}}{\hbox{TP} + \hbox{FN}}}\). True Positives (TP) and False Positives (FP) are the classifier’s correct and incorrect predictions for relevant documents, while True Negatives (TN) and False Negatives (FN) are the correct and incorrect predictions for irrelevant documents.
References
Carneiro J, Leon K, Caramalho I, van den Dool C, Gardner R, Oliveira V, Bergman ML, Sepúlveda N, Paixão T, Faro J, Demengeot J (2007) When three is not a crowd: a crossregulation model of the dynamics and repertoire selection of regulatory cd4 t cells. Immunol Rev 216(1):48–68
Krallinger M (2009) The biocreative ii. 5 challenge overview, p 19
Hunter L, Cohen KB (2006) Biomedical language processing: what’s beyond pubmed?. Mol Cell 21(5):589–594
Jensen L, Saric J, Bork P (2006) Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet 7(2):119–129. doi:10.1038/nrg1768
Shatkay H, Feldman R (2003) Mining the biomedical literature in the genomic era: an overview. J Comput Biol 10(6):821–856
Hersh W, Bhupatiraju RT, Corley S (2004) Enhancing access to the bibliome: the trec genomics track. Medinfo 11(Pt 2):773–777
Hirschman L, Yeh A, Blaschke C, Valencia A (2005) Overview of biocreative: critical assessment of information extraction for biology. BMC Bioinform 6(Suppl 1):S1
Krallinger M, Valencia A (2007) Evaluating the detection and ranking of protein interaction relevant articles: the biocreative challenge interaction article sub-task (ias). In: Proceedings of the 2nd biocreative challenge evaluation workshop
Feldman R, Sanger J (2006) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, Cambridge
Abi-Haidar A, Kaur J, Maguitman A, Radivojac P, Retchsteiner A, Verspoor K, Wang Z, Rocha LM (2008) Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks. p 9(Suppl 2):S11
Kolchinsky A, Abi-Haidar A, Kaur J, Hamed AA, Rocha LM (2010) Classification of protein-protein interaction full-text documents using text and citation network features. IEEE/ACM Trans Comput Biol Bioinform/IEEE, ACM 7(3):400–411. doi:10.1109/TCBB.2010.55. URL http://www.computer.org/portal/web/csdl/doi/10.1109/TCBB.2010.55
Hofmeyr SA (2001) An interpretative introduction to the immune system. Design principles for the immune system and other distributed autonomous systems
Segel LA, Cohen I (2001) Design principles for the immune system and other distributed autonomous systems. Oxford University Press, Oxford
Mitchell M (2006) Complex systems: network thinking. Artif Intell 170(18):1194–1212
Peak D, West JD, Messinger SM, Mott KA (2004) Evidence for complex, collective dynamics and distributed emergent computation in plants. PNAS 101(4):918–922
Helikar T, Konvalina J, Heidel J, Rogers JA (2008) Emergent decision-making in biological signal transduction networks. Proc Natl Acad Sci USA 105(6):1913–1918. doi:10.1073/pnas.0705088105
Walters M, Sperandio V (2006) Quorum sensing in escherichia coli and salmonella. Int J Med Microbiol 296(2–3):125–131. doi:10.1016/j.ijmm.2006.01.041
Pratt SC (2005) Quorum sensing by encounter rates in the ant temnothorax albipennis. Behav Ecol 16(2):488–496. doi:10.1093/beheco/ari0210.1093/beheco/ari020
Crutchfield J, Mitchell M (1995) The evolution of emergent computation. PNAS 92(23)
Rocha LM, Hordijk W (2005) Material representations: from the genetic code to the evolution of cellular automata. Artif Life 11(1–2):189–214
Shalizi C, Haslinger R, Rouquier J-B, Klinkner K, Moore C (2006) Automatic filters for the detection of coherent structure in spatiotemporal systems. Phys Rev E 73
Timmis J (2007) Artificial immune systems today and tomorrow. Nat Comput 6(1):1–18
Twycross J, Cayzer S (2002) An immune system approach to document classification. Master’s thesis, COGS, University of Sussex, UK
Dasgupta D, Nino F (2008) Immunological computation: theory and applications. AUERBACH
Garrett SM (2003) A paratope is not an epitope: implications for immune networks and clonal selection. pp 217–228
Abi-Haidar A, Rocha LM (2008) Artificial immune systems (Proc. ICARIS), pp 36–47
Abi-Haidar A, Rocha LM (2008) Artificial life XI: 11th international conference on the simulation and synthesis of living systems. MIT Press, Cambridge, pp 1–9
Tsymbal A (2004) The problem of concept drift: definitions and related work. Comput Sci Dep Trinity Coll Dublin 4(C):200415
Paul WE, Technologies IO (1993) Fundamental immunology. Raven Press, New York
Burnet SFM (1959) The clonal selection theory of acquired immunity. Vanderbilt University Press, Nashville
De Castro LN, Timmis J (2002) Artificial immune systems: a new computational intelligence approach. Springer, Berlin
Sepulveda NH (2009) How is the t-cell repertoire shaped. Ph.D. thesis, Instituto Gulbenkian de Ciencia
Abi-Haidar A, Rocha LM (2010) ICARIS 2010: Proceedings of the 9th international conference on artificial immune systems. In: pp 237–249
Abi-Haidar A, Rocha LM (2010) Artificial life XII: twelfth international conference on the simulation and synthesis of living systems. In: pp 706–713
Metsis V, Androutsopoulos I, Paliouras G (2006) Spam filtering with Naive Bayes–Which Naive Bayes? In: Third Conference on Email and Anti-Spam (CEAS)
Joachims T (2002) Learning to classify text using support vector machines: methods, theory, and algorithms. Kluwer, Dordrecht
Porter MF (1980) An algorithm for suffix stripping. Program 13(3):130–137
Sokolova M, Japkowicz N, Szpakowicz S (2006) Beyond accuracy, f-score and roc: a family of discriminant measures for performance evaluation, pp 1015–1021
Acknowledgments
This work was partially supported by a grant from the FLAD Computational Biology Collaboratorium at the Instituto Gulbenkian de Ciencia in Portugal. We also thank the ICARIS2010 committee board for encouraging this work. We acknowledge the computational resources provided by Indiana University used to conduct the simulations we report.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Abi-Haidar, A., Rocha, L.M. Collective classification of textual documents by guided self-organization in T-Cell cross-regulation dynamics. Evol. Intel. 4, 69–80 (2011). https://doi.org/10.1007/s12065-011-0052-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12065-011-0052-5