Abstract
Massive transaction streams present a number of opportunities for data mining techniques. Transactions might represent calls on a telephone network, commercial credit card purchases, stock market trades, or HTTP requests to a web server. While historically such data have been collected for billing or security purposes, they are now being used to discover how the transactors, e.g., credit-card numbers or IP addresses, use the associated services. For over six years, we have computed evolving profiles (called signatures) of the transactors in several large data streams. The signature for each transactor captures the salient features of his or her transactions through time. Programs for processing signatures must be highly optimized because of the size of the data stream (several gigabytes per day) and the number of signatures to maintain (hundreds of millions). Originally, we wrote such programs directly in C, but because signature programs often sacrificed readability for performance, they were difficult to verify and maintain. Hancock is a domain-specific language created to express computationally efficient signature programs cleanly. In this chapter, we describe the obstacles to computing signatures from massive streams and explain how Hancock addresses these problems. For expository purposes, we present Hancock using a running example from the telecommunications industry; however, the language itself is general and applies equally well to other data sources.
C. Cortes, K. Fisher, D. Pregibon, A. Rogers and F. Smith, ACM Transactions on Programming Languages and Systems (TOPLAS), Volume 26 Issue 2, March 2004, Pages 301–338. DOI: 10.1145/973097.973100, © 2004 ACM, Reprinted with permission.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
A.W. Appel, A runtime system. Lisp and Symbolic Computation 4(3), 343–380 (1990)
M. Atkinson, L. Daynes, M. Jordan, T. Printezis, S. Spence, An orthogonally persistent Java. ACM SIGMOD Rec. 25(4) (1996)
B. Babcock, S. Babu, M. Data, R. Motwani, J. Widom, Models and issues in data stream systems, in Proceedings of the 2002 ACM Symposium on Principles of Database Systems (PODS 2002) (2002). See the Stream Project homepage, www-db.stanford.edu/stream for a complete list of papers
D. Belanger, K. Church, A. Hume, Virtual data warehousing, data publishing, and call detail, in Processings of Databases in Telecommunications 1999, International Workshop. Also Appears in Springer Verlag LNCS, vol. 1819 (1999), pp. 106–117
D. Bonachea, K. Fisher, A. Rogers, F.S. Hancock, A language for processing very large-scale data, in USENIX 2nd Conference on Domain-Specific Languages, USENIX Association (1999), pp. 163–176
P. Burge, J. Shawe-Taylor, Frameworks for fraud detection in mobile telecommunications networks, in Proceedings of the Fourth Annual Mobile and Personal Communications Seminar, University of Limerick (1996)
D. Carney, U. Cetinemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, S. Zdonik, Monitoring streams–a new class of data management applications, in Proceedings of the 28th VLDB Conference (2002). See the Aurora Project homepage, www.cs.brown.edu/research/aurora/main.html for a complete list of papers
S. Chandra, N. Heintze, D. MacQueen, D. Oliva, M. Siff, Pre-release of C-frontend library for SML/NJ (1999). See cm.bell-labs.com/cm/cs/what/smlnj
S. Chandrasekaran, M.J. Franklin, Streaming queries over streaming data, in Proceedings of the 28th VLDB Conference (2002)
C. Cortes, K. Fisher, D. Pregibon, A. Rogers, F.S. Hancock, A language for extracting signatures from data streams, in Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (2000), pp. 9–17
C. Cortes, K. Fisher, D. Pregibon, A. Rogers, F.S. Hancock, A language for analyzing transactional data streams. ACM Transactions on Programming Languages and Systems 26(2), 301–338 (2004)
C. Cortes, D. Pregibon, Giga mining, in Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (1998)
C. Cortes, D. Pregibon, Information mining platform: an infrastructure for KDD rapid deployment, in Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining (1999)
D.E. Denning, An intrusion-detection model, IEEE Trans. Softw. Eng. 13(2) (1987)
T. Fawcett, F. Provost, Adaptive fraud detection. Data Mining and Knowledge Discovery 1, 291–316 (1997)
K. Fisher, C. Goodall, K. Hogstedt, A. Rogers, An application-specific database, in Proceedings of 8th Biennial Workshop on Data Bases and Programming Languages (DBPL’01). LNCS, vol. 2397 (Springer, Berlin, 2002), pp. 213–227
P. Gupta, S. Lin, M. McKeown, Routing lookups in hardware and memory access speeds, in Proc. 17th Ann. Joint Conf. of the IEEE Computer and Communications Societies, vol. 3 (1998), pp. 1240–1247
J. Hellerstein, M. Franklin, S. Chandrasekaran, A. Deshpande, K. Hildrum, S. Madden, V. Raman, M. Shah, Adaptive query processing: technology in evolution, in IEEE Data Eng. Bulletin (2000), pp. 7–18. See the Telegraph Project homepage telegraph.cs.berkley.edu for a complete list papers
N.-F. Huang, S.-M. Zhao, J.-Y. Pan, C.-A. Su, A fast IP routing lookup scheme for gigabit switching routers, in Proc. 18th Ann. Joint Conf. of the IEEE Computer and Communications Societies, vol. 3 (1999), pp. 1429–1436
M. Knasmüller, Adding persistence to the Oberon system, in Proceedings of the Joint Modular Languages Conference 97 (1997)
B. Lampson, V. Srinivasan, G. Varghese, IP lookups using multiway and multicolumn search. IEEE/ACM Transactions on Networking 7(3), 324–334 (1999)
B. Liskov, M. Castro, L. Shrira, A. Adya, Providing persistent objects in distributed systems, in Proceedings of the 13th European Conference on Object-Oriented Programming (ECOOP’99) (1999)
G. Nelson (ed.), Systems Programming with Modula-3 (Prentice Hall, New York, 1991)
R. Riggs, J. Waldo, A. Wollrath, K. Bharat, Pickling state in the Java system, in Proceedings of the USENIX 1996 Conference on Object-Oriented Technologies (COOTS) (1996)
SIGMOD. Proceedings of SIGMOD (2002)
M. Sullivan, A. Heybey, Tribeca: a system for managing large databases of network traffic, in Proceedings of the USENIX Annual Technical Conference (No. 98) (1998)
G. van Rossum Python library reference (2001). python.sourceforge.net/devel-docs/lib/lib.html
VLDB. Proceedings of the 28th VLDB conference (2002)
D.C. Wang, The asdlGen reference manual. See www.cs.princeton.edu/zephyr/ASDL (1998)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Cortes, C., Fisher, K., Pregibon, D., Rogers, A., Smith, F. (2016). Hancock: A Language for Analyzing Transactional Data Streams. In: Garofalakis, M., Gehrke, J., Rastogi, R. (eds) Data Stream Management. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28608-0_19
Download citation
DOI: https://doi.org/10.1007/978-3-540-28608-0_19
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28607-3
Online ISBN: 978-3-540-28608-0
eBook Packages: Computer ScienceComputer Science (R0)