Abstract
An ad hoc data source is any semi-structured, non-standard data source. The format of such data sources is often evolving and frequently lacking documentation. Consequently, off-the-shelf tools for processing such data often do not exist, forcing analysts to develop their own tools, a costly and time-consuming process. In this paper, we present an incremental algorithm that automatically infers the format of large-scale data sources. From the resulting format descriptions, we can generate a suite of data processing tools automatically. The system can handle large-scale or streaming data sources whose formats evolve over time. Furthermore, it allows analysts to modify inferred descriptions as desired and incorporates those changes in future revisions.
Keywords
- Edit Distance
- Initial Description
- Dependent Pair
- Membership Query
- Grammatical Inference
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
LearnPADS + + , http://www.padsproj.org/incremental-learning.html
Appel, A.W.: Modern Compiler Implementation in ML. Cambridge University Press (1998)
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD, pp. 337–348 (2003)
Bille, P.: A survey on tree edit distance and related problems. Theor. Comput. Sci. 337(1-3), 217–239 (2005)
Chidlovskii, B., Ragetli, J., de Rijke, M.: Wrapper Generation via Grammar Induction. In: Lopez de Mantaras, R., Plaza, E. (eds.) ECML 2000. LNCS (LNAI), vol. 1810, pp. 96–108. Springer, Heidelberg (2000)
Earley, J.: An efficient context-free parsing algorithm. Communications of the ACM 13(2), 94–102 (1970)
Fisher, K., Walker, D., Zhu, K., White, P.: From dirt to shovels: Fully automatic tool generation from ad hoc data. In: POPL (January 2008)
Fisher, K., Walker, D., Zhu, K.Q.: LearnPADS: Automatic tool generation from ad hoc data. In: SIGMOD (2008)
Grünwald, P.D.: The Minimum Description Length Principle. MIT Press (May 2007)
De La Higuera, C.: Current Trends in Grammatical Inference. In: Amin, A., Pudil, P., Ferri, F., Iñesta, J.M. (eds.) SPR 2000 and SSPR 2000. LNCS, vol. 1876, pp. 28–31. Springer, Heidelberg (2000)
PADS project (2009), http://www.padsproj.org/
Parekh, R., Honavar, V.: An Incremental Interactive Algorithm for Regular Grammar Inference. In: Miclet, L., de la Higuera, C. (eds.) ICGI 1996. LNCS, vol. 1147, pp. 238–249. Springer, Heidelberg (1996)
Vidal, E.: Grammatical Inference: An Introduction Survey. In: Carrasco, R.C., Oncina, J. (eds.) ICGI 1994. LNCS, vol. 862, pp. 1–4. Springer, Heidelberg (1994)
Zhu, K.Q., Fisher, K., Walker, D.: Incremental learning of system log formats. In: ACM SOSP Workshop on the Analysis of System Logs (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhu, K.Q., Fisher, K., Walker, D. (2012). LearnPADS + + : Incremental Inference of Ad Hoc Data Formats. In: Russo, C., Zhou, NF. (eds) Practical Aspects of Declarative Languages. PADL 2012. Lecture Notes in Computer Science, vol 7149. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27694-1_13
Download citation
DOI: https://doi.org/10.1007/978-3-642-27694-1_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-27693-4
Online ISBN: 978-3-642-27694-1
eBook Packages: Computer ScienceComputer Science (R0)
