Austrian Online Archive Processing: Analyzing Archives of the World Wide Web
With the popularity of the World Wide Web and the recognition of its worthiness of being archived we find numerous projects aiming at creating large-scale repositories containing excerpts and snapshots of Web data. Interfaces are being created that allow users to surf through time, analyzing the evolution of Web pages, or retrieving information using search interfaces. Yet, with the timeline and metadata available in such a Web archive, additional analyzes that go beyond mere information exploration, become possible. In this paper we present the AOLAP project building a Data Warehouse of such a Web archive, allowing its analysis and exploration from different points of view using OLAP technologies. Specifically, technological aspects such as operating systems and Web servers used, geographic location, and Web technology such as the use of file types, forms or scripting languages, may be used to infer e.g. technology maturation or impact.
KeywordsWeb Archiving Data Warehouse (DWH) On-Line Analytical Processing (OLAP) Technology Evaluation Digital Cultural Heritage
- 1.A. Arvidson, K. Persson, and J. Mannerheim. The Kulturarw3 project—The Royal Swedish Web Archiw3e—An example of “complete” collection of web pages. In Proceedings of the 66th IFLA Council and General Conference, Jerusalem, Israel, August 13–18 2000. http://www.ifla.org/IV/ifla66/papers/154-157e.htm.
- 2.S. Bhowmick, N. Keong, and S. Madria. Web schemas in WHOWEDA. In Proceedings of the ACM 3rd International Workshop on Data Warehousing and OLAP, Washington, DC, November 10 2000. ACM.Google Scholar
- 3.R. Bruckner and A. Tjoa. Managing time consistency for active data warehouse environments. In Proceedings of the Third International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2001), LNCS 2114, pages 254–263, Munich, Germany, September 2001. Springer. http://link.springer.de/link/service/series/0558/papers/2114/21140219.pdf.Google Scholar
- 4.Computer Knowledge (CKNOW). FILExt: The file extension source. Webpage, June 2002. http://filext.com.
- 5.A. Crespo and H. Garcia-Molin. Cost-driven design for archival repositories. In E. Fox and C. Borgman, editors, Proceedings of the First ACM/IEEE Joint Conference on Digital Libraries (JCDL’01), pages 363–372, Roanoke, VA, June 24–28 2001. ACM. http://www.acm.org/dl.
- 6.M. Day. Metadata for digital preservation: A review of recent developments. In Proceedings of the 5. European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2001), Springer Lecture Notes in Computer Science, Darmstadt, Germany, Sept. 4–8 2001. Springer.Google Scholar
- 7.J. Ding, L. Gravano, and N. Shivakumar. Computing geographical scopes of web resources. In Proceedings of the 26th International Conference on Very Large Databases, VLDB 2000, pages 545–556, Cairo, Egypt, September 10–14 2000.Google Scholar
- 8.J. Hakala. Collecting and preserving the web: Developing and testing the NEDLIB harvester. RLG DigiNews, 5(2), April 15 2001. http://www.rlg.org/preserv/diginews/diginews5-2.html.
- 9.J. Hirai, S. Raghavan, H. Garcia-Molina, and A. Paepcke. Webbase: A repositoru of web pages. In Proceedings of the 9th International World Wide Web Conference (WWW9), Amsterdam, The Netherlands, May 15–19 2000. Elsevir Science. http://www9.org/w9cdrom/296/296.html.
- 10.The Internet Archive. Website. http://www.archive.org.
- 11.B. Kahle. Preserving the internet. Scientific American, March 1997. http://www.sciam.com/0397issue/0397kahle.html.
- 12.R. Kimball. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. John Wiley & Sons, 2 edition, 2002.Google Scholar
- 13.S. Leung, S. Perl, R. Stata, and J. Wiener. Towards web-scale web archeology. Research Report 174, Compaq Systems Research Center, Palo Alto, CA, September 10 2001. http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/SRC-174.pdf.
- 15.T. Pedersen and C. Jensen. Multidimensional database technology. IEEE Computer, 34(12):40–46, December 2001.Google Scholar
- 16.A. Rauber. Austrian on-line archive: Current status and next steps. Presentation given at the ECDL Workshop on Digital Deposit Libraries (ECDL 2001) Darmstadt, Germany, September 8 2001.Google Scholar
- 17.A. Rauber and A. Aschenbrenner. Part of our culture is born digital-On efforts to preserve it for future generations. TRANS. On-line Journal for Cultural Studies (Internet-Zeitschrift für Kulturwissenschaften), 10, July 2001. http://www.inst.at/trans/10Nr/inhalt10.htm.
- 18.T. Werf-Davelaar. Long-term preservation of electronic publications: The NEDLIB project. D-Lib Magazine, 5(9), September 1999. http://www.dlib.org/dlib/september99/vanderwerf/09vanderwerf.html.