Abstracting and Structuring Web Contents for Supporting Personal Web Experiences
- 2.1k Downloads
This paper presents a novel approach for supporting abstraction and structuring mechanisms of Web contents. The goal of this approach is to enable users to create/extract Web contents in the form of objects that they can manipulate to create Personal Web experiences. We present an architecture that not only allows the user interaction with individual objects but also supports the integration of many objects found in diverse Web sites. We claim that once Web contents have been organized as objects it is possible to create many types of Personal Web interactions. The approach involves end-users and developers and it is fully supported by dedicated tools. We show how end-users can use our tools to identify contents and transform them into objects stored in our platform. We show how developers can use of objects to create Personal Web applications.
KeywordsPersonal web Web augmentation Mashups
Current Web personalization approaches usually suffer a boundary problem, since most, if not all, work in an individual application basis. When a user needs to deal with two or more applications for performing a particular task, he will face differences in the personalization approach for each of them (if any). Another drawback of personalization mechanisms is that, specified by application’s developers, do not necessarily may foreseen the requirements of every single application user. These problems have been the base for the Personal Web, defined in  as a “collection of technologies that confer the ability to reorganize, configure and manage online content rather than just viewing it”. This generic definition might be realized in different ways such as: (1) PIMs and object manipulation which allow users to collect information objects and make them available for performing operations , e.g. to collect scientific work’s titles relevant for a researcher to perform further tasks. (2) Mashups, to integrate and combine information objects from different resources into a specialized application [11, 12, 12], e.g. to combine multimedia search results from different resources in a single view. (3) Web augmentation, where users are able to enrich information objects in-situ, i.e. in the same Web page they appear  e.g. to add information to each movie in the IMDB’s Top250 list. (4) Reactive Web which allows users to obtain reactive feedback from information objects under certain events that these objects are able to detect automatically , e.g. to inform the user that a new movie was presented. (5) Creation of specific applications: for example, running specific client-side applications that using existing information objects, use them to build a domain specific application, such as a personal agenda, e.g. a personal application for managing scientific literature.
These approaches, altogether, provide users with the possibility of interacting with Web objects (information items from the existing Web) in different ways. However, all the approaches work isolated, with specific and dedicated information models, which makes very complicated to have a complete Personal Web experience supporting arbitrary combinations of these kinds of interactions, given that several different and specialized tools should be developed and maintained which moreover hinders reuse (of contents and behaviors).
In this paper we present a platform for supporting the abstraction of domain objects from Web sites with the goal of creating applications (Mashups, Web augmentation, independent applications, etc.) providing a full interactive Personal Web experience supporting the reuse of structure definition and behavior. The main contributions of our approach are that (1) it supports all the kinds of interactions mentioned above and new ones that could be envisioned in the future; (2) it achieves this goal by using a uniform and rich underlying object-oriented model and (3) the possible combinations of application types (e.g. mashups and augmentations) makes the overall result much richer than the mere sum of these individual approaches.
The paper is organized as follows. Section 2 presents the motivation and an overview of the approach. Section 3 presents the related works. Section 4 introduces our approach. Section 5 describes the tool support and in Sect. 6 several case studies for illustrating the approach are explained. Finally, Sect. 8 concludes and presents the future works.
2 Motivation and Approach Overview
Imagine a journalist who must be informed constantly. With this aim, too many contents might be got from different Web sources, and even different kind of interactions could be needed for achieving a real Personal experience: (I1). Interact with information objects directly from an object representation space: if the user needs to store preferred news that he wants to follow, then a PIM with those news could be good for starting any interaction with them. (I2) Merge objects from different sources into specialized apps: in this case he would need a mash-up application that integrates the daily news from the preferred media Web sites, allowing him to browse several sources at one time. (I3) Interact with objects when they are presented in the visited pages: when visiting a media Web sites, the user could take advantage of Web augmentation capabilities, and then augment the news in specific Web sites with behavior to obtain related news, multimedia resources, look for reactions on social networks, etc. (I4) Get reactive interaction from the Web: when there is a hot topic, the user could be interested in some kind of reactive experience, for instance to have immediate notifications informing him the last news. (I5) Domain Specific application experience: this user may appreciate a specialized news and journalist tracker application that allows him to follow specific journalists, recommend news, etc. This is similar to mashups, but the underlying application behavior is specific for the news domain.
For providing a full Personal Web experience like this, we must use several applications: a Web augmentation script, a Mashup tool, a PIM, an environment for client-side domain specific applications, etc. In this context, all approaches listed before tackle only a “portion” of the Personal Web. This situation presents some challenges to end-users, who could need to deal with different kind of artifacts at different moments and circumstances. To obtain all these tools for a single domain could be time consuming for end-users, or simply not possible because the required programming skills.
3 Related Works
The idea of object extraction is similar to Web Scraping . Web scraping is the process of non-structured (or with some weak structure) data extraction, usually emulating the Web browsing activity. Normally, it is used to automate data extraction in order to obtain more complex information, which means that end-users are not usually involved on determining what information to look for and still less about what to do with the abstracted objects. When Web content has not an underlying structure, Web scraping would be a good option in order to retrieve information from Web sites.
Some Web sites already tag their contents allowing other software artifacts (for instance a Web Browser plugin) to process those annotations and improve interaction with that structured content. A well-known approach for giving some meaning to Web data is Microformats . Some approaches leverage the underlying meaning given by Microformats, detecting those objects present on the Web page and allowing users to interact with them in new ways . A very similar approach is Microdata . Considering Semantic Web approaches, and an aim similar to our proposal,  presents an approach for mashups based on semantic information; however, it depends too heavily on the original application owners, something that is not always viable.
However, when analyzing the Web, we see that a huge majority of Web sites do not provide structured data. According to , only 5,64 % among 40.6 million Web sites provide some kind of structured data (Microformats, Microdata, RDFa , etc.). This reality raises the importance of empowering users to add semantic structure when it is not available. Several approaches let users adding structure to existing contents to ease the management of relevant information objects. For instance, HayStack  offers an extraction tool that allows users to populate a semantic-structured. Atomate it!  offers a reactive platform that could be set to the collected objects by means of rule definitions. Then the user can be informed when something interesting (such as a new movie, or record) happens.  allows the creation of domain specific applications that work over the objects defined in a PIM.
Web augmentation is a popular approach that lets end-users improve Web applications by altering original Web pages with new content or functionality not originally contemplated by developers. Nowadays, users may specify their own augmentations by using end-user programming tools. Very interesting tools have emerged , to manipulate DOM (Document Object Model) objects in order to specify the adaptation. However, the costs associated to specifying similar functionality in different Web applications sharing the same underlying domain may be high. Reutilization in Web Augmentation has been confined to reusing scripts. For example, Scripting Interface  is oriented to support better reutilization by generating a conceptual layer over the DOM, specifically for GreaseMonkey scripts. Since the specification of a Scripting Interface could be defined in two distinct Web sites, the augmentation artifacts written in terms of that interfaces could be reused.
Another well-known approach for integrating content and services are mashups. Very popular tools such as Yahoo Pipes!  allowed users to combine different resources and present a specific result. Yahoo Pipes! is strongly based on the existence of APIs, but other approaches propose in-situ composition, i.e. without generating a new independent application . Although MashMaker allows to abstract widgets with their properties, the way in which the widgets are used is always the same and extending the use implies modifying the application.
It must be noted that if we consider the interactions mentioned before (I1-I5), we can see that they may be supported individually by one of the mentioned approaches. Nevertheless, none approach supports these interactions altogether; therefore, the Personal Web experience might be restricted. Moreover, how future kind of interactions could be contemplated is not taken into account in most of the approaches. The main reason for that, is that the underlying data models seems to be specifically defined for supporting a particular kind of interaction.
4 Our Approach
Our approach proposes using a reusable object layer to build any kind of Personal Web application. This is achieved by giving end-users the possibility to structure CO from existing content, to import them into the WOA and to interact with their instances either from our WOA viewer, using in-situ Web Augmenters, or in domain-specific applications. Applications and decorators are created by developers, who profit from a reusable layer of specifications of CO and their behavior.
4.1 Abstracting and Collecting Objects
Class identification (1), consists of identifying relevant DOM elements on the context of any Web site, yielding either a single or a list of occurrences of the same element (such as a resulting list of products in Amazon). Implementation details are presented in Sect. 5, but it is worth mentioning that users are enabled to select DOM elements, and decide between extracting only such element or collecting all similar occurrences. For instance, in Fig. 4, although the collection task is made with the Carrie Fisher actor, users may choose to collect all similar detected objects, such as Harrison Ford, etc. Either collecting only one instance or a collection of them, WOA can manage both the Actor CO and its individual IO.
Finally, all the abstracted objects are stored into the WOA and the instances extraction and materialization (3) step takes place, so they can “live” as materialized objects; i.e. besides maintaining their properties in the internal state, they can also respond to messages, as in object-oriented approaches. With the same philosophy, once objects are collected, the WOA may manage both IO and their corresponding CO, and they may be enhanced via decorators, as we explain later. Summarizing, there are two types of objects available in the WOA: IOs which represent a concrete instance of a concept abstracted from a Web site, has the responsibility of maintaining values for its internal state, and respond to messages, and COs which serve for letting end-users to manage all the corresponding IO altogether. A CO has the responsibility of being aware of all its instances, and when possible, to provide some mechanism for retrieving instances that are not already collected. This is achieved by defining an Object Search Engine for those sites where there are instances of the concept; this is explained in detail in Sect. 5.
Based on the generated CO specifications, extraction is the process where the concrete information about the specified DOM elements is obtained. The CO may contain the specification for extracting a single object from a Web page (for instance the main news from a media portal), or all the news from the same site. For each object to be extracted, the CO contains–at least– the corresponding URL and XPath. In this way, the extraction step includes the task of obtaining a DOM (from that URL), parsing it and getting each information piece to extract all the required for setting the instances internal state (e.g. the title of the last news). Regarding materialization, it implies creating an IO; setting its internal state and wrapping it with some behavior.
4.2 Enhancing Objects
In the WOA, users may deal with CO or IO. Both of them has some basic behavior, which is automatically inherited. For example any CO responds to messages such as getInstances(), removeInstance(), etc. An IO inherits automatically some behavior such as showInContext(), getDOMImage(), getPropertyByName(), etc. Besides this default behavior, an object can be enhanced either with behavior for the specific object type or with behavior that can be applied over any kind of object (i.e. behavior independent of the application domain). These enhancements are called decorators, inspired in the Decorator design pattern  and are developed by advanced users. For instance, if a journalist has collected News objects in the WOA, then an instance decorator could add getRelatedMultimedia(), getRelatedTweets(), etc. Regarding to the News CO, a domain-specific class decorator could add getCurrentEconomyNews(), etc. A decorator adds new messages that can be sent to the object from different contexts (from a WOA viewer, augmentation scripts, etc.). Decorators may be generic or even domain specific when these are specifically defined for a type of object from an ontology in DBpedia. When a new IO is obtained, then available decorators may be automatically applied. Since decorators specify meta-information related to the type of objects over which it can be applied and also related to the needed properties to work properly, the WOA may discard those decorators that do not fit with an OMS.
End-users may add existing decorators in their browsers and then decide which decorators to apply over the WOA objects (See Sect. 5). Decorating an object requires identifying the desired decorator and choose the target objects. This can be done from the WOA Viewer, which helps end-users in this task by filtering decorators and CO analyzing their compatibility. Decorators must specify (a) the needed object structure: to which kind of objects the decorator may be applied. When the decorator is domain-specific, the target objects may be a particular CO or its IO. When the decorator is generic, then the target objects may be any CO or any IO. (b) The messages with which target objects will be enhanced: decorators must be able to define which are the messages for enhancing objects, which also includes if the messages have or not a UI effect that the end-user may perceive.
Decorators may use (a) WOA objects (CO and IO): although the behavior is going to be added to particular objects, decorators may consume any other objects existing into the WOA for accomplishing that behavior; (b) Any Web content: decorators may need to consume other content besides WOA objects. This can be done in two ways. First, decorators may consume any Web content via the use of APIs or ad-hoc DOM parsing. However, decorators may also reuse other OMS and obtain objects from different Web sites, without the need that these objects already exist in the WOA. For instance the getRelatedNews() decorator could parse a media Web site by applying (on the fly) an OMS to GoogleNews, etc., in order to obtain other objects.
Section 4 presents further technical details, but it is important to note at this point that the fact of separating decorator development from the underlying object in which it is going to be applied, implies that these behaviors are intrinsically reusable among Web sites sharing the same domain model in different contexts or applications.
4.3 Interacting with Objects
By interacting directly with objects, end-users may send messages (provided by their chosen decorators) to the objects. For instance, a journalist may send the getRelatedMultimedia() message when he wants; this message may return a list of Yooutube Videos and Google Images, while other similar decorators may consume content form other sources. All the messages shown in the menu are dynamic, because this behavior is implemented by decorators, as explained in Sect. 4.4.
Besides interacting directly with objects, end-users may install further WOA applications (created by developers), which might provide different ways to interact with objects. For instance, if the journalist wants to be informed about news related to a particular topic (economy, sports, etc.), he could use a WOA application for Reactive News, which alerts the user when a news appears. Other kind of applications such as one for integrating news related to a map could be also possible, as Sect. 6 shows.
4.4 Programming WOA Objects and Applications
5 WOA Supporting Tools
The complete tool is deployed as a Firefox browser extension, including the WOA, the WOA application runner, and the Object Collectors. More Object Collectors, WOA applications and decorators may be added in a plug-in-like style.
5.1 Tool Support for Collecting and Structuring Objects
First, we added a toolbar button with two options: opening WOA, and enabling the concept selection. Clicking the second option (step 1 in the picture), every DOM element is highlighted on a mouse-over event, so the user can clearly appreciate what is the current target element to collect. Then, as shown in step 2, he can access via a context menu to the options for extracting an element in the current DOM. Options are dynamically loaded according to the selected target element. This behavior is provided by a set of ObjectCollectors explained later. Once one of the options is clicked, a sidebar is opened for completing the remaining data required for the abstraction and structuring stage. The contextual menu is populated with those ObjectCollectors that match with the selected element. This is carried out by asking the set of collectors to analyze the target DOM element, and rendering just the ones that accomplish the required characteristics for being created with such extraction technique. Our tool currently supports collecting elements from Microformats, DOM element selection and text highlighting. New collectors can be incorporated by extending the framework. Each collector must be capable of analyzing a target HTML element and, if applicable, rendering a context-menu item with their description and associating some behavior to it, in order to return the created object.
5.2 WOA Viewer
Once saved into the WOA, users may see the CO and IO in the WOA viewer, as shown at the right of Fig. 9. We show the view of a CO, whose contextual menu allows to manage the properties, edit the CO, wrap it with some behavior and define an Object Search Engine for retrieving IO that may not be present as a result in the current DOM. If there are class decorators enabled, then the messages that can be sent directly by the user are shown under the submenu “Available Messages”.
5.3 Decorating Objects
5.4 Object Search Engines
To support different ways of searching objects, we take into advantage original Web applications engines, allowing end-users to abstract that searching engine UI similarly to the way in which they can abstract content into objects. These ObjectSearchEngine are search APIs, each of them containing the searching URL, the form where the user would enter the text to search, and the button for performing the action. Also searching modifiers (such as filter or ordering options) and pagination managers are supported. Then, for example, a decorator may easily search for news in Google News given a particular news title from an object extracted from DiarioRegistrado.com and materialized into the WOA, assuming that an Object Search Engine for Google News was defined. Finally, a CO that was added into the WOA may have associated several ObjectSearchEngine defined in different Web sites. For the sake of space we omit further explanation on creating custom search engines, which can be found in an online documentation site (see footnote on page 17th).
6 Case Studies
In this section we present some case studies demonstrating the power of the approach. Here several examples show how CO and IO materialized into the WOA may be enriched with decorators and then used in different contexts.
6.1 A Web Augmentation Approach Based on Domain-Specific Models
Another possible scenario for using directly materialized concepts is in-situ Web Augmentation. When the concept has been wrapped with a decorator with Web Augmentation capabilities, every DOM element related to an IO is enhanced with a floating-menu in its original context. Such menu is placed at the top-right corner of the element and makes it possible to interact with the decorator messages, in the original context of the structured data.
Until this point, creating a personal solution does not require any programming skills. However, if the needed functionality is not being contemplated by any of the existing decorators, a developer should implement it. Developers can create not only decorators but also applications. In both cases, the WOA library is accessible for also querying instances of existing templates, concepts and decorators.
6.2 A Personal Dashboard Based on Composition of Abstracted Objects
6.3 Using Decorators with Reactive Web Capabilities from WOPs
Finally, consider the fact that the consumed news of the previous example were retrieved from certain subsection of both Web portals–e.g. economics–. Both portals have other sections that, under certain circumstances may have news of interest for the journalist, either because the subject is directly related with his interests or because they have reached high level of popularity. Generally, the main entry of news portals usually owns such qualities, so a considerable feature for the journalist’s application could be tracking changes of such main entries. This is possible to implement through reactive programming, making elements capable of propagating their changes. As WOA decorators are instantiated in a high privileged context (our browser extension’s main code), it is possible to retrieve and manipulate external documents for achieving this goal.
We have performed an expert-oriented evaluation, to measure te power of the approach. Based on the motivational examples presented in Sect. 2, we identified the dimensions or aspects that an approach must support for letting users obtain such Personal Web experience. We found more than 10 dimensions of interest in the evaluation namely Consumes static data, Consumes dynamic data (Web services or extractors), Consumes structured data, Consumes unstructured data, No technical skills needed, Content authoring, Reusable information objects, Individual information objects shareability, Tracks changes in the original Web content, Allows augmenting existing Web content, Integrates content from multiple sources, Integrates and displays services from multiple sources, Objects can live in background.
We used these dimensions for comparing how each type (e.g. mashup) and individual approach (for instance Marmite) support personal experiences. For reasons of space, we cannot include the full comparison table here, but it can be read in the WOA documentation Web site1. As a result, we found that none of these approaches supports all experience at the same time. In some cases, the problem is data structuring. In other cases, the changes in Web pages (where an object was abstracted) are not tracked (and consequently some interactions such as reactive ones are not possible). Others do not support the enhancement of objects when they are visualized in their corresponding Web page. Our approach, in contrast, supports altogether the interaction kinds listed in Sect. 2 (and further ones, such as client-side recommender systems) since its underlying object-oriented data model is, in our opinion, the best way to implement such a layer, given its intrinsic properties such as reuse and extensibility makes the approach application-agnostic. Over these models, applications may be run in different scopes but always using the same client-side web technologies.
8 Conclusions and Future Works
The constant evolution of Web and their users have shown the need of more personal Web applications. Web Mashups, Web Augmentation and other approaches have emerged to reach this goal; however these approaches are usually not integrated and underlying domain models are not easy to reuse. We believe that, for reaching a more Personal Web, the kinds of interaction experiences supported by these approaches should be composable, in such a way that information object models and their behavior could be reused. In this paper we presented an approach for adding an object-oriented layer over Web contents, that serves as a platform for the development of third-party software. Solutions can be created from existing contents, and focused on existing content and decorators–therefore behavior– reusability. We presented our tools and several case studies that demonstrate the power of the approach. We are currently developing a WOA application and Decorators repository. In this way, we are increasingly covering functionality needs of diverse end-users in the process of decorating the objects they materialize. The same repository is being designed to support collaboration in the creation of OMS and also as a communication platform for sharing them. We are also developing an end-user tool for creating WOA applications, such as the dashboard presented in the case studies section. Finally, we plan to perform experiments with end-users for further validating our approach.
WOA Website Comparison table: https://sites.google.com/site/webobjectambient/comparison.
- 2.Díaz, O., Arellano, C., Aldalur, I., Medina, H., Firmenich, S.: End-user browser-side modification of web pages. In: Benatallah, B., Bestavros, A., Manolopoulos, Y., Vakali, A., Zhang, Y. (eds.) WISE 2014, Part I. LNCS, vol. 8786, pp. 293–307. Springer, Heidelberg (2014)Google Scholar
- 5.Khare, R., Çelik, T.: Microformats: a pragmatic path to the semantic web. In: Proceedings of the 15th International Conference on WWW, pp. 865–866. ACM, May 2006Google Scholar
- 6.Operator Firefox Extension. https://addons.mozilla.org/es/firefox/addon/operator/?src=search
- 7.Microdata. http://www.w3.org/TR/microdata/
- 10.Díaz, O., Arellano, C., Iturrioz, J.: Interfaces for Scripting: Making Greasemonkey Scripts Resilient to Website Upgrades, pp. 233–247. Springer, Heidelberg (2010)Google Scholar
- 11.Pruett, M.: Yahoo! Pipes. O’Reilly, California (2007)Google Scholar
- 12.Ennals, R., Garofalakis, M.: Mashmaker: mashups for the masses (demo paper). In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (SIGMOD 2007) (2007)Google Scholar
- 13.RDFa. https://rdfa.info
- 14.van Kleek, M., Moore, B., Karger, D.R., André, P.: Atomate it! end-user context-sensitive automation using heterogeneous information sources on the web. In: Proceedings of the 19th International Conference on World Wide Web, pp. 951–960. ACM, April 2010Google Scholar
- 15.van Kleek, M., Smith, D.A., Shadbolt, N.: A decentralized architecture for consolidating personal information ecosystems: The WebBox (2012)Google Scholar
- 16.Karger, D.R., Bakshi, K., Huynh, D., Quan, D., Sinha, V.: Haystack: a customizable general-purpose information management tool for end users of semistructured data. In: Proceedings of the CIDR Conference, January 2005Google Scholar