Developing Next Generation Dynamic Web Services for Scientific Portals By Niraj Kumar Software Developer West Bengal, India E-mail: nirajkumariitkgp@gmail.com
Kolkata ' 700026.
Contact No: (Mobile).
© 2005 Niraj Kumar. All rights reserved.
Objective:
The main objective of this whitepaper is to develop a framework for next
generation web services for scientific portal .
Integrating various databases and webresources of specific domain available on the WWW and bring these
heterogeneous sources of data into common platform or in the form required by
the user. Developing a domain specific wrapper (CHEMWRAP) for extracting
useful information from HTML web pages and hidden Web and converting these
into directly usable form. Next stage is to develop fully automatic, dynamic and
generic web based service for portal which can based on user specific query,
able to crawl the Web, give the most relevant data from multiple sources and
convert these data into user specific requirement. Our goal is to simplify access
to chemistry data by providing a single access point to a large number of
sources.
Introduction:
The Web can be considered as world's biggest data source and just textual data
amounts to at least hundreds of terabytes. The growth rate of Web is even more
dramatic with its size doubling every two years. However, the content of the Web
changes very fast and many of the past links and resources become dead while
many more newer one getting added every day. Aside from these newly created
pages, the existing pages are continuously updated. For example, in a study at
Stanford University of over half a million pages over 4 months, it was found that
about 23%
of pages changed daily. In the .com domain 40% of the pages changed daily,
and the half-life of pages is about 10 days (in 10 days half of the pages are gone,
i.e., their URLs are no longer valid) [Arasu et. al.].
Apart from this,
a tremendous amount of content on the Web is dynamic. According toan estimate close to 80% of the content of the Web is dynamically generated and that this
number is continuously increasing. This dynamism takes a number of different form like
temporal dynamism (time sensitive dynamic content), Client-based dynamism
(Customized web pages), Input dynamism ( Pages whose content depends on the input
received from the user) etc and further complicate the integration of web resources
[Raghavan et.al.]. However, little of these dynamic content is currently being crawled and
indexes by even most popular search engine and they usually index only static web pages
by following hyperlinks, ignoring search forms and pages that require authorization or
prior registration.
Crawling the hidden Web is a very challenging problem for two fundamental reasons.
First is the issue of scale: a recent study estimates that the size of the content available
through such searchable online databases is about 400 to 500 times larger than the size
of the "static Web". Second, access to these databases is provided only through restricted
search interfaces, intended for use by humans. However, the domain specific scientific
portals needs to crawl and integrate these hidden Web databases to provide task specific
requirements of a particular user and application.
Most of the modern science and technological advancement were driven by
latest development in sciences and Information Technology and particularly web
based system is going to play very important role. With technical development of
applied sciences currently facing many limitations and ever increasing
complexities of problems, increasing use of computer to solve problems is only
natural. Today, Scientific and research portals domain covers wide areas from
physical, organic, inorganic to biochemistry, molecular modeling, biology, drug
design, Geosciences, Applied engineering, and many more. The scientific
community is globally distributed with culture of sharing and rapid dissemination
of information. Each separate area of science generates its own data and
information sources.
A large amount of scientific data is distributed over the Internet (for example: The
Cambridge Crystallographic Data Center, NIST, The Protein Data Bank .
Typically, these information is accessible only through custom web based query
interfaces. Each of these sources and databases have different structures,
contents, query languages and retrieved data in different format. Furthermore,
they are prone to having their interfaces and formats updated without warning
[Buttler et. al.]. To facilitate scientist, students and industrial users, a large
number of specialist interrogation, modeling, and software analysis tools are
available (for example: GAMESS, Gaussian, GAP, Ghemical, COLUMBUS,
COSMO-RS etc).
When scientific resource users require information from multiple sources, they
must pose the appropriate queries at each source individually then explicitly
integrate the result. This solution may be acceptable for a small number of
sources, but it quickly becomes an overwhelming burden for users as the number
of sources grow [Buttler et. al.]. Currently there are thousands of scientific
resources and databases, making it infeasible to manually gather required and
relevant data from these sources. Our proposed system aims to provide a user
interface where they can enter their query, then it should be able to perform the
following query formulation and execution tasks :
(a)
identify sources and their locations both static as well as dynamic(b)
identify the content/function of sources and its type(c)
Clustering Web pages based on their structure and attributes(d) Developing a generic wrapper CHEMWRAP which able to filter required
information from HTML pages as well as hidden databases from heterogeneous sources
(e) transform data in user required format
(f) merge results from different sources
(g) Optimize the whole system to give most efficient, secure and low cost solution
Challenges in developing future generation Web services can be broadly classified into
integration of number of interrelated problems like developing a system which in real
time able to identify the most relevant static and dynamic sources (This is essentially a
problem of developing a advanced search engine and crawling technologies with high
precision and recall ratio) , addressing the problem of heterogeneous of these resources
i.e. developing some Multi-database system (This is problem of portability and platform
independences from data sources to hardware and software used). Then extracting only
relevant information from these sources (i.e. developing the wrapper(s)/filter(s)
methodology and technology which includes larger problems of semantics of Web) and
finally developing customized user interfaces and integration of all software, networking
and hardware sub-systems. From last 30 years a number of efforts are being made in all
these areas separately and in last 10 years more focused attempts of integration of these
techniques and methodologies were taken. However, my literature review revealed that
any of present day system is far below the level of challenges posed by requirement of
such a system. Now, before going into methodology of our approach, I would like to
give a brief overview of attempts already made in this direction particularly with
reference to providing future generation Web services from heterogeneous sources.
Overview of Related work
Many researchers have tackled problems related to information extraction and integration
from the Web. These go from developing toolkits to add in building wrappers manually
and wrapper induction to the extraction of relational data from large collections of web
documents or extraction of symbolic knowledge. Some of wrapper construction
methods are manual, while others are semiautomatic and automatic. However manually
coding of wrappers become entirely impossible in current scenarios. Methodologies
employed to develop wrappers vary widely from finding pattern in HTML pages using
tree structure to finite state based approach to fuzzy set, artificial intelligence, and neural
networks based learning and training approach. Some of the well known research groups
and products in these areas are: ANDES, WysiWyg Web Wrapper Factory (W4F),
Ariadne, Garlic, TSIMMIS, XWRAP, Mostrare Project, STALKER, TAMBIS,
SoftMealy, FASTUS, HLRT Wrappers, Jedi etc.
XWRAP [Liu et. al.]: XWRAP is a semi-automatic XML-enabled wrapper construction
system for Web sources. The architecture of XWRAP consists of four components:
Syntactical structure normalization, Information extraction, Code generation, Program
Testing and packaging. XWRAP was developed in Java. By XML-enabling, it means that
the wrapper programs generated by XWRAP can transform an HTML document into an
XML document and deliver the extracted data content in XML format with a DTD.
STALKER [Muslea et. al.]: STALKER, developed in University of Southern California,
is a wrapper induction algorithm that generates extraction rules for semi-structured Web
based information sources using landmark automata. Based on just a few training
examples STALKER learns extraction rules for documents with multiple level of
embedding.
FASTUS [Hobbs et. al.]: FASTUS is a five stage system for
extracting informationfrom natural language text. It works essentially as a cascaded, non-deterministic
finite-state automaton.
Decomposition of language processing enables the system to do exactly the
right amount of domain-independent syntax, so that domain-dependent semantic
and pragmatic processing can be applied to the right larger-scale structures.
Some of the blind experiments have demonstrated that it is very efficient.
WisiWyg Web Wrapper factory [Sahuguet et. al.]:
W4F, developed at Penn DatabaseResearch Group, is a toolkit that allows the fast generation of Web wrappers. Wrapper
generation consists of
retrieval of an HTML page via GET or POST methods, followedby construction of HTML parse tree according to the HTML hierarchy. Information can
then be
extracted declaratively using a set of rules applied on the parse tree. A nestedstring list (NSL) data structure is used as the datatype to represent extracted information
internally.
InfoSleuth [Bayardo et. al.]: The InfoSleuth project at MCC exploit and synthesize
new technologies into a unified system that retrieves and processes information
in an ever changing network of information resources. This is scalable and
portable and accomplished through the use of collaborative agents, and it uses
Java as a common wrapper agent.
TAMBIS [Baker et. al.]: The TAMBIS project at University of Manchester, UK, is
a three layer madiator/wrapper architecture which aims to provide transparent
access to various disparate biological databases and analysis tools. The use of
knowledge base and wrapped resources removes the need for user to know
which are the appropriate resources and how to access them. It greatly reduces
time taken to analyze their data. TAMBIS aims to use CORBA wrapped services.
Garlic: Developed at IBM Almaden Research Center, Garlic is a middleware
system that provides an integrated view of heterogeneous legacy data without
changing how or where data is stored. It provides a unified schema and common
interface for new applications without disturbing existing applications. This relies
on wrappers that encapsulate the underlying data and mediate between data
source and middleware.
Other projects which specifically aims at diverse and heterogeneous databases
are SINGAPORE, TSIMMIS, DISCO etc.
Problems related with crawling hidden Web and developing search engine were
addressed by Raghvan et. al., Brin et. al. among others.
METHODOLOGY
Methodology to be adopted for this study is to develop a web crawler specific to
scientific areas which able to crawl on the Web for available database resources. For
start we do not propose to crawl all the Web resources but try to stick to four or five
sources. But as most of the databases available are hidden and they have their own
data retrieval mechanism and user interfaces, we need to develop a crawler taking
into account all these factors. Then based on these sources we try to cluster
information into one based on their structural similarity. As each of these databases
have their own format but closely related one as each of them have data about
molecular structure of chemical compounds, so We propose a generic wrapper
CHEMWRAP which able to filter required information from HTML pages covert
them into a common format (say XML) and extract the required information and
convert them into format supported by some scientific software program. We will
develop whole our system in Java, XML, COM/CORBA and other Java based Web
technologies.
Decomposition of Web Information extraction task: The Wrapper generation
process is so complex that it is not possible to consider the construction process
occurring in a one single step. For this reason we have partition the CHEMWRAP
construction process into six phases (Figure 2). The interaction and information
exchange between any two of the phases needs to be performed . After the
preprocessing of sources, information extraction is started. The main task of the
information extraction component is to explore and specify the structure of the
retrieved document (page object) in a declarative extraction rule language. For an
HTML document, the information extraction phase takes as input a parse tree
generated by the syntactical normalizer. It first interacts with the user to identify the
semantic tokens (a group of syntactic tokens that logically belong together) and the
important hierarchical structure. Then it annotates the tree nodes with semantic tokens
in comma-delimited format and nesting hierarchy in context-free grammar. More
concretely, the information extraction process involves three steps; each step
generates a set of extractions rules to be used by the code generation phase to
generate wrapper program code.
Step 1: Identifying region of interest on the Page
Step 2: Identifying Semantics token of interest on the page
Step 3: Determining the nesting hierarchy for the content presentation of the page
Proposed System Architecture (Figure 1)
Scientific Portal
Client 1 Client 2 Client 3
HTTP
request
Cambridge databank NIST Databank Protein databank
Wrapper
Wrapper
wrapper
CHEMWRAP
Extracted Data
Data Integrator
XML form Data
Some Scientific software Format
Data
Result calculation by scientific
software
Decomposition of Web Information Extraction Task (Figure 2)
(CHEMWRAP Architecture)
HTTP Query Building
Enter asset of URL (s)
Fetching and Repairing Source document
Clustering Pages of Same Structure if
needed
Required Source Document
Generating a Parse Tree
Information Extraction
XML-enabled Wrapper Code Generator
Code Testing and Integration With
ChemCraft
Wrapper Code
Extraction rule
References
•
Raghavan Sriram, Grecia 'Molina Hector, "Crawling the Hidden Web", ComputerScience Department, Stanford University, Stanford, USA, 2000, PP-25.
•
Buttler David, Critchlow Terence, "Using meta-data to automatically wrapbioinformatics sources", Information and Software Technology, No-44, 2002, PP 237-
239.
•
Baker Patricia G. et al ,"TAMBIS ' Transparent access to Multiple BioinformaticsInformation Sources", School of Biological Sciences, University of Manchester, UK
•
Arasu Arvind, Cho Junghoo et. al., "Searching the Web", Computer ScienceDepartment, Stanford University, 2000, PP-42
•
Habegger Benjamin, Quafafou Mohamad, "Multi-pattern wrappers for relationextraction from the Web", IRIN, University of Nantes, France, 2003,PP-5.
•
Myllymaki Jussi, "Effective Web data extraction with standard XML technologies",Computer Networks, No ' 39, 2002, PP 635-644.
•
Liu Ling, Pu Calton, Han Wei, "XWRAP: An XML-enabled Wrapper ConstructionSystem for Web Information Sources", Georgia Institute of Technology, Atlanta, PP-11.
•
Muslea Ion, Minton Steve, Knoblock Craige, "STALKER: Learning extraction rules forsemistructured webbased information sources", IMSC, University of South California,
USA, PP-8.
•
Hobbs Jerry R., Applet Douglas et.al., "FASTUS:A Cascaded Finite-State Transducerfor Extracting Information from Natural-Language Text", Artificial Intelligence Center,
SRI International, California, 1997, PP-25.
•
A. Sahuguet, F. Azavant. W4F, 1998. http://db.cis.upenn.edu/W4F.•
Bayardo R. J., Bohrer W. et. al., "InfoSleuth: Agent-Based Semantic Integration of Information in Openand Dynamic Environments", Microelectronics and Computer Technology Corporation, Austin, Texas,
1997, PP-12
•
Roth Mary Tork, Schwarz Peter, "A Wrapper Architecture for Legacy Data Sources", IBM AlmadenResearch Center
•
Brin Sergey, Page Lawrence, "The Anatomy of a Large-Scale Hypertextual Web Search Engine",Computer Science Department, Stanford University, PP ' 26.
•
Brin Sergey, "Extracting Patterns and Relations from the World wide Web", Computer ScienceDepartment, Stanford University, PP ' 12.
•
Liu Ling, Pu Calton, Han Wei, "An XML-enabled data extraction toolkit for websources", Information Systems, No ' 26, 2001, PP ' 563-583
•
Habegger Benjamin, Quafafou Mohamad, "Web Services for Information Extractionfrom the Web",
Proceedings of the IEEE International Conference on Web Services, 2004, PP ' 8.CHEMW
RAP
great idea!!! it will surely be the technology of tommorow…but there are XML tags that does the same thing right???