The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

FEAR::API - Web Scraping Zen

SYNOPSIS

 FEAR

 = ∑( WWW Crawler, Data Extractor, Data Munger, (X|HT)ML Parser, ...... , Yucky Overloading )

 = ∞

 = ☯

 = 禪

DESCRIPTION

FEAR::API is a tool that helps reduce your time creating site scraping scripts and help you do it in a much more elegant way. FEAR::API combines many strong and powerful features from various CPAN modules, such as LWP::UserAgent, WWW::Mechanize, Template::Extract, Encode, HTML::Parser, etc. and digests them into a deeper Zen.

However, this module violates probably every single rule of any Perl coding standards. Please stop here if you don't want to see the yucky code.

This module was originated from a short-term project. I was asked to extract data from several commercial websites. During my development, I found many redundancies of code, and I attempted to reduce the code size and to create something that is specialized to do this job: Site Scraping. (Or Web Scraping, Screen Scraping). Before creating this module, I have surveyed some site scrapers or information extraction tools, and none of them could really satisfy my needs. I meditated on what the my ideal tool should be shaped like, and the ideas gradually got solidified in my mind.

Then I created FEAR::API.

It is a highly specialized module with a domain-specific syntax. Maybe you are used to creating browser emulator using WWW::Mechanize, but you need to write some extra code to parse the content. Sometimes, after you have extracted data from documents, you also need to write some extra code to store them into databases or plain text files. It may be very easy for you, but is not always done quickly. That's why FEAR::API is here. FEAR::API encapsulates necessary components in any site scraping flow, trying to help you speed up the whole process.

THE FIVE ELEMENTS

There are 5 essential elements in this module.

 FEAR::API::Agent
 FEAR::API::Document
 FEAR::API::Extract
 FEAR::API::Filter
 FEAR::API

FEAR::API::Agent is the crawler component. It fetches web pages, and passes contents to FEAR::API::Document.

FEAR::API::Document stores fetched documents.

FEAR::API::Extract performs data extraction on documents.

FEAR::API::Filter does pre-processing on documents and post-processing on extracted results. This component let you clean up fetched pages and refine extracted results.

FEAR::API is the public interface, and everything is handled and coordinated internally in it. Generally, you interact only with this package, and it is supposed to solve most of your problems.

The architecture is not complicated. I guess, the most bewildering thing may be the over-simplified syntax. According to some users who have already tried some of the example codes, they still have completely no idea about what's really going on with this module.

After having done parallel prefetching based on Larbin, I decided to start my documentation. (And I started to regret a little bit that I created this module.)

USAGE

The first line

    use FEAR::API -base;

To -base, or not to -base. That is no question.

Using FEAR::API with -base means your current package is a subclass of FEAR::API, and $_ is auto-initiated as a FEAR::API object.

Using it without -base is like using any other OO Perl modules. You need to do instantiation by yourself, and specify the object with each method call.

    use strict;
    use FEAR::API;
    my $f = fear();
    $f->url("blah");
    # blah, blah, blah.....

Fetch a page

    url("google.com");
    fetch();

FEAR::API maintains a URL queue in itself. Everytime you call url(), it pushes your arguments to the queue, and when you call fetch(), the URL at the front will be poped and be requested. If the request is successful, the fetched document will be stored in FEAR::API::Document.

fetch() not only pops the top element in the queue, but also takes arguments. If you pass a URL to fetch(), FEAR::API will fetch the one you specify, and ignore the URL quque temporarily.

Fetch a page and store it in a scalar

    fetch("google.com") > my $content;

    my $content = fetch("google.com")->document->as_string;

Fetch a page and print to STDOUT

    getprint("google.com");

    print fetch("google.com")->document->as_string;

    fetch("google.com");
    print $$_;    

    fetch("google.com") | _print;

Fetch a page and save it to a file

    getstore("google.com", 'google.html');

    url("google.com")->() | _save_as("google.html");
    
    fetch("google.com") | io('google.html');

Once you have a page fetched, you will probably need to process the links in this page. FEAR::API provides a method dispatch_links() (or report_links()) designed to do this job.

dispatch_links() takes a list of pairs of (regular expression => action). For each link in the page, if it matches a certain regular expression (or, say rule), then the action will be taken.

You can also set fallthrough_report(1) to test all the rules.

>> is overloaded. It is equivalent to method dispatch_links() or report_links(). fallthrough_report() is automatically set to 1 if >> is followed by an array ref [], and 0 if >> is followed by an hash ref {}.

In the following code examples, a constant _self is used with rules, which means links that matches rules will be all pushed back to the URL queue.

Verbose

    fetch("http://google.com")
    ->report_links(
                   qr(^http:) => _self,
                   qr(google) => \my @l,
                   qr(google) => sub {  print ">>>".$_[0]->[0],$/ }
                  );
    fetch while has_more_urls;
    print Dumper \@l;

Minimal

    url("google.com")->()
      >> [
          qr(^http:) => _self,
          qr(google) => \my @l,
          qr(google) => sub {  print ">>>".$_[0]->[0],$/ }
         ];
    $_->() while $_;
    print Dumper \@l;

Equivalent Code

    url("tw.yahoo.com")->();
    my @l;
    foreach my $link (links){
      $link->[0] =~ /^http:/ and url($link) and next;
      $link->[0] =~ /tw.yahoo/ and push @l, $link and next;
      $link->[0] =~ /tw.yahoo/ and print ">>>".$link->[0],$/ and next;
    }
    fetch while has_more_links;
    print Dumper \@l;

Verbose

    fetch("http://google.com")
    ->fallthrough_report(1)
    ->report_links(
                   qr(^http:) => _self,
                   qr(google) => \my @l,
                   qr(google) => sub {  print ">>>".$_[0]->[0],$/ }
                  );
    fetch while has_more_urls;
    print Dumper \@l;

Minimal

    url("google.com")->()
      >> {
          qr(^http:) => _self,
          qr(google) => \my @l,
          qr(google) => sub {  print ">>>".$_[0]->[0],$/ }
         };
    $_->() while $_;
    print Dumper \@l;

Equivalent Code

    url("tw.yahoo.com")->();
    my @l;
    foreach my $link (links){
      $link->[0] =~ /^http:/ and url($link);
      $link->[0] =~ /tw.yahoo/ and push @l, $link;
      $link->[0] =~ /tw.yahoo/ and print ">>>".$link->[0],$/;
    }
    fetch while has_more_links;
    print Dumper \@l;
    url("google.com")->() >> _self;
    &$_ while $_;
    url("google.com")->() >> _self | _save_as_tree("./root");
    $_->() | _save_as_tree("./root") while $_;

Recursively get web pages from Google

    url("google.com");
    &$_ >> _self while $_;

In English, line 1 sets the initial URL. Line 2 says, while there are more links in the queue, FEAR::API will continue fetching and feeding back the links to itself.

Recursively get web pages from Google

    url("google.com");
    &$_ >> _self | _save_as_tree("./root") while $_;

In English, line 1 sets the initial URL. Line 2 says, while there are more links in the queue, FEAR::API will continue fetching and feeding back the links to itself, and saving the current document in a tree structure with its root called "root" on file system. And guess what? It is the minimal web spider written in Perl. (Well, at least, I am not aware of any other pure perl implementation.)

Crawling with domain constraints

    allow_domains( qr(google),    qr(blahblah) );

    deny_domains( qr(microsoft), qr(bazzbazz) );

Mechanize fans?

FEAR::API borrows (or, steals) some useful methods from WWW::Mechanize.

    url("google.com")->()->follow_link(n => 2);
    print Dumper fetch("google.com")->links;

Submit a query to Google

    url("google.com")->();
    submit_form(
                form_number => 1,
                fields => { q => "Kill Bush" }
                );

If you have used curl before, then you may have tried to embed multiple URLs in one line. FEAR::API gives a similar functionality based on Template Toolkit. In the following code, the initial ones are http://some.site/a, http://some.site/b, ......, http://some.site/z

    url("[% FOREACH i = ['a'..'z'] %]
         http://some.site/[% i %]
         [% END %]");
    &$_ while $_;

Extraction

Use template() to set up the template for extraction. Note that FEAR::API will add [% FOREACH rec %] and [% END %] to your template if your extraction method is set to Template::Extract.

preproc() (or doc_filter()) can help you clean up document before you apply your template. postproc() (or result_filter()) is called after you perform extraction. The argument can be of two types. You can insert a string containing Perl code which will be evaluated, or you can use named filters. They are documented in FEAR::API::Filters.

Extract data from CPAN

    url("http://search.cpan.org/recent")->();
    submit_form(
            form_name => "f",
            fields => {
                       query => "perl"
                      });
    template("<!--item-->[% p %]<!--end item-->"); # [% FOREACH rec %]<!--item-->[% p %]<!--end item-->[% END %], actually.
    extract;
    print Dumper extresult;

Extract data from CPAN after some HTML cleanup

    url("http://search.cpan.org/recent")->();
    submit_form(
            form_name => "f",
            fields => {
                       query => "perl"
                      });
    # Only the section between <!--results--> and <!--end results--> is wanted.
    preproc(q(s/\A.+<!--results-->(.+)<!--end results-->.+\Z/$1/s));
    print document->as_string;    # print content to STDOUT
    template("<!--item-->[% p %]<!--end item-->");
    extract;
    print Dumper extresult;

HTML cleanup, extract data, and refine results

    url("http://search.cpan.org/recent")->();
    submit_form(
            form_name => "f",
            fields => {
                       query => "perl"
                      });
    preproc(q(s/\A.+<!--results-->(.+)<!--end results-->.+\Z/$1/s));
    template("<!--item-->[% rec %]<!--end item-->");
    extract;
    postproc(q($_->{rec} =~ s/<.+?>//g));     # Strip HTML tags brutally
    print Dumper extresult;

Use filtering syntax

    fetch("http://search.cpan.org/recent");
    submit_form(
                form_name => "f",
                fields => {
                           query => "perl"
                })
       | _doc_filter(q(s/\A.+<!--results-->(.+)<!--end results-->.+\Z/$1/s))
       | _template("<!--item-->[% rec %]<!--end item-->")
       | _result_filter(q($_->{rec} =~ s/<.+?>//g));
    print Dumper \@$_;

This is like doing piping in shell. Site scraping is actually just a flow of data. It is a process turning data into information. People usually pipe sort, wc, uniq, head, ... , etc. in shell to extract the thing they need. In FEAR::API, site scraping is equivalent to data munging. Every piece of information goes through multiple filters before the wanted information really comes out.

Invoke handler for extracted results

When you have results extracted, you can write handlers to process the data. invoke_handler() can takes arguments like "Data::Dumper", "YAML", a subref, an object-relational mapper, etc. And argument types are expected to grow.

    fetch("http://search.cpan.org/recent");
    submit_form(
                form_name => "f",
                fields => {
                           query => "perl"
                })
       | _doc_filter(q(s/\A.+<!--results-->(.+)<!--end results-->.+\Z/$1/s))
       | "<!--item-->[% rec %]<!--end item-->"
       | _result_filter(q($_->{rec} =~ s/<.+?>//g));
    invoke_handler('Data::Dumper');

Named Filters

Here are examples of using named filters provided by FEAR::API itself.

Preprocess document

    url("google.com")->()
    | _preproc(use => "html_to_null")
    | _preproc(use => "decode_entities")
    | _print;

Postprocess extraction results

    fetch("http://search.cpan.org/recent");
    submit_form(
                form_name => "f",
                fields => {
                           query => "perl"
                })
       | _doc_filter(q(s/\A.+<!--results-->(.+)<!--end results-->.+\Z/$1/s))
       | _template("<!--item-->[% rec %]<!--end item-->")
       | _result_filter(use => "html_to_null",    qw(rec));
       | _result_filter(use => "decode_entities", qw(rec))
    print Dumper \@$_;

ORMs

FEAR::API makes it very easy to transfer your extracted data straight to databases. All you need to do is set up an ORM, and invoke the mapper once you have new results extracted. (Though I still think it's not quick enough. It's better not to create any ORMs. FEAR::API should secretly build them for you.)

    template($template);
    extract;
    invoke_handler('Some::Module::based::on::Class::DBI');
    # or
    invoke_handler('Some::Module::based::on::DBIx::Class::CDBICompat');

Scraping a file

It is possible to use FEAR::API to extract data from local files. It implies you can use other web crawlers to fetch web pages and use FEAR::API to do scraping jobs.

    file('somse_file');

    url('file:///the/path/to/your/file');

Then you need to tell FEAR::API what the content type is because the document is loaded from your local file system. Generally, FEAR::API assumes files to be plain text.

    force_content_type('text/html');

THE XXX FILES

FEAR::API empowers you to select sub-documents using XPath. If your document is not in XML, you have to upgrade it first.

Upgrade HTML to XHTML

    print fetch("google.com")->document->html_to_xhtml->as_string;

    fetch("google.com") | _to_xhtml;
    print $$_;

Do XPathing

    print fetch("google.com")->document->html_to_xhtml->xpath('/html/body/*/form')->as_string;

    fetch("google.com") | _to_xhtml | _xpath('/html/body/*/form');
    print $$_;

Make your site scraping script a subroutine

It is possible to destruct your scripts or modules into several different components using SST (Site Scraping Template).

    load_sst('fetch("google.com") >> _self; $_->() while $_');
    run_sst;

    load_sst('fetch("[% initial_link %]") >> _self; $_->() while $_');
    run_sst({ initial_link => 'google.com'});

    # Load from a file
    load_sst_file("MY_SST");
    run_sst({ initial_link => 'google.com'});

Tabbed scraping

I don't really know what this is good for. I added this because I saw some scraper could do this fancy stuff.

    fetch("google.com");        # Default tab is 0
    tab 1;                             # Create a new tab, and switch to it.
    fetch("search.cpan.org");  # Fetch page in tab 1
    tab 0;                             # Switch back to tab 0
    template($template);       # Continue processing in tab 0
    extract();

    keep_tab 1;                    # Keep tab 1 only and close others
    close_tab 1;                    # Close tab 1

RSS

You can create RSS feeds easily with FEAR::API.

    use FEAR::API -base, -rss;
    my $url = "http://google.com";
    url($url)->();
    rss_new( $url, "Google", "Google Search Engine" );
    rss_language( 'en' );
    rss_webmaster( 'xxxxx@yourdomain.com' );
    rss_twice_daily();
    rss_item(@$_) for map{ [ $_->url(), $_->text() ] } links;
    die "No items have been added." unless rss_item_count;
    rss_save('google.rss');

See also XML::RSS::SimpleGen

Prefetching and document caching

Here I have designed two options for doing prefetching and document caching. One is purely written in Perl, and the other is a C++ web crawling engine. The perl solution is simple, easy-to-install, but not really efficient I think. The C++ crawler is extremely fast. It claims that it fetches 100 million pages on a home PC, with a good network. However, the C++ crawler is much more complex than the simple pure-perl prefetching.

Native perl prefetching based on fork()

    use FEAR::API -base, -prefetching;

Simple, and not efficient

C++ parallel crawling based on pthread

    use FEAR::API -base, -larbin;

Larbin is required. Amazingly fast. See also http://larbin.sourceforge.net/index-eng.html and larbin/README.

The default document repository is at /tmp/fear-api/pf. (Non-changeable for now).

ONE-LINERS

    fearperl -e 'fetch("google.com")'

    perl -M'FEAR::API -base' -e 'fetch("google.com")'

ARTICLE

There is also an article about this module. Please see http://www.perl.com/pub/a/2006/06/01/fear-api.html.

DEBATE

This module has been heavily criticized on Perlmonks. Please go to http://perlmonks.org/?node_id=537504 for details.

EXAMPLES

There are some example scrapers available with this module. Please go to examples/.

SEE ALSO

WWW::Mechanize, LWP::UserAgent, LWP::Simple, perlrequick, perlretut, perlre, perlreref, Regexp::Bind, Template::Extract, Template, IO::All, XML::Parser, XML::XPath, XML::RSS, XML::RSS::SimpleGen, Data::Dumper, YAML, Class::DBI, DBIx::Class

Larbin http://larbin.sourceforge.net/index-eng.html

FEAR::Web, a web interface based on FEAR::API. http://rt.openfoundry.org/Foundry/Project/?Queue=609 (But it needs much work.)

AUTHOR & COPYRIGHT

Copyright (C) 2006 by Yung-chung Lin (a.k.a. xern) <xern@cpan.org>

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself

1 POD Error

The following errors were encountered while parsing the POD:

Around line 1476:

Non-ASCII character seen before =encoding in '∑('. Assuming UTF-8