Page processors
  • 17 May 2024
  • 1 Minute to read
  • Contributors
  • Dark
    Light
  • PDF

Page processors

  • Dark
    Light
  • PDF

Article summary

What should I know about page processors?

Page processors determine how your crawler behaves. Without page processors, your crawler does nothing more than visit every page on the Web site. No information will be extracted. A crawler can have as many page processors as your requirements dictate.

Each page processor is made up of two sections: conditions and actions.

1.png

Conditions

Conditions determine whether a given page processor is executed for a page. All page processors are checked against every single page the crawler visits and if they match, the configured actions are applied.

You can add any number of conditions to a page to ensure it matches only the exact pages you want it to. All defined conditions for a page processor must be true for the actions to be applied to the given page.

  • Match every page: All pages are affected by the configured actions

  • Contains element: Allows jQuery selector

  • Does not contain element: Allows jQuery selector

  • Does not match URL: Accepts a regular expression that is matched against each web page URL

  • Matches URL: Accepts a regular expression that is matched against each web page URL

Screenshot 2021-03-30 at 11.07.46.png

Actions

Actions direct the crawler's activities upon visiting a page matching the associated conditions. Actions allow extraction of data, the addition of URL's to the crawl list, and more.

  • Extract element contents: Accepts jQuery selector, output field, and regular expressions to match several values or to format the output

  • Extract element attribute: Accepts jQuery selector, attribute name, output field, and regular expressions to match several values or to format the output

  • Extract URL referring to page: Accepts output field

  • Extract page title: Accepts output field

  • Extract page URL: Accepts output field

  • Add URL attribute to crawl list: Accepts jQuery selector, attribute name, and regular expressions to match several values or to format the output

  • Remove elements: Accepts jQuery selector

  • Don't follow any links on page: Crawler will not follow links from the page

Screenshot 2021-03-30 at 11.07.56.png


Was this article helpful?