Page processors

Updated on 17 May 2024
1 Minute to read
Contributors

Print
Share
Dark
Light
PDF

Article summary

Did you find this summary helpful?

Thank you for your feedback

What should I know about page processors?

Page processors determine how your crawler behaves. Without page processors, your crawler does nothing more than visit every page on the Web site. No information will be extracted. A crawler can have as many page processors as your requirements dictate.

Each page processor is made up of two sections: conditions and actions.

Conditions

Conditions determine whether a given page processor is executed for a page. All page processors are checked against every single page the crawler visits and if they match, the configured actions are applied.

You can add any number of conditions to a page to ensure it matches only the exact pages you want it to. All defined conditions for a page processor must be true for the actions to be applied to the given page.

Match every page: All pages are affected by the configured actions
Contains element: Allows jQuery selector
Does not contain element: Allows jQuery selector
Does not match URL: Accepts a regular expression that is matched against each web page URL
Matches URL: Accepts a regular expression that is matched against each web page URL

Screenshot 2021-03-30 at 11.07.46.png

Actions

Actions direct the crawler's activities upon visiting a page matching the associated conditions. Actions allow extraction of data, the addition of URL's to the crawl list, and more.

Extract element contents: Accepts jQuery selector, output field, and regular expressions to match several values or to format the output
Extract element attribute: Accepts jQuery selector, attribute name, output field, and regular expressions to match several values or to format the output
Extract URL referring to page: Accepts output field
Extract page title: Accepts output field
Extract page URL: Accepts output field
Add URL attribute to crawl list: Accepts jQuery selector, attribute name, and regular expressions to match several values or to format the output
Remove elements: Accepts jQuery selector
Don't follow any links on page: Crawler will not follow links from the page

Screenshot 2021-03-30 at 11.07.56.png

Was this article helpful?

What's Next

Build a Crawler robot