- Print
- DarkLight
- PDF
What should I know about page processors?
Page processors determine how your crawler behaves. Without page processors, your crawler does nothing more than visit every page on the Web site. No information will be extracted. A crawler can have as many page processors as your requirements dictate.
Each page processor is made up of two sections: conditions and actions.
Conditions
Conditions determine whether a given page processor is executed for a page. All page processors are checked against every single page the crawler visits and if they match, the configured actions are applied.
You can add any number of conditions to a page to ensure it matches only the exact pages you want it to. All defined conditions for a page processor must be true for the actions to be applied to the given page.
Match every page: All pages are affected by the configured actions
Contains element: Allows jQuery selector
Does not contain element: Allows jQuery selector
Does not match URL: Accepts a regular expression that is matched against each web page URL
Matches URL: Accepts a regular expression that is matched against each web page URL
Actions
Actions direct the crawler's activities upon visiting a page matching the associated conditions. Actions allow extraction of data, the addition of URL's to the crawl list, and more.
Extract element contents: Accepts jQuery selector, output field, and regular expressions to match several values or to format the output
Extract element attribute: Accepts jQuery selector, attribute name, output field, and regular expressions to match several values or to format the output
Extract URL referring to page: Accepts output field
Extract page title: Accepts output field
Extract page URL: Accepts output field
Add URL attribute to crawl list: Accepts jQuery selector, attribute name, and regular expressions to match several values or to format the output
Remove elements: Accepts jQuery selector
Don't follow any links on page: Crawler will not follow links from the page