- Print
- DarkLight
- PDF
Web Data Extraction Basics
Extracting data from most simple web pages can be done via the Extractor robot editor without any technical knowledge. Using the point-and-click interface, you simply point to the information you want to extract, specify any required formatting, and choose the output field where the information should be saved.
Some web pages, however, can require knowledge of web technologies like HTML, CSS selectors and JavaScript.
Read more about how to build Extractor robots:
- How to build an extractor?
- What should I know about the extractor editor?
- What should I know about site navigation?
Element paths (CSS Selectors)
Some web pages require some technical know-how to navigate the structure of the HTML page to find the element (also called tag) that holds the information you want.
The typical way to navigate the HTML is by using CSS selectors, called element paths in Dexi.
For example, to extract a price from a very simple HTML page like the one pictured below, you could use the element path: div > p
<html>
<head>
<meta charset="utf-8">
</head>
<body>
<div>
<h1>Price</h1>
<p>$9.99</p>
</div>
</body>
</html>
In the Extractor robot editor, this can be used in an Extract value step to extract the price.
For other ways to find elements, see What should I know about elements, paths, and scopes?
For general information on HTML, the DOM and CSS selectors, there are many articles and tutorials available online. A couple of useful resources are:
- W3Schools - HTML Element Reference
- W3Schools - HTML Global Attributes
- JavaScript.info - DOM Tree
- MDN - CSS Selectors
- W3Schools - CSS Selectors Reference
- jsoup - online CSS selector tester
Dexi uses CSS version 3.
Robust element paths
Writing a good element path can take some consideration: the more general/broad you make it, the more robust it is to web page changes but it also decreases the likelihood of finding the exact information you want to extract.
As an example, see the following HTML snippet:
<div>
<span>
<div>
<input type=”text” name=”username” id=”username-1298172391617”>
<input type=”text” name=”password” id=”password-891291767394”>
</div>
</span>
</div>
An example of a robust element path would be:
input[name="username"]
This is because it:
- Points to an element that most likely will continue to exist on the page.
- Does not depend on changes to the structure of the page.
An example of a not-so-robust element path would be:
div > span > div > input#username-1298172391617
This is because:
- The id looks like a dynamic number that could easily change.
- It is very dependent on the exact current structure of the page — if one of the
<div>
elements changes to a<span>
, the element path is no longer valid.
When you select an element in the Extractor editor the element path is automatically generated. These auto-generated element paths can be too general. You can update the element path with the method above to create more detailed, dialed-in element paths.
JavaScript
Most modern web pages use JavaScript (JS) to some extent and some use it heavily. JavaScript is used to make pages interactive and dynamic, allowing users to interact with the page and for the page to automatically load content, change appearance, etc.
Dexi fully supports JavaScript and is able to load complex pages, such as pages with calendars, menus, and much more using technologies such as React and AngularJS (specific Extractor step types for extracting from the AngularJS model are available).
It is also possible to execute arbitrary JS code such as encoding/decoding data, using dates and accessing mathematical functions, just to name a few.
“Just get me the data, please”
If you are not technically inclined or if you don’t have the time to learn web technologies, we offer the Build my robot feature for you. Simply tell us what information you want from which web page(s) and we build the necessary robot(s).
Sign in to the platform and select the Build my robot button.
If you need any other help, please email us at support@dexi.io.