Dexi is the ultimate platform for web scraping, browser automation and a whole lot more! In this article, we briefly explain how you can get started using the different parts of the platform.
To extract data from a website, create an Extractor robot:
- First, you must identify the starting page you wish to extract data from. Your imagination is the limit!
As an example, extracting product information, the page is often either a list page, listing multiple items and serving as a starting page or a specific details page, presenting the details of a single item.
- Browse to app.dexi.io and sign up (or sign in, if you already have an account).
Select the activation link in the email you receive.
In the dialog that displays, select the item that most closely matches what you want to use Dexi for.
Some options will take you through a tour showing you how to use Dexi.
- If you didn't choose one of these options, select the Projects link in the left-hand menu.
- Select the green New... button and then select Create new robot.
- Ensure Extractor is selected in the dialog that displays.
- Paste the URL into the URL text box and provide a name for the robot.
- Select Create new robot.
- The requested URL will now load in an editor that allows you to point and select the elements you want the robot to interact with, e.g., extract data from.
Extractor robots provide a powerful way to interact with web pages. It is not only possible to extract data, you can do pretty much anything you can do in your browser, e.g., select buttons, select elements in lists and much more.
If you would like to see an example of an existing robot and do some experiments, select the Create an example robot button.
For details on how to build Extractor robots:
You can also search our knowledge base for a specific topic which we continuously add information to.
For background information on a couple of web technologies (e.g., HTML) that you might need to build some Extractor robots, please see Web Data Extraction Basics.
Once your Extractor robot (or any other robot type) is working as intended in the editor, you must execute it to get actual results.
Robots can be executed with different configurations, called runs in part of the platform, most importantly with multiple input values, effectively executing the robot multiple times with different search values or dates.
To get the results of your robot:
On the Projects page,
- Select the robot and select the New run button:
Select the Open button to open the configuration.
Change any settings you want to change, e.g., setting a schedule, adding any integrations or, if the robot takes any inputs, add/import inputs to the configuration:
Execute the robot, or rather the configuration, by selecting the Execute now button (the latest saved version of the robot is executed).
On the Executions tab, select View to view the execution. Depending on system load, it can take from a few seconds to a few minutes for the execution to start.
On the Results tab of the execution, results display as they are extracted. When the execution completes, results can be downloaded in various formats, such as CSV, XLS, and JSON/, and sent to and stored in a number of different places, such as Google Drive, Google Sheets, Amazon S3, your own custom webhooks, or retrieved via the API.
Results are permanently removed after 3 weeks of completion of the execution, so make sure you send the results to one or more external locations for permanent storage using integrations.
The actual task of executing the robot is performed by what is called a worker. The number of workers on your account determines your capacity, i.e., how much work can be done concurrently.
Robot configurations with multiple inputs can be set to use multiple workers for faster execution. This is controlled by the Concurrent executions on the Configuration tab.
For example, if your subscription includes three workers, you will be able to concurrently execute, e.g.:
- Three robot configurations with one or no inputs.
- One robot configuration with multiple inputs and Concurrent executions set to 2, and configuration with one or no inputs.
That is, some combination that adds up to 3.
Visit our price plans to see how many workers are included in each price plan.
Worker Utilization (avoid short-lived robots)
Executing many short-lived robots concurrently, ie. robots that complete in less than ~60 seconds, can cause unpredictable utilization of the workers on your account, e.g. it can cause your workers to be under-utilized. Short-lived robots refer to either different robots or the same robot executed for different inputs.
This reason for this limitation is because the administrative overhead, albeit insignificant for low amounts of robots, adds up when running many robots/inputs of the same robot.
Hence, to ensure optimal worker utilization, always make sure that when running many robots/inputs of the same robot concurrently, that the robot runs for more than ~60 seconds. For example, instead of an Extractor robot only getting the details of one product, make sure it loops over a list of products, thereby increasing the runtime of the robot.
Where do I go next?
We understand that there can be a bit of a learning curve in learning to use the platform efficiently, e.g., that there are a number of new concepts to learn. The glossary provides a concise description of all concepts used in the dexi.io universe.
To explore more features of the platform, read on and follow the links below:
Pipes robots makes it possible to automate a process of data processing and transformation (ETL) performing arbitrarily complex business logic. For example, a Pipes robot could execute an Extractor robot, iterate its results, call an external web service for each result, do some custom formatting of the web service result and save the enriched results in an external SQL database or a Dexi data set (see below).
Crawler robots allow you to quickly collect a large number of URLs and other basic information from a website, e.g., identify product pages on a website and save the URL and page title for each page. For example, a Pipes robot could execute a Crawler that gathers product pages on a website and sends each URL to an Extractor that extracts the required information.
AutoBot robots allow you to normalize/standardize (the fields of) results extracted from a number of different websites, e.g., extract and save product ID, name and description from three different webshops.
Data sets make it possible to work with large amounts of data (even images and files!) similar to a NoSQL collection or SQL table. Advanced deduplication and record linkage can be performed using, e.g., fuzzy matching.
Dictionaries map keys to values and can be used, e.g., to correct misspellings like Galaxy vs Gallaxy.
Addons add functionality to the platform in various ways. For example, integration addons allow you to send data to third-party services, e.g., Amazon S3, Box or Google Sheets. Other examples include CAPTCHA-solving services, Google Maps (geocoding) and machine learning/text analysis services. More addons are continuously implemented.
Triggers perform actions when events occur. For example, when an execution of a robot completes, results could be added to a data set. More events and actions are continuously implemented.
The API allows you to programmatically talk to dexi.io, e.g., get the results of an execution or start an execution of a robot configuration.
“Just get me the data, please”
If you are not technically-inclined, or perhaps you don’t have the time to learn a new platform, we offer to build the robot for you. Simply tell us which information you want from which web page(s) and we build the necessary robot(s).
To request a robot build, please see our Robot Building page.
You can also sign in to the platform and select the Build my robot button in the bottom left corner:
If you need any other help, please write us at firstname.lastname@example.org.
Thank you for reading and enjoy dexi.io!