Today we are releasing version 0.2
of Open Crawler, which has also been promoted to beta!
Open Crawler was initially released (version 0.1
) in June 2024 as a tech-preview. Since then, we've been iterating on the product and have added several new features.
To get access to these changes you can use the latest Docker artifact, or download the source code directly from the GitHub repository. Follow the setup instructions in our documentation to get started.
What's new?
A list of every change can be found in our changelog in the Open Crawler repository. In this blog we will go over only new features, and configuration format changes
Features
Feature | Description |
---|---|
Extraction rules | Allows for the extraction of HTML content using CSS and XPath selectors, and URL content using regex. |
Binary content extraction | Allows for the extraction of binary content from file types supported by the Apache Tika project. |
Crawl rules | Used to enable or disable certain URL patterns from being crawled and ingested. |
Purge crawls | Deletes outdated documents from the index at the end of a crawl job. |
Scheduling | Recurrent crawls can be scheduled based on a cron expression. |
Configuration changes
Among the new features, the config.yml
file format has changed for a few fields, so existing configuration files will not work between 0.1
and 0.2
. Notably, the configuration field domain_allowlist
has been changed to domains
, and seed_urls
is now a subset of domains
instead of a top-level field. This change was made so new features like extraction rules and crawl rules could be applied to specific domains, while allowing a single crawler to configure multiple domains for a crawl job.
Make sure to reference the updated config.yml.example file to fix your configuration. Here is an example for migrating a 0.1
configuration to 0.2
:
# 0.1 config file example - OUTDATED
domain_allowlist:
- https://elastic.ac.cn
seed_urls:
- https://elastic.ac.cn/
- https://elastic.ac.cn/blog/
# The same config, in 0.2 format
domains:
- url: https://elastic.ac.cn
seed_urls:
- https://elastic.ac.cn/
- https://elastic.ac.cn/blog/
# Another 0.2 config for multiple domains
domains:
- url: https://elastic.ac.cn
seed_urls:
- https://elastic.ac.cn/
- https://elastic.ac.cn/blog/
- url: https://parksaustralia.gov.au
seed_urls:
- https://parksaustralia.gov.au/
- https://parksaustralia.gov.au/news/
Showcase 1: crawl rules
We're very excited to bring the crawl rules feature to Open Crawler. This is an existing feature in the Elastic Crawler. The biggest difference for crawl rules between Open Crawler and Elastic Crawler is the way these rules are configured. Elastic Crawler is configured using Kibana, while Open Crawler has crawl rules defined for a domain
in the crawler.yml
config file.
Crawling only specific endpoints
When determining if a URL is crawlable, Open Crawler will execute crawl rules in order from top to bottom.
In this example below, we want to crawl only the content of https://elastic.ac.cn/search-labs
. Because this has links to other URLs within the https://elastic.ac.cn
domain, it's not enough to limit just the seed_urls
to this entry point. Using crawl rules, we need two more rules:
- An allow rule for everything under (and including) the
/search-labs
URL pattern - A deny everything rule to catch all other URLs
In this example we are using a regex pattern for the deny rule.
domains:
- url: https://elastic.ac.cn
seed_urls:
- https://elastic.ac.cn/search-labs
crawl_rules:
- policy: allow
type: begins
pattern: /search-labs
- policy: deny
type: regex
pattern: .*
If I want to add another URL to ingest to this configuration (for example, /security-labs
), I need to:
- Add it as a
seed_url
- Add it to the
crawl_rules
above the deny all rule
domains:
- url: https://elastic.ac.cn
seed_urls:
- https://elastic.ac.cn/search-labs
- https://elastic.ac.cn/security-labs
crawl_rules:
- policy: allow
type: begins
pattern: /search-labs
- policy: allow
type: begins
pattern: /security-labs
- policy: deny
type: regex
pattern: .*
Through this manner of configuration, you can be very specific about what webpages Crawler will ingest.
If you have debug logs enabled, each denied URL will show up in the logs like this:
[crawl:<crawl_id>] [primary] The URL <denied URL> will not be crawled due to configured crawl rule: policy: deny, url_pattern: <deny rule explanation>
Here's an actual example from my crawl results:
[crawl:66d87958c78889f0053ceaf6] [primary] The URL https://elastic.ac.cn/contact will not be crawled due to configured crawl rule: policy: deny, url_pattern: (?-mix:\Ahttps:\/\/www\.elastic\.co.*)
Crawling everything except a specific endpoint
This pattern is much easier to implement, as Crawler will crawl everything by default. All that is needed is to add a deny rule for URL pattern that you want to exclude.
In this example, I want to crawl the entire https://elastic.ac.cn
website, except for anything under /search-labs
. Because I want to crawl everything, seed_urls
is not needed for this configuration.
domains:
- url: https://elastic.ac.cn
crawl_rules:
- policy: deny
type: begins
pattern: /search-labs
Now if I run a crawl, Crawler will not ingest webpages with URLs that begins with /search-labs
.
Showcase 2: extraction rules
Extraction rules are another much-asked-for feature for Open Crawler. Like crawl rules, extraction rules function almost the same as they do for Elastic Crawler, except for how they are configured. Extraction rules are configured under extraction_rulesets
, which belong to a single item from domains
.
domains:
- url: https://elastic.ac.cn
extraction_rulesets:
# rules go here
Getting the CSS selector
For this example, I want to extract the authors' names for each blog article in /search-labs
and assign it to the field authors
. Without extraction rules, each blog's Elasticsearch document will have the author names buried in the body
field.
Using my browser developer tools (in my case, the Firefox dev tools), I can visit the webpage and use the selector tool to find what CSS selectors an HTML element has. I can now see that the authors are stored in a <p>
element with a few different classes, but most eye-catching is the class .author-name
.
Now, to test that using the selector .author-name
is enough to fetch only the author name from this field, I can use the dev tools HTML search feature. Unfortunately, I can see that using only this class name returns 11 results for this blog post.
After some investigation, I found that this is because the "Recommended articles" section at the bottom of a page also uses the .author-name
class. To remedy this, we need a more restrictive selector.
Examining the HTML code directly, I can see that the side-bar containing the author name that I want to extract is nested a few levels under a class called .sticky
. This class refers to the sidebar that contains the author name I want to extract.
We can combine these selectors into a single selector .sticky .author-name
that will only search for .author-name
classes that are nested within .sticky
classes. We can then test this in the same HTML search bar as before, and ta-da!
Only one hit -- we've found our CSS selector!
Configuring the extraction rules
Now we can add the CSS selector from the previous step.
We also need to define the url_filters
for this rule. This will determine which endpoints the extraction rule is executed against. All articles for search labs fall under the format https://elastic.ac.cn/search-labs/blog/<slug>
, so this can be achieved with a simple regex pattern: /search-labs/blog/.+$
.
/search-labs/blog/
asserts the start of the URL.+
matches any character except line breaks$
marks the end of the string- This stops sub-URLs like
https://elastic.ac.cn/search-labs/blog/<slug>/<something-else>
from having this extraction rule
- This stops sub-URLs like
In this example we will also utilize crawl rules, to avoid crawling the entire https://elastic.ac.cn
website.
domains:
- url: https://elastic.ac.cn
seed_urls:
- https://elastic.ac.cn/search-labs
crawl_rules:
- policy: allow
type: begins
pattern: /search-labs
- policy: deny
type: regex
pattern: .*
extraction_rulesets:
- url_filters:
- type: regex
pattern: "/search-labs/blog/.+$"
rules:
- action: extract
field_name: author
selector: ".sticky .author-name"
join_as: string
source: html
After completing a crawl with the above configuration, I can check for the new author
field in the ingested documents.
I can do this using a _search
query to find articles written by the author Sebastien Guilloux
.
POST /crawler-beta-blog/_search
{
"_source": {
"includes": ["author", "title"]
},
"query": {
"match": {
"author": "Sebastien Guilloux"
}
}
}
And we have a single hit!
Showcase 3: combining it all with Semantic Text
Jeff Vestal wrote a fantastic article combining Open Crawler with Semantic Text search, among other cool RAG things. Read up on that here.
Comparing with Elastic Crawler
We now maintain a feature comparison table on the Open Crawler repository to compare the features available for Open Crawler vs Elastic Crawler.
Next steps
The next release will bring Open Crawler to version 1.0
, and will also promote it to GA (generally available). We don't have a release date planned for this version yet. We do have a general idea of some features we want to include:
- Extraction using data attributes and meta tags
- Full HTML extraction
- Send event logs to Elasticsearch
This list is not exhaustive, and depending on user feedback we will include other features in the 1.0
GA release. If there are other features you would like to see included, feel free to create an enhancement issue directly on the Open Crawler repository. Feedback like this will help us prioritize what to include in the next release.
Check out the different ways to ingest data into Elasticsearch and dive into practical examples to try something new.
Elasticsearch is packed with new features to help you build the best search solutions for your use case. Start a free trial now.