Installation
editInstallationedit
elasticsearch-hadoop binaries can be obtained either by downloading them from the elastic.co site as a ZIP (containing project jars, sources and documentation) or by using any Maven-compatible tool with the following dependency:
<dependency> <groupId>org.elasticsearch</groupId> <artifactId>elasticsearch-hadoop</artifactId> <version>8.14.3</version> </dependency>
The jar above contains all the features of elasticsearch-hadoop and does not require any other dependencies at runtime; in other words it can be used as is.
elasticsearch-hadoop binary is suitable for Hadoop 2.x (also known as YARN) environments. Support for Hadoop 1.x environments are deprecated in 5.5 and will no longer be tested against in 6.0.
Minimalistic binariesedit
In addition to the uber jar, elasticsearch-hadoop provides minimalistic jars for each integration, tailored for those who use just one module (in all other situations the uber
jar is recommended); the jars are smaller in size and use a dedicated pom, covering only the needed dependencies.
These are available under the same groupId
, using an artifactId
with the pattern elasticsearch-hadoop-{integration}
:
Map/Reduce.
<dependency> <groupId>org.elasticsearch</groupId> <artifactId>elasticsearch-hadoop-mr</artifactId> <version>8.14.3</version> </dependency>
Apache Hive.
<dependency> <groupId>org.elasticsearch</groupId> <artifactId>elasticsearch-hadoop-hive</artifactId> <version>8.14.3</version> </dependency>
Apache Spark.
<dependency> <groupId>org.elasticsearch</groupId> <artifactId>elasticsearch-spark-30_2.12</artifactId> <version>8.14.3</version> </dependency>
spark artifact. Notice the |
The Spark connector framework is the most sensitive to version incompatibilities. For your convenience, a version compatibility matrix has been provided below:
Spark Version | Scala Version | ES-Hadoop Artifact ID |
---|---|---|
1.0 - 2.x |
2.10 |
<unsupported> |
1.0 - 1.6 |
2.11 |
<unsupported> |
2.x |
2.11 |
elasticsearch-spark-20_2.11 |
2.x |
2.12 |
elasticsearch-spark-20_2.12 |
3.0+ |
2.12 |
elasticsearch-spark-30_2.12 |
3.2+ |
2.13 |
elasticsearch-spark-30_2.13 |
Development Buildsedit
Development (or nightly or snapshots) builds are published daily at sonatype-oss repository (see below). Make sure to use snapshot versioning:
<dependency> <groupId>org.elasticsearch</groupId> <artifactId>elasticsearch-hadoop</artifactId> <version>{version}-SNAPSHOT</version> </dependency>
but also enable the dedicated snapshots repository :
Upgrading Your Stackedit
Elasticsearch for Apache Hadoop is a client library for Elasticsearch, albeit one with extended functionality for supporting operations on Hadoop/Spark. When upgrading Hadoop/Spark versions, it is best to check to make sure that your new versions are supported by the connector, upgrading your elasticsearch-hadoop version as appropriate.
Elasticsearch for Apache Hadoop maintains backwards compatibility with the most recent minor version of Elasticsearch’s previous major release (5.X supports back to 2.4.X, 6.X supports back to 5.6.X, etc…). When you are upgrading your version of Elasticsearch, it is best to upgrade elasticsearch-hadoop to the new version (or higher) first. The new elasticsearch-hadoop version should continue to work for your previous Elasticsearch version, allowing you to upgrade as normal.
Elasticsearch for Apache Hadoop does not support rolling upgrades well. During a rolling upgrade, nodes that elasticsearch-hadoop is communicating with will be regularly disappearing and coming back online. Due to the constant connection failures that elasticsearch-hadoop will experience during the time frame of a rolling upgrade there is high probability that your jobs will fail. Thus, it is recommended that you disable any elasticsearch-hadoop based write or read jobs against Elasticsearch during your rolling upgrade process.