Resolving Spark Dependency Conflicts on AWS Glue
Process and approaches for handling conflicting dependencies
Resolving Spark Dependency Conflicts on AWS Glue
If you are a Spark developer, you have likely run into a dependency conflict at some point. While troubleshooting a recent issue with multiple conflicts, I was inspired to share my process along with some tips and approaches to resolving dependency conflicts during development with Apache Spark and AWS Glue.
The problem started with a failure in an AWS Glue 3.0 job written in Scala. A new change was incorporated into the code to perform custom conformance and validation of JSON structures using the Spark Dataset API and the Scala JSON library Circe. The self-contained unit tests all passed in the CI pipeline, but the job failed when running a test on the Glue platform with the error:
java.lang.NoClassDefFoundError: Could not initialize class io.circe.Decoder$
My first thought was it sounded like a dependency conflict. As described in the Java doc, a
NoClassDefFoundError means the class definition existed when the class was compiled but it can no longer be found. The next step was to figure out which dependency was causing the issue. Spark doesn’t use Circe so it was likely some transient dependency of Circe. I decided to create the simplest self-contained Spark job using Circe I could. That way I could eliminate any other variables that might be causing the issue and also determine if there were any other hints as to what the cause might be. I ran this in a Glue Studio scala script. I still got the same
NoClassDefFoundError, but luckily there was something new showing up in the error logs:
cats appeared to be part of the issue. My first inclination was to check my uber-jar dependencies. For this work the team is using Maven, so the command to use is:
mvn dependency:tree -Ddetail=true
With this we can see that
cats.kernel is a dependency of
[INFO] my.package.data:circe-spark:jar:1.0.0 [INFO] +- io.circe:circe-core_2.12:jar:0.15.0-M1:compile [INFO] | +- org.scala-lang:scala-library:jar:2.12.14:compile [INFO] | +- io.circe:circe-numbers_2.12:jar:0.15.0-M1:compile [INFO] | \- org.typelevel:cats-core_2.12:jar:2.6.1:compile [INFO] | +- org.typelevel:cats-kernel_2.12:jar:2.6.1:compile [INFO] | \- org.typelevel:simulacrum-scalafix-annotations_2.12:jar:0.5.4:compile [INFO] +- io.circe:circe-parser_2.12:jar:0.15.0-M1:compile [INFO] | \- io.circe:circe-jawn_2.12:jar:0.15.0-M1:compile [INFO] | \- org.typelevel:jawn-parser_2.12:jar:1.2.0:compile [INFO] +- io.circe:circe-generic_2.12:jar:0.15.0-M1:compile [INFO] | \- com.chuusai:shapeless_2.12:jar:2.3.7:compile
But it isn’t a dependency of
spark-sql, so that explains why local tests all passed without issue. The next step was to look at the jars that get added to the Spark classpath during
For this task we can use the environment variable
SPARK_PRINT_LAUNCH_COMMAND=true which prints out the complete Spark command, including the classpath, to standard out when running
spark-submit command can be run from a local install of Spark or a Spark/Glue container. I ran it from a container similiar to the one described in the AWS Glue Developer Guide. This allows us to find the directory(ies) of jars getting added to the Spark classpath.
... breeze_2.12-1.0.jar breeze-macros_2.12-1.0.jar cats-kernel_2.12-2.0.0-M4.jar chill_2.12-0.9.5.jar chill-java-0.9.5.jar ...
From this we can see that Spark 3.1 has a dependency on
cats-kernel 2.0.0-M4 and Circe has a dependency on a higher minor version
cats-kernel 2.6.1, so we need to resolve the conflict.
Spark has two configurations
spark.executor.userClassPathFirst that specify if user-added jars should take precedence over Spark’s own jars. This is intended to mitigate dependency conflicts. The feature is marked experimental and only works in cluster mode, so we can’t test it locally. This is a good place to start, but for my scenario using these configurations when creating the
GlueContext did not resolve the conflict. Also, use of the
--conf parameter is marked internal to AWS Glue and we are directed to not use it in the AWS documentation.
Maven and other build tools allow you to exclude transient dependencies. Since
cats.kernel is causing a problem, we can exclude it from Circe. We should keep the
cats dependencies in sync between
cats-core so we can exclude both. We also need to add back the versions that match Spark.
<dependency> <groupId>io.circe</groupId> <artifactId>circe-core_2.12</artifactId> <version>0.15.0-M1</version> <exclusions> <!-- Need to exclude cats to avoid conflict with Spark 3.1 spark-submit uses cats-kernel_2.12-2.0.0-M4.jar --> <exclusion> <groupId>org.typelevel</groupId> <artifactId>cats-kernel_2.12</artifactId> </exclusion> <exclusion> <groupId>org.typelevel</groupId> <artifactId>cats-core_2.12</artifactId> </exclusion> </exclusions> </dependency> <!-- Re-add cats-core since it should match the version of cats-kernel, but is not provided by spark-submit --> <dependency> <groupId>org.typelevel</groupId> <artifactId>cats-core_2.12</artifactId> <version>2.0.0-M4</version> </dependency> <!-- Re-add cats-kernel but it is provided by spark-submit --> <dependency> <groupId>org.typelevel</groupId> <artifactId>cats-kernel_2.12</artifactId> <version>2.0.0-M4</version> <scope>provided</scope> </dependency>
So that definitely looks overly complicated, but it should work. Let’s give it a try. Wait, what? A new exception??
java.lang.NoSuchMethodError: 'shapeless.DefaultSymbolicLabelling shapeless.DefaultSymbolicLabelling$.instance(shapeless.HList)'
There must be another conflict with
shapeless. From above we see that Circe uses
shapeless 2.3.7. Back to the
spark-submit class path we see Spark uses
... scala-reflect-2.12.10.jar scala-xml_2.12-1.2.0.jar shapeless_2.12-2.3.3.jar shims-0.9.0.jar slf4j-api-1.7.30.jar ...
We can add more to the POM for the
<dependency> <groupId>io.circe</groupId> <artifactId>circe-generic_2.12</artifactId> <version>0.15.0-M1</version> <exclusions> <!-- Need to exclude shapeless to avoid conflict with Spark 3.1 spark-submit uses shapeless_2.12-2.3.3.jar --> <exclusion> <groupId>com.chuusai</groupId> <artifactId>shapeless_2.12</artifactId> </exclusion> </exclusions> </dependency> <!-- Re-add shapeless with version provided by spark-submit --> <dependency> <groupId>com.chuusai</groupId> <artifactId>shapeless_2.12</artifactId> <version>2.3.3</version> <scope>provided</scope> </dependency>
After all that, Spark and the app peacefully co-exist. There is some risk here that swapping transitive dependencies, while only minor versions, could result in unintended behavior.
Another approach is to shade, or rename, the conflicting dependencies. We are already using the Maven Shade Plugin to create the uber-jar and we can also use it to re-locate the problem classes. With shading, the classes get moved and a private copy of their bytecode is created. In the uber-jar the imports and references to those classes are rewritten in the affected bytecode.
Before we try that, let’s look at the uber-jar prior to the dependency exclusions with the following command that lists the contents of the jar.
jar -tf target/circe-spark-1.0.0.jar
... cats/kernel/ cats/kernel/Band$$anon$1.class cats/kernel/Band$.class ... shapeless/ shapeless/$colon$colon$.class shapeless/$colon$colon.class ...
Adding the relocations to the
<relocations> <relocation> <pattern>cats.kernel</pattern> <shadedPattern>shaded.cats.kernel</shadedPattern> </relocation> <relocation> <pattern>shapeless</pattern> <shadedPattern>shaded.shapeless</shadedPattern> </relocation> </relocations>
... shaded/cats/kernel/ shaded/cats/kernel/Band$$anon$1.class shaded/cats/kernel/Band$.class ... shaded/shapeless/ shaded/shapeless/$colon$colon$.class shaded/shapeless/$colon$colon.class
This approach also allows Spark and the app to peacefully co-exist and is a simpler, safer approach than excluding the dependencies.
Dependency conflicts tend to come with the territory of developing a Scala-based Spark application since Spark itself has so many dependencies. For my situation, I ultimately used the dependency shading approach to resolve the issue. It is helpful to know what tools and techniques are available to help you quickly identify and resolve the issues so you can get back to solving your core business problem.