Jan 4, 2023

Resolving Spark Dependency Conflicts on AWS Glue

Process and approaches for handling conflicting dependencies

Nick Christopulos - Staff Technology Engineer

Resolving Spark Dependency Conflicts on AWS Glue

If you are a Spark developer, you have likely run into a dependency conflict at some point. While troubleshooting a recent issue with multiple conflicts, I was inspired to share my process along with some tips and approaches to resolving dependency conflicts during development with Apache Spark and AWS Glue.

The Problem

The problem started with a failure in an AWS Glue 3.0 job written in Scala. A new change was incorporated into the code to perform custom conformance and validation of JSON structures using the Spark Dataset API and the Scala JSON library Circe. The self-contained unit tests all passed in the CI pipeline, but the job failed when running a test on the Glue platform with the error:

java.lang.NoClassDefFoundError: Could not initialize class io.circe.Decoder$

The Cause

My first thought was it sounded like a dependency conflict. As described in the Java doc, a NoClassDefFoundError means the class definition existed when the class was compiled but it can no longer be found. The next step was to figure out which dependency was causing the issue. Spark doesn’t use Circe so it was likely some transient dependency of Circe. I decided to create the simplest self-contained Spark job using Circe I could. That way I could eliminate any other variables that might be causing the issue and also determine if there were any other hints as to what the cause might be. I ran this in a Glue Studio scala script. I still got the same NoClassDefFoundError, but luckily there was something new showing up in the error logs:

java.lang.NoSuchMethodError: cats.kernel.CommutativeSemigroup.$init$(Lcats/kernel/CommutativeSemigroup;)

So cats appeared to be part of the issue. My first inclination was to check my uber-jar dependencies. For this work the team is using Maven, so the command to use is:

mvn dependency:tree -Ddetail=true

With this we can see that cats.kernel is a dependency of circe-core_2.12.

[INFO] my.package.data:circe-spark:jar:1.0.0
[INFO] +- io.circe:circe-core_2.12:jar:0.15.0-M1:compile
[INFO] |  +- org.scala-lang:scala-library:jar:2.12.14:compile
[INFO] |  +- io.circe:circe-numbers_2.12:jar:0.15.0-M1:compile
[INFO] |  \- org.typelevel:cats-core_2.12:jar:2.6.1:compile
[INFO] |     +- org.typelevel:cats-kernel_2.12:jar:2.6.1:compile
[INFO] |     \- org.typelevel:simulacrum-scalafix-annotations_2.12:jar:0.5.4:compile
[INFO] +- io.circe:circe-parser_2.12:jar:0.15.0-M1:compile
[INFO] |  \- io.circe:circe-jawn_2.12:jar:0.15.0-M1:compile
[INFO] |     \- org.typelevel:jawn-parser_2.12:jar:1.2.0:compile
[INFO] +- io.circe:circe-generic_2.12:jar:0.15.0-M1:compile
[INFO] |  \- com.chuusai:shapeless_2.12:jar:2.3.7:compile

But it isn’t a dependency of spark-core or spark-sql, so that explains why local tests all passed without issue. The next step was to look at the jars that get added to the Spark classpath during spark-submit.

For this task we can use the environment variable SPARK_PRINT_LAUNCH_COMMAND=true which prints out the complete Spark command, including the classpath, to standard out when running spark-submit. The spark-submit command can be run from a local install of Spark or a Spark/Glue container. I ran it from a container similiar to the one described in the AWS Glue Developer Guide. This allows us to find the directory(ies) of jars getting added to the Spark classpath.

...
breeze_2.12-1.0.jar
breeze-macros_2.12-1.0.jar
cats-kernel_2.12-2.0.0-M4.jar
chill_2.12-0.9.5.jar
chill-java-0.9.5.jar
...

From this we can see that Spark 3.1 has a dependency on cats-kernel 2.0.0-M4 and Circe has a dependency on a higher minor version cats-kernel 2.6.1, so we need to resolve the conflict.

Solution Options

userClassPathFirst

Spark has two configurations spark.driver.userClassPathFirst and spark.executor.userClassPathFirst that specify if user-added jars should take precedence over Spark’s own jars. This is intended to mitigate dependency conflicts. The feature is marked experimental and only works in cluster mode, so we can’t test it locally. This is a good place to start, but for my scenario using these configurations when creating the SparkContext and GlueContext did not resolve the conflict. Also, use of the --conf parameter is marked internal to AWS Glue and we are directed to not use it in the AWS documentation.

Dependency Exclusions

Maven and other build tools allow you to exclude transient dependencies. Since cats.kernel is causing a problem, we can exclude it from Circe. We should keep the cats dependencies in sync between cats-kernel and cats-core so we can exclude both. We also need to add back the versions that match Spark.

<dependency>
    <groupId>io.circe</groupId>
    <artifactId>circe-core_2.12</artifactId>
    <version>0.15.0-M1</version>
    <exclusions>
        <!-- Need to exclude cats to avoid conflict with Spark 3.1
             spark-submit uses cats-kernel_2.12-2.0.0-M4.jar   -->
        <exclusion>
            <groupId>org.typelevel</groupId>
            <artifactId>cats-kernel_2.12</artifactId>
        </exclusion>
        <exclusion>
            <groupId>org.typelevel</groupId>
            <artifactId>cats-core_2.12</artifactId>
        </exclusion>
    </exclusions>
</dependency>
<!-- Re-add cats-core since it should match the
     version of cats-kernel, but is not provided by spark-submit -->
<dependency>
    <groupId>org.typelevel</groupId>
    <artifactId>cats-core_2.12</artifactId>
    <version>2.0.0-M4</version>
</dependency>
<!-- Re-add cats-kernel but it is provided by spark-submit -->
<dependency>
    <groupId>org.typelevel</groupId>
    <artifactId>cats-kernel_2.12</artifactId>
    <version>2.0.0-M4</version>
    <scope>provided</scope>
</dependency>

So that definitely looks overly complicated, but it should work. Let’s give it a try. Wait, what? A new exception??

java.lang.NoSuchMethodError: 'shapeless.DefaultSymbolicLabelling shapeless.DefaultSymbolicLabelling$.instance(shapeless.HList)'

There must be another conflict with shapeless. From above we see that Circe uses shapeless 2.3.7. Back to the spark-submit class path we see Spark uses shapeless 2.3.3:

...
scala-reflect-2.12.10.jar
scala-xml_2.12-1.2.0.jar
shapeless_2.12-2.3.3.jar
shims-0.9.0.jar
slf4j-api-1.7.30.jar
...

We can add more to the POM for the shapeless conflict.

<dependency>
    <groupId>io.circe</groupId>
    <artifactId>circe-generic_2.12</artifactId>
    <version>0.15.0-M1</version>
    <exclusions>
        <!-- Need to exclude shapeless to avoid conflict with Spark 3.1
             spark-submit uses shapeless_2.12-2.3.3.jar -->
        <exclusion>
            <groupId>com.chuusai</groupId>
            <artifactId>shapeless_2.12</artifactId>
        </exclusion>
    </exclusions>
</dependency>
<!-- Re-add shapeless with version provided by spark-submit -->
<dependency>
    <groupId>com.chuusai</groupId>
    <artifactId>shapeless_2.12</artifactId>
    <version>2.3.3</version>
    <scope>provided</scope>
</dependency>

After all that, Spark and the app peacefully co-exist. There is some risk here that swapping transitive dependencies, while only minor versions, could result in unintended behavior.

Dependency Shading

Another approach is to shade, or rename, the conflicting dependencies. We are already using the Maven Shade Plugin to create the uber-jar and we can also use it to re-locate the problem classes. With shading, the classes get moved and a private copy of their bytecode is created. In the uber-jar the imports and references to those classes are rewritten in the affected bytecode.

Before we try that, let’s look at the uber-jar prior to the dependency exclusions with the following command that lists the contents of the jar.

jar -tf target/circe-spark-1.0.0.jar

...
cats/kernel/
cats/kernel/Band$$anon$1.class
cats/kernel/Band$.class
...
shapeless/
shapeless/$colon$colon$.class
shapeless/$colon$colon.class
...

Adding the relocations to the pom.xml

<relocations>
    <relocation>
        <pattern>cats.kernel</pattern>
        <shadedPattern>shaded.cats.kernel</shadedPattern>
    </relocation>
    <relocation>
        <pattern>shapeless</pattern>
        <shadedPattern>shaded.shapeless</shadedPattern>
    </relocation>
</relocations>

...
shaded/cats/kernel/
shaded/cats/kernel/Band$$anon$1.class
shaded/cats/kernel/Band$.class
...
shaded/shapeless/
shaded/shapeless/$colon$colon$.class
shaded/shapeless/$colon$colon.class

This approach also allows Spark and the app to peacefully co-exist and is a simpler, safer approach than excluding the dependencies.

Wrap-Up

Dependency conflicts tend to come with the territory of developing a Scala-based Spark application since Spark itself has so many dependencies. For my situation, I ultimately used the dependency shading approach to resolve the issue. It is helpful to know what tools and techniques are available to help you quickly identify and resolve the issues so you can get back to solving your core business problem.

Spark

AWS Glue

Maven

Dependencies