L...A...T...E...N...C...Y

We can't change the speed of light.

Ron Keane
Lead Software Engineer
Note: Some content in this article could potentially be outdated due to age of the submission. Please refer to newer articles and technical documentation to validate whether the content listed below is still current and/or best practice.

Transitional architecture can be a challenge for companies as they plan their cloud journey. Migrating to the cloud typically means a piecemeal approach to moving applications and services. Applications and services that once existed in the same data center, are now hosted hundreds of miles from each other. Distance results in increased network latency.

Network latency is the time it takes for a data packet to be transferred from its source to the destination. In contrast, response time is the total time for a client to receive a response from a server (round trip). Latency impacts response time. While user interfaces may be able to absorb some latency before the user gets frustrated, some services may be hyper-sensitive to latency even with sub-second response times. This can be especially true for critical services with high volumes of requests. Understanding the sensitivity of your service to latency is an important step in deciding when to migrate applications and services to the cloud.

Latency Factors

Latency is impacted by the following factors:

  • Propagation delay
  • Routing and switching
  • Queuing and buffering

Propagation delay = (distance /speed of light)

Once, when inquiring with a network team about response times, the reply was: “There is nothing we can do about the speed of light.” I didn’t understand what that meant initially but it is right there in the equation. The speed of light is constant. As the distance increases, the propagation delay increases. Period. When a service is moved from an on-premise data center to the cloud, distance is added for those users who remain on-premise. You WILL add latency to your service for your on-premise users by virtue of migrating a service to the cloud.

Sending one 1500 byte packet from a State Farm data center to the cloud generally takes about 35 milliseconds round trip. If a payload is greater than 1500 bytes then more packets need to be sent, which increases latency.

Traffic between data centers and the cloud passes through various infrastructure. Routers and switches impact the time it takes for a packet to reach its destination. Depending on the service requirements, queuing and buffering may help as services are migrated. Queuing implies asynchronous processing. Users fire off an event and continue without waiting for a response. Buffering may help by holding copies of the responses in a local buffer or cache. Users can check the local cache before calling a service on the cloud. Buffering can also benefit services on the cloud that are dependent on responses from on-premise data stores and resources. Recently, I participated in a proof of concept (POC) where the user was authenticated by calling the on-premise authentication service. This took between 700 milliseconds and 1.2 seconds. Caching the authentication response had a significant impact on subsequent response times

Considerations

Consider the following latency factors when determining whether to migrate a service to the cloud:

  • What are the dependencies to on-premise resources such as data stores and other services?
  • How sensitive are users to latency?
  • What is the cost?

At first glance, the simplest migration roadmap would seemingly entail migrating leaf node services first. Leaf node services have few, if any, dependencies to on-premise resources. Migrating them first allows a service library to be built on the cloud. Services that have multiple dependencies back to on-premise resources may incur excess chattiness. Each interaction with on-premise resources has an additive latency effect unless processing is done in parallel. Once providers and users are on the same platform, latency becomes a minor concern again.

The API migrated in our POC met the leaf node test. Although it had a dependency with an on-premise authentication service, caching the response reduced the impact of that call. For the on-premise version of the service, the average response time was 42 milliseconds. On the cloud, the average response time was around 250 milliseconds with authentication cached. Response times were between 500 milliseconds and 1.4 seconds without authentication cached. Response times were around 90 to 100 milliseconds when tested with “keep-alive” enabled. This reduction is due to the elimination of the Secured Socket Layer (SSL) handshake. Latency will affect the SSL handshake response times as well.

Can users absorb response times of the least optimized requests? Knowing the response times helps when considering sensitivity to a higher latency. If users will error out on the least optimized requests, this may not be a good service to migrate even if it is a leaf node service. One way to tell the latency threshold for existing consumers is through chaos testing. Chaos testing, which tests a system’s resiliency against a variety of factors, is becoming more prevalent in the industry. You can gradually increase the latency on the service while running a performance test on the user’s application until the user application starts erroring out. Services with many users and high volumes of requests require extra scrutiny. It is important to performance test with users prior to deploying to production.

Providers and users will be split across platforms during a transition to the cloud. If response times for off-platform users are unacceptable, dual maintenance for a service may be an option to meet user needs. This means having an on-premise version of the service and a cloud version of the service. Although this requires more work for the provider, it offers low-latency responses. However, for services that require a common data store across platforms, dual maintenance may not be an option. Users may experience slower response times during the transitional period.

Finally, providers should consider the cost associated with their design. One design may perform slightly better than an alternative design but incur substantially more cost per month. In this case, the more cost effective solution may be the more optimal choice. By looking at the performance of each solution and estimating the associated costs, you will be in a better position to choose the best solution for your needs

Conclusion

Latency creates a chicken or the egg scenario as you determine which services to migrate to the cloud first. While there is a desire to be on the cloud with your users to ensure the fastest response times; users will experience slower response times due to higher latencies if you move to the cloud before them. Consideration of the impact of latency can help with the creation of an optimal migration roadmap. Awareness can also impact design decisions for a service.

Latency is one of the challenges of transitional architecture. To make progress, we must continuously learn and adapt to these challenges as we migrate our services from on-premise to the cloud.

To learn more about technology careers at State Farm, or to join our team visit, https://www.statefarm.com/careers.