BitLightning | Cloud Native: Software that Scales

How to build software like Silicon Valley

07 Feb

Cloud Native: Software that Scales

By Cory Smith cloud native kubernetes , google cloud platform , amazon web services , hashicorp , cockroachdb , ceph , kubeflow

Cloud native offers a set of best practices and recommendations for building, maintaining, and deploying microservices in a way that allows you to take full advantage of the cloud computing paradigm. In this blog post I’ll cover what Cloud Native is, how it works, and why you might want to adopt it. Lastly, I’ll talk about the different vendor specific and open source cloud native platforms, and give some examples of what it’s like to design an application for these platforms.

Just a quick note before I get into it… There are so many players in this arena that this post could go on forever. So here I’m covering my favorites, or solutions that make sense to talk about at an overview level; AWS, Google Cloud, Kubernetes, and Hashicorp.

What is Cloud Native?

Cloud Native is a microservices based software architecture pattern that decouples the business logic from the IT operations. It provides a clean contract between development and operations by providing a consistent solution for middleware problems like scalability, availability, security, metrics, logging, and tracing. These guidelines apply to any public cloud vendor, as well as private cloud solutions you can run in your own data center.

How Does Cloud Native Work?

Cloud Native works by combining a set of design principles around single responsibility, declarative configuration, exhaustive testing, measure everything, and heavy automation. With these efforts working together organizations are able to achiever higher velocity and improved operational efficiency.

Single Responsibility

If there is one understanding you take away from this blog post, I hope it’s this one. To me, this is the most important component of an effective cloud native design.

The cloud computing paradigm is built on microservices. Look at any cloud migration case study and you’ll see a strong correlation between successful cloud migrations, and microservices. You’ll also see a strong correlation between catastrophic cloud migrations and monoliths running on the cloud’s version of a fleet of virtual machines. Some people call these forklifts, as in; you’re just taking everything in your data center and loading into the cloud provider via a forklift. I know people who are still waiting for the ROI from using the latter type of migration.

Microservices should have a singular purpose, and should reflect a domain driven design, meaning each microservice should be concerned with a specific aspect of the business logic. Any specific microservice should be the only thing concerned with that particular problem domain. This can get a bit complicated in large organizations, but you break it down and it’s manageable.

Where it gets really gnarly is in the operations and middleware. While each microservice is doing it’s own thing with the business logic, you want them all to agree somewhat on the operations and tooling.

This is where cloud native design comes to the rescue. It provides a clean infrastructure API for developers to use so that they can focus 99% on business logic without having to worry about servers, retries, authn/z, monitoring, logs or whatever operations wheels they might be tempted to reinvent… you get the idea. The only code that should be in a cloud native microservice is a main bootstrap, business logic, and a few interfaces for transport and storage.

This infrastructure API further reinforces the single responsibility principle by allowing your operations team to focus solely on keeping the infrastructure running without having to worry about applications. It’s a win-win for both sides of the proverbial fence.

Declarative Configuration

In a cloud native system everything should be stored as a declarative resource.

In the previous paradigm you’d have a script that does things to get the system to the state you want. “Go install a server”. “Go install an application.” “Go load some data into a database.” If you wanted to install a second application, you’d go insert the application install steps into the appropriate place in the deployment script.

Declarative configurations look more like build specs. “Server has Nginx.” “Server has Application 1.” “Server has a database with some data loaded in it.” An orchestration tool is responsible for getting your system to match your declared state. It can look at your configuration, and look at the server, figure out the gaps, and take the necessary steps to get your system in sync with the spec.

Declarative configurations resemble code, and can actually be treated like code. They can be stored in a git repo, versioned, changes can be peer reviewed, commits can be rolled back, etc. This is why it’s often called “infrastructure as code.”

The true power of this will become more clear as I walk through this example.

Let’s say you deploy an application to three servers using an imperative script. Your script is written to deploy one instance of the app on Servers 1, 2, and 3. Later on, let’s say the app gets deleted from Server 2. So you re-run your script. Now you have two instances of the app on Server 1, one instance on Server 2, and two instances on Server 3. If you notice the problem (you did notice the problem, right?) you can go back and update your script to handle this condition, and clean up the mess on your servers. Hopefully nothing bad happens when you run two instances of your app on the same server. I mean, what could go wrong if I just re-run the script a bunch of times until the app starts working again?

By contrast let’s say you had a tool that could read a declarative config, and dynamically create a set of instructions needed to bridge the gap between the config, and the current state of the system. Now you can re-run that tool as many times as you want, and you should always end up with the same result. It’s also going to be much faster because it doesn’t have to do work that’s already been done. It’s also much more testable. To be fair, you could still write a script that does the same thing. But the logic would be complex, and you’d have to own and maintain all of that complexity. If you have a tool that does it for you, you only need to be concerned with managing a document that describes the desired state of the system.

Configuration management tools like Puppet and Ansible pointed us in the right direction. Cloud native tools took this to the next level.

Exhaustive Testing

One common trap developers fall into is spending a bunch of time writing a challenging feature, then not having a way to test it. Sometimes figuring out how to test their code is so daunting, they skip critical portions of it. By writing the tests first, it puts guard rails on your code structure and ensures your code is always easily testable.

Sometimes when working on a feature it’s easy to end up in the weeds writing code you didn’t actually have to write. Test driven development helps you get features done faster because you’re just focused on getting the tests to pass, resulting in a simple, minimal code solution.

When you have confidence in your test coverage, it’s much less scary to automatically deploy to prod because bugs are highly unlikely, and easily detectable.

Metrics and Monitoring

Cloud native systems are known for implementing vast and deep metrics collection and monitoring. It’s how you objectively measure the performance of your system.

What exactly should you be measuring? Anything you might want to improve. At a minimum you should measure what the book “Site Reliability Engineering” calls “The Four Golden Signals”: Latency, traffic, errors, and saturation.

Comprehensive monitoring gives you baselines for determining if a problem has occurred, which otherwise may have been difficult to spot, further decreasing the risk of each newly deployed code change.

Heavy Automation

Investing in declarative configuration systems, exhaustive testing, comprehensive metrics collection and monitoring lay the groundwork for the last piece of the cloud native design.

In a cloud native system, the entire release process, including; tests, linting, and deployment should all be automated. On the operations side, provisioning of infrastructure, managing capacity, and rolling back problematic changes are automated as well.

It’s easy to see how a manual software pipeline can cause a simple two hour task to drag on for days. Imagine manually; building an artifact, copying it to a shared repo, sending a request to have someone deploy it to staging, waiting for them to get the request, waiting for them to complete the deployment, manual testing with questionable coverage, any back and forth troubleshooting of issues, coordinating with multiple people who are all multitasking, them responding to you when you’re busy, and you responding to them when they’re busy. Let’s say you have to repeat the process three times to get all the issues resolved, and your total time from story to release is three days.

Now instead, let’s say the entire pipeline is automated. You spend two hours writing good test coverage for both unit and acceptance tests. You spend the same two hours coding the feature. Your artifacts are built and deployed automatically in seconds. Any needed infrastructure was already provisioned using a declarative orchestration tool. For simplicity we’ll say that in the above manual pipeline example, all three bugs were on your end, and not ops issues, so you’ll have to fix those too. But your automated unit tests take less than one minute to run, and your acceptance tests take five minutes to run. After all tests pass, you send a message to your teammates asking for a peer review. Since your change is small, the peer review is quick, and your code tagged and deployed to prod. The total time from story to release is 5 hours.

Now imagine this efficiency over multiple developers across multiple teams. The results are culturally transformative.

What are the Benefits of Cloud Native?

I don’t want to be in the digital snake oil business. Before I started down this path I gathered up as many case studies I could find on various cloud migrations. I started with around 30 and threw out a lot that were heavy on rhetoric and light on data, and that left me with a final count of 11 case studies. While these case studies may be somewhat biased in that they’re not necessarily going to talk about any of the challenges, there were some patterns that emerged in terms of benefits. Now that I’ve outlined the fundamental building blocks of cloud native, I’ll talk about the benefits we get from these design decisions and the numbers that back up these claims.

Continuous Releases

Every hour spent coding is an investment by the company. Since the ROI doesn’t start until the code is released, you want to release your code as quickly as possible. The combination of comprehensive test coverage, robust monitoring, and heavy automation are what makes this possible. Of the case studies I looked at on cloud native migrations, organizations reduced their deployment times by 92% on average using cloud native design. That’s not a typo. 92%.

Better Server Efficiency

If you’re running your server at anything less than 80% utilization, you’re effectively paying for unused capacity. Cloud native comes to the rescue again with tooling for intelligent placement of workloads, resulting in improved density and better saturation of compute resources. In other words, it allows you to run the same workload on a smaller hardware footprint. Using cloud native architecture, organizations increased saturation by 540% on average. While a flat 80% saturation may not be a realistically achievable goal, most of these organizations saw an improvement in the realm of going from 8% saturation to 45%.

Better Operational Efficiency

Heavy use of automation frees your site reliability engineers from having to do repetitive and mundane work. This results in much higher productivity per engineer, allowing your operations to scale in output while maintaining a small team size.

More productivity from engineers combined with better use of compute hardware greatly reduces operations costs. If done correctly, it can lead to incredibly fast ROIs. Averaging out the case studies I looked at, organizations were able to reduce infrastructure costs by 55% by adopting cloud native architecture.

Scalable Infrastructure

Declarative automated configuration tools, and infrastructure as code provide a foundation for enabling operations teams to rapidly scale infrastructure. In this case study, Christopher Berner talks about how OpenAI was able to scale their ML projects in rapid time using cloud native infrastructure and Kubernetes. In one specific example an engineer was able to scale his ML experiment from a few GPUs to a few hundred over a matter of days. On their legacy architecture, this would have taken months.

Scalable Teams

If your microservices are already specialized partitions of business logic, these easily map to specialized development teams in a one-to-one or many-to-one relationship. This allows your organization to easily scale your development teams and maintain the optimal “two pizza” team size of around 6 to 10 developers.

It also reduces the cognitive load on your staff by allowing them to specialize in a few niche areas of the business. This results in less context switching, higher productivity and shorter ramp up times. A bit anecdotal, but recently I talked to an engineering manager that has achieved rapid onboarding with cloud native architecture, resulting in new hires being able to deploy to production on day one.

Vendor Specific Cloud Native Solutions

There are many provider and tooling options for building a cloud native system, ranging from proprietary vendor-specific solutions, to vendor neutral open source alternatives. In the following illustrations I’ll start with the front end, and work my way back to the persistence layers.

AWS

No overview of cloud native can be taken seriously without mentioning the company that started it all. AWS has the oldest, most mature, and most comprehensive set of cloud infrastructure services. Here I’ll drill down into what a typical cloud native application might look like on Amazon.

Front End

For an AWS native application, you can build your front end using HTML, CSS, and Javascript. It’s a pretty popular option to use React, the client-side Javascript framework. Your assets will be stored on Amazon S3, with the web server option turned on. To improve performance, you can seed your front end assets on CloudFront, their CDN solution. For the average web application of moderate usage, the cost of this is trivial.

For mobile, React Native, or any other mobile application will work in place of a web app. Cloud native is really about the backend microservices, so the main takeaway here is that you don’t want stateful server-side rendered pages.

Back End

Your front end Javascript actions will make HTTP requests back to Amazon API Gateway, and API Gateway will forward those requests onto AWS Lambda functions. Lambda functions are basically small functions of business logic that are uploaded to AWS. They’re written to comply with an interface that AWS specifies, and AWS basically takes the code and deploys it as a callable function on their system. This is often referred to as “serverless” or “functions as a service”. There are no servers for you to manage. Your code just sits up there, answering requests. These are stateless, event-driven microservices. Pricing is in the pennies per month for about one million requests. For lightly used applications, it’s practically free hosting.

Databases

For simple database functionality, the lambda functions can store data in a DynamoDB table. With no servers to manage, DynamoDB lets you store hierarchical style objects, like JSON documents with primary and secondary keys. It’s a NoSQL database so you won’t be able to do the same types of complex joins with full ACID compliance like you can in relational databases. But for simple data structures, it’s serverless, cheap, and fast.

If you do need the more traditional ACID RDBMS, you can provision Amazon RDS instances. These are managed SQL servers that run on your choice of relational database engine including; MySQL, Microsoft SQL, and Amazon’s own AuroraDB. These are more expensive because you’re paying for the underlying EC2 instance (virtual machine), plus an hourly management fee. But the system is fully managed so you don’t have to worry about OS and database server maintenance. While it doesn’t scale horizontally, the cloud native microservices architecture does have some sharding properties built into the overall design, so it scales by sharding across business domains, as well as the RDS scaling functionality that resembles most relational database servers.

Storage

If you just need to store blobs of data, you can once again take advantage of Amazon S3 which stores files at URL based locations, and gives you a variety of options for configuring how much durability and availability you want for your various storage buckets. The more redundancy and performance you need for your storage, obviously the more it costs. It scales really well as long as the pricing makes sense for your business. For text files and web images, the cost is fairly trivial. For multi-media or similar large format data types, you’ll want to get familiar with the different pricing options and plan carefully.

DevOps

AWS has a service called CodeCommit that is a hosted git repository. They also have a CI/CD solution called CodeDeploy. CloudFormation can assist with infrastructure-as-code resource provisioning.

Lambda functions can be versioned, and you can split traffic between versions for canary deployments. You can use Amazon CloudWatch for metrics and logs, and AWS X-Ray for tracing to help ensure problem-free deployments.

Machine Learning

If you need a machine learning workflow for your app, AWS has an API for building ML workflows called SageMaker. It helps you prepare your data, build an architecture, train and tune a model, and deploy the model for inference. Similar to RDS, you do have to pay for the underlying EC2 instances, plus an additional hourly fee for management and the API. But you don’t have to actually manage the servers yourself. For teams that need sporadic GPU workloads, this can be a real time and money saver. If you have a workload that can keep a GPU cluster busy 24/7 you might want to compare the cost to building your own solution using one of the open source solutions I’ll talk about below.

GCP

Google Cloud Platform is brought to you by one of the most respected engineering companies in the world. It’s my personal favorite for a few reasons.

First, their data centers run on 100% renewable energy; something that’s very important to me on a personal level.

Second, their Cloud Spanner database is an extremely impressive engineering accomplishment. It’s an ACID compliant relational database that uses various GPS and atomic clocks to achieve external consistency, allowing it to grow to practically unlimited size, and replicate across any geographic region.

Third, they have some really powerful tools for Big Data that provide an awesome developer experience and integrate well with their other products, including things outside of Google Cloud like Google Ads.

Some of what follows will be a bit repetitive with what I wrote above about AWS. I didn’t want to assume that everyone read the AWS section, so I’m covering all of the same points again. Let’s look at how a typical cloud native application runs on GCP.

Front End

You can build your front end as static HTML, CSS, and Javascript. For robust front end tooling, consider the React client-side javascript framework. You’ll store your static assets on Google Cloud Storage. You’ll provision a load balancer, and point it at the bucket that contains your assets. To reduce latency for you end users, consider caching your assets on Google Cloud CDN.

Back End

Your client-side Javascript code will add additional functionality by calling a backend API terminated on Google’s HTTP routing service: API Gateway. From here you have a few options for running the backend code.

Google Cloud Functions is a functions-as-a-service solution. You write your code according to the GCF spec, and they take your code and run it on their infrastructure with exposure through an HTTP endpoint. This is the typical back end for an event driven serverless cloud native app. Billing is measured down to the nearest 100 milliseconds and for lightly used applications, it’s nearly free.

Google App Engine is similar to Cloud Functions but allows you to run a whole application instead of just one function. It provides a managed runtime environment for a subset of supported programming languages, including some older server side languages that aren’t supported on Cloud Functions. It also adds some functionality in regards to versioning and traffic splitting between versions. It’s an alternative way of doing a similar stateless, serverless cloud native back end.

Google Run is a service that accepts a container you provide, and runs it on a managed container runtime. It’s similar to Cloud Functions with the same granular billing, but it gives developers more control over the runtime environment. You can package up binaries, dependencies, and define the runtime environment for the app as long as it’s listening on an HTTP interface. Alternatively, you can configure your Cloud Run app with Google Kubernetes Engine backing for some additional functionality provided by Kubernetes which I’ll talk about further down.

Databases

The cloud ecosystems move really fast, and sometimes it’s hard to keep up. Now in 2021, Google’s solution for a simple serverless NoSQL database is called Google Firestore. It’s cheap, fast, and highly scalable. It provides strong global consistency for key/value data or JSON documents with some options for tuning based on primary and secondary keys.

If you need a more traditional ACID compliant relational database, you’ll want to start by looking at Google Cloud SQL. This is your typical managed SQL database server which can be provisioned for MySQL, PostgreSQL, or Microsoft SQL Server.

If you’re worried about the constraints of running on a traditional SQL database engine, Google also has a fairly unique product called Google Spanner which is a large-scale geo-replicated RDBMS engine. Coming in at a premium price point, it’s able to solve scalability issues that have plagued other SQL implementations for decades.

Storage

As you probably guessed by the no-nonsense naming; Google Cloud Storage is GCP’s object storage solution. It can store any file type, and every stored object can be accessed via an HTTP URL. There are different pricing options depending on what level of durability and availability are needed as well as how frequently the data will be accessed. For typical web data like website content, videos, or mobile app data, you’d probably want to choose standard storage at around $0.02 per GB per month. For large amounts of less-frequently accessed data, like long tail video content, you could use nearline storage for $0.01 per GB per month, with additional egress costs when the data is accessed. See Google for up to date pricing.

DevOps

Google Cloud Source Repositories is their offering for hosting git repos. Cloud Build offers a GitOps style CI/CD solution that covers both application deployment as well as infrastructure-as-code.

Google Cloud has some really strong operations and SRE tools in the Operations Suite (formerly StackDriver). Cloud Logging and Cloud Monitoring do exactly what you would think. They have Cloud Trace for distributed tracing, and Cloud Profiler for application performance monitoring (APM) and analysis.

Machine Learning

If you’re planning to integrate machine learning into your Google hosted cloud native app, a good place to start is probably Google Vertex AI. It’s an ML platform that provides the tooling around preparing data, building ML architectures, training models, and deploying models for inference. Because training a model can sometimes take hundreds of thousands of rows in a dataset, big data tools can be really useful for building those datasets. Vertex AI integrates really easily with Google BigQuery to give data scientists a lot of capabilities in this regard.

Vendor Agnostic Cloud Solutions

The turn-key serverless infrastructure APIs I talked about above are a great way to get started with cloud native, but you may run into a point where you start to outgrow them.

Private Cloud Native

Vendor specific cloud native designs can be extremely advantageous for small applications and startups because of the granular billing, paying down to the 100ms of CPU time. The cost savings you get from cloud computing comes from only paying for what you use. But as soon as you have enough workload to keep a few clusters busy, that cost savings falls off sharply.

For example, with Google Cloud Functions, the CPU time costs $0.0000029 per 100ms for a 2.4GHz processor with 2GB of RAM. That’s about the same compute power as a Raspberry Pi, and your bill to run that continuously on Google Cloud Functions for 5 years would be roughly $4500. A Raspberry Pi costs around $75 with a case. Also, for comparison, you can get a Lenovo ThinkSystem server with 4 3.2GHz cores, 8GB RAM and a top-tier 5 year warranty for $2,110. Let’s say you only plan to saturate the server at 40%. That’s still a 350% markup on the compute with Cloud Functions, and vendors are going to be in the same ballpark.

Now there are all kinds of other costs involved with running a server. Data center real estate, rack hardware, power, internet, firewalls, load balancers, switches, installation, and maintenance. So even if you can keep one of these low end servers busy, it’s still probably better to run your app in the cloud. After all, $4500 to run your backend in one of the world’s best data centers for 5 years is trivial for a for-profit organization.

But that markup will only scale to a certain point. Each organization is unique, but as a general rule of thumb you probably want to consider a private cloud once you can saturate two clusters. And by cluster, I mean a single rack of hyper-converged nodes (storage and compute). And by saturate, I mean 40% because you want some room to move workloads around in a failure scenario.

How to Host a Private Cloud for Cloud Native Apps

Above I covered how cloud native works, so you already know what systems you need for an effective cloud native design. So now it’s just a matter of choosing the right tools.

First, you should consider that you’re building microservices. So whether you’re doing green field or a refactor, you want to choose a set of languages and tools that have good support for microservices stuff like HTTP, gRPC, service discovery, good stability for long running processes, good packaging, and testability. If this is a customer facing software, I’d also look at performance considerations like non-blocking patterns or concurrency.

You’re also going to need an application orchestration tool. Kubernetes has emerged as the de-facto standard platform, and similar to Linux, it comes in a variety of flavors. Hashicorp also has a great alternative, often referred to as the “Hashistack”. These tools provide that cloud native separation between development and operations, and are the foundation of your private cloud deployment. Hashicorp provides a simple, opinionated, turn-key solution for basic microservice orchestration. Kubernetes is like the open source cloud operating system at the heart of a vast and deep landscape of specialized cloud building blocks that you can use to build a platform tailored to your organization. Kubernetes offers more power at the cost of more complexity.

Front End

Same as above, your front end is HTML, CSS, and Javascript. For front end hosting, use Caddy web server for easy configuration with some really useful features like auto TLS configuration through the ACME protocol. For performance choose Nginx. They both deploy easily onto whatever private cloud solution you choose.

The one thing you probably won’t want to build yourself is a CDN just because of the logistics. The good news is that you can still use Google Cloud CDN, or Amazon CloudFront pointed at your private cloud origin. There’s also a ton of other options including CloudFlare, Akamai, and Fastly, so you can pick the best of breed solution for your use case.

Back End

Here you’ll just package up your application as a container and deploy it. With the cloud native platforms I mentioned earlier, containers will be a first class citizen. A lot of people just call these “container orchestration tools” because in an overly-simplified sense, that’s pretty much what they do. So your backend microservices will listen on a transport, you’ll package it as a container, and the underlying orchestration will handle the rest. Kubernetes has options for running Functions-as-a-Service to provide a similar experience as Google Cloud Functions or AWS Lambda if that’s what your developers prefer.

Databases

Traditional MySQL server deployments will be considered a bottleneck for your cloud native app. Instead you can choose any database that has horizontal scalability built in to the database server. CockroachDB is a horizontally scalable, ACID compliant, RDBMS based on the PostgreSQL standard. Cassandra is a popular cloud native NoSQL database. Elasticache and InfluxDB are horizontally scalable time series databases for logs and metrics respectively. Kubernetes adds some additional options by community provided operators that can manage the sharding and replication of databases that don’t horizontally scale out of the box. For example Vitess is a modified distribution of MySQL that scales horizontally with a Kubernetes operator. The company PingCAP has open sourced TiDB and TiKV which are Kubernetes cloud native RDBMS and key/value databases respectively.

Once you deploy a database service in your cloud orchestrator, you’ll get some kind of service discovery endpoint to which you can connect from your backend microservices. The orchestrator works with the cloud native database to handle scaling and replication, so your developers don’t need to be concerned about any operations issues inside your app. Remember, we only want business logic running in your services!

Storage

Persistence for stateful applications adds some complexity to privately hosted cloud native apps. For general use cases, I’d recommend using a hyper converged infrastructure where nodes are dual purpose for both storage and compute. Some people prefer to split their clusters between stateful and stateless, in which case you could use strictly compute nodes for the stateless cluster. The most popular cloud native storage solution is probably Ceph, and it has options that provide an Amazon S3 compatible object storage API. You can also vendor the volumes into your orchestration tool as persistent storage resources. Kubernetes has a storage operator called Rook that can help you manage your Ceph topology.

DevOps

I’m sure you already know what GitHub is (it’s a hosted git repo), but I’ll mention it here to be thorough. For CI/CD; GitHub actions, CircleCI and TravisCI are all great. GitLab has both hosted and on-prem options that combine a git repository with CI/CD all integrated. Argo CD is a really powerful GitOps solution for Kubernetes.

Prometheus is a really popular polling based metrics and monitoring tool. InfluxDB is a great alternative if you prefer agent based monitoring. OpenTelemetry allows you to easily add vendor neutral APM profiling and tracing to your microservices, and Jaeger allows you to visualize that data to locate errors and performance issues in your application. There are several building blocks for handling logs, popular ones being Fluentd, Elasticsearch, and Graylog. Metrics and monitoring can feel a bit like building your own solution with Lego-like components when you run your own clusters. The good news is that it’s abstracted in the cloud native design which makes it easy to standardize and implement once you get up and running.

Machine Learning

ML workloads are an aspect that can really tip the scales when deciding between public or private cloud. GPU time in the cloud is very expensive, but so are underutilized GPU clusters. Because ML training workloads are usually batch processes, you want to think of them as 100% saturation workloads with no need for failover capacity. If you think you can keep a cluster busy with training for a few years, it’s probably worth it to build one. If you expect to have infrequent spikes of large scale GPU training, public cloud is probably the way to go.

If you do decide to build a private cloud for cloud native ML workloads, the most mature solution for your platform in 2021 is going to be Kubeflow. Kubeflow is a distribution of Kubernetes built for handling machine learning. It gives you a framework for data preprocessing, a Jupyter notebook interface, hooks into various ML libraries, a powerful TensorFlow operator, and tooling for managing and serving models over a serverless framework.

Hybrid Cloud and Multi-Cloud

Building on what you know about cloud native applications hosted both publicly and privately, I’ll talk about how to extend this design into multi-cloud or hybrid cloud topologies along with some common reasons for doing this.

I covered hosting your own on-prem and colo clusters running Kubernetes, but the major cloud and web hosting providers will also offer you managed Kubernetes clusters. These are awesome for solving both hybrid cloud and multi-cloud.

Hashicorp has a fairly new cloud service where they manage their tools for you, although it doesn’t seem to support their container scheduler, nor on-prem clusters. But if you’re managing an on-prem Hashistack, it shouldn’t be a problem to manage another one on public cloud infrastructure.

So why would you want to do this?

Burst Workloads

Above I mentioned how a baseline saturation can be used to justify building a few privately hosted clusters, but what if your workloads are unpredictable? In this case you can extend your private cloud into the public cloud to create a hybrid cloud. The private cloud will save you money on your baseline workload, while you leverage the cost savings of granular pay-for-what-you-use billing offered by public cloud providers.

This is done by mimicking your private cloud platform on the public cloud and treating it as just another private cloud cluster. So if you’re running Hashistack in a private cloud, you’ll build a Hashistack cluster in the public cloud for burst workloads. You’ll want to provision virtual machines for your public cloud provider with the auto scaling feature they provide. Ideally the solution you choose will give you the option to scale to 0 so that you don’t need to pay for idle nodes when you’re not bursting requests.

Also, remember when I mentioned running your clusters at 40% saturation to keep reserve capacity for failover? You probably see where this is going. If you have a burstable or failover cluster running in the public cloud, you can get away with running your on prem clusters closer to 80% saturation, further reducing your operating costs.

Nordstrom uses this exact model which they talk about in a Kubernetes case study: “With cluster federation, we can have our on-premise as the primary cluster and the cloud as a secondary burstable cluster[…] So, when there is an anniversary sale or Black Friday sale, and we need more containers - we can go to the cloud.”

Increased Fault Tolerance

The main reason I see organizations doing multi-cloud is to hedge against provider-wide outages. It’s rare, but when a major cloud vendor experiences an outage, it’s front page news. That is, if you can get to the news site. Usually when this happens, huge chunks of the internet are down.

By hosting your own cloud platform, you can streamline the tooling across multiple cloud providers, making it much easier to host your application in multiple clouds. If you’re running Kubernetes as your cloud orchestrator, you may be able to take advantage of managed Kubernetes-as-a-Service solutions available from most cloud providers, saving you from the overhead of managing virtual machines.

It also levels the playing field among cloud providers. When building vendor specific cloud native apps, the maturity of the vendor’s API can be a huge differentiating factor when choosing who you’re going to build on. But if you’re running Kubernetes across them, you can compare apples-to-apples, turning them into commodities that can just be shopped on price. With Kubernetes or Hashicorp, any hosting provider is a cloud vendor, including DigitalOcean and Linode who are very competitive on price and customer service.

Of course microservices make multi-cloud feasible, even if you want to use the vendor specific cloud solutions. Your business logic should be provided by a package, and the cloud API is just an interface to that. So you’ll just have to manage additional interfaces for your application. For example, you can have your “IsAHotdog” microservice, and you’ll have separate interfaces for Google Cloud Functions and AWS Lambda. This strategy is probably more suited for smaller applications leveraging functions-as-a-service billing, where you’d only need to write multiple interfaces for a handful of microservices.

How We Help Businesses Build Cloud Native Applications

Hopefully this gives you a good intuition about cloud native to get you started in the right direction. Many of our customers are enterprises that are dealing with end-of-life hardware in a colo, and want to avoid the pitfalls of a poorly executed cloud migration.

Cloud native architecture often means huge ramp ups learning the right technical skills. Between your existing responsibilities, and deciding among the endless options for cloud solutions, it’s easy for projects to stall out in analysis paralysis.

BitLightning is here to help. We offer architecture, proof of concepts, and consulting for startups and enterprises who want to make the most of the cloud. Let us show you how much easier it can be with a partner who’s done it before.

Click the link below to schedule a risk-free consultation with a principal engineer. We can start small and grow with you as you begin to see a return on your cloud native investment.

Start your cloud native transformation today!

Book a free consultation and learn how to increase velocity and scale your software while eliminating technical debt.

Book Now

If you enjoyed this article, you can sign up for my newsletter to be notified about new posts I write on cloud native, DevOps, and machine learning.

Cloud Native: Software that Scales

What is Cloud Native?

How Does Cloud Native Work?

Single Responsibility

Declarative Configuration

Exhaustive Testing

Metrics and Monitoring

Heavy Automation

What are the Benefits of Cloud Native?

Continuous Releases

Better Server Efficiency

Better Operational Efficiency

Scalable Infrastructure

Scalable Teams

Vendor Specific Cloud Native Solutions

AWS

Front End

Back End

Databases

Storage

DevOps

Machine Learning

GCP

Front End

Back End

Databases

Storage

DevOps

Machine Learning

Vendor Agnostic Cloud Solutions

Private Cloud Native

How to Host a Private Cloud for Cloud Native Apps

Front End

Back End

Databases

Storage

DevOps

Machine Learning

Hybrid Cloud and Multi-Cloud

Burst Workloads

Increased Fault Tolerance

How We Help Businesses Build Cloud Native Applications

Start your cloud native transformation today!

Further Reading