What’s AWS Well-Architected Framework? | by Jakub Kapuscik | Jul, 2022

How to balance your software architecture?

Photo by Christophe Hautier on Unsplash

The software architecture is a set of significant decisions that have been taken. Designing a good and long-living software that you would be happy to work with is a complex and hard task. Very very hard. We have to make tough trade-offs while things around are changing and breaking. How to make sure our system is thought-through and well-designed? Are there any guidelines that could help? This is where the AWS Well-Architected Framework comes to play. Let’s briefly explore some key takeaways from this framework.

Disclaimer! AWS Well-Architected Framework focuses on AWS cloud (no surprise here). Nethertheless, I belive its way of thinking and making decisions can be very helpful outside AWS, even outside a cloud

The AWS Well-Architected Framework has been created as a result of years of experience from AWS engineers and is applied by numerous companies worldwide. It is always better to learn from someone else’s mistakes and experiences than our own.

There are six interconnected pillars of the AWS Well Architecture Framework. It is crucial to be aware of their existence to be able to make an informed decision. Each one of the pillars describes an important aspect of a system. Depending on the functional requirements and drivers of each project, we need to make educated tradeoffs between them.

There are some general principles that should be applied:

  • use the resources you really need. There are multiple tools that can scale up and down your capacity depending on current needs. No need to keep idle resources. Pay only for what you need
  • make data-driven decisions. You can and should gather data that will allow you to evaluate your architecture’s performance and help improve it
  • automate everything. Automation makes it easier to adapt and test your architecture.
  • let your architecture evolve. A software system is never finished, only released. We should always adapt and iteratively improve in response to changing requirements and demands
  • practice what you designed. We must know how the processes we designed work in real environments. Especially if those were designed as a response to incidents

It is not only up to great software that a system achieves business goals. It is primarily up to people that are working with it. It is crucial that teams in the organization understand their roles and what value they bring to the customers. They need to have a shared goal and know their responsibility and areas of ownership. Clarity and understanding of what and why we are doing are crucial on all levels.

We have to be aware of a risk that can occur and be prepared to respond to them. Not only theoretically, but we also need to make sure that such procedures really work in practice. We have to know what and who is responsible for each process. It is much better to practice with test environment than do it for the first time in the middle of the night when production is burning. Planning for failure will make our system much more reliable.

We have to know what is the normal and abnormal states of our system. It can not be done without a lot of telemetry data. Without it, we could be very easily surprised by the behavior of our system when it is too late. Making data-driven decisions are essential for a successful organization.

Each deployment of software is a potential risk of failure. The more changes at a time, the bigger the risk. The updates should be frequent, small, and reversible. We should always learn from our mistakes and evolve our technical and organizational processes to minimize such cases in the future.

Sample question: How do you determine what your priorities are?

There are things there are easy to overlook until it is too late. The lack of a proper security model has put down a number of companies. Fortunately, there are some practices that can help us to avoid such fate.

AWS takes some of the security burdens from our shoulders. The AWS Shared Responsibility Model defines the areas of responsibility for the security of the customer and the cloud provider. In summary, AWS makes sure that the infrastructure of provided services is secure but we need to take care of the rest.

No one should have access to the things they do not need. This is the basic idea of ​​the least privilege principle that is a fundament of any secure system. When privileges are granted too loosely, it can very easily backfire. By default, the privilege to any part of the system should be denied and explicitly allowed only when needed for a given role. Identities should rely on a single, centralized provider. This makes them easier to manage and explore.

Data should be encrypted both at rest and in transit. AWS provides numerous tools to make it relatively simple. In the case of many services, it is as easy as selecting a single checkbox. The level of security precautions should depend on the classification of data we store. Of course, cat images will be treated differently than credit card data.

Logs should be treated as first-class citizens. Without a proper level of traceability, we will not be able to detect any abnormal behaviors in our system or security breaches. We should also have prepared processes that will mitigate the blast radius of potential security issues. The incident response should be practiced in order to make it more reliable and up-to-date.

Sample question: How do you manage identities for people and machines?

AWS Shared Responsibility Model. Source: Security Pillar, AWS Well-Architected Framework, 2022

It’s all about getting the best bang for a back. The only way we can achieve it is if we understand what and why we are paying for. This pillar is not about minimizing spending at all costs but getting the best pricing for tools required to meet our functional requirements.

Pricing for AWS services (and other cloud providers too) can be quite complex and have multiple configurations for a single service. We are also charged for different aspects of the workload of our systems. We have to fully understand how we will be charged for a given service and choose the best pricing option for our patterns of usage.

There are a lot of horror stories about enormous AWS bills that scared unaware users at the end of the month. If we are not monitoring our costs, it will be very easy to get surprised. Especially, if we do not fully understand the pricing models of the used services. Each cloud provider offers a variety of tools to monitor costs. Budgeting and setting notifications can save us a lot of trouble.

Wast is a crime against humanity. We should never allow ourselves to pay for the things we do not need and actively track such cases. The teams working on a system should always be aware of the costs of each component they are using. They should also actively work to optimize them.

Some of the generic workload can and most of the time should be delegated to the cloud provider by using managed services. It would allow us to focus on our core domain that makes our business successful. Let others do the heavy lifting related to the infrastructure and administration.

AWS is changing at a very fast pace. New services are added, existing ones updated, and pricing changed. We need to adapt and get the best of those changes. Keeping up to date with the new tooling and changes is not an easy job, but can really pay off and make our system more efficient (and much cooler).

Sample question: How do you monitor usage and cost?

Nobody likes systems that are not reliable and keep breaking. We need to always assure that our system will have enough resources to perform. It needs to scale automatically to meet expected demand. Each AWS service has some limitations and quotas for an account or region we have to be aware of. Some of those are flexible and can be changed when requested. We need to actively monitor resource usage and check if we are not about to hit such limits.

Every system will break at some point in time. We have to plan for failure and have procedures to react in place. We should have playbooks that describe how to behave in such cases. Moreover, the response to the incident should preferably be automated. It would make the response much faster and more reliable. We should make sure that the blast radius of the incident will be as small as possible and that our business will not suffer. If a single component of the system goes down, it should have no effect on other components.

There are two kinds of people: those who backup, and those who have never lost all their data. Backups are always crucial if our data has any value. Moreover, we have to be sure that the backups and restoration process work in practice. Unfortunately, this is something to is often overlooked and rarely tested. There could be tragic consequences when backups do not work as expected.

Sample question: How do you design interactions in a distributed system to prevent failures?

There are four core resource types: compute, storage, database, and networking. Each one of them has numerous different cloud services available and each with a number of configurations. It is a very hard task to select just the right option for the job. We definitely can not do it without fully understanding our needs first. Performance expectations will depend on a specific workload and would usually be much different eg with a customer-facing website and a back-office reporting tool. Moreover, the cloud environment is constantly evolving, and choosing the right tools has to be a constant process. We need to frequently reevaluate our selection and leverage the latest offerings.

There are a lot of very complex challenges that can be shifted to the cloud provider. It is not only about the infrastructure but also about using managed services. The more we can focus on the core domain that differentiates our business, the better. AWS is investing a lot of effort into the optimization of managed services. Therefore, those would most likely be much more performant than any similar self-hosted solution that we manage on our own.

Our system should be closely monitored to make sure it performs as expected. Monitoring should alarm us if any abnormal behavior was detected and verify the performance of our software. The rules for the monitoring should be tailored to our needs and frequently reevaluated.

Sample question: How do you monitor your resources to ensure they are performing?

This is the newest pillar added in 2021. It is definitely a sign of a changed perspective on how important the consequences of our actions are. We have an impact on the environment with each decision we take. We should understand how big it will be. Fully utilizing the resources we already have instead of purchasing additional ones could be an example of minimizing our carbon footprint. We could also use spot instances to reuse compute power of instances that have already been reserved by other customers.

Some of the AWS regions use renewable sources of energy. It is worth considering when selecting which region we plan to use. Even a small change by a number of users and companies can create a huge impact.

Sample question: How do you select regions to support your sustainability goals?

Leave a Comment