I was reading the section in the book, “Thinking in Systems” about the tragedy of the commons and thought some of the recommendations might apply to software teams trying to manage their cloud infrastructure spending. When companies are young, they may optimize for development speed and flexibility and not focus on financial efficiency, This prioritization can lead to significant negative business impacts if left unchecked. We can take frameworks used to evaluate tragedy of the commons scenarios from systems thinking to identify practical solutions for managing cloud infrastructure expenses.
I’m most familiar with robotic systems which often require an array of high resolution sensors to perceive their environment and operate within it. These systems save and upload raw sensor data to cloud storage. Access to this data enables a wide variety of debugging, testing, and feature development work. These activities leverage cloud computing systems that produce even more data that is often retained, resulting in even higher cloud infrastructure costs. In high growth development environments, individual teams or engineers receive liberal access to the tools and libraries required to access cloud storage and compute resources. Left unchecked, an exponential growth of cloud infrastructure costs can significantly impact a company’s financial health.
The book defines a “commons” as a resource that is commonly shared. For the system to be subject to tragedy, the resource must not only be limited, but erodable when overused. A commons also needs users of the resource to increase at a rate that is not influenced by the condition of the commons.
The tragedy of the commons arises from missing (or too long delayed) feedback from the resource to the growth of the users of that resource. The structure of a commons system makes selfish behavior much more convenient and profitable than behavior that is responsible to the whole community and to the future.
The nature of cloud computing means that engineers can easily scale their demands on the infrastructure as needed. There are effectively limitless storage and compute resources available and cloud storage companies are incentivized to reduce any friction associated with accessing more resources. Without constraints in place, developers can continue to scale their demands on the infrastructure, justifying the action as required to meet their team’s deliverables and deadlines, while inadvertently impacting the company’s finances. At some point, the revenue generated from the development activities can’t keep up with the growth of the costs associated with those development activities.
According to the book, there are three ways to avoid the tragedy of the commons.
- Educate and exhort. Help people to see the consequences of unrestrained use of the commons.
- There’s a range of possibilities here. When we decentralize cloud infrastructure access to individual teams and developers in order to optimize for speed and flexibility, it’s often a challenge to understand who or what process is responsible for specific expenses. This often requires effort to add tagging to resources and processes to make it easier to associate expenses to teams or individuals. More generally, we want to find ways to bring more financial awareness to the software engineers so they understand some of the basic financial impacts of the work they are doing. When a developer is running a CLI command, how do they know if the code costs $1 to execute or $1,000s?
- Privatize the commons. Divide it up so that each person reaps the consequences of their actions.
- Assuming we have some basic traceability between expenses and teams, we can start to assign individual teams their own budgets. We can provide dashboards and simple forecasting tools to make it clear which teams are abiding by their budget. We could even implement policies that begin to limit or restrict a team’s access to the infrastructure when they exceed their budget. The budgeting exercise also forces teams to account more explicitly for financial impact when prioritizing their work. This can drive conversations about what’s important to maintain or expand and what functionality can be left stagnant, reduced, or even removed completely.
- Regulate the commons. Regulation must be enforced by policing and penalties.
- Another option is to develop policies that enforce fiscal responsibilities. We can establish and enforce strict TTLs/SLAs which ensure data is regularly deleted unless it’s explicitly selected to be retained. We can review the data offload from the fleet and determine if there are ways to minimize the need to offload the entire log as opposed to only a subset of data types or time periods of interest. We can analyze the access patterns of the cloud data to determine if certain categories of data are rarely used. Processes can be set up to migrate data from hot to cold storage to reduce costs. Compute intensive services can be distributed to run at off peak times or rely on spot instances. There’s a wide variety of options for the team to set up policies and enforce them through tooling to help the team manage costs.
Ultimately, a budgeting exercise serves multiple benefits. It will force you to look at your system, identify cost centers, and put processes in place that save you money on cloud infrastructure spending. However, it can also help you identify where you are spending time and money collecting data that isn’t being used to generate any business value. Identifying these gaps can also reduce costs and improve productivity.
By following some general guidance from systems theory regarding tragedy of the commons, we can reinforce the feedback loop between users of cloud infrastructure and the associated costs of that use.