Great Questions on Scalability

The definition

Scalability is one of the biggest architectural concerns in modern software developments. In technical term, scalability enables a system to gracefully respond to the demands that are placed upon it, e.g. storage IO, database access, CPU utilization, memory utilization , App servers farms and network utilization are most common area requires scalability attention.

The challenges

In my experience, when designing or even developing a scalable solution, it’s difficult to make the right prediction on the demand for the future system and the potential area of optimization. Those are coming through the real experiences upon the system being up and running in production and being used and assessed by users.

It is arguable.

As the architect whom are responsible of the scalable design and solutions, we must plan in scalability as part of the development and deliverable cycle. It could be achieved by chunking, testing and details monitoring to validate the system behaviors.

The options

Two most common options for scalability are scale-up and scale-out. Scale-up means to buy bigger hardware. Scale out means to have multiple sets of hardware that can response to the same requests.

In my early career, the scale-up is often the favorite choices because it provide full control and ownership the the hardware and, most importantly, it is usually budgeted. Not even virtualization of VM concept was employed yet since the technology is not so popular. Then after cloud was introduced in latest 2008, there is a momentum shift to scale-out option which is more cost effective. Why? It simply allows to start small and add system resources as the demand for system’s capability increases overtime.

The questions

Now it comes to the most interesting part of this article: the area need to consider when designing and implementing scalable solutions. For me, I like to ask questions because I often have different answers sometime that interest me and cultivate my interest to ask more. So, here they are.

1. How many users (online and batch) will concurrently access the system ?

2. How much data will the system be able to manage ?

3.  How many read / write operations per second does the data store need to handle ?

4. What is the peak concurrency access to the system ?

5. How much data can be cached to minimize the depth within the system that the requests need to travel before being responded to ?

  • Can data be cached outside the system in content distribution network (CDN) to help to keep traffic away from site ?
  • Is it worth caching ?

6. Is data replication required for the system ? How long is it acceptable for the data synchronization to take place ?

7. How much logging and events are required to the system to support the operational needs of the system, for now and future performance analysis ?

8. Are there area of data contention ?

9. Are the CPU intensive operations ?

10. How do you plan to measure usage of the system ?

11. Do you plan to meter services to throttle excessive usage ?

12. Do you have ability to auto-provision additional servers to meet the demand ?

13. Can you schedule batch operations to occur at non-peak times ?

I leave it for you, the readers, to decide which questions are most important for you.

The practice

For me, it is important to setup the set of rules and alert so that key personnel will be notified upon certain threshold, in related to system performance. For example, the operational warning for operation team will be triggered upon system resource reaching 80% utilization. If it is over 90%, the urgent notification is needed. And action to be taken to resolve the problem. I love the idea of auto provision base on system usage, it is fully automated and greatly improve the system performance. It is certainly that the rule for demolishing those underutilized VM or instance should be set.

The takeaway

The key to scalability is to test and validate our assumption about system behavior. It is to drive system pass its limit to the breaking points so that we could find out how system fails under load.