AI, Gen AI and Cloud: Distributed cloud apps architecture

Feature Past Present

Clients Enterprise/Intranet Public/Internet
Demand Stable(Small) Dynamic (Small -> Massive)
Datacenter Single tenant Multi - tenant(noisy neighbor)
Operations People(expensive) Automation (cheap)
Scale Up via few reliable Out via lots of (Cheap) commodity PCs
(expensive) PCs
Failure Unlikely but possible Very likely
Machine loss Catastrophic Normal (no big deal)

Exceptions Catch, swallow & keep running Crash & restart
Communication In order exactly once Out of order Clients must retry &
servers must be idempotent.

Failure Model Analysis (FMA)

How will the application detect this type of failure?
How will the application response to his type of failure?
How will you log and monitor this type of failure?

Design self healing apps:

Detect failures, response to failures gracefully and log and monitor failures to give operational insight

Recommendations:

1) Retry failed operations. Transient failures might occur due to momentary loss of network connectivity, a dropped database connection, or a timeout when a service is busy. Build retry logic into your application to handle transient failures.

2) Protect failing remote services (circuit breaker design pattern). It's advisable to retry after a transient failure, but if the failure persists, you can end up with too many callers hitting a failing service. This can lead to cascading failures as requests back up. Use the circuit breaker design pattern to fail fast (without making the remote call) when an operation is likely to fail.

3) Isolate critical resources (bulkhead pattern). Failures in one subsystem can sometimes cascade. This can happen if a failure causes some resources, such as threads or sockets, from being freed in a timely manner, leading to resource exhaustion. To avoid this, partition a system into isolated groups, so that a failure in one partition does not bring down the entire system.

4) Perform load leveling. Applications may experience sudden spikes in traffic that can overwhelm services on the backend. To avoid this, use the queue-based load leveling pattern to queue work items to run asynchronously. The queue acts as a buffer that evens out peaks in the load.

5) Fail over. If an instance can't be reached, fail over to another instance. For things that are stateless, like a web server, put several instances behind a load balancer or traffic manager. For things that store state, like a database, use replicas and fail over. Depending on the data store and how it replicates, this might require the application to deal with eventual consistency.

6) Compensate for failed transactions. In general, avoid distributed transactions because they require coordination across services and resources. Instead, use compensating transactions to undo any step that already completed.
7)Use checkpoints on long-running transactions. Checkpoints can provide resiliency if a long-running operation fails. When the operation restarts (for example, it is picked up by another virtual machine), it can be resumed from the last checkpoint.

8) Degrade gracefully. Sometimes you can't work around a problem, but you can provide reduced functionality that is still useful. Consider an application that shows a catalog of books. If the application can't retrieve the thumbnail image for the cover, it might show a placeholder image. Entire subsystems might be noncritical for the application. For example, in an e-commerce site, showing product recommendations is probably less critical than processing orders.

9)Throttle clients. Sometimes a small number of users create excessive load, which can reduce your application's availability for other users. In this situation, throttle the client for a certain period of time. See the throttling pattern for more information.

10) Block bad actors. Just because you throttle a client, it doesn't mean the client was acting maliciously. It just means that the client exceeded its service quota. But if a client consistently exceeds their quota or otherwise behaves badly, you might block them. Define an out-of-band process for the user to request getting unblocked.

11) Use leader election. When you need to coordinate a task, use leader election to select a coordinator. That way, the coordinator is not a single point of failure. If the coordinator fails, a new one is selected. Rather than implement a leader election algorithm from scratch, consider an off-the-shelf solution such as Apache ZooKeeper.

12) Test with fault injection. All too often, the success path is well tested but not the failure path. A system could run in production for a long time before a failure path is exercised. Use fault injection to test the resiliency of the system to failures, either by triggering actual failures or by simulating them.

13) Embrace chaos engineering. Chaos engineering extends the notion of fault injection by randomly injecting failures or abnormal conditions into production instances.

AI, Gen AI and Cloud

Wednesday, October 31, 2018

Distributed cloud apps architecture

No comments:

Post a Comment

Distributed Computing: A Guide to Comparing Data Between Hive Tables Using Spark

Report Abuse