Pascal Brisset explains (on erlang-questions some time ago) a scenario where Erlang's selective receive can fall behind...
The system is dimensioned so that the CPU load is low (say 10 %). Now at some point in time, the backend service takes one second longer than usual to process one particular request. You'd expect that some requests will be delayed (by no more than one second) and that quality of service will return to normal within two seconds, since there is so much spare capacity.I'm not sure how it really complicates naming and supervision so much. I think it is the best solution. The problem is not in selective receive, per se, which has a lot of benefits that outweigh this specific scenario. Especially wrong would be to gum up the Erlang language and simple message passing mechanisms just for this.Instead, the following can happen: During the one second outage, requests accumulate in the message queue of the server process. Subsequent gen_server calls take more CPU time than usual because they have to scan the whole message queue to extract replies. As a result, more messages accumulate, and so on.
snowball.erl (attached) simulates all this. It slowly increases the CPU load to 10 %. Then it pauses the backend for one second, and you can see the load rise to 100 % and remain there, although the throughput has fallen dramatically.
Here are several ways to avoid this scenario...
...Add a proxy process dedicated to buffering requests from clients and making sure the message queue of the server remains small. This was suggested to me at the erlounge. It is probably the best solution, but it complicates process naming and supervision. And programmers just shouldn't have to wonder whether each server needs a proxy or not.
The problem in this scenario is *coupling* too closely the asynchronous selective receive with the backend synchronous service. This is not an uncommon scenario in all kinds of "service-oriented architectures" and the solution, generally, should be the one quoted above.
A programmer should legitimately wonder whether some kind of a "proxy" is needed when they see this kind of a combination.
This is related to the blogs going round not so long ago among fuzzy, Bill de hÓra, Dan Creswell, and others.
2 comments:
I've seen this actually happen... if you write a gen_server that uses synchronous calls (e.g. to another gen_server) then it'll definitely snowball if you hit it too hard.
The solution we used was just to use more processes/nodes and monitor the load (and our latency) to make sure it stays at reasonable levels.
So Verisign separated out the duties of queue'ing and processing. They have a bunch of nodes that scoop up batches of dns requests and then submit those batches to chunky back-end servers for resolution:
Google Video
Dan.
http://www.dancres.org
Post a Comment