cluescluster energy saving (for hpc and cloud computing)

what is CLUES?

Intuitively, CLUES is a tool that powers off those cluster nodes that are not being used, and powers them on when demanded and in real time. CLUES integrates with the existing local resource manager and carries out its activities transparently to the final user.

why CLUES?

Many computers stay powered on even if they are not being used: file servers, printer servers, domain controllers, application servers, user computers to perform update processes, etc. This is a problem also happening, in a much larger scale, in the case of specific infrastructures such as clusters or cloud deployments, even more if we take into account additional aspects such as air conditioning systems necessary to keep temperature within a suitable range. In some cases, energy waste accounts for more than 70% of the operating time (considering working hours as the usage time for a computer).

where can CLUES be used?

CLUES can be used in any cluster controlled by a Local Resource Management System (LRMS), be it either a batch-queuing system such as as OpenPBS/Torque, Oracle Grid Engine, etc. or a virtual machine manager such as OpenNebula. For some of these systems, connectors have already been developed , but if the connector corresponding to the system you are using is not available, you can contact us or even develop your own connector.

In the latter case, we would be grateful if you could report back to us, so that we can incorporate your development to the pool of connectors available for other users. We try to expand the pool of connectors, on the basis of requests received from users, and to enhance connectors already available (either developed by us or by external users).

how does CLUES work?

CLUES performs the monitoring of the local resource managers usage by means of a series of integration plug-ins. When CLUES detects that a node has not been used for a period of time, that node is considered a candidate to be powered off. If none of the systems integrated with CLUES reports recent use of the node, it is switched off, or put in hibernation or stand-by mode.

When a user requests the execution of a job (by sending it to a batch queue) or makes a request for a virtual machine, CLUES checks if the currently available nodes will be able to process the request. If the number of resources is not enough, CLUES will try to power on the nodes that are necessary for the execution of the task.

is the user affected by the activity of CLUES?

One of the design objectives of CLUES is to affect as little as possible to the interaction of the user with the cluster, trying to always offer the appearance of a completely powered-on cluster. Thus, the user that submits a job will only be affected if there are not enough powered-on nodes to process the job or to launch the virtual machine. In this case, the job will have to wait for additional nodes to be powered on, but the waiting time is usually reasonably short in the case of the internal nodes of a cluster. Moreover, CLUES tries to keep a small amount of extra nodes powered on, in order to have them ready for future requests and reduce waiting time.

what is the difference between CLUES and other systems?

Some batch-queuing systems such as SLURM or Oracle Grid Engine (formerly Sun Grid Engine) can, according to the product documentation, perform similar green computing functionality. However, CLUES can not only integrate with virtually any Local Resource Management System, but also perform power-on/off scheduling for computing platforms that are shared by different control middleware.

Thus, CLUES can integrate with SLURM and SGE (OGE), but also with other batch-queuing systems such as LSF, OpenPBS/Torque, etc. Additionally, CLUES can also integrate with other cloud-computing resource management systems (particularly those providing IaaS) such as OpenNebula or emerging PaaS cloud systems.

Moreover, CLUES can integrate simultaneously with different subsystems coexisting in a cluster. Thus, it is possible to have shared clusters that are managed by a batch-queuing system such as Torque and a Virtual Infrastructure Manager such as OpenNebula, and to perform a coordinated energy saving policy for the whole cluster.

© GRyCAP - UPV, Edificio 8B - Universidad Politécnica de Valencia - 46022, Valencia.
Contacto: +34963877023, Fax: +34963877274
legal note