Thursday, August 07, 2008

RHQ - tip of the day: Agent waiting at startup

Sometimes when you start the RHQ agent (on commandline), it will not proceed to the sending> prompt, but sit there and wait for something. This post will talk about some of the possibilities.

The server has rejected the agent registration request...



Well, this message from the agent actually goes on:


Cause: [org.rhq.core.clientapi.server.core.AgentRegistrationException:The agent asking for registration is trying to register the same address/port [172.31.7.7:16163] that is already registered under a different name [snert]; if this new agent is actually the same as the original, then re-register with the same name]


This means that the connecting agent is known as 'snert' to the server, but it was passing a different name to it on this start.

To solve this, start the agent with option --clean and give the correct name.

The agent will now wait until it has registered with the server...



By default (well, you had to answer the questions on about it), the communication ports for server-agent communication are as follows (yes, two unidirectional connections):

  • Agent to server: server is listening on port 7080

  • Server to agent: the agent is listening on port 16163



... and hangs there



This is an agent state where the server can not be reached (perhaps because it is down or because a firewall blocks the traffic.
So make sure port 7080 on the server machine is reachable from the agents machine. You can simply do this with a web browser like Safari or lynx or wget.

... and shows an additional error


Here, after a little time the agent will show a message like this:

The server has rejected the agent registration request. Cause: [org.rhq.core.clientapi.server.core.AgentRegistrationException:Server cannot ping the agent's endpoint. The agent's endpoint is probably invalid or there is a firewall preventing the server from connecting to the agent. Endpoint: socket://172.31.7.3:12345/....

This means that the agent was able to talk to the server (so this communication channel is ok), but
the other direction is failing. In the example above, the server was trying to reach an agent on IP 172.31.7.3 and TCP port 12345, which was probably blocked in the firewall.

The agent does not have plugins - it will now wait for them to be downloaded...



This usually means that the server has a different security token than the one the agent was sending.
This could come from the fact that the java preferences entry got mangled e.g by testing with different agent versions or VMs or ...

You will see this message only on initial agent startup when it does not have any plugins yet.
If plugins got downloaded in a previous run, you will probably run in the situation shown below.

If you see this on the agent, you should also see messages like this on the server side:

11:40:48,454 WARN [CommandProcessor] {CommandProcessor.failed-authentication}Command failed to be authenticated! This command will be ignored and not processed: Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.security-token=1217855913569-109582636-403140853869881172, rhq.send-throttle=true}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.core.CoreServerService, invocation=NameBasedInvocation[getLatestPlugins]}]

To solve this, start the agent interactively with the --clean option.

Agent startup is ok, but ping command fails



Here, the agent successfully starts, but you will e.g not see any new metric data coming in from this agent. When you give the ping command on the agent command line you will see something like:

sending> ping

Pinging...

Failed to execute prompt command [ping]. Cause: org.rhq.enterprise.communications.command.server.AuthenticationException:Command failed to be authenticated! This command will be ignored and not processed: Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.security-token=1214208960346-102975580-7334156733284942657, rhq.send-throttle=true}]; params=[{targetInterfaceName=org.rhq.enterprise.communications.Ping, invocation=NameBasedInvocation[ping]}]


This is basically the same as above. Your server log should also be full of those CommandProcessor.failed-authentication messages. Solution as in the previous section.



5 comments:

k-blog said...

I get the message of "The agent does not have plugins - it will now wait for them to be downloaded..." after using the --clean option for ./rhq-agent. Do you know of any other reasons this could occur? I have other machines that connect just fine, but this VM doesn't...

Heiko said...

Kevin,

it may be that this agent machine is not fully resolvable back and forth in DNS or /etc/hosts. With RHQ 1.1 we introduced high-availability and this means that the server tries to reverse-lookup the agents ip and then do a forward lookup again. If this does not result in the same IP, the agent will hang.


Btw.: the forum at http://www.jboss.com/index.html?module=bb&op=viewforum&f=292 would have been better places to ask :)

k-blog said...

You were right. I added entries to /etc/hosts and it worked right away. Thanks! I was already pinging the forum for information on how to reverse engineer the database to do reporting. :) It slipped my mind to not post there.

Heidrich said...

I see the middle one... the strange is, that the server cannot reach my agent on an IP:port, which doesn't even defined in the config.
I have configured some wrong data when first run the agent, and since then there's no use to edit the config/delete the agent directory, and/or reinstall. Server always wants to connect the agent on the primary IP, which is inaccessible, this is why I set up the config for the NAT recommendations - but the agent seems to reject the config at startup.

Unknown said...

Thank u!!
If started the agent with option --clean doesn't work, you need to be signed as root.
I spent a couple of hours with this detail