Jump to content

CPU maxing out, killing PBX


Jon Heese

Recommended Posts

Today, around 3:00pm, our internal PBXnSIP install (on a Windows 2003 server) came screeching to a halt. It was maxing out the CPU and was unresponsive to everything: web, SIP, service control, etc. I had to kill the pbxctrl.exe process to get the service to stop, but as soon as I started it back up again, it would go back to max CPU and become unresponsive again within 10-20 seconds, never to return.

 

A reboot of the server made no difference, a restore from known-good tar backup made no difference, upgrading the pbxctrl.exe to 3.3.1.3177 made no difference, but I noticed that a "factory reset" (if I was quick enough to get in the web interface and do it before it locked up) did cause it to behave normally, until I tried restoring from the backup, at which point it went back to the max CPU.

 

I noticed that if I manually remove the license key from pbx.xml, everything starts up fine. So, by my rough estimation, it's either the licensing component itself that's causing the problem, or it's something that is running *because* of the license (i.e. trunk registration, etc.) that's causing it. Either way, we are dead in the water until we can work out what's maxing out the CPU and killing the PBX.

 

Any suggestions?

 

Regards,

Jon Heese

Link to comment
Share on other sites

Hi Jon,

 

(your name looks awlfully familiar?)

 

anyway, to test if license issue how about just type in a 3minute key? if it works you know what it is.

 

also, what happens if you install on another pc? takes but a minute to try...

 

just some ideas--hopefully helpful,

matt

 

 

We had a similar complaint sometime back and an instance of PAC running against the PBX was causing the issue. After they closed and restarted the PAC, things were fine.

Link to comment
Share on other sites

We had a similar complaint sometime back and an instance of PAC running against the PBX was causing the issue. After they closed and restarted the PAC, things were fine.

 

Same here yesturday around 2PM on the 28th.

reboot, killing the app, removing the licence and try the 3 minutes... nothing, aboslutelly nothing worked...

 

I have manually disabled pbxctrl and manually uninstall .NET Framwork 3.5 upgrade and 3.0 upgrade, reboot the machine and restart the server, everything seams to be fine since...

 

I suspect that something is going wrong with pbxctrl since it is targetting a specific processor and when stopping pbxctrl targeted processor is getting back to normal

Link to comment
Share on other sites

Hi Jon,

 

(your name looks awlfully familiar?)

 

anyway, to test if license issue how about just type in a 3minute key? if it works you know what it is.

 

also, what happens if you install on another pc? takes but a minute to try...

 

just some ideas--hopefully helpful,

matt

Yes, it seems that we apparently followed you over here to PBXnSIP, Matt. ;)

 

I just tried a couple different licenses on this machine, both commercial and free, with the same results. This license doesn't work on another machine (presumably because it's inherently linked to the MAC address of this server's NIC), but it just says "No License" on the other machine, it doesn't lock up like this, so it seems to be something particular to this server, not the license itself (which has worked fine for at least a month before today). Thanks for the suggestions.

 

We had a similar complaint sometime back and an instance of PAC running against the PBX was causing the issue. After they closed and restarted the PAC, things were fine.

I can't say for sure that someone didn't leave a PAC open somewhere, but I changed the web port to a different number and restarted the service (which should have rendered any running PAC client disconnected), with the same results. So I don't think it's the PAC. But thanks anyway.

 

Same here yesturday around 2PM on the 28th.

reboot, killing the app, removing the licence and try the 3 minutes... nothing, aboslutelly nothing worked...

 

I have manually disabled pbxctrl and manually uninstall .NET Framwork 3.5 upgrade and 3.0 upgrade, reboot the machine and restart the server, everything seams to be fine since...

Now there's an interesting suggestion...

 

I do know that some Microsoft Updates had been performed on this machine recently (although not since last Friday), and sure enough, Add/Remove Programs does show .NET 3.0 and .NET 3.5 with a handful of updates on both.

 

However, after uninstalling both .NET 3.0 and .NET 3.5 entirely, and rebooting the server, the service still pegs out when started.

 

I suspect that something is going wrong with pbxctrl since it is targetting a specific processor and when stopping pbxctrl targeted processor is getting back to normal

I'm not sure I understand what you're saying here, but interestingly, as you seem to imply here, only 1 of the 2 processor cores is actually pegging out. Does this shed any light on what might be happening?

 

Thanks again for all the great suggestions, and please keep 'em coming! We are dead in the water!

 

Regards,

Jon Heese

Link to comment
Share on other sites

Get a new 3miute key and see if it will work on another computer...

 

Sounds like it is NOT a key issue...now lets see if it is a pc issue...;-)

 

matt

Easier said that done, apparently...

 

I tried requesting a 3-minute trial key (http://www.pbxnsip.com/sales/trial.php) once last night and once again this morning and haven't gotten anything yet... Is this supposed to be an automated process, or am I waiting for a human being to see my request and provide a key?

 

I can just use any 3-minute key, right? Anyone have one handy they can send to me for some quick testing?

 

This is getting kind of desperate; is there any quicker way to have this addressed by support staff?

 

Incidentally, I don't think the key itself is the problem, since we have two different NFR keys for this machine, and both of them cause this issue. I suspect that if/when I get a 3-minute key and try to insert it, it will do the same thing. I think it's something on this machine that is causing the licensing mechanism to freak out when a key is entered.

 

Regards,

Jon Heese

Link to comment
Share on other sites

there is a 3 minute key in you inbox.

 

did you try installing on a NEW machine? does that work?

 

I would exclude that is has to do with the 3-minute key. If you have dual core, the PBX will worst-case block only one core, which leaves the other core available to log in and check what is going on.

 

If you cannot log in, or the server does not respond to PING any more, then there is something very serious going onand I would say low probability it has something to do with the application (PBX).

 

If you can log in, take a look at the taskmanager. If the PBX consumes the CPU or the memory, obviously then the one responsible has been found and we need to dig depper why this happens. The PAC example was something typical on what can be the problem and then it is releatively easy to fix it.

 

We have also seen other cases where a device driver freaked out and killed the machine. The fact that a software is a device driver does not imply that it is best quality...

Link to comment
Share on other sites

there is a 3 minute key in you inbox.

 

did you try installing on a NEW machine? does that work?

 

matt

Thanks for the key, Matt. It does the same thing.

 

Yes, I tried installing on a new machine, and of course, it runs with the 3-minute license fine, but I can't insert my NFR licenses into it because they are keyed to the MAC address of this server's NIC.

 

I would exclude that is has to do with the 3-minute key. If you have dual core, the PBX will worst-case block only one core, which leaves the other core available to log in and check what is going on.

 

If you cannot log in, or the server does not respond to PING any more, then there is something very serious going onand I would say low probability it has something to do with the application (PBX).

 

If you can log in, take a look at the taskmanager. If the PBX consumes the CPU or the memory, obviously then the one responsible has been found and we need to dig depper why this happens. The PAC example was something typical on what can be the problem and then it is releatively easy to fix it.

It is a dual core server, so I can log in and watch it. The server responds to ping and other network traffic perfectly fine, the only problem is PBXnSIP, which maxes out 1 core of the CPU and totally stops responding to SIP, web, and service control requests. After killing the process, removing the license key from the pbx.xml file, and starting the service back up again, it responds fine (if I don't remove the license from pbx.xml, it will spike out immediately upon starting the service). I'd say there is a 100% chance that it's the application. ;)

 

We have also seen other cases where a device driver freaked out and killed the machine. The fact that a software is a device driver does not imply that it is best quality...

I'm not really sure what you mean by that last sentence, but Pradeep mentioned a possible bad USB driver in an e-mail to me a little while ago. There are no devices plugged into the server (USB or otherwise), and virtually nothing extraneous connected to it at all. It's just a Dell PowerEdge SC440 server with a network cable and a power cable plugged into it. No expansion cards at all, just a bare stable machine (been installed and running for 2-3 years) with PBXnSIP on it for the last month or two.

 

Regards,

Jon Heese

Link to comment
Share on other sites

Okay, well, after looking at a Wireshark capture from this box, I saw a bunch of packets from an IP in Amsterdam, so I tried disabling the WAN adapter on that server, and voila, the CPU usage went back down to normal and everything was fine.

 

So, I blocked that IP address in the "Access" tab of the PBXnSIP config, re-enabled the WAN adapter and now everything's back up and running.

 

It appears that someone was hammering the PBX with approximately 250 SIP "REGISTER" packets per second, thus crippling the process with a bunch of failed login attempts. Of course, when the system had no license installed, it summarily dismissed all incoming SIP registrations, so the problem didn't rear its ugly head until we inserted a license key and it had to process each failed login.

 

Is this not a common occurrence? It seemed to be quite easy for this attacker to bring the system to its knees. Is the prescribed fix for this to do what we did, and block the attacking IP address with the "Access" feature?

 

What if we have this problem on one of the PBXnSIP appliances, like the cs410? We can't do a Wireshark capture on the device, so would we have to depend on the logging or hooking up a test computer with a hub to detect these malicious SIP packets? What would your suggestion be in that case?

 

Thanks for all who helped troubleshoot this and gave ideas.

 

Regards,

Jon Heese

Link to comment
Share on other sites

Is this not a common occurrence? It seemed to be quite easy for this attacker to bring the system to its knees. Is the prescribed fix for this to do what we did, and block the attacking IP address with the "Access" feature?

 

What if we have this problem on one of the PBXnSIP appliances, like the cs410? We can't do a Wireshark capture on the device, so would we have to depend on the logging or hooking up a test computer with a hub to detect these malicious SIP packets? What would your suggestion be in that case?

 

So far this is not very common. But I am afraid it will be more common.

 

The IP address white and black list were done for a reason. Yes, after finding out it saved your day. What we can learn from this case is that we need to handle this more automatically. IMHO it is not enough to use iptables in Linux to address this problem. What we need in the next version is an automatic blacklisting of addresses that fail to authenticate for so and so many times.

 

Today the biggest limitation is usually the speed on the link to the Internet (for example, the typical CS410 case will be like that). However, as people are moving to hosted environments with very high speed links, that "natural" protection will not be there any more.

Link to comment
Share on other sites

So far this is not very common. But I am afraid it will be more common.

 

The IP address white and black list were done for a reason. Yes, after finding out it saved your day. What we can learn from this case is that we need to handle this more automatically. IMHO it is not enough to use iptables in Linux to address this problem. What we need in the next version is an automatic blacklisting of addresses that fail to authenticate for so and so many times.

 

Today the biggest limitation is usually the speed on the link to the Internet (for example, the typical CS410 case will be like that). However, as people are moving to hosted environments with very high speed links, that "natural" protection will not be there any more.

I agree wholeheartedly. We were down for almost a full 24 hours chasing down .NET updates, license key problems, trunk issues, etc.. If this had been a client's system, instead of our own internal PBX, we would have probably lost the client, or at the very least had a lot of explaining to do as to why neither we nor the system were equipped to handle this situation more quickly.

 

I don't blame the PBX or its developers though. A DoS is a DoS, and I probably should have recognized it earlier than I did. However, if the PBX was made smarter than me ;) (i.e. with a flow control or auto-blacklisting function), then it could have saved my butt even if I was unaware what a UDP packet even was...

 

I talked with Pradeep on the phone a little while ago, and he agrees that it should be possible to implement a more automatic way to mitigate this problem in future versions, so here's to hoping that makes it down the chain and into the software soon!

 

He also mentioned something I hadn't thought of yet in regards to the CS410, which is the possibility of running tcpdump on it via SSH, to determine if a similar symptom is being caused by a packet flood like this. Of course, considering that the CS410 undoubtedly runs on a single-core CPU, that may be more easily said than done. Of course, an enterprise-level router/switch would be able to analyze the packets coming in to the device, but most of our clients are quite small businesses (1-15 employees), so we would probably just pull out a hub and a laptop and run Wireshark there in that kind of situation.

 

Regards,

Jon Heese

Link to comment
Share on other sites

I agree wholeheartedly. We were down for almost a full 24 hours chasing down .NET updates, license key problems, trunk issues, etc.. If this had been a client's system, instead of our own internal PBX, we would have probably lost the client, or at the very least had a lot of explaining to do as to why neither we nor the system were equipped to handle this situation more quickly.

 

I don't blame the PBX or its developers though. A DoS is a DoS, and I probably should have recognized it earlier than I did. However, if the PBX was made smarter than me :) (i.e. with a flow control or auto-blacklisting function), then it could have saved my butt even if I was unaware what a UDP packet even was...

 

I talked with Pradeep on the phone a little while ago, and he agrees that it should be possible to implement a more automatic way to mitigate this problem in future versions, so here's to hoping that makes it down the chain and into the software soon!

 

He also mentioned something I hadn't thought of yet in regards to the CS410, which is the possibility of running tcpdump on it via SSH, to determine if a similar symptom is being caused by a packet flood like this. Of course, considering that the CS410 undoubtedly runs on a single-core CPU, that may be more easily said than done. Of course, an enterprise-level router/switch would be able to analyze the packets coming in to the device, but most of our clients are quite small businesses (1-15 employees), so we would probably just pull out a hub and a laptop and run Wireshark there in that kind of situation.

 

Not to scare you, but...

 

There are also other ways of DoS. If you are in the LAN, try ping -f, maybe from more than one computer. Okay, that's easy.

 

The other nasty thing is spraying packets on the PBX, just to consume bandwidth. Especially for systems that don't have too much of it (e.g. sitting on a cable modem) and who don't have a QoS mechanism, it is easy to tease those installations badly. The experience is that the audio quality will suffer significantly. You will be chasing this for more than just days.

 

We had a case some time ago, where the other party accidentially crashed the application, but the media part was still alive (of course that other party was not running the pbxnsip PBX ;) ). However, it was serious because that media server was sending RTP well-formatted, constantly on the PBX. It was simply comsuming so much bandwidth that we could not make phone calls over that line without mmm-aaa-jjj-ooo-rrr quality problems. Just before we were ready to call our service provider and beg him to blacklist this IP address, the media server had mercy and rebooted. Even blacklisting that address on the PBX would not have changed anything.

 

Those guys who believe that ENUM and peer to peer will be the future should think about this scenario.

 

Check out MPLS. It is not so stupid.

Link to comment
Share on other sites

Not to scare you, but...

 

There are also other ways of DoS. If you are in the LAN, try ping -f, maybe from more than one computer. Okay, that's easy.

 

The other nasty thing is spraying packets on the PBX, just to consume bandwidth. Especially for systems that don't have too much of it (e.g. sitting on a cable modem) and who don't have a QoS mechanism, it is easy to tease those installations badly. The experience is that the audio quality will suffer significantly. You will be chasing this for more than just days.

 

We had a case some time ago, where the other party accidentially crashed the application, but the media part was still alive (of course that other party was not running the pbxnsip PBX ;) ). However, it was serious because that media server was sending RTP well-formatted, constantly on the PBX. It was simply comsuming so much bandwidth that we could not make phone calls over that line without mmm-aaa-jjj-ooo-rrr quality problems. Just before we were ready to call our service provider and beg him to blacklist this IP address, the media server had mercy and rebooted. Even blacklisting that address on the PBX would not have changed anything.

 

Those guys who believe that ENUM and peer to peer will be the future should think about this scenario.

 

Check out MPLS. It is not so stupid.

Right.

 

In a sense, we are lucky that this attack was "lightweight" enough that it isn't using up all of our T1's bandwidth, and all it did was crash the PBX. We are probably going to be asking our T1 provider to add a null route for this IP at the last hop on their side, just to make sure that packets aren't needlessly using up a portion of our bandwidth.

 

Thanks for the insight.

 

Regards,

Jon Heese

Link to comment
Share on other sites

I had the same symptoms when someone decided to hookup a remote sip phone through a router that was not sip aware. It crushed the PBX and almost no bandwidth was used. I feel that we really do need some dynamic blacklisting for addresses that are causing a certain number of failures per minute.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...