Jump to content
Vodia PBX forum
ndemou

Service down by too many open files?

Recommended Posts

We had problems with calls not connecting or getting no audio today and when I examined the logs  I show many serious errors like these:

[1] ... Could not open cdr....bin for writing
[1] ... Port ...: Could not allocate sockets!
[0] ... UDP(IPv4): Could not open socket (EMFILE)

lsof | grep ^pbxctrl | wc -l showed more that 8160 lines. Looking at the output of lsof I noticed that 99% of the lines are like this:

COMMAND     PID  TID    USER   FD      TYPE             DEVICE  SIZE/OFF       NODE NAME
pbxctrl    1280 1307    root 1005u     IPv6           10107481       0t0        UDP *:49689

I wonder if so many open files(UDP sockets) is normal. It doesn't seem to be so I restarted pbxctrl and the problems went away (after a painful 5 minutes).

I'm monitoring the output of `lsof | grep ^pbxctrl | wc -l` and it shows me anywhere from 1200 to 1500 open files (with moderate traffic).

I've raised my soft & hard limit of open files from 1024 & 4096 to 4000 & 8000 with prlimit -n -p`pidof pbxctrl` --nofile=4000:8000

Some basic facts about my system:

 
Software-Version: 60.0.3 (CentOS32)
Build Date: May 1 2018 04:42:47
uname -a
Linux ... x86_64 x86_64 x86_64 GNU/Linux

Questions:

Have we hit a bug or is such an amount of open sockets normal? If it's normal what's the recommended configuration for CentOS?

Is the 5 minutes to restart normal? How could we cut it down?

Is it OK that I'm running 64bit CentOS but pbxctrl reports CentOS32?

 

Share this post


Link to post
Share on other sites

There were a couple of problems in that area, especially for very busy systems. If you can move to 61.0 or maybe even 61.1 then those problems should be fixed.

Share this post


Link to post
Share on other sites

Thanks! Upgrading a production system is always tough but "a man's got to do...". Any ideas about my other 2 questions:

Is the 5 minutes to restart normal? How could we cut it down?

Is it OK that I'm running 64bit CentOS but pbxctrl reports CentOS32? 

Share this post


Link to post
Share on other sites

Hi,

Maybe this helps for one of your questions.

If the open files are the issue then the restart time will not come down unless this gets fixed, but if you had a restart time of 5 minutes even before this issue occurred then you will have to upgrade your cores on your server running the PBX. Increasing the RAM, and / or space on the server should also help if you have a lot of CDR's or PCAPs or recordings that might be pulling the processing speed down.

 

Share this post


Link to post
Share on other sites

When the PBX restarts, it needs to read the configuration files. Especially if there are a lot of CDR this can take its time. If you have a lot, well that can take time (the newer versions keep an index for the CDR tables which make this a lot faster). 

Share this post


Link to post
Share on other sites

Well keeping the number of CDRs left on the PBX low is definitely something we can accomplish. Is there a way to have the PBX clean up old recordings (e.g. remove recordings older than 48 hours). I already copy them to an ftp server anyway so I don't mind.

Share this post


Link to post
Share on other sites

Hi,

On the latest version of the PBX 61.0 and above you can set the recordings to get cleaned up after specified interval of times from this page: /reg_recording.htm

If you're not on the latest then you can manually delete them from the recordings folder under the PBX directory for now.

Share this post


Link to post
Share on other sites

Some Linux distributions keep the PBX from binding to port 80/443 for some time, which could be 5 minutes. When you restart the PBX check top if it CPU is actually busy.

Share this post


Link to post
Share on other sites

I've upgraded to 63.0.1 about 12 days ago and the output of lsof -n | grep pbxctrl has grown from about ~2000 lines to about ~3300 lines (IPv4 and IPv6 connections). I've set a new warning level at 4000 lines and will let you know if/when I'll hit it

Share this post


Link to post
Share on other sites

Well 2000 s quite a big number - but possible if you have a lot of registrations over TCP or TLS. UDP should be around the number of active calls x 4.

Share this post


Link to post
Share on other sites

Unfortunately within a week 'lsof -n | grep pbxctrl' did give more than 4000 lines of output. Usage of the system is stable so I don't like that this number keeps climbing up week by week and I'm afraid of the system crashing again.

Looking at today's graphs I see these PEAKS:

~770 Registrations+Subscriptions, 
  52 call legs
 112 http(s) 112.
~ 10 for TFTP & LDAP together

In case it helps:

Examining the output of ngrep on port 5060 I see that I'm sending packets to 310 IP:port couples.

The output of lsof | grep pbxctrl is along the lines of this:

pbxctrl    1486 1536    root  442u     IPv4           18445564       0t0        UDP *:49460 
pbxctrl    1486 1536    root  443u     IPv4           18437010       0t0        TCP 1.2.3.4:https->9.10.11.12:53054 (ESTABLISHED)
pbxctrl    1486 1536    root  444u     IPv4           18445565       0t0        UDP *:49461 
pbxctrl    1486 1536    root  452u     IPv6           18472707       0t0        UDP *:49869 
pbxctrl    1486 1536    root  453u     IPv4           18506734       0t0        UDP *:62612 
pbxctrl    1486 1536    root  456u     IPv6           18445566       0t0        UDP *:49460 
pbxctrl    1486 1536    root  457u     IPv6           18445567       0t0        UDP *:49461 
pbxctrl    1486 1536    root  458u     IPv4           18472815       0t0        TCP 1.2.3.4:https->5.6.7.8:34917 (ESTABLISHED)
pbxctrl    1486 1536    root  461u     IPv4           18506735       0t0        UDP *:62613 
pbxctrl    1486 1536    root  472u     IPv4           18505204       0t0        UDP *:52692 
pbxctrl    1486 1536    root  473u     IPv4           18505205       0t0        UDP *:52693 
pbxctrl    1486 1536    root  478u     IPv4           18505208       0t0        UDP *:52192 
pbxctrl    1486 1536    root  479u     IPv4           18505209       0t0        UDP *:52193 
pbxctrl    1486 1536    root  482u     IPv4           18505038       0t0        UDP *:58870 
pbxctrl    1486 1536    root  483u     IPv4           18505039       0t0        UDP *:58871 
pbxctrl    1486 1536    root  484u     IPv4           18505042       0t0        UDP *:61862 
pbxctrl    1486 1536    root  485u     IPv4           18505043       0t0        UDP *:61863 

What's your opinion? Would you like more data?

Share this post


Link to post
Share on other sites

Well that does look like there are a lot of dangling calls. Are you doing anything unusual? Like multicast page a lot? Or is the system under attack all the time? Seems there are also IPv6 ports open which is a sign that the call gets allocated but is not used later.

Share this post


Link to post
Share on other sites

We have a few hundred extensions in small companies. Nothing unusual I would guess (some inter-domain, plenty external calls and a conference call from time to time). Surely no multicast paging.

Regarding attacks, I've run for a few minutes ngrep (a packet sniffer) looking for SIP replies that are not "SIP 200 OK" and I see nothing unusual.

Regarding IPv6 ports in lsof's output they account for1352 out of 3300 lines. But I should note that they contain only 169 unique port numbers. E.g.:

lsof -n| grep ^pbxctrl | grep -i ipv6 |grep 49460
pbxctrl    1486         root  456u     IPv6           18445566       0t0        UDP *:49460
pbxctrl    1486 1530    root  456u     IPv6           18445566       0t0        UDP *:49460
pbxctrl    1486 1531    root  456u     IPv6           18445566       0t0        UDP *:49460
pbxctrl    1486 1532    root  456u     IPv6           18445566       0t0        UDP *:49460
pbxctrl    1486 1533    root  456u     IPv6           18445566       0t0        UDP *:49460
pbxctrl    1486 1534    root  456u     IPv6           18445566       0t0        UDP *:49460
pbxctrl    1486 1535    root  456u     IPv6           18445566       0t0        UDP *:49460
pbxctrl    1486 1536    root  456u     IPv6           18445566       0t0        UDP *:49460

Should I send you log files or anything else?

Share this post


Link to post
Share on other sites
Quote

Looks like those port all belong to the same process

They certainly belong to the same process: pbxctrl with PID 1486. But that's to be expected since I'm looking specifically for pbxctrl.

Quote

what is the "TID" in the lsof output?

Not sure what you're asking here so I'll provide anything I can think of: "man lsof" reads: "TID is the task (thread) IDentification number". In the example output above the 3rd column is the TID column (empty for the first line and 1530 to 1536 for the next 7 lines). B.T.W. I see the same pattern for almost all other ports: each port appears 8 times, once for the main process and another 7 times for threads.

I've restarted pbxctrl and now I'm running lsof -nP | grep pbxctrl every hour and logging the full output on a series of timestamped files. Just in case it helps.

Share this post


Link to post
Share on other sites

I was just wondering if there is a misinterpretation of the data - sockets are available on all threads and maybe it just shows those sockets multiple times. Maybe we need to filter them by the process ID and count only unique sockets.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

×
×
  • Create New...