Service down by too many open files?

ndemou · October 15, 2018

We had problems with calls not connecting or getting no audio today and when I examined the logs I show many serious errors like these:

[1] ... Could not open cdr....bin for writing
[1] ... Port ...: Could not allocate sockets!
[0] ... UDP(IPv4): Could not open socket (EMFILE)

lsof | grep ^pbxctrl | wc -l showed more that 8160 lines. Looking at the output of lsof I noticed that 99% of the lines are like this:

COMMAND     PID  TID    USER   FD      TYPE             DEVICE  SIZE/OFF       NODE NAME
pbxctrl    1280 1307    root 1005u     IPv6           10107481       0t0        UDP *:49689

I wonder if so many open files(UDP sockets) is normal. It doesn't seem to be so I restarted pbxctrl and the problems went away (after a painful 5 minutes).

I'm monitoring the output of `lsof | grep ^pbxctrl | wc -l` and it shows me anywhere from 1200 to 1500 open files (with moderate traffic).

I've raised my soft & hard limit of open files from 1024 & 4096 to 4000 & 8000 with prlimit -n -p`pidof pbxctrl` --nofile=4000:8000

Some basic facts about my system:

Software-Version: 60.0.3 (CentOS32)
Build Date: May 1 2018 04:42:47
uname -a
Linux ... x86_64 x86_64 x86_64 GNU/Linux

Questions:

Have we hit a bug or is such an amount of open sockets normal? If it's normal what's the recommended configuration for CentOS?

Is the 5 minutes to restart normal? How could we cut it down?

Is it OK that I'm running 64bit CentOS but pbxctrl reports CentOS32?

Vodia PBX · October 15, 2018

There were a couple of problems in that area, especially for very busy systems. If you can move to 61.0 or maybe even 61.1 then those problems should be fixed.

ndemou · October 15, 2018

Thanks! Upgrading a production system is always tough but "a man's got to do...". Any ideas about my other 2 questions:

Is the 5 minutes to restart normal? How could we cut it down?

Is it OK that I'm running 64bit CentOS but pbxctrl reports CentOS32?

Support · October 15, 2018

Hi,

Maybe this helps for one of your questions.

If the open files are the issue then the restart time will not come down unless this gets fixed, but if you had a restart time of 5 minutes even before this issue occurred then you will have to upgrade your cores on your server running the PBX. Increasing the RAM, and / or space on the server should also help if you have a lot of CDR's or PCAPs or recordings that might be pulling the processing speed down.

Vodia PBX · October 15, 2018

When the PBX restarts, it needs to read the configuration files. Especially if there are a lot of CDR this can take its time. If you have a lot, well that can take time (the newer versions keep an index for the CDR tables which make this a lot faster).

ndemou · October 15, 2018

Well keeping the number of CDRs left on the PBX low is definitely something we can accomplish. Is there a way to have the PBX clean up old recordings (e.g. remove recordings older than 48 hours). I already copy them to an ftp server anyway so I don't mind.

Support · October 15, 2018

Hi,

On the latest version of the PBX 61.0 and above you can set the recordings to get cleaned up after specified interval of times from this page: /reg_recording.htm

If you're not on the latest then you can manually delete them from the recordings folder under the PBX directory for now.

Vodia PBX · October 15, 2018

Some Linux distributions keep the PBX from binding to port 80/443 for some time, which could be 5 minutes. When you restart the PBX check top if it CPU is actually busy.

ndemou · October 15, 2018

Thanks everybody for the advice.

ndemou · June 7, 2019

I've upgraded to 63.0.1 about 12 days ago and the output of lsof -n | grep pbxctrl has grown from about ~2000 lines to about ~3300 lines (IPv4 and IPv6 connections). I've set a new warning level at 4000 lines and will let you know if/when I'll hit it

Vodia PBX · June 9, 2019

Well 2000 s quite a big number - but possible if you have a lot of registrations over TCP or TLS. UDP should be around the number of active calls x 4.

ndemou · June 11, 2019

Unfortunately within a week 'lsof -n | grep pbxctrl' did give more than 4000 lines of output. Usage of the system is stable so I don't like that this number keeps climbing up week by week and I'm afraid of the system crashing again.

Looking at today's graphs I see these PEAKS:

~770 Registrations+Subscriptions, 
  52 call legs
 112 http(s) 112.
~ 10 for TFTP & LDAP together

In case it helps:

Examining the output of ngrep on port 5060 I see that I'm sending packets to 310 IP:port couples.

The output of lsof | grep pbxctrl is along the lines of this:

pbxctrl    1486 1536    root  442u     IPv4           18445564       0t0        UDP *:49460 
pbxctrl    1486 1536    root  443u     IPv4           18437010       0t0        TCP 1.2.3.4:https->9.10.11.12:53054 (ESTABLISHED)
pbxctrl    1486 1536    root  444u     IPv4           18445565       0t0        UDP *:49461 
pbxctrl    1486 1536    root  452u     IPv6           18472707       0t0        UDP *:49869 
pbxctrl    1486 1536    root  453u     IPv4           18506734       0t0        UDP *:62612 
pbxctrl    1486 1536    root  456u     IPv6           18445566       0t0        UDP *:49460 
pbxctrl    1486 1536    root  457u     IPv6           18445567       0t0        UDP *:49461 
pbxctrl    1486 1536    root  458u     IPv4           18472815       0t0        TCP 1.2.3.4:https->5.6.7.8:34917 (ESTABLISHED)
pbxctrl    1486 1536    root  461u     IPv4           18506735       0t0        UDP *:62613 
pbxctrl    1486 1536    root  472u     IPv4           18505204       0t0        UDP *:52692 
pbxctrl    1486 1536    root  473u     IPv4           18505205       0t0        UDP *:52693 
pbxctrl    1486 1536    root  478u     IPv4           18505208       0t0        UDP *:52192 
pbxctrl    1486 1536    root  479u     IPv4           18505209       0t0        UDP *:52193 
pbxctrl    1486 1536    root  482u     IPv4           18505038       0t0        UDP *:58870 
pbxctrl    1486 1536    root  483u     IPv4           18505039       0t0        UDP *:58871 
pbxctrl    1486 1536    root  484u     IPv4           18505042       0t0        UDP *:61862 
pbxctrl    1486 1536    root  485u     IPv4           18505043       0t0        UDP *:61863

What's your opinion? Would you like more data?

Vodia PBX · June 11, 2019

Well that does look like there are a lot of dangling calls. Are you doing anything unusual? Like multicast page a lot? Or is the system under attack all the time? Seems there are also IPv6 ports open which is a sign that the call gets allocated but is not used later.

ndemou · June 11, 2019

We have a few hundred extensions in small companies. Nothing unusual I would guess (some inter-domain, plenty external calls and a conference call from time to time). Surely no multicast paging.

Regarding attacks, I've run for a few minutes ngrep (a packet sniffer) looking for SIP replies that are not "SIP 200 OK" and I see nothing unusual.

Regarding IPv6 ports in lsof's output they account for1352 out of 3300 lines. But I should note that they contain only 169 unique port numbers. E.g.:

lsof -n| grep ^pbxctrl | grep -i ipv6 |grep 49460
pbxctrl    1486         root 456u     IPv6           18445566       0t0        UDP *:49460
pbxctrl    1486 1530    root 456u     IPv6           18445566       0t0        UDP *:49460
pbxctrl    1486 1531    root 456u     IPv6           18445566       0t0        UDP *:49460
pbxctrl    1486 1532    root 456u     IPv6           18445566       0t0        UDP *:49460
pbxctrl    1486 1533    root 456u     IPv6           18445566       0t0        UDP *:49460
pbxctrl    1486 1534    root 456u     IPv6           18445566       0t0        UDP *:49460
pbxctrl    1486 1535    root 456u     IPv6           18445566       0t0        UDP *:49460
pbxctrl    1486 1536    root 456u     IPv6           18445566       0t0        UDP *:49460

Should I send you log files or anything else?

Vodia PBX · June 11, 2019

Looks like those port all belong to the same process - what is the "TID" in the lsof output?

ndemou · June 12, 2019

Quote

Looks like those port all belong to the same process

They certainly belong to the same process: pbxctrl with PID 1486. But that's to be expected since I'm looking specifically for pbxctrl.

Quote

what is the "TID" in the lsof output?

Not sure what you're asking here so I'll provide anything I can think of: "man lsof" reads: "TID is the task (thread) IDentification number". In the example output above the 3rd column is the TID column (empty for the first line and 1530 to 1536 for the next 7 lines). B.T.W. I see the same pattern for almost all other ports: each port appears 8 times, once for the main process and another 7 times for threads.

I've restarted pbxctrl and now I'm running lsof -nP | grep pbxctrl every hour and logging the full output on a series of timestamped files. Just in case it helps.

Vodia PBX · June 12, 2019

I was just wondering if there is a misinterpretation of the data - sockets are available on all threads and maybe it just shows those sockets multiple times. Maybe we need to filter them by the process ID and count only unique sockets.

ndemou · June 12, 2019

Hmmm good thinking. I'll have to read a bit more on lsof and linux internals. Will let you know

ndemou · July 7, 2019

I'm logging the output of lsof for the pbxctrl process every hour for about a month. It is clear that the number of lines of output grows by about 20% per week. I've just restarted pbxctrl as a precaution.

I'm attaching a graph. The y-axis is the number of lines of the output of the command: lsof | grep pbxctrl . Weekends and holidays are painted in green color. Most of my customers are businesses so I have many calls with a peak at about midday during working days and very few calls during weekends and holidays.

I confess I haven't dived into the details of the output of lsof. However I'm almost sure that no matter the exact meaning it doesn't seem right for the number of these lines to grow by ~4% every working day. Most likely a bug in pbxctrl and less likely one in lsof. I need your help for further debugging though. I have kept all the log files and the full output of lsof for this period. Anything else I can do?

Vodia PBX · July 8, 2019

I don't think there is a bug in lsof. In the previous post it looked like there were multiple listings for the same port - which is of course raising questions. What would make sense is to check /proc/xxx/fd if that matches the output of lsof in terms of open sockets.

ndemou · July 8, 2019

First of all many thanks for the help
Now, regarding the correlation between the output of lsof and the contents of /proc/`pidof pbxctrl`/fd: No they don't match and the difference is clearly on the IP sockets (see [3]).The reason for the difference is that lsof gives 7 lines of output for each socket. So if I count only unique socket ids then both outputs match:

$ lsof -nP | grep 'pbxctrl' | grep 'IPv' | awk -F' ' '{print $7 }' | sort -n | uniq | wc -l
116
$ ls -l /proc/`pidof pbxctrl`/fd | grep 'socket' | wc -l
116

A quick check[1] on my saved lsof outputs during the past few weeks reveals that this factor (x7) is constant. And another check [2] reveals that:

The unique socket ids at the output of lsof (that match the number of sockets in /proc/.../fd) are increasing with the pattern mentioned in my previous post (i.e. ~4% increase during working days and no increase during weekends and holidays). By the way: when you exclude sockets the number of "files" left is more-or-less constant during these weeks that I'm testing (97 open "files" with a rare peak to at most 137 now and then).

Here's a plot of unique socket ids per hour for two weeks. Weekends are colored green (it's of-course the same as above but scaled down by a factor of 7 but I like pictures :-)

1050600787_Screenshotfrom2019-07-0819-54-26.png.50b5bae9daebdab1aefb5af056590624.png

______________________________

[1] ls lsof.2019-0*|sort | while read FILE; do cat $FILE | grep 'IPv' | awk -F' ' '{print $7 }' | grep -v '0t0' | sort -n | uniq -c | awk -F' ' '{print $1 }' | uniq -c | sort -nr | head -1; done

[2] ls lsof.2019-0*|sort | while read FILE; do echo -n "$FILE "; cat $FILE | grep 'IPv' | awk -F' ' '{print $7 }' | grep -v '0t0' | sort -n | uniq | wc -l; done

[3]

ndemou · July 10, 2019

Any ideas at all?

Vodia PBX · July 10, 2019

I do trust in /proc/xxx/fd. There must be some kind of operation that is causing this - e.g. some problem with a FAX or some problem with streaming MoH that is causing this. Other services do not seem to have this problem. Is there any clue in this direction? Do you have "expired" files in the working directors? The ports seem to be on RTP ports, so this must be something media related.

Service down by too many open files?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation