Jump to content

Service down by too many open files?


ndemou

Recommended Posts

We had problems with calls not connecting or getting no audio today and when I examined the logs  I show many serious errors like these:

[1] ... Could not open cdr....bin for writing
[1] ... Port ...: Could not allocate sockets!
[0] ... UDP(IPv4): Could not open socket (EMFILE)

lsof | grep ^pbxctrl | wc -l showed more that 8160 lines. Looking at the output of lsof I noticed that 99% of the lines are like this:

COMMAND     PID  TID    USER   FD      TYPE             DEVICE  SIZE/OFF       NODE NAME
pbxctrl    1280 1307    root 1005u     IPv6           10107481       0t0        UDP *:49689

I wonder if so many open files(UDP sockets) is normal. It doesn't seem to be so I restarted pbxctrl and the problems went away (after a painful 5 minutes).

I'm monitoring the output of `lsof | grep ^pbxctrl | wc -l` and it shows me anywhere from 1200 to 1500 open files (with moderate traffic).

I've raised my soft & hard limit of open files from 1024 & 4096 to 4000 & 8000 with prlimit -n -p`pidof pbxctrl` --nofile=4000:8000

Some basic facts about my system:

 
Software-Version: 60.0.3 (CentOS32)
Build Date: May 1 2018 04:42:47
uname -a
Linux ... x86_64 x86_64 x86_64 GNU/Linux

Questions:

Have we hit a bug or is such an amount of open sockets normal? If it's normal what's the recommended configuration for CentOS?

Is the 5 minutes to restart normal? How could we cut it down?

Is it OK that I'm running 64bit CentOS but pbxctrl reports CentOS32?

 
Link to comment
Share on other sites

Hi,

Maybe this helps for one of your questions.

If the open files are the issue then the restart time will not come down unless this gets fixed, but if you had a restart time of 5 minutes even before this issue occurred then you will have to upgrade your cores on your server running the PBX. Increasing the RAM, and / or space on the server should also help if you have a lot of CDR's or PCAPs or recordings that might be pulling the processing speed down.

 

Link to comment
Share on other sites

Hi,

On the latest version of the PBX 61.0 and above you can set the recordings to get cleaned up after specified interval of times from this page: /reg_recording.htm

If you're not on the latest then you can manually delete them from the recordings folder under the PBX directory for now.

Link to comment
Share on other sites

  • 7 months later...

I've upgraded to 63.0.1 about 12 days ago and the output of lsof -n | grep pbxctrl has grown from about ~2000 lines to about ~3300 lines (IPv4 and IPv6 connections). I've set a new warning level at 4000 lines and will let you know if/when I'll hit it

Link to comment
Share on other sites

Unfortunately within a week 'lsof -n | grep pbxctrl' did give more than 4000 lines of output. Usage of the system is stable so I don't like that this number keeps climbing up week by week and I'm afraid of the system crashing again.

Looking at today's graphs I see these PEAKS:

~770 Registrations+Subscriptions, 
  52 call legs
 112 http(s) 112.
~ 10 for TFTP & LDAP together

In case it helps:

Examining the output of ngrep on port 5060 I see that I'm sending packets to 310 IP:port couples.

The output of lsof | grep pbxctrl is along the lines of this:

pbxctrl    1486 1536    root  442u     IPv4           18445564       0t0        UDP *:49460 
pbxctrl    1486 1536    root  443u     IPv4           18437010       0t0        TCP 1.2.3.4:https->9.10.11.12:53054 (ESTABLISHED)
pbxctrl    1486 1536    root  444u     IPv4           18445565       0t0        UDP *:49461 
pbxctrl    1486 1536    root  452u     IPv6           18472707       0t0        UDP *:49869 
pbxctrl    1486 1536    root  453u     IPv4           18506734       0t0        UDP *:62612 
pbxctrl    1486 1536    root  456u     IPv6           18445566       0t0        UDP *:49460 
pbxctrl    1486 1536    root  457u     IPv6           18445567       0t0        UDP *:49461 
pbxctrl    1486 1536    root  458u     IPv4           18472815       0t0        TCP 1.2.3.4:https->5.6.7.8:34917 (ESTABLISHED)
pbxctrl    1486 1536    root  461u     IPv4           18506735       0t0        UDP *:62613 
pbxctrl    1486 1536    root  472u     IPv4           18505204       0t0        UDP *:52692 
pbxctrl    1486 1536    root  473u     IPv4           18505205       0t0        UDP *:52693 
pbxctrl    1486 1536    root  478u     IPv4           18505208       0t0        UDP *:52192 
pbxctrl    1486 1536    root  479u     IPv4           18505209       0t0        UDP *:52193 
pbxctrl    1486 1536    root  482u     IPv4           18505038       0t0        UDP *:58870 
pbxctrl    1486 1536    root  483u     IPv4           18505039       0t0        UDP *:58871 
pbxctrl    1486 1536    root  484u     IPv4           18505042       0t0        UDP *:61862 
pbxctrl    1486 1536    root  485u     IPv4           18505043       0t0        UDP *:61863 

What's your opinion? Would you like more data?

Link to comment
Share on other sites

We have a few hundred extensions in small companies. Nothing unusual I would guess (some inter-domain, plenty external calls and a conference call from time to time). Surely no multicast paging.

Regarding attacks, I've run for a few minutes ngrep (a packet sniffer) looking for SIP replies that are not "SIP 200 OK" and I see nothing unusual.

Regarding IPv6 ports in lsof's output they account for1352 out of 3300 lines. But I should note that they contain only 169 unique port numbers. E.g.:

lsof -n| grep ^pbxctrl | grep -i ipv6 |grep 49460
pbxctrl    1486         root  456u     IPv6           18445566       0t0        UDP *:49460
pbxctrl    1486 1530    root  456u     IPv6           18445566       0t0        UDP *:49460
pbxctrl    1486 1531    root  456u     IPv6           18445566       0t0        UDP *:49460
pbxctrl    1486 1532    root  456u     IPv6           18445566       0t0        UDP *:49460
pbxctrl    1486 1533    root  456u     IPv6           18445566       0t0        UDP *:49460
pbxctrl    1486 1534    root  456u     IPv6           18445566       0t0        UDP *:49460
pbxctrl    1486 1535    root  456u     IPv6           18445566       0t0        UDP *:49460
pbxctrl    1486 1536    root  456u     IPv6           18445566       0t0        UDP *:49460

Should I send you log files or anything else?

Link to comment
Share on other sites

Quote

Looks like those port all belong to the same process

They certainly belong to the same process: pbxctrl with PID 1486. But that's to be expected since I'm looking specifically for pbxctrl.

Quote

what is the "TID" in the lsof output?

Not sure what you're asking here so I'll provide anything I can think of: "man lsof" reads: "TID is the task (thread) IDentification number". In the example output above the 3rd column is the TID column (empty for the first line and 1530 to 1536 for the next 7 lines). B.T.W. I see the same pattern for almost all other ports: each port appears 8 times, once for the main process and another 7 times for threads.

I've restarted pbxctrl and now I'm running lsof -nP | grep pbxctrl every hour and logging the full output on a series of timestamped files. Just in case it helps.

Link to comment
Share on other sites

  • 4 weeks later...

I'm logging the output of lsof for the pbxctrl process every hour for about a month. It is clear that the number of lines of output grows by about 20% per week. I've just restarted pbxctrl as a precaution.

I'm attaching a graph. The y-axis is the number of lines of the output of the command: lsof | grep pbxctrl . Weekends and holidays are painted in green color. Most of my customers are businesses so I have many calls with a peak at about midday during working days and very few calls during weekends and holidays.

I confess I haven't dived into the details of the output of lsof. However I'm almost sure that no matter the exact meaning it doesn't seem right for the number of these lines to grow by ~4% every working day. Most likely a bug in pbxctrl and less likely one in lsof. I need your help for further debugging though. I have kept all the log files and the full output of lsof for this period. Anything else I can do?

lsof.thumb.png.a91c215298a2c114ffcbca6dbf248203.png

Link to comment
Share on other sites

I don't think there is a bug in lsof. In the previous post it looked like there were multiple listings for the same port - which is of course raising questions. What would make sense is to check /proc/xxx/fd if that matches the output of lsof in terms of open sockets.

Link to comment
Share on other sites

First of all many thanks for the help 😀
Now, regarding the correlation between the output of lsof and the contents of /proc/`pidof pbxctrl`/fd: No they don't match and the difference is clearly on the IP sockets (see [3]).The reason for the difference is that lsof gives 7 lines of output for each socket. So if I count only unique socket ids then both outputs match:

$ lsof -nP | grep 'pbxctrl' | grep 'IPv' | awk -F' ' '{print $7 }' | sort -n | uniq | wc -l
116
$ ls -l /proc/`pidof pbxctrl`/fd | grep 'socket' | wc -l
116

A quick check[1] on my saved lsof outputs during the past few weeks reveals that this factor (x7) is constant.  And another check [2] reveals that:

The unique socket ids at the output of lsof (that match the number of sockets in /proc/.../fd) are increasing with the pattern mentioned in my previous post (i.e. ~4% increase during working days and no increase during weekends and holidays). By the way: when you exclude sockets the number of "files" left is more-or-less constant during these  weeks that I'm testing (97 open "files" with a rare peak to at most 137 now and then).

Here's a plot of unique socket ids per hour for two weeks. Weekends are colored green (it's of-course the same as above but scaled down by a factor of 7 but I like pictures :-)

1050600787_Screenshotfrom2019-07-0819-54-26.png.50b5bae9daebdab1aefb5af056590624.png

______________________________

[1] ls lsof.2019-0*|sort | while read FILE; do cat $FILE | grep 'IPv' | awk -F' ' '{print $7 }' | grep -v '0t0' | sort -n | uniq -c | awk -F' ' '{print $1 }' | uniq -c | sort -nr | head -1; done

[2] ls lsof.2019-0*|sort | while read FILE; do echo -n "$FILE "; cat $FILE | grep 'IPv' | awk -F' ' '{print $7 }' | grep -v '0t0' | sort -n | uniq | wc -l; done

[3]

$ ls -l /proc/`pidof pbxctrl`/fd | grep 'socket'|wc -l
125
$ lsof -nP | grep 'pbxctrl' | grep 'IPv'|wc -l
992

Link to comment
Share on other sites

I do trust in /proc/xxx/fd. There must be some kind of operation that is causing this - e.g. some problem with a FAX or some problem with streaming MoH that is causing this. Other services do not seem to have this problem. Is there any clue in this direction? Do you have "expired" files in the working directors? The ports seem to be on RTP ports, so this must be something media related.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...