Jump to content

Service down


ndemou

Recommended Posts

Today the service went down. ps output was showing the process still running and netstat -anp was showing it bind to the network interface. However we could place no call and we could not open the web interface. The logs had hundrends of messages like these:

 

[5] 20150920121742: Did not receive ACK, disconnecting call 2684293046@192.168.1.23
[5] 20150920121742: Did not receive ACK, disconnecting call 2e3d18e0-15eb03@127.0.0.1
[4] 20150920121742: 11 registration messages pending
[4] 20150920121748: Last message repeated 13 times

A simple /etc/init.d/pbx restart was enough to restore the service. We're happy it did but we feel kind of scared.
This was version 5.2.4 on CentOS 32bit. We need your help to find the root cause.

Link to comment
Share on other sites

No problem with a full file system (we've plenty of space available). Note that the only thing we did to solve the issue was to restart the service and it keeps working until now.

 

What about the log lines I've pasted? Do they reveal something? Would the full log help you and if yes can I share them privately?

 

Do you want me to open a ticket for this?

 

_______________________

System Status Overview

Please use the information shown on this web page when you request help from the support team.
Software-Version: 5.2.4 (CentOS32)
Build Date: Aug 22 2014 06:11:46
License Status: Vodia PBX Hosted 5AY-T9L-MWT-13A
License Duration: Active subscription

Link to comment
Share on other sites

Ahhhh okay sorry I now got it. You are right this has nothing to do with the restart script; it seems that the PBX simply got unresponsive. I can happen for a short time when the non-realtime thread is crunching numbers (or do a large table lookup); however it should eventually after a few seconds max resume operations. If that is not the case, then this would be a lot more serious. If it should happen again, it would be great if you can generate a core dump, so that we can see what the problem is.

 

Also, there is a reason why we are building 64 bit versions. Maybe you have just exhausted the memory size limit. The problem is mainly that each thread takes up a lot of virtual memory space (not even physical), so that memory allocation eventually fails and then things go down south pretty quick, with all sorts of effects. You can check with ps how many threads you have and how much virtual memory has been taken already, and depending on how it looks, upgrade to 64 bit.

Link to comment
Share on other sites

Thanks for the detailed explanation. I only have 1GB of RAM so I've plenty of room to grow before moving to 64bit. I've installed atop with the default logging options and I'll keep an eye on memory usage. It already seems that memory is tight:

 

One observation of the system during not so high load shows that pbxctrl has allocated 610M of virtual memory with 480M of them taking up real memory (resident in memory). Adding everything else (the kernel slab being considerable) I only have about 250MB of free memory. Unfortunately I didn't have atop or something similar before to check the logs and the default graphs the system sends don't include the RAM usage -- you could consider it as a nice-to-have addition.

 

What's your opinion based on this preliminary data? (I'm attaching the output of atop and free bellow)

$ atop

PRC |  sys    0.41s |  user   1.02s  | #proc    100  |  #tslpi   109 |  #tslpu     0 |  #zombie    0  | #exit	   0  |
CPU |  sys       4% |  user     11%  | irq	 0%  |  idle     82% |  wait	  2% |  curf 3.09GHz  | curscal   ?%  |
CPL |  avg1    0.46 |  avg5    0.38  | avg15   0.15  |  csw    24571 |  intr   13795 |                | numcpu     1  |
MEM |  tot     1.0G |  free   31.2M  | cache 129.1M  |  dirty   0.8M |  buff   90.0M |  slab  250.9M  |               |
SWP |  tot     1.0G |  free  991.7M  |               |               |               |  vmcom 367.1M  | vmlim   1.5G  |
LVM |  vgpbx-lvpbx1 |  busy      2%  | read       0  |  write    114 |  MBr/s   0.00 |  MBw/s   0.04  | avio 1.46 ms  |
LVM |  pbxrec-lvrec |  busy      1%  | read       0  |  write    251 |  MBr/s   0.00 |  MBw/s   0.10  | avio 0.54 ms  |
LVM |  xcdrt-lvcdrt |  busy      1%  | read 	  0  |  write     19 |  MBr/s   0.00 |  MBw/s   0.01  | avio 4.53 ms  |
LVM |  roup-lv_root |  busy      0%  | read       0  |  write     29 |  MBr/s   0.00 |  MBw/s   0.01  | avio 1.24 ms  |
LVM |  pbxlog-lvlog |  busy	 0%  | read	  0  |  write      4 |  MBr/s   0.00 |  MBw/s   0.00  | avio 6.25 ms  |
DSK |           vdb |  busy      2%  | read	  0  |  write     99 |  MBr/s   0.00 |  MBw/s   0.04  | avio 1.69 ms  |
DSK |           vde |  busy      1%  | read	  0  |  write     11 |  MBr/s   0.00 |  MBw/s   0.10  | avio 12.3 ms  |
DSK |           vdd |  busy      1%  | read	  0  |  write	   9 |  MBr/s   0.00 |  MBw/s   0.01  | avio 9.56 ms  |
DSK |           vda |  busy      0%  | read	  0  |  write     26 |  MBr/s   0.00 |  MBw/s   0.01  | avio 1.38 ms  |
DSK |           vdc |  busy      0%  | read	  0  |  write      4 |  MBr/s   0.00 |  MBw/s   0.00  | avio 6.25 ms  |
NET |  transport    |  tcpi     150  | tcpo     113  |  udpi   12333 |  udpo   12347 |  tcpao	   2  | tcppo	   0  |
NET |  network      |  ipi    12504  | ipo    12465  |  ipfrw	   0 |  deliv  12488 |  icmpi	   2  | icmpo	   3  |
NET |  eth0    ---- |  pcki   12406  | pcko   11962  |  si 1173 Kbps |  so  986 Kbps |  erri	   0  | erro	   0  |
NET |  eth1    ---- |  pcki     100  | pcko     504  |  si   15 Kbps |  so   85 Kbps |  erri	   0  | erro	   0  |

  PID   TID MINFLT MAJFLT VSTEXT VSLIBS  VDATA VSTACK   VSIZE  RSIZE  VGROW  RGROW SWAPSZ RUID     AMEM CMD         1/1
11463     -      0      0  6316K  3008K 600.1M    88K  609.4M 478.4M     0K     0K 25852K root      48% pbxctrl
26864     -    465      0   184K  2236K  2392K    88K   4980K  4972K     0K     0K     0K root       0% atop


$ free -m
             total       used       free     shared    buffers     cached
Mem:          1006        987         18          0         90        142
-/+ buffers/cache:        754        251
Swap:         1023         34        989
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...