Lsof (list open files) is a really useful tool for troubleshooting open file decriptors which prevent a deleted file from being released or a shared memory segment from being removed.
Here’s a little situation on Linux where an Oracle shared memory segment was not released as someone was still using it.
$ ipcs -ma ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x00000000 393216 oracle 640 289406976 1 dest 0xbfb94e30 425985 oracle 640 289406976 18 0x3cf13430 557058 oracle 660 423624704 22 ------ Semaphore Arrays -------- key semid owner perms nsems 0xe2260ff0 1409024 oracle 640 154 0x9df96b74 1671169 oracle 660 154 ------ Message Queues -------- key msqid owner perms used-bytes messages
The bold line should have disappeared after instance shutdown, but it didn’t. From “natcch” (number of attached processes) column I see there is still some process using the shared memory segment. Thus the segment was not released and even ipcrm command did not remove it (just like with normal files if someone has them open).
So, I needed to identify which process was still using the memory segment. If that had been a normal existing file, I’d could have used /sbin/fuser command to see which process still holds it open, but this only works for existing files with existing directory entries.
However for deleted files, sockets and shared memory segments, you can use lsof command (it’s normally installed by default on Linux, but for Unixes you need to separately download and install).
The SHM ID of that segment was 393216 as ipcs -ma showed, so I simply run lsof to show all open file descriptors and grep for that SHM ID:
$ lsof | egrep "393216|COMMAND" COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME python 18811 oracle DEL REG 0,8 393216 /SYSVbfb94e30
See how the NODE column corresponds to SHM ID in ipcs output.
So I kill the PID 18811 which is still attached to the SHM segment:
$ kill 18811 $ ipcs -ma ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0xbfb94e30 425985 oracle 640 289406976 18 0x3cf13430 557058 oracle 660 423624704 25 ------ Semaphore Arrays -------- key semid owner perms nsems 0xe2260ff0 1409024 oracle 640 154 0x9df96b74 1671169 oracle 660 154 ------ Message Queues -------- key msqid owner perms used-bytes messages
Now the shared memory segment is gone and its memory released.
Note that the lsof command is very useful for many other tasks as well. For example it allows you to list open sockets by network protocol, IP, port etc. For example you can determine to which client some server process is talking to, from OS level:
$ lsof -i:1521 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME tnslsnr 6212 oracle 11u IPv4 49486 TCP *:1521 (LISTEN) tnslsnr 6212 oracle 13u IPv4 276708 TCP linux03:1521->linux03:37277 (ESTABLISHED) tnslsnr 6212 oracle 14u IPv4 264894 TCP linux03:1521->linux03:41122 (ESTABLISHED) oracle 22687 oracle 20u IPv4 264893 TCP linux03:41122->linux03:1521 (ESTABLISHED) oracle 25250 oracle 15u IPv4 276707 TCP linux03:37277->linux03:1521 (ESTABLISHED) oracle 25530 oracle 15u IPv4 279910 TCP linux03:1521->192.168.247.1:nimsh (ESTABLISHED)
Unfortunately lsof is not installed by default in classic Unixes, but in some shops the sysadmins have chosen to install it. But even then, it may not work for regular users as lsof requires access to kernel memory structures through /dev/kmem or similar. If you can’t get access to lsof then there may be other tools available which can do some tricks lsof can do. For example on Solaris, there’s an useful command pfiles which can show open files of a process and since Solaris 9 ( I think ) it can also report the TCP connection endpoints of network sockets…
NB! I am running one more Advanced Oracle Troubleshooting training in 2018! You can attend the live online training and can download personal video recordings too. The Part 1 starts on 29th January 2018 - sign up here!
Very cool demo. What was python doing in your SGA?
Thanks Chen :)
Python was reading some interesting stuff out of there ;)
But I’ve had cases where an Oracle server process fails to die during shutdown and keeps being attached to SGA…
Hi,
i am using Solaris 9, when i ran ipcs -ma command there were several output line, though Solaris does’t mention STATUS column, how can i investigate in this situation ?
Hi Tanel,
How do u do…one of the background processes SMON ran up a HUUUUUUUUUUGE trace file that just left 5% free space. And, the rookie dba rm-ed the file…instead of doing something like echo ”> bigtracefile.
This is in the first node of Exadata X2.
Oracle Support says no other option other than to kill SMON to free up the filesystem !
I will be hanged if I request a bounce…as we just moved from V1 to X2 2 months back.
Any ‘unsupported’ tricks you guys at Enkitec know of?!!
Thanks
P.S.:
lsof -p 16927 | egrep ‘^COMMAND|trc|trm’
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
oracle 16927 oracle 44w REG 253,2 48599035904 12880302 /u01/app/oracle/diag/rdbms/edaprd/edaprd1/trace/edaprd1_smon_16927.trc (deleted)
oracle 16927 oracle 45w REG 253,2 3661829175 12880303 /u01/app/oracle/diag/rdbms/edaprd/edaprd1/trace/edaprd1_smon_16927.trm (deleted)
@jc nars
What you can also do is to identify SMON’s spid (16927) and then:
1) ORADEBUG SETOSPID 16927
2) ORADEBUG CLOSE_TRACE
As this is SMON, I’m not too comfortable sending oradebug commands to it (wouldn’t want to crash it by any chance! :) but you can use this as your last option instead of restarting the instance (But do this at your own risk! :)
@jc nars
But yes, the next time I’d just truncate the file with “> filename.trc” … or make sure such traces aren’t dumped at all…
Hi Tanel,
Thanks much. Following the note “Retrieve deleted files on Unix / Linux using File Descriptors [ID 444749.1]” we did a:
head -100 /proc/16927/fd/44 > /tmp/file1
Basically we had wanted to upload to Support the first few lines of the huge file…but after a few mins, the space was back in the filesystem !
I thank you again for taking time to respond to the blog comment.
Hi Tanel,
I meet the same question,following your note,I found I can delete the shmid, but when I delete the one,the other shmid apper. how can I do?
looking forward to you replay.
Hi Tanel,
I meet the same question,following your note, I delete the shmid,but when I delete the one,another one shmid appear. How can I do?
Looking forward to you replay.