Linux ate my RAM

I know this seems like a duplicate of many other posts in here, but I beg to differ.
as this (to my not very experienced eyes) doesn’t seem to be a caching problem.

We have a server with 256G of RAM, the usage of which is always 95% to 99% with no apparent process using it.
at the same time, when running anything that is memory heavy the swap starts to fill up and the server quickly becomes unresponsive.
even after a clean reboot the memory is always immediately full.
when booting into recovery the memory usage is normal.

to give some clues to the issue, here is the output of cat /proc/meminfo

MemTotal:       263702068 kB
MemFree:          655500 kB
MemAvailable:          0 kB
Buffers:            3248 kB
Cached:            70244 kB
SwapCached:         3108 kB
Active:            22584 kB
Inactive:          49740 kB
Active(anon):       8596 kB
Inactive(anon):    19848 kB
Active(file):      13988 kB
Inactive(file):    29892 kB
Unevictable:      134096 kB
Mlocked:          126596 kB
SwapTotal:       2097148 kB
SwapFree:         488472 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:        131292 kB
Mapped:            61476 kB
Shmem:              7532 kB
KReclaimable:     122832 kB
Slab:            1580428 kB
SReclaimable:     122832 kB
SUnreclaim:      1457596 kB
KernelStack:       21600 kB
PageTables:        11700 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     6021908 kB
Committed_AS:    2924960 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      502876 kB
VmallocChunk:          0 kB
Percpu:           142912 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
HugePages_Total:     244
HugePages_Free:      244
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:        255852544 kB
DirectMap4k:      967040 kB
DirectMap2M:     8079360 kB
DirectMap1G:    261095424 kB

and here is the output of free -h

              total        used        free      shared  buff/cache   available
Mem:          251Gi       250Gi       650Mi       7,0Mi       177Mi       614Mi
Swap:         2,0Gi       1,9Gi        82Mi

and here is the output of slabtop -s c

 Active / Total Objects (% used)    : 2816030 / 2856667 (98,6%)
 Active / Total Slabs (% used)      : 62599 / 62599 (100,0%)
 Active / Total Caches (% used)     : 123 / 183 (67,2%)
 Active / Total Size (% used)       : 618859,66K / 630011,16K (98,2%)
 Minimum / Average / Maximum Object : 0,01K / 0,22K / 12,00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
 12120  12120 100%    8,00K   3030        4     96960K kmalloc-8k
513440 513440 100%    0,12K  16045       32     64180K scsi_sense_cache
112192 104705  93%    0,50K   3506       32     56096K kmalloc-512
 65484  64946  99%    0,62K   1284       51     41088K inode_cache
 37800  37705  99%    0,57K    675       56     21600K radix_tree_node
  1998   1985  99%    8,12K    666        3     21312K task_struct
  4848   4812  99%    4,00K    606        8     19392K kmalloc-4k
150272 150272 100%    0,12K   4696       32     18784K kernfs_node_cache
 96306  84047  87%    0,19K   2293       42     18344K dentry
  7872   7466  94%    2,00K    492       16     15744K kmalloc-2k
 22274  21817  97%    0,70K    486       46     15552K proc_inode_cache
 15232  15029  98%    1,00K    476       32     15232K kmalloc-1k
152838 152838 100%    0,09K   3639       42     14556K kmalloc-96
 68308  68160  99%    0,20K   1752       39     14016K vm_area_struct
 12558  12248  97%    0,81K    322       39     10304K sock_inode_cache
130704 130704 100%    0,07K   2334       56      9336K Acpi-Operand
  2320   2280  98%    4,00K    290        8      9280K biovec-max
  3855   3855 100%    2,06K    257       15      8224K sighand_cache
 31648  30708  97%    0,25K    989       32      7912K filp
125952 123874  98%    0,06K   1968       64      7872K kmalloc-64
  6129   5643  92%    1,15K    227       27      7264K ext4_inode_cache
135150 133752  98%    0,05K   1590       85      6360K ftrace_event_field
 12352  12192  98%    0,50K    386       32      6176K skbuff_fclone_cache
  5348   5348 100%    1,12K    191       28      6112K signal_cache
 92800  92800 100%    0,06K   1450       64      5800K anon_vma_chain
 49452  49273  99%    0,10K   1268       39      5072K anon_vma
157184 156203  99%    0,03K   1228      128      4912K kmalloc-32
 19424  17992  92%    0,25K    607       32      4856K kmalloc-256
  6106   6106 100%    0,74K    142       43      4544K shmem_inode_cache
  4260   4260 100%    1,00K    134       32      4288K kmalloc-cg-1k
  3780   3780 100%    1,06K    126       30      4032K mm_struct
  5566   5566 100%    0,69K    121       46      3872K files_cache
   944    944 100%    4,00K    118        8      3776K kmalloc-cg-4k
 14656  14656 100%    0,25K    458       32      3664K pool_workqueue
  2912   2912 100%    1,19K    112       26      3584K perf_event
   896    896 100%    4,00K    112        8      3584K names_cache
  3520   3520 100%    1,00K    110       32      3520K biovec-64
  1744   1744 100%    2,00K    109       16      3488K kmalloc-cg-2k
  1728   1728 100%    2,00K    108       16      3456K biovec-128
  3240   3240 100%    1,06K    108       30      3456K UNIX
  2808   2626  93%    1,19K    108       26      3456K RAWv6
 17388  17209  98%    0,19K    414       42      3312K kmalloc-192
  4743   4743 100%    0,62K     93       51      2976K task_group
 28743  28229  98%    0,10K    737       39      2948K buffer_head
  2944   2688  91%    1,00K     92       32      2944K RAW
 10208  10089  98%    0,25K    319       32      2552K skbuff_head_cache
 39808  36968  92%    0,06K    622       64      2488K vmap_area
  1008   1008 100%    2,19K     72       14      2304K TCP
  3840   3840 100%    0,50K    120       32      1920K kmalloc-cg-512

any help is much appreciated.

Edit 1: Including output of smem -tw

Area                           Used      Cache   Noncache
firmware/hardware                 0          0          0
kernel image                      0          0          0
kernel dynamic memory     262358340     199108  262159232
userspace memory             322752     173264     149488
free memory                 1020956    1020956          0
----------------------------------------------------------
                          263702048    1393328  262308720

Edit 2: Included output of top -o VIRT -b -n 1 in the link below because the post became too long. (maybe this gives a clear clue to how to diagnose this problem)

Top.txt on Dropbox

Edit 3: added output of ipcs -ma

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status

------ Semaphore Arrays --------
key        semid      owner      perms      nsems
Asked By: Forrest

||

You have allocated 244 huge pages (HugePages_Total in /proc/meminfo) with a size of 1 GB per page (Hugepagesize) which amounts to 244 GB (Hugetlb). These pages are excluded from the normal memory allocation. Check /proc/sys/vm/nr_hugepages or kernel boot command line parameter hugepages. See the documentation and here.

Answered By: AlexD
Categories: Answers Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.