Skip to content
Advertisement

Why linux split program have weird behavior with large files >20GB?

I’m doing the next statement on my ubuntu:

split --number=l/5 /pathToSource.csv /pathToOutputDirectory

If i do a “ls”

myUser@serverNAme:/pathToOutputDirectory> ls -la

total 21467452 
drwxr-xr-x 2 myUser group        4096 Jun 23 08:51 .
drwxrwxrwx 4 myUser group        4096 Jun 23 08:44 ..
-rw-r--r-- 1 myUser group 10353843231 Jun 23 08:48 aa
-rw-r--r-- 1 myUser group           0 Jun 23 08:48 ab
-rw-r--r-- 1 myUser group 11376663825 Jun 23 08:51 ac
-rw-r--r-- 1 myUser group           0 Jun 23 08:51 ad
-rw-r--r-- 1 myUser group   252141913 Jun 23 08:51 ae

If i do a “du” over ab and ad files.

$du -h ab ad
0   ab
0   ad

As you can see, split divided the file in a non-homogeneous form. Anyone know what’s going on? Some unprintable character can hang the split? Thank you. Best Regards! Francisco.

Advertisement

Answer

While this is unusual data with an average line length of 114137, I’m not sure that fully describes the issue. Hmm you’ve 21982648969 of data => each bucket that split is trying to fill is 4396529793. That’s larger than 2^32. I wonder do we have a 32 bit overflow. Are you on a 32 bit or 64 bit platform? Looking at the code I don’t see an overflow issue TBH. Note you could anonymize and compress the data providing the following file for download somewhere:

tr -c 'n' . < /pathToSource.csv | xz > /pathToSource.csv.xz

It’s also worth specifying the version since implementation changed a bit between v8.8 and v8.13

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement