Grouping and deleting Files

I have to come up with a solution to delete all files but the newest 2 in a directory stucture of our owncloud. The be exact – its the file versioning folder. There are files in one folder with the following structure:

Filename.Ext.v[random_Number]

The hard part is that there are different files in one folder I need to keep.

IE: Content of folder A:

HelloWorld.txt.v123
HelloWorld.txt.v555
HelloWorld.txt.v666
OtherFile.pdf.v143
OtherFile.pdf.v1453
OtherFile.pdf.v123
OtherFile.pdf.v14345
YetOtherFile.docx.v11113

In this case we have 3 “basefiles”. And I would have to keep the newest 2 files of each “basefile”.

I tried Python3 with os.walk and regex to filter out the basename. I tried build in Linux tools like find with -ctime. I could use also bash.

But my real problem is more the logic. How would you approach this task?

EDIT 2: Here my progress:

import os
from itertools import groupby
directory = 'C:\Users\x41\Desktop\Test\'


def sorted_ls(directory):
    mtime = lambda f: os.stat(os.path.join(directory, f)).st_mtime
    return list(sorted(os.listdir(directory), key=mtime))

print(sorted_ls(directory))

for basename, group in groupby(sorted_ls(directory), lambda x: x.rsplit('.')[0]):
    for i in basename:
            finallist = []
            for a in group:
                finallist.append(a)
            print(finallist[:-2])

JavaScript
​x
 
import osfrom itertools import groupbydirectory = 'C:\Users\x41\Desktop\Test\'​​def sorted_ls(directory):    mtime = lambda f: os.stat(os.path.join(directory, f)).st_mtime    return list(sorted(os.listdir(directory), key=mtime))​print(sorted_ls(directory))​for basename, group in groupby(sorted_ls(directory), lambda x: x.rsplit('.')[0]):    for i in basename:            finallist = []            for a in group:                finallist.append(a)            print(finallist[:-2])​

I am almost there. The function sorts the files in the directory based on the mtime value. The suggested groupby() function calls my custom sort function.

Now the problem here is that I have to dump the sort() before the groupby() because this would reset my custom sort. But it now also returns more groups than anticipated.

If my sorted list looks like this:

['A.txt.1', 'B.txt.2', 'B.txt.1', 'B.txt.3', 'A.txt.2']

JavaScript
 
['A.txt.1', 'B.txt.2', 'B.txt.1', 'B.txt.3', 'A.txt.2']​

I would get 3 groups. A, B, and A again. Any suggestions?

FINAL RESULT

Here is my final version with added recursiveness:

import os
from itertools import groupby

directory = r'C:Usersx41DesktopTest'

for dirpath, dirs, files in os.walk(directory):
    output = []
    for basename, group in groupby(sorted(files), lambda x: x.rsplit('.')[0]):
        output.extend(sorted(group, key=lambda x: os.stat(os.path.join(dirpath, x)).st_mtime)[:-2])

        for file in output:
            os.remove(dirpath + "\" + file)

JavaScript
 
import osfrom itertools import groupby​directory = r'C:Usersx41DesktopTest'​for dirpath, dirs, files in os.walk(directory):    output = []    for basename, group in groupby(sorted(files), lambda x: x.rsplit('.')[0]):        output.extend(sorted(group, key=lambda x: os.stat(os.path.join(dirpath, x)).st_mtime)[:-2])​        for file in output:            os.remove(dirpath + "\" + file)​

Answer

You need to do a simple sort first on the file names so that they are in alphabetical order to allow the groupby function to work correctly.

With each of the resulting file groups, you can then sort using your os.stat key as follows:

import os
from itertools import groupby

directory = r'C:Usersx41DesktopTest'
output = []

for basename, group in groupby(sorted(os.listdir(directory)), lambda x: x.rsplit('.')[0]):
    output.extend(sorted(group, key=lambda x: os.stat(os.path.join(directory, x)).st_mtime)[-2:])

print output

JavaScript
 
import osfrom itertools import groupby​directory = r'C:Usersx41DesktopTest'output = []​for basename, group in groupby(sorted(os.listdir(directory)), lambda x: x.rsplit('.')[0]):    output.extend(sorted(group, key=lambda x: os.stat(os.path.join(directory, x)).st_mtime)[-2:])​print output​

This will produce a single list containing the latest two files from each group.

Advertisement

Answer