I have to come up with a solution to delete all files but the newest 2 in a directory stucture of our owncloud. The be exact – its the file versioning folder. There are files in one folder with the following structure:
Filename.Ext.v[random_Number]
The hard part is that there are different files in one folder I need to keep.
IE: Content of folder A:
- HelloWorld.txt.v123
- HelloWorld.txt.v555
- HelloWorld.txt.v666
- OtherFile.pdf.v143
- OtherFile.pdf.v1453
- OtherFile.pdf.v123
- OtherFile.pdf.v14345
- YetOtherFile.docx.v11113
In this case we have 3 “basefiles”. And I would have to keep the newest 2 files of each “basefile”.
I tried Python3 with os.walk
and regex
to filter out the basename. I tried build in Linux tools like find
with -ctime
. I could use also bash.
But my real problem is more the logic. How would you approach this task?
EDIT 2: Here my progress:
import os from itertools import groupby directory = 'C:\Users\x41\Desktop\Test\' def sorted_ls(directory): mtime = lambda f: os.stat(os.path.join(directory, f)).st_mtime return list(sorted(os.listdir(directory), key=mtime)) print(sorted_ls(directory)) for basename, group in groupby(sorted_ls(directory), lambda x: x.rsplit('.')[0]): for i in basename: finallist = [] for a in group: finallist.append(a) print(finallist[:-2])
I am almost there. The function sorts the files in the directory based on the mtime
value. The suggested groupby()
function calls my custom sort function.
Now the problem here is that I have to dump the sort()
before the groupby()
because this would reset my custom sort. But it now also returns more groups than anticipated.
If my sorted list looks like this:
['A.txt.1', 'B.txt.2', 'B.txt.1', 'B.txt.3', 'A.txt.2']
I would get 3 groups. A, B, and A again. Any suggestions?
FINAL RESULT
Here is my final version with added recursiveness:
import os from itertools import groupby directory = r'C:Usersx41DesktopTest' for dirpath, dirs, files in os.walk(directory): output = [] for basename, group in groupby(sorted(files), lambda x: x.rsplit('.')[0]): output.extend(sorted(group, key=lambda x: os.stat(os.path.join(dirpath, x)).st_mtime)[:-2]) for file in output: os.remove(dirpath + "\" + file)
Advertisement
Answer
You need to do a simple sort first on the file names so that they are in alphabetical order to allow the groupby
function to work correctly.
With each of the resulting file groups, you can then sort using your os.stat
key as follows:
import os from itertools import groupby directory = r'C:Usersx41DesktopTest' output = [] for basename, group in groupby(sorted(os.listdir(directory)), lambda x: x.rsplit('.')[0]): output.extend(sorted(group, key=lambda x: os.stat(os.path.join(directory, x)).st_mtime)[-2:]) print output
This will produce a single list containing the latest two files from each group.