I am trying to to get all files from google cloud storage with md5sum and all as csv.
Condition: run it from bash and use only linux commands
When I run this:
gsutil ls -L -r gs://some-bucket/subfolder/**
It returns yaml as a stream:
gs://sombucket/subfolder/filename1.jpg: Creation time: Wed, 09 Feb 2022 16:44:55 GMT Update time: Wed, 09 Feb 2022 16:44:55 GMT Storage class: STANDARD Content-Length: 11466 Content-Type: image/jpeg Hash (crc32c): waea9g== Hash (md5): HGTN2JFXASB0bfSH14hJGQ== ETag: CLq0mO2I8/UCEAE= Generation: 1644425095027258 Metageneration: 1 ACL: []
What I’d like to see is this:
gs://sombucket/subfolder/filename1.jpg,HGTN2JFXASB0bfSH14hJGQ== ... (and other files)
Advertisement
Answer
With a docker:
docker run --name gcloud-gsutil-vladimir --rm --volumes-from gcloud-config -i google/cloud-sdk:latest gsutil ls -L -r gs://some-bucket/subfolder/**|egrep "gs.*@@.*jpg|md5.*"|tr -d 'n'|tr -s '=' 'n'| sed 's/Hash (md5)://'|sed 's/$/==/g'|sed 's/: //'|tr -s ' ' ','
Or use gsutil directly if installed:
gsutil ls -L -r gs://some-bucket/subfolder/**|egrep "gs.*@@.*jpg|md5.*"|tr -d 'n'|tr -s '=' 'n'| sed 's/Hash (md5)://'|sed 's/$/==/g'|sed 's/: //'|tr -s ' ' ','
Steps:
- Run
gsutil
and pipe it toegrep
to get only lines with filename and md5sum - Remove all the new line characters from the stream with
tr -d 'n'
- Lean on md5sum ending ‘==’ to replace it again with the newline (the one we need) with
tr -s '=' 'n'
- Optionally remove other things like “Hash (md5):”
- Use sed to return removed “==” to the end of the line
sed 's/$/==/g'
- Remove ‘: ‘ (colon with space after “.jpg: “)
- Finally replace all spaces with the comma with
tr -s ' ' ','
This is one-liner that I’ve been looking for. It works but probably it could be achieved with less steps and less tools.
I know this can be achieved with python, perl and whatnot – but I would happy to see other “one-liner” approaches.