I am trying to to get all files from google cloud storage with md5sum and all as csv.
Condition: run it from bash and use only linux commands
When I run this:
JavaScript
x
gsutil ls -L -r gs://some-bucket/subfolder/**
It returns yaml as a stream:
JavaScript
gs://sombucket/subfolder/filename1.jpg:
Creation time: Wed, 09 Feb 2022 16:44:55 GMT
Update time: Wed, 09 Feb 2022 16:44:55 GMT
Storage class: STANDARD
Content-Length: 11466
Content-Type: image/jpeg
Hash (crc32c): waea9g==
Hash (md5): HGTN2JFXASB0bfSH14hJGQ==
ETag: CLq0mO2I8/UCEAE=
Generation: 1644425095027258
Metageneration: 1
ACL: []
What I’d like to see is this:
JavaScript
gs://sombucket/subfolder/filename1.jpg,HGTN2JFXASB0bfSH14hJGQ==
and other files) (
Advertisement
Answer
With a docker:
JavaScript
docker run --name gcloud-gsutil-vladimir --rm --volumes-from gcloud-config -i google/cloud-sdk:latest gsutil ls -L -r gs://some-bucket/subfolder/**|egrep "gs.*@@.*jpg|md5.*"|tr -d 'n'|tr -s '=' 'n'| sed 's/Hash (md5)://'|sed 's/$/==/g'|sed 's/: //'|tr -s ' ' ','
Or use gsutil directly if installed:
JavaScript
gsutil ls -L -r gs://some-bucket/subfolder/**|egrep "gs.*@@.*jpg|md5.*"|tr -d 'n'|tr -s '=' 'n'| sed 's/Hash (md5)://'|sed 's/$/==/g'|sed 's/: //'|tr -s ' ' ','
Steps:
- Run
gsutil
and pipe it toegrep
to get only lines with filename and md5sum - Remove all the new line characters from the stream with
tr -d 'n'
- Lean on md5sum ending ‘==’ to replace it again with the newline (the one we need) with
tr -s '=' 'n'
- Optionally remove other things like “Hash (md5):”
- Use sed to return removed “==” to the end of the line
sed 's/$/==/g'
- Remove ‘: ‘ (colon with space after “.jpg: “)
- Finally replace all spaces with the comma with
tr -s ' ' ','
This is one-liner that I’ve been looking for. It works but probably it could be achieved with less steps and less tools.
I know this can be achieved with python, perl and whatnot – but I would happy to see other “one-liner” approaches.