Skip to content
Advertisement

How to get filenames and md5sum from google bucket in csv format with gsutil ls

I am trying to to get all files from google cloud storage with md5sum and all as csv.

Condition: run it from bash and use only linux commands

When I run this:

gsutil ls -L -r gs://some-bucket/subfolder/**

It returns yaml as a stream:

gs://sombucket/subfolder/filename1.jpg:
    Creation time:          Wed, 09 Feb 2022 16:44:55 GMT
    Update time:            Wed, 09 Feb 2022 16:44:55 GMT
    Storage class:          STANDARD
    Content-Length:         11466
    Content-Type:           image/jpeg
    Hash (crc32c):          waea9g==
    Hash (md5):             HGTN2JFXASB0bfSH14hJGQ==
    ETag:                   CLq0mO2I8/UCEAE=
    Generation:             1644425095027258
    Metageneration:         1
    ACL:                    []

What I’d like to see is this:

gs://sombucket/subfolder/filename1.jpg,HGTN2JFXASB0bfSH14hJGQ==
... (and other files)

Advertisement

Answer

With a docker:

docker run --name gcloud-gsutil-vladimir --rm --volumes-from gcloud-config -i google/cloud-sdk:latest gsutil ls -L -r gs://some-bucket/subfolder/**|egrep "gs.*@@.*jpg|md5.*"|tr -d 'n'|tr -s '=' 'n'| sed 's/Hash (md5)://'|sed 's/$/==/g'|sed 's/: //'|tr -s ' ' ','

Or use gsutil directly if installed:

gsutil ls -L -r gs://some-bucket/subfolder/**|egrep "gs.*@@.*jpg|md5.*"|tr -d 'n'|tr -s '=' 'n'| sed 's/Hash (md5)://'|sed 's/$/==/g'|sed 's/: //'|tr -s ' ' ','

Steps:

  1. Run gsutil and pipe it to egrep to get only lines with filename and md5sum
  2. Remove all the new line characters from the stream with tr -d 'n'
  3. Lean on md5sum ending ‘==’ to replace it again with the newline (the one we need) with tr -s '=' 'n'
  4. Optionally remove other things like “Hash (md5):”
  5. Use sed to return removed “==” to the end of the line sed 's/$/==/g'
  6. Remove ‘: ‘ (colon with space after “.jpg: “)
  7. Finally replace all spaces with the comma with tr -s ' ' ','

This is one-liner that I’ve been looking for. It works but probably it could be achieved with less steps and less tools.

I know this can be achieved with python, perl and whatnot – but I would happy to see other “one-liner” approaches.

Advertisement