I am trying to move files in a HDFS directory that are over 3 days old to an archiving folder in HDFS.
AWK Script:
hdfs dfs -ls hdfs://companycluster/data/src/purecloud/current | tail -n+2 | xargs -n 8 | awk '{ DAY_CONV=(60*60*24); X ="date +%s";X | getline ED;printf("") > "X";close("X"); Y="date -d "$6" +%s";Y | getline SD;printf("") > "Y";close("Y"); DIFF=(ED-SD)/DAY_CONV; print " SD=",SD" ED=",ED," DIFF=",DIFF," INPUT=",$6; if ( DIFF -gt 3) cmd="hdfs dfs -ls " $8; system(cmd); }'
Note: cmd variable would have a mv command once this script starts working
Issue:
- Value of variable X is constant
- Value of Variable Y is constant
- Unable to get day difference between 2 date , i get fraction value in DIFF
- If statement in AWK is failing due to inaccurate arguments
Input to AWK:
-rw-r--r-- 3 user hdfs 50687424 2017-02-27 17:06 hdfs://companycluster/data/src/purecloud/current/Conversation.json.240220170000 -rw-r--r-- 3 user hdfs 49967359 2017-02-27 17:06 hdfs://companycluster/data/src/purecloud/current/Conversation.json.250220170000 -rw-r--r-- 3 user hdfs 28647041 2017-02-27 17:00 hdfs://companycluster/data/src/purecloud/current/Conversation.json.260220170000 -rw-r--r-- 3 user hdfs 6728724 2017-03-01 13:05 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1305 -rw-r--r-- 3 user hdfs 7050854 2017-03-01 13:25 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1325 -rw-r--r-- 3 user hdfs 6630106 2017-03-01 13:45 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1345 -rw-r--r-- 3 user hdfs 6766650 2017-03-01 14:05 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1405 -rw-r--r-- 3 user hdfs 6486095 2017-03-01 14:25 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1425 -rw-r--r-- 3 user hdfs 6350705 2017-03-01 14:45 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1445 -rw-r--r-- 3 user hdfs 6082589 2017-03-01 15:05 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1505 -rw-r--r-- 3 user hdfs 6417281 2017-03-01 15:25 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1525 -rw-r--r-- 3 user hdfs 6519949 2017-03-01 15:45 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1545 -rw-r--r-- 3 user hdfs 6988534 2017-03-01 16:05 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1605 -rw-r--r-- 3 user hdfs 6734459 2017-03-01 16:25 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1625 -rw-r--r-- 3 user hdfs 6842766 2017-03-01 16:45 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1645 -rw-r--r-- 3 user hdfs 6575513 2017-03-01 17:05 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1705 -rw-r--r-- 3 user hdfs 6574050 2017-03-01 17:25 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1725 -rw-r--r-- 3 user hdfs 50215096 2017-02-27 18:01 hdfs://companycluster/data/src/purecloud/current/conversation_6hr.json.2017-02-27_1801 -rw-r--r-- 3 user hdfs 50985760 2017-02-27 18:18 hdfs://companycluster/data/src/purecloud/current/conversation_6hr.json.2017-02-27_1818 -rw-r--r-- 3 user hdfs 58206776 2017-02-28 00:01 hdfs://companycluster/data/src/purecloud/current/conversation_6hr.json.2017-02-28_0001 -rw-r--r-- 3 user hdfs 58823497 2017-02-28 06:01 hdfs://companycluster/data/src/purecloud/current/conversation_6hr.json.2017-02-28_0601 -rw-r--r-- 3 user hdfs 61591660 2017-02-28 12:01 hdfs://companycluster/data/src/purecloud/current/conversation_6hr.json.2017-02-28_1201 -rw-r--r-- 3 user hdfs 59703667 2017-03-01 10:40 hdfs://companycluster/data/src/purecloud/current/conversation_6hr.json.2017-02-28_1801 -rw-r--r-- 3 user hdfs 59160075 2017-03-01 10:47 hdfs://companycluster/data/src/purecloud/current/conversation_6hr.json.2017-03-01_0001 -rw-r--r-- 3 user hdfs 61812121 2017-03-01 10:48 hdfs://companycluster/data/src/purecloud/current/conversation_6hr.json.2017-03-01_0601 -rw-r--r-- 3 user hdfs 63804772 2017-03-01 12:01 hdfs://companycluster/data/src/purecloud/current/conversation_6hr.json.2017-03-01_1201
Output from AWK (Has debugging prints):
SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-02-27 -rw-r--r-- 3 user hdfs 50687424 2017-02-27 17:06 hdfs://companycluster/data/src/purecloud/current/Conversation.json.240220170000 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-02-27 -rw-r--r-- 3 user hdfs 49967359 2017-02-27 17:06 hdfs://companycluster/data/src/purecloud/current/Conversation.json.250220170000 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-02-27 -rw-r--r-- 3 user hdfs 28647041 2017-02-27 17:00 hdfs://companycluster/data/src/purecloud/current/Conversation.json.260220170000 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-03-01 -rw-r--r-- 3 user hdfs 6728724 2017-03-01 13:05 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1305 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-03-01 -rw-r--r-- 3 user hdfs 7050854 2017-03-01 13:25 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1325 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-03-01 -rw-r--r-- 3 user hdfs 6630106 2017-03-01 13:45 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1345 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-03-01 -rw-r--r-- 3 user hdfs 6766650 2017-03-01 14:05 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1405 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-03-01 -rw-r--r-- 3 user hdfs 6486095 2017-03-01 14:25 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1425 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-03-01 -rw-r--r-- 3 user hdfs 6350705 2017-03-01 14:45 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1445 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-03-01 -rw-r--r-- 3 user hdfs 6082589 2017-03-01 15:05 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1505 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-03-01 -rw-r--r-- 3 user hdfs 6417281 2017-03-01 15:25 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1525 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-03-01 -rw-r--r-- 3 user hdfs 6519949 2017-03-01 15:45 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1545 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-03-01 -rw-r--r-- 3 user hdfs 6988534 2017-03-01 16:05 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1605 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-03-01 -rw-r--r-- 3 user hdfs 6734459 2017-03-01 16:25 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1625 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-03-01 -rw-r--r-- 3 user hdfs 6842766 2017-03-01 16:45 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1645 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-03-01 -rw-r--r-- 3 user hdfs 6575513 2017-03-01 17:05 hdfs://companycluster/data/src/purecloud/current/conversation.json.2017-03-01_1705 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-02-27 -rw-r--r-- 3 user hdfs 50215096 2017-02-27 18:01 hdfs://companycluster/data/src/purecloud/current/conversation_6hr.json.2017-02-27_1801 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-02-27 -rw-r--r-- 3 user hdfs 50985760 2017-02-27 18:18 hdfs://companycluster/data/src/purecloud/current/conversation_6hr.json.2017-02-27_1818 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-02-28 -rw-r--r-- 3 user hdfs 58206776 2017-02-28 00:01 hdfs://companycluster/data/src/purecloud/current/conversation_6hr.json.2017-02-28_0001 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-02-28 -rw-r--r-- 3 user hdfs 58823497 2017-02-28 06:01 hdfs://companycluster/data/src/purecloud/current/conversation_6hr.json.2017-02-28_0601 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-02-28 -rw-r--r-- 3 user hdfs 61591660 2017-02-28 12:01 hdfs://companycluster/data/src/purecloud/current/conversation_6hr.json.2017-02-28_1201 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-03-01 -rw-r--r-- 3 user hdfs 59703667 2017-03-01 10:40 hdfs://companycluster/data/src/purecloud/current/conversation_6hr.json.2017-02-28_1801 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-03-01 -rw-r--r-- 3 user hdfs 59160075 2017-03-01 10:47 hdfs://companycluster/data/src/purecloud/current/conversation_6hr.json.2017-03-01_0001 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-03-01 -rw-r--r-- 3 user hdfs 61812121 2017-03-01 10:48 hdfs://companycluster/data/src/purecloud/current/conversation_6hr.json.2017-03-01_0601 SD= 1488286800 ED= 1488348518 DIFF= 0.714329 INPUT= 2017-03-01 -rw-r--r-- 3 user hdfs 63804772 2017-03-01 12:01 hdfs://companycluster/data/src/purecloud/current/conversation_6hr.json.2017-03-01_1201
Distribution Information:
- Hortonworks
- Hadoop 2.7.1.2.4.0.0-169
- Linux dh01 aaaaaaaaaaaaa.x86_64 #1 SMP Sun Jul 27 15:55:46 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
Any input will be greatly helpfull.
Advertisement
Answer
hdfs dfs -ls hdfs://companycluster/data/src/purecloud/current | tail -n+2 | xargs -n 8 | awk ' BEGIN { # take the time reference (3 days before now) R = systime() - 3 * 86400 } # for each line { # format used by mktime "YYYY MM DD HH MM SS [DST]" # create the time in mktime format t = $6 " " $7 " 00";gsub( /[-:]/, " ", t) # convert in epoch T = mktime( t ) # if lower than reference time if( T < R ) { print "Included line: " $0 # do what you want as action cmd = "hdfs dfs -ls " $8 system( cmd ) } else { print "Discarted line: $0" } }'
Comment:
- self commented awk
- the input to awk could certainly be optimized (awk do tail very well and xargs is certainly not mandatory here [no hdfs to test from here])