0. Awk Libraries
1. List only files in a directory ('ls' doesn't have this feature!). Change "$3==d" to list only directories.gawk -lreaddir 'BEGIN { FS = "/" } { if($3=="f") {print $2} }' /
2. Replicate basic 'ls filename' or 'ls dirname' output. No wildcards but that and other features could be added in a longer script.gawk -lfilefuncs -lreaddir 'BEGIN { FS = "/"; stat(ARGV[1], fd); if(fd["type"] == "file") {print ARGV[1]; exit} } { print $2}' /home
3. Replicate 'cat temp1 temp2 temp3 etc' .. even if some or all of the files are missing (without generating an error).gawk 'BEGINFILE{if (ERRNO) nextfile} NF' temp1 temp2 temp-missing
4. Search for a string in a file, when the search string contains special characters needing escape.. and it's passed as a shell variable.gawk -v nw="(Hello? *World*)" 'BEGIN{gsub(/\(|\?|\*/,"\\\\&", nw)} $0 == nw' file.txt
5. Run an external program and capture output to a variable.Examples:function sys2var(com, in, loc, out) { # Ucomment to see stderr # com = com " 2>/dev/null" while ( (com | getline in) > 0 ) { if ( ++loc == 1 ) out = in else out = out "\n" in } close(com) return out }
This function allows Gawk to be a suitable replacement for shell scripts, though not always. Since Gawk is installed by default on most Unix's, it is more portable than the many shell versions and types.BEGIN { date = sys2var("/usr/bin/date") googlebarn = sys2var("wget -q -O- https://www.google.com/#q=barnyard") python_files = sys2var("ls | grep py") # To see exit status: if( sys2var("mkdir dirname 2>/dev/null ; echo $?") == 0) print "Unable to make directory" }
6. Variations on the equivalent of "grep -c" in awk. All of the below do the same thing. The "i=0" isn't strictly required but it ensures a return of "0" instead of null, the same as grep.gawk 'BEGIN{i=0}/Birds feather/{i++}END{print i}' file gawk -v i=0 '/Birds feather/{i++}END{print i}' file gawk -v i=0 'tolower($0) ~ /Birds feather/{i++}END{print i}' file gawk -v i=0 '{IGNORECASE=1}/Birds feather/{i++}END{print i}' file gawk -v i=0 -v s="Birds feather" '{IGNORECASE=1} $0 ~ s {i++}END{print i}' file gawk -v i=0 -v s="Birds" '{IGNORECASE=1} $0 ~ s" feather" {i++}END{print i}' file
7. Example of how to "parse" an XML document eg. a Wikipedia source file XML export.An easier way to get a Wikipedia source file:wget -q -O- http://en.wikipedia.org/wiki/Special:Export/Auk | awk '{RS=("<text xml|</text")} NR==3' | awk 'NR==2 {gsub(/</,"<");gsub(/>/,">");gsub(/"/,"\"");gsub(/&/,"\\&");print}' RS=">"
https://en.wikipedia.org/wiki/Auk?action=raw
8. An equivilent grep -oE/sort/uniq for example: "grep -oE 'field=\"[^\"]*"' file.txt | sort | uniq"awk -lreadfile 'BEGIN{s = readfile("file.txt"); c = patsplit(s,a,"field=\"[^\"]*\""); asort(a); for(i in a) b[a[i]]=1;r="";for(i in b)r = r i "\n";print r}'
9. How to do a non-greedy extraction of markup-pairs (eg. <ref></ref>) in a long string. It's common when doing web scraping to read in a lengthy HTML file into a variable as a single long string. From there, extracting data between markup-pairs can be tricky because Gawk only does greedy matches. Greedy means that .* in the ERE <ref>.*</ref> matches the longest string of characters it can find that starts with <ref> and ends with </ref>. So there is no effective regular expresion solution. This is another way. Limits: it will not work with nested pairs (though could with more coding).BEGIN { file = readfile("file.txt") c = split(file, b, "<ref[^>]*>") i = 1 while(i++ < c) { k = substr(b[i], 1, match(b[i], "</ref>") - 1) if(k ~ "kirjasto[.]sci[.]fi") print k } }
9. Sort a multidimensional associative array by element values. In the below array aa[], sort by name, sales and zip.Output:function reindex(org, new, field, o, o2, newndx) { delete new for(o in org) { newndx = org[o][field] for(o2 in org[o]) new[newndx][o][o2] = org[o][o2] } } BEGIN{ aa["c456"]["name"] = "tom jones" aa["c456"]["sales"] = 1 aa["c456"]["zip"] = 21005 aa["c897"]["name"] = "john martin" aa["c897"]["sales"] = 2 aa["c897"]["zip"] = 21004 aa["c259"]["name"] = "phil lesh" aa["c259"]["sales"] = 100 aa["c259"]["zip"] = 21003 aa["c109"]["name"] = "mary sue" aa["c109"]["sales"] = 9 aa["c109"]["zip"] = 21006 fields = "name sales zip" c = split(fields, field, " ") while(i++ < c) { field[i] ~ /name/ ? sort = "@ind_str_asc" : sort = "@ind_num_asc" print "\nSorted by " field[i] "\n-----------" reindex(aa, ba, field[i]) PROCINFO["sorted_in"] = sort for (b in ba) { for(bi in ba[b]) print ba[b][bi]["name"] "|" ba[b][bi]["sales"] "|" ba[b][bi]["zip"] } } }
Sorted by name Sorted by sales Sorted by zip ----------- ----------- ----------- john martin|2|21004 tom jones|1|21005 phil lesh|100|21003 mary sue|9|21006 john martin|2|21004 john martin|2|21004 phil lesh|100|21003 mary sue|9|21006 tom jones|1|21005 tom jones|1|21005 phil lesh|100|21003 mary sue|9|21006
10. Write to a text file using locks allowing for concurrent writes. When using GNU Parallel (or other parallelization method), and writing to files, locks may be needed to prevent data loss.awklock() makes a directory as a lock instead of creating a lock file, the advatages are explained here. sys2var() is defined elsewhere on this page.
function awklock(filename, command, status,count) { if(command ~ /lock/) { while(1) { status = sys2var( "/bin/mkdir /tmp/lock." filename " 2>/dev/null ; echo $?") if(count > 100) { print "Error in awklock() - stuck lock file /tmp/lock." filename > "/dev/stderr" return 0 } if(status != 0) { sys2var("/bin/sleep 1") count++ } else break } return 1 } if(command ~ /release/) sys2var("/bin/rm -r /tmp/lock." filename " 2>/dev/null") } BEGIN { database = "data.txt" awklock(database, "lock") # create lock print "whatever" >> database close(database) awklock(database, "release") # release lock }