Gawk Things

How to solve some things with Gawk (GNU Awk 4+). By Stephen Balbach.

0. Awk Libraries

1. List only files in a directory ('ls' doesn't have this feature!). Change "$3==d" to list only directories.
gawk -lreaddir 'BEGIN { FS = "/" } { if($3=="f") {print $2} }' /

2. Replicate basic 'ls filename' or 'ls dirname' output. No wildcards but that and other features could be added in a longer script.
gawk -lfilefuncs -lreaddir 'BEGIN { FS = "/"; stat(ARGV[1], fd); if(fd["type"] == "file") {print ARGV[1]; exit} } { print $2}' /home

3. Replicate 'cat temp1 temp2 temp3 etc' .. even if some or all of the files are missing (without generating an error).
gawk 'BEGINFILE{if (ERRNO) nextfile} NF' temp1 temp2 temp-missing

4. Search for a string in a file, when the search string contains special characters needing escape.. and it's passed as a shell variable.
gawk -v nw="(Hello? *World*)" 'BEGIN{gsub(/\(|\?|\*/,"\\\\&", nw)} $0 == nw' file.txt

5. Run an external program and capture output to a variable.
function sys2var(com,        in, loc, out) {

         # Ucomment to see stderr
         # com = com " 2>/dev/null"

         while ( (com | getline in) > 0 ) {
             if ( ++loc == 1 )
                 out = in
             else
                 out = out "\n" in
         }            
         close(com)
         return out         
}
Examples:
BEGIN {
 date = sys2var("/usr/bin/date")
 googlebarn = sys2var("wget -q -O- https://www.google.com/#q=barnyard")
 python_files = sys2var("ls | grep py") 
 # To see exit status:
  if( sys2var("mkdir dirname 2>/dev/null ; echo $?") == 0)
    print "Unable to make directory"
}
This function allows Gawk to be a suitable replacement for shell scripts, though not always. Since Gawk is installed by default on most Unix's, it is more portable than the many shell versions and types.

6. Variations on the equivalent of "grep -c" in awk. All of the below do the same thing. The "i=0" isn't strictly required but it ensures a return of "0" instead of null, the same as grep.
gawk 'BEGIN{i=0}/Birds feather/{i++}END{print i}' file
gawk -v i=0 '/Birds feather/{i++}END{print i}' file
gawk -v i=0 'tolower($0) ~ /Birds feather/{i++}END{print i}' file
gawk -v i=0 '{IGNORECASE=1}/Birds feather/{i++}END{print i}' file
gawk -v i=0 -v s="Birds feather" '{IGNORECASE=1} $0 ~ s {i++}END{print i}' file
gawk -v i=0 -v s="Birds" '{IGNORECASE=1} $0 ~ s" feather" {i++}END{print i}' file

7. Example of how to "parse" an XML document eg. a Wikipedia source file XML export.
wget -q -O- http://en.wikipedia.org/wiki/Special:Export/Auk | 
  awk '{RS=("<text xml|</text")} NR==3' | 
  awk 'NR==2 {gsub(/&lt;/,"<");gsub(/&gt;/,">");gsub(/&quot;/,"\"");gsub(/&amp;/,"\\&");print}' RS=">"
An easier way to get a Wikipedia source file:
https://en.wikipedia.org/wiki/Auk?action=raw

8. An equivilent grep -oE/sort/uniq for example: "grep -oE 'field=\"[^\"]*"' file.txt | sort | uniq"
awk -lreadfile 'BEGIN{s = readfile("file.txt"); c = patsplit(s,a,"field=\"[^\"]*\""); asort(a); for(i in a) b[a[i]]=1;r="";for(i in b)r = r i "\n";print r}'

9. How to do a non-greedy extraction of markup-pairs (eg. <ref></ref>) in a long string. It's common when doing web scraping to read in a lengthy HTML file into a variable as a single long string. From there, extracting data between markup-pairs can be tricky because Gawk only does greedy matches. Greedy means that .* in the ERE <ref>.*</ref> matches the longest string of characters it can find that starts with <ref> and ends with </ref>. So there is no effective regular expresion solution. This is another way. Limits: it will not work with nested pairs (though could with more coding).
BEGIN {
  file = readfile("file.txt")
  c = split(file, b, "<ref[^>]*>")
  i = 1
  while(i++ < c) {
    k = substr(b[i], 1, match(b[i], "</ref>") - 1)
    if(k ~ "kirjasto[.]sci[.]fi") 
      print k
  }
}

9. Sort a multidimensional associative array by element values. In the below array aa[], sort by name, sales and zip.


function reindex(org, new, field,       o, o2, newndx) {
  delete new
  for(o in org) {
    newndx = org[o][field]
    for(o2 in org[o])
      new[newndx][o][o2] = org[o][o2]
  }
}
BEGIN{
 aa["c456"]["name"] = "tom jones"
 aa["c456"]["sales"] = 1
 aa["c456"]["zip"] = 21005
 aa["c897"]["name"] = "john martin"
 aa["c897"]["sales"] = 2
 aa["c897"]["zip"] = 21004
 aa["c259"]["name"] = "phil lesh"
 aa["c259"]["sales"] = 100
 aa["c259"]["zip"] = 21003
 aa["c109"]["name"] = "mary sue"
 aa["c109"]["sales"] = 9
 aa["c109"]["zip"] = 21006

 fields = "name sales zip"
 c = split(fields, field, " ")
 while(i++ < c) {
   field[i] ~ /name/ ? sort = "@ind_str_asc" : sort = "@ind_num_asc"
   print "\nSorted by " field[i] "\n-----------"
   reindex(aa, ba, field[i])
   PROCINFO["sorted_in"] = sort
   for (b in ba) {
     for(bi in ba[b])
       print ba[b][bi]["name"] "|" ba[b][bi]["sales"] "|" ba[b][bi]["zip"]
   }
 }
}

Output:


Sorted by name        Sorted by sales       Sorted by zip
-----------           -----------           -----------
john martin|2|21004   tom jones|1|21005     phil lesh|100|21003
mary sue|9|21006      john martin|2|21004   john martin|2|21004
phil lesh|100|21003   mary sue|9|21006      tom jones|1|21005
tom jones|1|21005     phil lesh|100|21003   mary sue|9|21006

10. Write to a text file using locks allowing for concurrent writes. When using GNU Parallel (or other parallelization method), and writing to files, locks may be needed to prevent data loss.
function awklock(filename, command,    status,count) {
  if(command ~ /lock/) {
    while(1) {       
      status = sys2var( "/bin/mkdir /tmp/lock." filename " 2>/dev/null ; echo $?")
      if(count > 100) {
        print "Error in awklock() - stuck lock file /tmp/lock." filename > "/dev/stderr"    
        return 0
      }
      if(status != 0) { 
        sys2var("/bin/sleep 1")
        count++       
      }
      else                 
        break      
    }
    return 1
  }
  if(command ~ /release/) 
    sys2var("/bin/rm -r /tmp/lock." filename " 2>/dev/null")
}          
BEGIN {
  database = "data.txt"
  awklock(database, "lock")          # create lock
  print "whatever" >> database
  close(database)
  awklock(database, "release")       # release lock
}
awklock() makes a directory as a lock instead of creating a lock file, the advatages are explained here. sys2var() is defined elsewhere on this page.