bash – How to print a field containing a specific substring with awk?

Goal: To print fields from an input file to an output file, but with special regard to specific fields split by semicolons (;).

Example Input (input.txt):

NC_051336.1     Gnomon  gene    40042   56215   .       -       .       gene_id "LOC102552844"; transcript_id ""; db_xref "GeneID:102552844"; db_xref "RGD:7551986"; gbkey "Gene"; gene "LOC102552844"; gene_biotype "pseudogene"; pseudo "true";
NC_051336.1     BestRefSeq%2CGnomon     gene    76909   85762   .       +       .       gene_id "Vom2r3"; transcript_id ""; db_xref "GeneID:502213"; db_xref "RGD:1565892"; description "vomeronasal 2 receptor, 3"; gbkey "Gene"; gene "Vom2r3"; gene_biotype "protein_coding"; gene_synonym "RGD1565892";
NC_051336.1     Gnomon  gene    162525  192445  .       -       .       gene_id "LOC103693496"; transcript_id ""; db_xref "GeneID:103693496"; db_xref "RGD:9090555"; gbkey "Gene"; gene "LOC103693496"; gene_biotype "protein_coding";
NC_051336.1     Gnomon  transcript      162525  167758  .       -       .       gene_id "LOC103693496"; transcript_id "XM_039098304.1"; db_xref "GeneID:103693496"; gbkey "mRNA"; gene "LOC103693496"; model_evidence "Supporting evidence includes similarity to: 4 Proteins, and 90% coverage of the annotated genomic feature by RNAseq alignments"; product "sperm motility kinase X-like, transcript variant X9"; transcript_biotype "mRNA";
NC_051336.1     BestRefSeq      gene    226465  241532  .       -       .       gene_id "Vom2r4"; transcript_id ""; db_xref "GeneID:308248"; db_xref "RGD:1564110"; description "vomeronasal 2 receptor, 4"; gbkey "Gene"; gene "Vom2r4"; gene_biotype "protein_coding"; gene_synonym "RGD1564110";
NC_051336.1     BestRefSeq      transcript      226465  241532  .       -       .       gene_id "Vom2r4"; transcript_id "NM_001099458.1"; db_xref "GeneID:308248"; gbkey "mRNA"; gene "Vom2r4"; product "vomeronasal 2 receptor, 4"; transcript_biotype "mRNA";
NC_051336.1     Curated Genomic gene    267100  275769  .       -       .       gene_id "Vom2r-ps8"; transcript_id ""; db_xref "GeneID:502214"; db_xref "RGD:1563053"; description "vomeronasal 2 receptor, pseudogene 8"; gbkey "Gene"; gene "Vom2r-ps8"; gene_biotype "pseudogene"; gene_synonym "RGD1563053"; pseudo "true";
NC_051336.1     Gnomon  gene    284195  301267  .       -       .       gene_id "LOC102556157"; transcript_id ""; db_xref "GeneID:102556157"; db_xref "RGD:7626961"; gbkey "Gene"; gene "LOC102556157"; gene_biotype "protein_coding";
NC_051336.1     Gnomon  gene    307758  313908  .       -       .       gene_id "LOC108352169"; transcript_id ""; db_xref "GeneID:108352169"; db_xref "RGD:11507047"; gbkey "Gene"; gene "LOC108352169"; gene_biotype "lncRNA";
NC_051336.1     Gnomon  transcript      307758  313908  .       -       .       gene_id "LOC108352169"; transcript_id "XR_005497081.1"; db_xref "GeneID:108352169"; gbkey "ncRNA"; gene "LOC108352169"; model_evidence "Supporting evidence includes similarity to: 1 EST, and 98% coverage of the annotated genomic feature by RNAseq alignments, including 3 samples with support for all annotated introns"; product "uncharacterized LOC108352169, transcript variant X3"; transcript_biotype "lnc_RNA";

Desired output (output.txt):

LOC102552844    LOC102552844    NC_051336.1:40042-56215 NC_051336.1     40042   56215   -       16173   pseudogene
Vom2r3  Vom2r3          NC_051336.1:76909-85762 NC_051336.1     76909   85762   +       8853    protein_coding
LOC103693496    LOC103693496    NC_051336.1:162525-192445       NC_051336.1     162525  192445  -       29920   protein_coding
Vom2r4  Vom2r4          NC_051336.1:226465-241532       NC_051336.1     226465  241532  -       15067   protein_coding
Vom2r-ps8       Vom2r-ps8       NC_051336.1:267100-275769       NC_051336.1     267100  275769  -       8669    pseudogene
LOC102556157    LOC102556157    NC_051336.1:284195-301267       NC_051336.1     284195  301267  -       17072   protein_coding
LOC108352169    LOC108352169    NC_051336.1:307758-313908       NC_051336.1     307758  313908  -       6150    lncRNA
  • Column 1: “gene_id” substring of $9
  • Column 2: “gene” substring of $9
  • Column 3: $1 “:” $4 “-” $5
  • Column 4: $1
  • Column 5: $4
  • Column 6: $5
  • Column 7: $7
  • Column 8: $5-$4
  • Column 9: “gene_biotype” substring of $9

Attempts:

The general idea is to… (i) cat input.txt | awk provide the input to awk, (ii) 'BEGIN{FS="t"} use tab as the field separator, (iii) {split($9,a,";"); split field 9 into individual fields using semicolon as the field separator, (iv) if($3~"gene") print print indicated fields (to follow), but only if the third field is “gene”, (v) a[1]"t"a[6]"t"$1":"$4"-"$5"t"$1"t"$4"t"$5"t"$7"t"$5-$4"t"a[7]}' | print the indicated fields of each line in a tab-delimited fashion, (vi) sed 's/gene_id "//' | remove any instance of the string “gene_id "“, (vii) sed 's/gene "//' | remove any instance of the string “gene "“, (viii) sed 's/gene_biotype "//' | remove any instance of the string “gene_biotype "“, (ix) sed 's/[" ]//g' remove any instance of the chracter “"” global, then (x) > output.txt send the output to a file.

1.
$ cat input.txt | awk 'BEGIN{FS="t"}{split($9,a,";"); if($3~"gene") print a[1]"t"a[6]"t"$1":"$4"-"$5"t"$1"t"$4"t"$5"t"$7"t"$5-$4"t"a[7]}' | sed 's/gene_id "//' | sed 's/gene "//' | sed 's/gene_biotype "//' | sed 's/[" ]//g' > output.txt
$ cat output.txt
LOC102552844    LOC102552844    NC_051336.1:40042-56215 NC_051336.1     40042   56215   -       16173   pseudogene
Vom2r3  gbkeyGene       NC_051336.1:76909-85762 NC_051336.1     76909   85762   +       8853    Vom2r3
LOC103693496    LOC103693496    NC_051336.1:162525-192445       NC_051336.1     162525  192445  -       29920   protein_coding
Vom2r4  gbkeyGene       NC_051336.1:226465-241532       NC_051336.1     226465  241532  -       15067   Vom2r4
Vom2r-ps8       gbkeyGene       NC_051336.1:267100-275769       NC_051336.1     267100  275769  -       8669    Vom2r-ps8
LOC102556157    LOC102556157    NC_051336.1:284195-301267       NC_051336.1     284195  301267  -       17072   protein_coding
LOC108352169    LOC108352169    NC_051336.1:307758-313908       NC_051336.1     307758  313908  -       6150    lncRNA

This comes really close to what I want… except that output lines 2,4,5 are incorrect. I want the second output field to be a substring of the “gene "” input field (variable position in the ninth input field when separated by tab) and the ninth output field to be a substring of the “gene_biotype "input field (variable position in the ninth input field when separated by tab).

While printing a[1] to extract the “gene_id "” field does work since it is ALWAYS in the first field of the ninth semicolon-delimited input field, printing a[6] and a[7] to extract the other fields mentioned above will NOT work because their index position changes for some lines (see 2,4,5). The obvious solution is to print a field only when it contains the string of interest rather than a specific index position…

2.
$ cat $ANNOTATION | awk 'BEGIN{FS="t"}{split($9,a,";"); if($3~"gene") print {for(i=1;i<=NF;i++){if($i~/gene_id "/){print a[$i]}}}"t"{for(i=1;i<=NF;i++){if($i~/gene "/){print a[$i]}}}"t"$1":"$4"-"$5"t"$1"t"$4"t"$5"t"$7"t"$5-$4"t"{for(i=1;i<=NF;i++){if($i~/gene_biotype "/){print a[$i]}}}}' | sed 's/gene_id "//' | sed 's/gene "//' | sed 's/gene_biotype "//' | sed 's/[" ]//g' > output.txt
$ cat output.txt

This awk command is riddled with syntax errors (using { or } or ; after print is not acceptable), thus the command is not performed and there is no output. The general idea was to replace the index position extraction bits with awk code that uses regex to identify fields containing specific substrings: {for(i=1;i<=NF;i++){if($i~/gene_id "/){print a[$i]}}}. However, this will not work if the syntax is improper, like above.


Note: I am using…

  • Shell version: GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)
  • AWK version: GNU Awk 4.0.2
  • sed version: sed (GNU sed) 4.2.2

Leave a Comment