Goal: To print fields from an input file to an output file, but with special regard to specific fields split by semicolons (;
).
Example Input (input.txt):
NC_051336.1 Gnomon gene 40042 56215 . - . gene_id "LOC102552844"; transcript_id ""; db_xref "GeneID:102552844"; db_xref "RGD:7551986"; gbkey "Gene"; gene "LOC102552844"; gene_biotype "pseudogene"; pseudo "true";
NC_051336.1 BestRefSeq%2CGnomon gene 76909 85762 . + . gene_id "Vom2r3"; transcript_id ""; db_xref "GeneID:502213"; db_xref "RGD:1565892"; description "vomeronasal 2 receptor, 3"; gbkey "Gene"; gene "Vom2r3"; gene_biotype "protein_coding"; gene_synonym "RGD1565892";
NC_051336.1 Gnomon gene 162525 192445 . - . gene_id "LOC103693496"; transcript_id ""; db_xref "GeneID:103693496"; db_xref "RGD:9090555"; gbkey "Gene"; gene "LOC103693496"; gene_biotype "protein_coding";
NC_051336.1 Gnomon transcript 162525 167758 . - . gene_id "LOC103693496"; transcript_id "XM_039098304.1"; db_xref "GeneID:103693496"; gbkey "mRNA"; gene "LOC103693496"; model_evidence "Supporting evidence includes similarity to: 4 Proteins, and 90% coverage of the annotated genomic feature by RNAseq alignments"; product "sperm motility kinase X-like, transcript variant X9"; transcript_biotype "mRNA";
NC_051336.1 BestRefSeq gene 226465 241532 . - . gene_id "Vom2r4"; transcript_id ""; db_xref "GeneID:308248"; db_xref "RGD:1564110"; description "vomeronasal 2 receptor, 4"; gbkey "Gene"; gene "Vom2r4"; gene_biotype "protein_coding"; gene_synonym "RGD1564110";
NC_051336.1 BestRefSeq transcript 226465 241532 . - . gene_id "Vom2r4"; transcript_id "NM_001099458.1"; db_xref "GeneID:308248"; gbkey "mRNA"; gene "Vom2r4"; product "vomeronasal 2 receptor, 4"; transcript_biotype "mRNA";
NC_051336.1 Curated Genomic gene 267100 275769 . - . gene_id "Vom2r-ps8"; transcript_id ""; db_xref "GeneID:502214"; db_xref "RGD:1563053"; description "vomeronasal 2 receptor, pseudogene 8"; gbkey "Gene"; gene "Vom2r-ps8"; gene_biotype "pseudogene"; gene_synonym "RGD1563053"; pseudo "true";
NC_051336.1 Gnomon gene 284195 301267 . - . gene_id "LOC102556157"; transcript_id ""; db_xref "GeneID:102556157"; db_xref "RGD:7626961"; gbkey "Gene"; gene "LOC102556157"; gene_biotype "protein_coding";
NC_051336.1 Gnomon gene 307758 313908 . - . gene_id "LOC108352169"; transcript_id ""; db_xref "GeneID:108352169"; db_xref "RGD:11507047"; gbkey "Gene"; gene "LOC108352169"; gene_biotype "lncRNA";
NC_051336.1 Gnomon transcript 307758 313908 . - . gene_id "LOC108352169"; transcript_id "XR_005497081.1"; db_xref "GeneID:108352169"; gbkey "ncRNA"; gene "LOC108352169"; model_evidence "Supporting evidence includes similarity to: 1 EST, and 98% coverage of the annotated genomic feature by RNAseq alignments, including 3 samples with support for all annotated introns"; product "uncharacterized LOC108352169, transcript variant X3"; transcript_biotype "lnc_RNA";
Desired output (output.txt):
LOC102552844 LOC102552844 NC_051336.1:40042-56215 NC_051336.1 40042 56215 - 16173 pseudogene
Vom2r3 Vom2r3 NC_051336.1:76909-85762 NC_051336.1 76909 85762 + 8853 protein_coding
LOC103693496 LOC103693496 NC_051336.1:162525-192445 NC_051336.1 162525 192445 - 29920 protein_coding
Vom2r4 Vom2r4 NC_051336.1:226465-241532 NC_051336.1 226465 241532 - 15067 protein_coding
Vom2r-ps8 Vom2r-ps8 NC_051336.1:267100-275769 NC_051336.1 267100 275769 - 8669 pseudogene
LOC102556157 LOC102556157 NC_051336.1:284195-301267 NC_051336.1 284195 301267 - 17072 protein_coding
LOC108352169 LOC108352169 NC_051336.1:307758-313908 NC_051336.1 307758 313908 - 6150 lncRNA
- Column 1: “gene_id” substring of $9
- Column 2: “gene” substring of $9
- Column 3: $1 “:” $4 “-” $5
- Column 4: $1
- Column 5: $4
- Column 6: $5
- Column 7: $7
- Column 8: $5-$4
- Column 9: “gene_biotype” substring of $9
Attempts:
The general idea is to… (i) cat input.txt | awk
provide the input to awk, (ii) 'BEGIN{FS="t"}
use tab as the field separator, (iii) {split($9,a,";");
split field 9 into individual fields using semicolon as the field separator, (iv) if($3~"gene") print
print indicated fields (to follow), but only if the third field is “gene”, (v) a[1]"t"a[6]"t"$1":"$4"-"$5"t"$1"t"$4"t"$5"t"$7"t"$5-$4"t"a[7]}' |
print the indicated fields of each line in a tab-delimited fashion, (vi) sed 's/gene_id "//' |
remove any instance of the string “gene_id "
“, (vii) sed 's/gene "//' |
remove any instance of the string “gene "
“, (viii) sed 's/gene_biotype "//' |
remove any instance of the string “gene_biotype "
“, (ix) sed 's/[" ]//g'
remove any instance of the chracter “"
” global, then (x) > output.txt
send the output to a file.
1.
$ cat input.txt | awk 'BEGIN{FS="t"}{split($9,a,";"); if($3~"gene") print a[1]"t"a[6]"t"$1":"$4"-"$5"t"$1"t"$4"t"$5"t"$7"t"$5-$4"t"a[7]}' | sed 's/gene_id "//' | sed 's/gene "//' | sed 's/gene_biotype "//' | sed 's/[" ]//g' > output.txt
$ cat output.txt
LOC102552844 LOC102552844 NC_051336.1:40042-56215 NC_051336.1 40042 56215 - 16173 pseudogene
Vom2r3 gbkeyGene NC_051336.1:76909-85762 NC_051336.1 76909 85762 + 8853 Vom2r3
LOC103693496 LOC103693496 NC_051336.1:162525-192445 NC_051336.1 162525 192445 - 29920 protein_coding
Vom2r4 gbkeyGene NC_051336.1:226465-241532 NC_051336.1 226465 241532 - 15067 Vom2r4
Vom2r-ps8 gbkeyGene NC_051336.1:267100-275769 NC_051336.1 267100 275769 - 8669 Vom2r-ps8
LOC102556157 LOC102556157 NC_051336.1:284195-301267 NC_051336.1 284195 301267 - 17072 protein_coding
LOC108352169 LOC108352169 NC_051336.1:307758-313908 NC_051336.1 307758 313908 - 6150 lncRNA
This comes really close to what I want… except that output lines 2,4,5 are incorrect. I want the second output field to be a substring of the “gene "
” input field (variable position in the ninth input field when separated by tab) and the ninth output field to be a substring of the “gene_biotype "
input field (variable position in the ninth input field when separated by tab).
While printing a[1]
to extract the “gene_id "
” field does work since it is ALWAYS in the first field of the ninth semicolon-delimited input field, printing a[6]
and a[7]
to extract the other fields mentioned above will NOT work because their index position changes for some lines (see 2,4,5). The obvious solution is to print a field only when it contains the string of interest rather than a specific index position…
2.
$ cat $ANNOTATION | awk 'BEGIN{FS="t"}{split($9,a,";"); if($3~"gene") print {for(i=1;i<=NF;i++){if($i~/gene_id "/){print a[$i]}}}"t"{for(i=1;i<=NF;i++){if($i~/gene "/){print a[$i]}}}"t"$1":"$4"-"$5"t"$1"t"$4"t"$5"t"$7"t"$5-$4"t"{for(i=1;i<=NF;i++){if($i~/gene_biotype "/){print a[$i]}}}}' | sed 's/gene_id "//' | sed 's/gene "//' | sed 's/gene_biotype "//' | sed 's/[" ]//g' > output.txt
$ cat output.txt
This awk command is riddled with syntax errors (using {
or }
or ;
after print
is not acceptable), thus the command is not performed and there is no output. The general idea was to replace the index position extraction bits with awk code that uses regex to identify fields containing specific substrings: {for(i=1;i<=NF;i++){if($i~/gene_id "/){print a[$i]}}}
. However, this will not work if the syntax is improper, like above.
Note: I am using…
- Shell version:
GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)
- AWK version:
GNU Awk 4.0.2
- sed version:
sed (GNU sed) 4.2.2