Extracting fasta from AUGUSTUS output
Goal: extract extract CDS and protein sequences from first transcript in Augustus output files
count of genes in gff
grep -c “# start gene” Structural_annotation_abintio.gffgene count = 64558
This matches the reported number from the genome (Vidal Dupiol BioRxiv)
1. Use this perl script to extract fasta seq file for AUGUSTUS predicted genes and proteins
perl getAnnoFasta.pl Structural_annotation_abintio.gff
this results in 2 files:
Structural_annotation_abintio.aa
Structural_annotation_abintio.codingseq
2. Files then need to be parsed for the first transcript only (e.g., g1.t1 … gn.t1)
awk '/^>/ {P=index($0,".t1")==0} {if(!P) print} ' Structural_annotation_abintio.aa > Pact_T1_Structural_annotation_abintio.aa
sed 's/\..*$//' Pact_T1_Structural_annotation_abintio.aa > Pact_protein.fa
awk '/^>/ {P=index($0,".t1")==0} {if(!P) print} ' Structural_annotation_abintio.codingseq > Pact_T1_Structural_annotation_abintio.codingseq
	
sed '/\..*\./s/^[^.]*\./>/' Pact_T1_Structural_annotation_abintio.codingseq > Pact_CDS.fas
sed 's/\..*$//' Pact_CDS.fas > Pact_CDS.fa
Output Sequence Files
Pact_protein.fa Pact_CDS.fa
3. Sanity Checks
count of genes in fasta files
grep -c ">" Pact_protein.fa
grep -c ">" Pact_CDS.fa
Pact_CDS.fa = 64558
Pact_protein.fa = 64554
4 protein sequences seem to be missing from protein
Is this because some did not generate predicted proteins?
Protein file headers
grep -e ">"  Pact_protein.fa > protein.headers	 #### CDS file headers
grep -e ">" Pact_CDS.fa > CDS.headers	 #### Check for differences in the headers
diff protein.headers CDS.headers > diffs
1243a1244
> g1244
12721a12723
> g12723
18185a18188
> g18188
48165a48169
> g48169
this adds up to the 4 missing
search these 4 in the augustus ouput
less Structural_annotation_abintio.gff
# start gene g1244
scaffold97cov102        AUGUSTUS        gene    1       570     0.49    +       .       g1244
scaffold97cov102        AUGUSTUS        transcript      1       570     0.49    +       .       g1244.t1
scaffold97cov102        AUGUSTUS        terminal        566     570     0.86    +       2       transcript_id "g1244.t1"; gene_id "g1244";
scaffold97cov102        AUGUSTUS        intron  1       565     0.49    +       .       transcript_id "g1244.t1"; gene_id "g1244";
scaffold97cov102        AUGUSTUS        CDS     566     570     0.86    +       2       transcript_id "g1244.t1"; gene_id "g1244";
scaffold97cov102        AUGUSTUS        stop_codon      568     570     .       +       0       transcript_id "g1244.t1"; gene_id "g1244";
# coding sequence = [aatga]
# protein sequence = []
# Evidence for and against this transcript:
# % of transcript supported by hints (any source): 50
# CDS exons: 1/1
#      E:   1
# CDS introns: 0/1
# 5'UTR exons and introns: 0/0
# 3'UTR exons and introns: 0/0
# hint groups fully obeyed: 0
# incompatible hint groups: 3
#      E:   3 (TCONS00004018,TCONS00060351,TCONS00060352)
# end gene g1244`
# start gene g12723
scaffold991cov108       AUGUSTUS        gene    1       1990    0.4     +       .       g12723
scaffold991cov108       AUGUSTUS        transcript      1       1990    0.4     +       .       g12723.t1
scaffold991cov108       AUGUSTUS        terminal        1986    1990    1       +       2       transcript_id "g12723.t1"; gene_id "g12723";
scaffold991cov108       AUGUSTUS        intron  1       1985    0.4     +       .       transcript_id "g12723.t1"; gene_id "g12723";
scaffold991cov108       AUGUSTUS        CDS     1986    1990    1       +       2       	transcript_id "g12723.t1"; gene_id "g12723";
scaffold991cov108       AUGUSTUS        stop_codon      1988    1990    .       +       0       transcript_id "g12723.t1"; gene_id "g12723";
# coding sequence = [tgtag]
# protein sequence = []
# Evidence for and against this transcript:
# % of transcript supported by hints (any source): 50
# CDS exons: 1/1
#      E:   1
# CDS introns: 0/1
# 5'UTR exons and introns: 0/0
# 3'UTR exons and introns: 0/0
# hint groups fully obeyed: 0
# incompatible hint groups: 31
#      E:  31 (TCONS00062974,TCONS00010788,TCONS00024921,TCONS00062723,TCONS00046300,TCONS00008149,...)
# end gene g12723
# start gene g18188
scaffold1507cov106      AUGUSTUS        gene    1       979     0.6     +       .       g18188
scaffold1507cov106      AUGUSTUS        transcript      1       979     0.6     +       .       g18188.t1
scaffold1507cov106      AUGUSTUS        terminal        976     979     0.6     +       1       transcript_id "g18188.t1"; gene_id "g18188";
scaffold1507cov106      AUGUSTUS        intron  1       975     0.71    +       .       transcript_id "g18188.t1"; gene_id "g18188";
scaffold1507cov106      AUGUSTUS        CDS     976     979     0.6     +       1       transcript_id "g18188.t1"; gene_id "g18188";
scaffold1507cov106      AUGUSTUS        stop_codon      977     979     .       +       0       transcript_id "g18188.t1"; gene_id "g18188";
# coding sequence = [ctag]
# protein sequence = []
# Evidence for and against this transcript:
# % of transcript supported by hints (any source): 50
# CDS exons: 1/1
#      E:   1
# CDS introns: 0/1
# 5'UTR exons and introns: 0/0
# 3'UTR exons and introns: 0/0
# hint groups fully obeyed: 0
# incompatible hint groups: 29
#      E:  29 (TCONS00047370,TCONS00028284,TCONS00036554,TCONS00012242,TCONS00056347,TCONS00050834,...)
# end gene g18188
# start gene g48169
scaffold9876cov105      AUGUSTUS        gene    1       180     0.99    +       .       g48169
scaffold9876cov105      AUGUSTUS        transcript      1       180     0.99    +       .       g48169.t1
scaffold9876cov105      AUGUSTUS        terminal        176     180     0.99    +       2       transcript_id "g48169.t1"; gene_id "g48169";
scaffold9876cov105      AUGUSTUS        intron  1       175     0.99    +       .       transcript_id "g48169.t1"; gene_id "g48169";
scaffold9876cov105      AUGUSTUS        CDS     176     180     0.99    +       2       transcript_id "g48169.t1"; gene_id "g48169";
scaffold9876cov105      AUGUSTUS        stop_codon      178     180     .       +       0       transcript_id "g48169.t1"; gene_id "g48169";
# coding sequence = [aataa]
# protein sequence = []
# Evidence for and against this transcript:
# % of transcript supported by hints (any source): 50
# CDS exons: 1/1
#      E:   1
# CDS introns: 0/1
# 5'UTR exons and introns: 0/0
# 3'UTR exons and introns: 0/0
# hint groups fully obeyed: 0
# incompatible hint groups: 25
#      E:  25 (TCONS00029265,TCONS00001086,TCONS00052623,TCONS00019793,TCONS00046668,TCONS00025937,...)
# end gene g48169
Yes, 4 genes did not have predicted proteins
Yes we can move forward with this perl script, followed by cleaning of sequence names to remove extra header information (e.g., .t1)
Compare this to script written by Tejashree Modak
August Extract Script by Tejashree Modak
#!/bin/bash
PROG="${0##*/}"
help() {
echo -e "$PROG [options] <Agustus gff>\n"
echo "Options:"
echo "-c <coding file>    Output file for CDS fasta"
echo "-p <coding file>    Output file for protein fasta"
exit 0
}
while getopts "c:p:h" arg ; do
case "$arg" in
    c)  cfile="$OPTARG" ;;
    p)  pfile="$OPTARG" ;;
    *)  help ;;
esac
done
shift $((OPTIND-1))
# pre-process the file to "join" all coding and protein sequences
# output will be
# >gene1
# coding <coding-sequence>
# protein <protein-sequence>
# coding <coding-sequence>
# protein <protein-sequence>
# ....
# >gene2
# ...
awk '/^# start gene g/ {
print ">" $NF
f = ""
}
/^# (coding|protein) sequence/ {
f = $NF                     # store the start of sequence
t = $2                      # t is either "coding" or "protein"
if ($NF ~ /\]$/) {          # for one line sequences, ] is on the same line, so print it out
    print t, f; f = ""
}
}
length(f) && $1 == "#" && NF == 2 {    # multiline sequences will continue appending to f
f = f $NF                          # first field is '#' second is sequence,
if ($NF ~ /\]$/) {                 # once we get to ']' thats the end of sequence, so print it out
    print t, f; f = ""
}
} ' $1  | tr -d '[]' > preprocessed
# after that Pick the gene line and the first two coding and protein lines
# and write then out
awk -v pfile="$pfile" -v cfile="$cfile" 'BEGIN {
pfile = !length(pfile) ? "protein.fasta" : pfile;
printf "" > pfile
cfile = !length(cfile) ? "coding.fasta" : cfile;
printf "" > cfile
}
NF == 1 && /^>/ {
gene = $1
c = p = 0
total++
}
$1 == "coding" && c == 0 && NF == 2 {
print gene "\n" $NF >> cfile
c++; codings++;
}
$1 == "protein" && p == 0 && NF == 2 {
print  gene "\n" $NF >> pfile
p++; proteins++;
}
$1 == "protein" && (c == 0 || p == 0) {
if (c == 0) {
    missing_codings++;
} else {
    missing_proteins++;
}
print "WARNING: Gene", substr(gene, 2), "has", c, "coding sequences and", p, "protein sequences"
}
END {
print "Statistics:\nTotal Genes:", total, "\nTotal Coding Sequences:", codings, "\nTotal Protein Sequences:", proteins, "\nEmpty Coding Sequences:", missing_codings, "\nEmpty Protein Sequences:", missing_proteins
}' preprocessed
echo "Total genes in $1: $(grep -c '^# start gene' $1)" 
1. Run extraction script
bash get-augustus-fasta.sh Structural_annotation_abintio.gff
2. Output Files
coding.fasta protein.fasta
3. Sanity check against sequence number
Script writes out the following
WARNING: Gene g1244 has 1 coding sequences and 0 protein sequences
WARNING: Gene g12723 has 1 coding sequences and 0 protein sequences
WARNING: Gene g18188 has 1 coding sequences and 0 protein sequences
WARNING: Gene g48169 has 1 coding sequences and 0 protein sequences
Statistics:
Total Genes: 64558
Total Coding Sequences: 64558
Total Protein Sequences: 64554
Empty Coding Sequences:
Empty Protein Sequences: 4
Total genes in Structural_annotation_abintio.gff: 64558
Conclusions:
TM script is faster and fewer steps and gives you the output for a sanity check directly
    Written on June  6, 2020