Downloading all the sequences from a PATRIC family

PATRIC families are somehow organized sets proteins that are related. Sometimes we want to get all the protein sequences in the family. Here’s how to do that using the PATRIC command line

If you get the PATRIC family ID from pretty much anywhere on the PATRIC website, but I usually end up getting it from a feature page. In this example, my PATRIC family ID isĀ PGF_07220970.

We use linux piping to echo that (note we use the p3 echo not the standard echo) to get the features in the family, but we only want the patric_ids. Then we get those as sequences.

p3-echo $FAMILY | p3-get-family-features --ftype global --attr patric_id | p3-get-feature-data --attr aa_sequence > $FAMILY.tsv

This gives you a three column tsv file with the PATRIC family ID, the PATRIC protein ID, and the protein sequence. You can convert that to a fasta file like this:

perl -F"\t" -lane 'next if (/feature.patric_id/); print ">$F[1] [PATRIC FAMILY $F[0]]\n$F[2]"' PGF_07220970.tsv > PGF_07220970.faa