Splitting genomes to mimic 454 reads
From EdwardsLab
Mimicking 454
To get a genome to mimic 454 reads, I started with the reads that we have. There are 16,176,486 reads in the metagenomes used in the Dinsdale et al analysis, and so I counted the distribution of the lengths of those reads. Then, I read the DNA sequence of the genome that we are interested in, either from a fasta file or from the SEED database. A contig is chosen at random, and each choice is proportional to the length of each contig compared to the whole genome, then a start site is chosen, at random, within that contig. Finally, a length is chosen from the distribution of known lengths from the 454 GS20 reads. Then the sequence defined by that contig, start site, and length is extracted.
The analysis is essentially a random selection with replacement (i.e. each sequence could occur more than once), and currently 30,000,000 bp of output are generated, although that can be altered if desired. Of course, the contig, start position, and length are saved for each sequence.
Distribution of Sequence Size Lengths
Here is the distribution of sequence lengths for the 16+ million sequence reads
This is also included in the data in the tarball.
Download analysis
This tarball contains the following files:
- metagenome_read_sizes.txt -- A list of the lengths of the sequences, and the number of times each is seen
- chop_genome_to_mimic_454.pl -- the perl code that will generate the fragments of sequence
- 2351472.finished.fsa - an example fasta file downloaded from the JGI on 1/27/08
- 2351472.fractionated - an example output from this file
