Skip to content

BCEM-UniAndes/chloroplast_markers_database

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

chloroplast_markers_database

The database included in this repository was used in our study in process of publication as:

"Taxonomical evaluation of plants genetic markers by Bayesian Classifier"

*by

This database contains plant sequences of six chloroplastic markers (trnL, rpoB, rbcL, matK, psbA-trnH and psbK). Hereby, we provide a Greengenes-like databases derived from NCBI data for scientists who want to use these resources in their own research.

Sequences were downloaded from Genbank on September of 2018. The fasta file was modified to include every species' respective taxonomy (donwloaded and linked from NCBI Taxonomy). The taxonomy was included in each fasta header in the respective order:

, ID, Taxonomy.

The taxonomy annotation is: p__(phylum),c__(class),o__(order),f__(family),g__(genus),s__(species).

This database includes a balanced number of sequences per species. Two different datasets were generated for each molecular marker:

  • The first one with species having a minimum of 2 sequences and a maximum of 20 sequences per species (directory: "2_20_DB"),
  • and the second one having a minimum of 5 sequences and a maximum of 20 sequences per species (directory: "5_20_DB").

In both cases, species that had more than 20 sequences were randomly subsampled to 20 sequences. The 5-20 dataset was generated in order to evaluate the performance of every marker on general classification, on the other hand, 2-20 dataset was generated to determine dataset bias in marker classification.

We hope this material is useful for you.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published