The i5k Workspace@NAL provides data visualization, access, and curation tools for any arthropod genome project. We strive to:

Help improve the quality of arthropod genome assemblies and their annotations through our toolset and services;
Provide tools to improve, curate, and disseminate genome annotations;
Guide our data providers to long-term repositories for their stable datasets;
Improve arthropod genome data accessibility.

Here, we outline our data management policy for arthropod genome projects hosted at the i5k Workspace. Our current focus is on arthropod genome assemblies and all datasets derived from or mapped to them.

Accepted data types
Conditions of accepting data
What we do with your data
Versioning
Long-term storage policy
Datasets that we generate for you

Accepted data types

We currently only accept data resulting from or complementary to arthropod genome projects. Data types that we do not accept will be referred to the Ag Data Commons (https://data.nal.usda.gov) or other appropriate repositories.

Our current policy is to make all submitted data publicly available. Manual annotations in the Apollo application are only accessible by curators of that genome project.

Genome assemblies

Accepted file formats
1. Fasta (https://en.wikipedia.org/wiki/FASTA_format)
2. Agp (https://www.ncbi.nlm.nih.gov/assembly/agp/AGP_Specification/)
File provenance
1. We accept only assemblies that are archived in INSDC repositories (ENA, GenBank and DDBJ) or RefSeq. We have found that users benefit from the additional contamination screen that INSDC repositories provide. Using official sequence and assembly identifiers also prevents confusion about sequence content and versioning. See https://metazoa.ensembl.org/info/about/legal/browser_agreement.html for further elaboration.
Acknowledgement of data source
1. We list the genome assembly data source on each i5k Project page (e.g. https://i5k.nal.usda.gov/Cimex_lectularius).
2. We require contact information (Name, valid email address, Affiliation) for the genome submitter (or other primary contact). This information is not currently listed on our pages, but we reserve the right to make this information available to those with questions about the assembly.

Annotations/gene predictions

Accepted file formats
1. Gff3 (https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md)
2. bed (https://genome.ucsc.edu/FAQ/FAQformat.html#format1)
Constraints
1. All annotations must be submitted to us on the coordinate system of the assembly that we host (e.g. GenBank). You can find the assembly in our data downloads section, or contact us.
2. If you did not use this assembly, there may be problems displaying the mapped data in the JBrowse genome browser. In this case, we may provide guidance or assistance on how to transfer coordinates or re-map your annotations.
Acknowledgement of data source
1. We list the annotation data source in the annotation analysis page (e.g. https://i5k.nal.usda.gov/bio_data/836754).
2. The data source is also listed in the ‘About this track’ section in the genome browser track display.
3. We require and display submitter contact information (Name and affiliation).
Annotation Categories
1. Official Gene Sets (OGS)
  1. Definition: Official Gene Sets (OGS) are designated by the project coordinator as the definitive “best” gene set for this species. There are no requirements other than coordinator approval. However, Official Gene Sets are often a synthesis of several different gene prediction programs and manual curation.
  2. Official Gene Sets generated by the NAL result from a pipeline (under development) that performs QC and a merge of manual curations from Apollo and a single additional gene set. See the github page for more information: https://github.com/NAL-i5K/GFF3toolkit/blob/master/docs/Merge-two-GFF3-files.md.
  3. OGS Gene identifier policy
    1. We follow the Sequence Ontology definition of the ID attribute.
    2. We generally do not provide stable gene identifiers. We also do not currently map IDs between assembly or annotation versions.
    3. We aim to maintain the IDs provided to us, provided the gff3 file that they are stored in is compliant with the Sequence Ontology gff3 specification. In cases where IDs may be in conflict with other IDs or are otherwise problematic, we may assign new IDs (after consultation with the file submitter and the project contact).
  4. Additional file modifications
    1. Occasionally, we will need to modify the original files provided to us in order to meet the requirements of our databases and applications. In this case, we will document the changes made and provide them in a readme file in our data downloads section.
2. Primary Gene Sets.
  1. Definition: Primary Gene Sets are designated by the Project Coordinator as the gene set that should be curated in Apollo.
  2. Primary Gene Sets are only visualized in the genome browser - we do not import primary gene sets into our database for longer-term storage. As such, we do not change formatting or content of the file unless there are problems with the Jbrowse display, or we anticipate problems during the manual curation effort of the primary gene set. Changes are reported to the file provider, and are listed in a readme file in our ‘data downloads’ section.
3. Additional Gene Sets and Annotation Projects
  1. Definition: Any gene set that is not a primary or official gene set.
  2. Additional Gene Sets and Annotation Projects are only visualized in the genome browser - we do not import primary gene sets into our database for longer-term storage. As such, we do not change formatting or content of the file unless there are problems with the Jbrowse display, or we anticipate problems during the manual curation effort of the primary gene set. Changes are reported to the file provider, and are listed in a readme file in our ‘data downloads’ section.
4. Manually curated annotations derived from Apollo
  1. Current Policy
    1. Active projects
      1. Active projects are open for curation under an active community curation team.
      2. The NAL creates regular backups of the annotation files. All curators have full access to the curated genes.
    2. Finished projects.
      1. A project is ‘finished’ when it is deemed as such by the project coordinator.
      2. Official Gene Set development at the NAL. We are actively developing a pipeline that performs QC on Apollo gff3 output and performs a merge between the Apollo output and an additional gene set that was designated to be curated. We aim to provide this service when requested, but because the project is still under development, we cannot make any guarantees as to the completion date or quality of the resulting project.
      3. For all finished projects, we
        
        Provide the project coordinator with the the final gff3 from Apollo;
        
        Store the final gff3 from Apollo in our records;
        
        Clear out all annotations from the user-created annotations track in Apollo
    3. Orphaned projects
      1. “Orphaned” projects are Apollo curation projects that are open for curation but that do not have an active project coordinator.
      2. Annotations in Apollo will be maintained by the NAL, but there are no guarantees that the annotations will be quality-controlled, integrated into an Official Gene Set, or deposited into an official repository.

Mapped files

Accepted file formats (these largely depend on what the JBrowsegenome browser accepts)
Data type examples (anything is possible if you can map it and your community can benefit)
1. RNA-Seq
2. DNA-Seq
3. Variant Data
Constraints
1. All mapped files must be submitted to us on the coordinate system of the assembly that we host (e.g. GenBank). You can find the assembly in our data downloads section, or contact us.
2. If you did not use this assembly, there may be problems displaying the mapped data in the JBrowse genome browser. In this case, we may provide guidance or assistance on how to transfer coordinates or re-map your annotations.
File processing services
1. We routinely convert bam files submitted to us to bigwig format for easier display in the genome browser, unless requested otherwise. Generally, we use this pipeline for conversion.
2. We can also reduce bam file coverage for you, on request. We may also recommend this to you if we think your bam file is too dense to display well in the genome browser.

Conditions of accepting data

Your file type is one of the ‘accepted types’ listed above, and follows guidelines for that type.
Depending on genome project type, approval from community coordinator.
Receipt of sufficient metadata.
Datasets from Agricultural Research Service authors will usually require an ARIS log number:
1. Published datasets, meaning datasets deposited and released in a public repository with a persistent identifier, now require an ARIS log number. Example: A genome publicly released in GenBank; transposable element predictions described in a peer-reviewed publication.
2. Unpublished datasets that are final research products now require an ARIS log number. Example: A genome assembly that has not yet been publicly released in GenBank.
3. Unpublished datasets that are considered provisional do not require an ARIS log number. Instead, these datasets will be reviewed internally. Example: i5k Workspace@NAL functional annotations.
Datasets from non-ARS authors will be reviewed internally and do not require an ARIS log number.
We do not currently have a size limit for data files, but we reserve the right to reject a dataset if it is excessively large.

What we do with your data

This depends on the data type.

Genome assemblies:
1. Use as basis of genome browser
2. Upload to BLAST
3. Post in data downloads section
Gene predictions:
1. Official Gene Sets:
  1. Verify integrity based on the gff3 specification.
  2. Load as track in genome browser (unless requested otherwise)
  3. Fasta files derived from gene predictions (all RNA, CDS, peptides) will be uploaded to Blast (unless requested otherwise)
  4. Post in data downloads section
2. Primary Gene Sets
  1. Load as track in genome browser (unless requested otherwise)
  2. Fasta files derived from gene predictions (all RNA, CDS, peptides) will be uploaded to Blast (unless requested otherwise)
  3. Post in data downloads section
3. Additional Gene Sets and Annotation Projects
  1. Load as track in genome browser (unless requested otherwise)
  2. We do not regularly post mapped files in our data downloads section. However, we will do so if requested (or approved) by the project coordinator, and if the file size is less than 2 Gb.
4. Manual curation output from Apollo:
  1. If requested and possible, merge finished manual curation project (as determined by project coordinator) with pre-designated gene set to generate an OGS.
Mapped files:
1. Load as track in genome browser (unless requested otherwise)
2. We do not regularly post mapped files in our data downloads section. However, we will do so if requested (or approved) by the project coordinator, and if the file size is less than 2 Gb.

Versioning

Many genomic data types undergo versioning (minor and major). If requested, we will update your datasets to the latest version, provided they fit the criteria for each data type (listed above). We will generally only host a single version of your genome assembly, and only allow curation on one version of a genome assembly at a time. Superceded files will still be available under the “Legacy” section of the data downloads for your project.

Note that we do not provide mapping services of existing annotations to a re-assembled genome. We can provide some assistance if the assembly changes are minor (e.g. removal of contaminated scaffolds, breaking individual scaffolds into contigs). Contact us if you have questions about this.

Long-term storage policy

Currently, we cannot guarantee that any of the files that we host are stored at the NAL in perpetuity. We will work with data providers to find long-term storage for their datasets in the appropriate repository if this is desired.

Datasets that we generate for you

The i5k Workspace@NAL now generates two dataset types as part of our regular project setup:

Functional annotation of genome annotations;
Mapped RNA-Seq.

We do not have a policy to maintain or archive these datasets. However, if you need long-term storage, e.g. if you used a dataset in a peer-reviewed publication, we can work with the Ag Data Commons (https://data.nal.usda.gov)to archive it for you. Contact us if this is the case.

I5k Workspace data management policy