Welcome to the MapMan Family of Software Forum

Please do not hesitate to register and post your question.

Don't forget to subscribe to your posted message so you get notified on updates.
Every question you post will help others and or enhance the software!

Post a question,   post a bug!

Welcome to the MapMen Family of Software Forum Welcome to the MapMen Family of Software Forum

MapMan Bugs

RE: mercator : Error: Sequence Check failed

mercator : Error: Sequence Check failed
mercator sequene analysis mapman upload mercator cassfiction mrcator mercator fasta upload problem
Answer
11/26/13 5:48 PM
Dear MapMan/Mercator-admin,

I can't upload protein-fasta-files for Mercator any more. I always get "warning(s)" and "fatal(s") errors (see below).
I've checked my protein sequences (to only contain one of twenty upper case one letter code Amino Acids, removing all entries with "X" or "U" etc. ), I've changed the headers to only contain ">AccessionNumber", used very short fasta files, I've changed the linelength of from 80 to 70 to 60, ... and even fasta files that have previously already been successfully processed with Mercator. I've downloaded UniProt-UniRef100 fasta files and they don't work either. after removing all entries not purely consisting of one of the 20 AA codes. It was accepted, but still got error messages of Type A (see below). I repeated the process with another fasta, but got Type A and B errors.
Some fastas work as they are, e.g.
ftp://ftp.jcvi.org/pub/data/m_truncatula/Mt4.0/Annotation/Mt4.0v1/Mt4.0v1_GenesProteinSeq_20130731_1800.fasta

How can I fix this??? PLEASE HELP!

The error messages are not very useful to me, since I don' t know where to look for a mistake/error-source. It would be great to get more useful info from the error messages.

I've tried the following to get the fastas to be accepted:
- allow only sequences containing the 20 Amino Acid sequences (no "X", "U", "B", "Z" or "*")
- unique Accession Numbers
- header-character-length reduced to 10 characters
- linelength of protein sequence reduced to 80, 70, 60
- deleting protein entry number XYZ as indicated in Error Type B (though I don't see what is wrong with the protein).

What else can I do? Obviously, I'm missing something, since some fastas seem to work.

THANK YOU FOR YOUR TIME AND EFFORT!



ERROR TYPE A
"45 warning(s)":
: error in sequence : Length (28) outside limits (30-20000)

: error in sequence : Length (27) outside limits (30-20000)

: error in sequence : Length (20) outside limits (30-20000)

: error in sequence : Length (19) outside limits (30-20000)
...
...
...

ERROR TYPE B
"1 fatal(s)":
: FATAL ERROR: The input contains both nucleotide and protein sequences.
Please make sure the input only contains data of either type.
Number of sequence types found in input:
Protein : 19984
Nucleotide: 1

RE: mercator : Error: Sequence Check failed
Answer
12/9/13 1:38 PM as a reply to Igor Branco.
I met the same problem !! Could anybody help? emoticonemoticon

RE: mercator : Error: Sequence Check failed
Answer
12/9/13 1:49 PM as a reply to Yanli Cai.
Dear Igor and Yanli,

thanks for your feedback. I have been trying to reproduce the error you reported but could only
elicit an error message when the data was really (intentionally) corrupted. When uploading data
please make sure you set the check mark "IS_DNA" in case you are uploading nucleotide sequences.
When uploading protein sequences as input, the box, of course, must not be checked. If the
problem persists, please report again - it would then also be very helpful if you could send me the
exact input file that caused the error - so i can reproduce and track down the error.

cheers,
Marc

RE: mercator : Error: Sequence Check failed
Answer
12/9/13 3:40 PM as a reply to Marc Lohse.
Dear Marc/MapMen&Women,

I've only submitted protein not nucleotide fasta files to Mercator. The check-box for "is-DNA" is obvious, but thank you for your hint (I've not checked it for protein sequences, still Mercator enumerates "Nucleotides" in "Status&Results -> Parameters".

I've solved the problem for my files by applying the following filters:
Filter 1: Remove all entries containing anything but the 20 common amino acids (e.g. "X" and "U" were problems)
Filter 2: Remove all entries containing ONLY "A, C, G, T" (if not removed I'd get an error stating that protein and nucleotide sequences are present, see error 1 below), another protein (see "Entry causing problems" below) caused the same error even though it contains an "R" in the sequence.
Filter 3: Non-redundant protein sequences
Filter 4: Non-redundant Accession Numbers

The Mercator/MapMan platform is very useful, thus it's frustrating when the fasta bounces back.

It would be very helpful if you could state what Mercator checks to enable a submission to be accepted or rejected (besides the size constraints). Generally, a more verbose error message stating e.g. the accession number or sequence causing the problem (at least the first one) would be great. I've only discovered the "Entry causing problems" by chance (since I've checked for sequence length).

What constitutes a nucleotide sequence and what a protein sequence? Since you refer to the following NCBI website for fasta files:
http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml
Which lists "X" and "U" as accepted amino acid input. All my fasta files were accepted by "BLASTP" (without applying any filters) but not by Mercator.

What is the length-cutoff for a single entry (meaning minimum number of AminoAcids of one protein)? What is the minimum number of total/cumulative amino acids of the entire fasta to be accepted?

Are windows/unix EOL characaters checked?

Simply listing contraints would be great.

THANK YOU FOR YOUR HARD WORK!!!

Best regards,
I


ERROR 1:
: FATAL ERROR: The input contains both nucleotide and protein sequences.
Please make sure the input only contains data of either type.
Number of sequence types found in input:
Protein : 5520
Nucleotide: null



Entry causing problems:
>BM814132_1 similar to UniRef100_Q3HTK6 Cluster: Pherophorin-C1 protein precursor; n=1; Chlamydomonas reinhardtii|Rep: Pherophorin-C1 protein precursor - Chlamydomonas reinhardtii, partial (14%)
GGGGGGGGGGGGGGGGGGGGGRGGGGGGGGGGAGGGGGGGGGGGGGGGGGGGGGG

RE: mercator : Error: Sequence Check failed
Answer
2/4/14 5:17 PM as a reply to Igor Branco.
Dear Marc,

Please help!

I've been trying to process a protein fasta file with Mercator, but am encountering different problems. On the one hand, some sequences are simply not accepted, and on the other hand, files that have been accepted, get into the "job queue" but the processing does not start and thus no results are generated.

Specifically, I'd like to process 2 large protein fasta files (200MB und 100Memoticon that should be processed separately.
Each of them is the product of merging protein-fasta-files from various sources (e.g. Uniprot, six-frame-translated, ...)
Due to the experiences from last time I've used such fastas with Mercator, I've split the file into a "mercatorable" and a "not-mercatorable" part, and submitted only small (about 30MB parts) at once.

I've applied the 4 filters I've previously described and created 2 more (Filter 5 and 6):
Filter 1: Remove all entries containing anything but the 20 common amino acids (e.g. "X" and "U" were problems)
Filter 2: Remove all entries containing ONLY "A, C, G, T" (if not removed I'd get an error stating that protein and nucleotide sequences are present, see error 1 below), another protein (see "Entry causing problems" below) caused the same error even though it contains an "R" in the sequence.
Filter 3: Non-redundant protein sequences
Filter 4: Non-redundant Accession Numbers
Filter 5: check if more than certaing percentage of "ATGCU" content in protein sequence
Filter 6: check for short repetitive sequences (very very simple version)

I've tried various settings for filter 5 and 6, but always get the same error message:
"""
Problems found in the sequence file!
0 warning(s)
0 error(s)
1 fatal(s)


FATAL

: FATAL ERROR: The input contains both nucleotide and protein sequences.
Please make sure the input only contains data of either type.
Number of sequence types found in input:
Protein : 41573
Nucleotide: null
"""




In contrast, I get the following error message from the "non-mercatorable" sequences:
"""
Problems found in the sequence file!
848 warning(s)
0 error(s)
24 fatal(s)
WARN

: error in sequence < UniRef100_P82332 Unknown protein from spot 115 of 2D-PAGE of thylakoid (Fragments) n=1 Tax=Pisum sativum RepID=UT115_PEA | P82332| UniRef100_P82332 Unknown protein from spot 115 of 2D-PAGE of thylakoid (Fragments) n=1 Tax=Pisum sativum RepID=UT115_PEA __***__ P82332.1 | gi|75107074|sp|P82332.1|UT115_PEA RecName: Full=Unknown protein from spot 115 of 2D-PAGE of thylakoid> : Length (14) outside limits (30-20000)

: error in sequence : Length (29) outside limits (30-20000)

: error in sequence < UniRef100_K4IS98 Translation elongation factor 1-alpha (Fragment) n=1 Tax=Cercospora sojina RepID=K4IS98_9PEZI | K4IS98| UniRef100_K4IS98 Translation elongation factor 1-alpha (Fragment) n=1 Tax=Cercospora sojina RepID=K4IS98_9PEZI> : Length (17) outside limits (30-20000)

: error in sequence < ID276399 p.sativum_wa1_contig15131_rframe6 __***__ contig15131 | p.sativum_wa1_contig15131| ID276399 p.sativum_wa1_contig15131_rframe6 __***__ contig15131| ID p.sativum_wa1_contig15131_rframe6> : Length (22) outside limits (30-20000)

: error in sequence : Length (25) outside limits (30-20000)

: error in sequence : Length (2) outside limits (30-20000)

: error in sequence < UniRef100_G8YZP4 PsbA peptide (Fragment) n=2 Tax=Papilionoideae RepID=G8YZP4_PEA | G8YZP4| UniRef100_G8YZP4 PsbA peptide (Fragment) n=2 Tax=Papilionoideae RepID=G8YZP4_PEA> : Length (24) outside limits (30-20000)
etc. .......................................................................................

FATAL ERROR: Data type of sequence <> could not be determined.
sequence is:
ENVVEIETISTGSLGLBIALGVGGLPRGRIIEIYGPESSGKTTLALQTIAEAQKKGGICAFVXAEHALDPVYARKLGVXLQNLLISQPDTGEQXLZIXDTLVRSGXVBVLVVDSVAALTPRAEIZGEMGDSLPGLQARLMSQALRKLTASISK

: FATAL ERROR: Data type of sequence <> could not be determined.
sequence is:
MFAPYLDRWSLISDGEPIITHSSRLLPVLWQDRPAMLKVAADIDEUYGALLMQWWDGDGAAYVYAHEGDAVLLERATGKRSLLAMAMNGADDEASRILCRTAARLHAPREKPLPDPISLTRWFRDLEPAAGKHGGTLADCSAIANVLLADQRDLTILHGDIHHDNILDFEARGWLAIDPKRLHGERGFDFANIFANEELPTITDPARFRRQLAVVSAEAKLEPKRLLQWIAAYSGLSAAWFLGDPNIQQAETALTVARIALAELQT

: FATAL ERROR: Data type of sequence <> could not be determined.
sequence is:
TTNWSEERFNEIVKEVSGFIKKVGYNPKTVAFVPISGWHGDNMLEESTNMTWFKGWTKDIKGGSTKGKTLLZAIDSIEPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGIIKADMVVTFAPVGLTTEVKSVEMHHEQLEQGVP

: FATAL ERROR: Data type of sequence <> could not be determined.
sequence is:
EFWEVISBEHGIDYTGTYTGDSDLQLERXSVYYXXAAGGKYVPRAVLVDLEPGTMDSVRAGPFGQLFRPDNFVFGQSGAGNNWAKGHYTEGAELVDSVLDVVRKEAESCDCLQGFQITHSLGGGTGAGMGTLLISKIREEYPDRMMCTFSVVPSPKVSDTVVEPYNATLSVHQLVENSDETFCIDNEALYXICFRTLKLSTPTYGDLNHLVSAVMSGIXTCLRFPGQLNADLRKLAVNMVPFPRLHFFMVGFAPLTSRGSQGYRALT

: FATAL ERROR: Data type of sequence could not be determined. sequence is: MPAVGGISNGGGKEYPGSLTPFVTVTCIVAAMGGLIFGYDIGISGGVTSMDPFLLKFFPSVFRKKNSDKTVNQYCQYDSQTLTMFTSSLYLAALLSSLVASTVTRRFGRKJSMLFGGLLFLXGALINGFAXHVWXLIVGRILLGFGIRFANQSVPLXLSEMAPYKYRGALNIGFQLSITVGILVANVLNYFFAKIKGGWGWRLSLGGAMVPALIITVGSLVLPDTPNSMIERGDREKAKAQLQRIRGIDNVDEEFNDLVAASESSSQVEHPWRNLLQRKYRPHLTMAVLIPFFQQLTGINVIMFYAPVLFSSIGFKDDAALMSAVITGVVNVVATCVSIYGVDKWGRRALFLEGGVQMLICQAVVAAAIGAKFGTDGNPGDLPKWYAIVVVLFICIYVSAFAWSWGPLGWLVPSEIFPLEIRSAAQSINVSVNMLFTFLIAQVFLTMLCHMKFGLFLFFAFFVLIMTFFVYFFLPETKGIPIEEMGQVWQAHPFWSRFVEHDDYGNGVEMGKGAIKEV
etc. .......................................................................................
"""



Through, tedious trial and error (repeated splitting in half of "mercatorable" fasta, and separately trying the two halves) I've found the following entry, that blocked Mercator:
>A9TBD1 | tr|A9TBD1|A9TBD1_PHYPA Predicted protein (Fragment) OS=Physcomitrella patens subsp. patens GN=PHYPADRAFT_29159 PE=4 SV=1 __***__ Pp1s198_98V2.1 | Pp1s198_98V2.1 Phypa_29159 LOC435370; similar to hypothetical protein PFB0615c __***__ NP13144251_1 | NP13144251_1 GB|XM_001775839.1|XP_001775891.1 hypothetical protein
GSTRWGSNHNWHGSTRWGSNHNWHGSTRWGSNHNWHGSTRWGSNHNWHGSTRWGSNHNW

The sequence contains "short repeats". When I remove this entry (in addition to applying the first 4 filters) Mercator accepts the job, but does not start processing anything. The Filters 5 and 6 seem not to be relevant, but in certain cases seem to catch the "problematic sequences", while at the same time removing other sequences that are not problematic.

Do you have any suggestions/tipps/tricks??

Could I possibly send you the sequences to process? It would be great if the "trouble" sequences would not raise these errors but "simply" not be annotated or thrown into a separate bin.

Thank you for your time and effort!

Best regards,
Igor

RE: mercator : Error: Sequence Check failed
Answer
10/13/15 4:27 PM as a reply to Igor Branco.
Hi there

I'm also getting this error. I have moved my attempt to the plabidb website seeing as the web interface is offline for now. What is strange is that the first time I ran one proteome through mercator it worked. Now I can't upload the same data! I really loved the output of this first run so I'm not giving up yet!

I don't know how the analyses are linked to the plabidb website but here is my current run as an example (from the plabidb website).

Mercator329b759471aae790ebc8dc6d6c2e7bdc

It seem in the above proteome to have about 4 proteins that are getting confused with nucleotide sequences but I have no way of know which they are. The output doesn't say- it simply states that there is a fatal error: both nucleotides and proteins are present etc.

I have checked for "U's". I have not removed however proteins with "Xs" as there are a number of them. From the blast webpage it seems X's are acceptable.

I'm assuming the issue is with a few repetitive sequences that have a lot ACG or Ts but I'm not sure how to identify which proteins to remove.

I'd would be very grateful if you could help me with resolving this issue.

Kind Regards
Jonathan

RE: mercator : Error: Sequence Check failed
Answer
10/16/15 6:05 PM as a reply to Jonathan Featherston.
Hi Jonathan,

I can see that Mercator believed that your fasta file contained input which had both nucleotide and protein sequences.
I'm not sure of the scenario that prompted this - but if you want, you can send me your file and I will try and diagnose the problem.
It could be that there are some scenarios that trigger incorrect behavior.

you can send me the data at plabipd(at)gmail(dot)com

Cheers,
Marie

RE: mercator : Error: Sequence Check failed
Answer
10/27/15 12:10 PM as a reply to Marie Bolger.
Hi All

Just want to say thank you to Marie who advised me to use a script by Keith Bradnam to identify nucleotide-like proteins.

The script is available here: http://www.acgt.me/blog/2015/8/2/mining-uniprot-the-roy-chaudhuri-quest-to-find-dna-like-protein-sequences

It does require that one use the FAlite.pm perl module as well.

After filtering these proteins it seems to be working.

Thanks

Regards
Jonathan

RE: mercator : Error: Sequence Check failed
Answer
10/27/15 2:31 PM as a reply to Jonathan Featherston.
Here is a slightly modified version of the above script (from Keith Bradnam) which will also print record headers with contain only valid IUPAC nucleotide characters (there currently cause an issue for Mercator).

I hope to have a fix for this implemented soon, but in the meantime if you get an error claiming that your fast file contains both nucleotide and protein sequences, you can run the above script to find the offending sequences.

Note - save the script (protein_parser2.txt) and rename it to protein_parser2.pl