Dear Marc,
Please help!
I've been trying to process a protein fasta file with Mercator, but am encountering different problems. On the one hand, some sequences are simply not accepted, and on the other hand, files that have been accepted, get into the "job queue" but the processing does not start and thus no results are generated.
Specifically, I'd like to process 2 large protein fasta files (200MB und 100M

that should be processed separately.
Each of them is the product of merging protein-fasta-files from various sources (e.g. Uniprot, six-frame-translated, ...)
Due to the experiences from last time I've used such fastas with Mercator, I've split the file into a "mercatorable" and a "not-mercatorable" part, and submitted only small (about 30MB parts) at once.
I've applied the 4 filters I've previously described and created 2 more (Filter 5 and 6):
Filter 1: Remove all entries containing anything but the 20 common amino acids (e.g. "X" and "U" were problems)
Filter 2: Remove all entries containing ONLY "A, C, G, T" (if not removed I'd get an error stating that protein and nucleotide sequences are present, see error 1 below), another protein (see "Entry causing problems" below) caused the same error even though it contains an "R" in the sequence.
Filter 3: Non-redundant protein sequences
Filter 4: Non-redundant Accession Numbers
Filter 5: check if more than certaing percentage of "ATGCU" content in protein sequence
Filter 6: check for short repetitive sequences (very very simple version)
I've tried various settings for filter 5 and 6, but always get the same error message:
"""
Problems found in the sequence file!
0 warning(s)
0 error(s)
1 fatal(s)
FATAL
: FATAL ERROR: The input contains both nucleotide and protein sequences.
Please make sure the input only contains data of either type.
Number of sequence types found in input:
Protein : 41573
Nucleotide: null
"""
In contrast, I get the following error message from the "non-mercatorable" sequences:
"""
Problems found in the sequence file!
848 warning(s)
0 error(s)
24 fatal(s)
WARN
: error in sequence < UniRef100_P82332 Unknown protein from spot 115 of 2D-PAGE of thylakoid (Fragments) n=1 Tax=Pisum sativum RepID=UT115_PEA | P82332| UniRef100_P82332 Unknown protein from spot 115 of 2D-PAGE of thylakoid (Fragments) n=1 Tax=Pisum sativum RepID=UT115_PEA __***__ P82332.1 | gi|75107074|sp|P82332.1|UT115_PEA RecName: Full=Unknown protein from spot 115 of 2D-PAGE of thylakoid> : Length (14) outside limits (30-20000)
: error in sequence : Length (29) outside limits (30-20000)
: error in sequence < UniRef100_K4IS98 Translation elongation factor 1-alpha (Fragment) n=1 Tax=Cercospora sojina RepID=K4IS98_9PEZI | K4IS98| UniRef100_K4IS98 Translation elongation factor 1-alpha (Fragment) n=1 Tax=Cercospora sojina RepID=K4IS98_9PEZI> : Length (17) outside limits (30-20000)
: error in sequence < ID276399 p.sativum_wa1_contig15131_rframe6 __***__ contig15131 | p.sativum_wa1_contig15131| ID276399 p.sativum_wa1_contig15131_rframe6 __***__ contig15131| ID p.sativum_wa1_contig15131_rframe6> : Length (22) outside limits (30-20000)
: error in sequence : Length (25) outside limits (30-20000)
: error in sequence : Length (2) outside limits (30-20000)
: error in sequence < UniRef100_G8YZP4 PsbA peptide (Fragment) n=2 Tax=Papilionoideae RepID=G8YZP4_PEA | G8YZP4| UniRef100_G8YZP4 PsbA peptide (Fragment) n=2 Tax=Papilionoideae RepID=G8YZP4_PEA> : Length (24) outside limits (30-20000)
etc. .......................................................................................
FATAL ERROR: Data type of sequence <> could not be determined.
sequence is:
ENVVEIETISTGSLGLBIALGVGGLPRGRIIEIYGPESSGKTTLALQTIAEAQKKGGICAFVXAEHALDPVYARKLGVXLQNLLISQPDTGEQXLZIXDTLVRSGXVBVLVVDSVAALTPRAEIZGEMGDSLPGLQARLMSQALRKLTASISK
: FATAL ERROR: Data type of sequence <> could not be determined.
sequence is:
MFAPYLDRWSLISDGEPIITHSSRLLPVLWQDRPAMLKVAADIDEUYGALLMQWWDGDGAAYVYAHEGDAVLLERATGKRSLLAMAMNGADDEASRILCRTAARLHAPREKPLPDPISLTRWFRDLEPAAGKHGGTLADCSAIANVLLADQRDLTILHGDIHHDNILDFEARGWLAIDPKRLHGERGFDFANIFANEELPTITDPARFRRQLAVVSAEAKLEPKRLLQWIAAYSGLSAAWFLGDPNIQQAETALTVARIALAELQT
: FATAL ERROR: Data type of sequence <> could not be determined.
sequence is:
TTNWSEERFNEIVKEVSGFIKKVGYNPKTVAFVPISGWHGDNMLEESTNMTWFKGWTKDIKGGSTKGKTLLZAIDSIEPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGIIKADMVVTFAPVGLTTEVKSVEMHHEQLEQGVP
: FATAL ERROR: Data type of sequence <> could not be determined.
sequence is:
EFWEVISBEHGIDYTGTYTGDSDLQLERXSVYYXXAAGGKYVPRAVLVDLEPGTMDSVRAGPFGQLFRPDNFVFGQSGAGNNWAKGHYTEGAELVDSVLDVVRKEAESCDCLQGFQITHSLGGGTGAGMGTLLISKIREEYPDRMMCTFSVVPSPKVSDTVVEPYNATLSVHQLVENSDETFCIDNEALYXICFRTLKLSTPTYGDLNHLVSAVMSGIXTCLRFPGQLNADLRKLAVNMVPFPRLHFFMVGFAPLTSRGSQGYRALT
: FATAL ERROR: Data type of sequence could not be determined. sequence is: MPAVGGISNGGGKEYPGSLTPFVTVTCIVAAMGGLIFGYDIGISGGVTSMDPFLLKFFPSVFRKKNSDKTVNQYCQYDSQTLTMFTSSLYLAALLSSLVASTVTRRFGRKJSMLFGGLLFLXGALINGFAXHVWXLIVGRILLGFGIRFANQSVPLXLSEMAPYKYRGALNIGFQLSITVGILVANVLNYFFAKIKGGWGWRLSLGGAMVPALIITVGSLVLPDTPNSMIERGDREKAKAQLQRIRGIDNVDEEFNDLVAASESSSQVEHPWRNLLQRKYRPHLTMAVLIPFFQQLTGINVIMFYAPVLFSSIGFKDDAALMSAVITGVVNVVATCVSIYGVDKWGRRALFLEGGVQMLICQAVVAAAIGAKFGTDGNPGDLPKWYAIVVVLFICIYVSAFAWSWGPLGWLVPSEIFPLEIRSAAQSINVSVNMLFTFLIAQVFLTMLCHMKFGLFLFFAFFVLIMTFFVYFFLPETKGIPIEEMGQVWQAHPFWSRFVEHDDYGNGVEMGKGAIKEV
etc. .......................................................................................
"""
Through, tedious trial and error (repeated splitting in half of "mercatorable" fasta, and separately trying the two halves) I've found the following entry, that blocked Mercator:
>A9TBD1 | tr|A9TBD1|A9TBD1_PHYPA Predicted protein (Fragment) OS=Physcomitrella patens subsp. patens GN=PHYPADRAFT_29159 PE=4 SV=1 __***__ Pp1s198_98V2.1 | Pp1s198_98V2.1 Phypa_29159 LOC435370; similar to hypothetical protein PFB0615c __***__ NP13144251_1 | NP13144251_1 GB|XM_001775839.1|XP_001775891.1 hypothetical protein
GSTRWGSNHNWHGSTRWGSNHNWHGSTRWGSNHNWHGSTRWGSNHNWHGSTRWGSNHNW
The sequence contains "short repeats". When I remove this entry (in addition to applying the first 4 filters) Mercator accepts the job, but does not start processing anything. The Filters 5 and 6 seem not to be relevant, but in certain cases seem to catch the "problematic sequences", while at the same time removing other sequences that are not problematic.
Do you have any suggestions/tipps/tricks??
Could I possibly send you the sequences to process? It would be great if the "trouble" sequences would not raise these errors but "simply" not be annotated or thrown into a separate bin.
Thank you for your time and effort!
Best regards,
Igor