Welcome to the MapMan Family of Software Forum

Please do not hesitate to register and post your question.

Don't forget to subscribe to your posted message so you get notified on updates.
Every question you post will help others and or enhance the software!

Post a question,   post a bug!

Welcome to the MapMen Family of Software Forum Welcome to the MapMen Family of Software Forum

Using MapMan

Result file only contains unassigned reads

Hello,

I have some trouble with the annotation of chickpea derived sequences via the Mercator tool. Although I tried out different approaches (with/without selection of 'Sequence file contains DNA sequence') and all the reference databases, the tool is not able to annotate a single 100 bp sequence of all the entries I have. The annotation of some of these sequences with blastN from NCBI, however, results in perfect scoring hits.

By setting a Blast cutoff of 26 bps, and selecting 'Sequence file contains DNA sequence' despite the fact that these sequences should represent mRNAs I finally managed to get some of the sequences annotated. However, I would rather use the indicated 80 bps. Could you please try to explain, what I can do to blast the 80 bp sequences?

Many thanks and best regards

RE: Result file only contains unassigned reads
Answer
9/26/13 10:12 AM as a reply to Fabian Grunz.
Hi Fabian,

if I understand you correctly you are trying to annotate 80bp long reads derived from a
chickpea mRNA sample using the Mercator pipeline? There also seems to be a misunderstanding
concerning the meaning of the blast cutoff setting in mercator - i guess the interface should
be made more informative to clarify this (so thanks for pointing it out). The blast cutoff
sets the minimal blast bit score an alignment needs to reach to be considered valid enough
to be included in the annotation process. It is not setting the minimal length of the input
sequences.

When using very short (80bp - probably next-generation sequencing-derived) reads as input,
none of the resulting alignments is likely to reach a bit score above 80 (or even 26) and hence mercator
will assign "not assigned.unknown" to all of the sequences. Mercator was designed to functionally
annotate (ideally) fulll length transcript or protein sequences - it will work with partial sequences
to some extent, but the shorter the sequences are, the lower and less accurate the annotation will be.

Internally, mercator uses protein sequences as reference. Every DNA/RNA input will be translated -
so your 80bp input nucleotide sequences are translated into 26 amino acid long peptides.... blasting
these against several whole proteome and protein domain databases will, in the best case, result
in a few domain hits. But most of the hits will be very low scoring because the peptide is very short.

If you want to generate a functional annotation of your RNA-Seq chickpea reads you'll have to assemble
them into a preliminary transcriptome first (using tools like e.g. trinity) and then submit the
resulting transcript contigs to mercator.

Mercator was not designed for functionally annotating raw RNA-Seq reads which is a task that is,
for the reasons given above, nearly impossible.

I hope this helps and thanks for your feed back,
Marc

RE: Result file only contains unassigned reads
Answer
9/26/13 12:56 PM as a reply to Marc Lohse.
Dear Marc,

many thanks for the fast reply and the helpful information! You are perfectly right that I completely misunderstood the blast cutoff parameter.

However, the only reason why I used 80 bp sequences is the huge size of the file containing the full-length sequences. I could reduce it from originally 255mb down to 15mb using 7-zip. Is it possible to submit the zipped file to the indicated e-mail address (it exceeds the indicated 30 * 10^6 symbols by far)?

Thanks again for your help! Best wishes,

Fabian

RE: Result file only contains unassigned reads
Answer
9/26/13 2:08 PM as a reply to Fabian Grunz.
Dear Fabian,

i am a bit surprised at the size of the sequence set you want to annotate - just out of curiosity - may i ask
what kind of organism is the source? Or is it a "metagenomics" data set?

Whatever the source - we can process larger data sets. Just transfer the FASTA file via the cryptshare
system using my emain adress (lohse_at_mpimp-golm.mpg.de) as recipient: https://cryptshare.mpg.de/

I can manually submit the file to the mercator pipeline and will send you a link via which you can check
the status and download the results.

cheers,
Marc

RE: Result file only contains unassigned reads
Answer
9/26/13 2:31 PM as a reply to Marc Lohse.
Dear Marc,

the file contains the mRNA sequences of the newly published chickpea genome.

I am currently trying to upload the unzipped FASTA file in cryptshare (this will take some time though). Can I send you the respective pw in an extra e-mail?

Thanks again and best regards,
Fabian

RE: Result file only contains unassigned reads
Answer
9/26/13 3:02 PM as a reply to Fabian Grunz.
Hi Fabian,

yes, of course - just use the same address.

cheers,
Marc

RE: Result file only contains unassigned reads
Answer
9/27/13 12:59 PM as a reply to Marc Lohse.
Hi Marc,

I am not sure whether or not the upload was completed successfully. Did you receive the respective link, and could you access the file with the pw I sent you?

Many thanks and cheers,
Fabian

RE: Result file only contains unassigned reads
Answer
9/27/13 2:19 PM as a reply to Fabian Grunz.
Hi Fabian,

no - i only got your email with the password but no data so far...Please try again. Should that fail
again we have to find another way of transferring the data.

cheers,
Marc

RE: Result file only contains unassigned reads
Answer
9/27/13 5:11 PM as a reply to Marc Lohse.
Hi again,

I just finished uploading the zipped data and this time everything seems fine. The pw is still the same.

Thanks a lot and best regards,
Fabian

RE: Result file only contains unassigned reads
Answer
9/27/13 5:16 PM as a reply to Fabian Grunz.
Hi Fabian,

yes - this time the transfer worked. I obtained and unzipped the file. Since the transfer was dodgy
at the first attempt i computed the MD5 checksum of the unzipped file. Could you please do that
too on your file so that we can make sure that the data did not get scrambled during the transfer?

> md5 seqs.fasta
MD5 (seqs.fasta) = 28db82d920ee7ca7696d0a7693034afb

Best greetings,
Marc

RE: Result file only contains unassigned reads
Answer
9/27/13 5:48 PM as a reply to Marc Lohse.
Hi Marc,

and thanks again emoticon The checksum looks fine to me:

28db82d920ee7ca7696d0a7693034afb

However I noticed that, the job queue states 'Nucleotides: 3627'. I hope that this is just a minor bug, and not the number of nucleotides the tool is actually going to process.

Best wishes,
Fabian

RE: Result file only contains unassigned reads
Answer
9/28/13 1:02 PM as a reply to Marc Lohse.
Hi again,

I am sorry to inform you that the processing apparently failed...

The sequences should be ok (at least blasting them with NCBI was no problem at all), and the checksum was identical, too. I do not know what went wrong emoticon

Can we retry the processing?

Best regards,
Fabian

RE: Result file only contains unassigned reads
Answer
9/30/13 11:09 AM as a reply to Fabian Grunz.
Hi Fabian,

your input file was so large that it set off an internal bug - a number became too large to fit
into the variable type that i had given to it. I have to thank you for this, because these kinds
of bugs go unnoticed until the software is facing the special data set that causes them. I will
fix that - aside from this, the job processing is almost done. I will fix the issue and get back to you as soon
as it's done.

cheers,
Marc

RE: Result file only contains unassigned reads
Answer
9/30/13 2:16 PM as a reply to Marc Lohse.
Hi again,

i restarted your job after fixing the problem - your job should now finish without problems. Please
use the same link as before for monitoring the progress.

cheers,
Marc

RE: Result file only contains unassigned reads
Answer
10/3/13 6:02 PM as a reply to Marc Lohse.
Hi,

sorry for the late reply but I had to attend a congress in the meantime.

I checked the link but the status is still the same emoticon

Is the link for monitoring of the progress still correct or did the processing fail again?

Thank you for all your efforts and best regards,
Fabian

RE: Result file only contains unassigned reads
Answer
10/3/13 10:17 PM as a reply to Fabian Grunz.
There seems to be still a problem.
I will have to look into it in more detail on Monday.

Thanks for your patience
Axel

RE: Result file only contains unassigned reads
Answer
10/9/13 12:02 PM as a reply to Axel Nagel.
Hi Axel,

could you localize the problem?

Best regards,
Fabian

RE: Result file only contains unassigned reads
Answer
10/9/13 12:30 PM as a reply to Fabian Grunz.
Hi Fabian,

I'm currently having a go on fixing the problem as it still persists.
So I will restart the job in about an hour.

Cheers for your patience
Axel

RE: Result file only contains unassigned reads
Answer
10/9/13 3:50 PM as a reply to Fabian Grunz.
The job is running again since a while and there are now exceptions logged until now.
As we have other load as well your job might take until tomorrow to finish.


Axel

RE: Result file only contains unassigned reads
Answer
10/10/13 2:52 PM as a reply to Fabian Grunz.
Hi Fabian,


the job finished without the problems we had earlier.
You may download the results. However I suspect some problems with duplicate data and I will ask Marc to have a look at the results next week.


Have a nice day
Axel

RE: Result file only contains unassigned reads
Answer
10/14/13 5:11 PM as a reply to Axel Nagel.
Hi Fabian,

i just had a closer look at your sequence file and the resulting mapping and want to
point out some things. When looking at the input data file it turns out that the IDs
are not unique - that means that more than one sequence in the FASTA file are
identified by the same name: There are 79754 sequence entries in the FASTA
file but only 20292 unique IDs. That means that there are sequences with the
same name but potentially different nucleotide sequence - which is problematic since
the annotation would only be computed for the first sequence - the others will be
filtered as redundant.

Looking at the sequences i found examples that have the same ID but different
sequences (e.g. Ca_23791 --> 578nt != Ca_23791 --> 2739nt). This is a bad situation
because only the first sequence will be used in the annotation. I tried to obtain the
original chickpea sequence file but couldn't due to the current government funding situation
in the US - so i could not check whether the redundancy is already present there and
why there would be different sequences under the same name (are they maybe alternative
transcript models?) -
Anyhow - I do very strongly recommend to consolidate the data set first and then resubmit the
much smaller data set to mercator to generate a consistent annotation. Alternatively,
the different sequences with the same ID could be renamed to include e.g. a
running number (like Ca_23791_1, Ca_23791_2, ... Ca_23791_n) to be able to
distinguish between them. I would strongly discourage using the annotation as
it is now.

Best greetings,
Marc

RE: Result file only contains unassigned reads
Answer
10/14/13 10:02 PM as a reply to Marc Lohse.
Dear Marc,

I am really sorry for the inconvenience! I checked the files that were used for extraction of the sequences and apparently the IDs are listed more than once because all exons are listed seperately but under the same ID. Somehow I completely overlooked this emoticon My apologies again.

I am going to consolidate the data as soon as possible (hopefully it will be small enough to submit it directly on your webpage). Thank you for the clue, and many thanks to both of you for getting the job done despite my lack of attention!

Best regards,
Fabian

RE: Result file only contains unassigned reads
Answer
10/15/13 10:56 AM as a reply to Fabian Grunz.
Dear Fabian,

thanks for clarifying this. Since the sequences are exons i would suggest combining them into
the full transcript, translating the contained ORF and resubmitting the resulting protein sequences
to mercator - since mercator works with protein sequences internally, anyway, this would strongly
reduce the computational load and produce results much more quickly (when submitting nucleotide
sequences, mercator bluntly does a six frame translation of every input sequence and runs each of the
translated proteins through the search parcours).

cheers,
Marc

RE: Result file only contains unassigned reads
Answer
10/15/13 3:14 PM as a reply to Marc Lohse.
Dear Marc,

I combined the sequences into full transcripts and ensured that all IDs are unique.

However, I am already uploading the fasta file to your server at the moment, and read your reply only now emoticon

Sorry for the additional computational load, but at least the file size is significantly smaller now.

The md5 checksum is:

b28be6d8de7d4ba3a3ad3f9d52090fa1

and the password is still the same as for the previous file.

Many, many thanks and greetings,

Fabian

RE: Result file only contains unassigned reads
Answer
10/15/13 3:39 PM as a reply to Fabian Grunz.
Hi Fabian,

no worries - even with 6-frame translation - a normal-sized transcriptome annotation
will not take longer than 1 day. I checked the sequence file and the length distribution also
look way healthier now. The job is started - I'll send you the status link in a separate email.

cheers,
Marc