User:La comadreja/Software/SpaceRemover

Overview
This program is used to format .aln files before constructing hidden Markov models (HMMs).

Description
A ClustalW2 output file in .aln form (align with numbers) is written like this:

Sequence1Name     SpecificSequence Number_of_Nucleotides_or_AAs Sequence2Name     SpecificSequence Number_of_Nucleotides_or_AAs

Sequence1Name     SpecificSequence Number_of_Nucleotides_or_AAs Sequence2Name     SpecificSequence Number_of_Nucleotides_or_AAs

etc.

This program takes such an .aln file, removes the spaces between the sequence name and the sequence, changes the count of nucleotides or aas at the end of the line to "83," and outputs another .aln file for use in HMMER.

Instructions
Supply the names of the files on the command line, e.g.: $java SpaceRemover < inputTest.aln > outputTest.aln

Completion Stage

 * November 22, 2008: Files parsed by this program could be used to build hidden Markov models. When the HMM files were calibrated, however, the error message of "FATAL: fit failed; --num may be set too small?" was returned.  The programmer (R.S.) will next convert files that are supplied in the HMMER tutorial to ClustalW2 form, construct hidden Markov models, and attempt to calibrate these files.


 * November 30, 2008: A new version of SpaceRemover, SpaceRemoverV2.java, is in the works and shows several improvements over its predecessor (e.g., that command line arguments are no longer required and the .aln file no longer needs to be further reformatted). R.S. was able to use ClustalW2 to convert the .msf file supplied in the HMMER tutorial (globins50.msf) to .aln form.  SpaceRemover.java (the original) cannot parse globins50.aln.  When SpaceRemoverV2.java was used to reformat globins50.aln, the program ran excessively slowly.  The next step for R.S. is to understand why it is so slow, fix it and make the program available online.


 * December 2, 2008: SpaceRemoverV2.java can parse globins50.aln supposedly successfully. However, when an HMM model of globins50.aln is built, the error message "FATAL: Parse error: sequence myg_escgi-VLSDAEWQLVL: length 2, expected 6 in alignment" is returned.  R.S. does not know what this means and the next step is to figure out its meaning and fix it.


 * January 18, 2009: "length 2, expected 6 in alignment"--when the length was increased to 6, the program printed "length 6, expected 18 in alignment." R.S. still does not understand the meaning.
 * HMMs of ADRA1A1.txt, based on SpaceRemoverV2.java, appear to be able to find sequences, e.g. in the FASTA file Artemia.fa, but R.S. does not know if they are correct sequences.
 * globins50.phylip contains an excessive number of gaps and cannot be recognized by hmmbuild.
 * The next step is to search for a better HMMER tutorial than the one in the Userguide.

Code
Code for the driver class is here.

Code for StdIn.java, a blueprint class that is used with the program, is here. StdIn.java is supplied for programming assignments by the instructors of Princeton University's introductory computer science class.