Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting Signal to Noise data from Raw Files #130

Closed
GeorgWa opened this issue Feb 8, 2022 · 17 comments
Closed

Extracting Signal to Noise data from Raw Files #130

GeorgWa opened this issue Feb 8, 2022 · 17 comments
Labels
enhancement New feature or request

Comments

@GeorgWa
Copy link
Contributor

GeorgWa commented Feb 8, 2022

Hi,

I have used the ThermoRawFileParser for some time now and I'm very happy with the Software!
Is there an option to include the signal to noise data in the mzML as defined in the mzML standard?

[Term]
id: MS:1000517
name: signal to noise array
def: "A data array of signal-to-noise values." [PSI:MS]
xref: binary-data-type:MS\:1000521 "32-bit float"
xref: binary-data-type:MS\:1000523 "64-bit float"

Best,
Georg

@caetera
Copy link
Collaborator

caetera commented Feb 8, 2022

Hi @GeorgWa ,
great to hear that you can find good use for ThermoRawFileParser.
The S/N array are not extracted currently, but this feature, certainly, can be added in the future.

@caetera caetera added the enhancement New feature or request label Feb 8, 2022
@GeorgWa
Copy link
Contributor Author

GeorgWa commented Feb 8, 2022

Thanks a lot for the fast reply!
Do you know of ressources describing the Thermo Raw File Reader DLL or the .raw format which could be helpfull for implementing this feature?

@GeorgWa
Copy link
Contributor Author

GeorgWa commented Feb 9, 2022

Just a quick update. I forked the repository and was able to add the data raw noise data to the mzML output.
GeorgWa@10bba51

If you are interested I can add this functionality with a pull request. Please let me know:

  • Do you have any tests which I should run before committing the changes?
  • Should I just increase the version number myself?
  • There was no unit associated with the accession for the noise field "MS:1002742". Therefore I left the unit fields empty:
accession = "MS:1002742",
name = "noise array",
cvRef = "MS",
unitCvRef = "",
unitAccession = "",
unitName = "",
value = ""

Is this something I should discuss with the HUPO-PSI team?

Cheers,
Georg

@edeutsch
Copy link
Collaborator

edeutsch commented Feb 9, 2022

Do you suppose that the noise array should have the same units as the intensity array?

@GeorgWa
Copy link
Contributor Author

GeorgWa commented Feb 9, 2022

Yes, I expect them to have the same units.

Its interesting though, the term I used, id: MS:1002742 name: noise array has no unit associated.
The term id: MS:1002744 name: sampled noise intensity array has the right detector count relationship:
relationship: has_units MS:1000131 ! number of detector counts

I am also not sure if the noise array I use is compatible with the m/z array if no centroiding is performed. I have not found a way with the Thermo RawFileReader to get Noise for uncentroided spectra.
Therefore I might switch to the three terms: id: MS:1002743, id: MS:1002744, MS:1002745 and add an option like -N --noiseData.

GeorgWa@4d5710f

Let me know what you would suggest.

@caetera
Copy link
Collaborator

caetera commented Feb 9, 2022

@GeorgWa Great, thank you for active participation. You are welcome to submit PR with the changes you have made, they will be merged into the next release with the other pending changes. Since there aren't any specific tests for signal-to-noise, you can just make sure the tests in ThermoRawFileParserTest solution work. The later requires setting up NUnit - it is sometimes a bit tricky, thus, it is up to you if you are willing to spend time for that. I will run the tests before the merge.

Noise, baseline, charge, and resolution data are only present, for FTMS scans and only for centroids, these properties (to the best of my knowledge) are pre-calculated on the instrument side and stored to a RAW file directly. For profile data and/or non-FTMS scans it is not possible to retrieve these through their library (again, to the best of my knowledge) - these only contain m/z and intensity data. There is, however, a general method GenerateNoiseTable for of Scan objects, that is according to docstring also only relevant for FT data (see below).

// Summary:
//     Generates a "noise and baseline table". This table is only relevant to FT format
//     data. For other data, an empty list is returned. This table is intended for use
//     when exporting processed (averaged, subtracted) scans to a raw file. If this
//     scan is the result of a calculation such as "average of subtract" it may be constructed
//     using an overload which includes a noise and baseline table. If so: that tale
//     is returned. Otherwise, a table is generated by extracting data from the scan.
//
// Returns:
//     The nose and baseline data

I am a bit puzzled what is the use of MS:1002743 - MS:1002745, i.e. how are these different from MS:1002742 - noise array.

I agree that noise and baseline should have the same dimensions as intensity, for example, detector counts; signal-to-noise, however, should be dimensionless, since it is a ratio.

Do you think it would be relevant to have an option to input signal-to-noise directly to mzML, rather than an additional noise array? It can be a substitute of intensity array, it does not seem to broke mzML specs (http://www.peptideatlas.org/tmp/mzML1.1.0.html), there should be at least 2 binaryArray for every spectrum, but no requirements as to which ones.
For example, an option like --snasintensity can swap the representation. What is your opinion?

@GeorgWa
Copy link
Contributor Author

GeorgWa commented Feb 9, 2022

@caetera thanks for the recommendations regarding the testing, I will check this out.

I compared all ways of accessing the noise data and Scan.GenerateNoiseTable(), Scan.CentroidScan.Noises and Scan.PreferredNoises all returned arrays of the same length as the centroided data. This was as expected considerably shorter than the m/z and intensities returned by the SegmentedScan. For our application and in general (I would think) it's sufficient to have acces to this centroided noise data.

Currently I get all three arrays from the Scan.PreferredNoises/Masses/Baselines which defaults to the centroided scan. If you have any considerations on using another set of the three properties, I am happy to change this behavior.

I introduced the following parameter which controls the output:
-N, --noiseData Include noise data in mzML output

If this parameter is set, the three different binaryDataArrays (Noises/Masses/Baselines) ar added to the binaryDataArrayList in every spectrum. Note that the MS:1000514 and MS:1002743 are reduntant if peack picking is performed. Allthough, if -p is set, they differ. I agree that including the noise instead of the signal to noise is the way to go.

If the -N option is set, the mzML output looks like this:

<binaryDataArrayList count="5">
  <binaryDataArray encodedLength="56">
    <cvParam cvRef="MS" accession="MS:1000514" value="" name="m/z array" unitAccession="MS:1000040" unitName="m/z" unitCvRef="MS" />
    <cvParam cvRef="MS" accession="MS:1000523" value="" name="64-bit float" />
    <cvParam cvRef="MS" accession="MS:1000574" value="" name="zlib compression" />
    <binary></binary>
  </binaryDataArray>
  <binaryDataArray encodedLength="56">
    <cvParam cvRef="MS" accession="MS:1000515" value="" name="intensity array" unitAccession="MS:1000131" unitName="number of counts" unitCvRef="MS" />
    <cvParam cvRef="MS" accession="MS:1000523" value="" name="64-bit float" />
    <cvParam cvRef="MS" accession="MS:1000574" value="" name="zlib compression" />
    <binary></binary>
  </binaryDataArray>
  <binaryDataArray encodedLength="56">
    <cvParam cvRef="MS" accession="MS:1002745" value="" name="sampled noise baseline array" />
    <cvParam cvRef="MS" accession="MS:1000523" value="" name="64-bit float" />
    <cvParam cvRef="MS" accession="MS:1000574" value="" name="zlib compression" />
    <binary></binary>
  </binaryDataArray>
  <binaryDataArray encodedLength="52">
    <cvParam cvRef="MS" accession="MS:1002744" value="" name="sampled noise intensity array" unitAccession="MS:1000131" unitName="number of detector counts" unitCvRef="MS" />
    <cvParam cvRef="MS" accession="MS:1000523" value="" name="64-bit float" />
    <cvParam cvRef="MS" accession="MS:1000574" value="" name="zlib compression" />
    <binary></binary>
  </binaryDataArray>
  <binaryDataArray encodedLength="52">
    <cvParam cvRef="MS" accession="MS:1002743" value="" name="sampled noise m/z array" unitAccession="MS:1000040" unitName="m/z" unitCvRef="MS" />
    <cvParam cvRef="MS" accession="MS:1000523" value="" name="64-bit float" />
    <cvParam cvRef="MS" accession="MS:1000574" value="" name="zlib compression" />
    <binary></binary>
  </binaryDataArray>
</binaryDataArrayList>

@caetera
Copy link
Collaborator

caetera commented Feb 9, 2022

Yes, Scan objects can contain two types of data streams CentroidStream and SegmentedScan, the first one has centroids - i.e. a set of (m/z, intensity, charge, noise, baseline) vectors; the second one has just (m/z, intensity) vectors. FT scans have both streams, while IT scans contain only SegmentedScan one. If a particular scan was acquired in Profile mode, the profile data end up in SegmentedScan. Preffered-arrays seem to catch one of the two data streams, depending on the scan type. When the mzML:spectrum object is constructed we read masses and intensities directly either from CentroidStream or SegmentedScan depending on the parameters and availability of the data for a particular scan.

If it is fine for your needs that noise data is going to be always "centroided", i.e. will be present only at certain subset of m/z values even for profile scans, then the current implementation should work fine. I think, though, it is necessary to describe this peculiarity in the documentation (I can do it before making a release).

If I understand it correctly, MS:1002743 - MS:1002745 are exactly for "centroided" noise data, i.e. they indicate that length of m/z array and noise array are not guaranteed to match, while MS:1002742 is for the case when these two arrays have the same length. Does it make sense to set MS:1002743 for centroided scans, since it will be exact copy of MS:1000514, i.e. is it easier for the downstream processing to have them both?

Does not sampled noise baseline array (MS:1002745) should also has_units MS:1000131 ! number of detector counts? I guess, that needs to be updated in PSI-MS CV though. @edeutsch

@edeutsch
Copy link
Collaborator

edeutsch commented Feb 9, 2022

I wonder if we really know what the units of sampled noise baseline array are? Are they really detector counts? Or some other magical number that falls out of a Fourier transform that doesn't really have any units? I don't know.

But in any case, it is reasonable to add the same set of has_units to these terms that we have for others.

I will get it on the docket.

@GeorgWa
Copy link
Contributor Author

GeorgWa commented Feb 10, 2022

@edeutsch I'm also not sure about sampled noise baseline array. Should MS:1002742 noise array get the same unit as MS:1002744 sampled noise intensity array ?

@caetera
For our use case, having access to the non centroided noise would only be marginal better. As I have not found a way to access this data yet, I sticked to the CentroidStream noise data. Scan.GenerateNoiseTable() gave me the same centroided noise data. Sorry if I missunderstood your previous messages. If there is a way which is more reliable based on different scan type, I'm happy to implement it.

Its definetly an option to skip MS:1002743 for centroided scans. In this case one has to check if MS:1000127 centroid spectrum is present and access MS:1000514 instead of MS:1002743. I can change this behavior, so that MS:1002743 is only included as separate array if the SegmentedScan data is written. but thats also something which has to be included in the documentation, as its not safe to access MS:1002743. For our use case, inlcuding all three arrays every time would be slightly easier, but I dont have any strong opinion about this. So let me know what solution you prefer.

@caetera
Copy link
Collaborator

caetera commented Feb 10, 2022

@edeutsch good, thank you. I believe baseline is a kind of zero-level intensity and thus should have the same dimension, but I am not 100% sure as well.

@GeorgWa I have asked Jim Shofstahl - the developer of RawFileReader in Thermo if it is possible to get the noise data for profile and low resolution scans. Let's see what will be his response.

I think then it is easier to have both MS:1000514 and MS:1002743 in the output - it will be redundant sometimes, but will make parsing easier afterwards.

@caetera
Copy link
Collaborator

caetera commented Feb 10, 2022

Here is the reply of Jim:

With regards to the noise data and low resolution data, the reason that we don’t include it with those scans is the data doesn’t exists (at least what I was told years ago). With regards to profile data,
I can’t remember the exact reason at the moment. I would have to check with some of our firmware people on that question.

@GeorgWa
Copy link
Contributor Author

GeorgWa commented Feb 10, 2022

@caetera
Great, I will keep both MS:1000514 and MS:1002743 if the -N flag is set.
I will test it with our internal data and create a pull request when I'm ready.

Thanks for forwarding the answer from Jim! Id would be great if you can update me if you hear anything else about noise for profile data.

@caetera
Copy link
Collaborator

caetera commented Mar 29, 2022

Hi @GeorgWa,
just wonder if you are ready with PR?

@GeorgWa
Copy link
Contributor Author

GeorgWa commented Apr 19, 2022

Hi @caetera,
please excuse my late reply.
I have updated the README.md to the new help output and created a pull request. #137

@caetera
Copy link
Collaborator

caetera commented Apr 20, 2022

Hi @GeorgWa
thank you, I will check and merge it soon.

@caetera
Copy link
Collaborator

caetera commented Apr 21, 2022

Merged after fixing sampled noise m/z array f969329 and thus closing the issue

@caetera caetera closed this as completed Apr 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants