Skip to content

Process journal papers and add content to MTE #45

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 7 tasks
wkiri opened this issue Feb 23, 2022 · 28 comments
Open
1 of 7 tasks

Process journal papers and add content to MTE #45

wkiri opened this issue Feb 23, 2022 · 28 comments
Assignees

Comments

@wkiri
Copy link
Owner

wkiri commented Feb 23, 2022

The first step is to try parsing the journal documents @stevenlujpl already downloaded.
For some documents, we may need to process them multiple times for each mission whose targets are mentioned (see issue #22).

  • Generate initial annotations (w/Contains and HasProperty from jSRE) using the MTE pipeline
  • Perform expert review of Target, Contains, and HasProperty annotations. Guidelines: https://docs.google.com/document/d/1KnkVtxkKb9kcRVKZqPWwuXSi_6zK8Scay1M6qiz1Khc/edit#
  • For each mission, update aliases table (if any updates are needed)
  • For each mission, concatenate LPSC + journal .jsonl files and use them for ingest_sqlite.py
  • For each mission, run update_sqlite.py twice: once with LPSC annotations and once with journal paper annotations
  • Check contents via MTE mission-specific websites
  • Generate MTE bundle v3.0 with MER-B and journal content added
@wkiri
Copy link
Owner Author

wkiri commented Feb 25, 2022

Note: the MTE schema currently has an abstract column in the documents table. Journal papers do not have an abstract number, so we should decide how to handle this.

We will also need to decide how to generate a doc_id which currently is year_abstract. Perhaps it should be year_venue_paperid in which case venue would be a short form like lpsc or jgr and paperid would be the same as abstract for LPSC publications and something like volume-number-page for journal papers. Then maybe this paperid would take the place of the abstract column in documents (we'd want to rename the column as needed).

@stevenlujpl
Copy link
Collaborator

  • 35 journal papers were processed, and it took about 30 minutes to process them using paper_parser.py
  • ADS title search failed for 11 papers
    • ADS doesn't have the papers
    • Grobid failed to extract titles
    • Extracted titles were wrong
  • jSRE relation extraction failed for 6 papers (due to the jSRE "out of memory" error)
  • 18 papers passed initial filter

@stevenlujpl
Copy link
Collaborator

I categorized 18 papers that passed the initial filtering process by missions to ensure that we at least have one paper for each MERA, MERB, MPF, and PHX mission. Some papers may appear in more than one mission list. For example, the paper 2003JE002125.pdf talks about the MER landing sites, so it has been categorized within both MERA and MERB mission lists.

MERA:

  • 2003JE002125.pdf
  • 2010JE003633.pdf
  • 2016JE005079.pdf
  • Niles_SpaceSciRev_v174_2013.pdf

MERB:

  • 2003JE002125.pdf
  • 2006JE002728.pdf
  • 2010JE003746.pdf
  • 2014JE004686.pdf
  • 2016JE005079.pdf
  • 1248097.full.pdf

MPF:

  • 2016JE005079.pdf

PHX:

  • carrier_GRL_v42_2015.pdf
  • Cull_GRL_v37_2010.pdf
  • Heet_JGR_v114_2009.pdf
  • Mellon_JGR_v114_2009.pdf
  • Renno_JGR_v114_2009.pdf
  • Toner_GeochimCosmochimActa_v236_2014.pdf

@wkiri
Copy link
Owner Author

wkiri commented Mar 18, 2022

Thanks, Steven! It looks like there are 14 unique papers here. Are the other 4 that passed the filter worth including?

@stevenlujpl
Copy link
Collaborator

@wkiri The other 4 papers are MSL papers. Sorry that I forgot to mention them.

MSL:

  • jackson-drillholes-16.pdf
  • mezzacappa-dist-corr-16.pdf
  • rapin-sulfate-16.pdf
  • schwenzer-diagenesis-16.pdf

@wkiri
Copy link
Owner Author

wkiri commented Mar 18, 2022

@stevenlujpl Great, thanks for the clarification!

@stevenlujpl
Copy link
Collaborator

@wkiri I tested the changes I made to the MTE codebase, and it seems it is working fine. Please see the commits above for details about the code changes. The changes are currently checked into the issue45-journal branch. Please see the summary of the changes below:

  1. doc_id field: The doc_id field is now used to store filenames (without extension) for journal papers. The doc_id field can also be made to store document indices. If you think document indices make more sense than filenames, please let me know and I will update the code to store document indices in the doc_id field.
  2. abstract field: The abstract field is ignored for journal papers. With the changes I made, the abstract field will be empty strings in the DB, and it will not be included as a column in the exported CSV file of the PDS4 bundle.
  3. doc_url and year fields: The doc_url and year fields are currently handled in the same way as the abstract field.
  4. I added a new CLI argument venue to the bundle generation script. The venue argument is used to distinguish LPSC v.s. other documents. If the input documents are from LPSC, the documents.csv of the PDS4 bundle will have 8 columns (i.e., all the 8 fields from the documents table of the DB are exported to the CSV file). If the input documents are from a venue other than LPSC, then the documtnes.csv will only have 5 columns (the abstract, doc_url, and year fields are skipped).

I tested this approach using the 18 journal papers that passed the initial filtering process. the MPF jsonl, DB, and PDS4 bundle files can be found at the following locations in my /home dir. The bundle validate tool reported 0 errors.

/home/youlu/MTE/working_dir/process_journals/mpf/mpf_init_filter.jsonl
/home/youlu/MTE/working_dir/process_journals/mpf/mpf_init_filter.db
/home/youlu/MTE/working_dir/process_journals/mpf/mars_target_encyclopedia

This approach requires only a few minor changes (as shown in the commits above) in the codebase because it doesn't require changes in DB schema. The drawback is that the abstract field of the documents table in the current DB schema doesn't really apply to journal papers. I think this may be okay because I consider DB files as the intermediate products. The PDS4 bundles as the final delivered products don't have the abstract field in the CSV file. Please let me know what you think.

The current MTE website code won't work with the DB files generated from journal papers primarily because of the lack of the doc_url field in the documents table. I will update the website code if we are planning to use it for journal DBs. Please let me know. Thanks.

@wkiri
Copy link
Owner Author

wkiri commented Mar 18, 2022

@stevenlujpl This is great progress! Thank you!

Can you place the generated .jsonl files under /proj/mte/? That way I can generate brat review pages for them.

The changes seem fine in general. I have two questions:

  1. I think that it would be good to retain the year field. Is this possible?
  2. Can we populate doc_url with the journal paper DOI (which can be formatted as a URL if it isn't already)? I think this is included in the ADS results. While not available for LPSC, it should be available for journal papers.

It seems I overlooked that "abstract" is not included in the final .csv files that are delivered. I should correct this in the schema diagram and in the README.

I agree that the sqlite DB is an intermediate product so it's ok for it to have more information even if not used later. As you note, however, the website does use the DB directly. It makes sense to prioritize getting the journal paper content into PDS4 bundles first, and if time remains, then update the website (but it's not on the critical path for the time remaining).

@stevenlujpl
Copy link
Collaborator

  1. I think that it would be good to retain the year field. Is this possible?

It should be possible to retain the year field for documents indexed in the ADS database.

  1. Can we populate doc_url with the journal paper DOI (which can be formatted as a URL if it isn't already)? I think this is included in the ADS results. While not available for LPSC, it should be available for journal papers.

It seems from the ADS website search results, the DOI fields are already formatted as URLs. I will double-check the format of the DOIs returned by directly querying the ADS database. These are great suggestions. I will work on them now.

I copied the .jsonl files to the following locations in /proj/mte/. Please let me know if you run into any problems generating the brat review pages.

/proj/mte/data/steven_working_dir/process_journals/mera/mera_init_filter.jsonl
/proj/mte/data/steven_working_dir/process_journals/merb/merb_init_filter.jsonl
/proj/mte/data/steven_working_dir/process_journals/mpf/mpf_init_filter.jsonl
/proj/mte/data/steven_working_dir/process_journals/phx/phx_init_filter.jsonl

@stevenlujpl
Copy link
Collaborator

@wkiri I couldn't test the update_sqlite.py step to update the DB with human-reviewed brat annotations yet, but I don't foresee any problems because the DB schema isn't changed.

@stevenlujpl
Copy link
Collaborator

@wkiri Do you know how to use a DOI to form a URL? The DOIs returned by the ADS database aren't formed in URL. For example, the DOI returned for "Analysis of MOLA data for the Mars Exploration Rover landing sites" is 10.1029/2003JE002125. How do I convert the DOI into a URL?

@stevenlujpl
Copy link
Collaborator

stevenlujpl commented Mar 19, 2022

I just googled, and it seems we can use this patter https://doi.org/xxxxx (where the xxxxx is DOI) to convert DOI to URL.

stevenlujpl added a commit that referenced this issue Mar 19, 2022
@stevenlujpl
Copy link
Collaborator

@wkiri I've added the year and doc_url back to the DB and the exported documents.csv files. Please see the following files as examples. Please let me know if you find any problems. Thanks.

/home/youlu/MTE/working_dir/process_journals/mpf/mpf_init_filter.db
/home/youlu/MTE/working_dir/process_journals/mpf/mars_target_encyclopedia/data_mpf/documents.csv

@wkiri
Copy link
Owner Author

wkiri commented Mar 23, 2022

I created a brat site for the MPF JSONL file here:
https://ml.jpl.nasa.gov/mte/brat/#/mpf/journals/
As noted above, probably the only document to review for MPF would be this one:
https://ml.jpl.nasa.gov/mte/brat/#/mpf/journals/2016JE005079
Unfortunately, I do not see any named MPF targets (although there are targets from several other missions mentioned). I will check again to see if the 8 MPF papers Matt previously shared were in the set of 35 that were checked, or if they should be added.

@wkiri
Copy link
Owner Author

wkiri commented Mar 23, 2022

@stevenlujpl
Copy link
Collaborator

@wkiri Do we need to add a few more journal papers for MPF?

@wkiri
Copy link
Owner Author

wkiri commented Mar 23, 2022

Yes, we should try to add the 4 JGR + 1 Science papers that are referenced in https://github.com/wkiri/MTE/tree/master/ref/MPF#readme
I had two of them handy (bell-mpf-00.pdf, golombek-mpf-99.pdf) and put them in the JGR directory. I also added golombek-mpf-00.pdf, greeley-mpf-00.pdf, landis-mpf-00.pdf, and morris-mpf-00.pdf in case they have useful content. Could you process these with your MPF run? You could generate a separate .jsonl file for this batch - no need to run them all together with the earlier docs. See: /proj/mte/data/corpus-journals/pdf/jgr-planets/

@wkiri
Copy link
Owner Author

wkiri commented Mar 23, 2022

@wkiri
Copy link
Owner Author

wkiri commented Mar 23, 2022

@wkiri
Copy link
Owner Author

wkiri commented Mar 23, 2022

To make review easier, I have pruned the documents for each mission in the "journals" directory under brat to only include the documents to be reviewed.

@wkiri wkiri self-assigned this Mar 23, 2022
@stevenlujpl
Copy link
Collaborator

stevenlujpl commented Mar 24, 2022

@wkiri I've processed the 6 MPF documents you added. 5 documents were successfully processed, and one document (golombek-mpf-99.pdf) failed due to jSRE out of memory problem.

I've copied the jsonl file to the following location:

/proj/mte/results/journals/mpf_2nd.jsonl

I also copied the jsonl files from the initial MERA, MERB, MPF, and PHX runs to /proj/mte/results/journals/.

@wkiri
Copy link
Owner Author

wkiri commented Mar 24, 2022

@stevenlujpl Thank you, that was fast! I'll look at these tomorrow.

@wkiri
Copy link
Owner Author

wkiri commented Mar 24, 2022

@stevenlujpl These look great! They are now available at
https://ml.jpl.nasa.gov/mte/brat/#/mpf/journals

@stevenlujpl
Copy link
Collaborator

@wkiri Great! Thanks for sharing the brat URL. There are targets and relations, which look promising. Please let me know if you need help reviewing them (even after this week).

@stevenlujpl
Copy link
Collaborator

@wkiri I have updated the MTE parser and bundle generation scripts based on what we discussed on Monday. Please see the following steps for generating a PDS4 bundle with both LPSC and journal papers:

  1. Run lpsc_parser.py with LPSC papers to generate lpsc.jsonl
  2. Run paper_parser.py with journal papers to generate journal.jsonl
  3. Manually concatenate lpsc.jsonl and journal.jsonl (i.e., cat lpsc.jsonl > combined.jsonl and then cat journal.json >> combined.jsonl)
  4. Run ingest_sqlite.py with combined.jsonl to generate combined.db with both LPSC and journal papers. Please note that the venue CLI argument has been removed as we discussed. Now, ingest_sqlite.py relies on the parser list field (i.e., rec['metadata']['mte_parser'] field) to distinguish if the document being processed is LPSC or others.
  5. Run generate_pds4_bundle.py with the combined.db DB file to generate the MTE PDS4 bundle.

I tested the scripts with 5 LPSC and 1 journal paper, and verified the results manually and with the PDS4 validate tool. I didn't find any problem. I am attaching the jsonl, DB, and bundle file in the following .zip file. Please take a look and let me know if you find any problems. Thanks.

Archive.zip

@wkiri
Copy link
Owner Author

wkiri commented Mar 25, 2022

@stevenlujpl This sounds great!!! Thanks for pulling it all together.

I haven't looked at the .zip file yet but will try to do so tomorrow.

For the full process, I believe there will be 2 steps between 4 and 5 in which we run update_sqlite.py twice (once using manually reviewed LPSC docs, once using manually reviewed journal docs, only because they are in different directories... I guess we could put them in one directory if that makes this easier). Can you take a look and see if you think we need any changes to update_sqlite.py ? You could test it with the .ann files I generated for e.g. MPF, which are not yet reviewed but are in the correct format. However, if you are out of time for this task for this week, our "test" can take place when we get to actually merging the two sets of annotations and find out if it works :)

@wkiri
Copy link
Owner Author

wkiri commented Mar 25, 2022

The per-mission LPSC .jsonl files are:

  • /proj/mte/results/mpf-jsre-all-ads-gaz.jsonl
  • /proj/mte/results/phx-jsre-v2-ads.jsonl
  • /proj/mte/results/mer-a-jsre-v2-ads-gaz-CHP-all397.jsonl
  • /proj/mte/results/mer-b-jsre-v2-ads-gaz.jsonl

See /proj/mte/results/README.txt for details on each file. Note that the MER-A file was generated after we identified the 397 documents with at least one Target. We didn't have that list yet for MER-B so its file contains the entire set of 1635 candidate documents. However, many should be omitted at the remove-orphans step of update_sqlite.py. (This is the step that I think could be problematic if run twice)

@stevenlujpl
Copy link
Collaborator

@wkiri I've added the script to insert mte_parser fields to an existing jsonl file. I also processed the pre-mission LPSC .jsonl files to insert mte_parser fields. The updated pre-mission LPSC .jsonl files are at the following locations:

  • /proj/mte/results/mer-a-jsre-v2-ads-gaz-CHP-all397-parser.jsonl
  • /proj/mte/results/mer-b-jsre-v2-ads-gaz-parser.jsonl
  • /proj/mte/results/mpf-jsre-all-ads-gaz-parser.jsonl
  • /proj/mte/results/phx-jsre-v2-ads-parser.jsonl

Please take a look and let me know if you find any problems. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants