Process journal papers and add content to MTE #45

wkiri · 2022-02-23T20:14:28Z

The first step is to try parsing the journal documents @stevenlujpl already downloaded.
For some documents, we may need to process them multiple times for each mission whose targets are mentioned (see issue #22).

Generate initial annotations (w/Contains and HasProperty from jSRE) using the MTE pipeline
Perform expert review of Target, Contains, and HasProperty annotations. Guidelines: https://docs.google.com/document/d/1KnkVtxkKb9kcRVKZqPWwuXSi_6zK8Scay1M6qiz1Khc/edit#
For each mission, update aliases table (if any updates are needed)
For each mission, concatenate LPSC + journal .jsonl files and use them for ingest_sqlite.py
For each mission, run update_sqlite.py twice: once with LPSC annotations and once with journal paper annotations
Check contents via MTE mission-specific websites
Generate MTE bundle v3.0 with MER-B and journal content added

The text was updated successfully, but these errors were encountered:

wkiri · 2022-02-25T17:45:04Z

Note: the MTE schema currently has an abstract column in the documents table. Journal papers do not have an abstract number, so we should decide how to handle this.

We will also need to decide how to generate a doc_id which currently is year_abstract. Perhaps it should be year_venue_paperid in which case venue would be a short form like lpsc or jgr and paperid would be the same as abstract for LPSC publications and something like volume-number-page for journal papers. Then maybe this paperid would take the place of the abstract column in documents (we'd want to rename the column as needed).

stevenlujpl · 2022-03-18T18:03:10Z

35 journal papers were processed, and it took about 30 minutes to process them using paper_parser.py
ADS title search failed for 11 papers
- ADS doesn't have the papers
- Grobid failed to extract titles
- Extracted titles were wrong
jSRE relation extraction failed for 6 papers (due to the jSRE "out of memory" error)
18 papers passed initial filter

stevenlujpl · 2022-03-18T18:57:30Z

I categorized 18 papers that passed the initial filtering process by missions to ensure that we at least have one paper for each MERA, MERB, MPF, and PHX mission. Some papers may appear in more than one mission list. For example, the paper 2003JE002125.pdf talks about the MER landing sites, so it has been categorized within both MERA and MERB mission lists.

MERA:

2003JE002125.pdf
2010JE003633.pdf
2016JE005079.pdf
Niles_SpaceSciRev_v174_2013.pdf

MERB:

2003JE002125.pdf
2006JE002728.pdf
2010JE003746.pdf
2014JE004686.pdf
2016JE005079.pdf
1248097.full.pdf

MPF:

2016JE005079.pdf

PHX:

carrier_GRL_v42_2015.pdf
Cull_GRL_v37_2010.pdf
Heet_JGR_v114_2009.pdf
Mellon_JGR_v114_2009.pdf
Renno_JGR_v114_2009.pdf
Toner_GeochimCosmochimActa_v236_2014.pdf

wkiri · 2022-03-18T22:07:52Z

Thanks, Steven! It looks like there are 14 unique papers here. Are the other 4 that passed the filter worth including?

stevenlujpl · 2022-03-18T22:11:14Z

@wkiri The other 4 papers are MSL papers. Sorry that I forgot to mention them.

MSL:

jackson-drillholes-16.pdf
mezzacappa-dist-corr-16.pdf
rapin-sulfate-16.pdf
schwenzer-diagenesis-16.pdf

wkiri · 2022-03-18T22:15:54Z

@stevenlujpl Great, thanks for the clarification!

…on LPSC documents #45

stevenlujpl · 2022-03-18T23:36:46Z

@wkiri I tested the changes I made to the MTE codebase, and it seems it is working fine. Please see the commits above for details about the code changes. The changes are currently checked into the issue45-journal branch. Please see the summary of the changes below:

doc_id field: The doc_id field is now used to store filenames (without extension) for journal papers. The doc_id field can also be made to store document indices. If you think document indices make more sense than filenames, please let me know and I will update the code to store document indices in the doc_id field.
abstract field: The abstract field is ignored for journal papers. With the changes I made, the abstract field will be empty strings in the DB, and it will not be included as a column in the exported CSV file of the PDS4 bundle.
doc_url and year fields: The doc_url and year fields are currently handled in the same way as the abstract field.
I added a new CLI argument venue to the bundle generation script. The venue argument is used to distinguish LPSC v.s. other documents. If the input documents are from LPSC, the documents.csv of the PDS4 bundle will have 8 columns (i.e., all the 8 fields from the documents table of the DB are exported to the CSV file). If the input documents are from a venue other than LPSC, then the documtnes.csv will only have 5 columns (the abstract, doc_url, and year fields are skipped).

I tested this approach using the 18 journal papers that passed the initial filtering process. the MPF jsonl, DB, and PDS4 bundle files can be found at the following locations in my /home dir. The bundle validate tool reported 0 errors.

/home/youlu/MTE/working_dir/process_journals/mpf/mpf_init_filter.jsonl
/home/youlu/MTE/working_dir/process_journals/mpf/mpf_init_filter.db
/home/youlu/MTE/working_dir/process_journals/mpf/mars_target_encyclopedia

This approach requires only a few minor changes (as shown in the commits above) in the codebase because it doesn't require changes in DB schema. The drawback is that the abstract field of the documents table in the current DB schema doesn't really apply to journal papers. I think this may be okay because I consider DB files as the intermediate products. The PDS4 bundles as the final delivered products don't have the abstract field in the CSV file. Please let me know what you think.

The current MTE website code won't work with the DB files generated from journal papers primarily because of the lack of the doc_url field in the documents table. I will update the website code if we are planning to use it for journal DBs. Please let me know. Thanks.

wkiri · 2022-03-18T23:51:50Z

@stevenlujpl This is great progress! Thank you!

Can you place the generated .jsonl files under /proj/mte/? That way I can generate brat review pages for them.

The changes seem fine in general. I have two questions:

I think that it would be good to retain the year field. Is this possible?
Can we populate doc_url with the journal paper DOI (which can be formatted as a URL if it isn't already)? I think this is included in the ADS results. While not available for LPSC, it should be available for journal papers.

It seems I overlooked that "abstract" is not included in the final .csv files that are delivered. I should correct this in the schema diagram and in the README.

I agree that the sqlite DB is an intermediate product so it's ok for it to have more information even if not used later. As you note, however, the website does use the DB directly. It makes sense to prioritize getting the journal paper content into PDS4 bundles first, and if time remains, then update the website (but it's not on the critical path for the time remaining).

stevenlujpl · 2022-03-19T00:08:10Z

I think that it would be good to retain the year field. Is this possible?

It should be possible to retain the year field for documents indexed in the ADS database.

Can we populate doc_url with the journal paper DOI (which can be formatted as a URL if it isn't already)? I think this is included in the ADS results. While not available for LPSC, it should be available for journal papers.

It seems from the ADS website search results, the DOI fields are already formatted as URLs. I will double-check the format of the DOIs returned by directly querying the ADS database. These are great suggestions. I will work on them now.

I copied the .jsonl files to the following locations in /proj/mte/. Please let me know if you run into any problems generating the brat review pages.

/proj/mte/data/steven_working_dir/process_journals/mera/mera_init_filter.jsonl
/proj/mte/data/steven_working_dir/process_journals/merb/merb_init_filter.jsonl
/proj/mte/data/steven_working_dir/process_journals/mpf/mpf_init_filter.jsonl
/proj/mte/data/steven_working_dir/process_journals/phx/phx_init_filter.jsonl

stevenlujpl · 2022-03-19T00:12:38Z

@wkiri I couldn't test the update_sqlite.py step to update the DB with human-reviewed brat annotations yet, but I don't foresee any problems because the DB schema isn't changed.

stevenlujpl · 2022-03-19T00:51:30Z

@wkiri Do you know how to use a DOI to form a URL? The DOIs returned by the ADS database aren't formed in URL. For example, the DOI returned for "Analysis of MOLA data for the Mars Exploration Rover landing sites" is 10.1029/2003JE002125. How do I convert the DOI into a URL?

stevenlujpl · 2022-03-19T00:53:10Z

I just googled, and it seems we can use this patter https://doi.org/xxxxx (where the xxxxx is DOI) to convert DOI to URL.

stevenlujpl · 2022-03-19T01:30:39Z

@wkiri I've added the year and doc_url back to the DB and the exported documents.csv files. Please see the following files as examples. Please let me know if you find any problems. Thanks.

/home/youlu/MTE/working_dir/process_journals/mpf/mpf_init_filter.db
/home/youlu/MTE/working_dir/process_journals/mpf/mars_target_encyclopedia/data_mpf/documents.csv

wkiri · 2022-03-23T17:03:23Z

I created a brat site for the MPF JSONL file here:
https://ml.jpl.nasa.gov/mte/brat/#/mpf/journals/
As noted above, probably the only document to review for MPF would be this one:
https://ml.jpl.nasa.gov/mte/brat/#/mpf/journals/2016JE005079
Unfortunately, I do not see any named MPF targets (although there are targets from several other missions mentioned). I will check again to see if the 8 MPF papers Matt previously shared were in the set of 35 that were checked, or if they should be added.

wkiri · 2022-03-23T19:26:40Z

The PHX output is available here:
https://ml.jpl.nasa.gov/mte/brat/#/phx/journals
There are 6 relevant documents. These three have named targets and are ready for review:

stevenlujpl · 2022-03-23T20:58:39Z

@wkiri Do we need to add a few more journal papers for MPF?

wkiri · 2022-03-23T23:02:08Z

Yes, we should try to add the 4 JGR + 1 Science papers that are referenced in https://github.com/wkiri/MTE/tree/master/ref/MPF#readme
I had two of them handy (bell-mpf-00.pdf, golombek-mpf-99.pdf) and put them in the JGR directory. I also added golombek-mpf-00.pdf, greeley-mpf-00.pdf, landis-mpf-00.pdf, and morris-mpf-00.pdf in case they have useful content. Could you process these with your MPF run? You could generate a separate .jsonl file for this batch - no need to run them all together with the earlier docs. See: /proj/mte/data/corpus-journals/pdf/jgr-planets/

wkiri · 2022-03-23T23:24:23Z

The MER-A output is available here:
https://ml.jpl.nasa.gov/mte/brat/#/mer-a/journals
There are 4 relevant documents. These three have named targets and are ready for review:

wkiri · 2022-03-23T23:30:07Z

The MER-B output is available here:
https://ml.jpl.nasa.gov/mte/brat/#/mer-b/journals
There are 6 relevant documents. All six have named targets, but the ones in 2003JE002125 are spurious, so focus on these five for review:

wkiri · 2022-03-23T23:38:35Z

To make review easier, I have pruned the documents for each mission in the "journals" directory under brat to only include the documents to be reviewed.

stevenlujpl · 2022-03-24T00:49:01Z

@wkiri I've processed the 6 MPF documents you added. 5 documents were successfully processed, and one document (golombek-mpf-99.pdf) failed due to jSRE out of memory problem.

I've copied the jsonl file to the following location:

/proj/mte/results/journals/mpf_2nd.jsonl

I also copied the jsonl files from the initial MERA, MERB, MPF, and PHX runs to /proj/mte/results/journals/.

wkiri · 2022-03-24T01:46:06Z

@stevenlujpl Thank you, that was fast! I'll look at these tomorrow.

wkiri · 2022-03-24T16:40:58Z

@stevenlujpl These look great! They are now available at
https://ml.jpl.nasa.gov/mte/brat/#/mpf/journals

stevenlujpl · 2022-03-24T20:33:47Z

@wkiri Great! Thanks for sharing the brat URL. There are targets and relations, which look promising. Please let me know if you need help reviewing them (even after this week).

…csv #45

stevenlujpl · 2022-03-25T01:48:53Z

@wkiri I have updated the MTE parser and bundle generation scripts based on what we discussed on Monday. Please see the following steps for generating a PDS4 bundle with both LPSC and journal papers:

Run lpsc_parser.py with LPSC papers to generate lpsc.jsonl
Run paper_parser.py with journal papers to generate journal.jsonl
Manually concatenate lpsc.jsonl and journal.jsonl (i.e., cat lpsc.jsonl > combined.jsonl and then cat journal.json >> combined.jsonl)
Run ingest_sqlite.py with combined.jsonl to generate combined.db with both LPSC and journal papers. Please note that the venue CLI argument has been removed as we discussed. Now, ingest_sqlite.py relies on the parser list field (i.e., rec['metadata']['mte_parser'] field) to distinguish if the document being processed is LPSC or others.
Run generate_pds4_bundle.py with the combined.db DB file to generate the MTE PDS4 bundle.

I tested the scripts with 5 LPSC and 1 journal paper, and verified the results manually and with the PDS4 validate tool. I didn't find any problem. I am attaching the jsonl, DB, and bundle file in the following .zip file. Please take a look and let me know if you find any problems. Thanks.

Archive.zip

wkiri · 2022-03-25T02:07:39Z

@stevenlujpl This sounds great!!! Thanks for pulling it all together.

I haven't looked at the .zip file yet but will try to do so tomorrow.

For the full process, I believe there will be 2 steps between 4 and 5 in which we run update_sqlite.py twice (once using manually reviewed LPSC docs, once using manually reviewed journal docs, only because they are in different directories... I guess we could put them in one directory if that makes this easier). Can you take a look and see if you think we need any changes to update_sqlite.py ? You could test it with the .ann files I generated for e.g. MPF, which are not yet reviewed but are in the correct format. However, if you are out of time for this task for this week, our "test" can take place when we get to actually merging the two sets of annotations and find out if it works :)

wkiri · 2022-03-25T19:22:38Z

The per-mission LPSC .jsonl files are:

/proj/mte/results/mpf-jsre-all-ads-gaz.jsonl
/proj/mte/results/phx-jsre-v2-ads.jsonl
/proj/mte/results/mer-a-jsre-v2-ads-gaz-CHP-all397.jsonl
/proj/mte/results/mer-b-jsre-v2-ads-gaz.jsonl

See /proj/mte/results/README.txt for details on each file. Note that the MER-A file was generated after we identified the 397 documents with at least one Target. We didn't have that list yet for MER-B so its file contains the entire set of 1635 candidate documents. However, many should be omitted at the remove-orphans step of update_sqlite.py. (This is the step that I think could be problematic if run twice)

stevenlujpl · 2022-03-25T23:44:34Z

@wkiri I've added the script to insert mte_parser fields to an existing jsonl file. I also processed the pre-mission LPSC .jsonl files to insert mte_parser fields. The updated pre-mission LPSC .jsonl files are at the following locations:

/proj/mte/results/mer-a-jsre-v2-ads-gaz-CHP-all397-parser.jsonl
/proj/mte/results/mer-b-jsre-v2-ads-gaz-parser.jsonl
/proj/mte/results/mpf-jsre-all-ads-gaz-parser.jsonl
/proj/mte/results/phx-jsre-v2-ads-parser.jsonl

Please take a look and let me know if you find any problems. Thanks.

wkiri assigned stevenlujpl Feb 23, 2022

stevenlujpl added a commit that referenced this issue Mar 18, 2022

add a jsonl read for non LPSC documents #45

73be98b

stevenlujpl added a commit that referenced this issue Mar 18, 2022

ignore abstrach, doc_url, and year fields for non LPSC documents #45

819c61d

stevenlujpl added a commit that referenced this issue Mar 18, 2022

add CLI argument venue to distinguish lpsc vs other documents #45

aa5914e

stevenlujpl added a commit that referenced this issue Mar 18, 2022

ignore abstract, doc_url, and year fields in documents template for n…

8899431

…on LPSC documents #45

stevenlujpl added a commit that referenced this issue Mar 19, 2022

add year and doc_url fields #45

2ac805f

wkiri self-assigned this Mar 23, 2022

stevenlujpl added a commit that referenced this issue Mar 24, 2022

add mte_parser field to jsonl #45

1f80dfb

stevenlujpl added a commit that referenced this issue Mar 25, 2022

add ability to ingest both lpsc and journal docs at the same time #45

77058be

stevenlujpl added a commit that referenced this issue Mar 25, 2022

(1) remove venue CLI arugment; (2) omit abstract field for documents.…

7cc6587

…csv #45

stevenlujpl added a commit that referenced this issue Mar 25, 2022

update templates to remove venue specific logic #45

242c323

stevenlujpl added a commit that referenced this issue Mar 25, 2022

add script to insert mte_parser field to an existing jsonl file #45

b1cd3f9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process journal papers and add content to MTE #45

Process journal papers and add content to MTE #45

wkiri commented Feb 23, 2022 •

edited

Loading

wkiri commented Feb 25, 2022

stevenlujpl commented Mar 18, 2022

stevenlujpl commented Mar 18, 2022

wkiri commented Mar 18, 2022

stevenlujpl commented Mar 18, 2022

wkiri commented Mar 18, 2022

stevenlujpl commented Mar 18, 2022

wkiri commented Mar 18, 2022

stevenlujpl commented Mar 19, 2022

stevenlujpl commented Mar 19, 2022

stevenlujpl commented Mar 19, 2022

stevenlujpl commented Mar 19, 2022 •

edited

Loading

stevenlujpl commented Mar 19, 2022

wkiri commented Mar 23, 2022

wkiri commented Mar 23, 2022

stevenlujpl commented Mar 23, 2022

wkiri commented Mar 23, 2022 •

edited

Loading

wkiri commented Mar 23, 2022

wkiri commented Mar 23, 2022

wkiri commented Mar 23, 2022

stevenlujpl commented Mar 24, 2022 •

edited

Loading

wkiri commented Mar 24, 2022

wkiri commented Mar 24, 2022

stevenlujpl commented Mar 24, 2022

stevenlujpl commented Mar 25, 2022

wkiri commented Mar 25, 2022

wkiri commented Mar 25, 2022

stevenlujpl commented Mar 25, 2022

Process journal papers and add content to MTE #45

Process journal papers and add content to MTE #45

Comments

wkiri commented Feb 23, 2022 • edited Loading

wkiri commented Feb 25, 2022

stevenlujpl commented Mar 18, 2022

stevenlujpl commented Mar 18, 2022

wkiri commented Mar 18, 2022

stevenlujpl commented Mar 18, 2022

wkiri commented Mar 18, 2022

stevenlujpl commented Mar 18, 2022

wkiri commented Mar 18, 2022

stevenlujpl commented Mar 19, 2022

stevenlujpl commented Mar 19, 2022

stevenlujpl commented Mar 19, 2022

stevenlujpl commented Mar 19, 2022 • edited Loading

stevenlujpl commented Mar 19, 2022

wkiri commented Mar 23, 2022

wkiri commented Mar 23, 2022

stevenlujpl commented Mar 23, 2022

wkiri commented Mar 23, 2022 • edited Loading

wkiri commented Mar 23, 2022

wkiri commented Mar 23, 2022

wkiri commented Mar 23, 2022

stevenlujpl commented Mar 24, 2022 • edited Loading

wkiri commented Mar 24, 2022

wkiri commented Mar 24, 2022

stevenlujpl commented Mar 24, 2022

stevenlujpl commented Mar 25, 2022

wkiri commented Mar 25, 2022

wkiri commented Mar 25, 2022

stevenlujpl commented Mar 25, 2022

wkiri commented Feb 23, 2022 •

edited

Loading

stevenlujpl commented Mar 19, 2022 •

edited

Loading

wkiri commented Mar 23, 2022 •

edited

Loading

stevenlujpl commented Mar 24, 2022 •

edited

Loading