Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DS-4301] Added Content Reports section and Filtered Collections report therein #202

Merged
merged 25 commits into from
Feb 28, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
177457e
Added a missing link and deleted a duplicate entry
Sep 13, 2022
1d35992
Added the Filtered Collections report spec
Sep 13, 2022
022db3c
Merge branch 'DSpace:main' into main
jeffmorin Nov 30, 2022
7a51061
Filtered Items report
Dec 2, 2022
9e094d3
Merge branch 'main' of github.com:jeffmorin/RestContract
Dec 2, 2022
77bcc5e
JSON and content fixes
Dec 2, 2022
3470e30
JSON fix
Dec 2, 2022
b8a1274
JSON fix
Dec 2, 2022
1527355
Improved API documentation
Jan 9, 2023
8bad9b3
Changed contentreports category to contentreport
Jan 11, 2023
a1a13dd
Merge branch 'DSpace:main' into main
jeffmorin Feb 15, 2023
3cc4598
Merge branch 'DSpace:main' into main
jeffmorin Mar 1, 2023
5455187
Merge branch 'DSpace:main' into main
jeffmorin Apr 19, 2023
41e9ac3
Added GET endpoint to Filtered Items report
Apr 19, 2023
001a342
Merge branch 'DSpace:main' into main
jeffmorin May 1, 2023
b6af0ae
Merge branch 'DSpace:main' into main
jeffmorin May 25, 2023
26648ee
Merge branch 'DSpace:main' into main
jeffmorin Nov 20, 2023
718ca8a
Updated to latest version from main branch
Nov 21, 2023
a552cb8
Merge branch 'DSpace:main' into main
jeffmorin Dec 18, 2023
6386f46
Merge branch 'DSpace:main' into main
jeffmorin Feb 12, 2024
768e541
Merge branch 'DSpace:main' into main
jeffmorin Feb 20, 2024
487f59b
Merge branch 'DSpace:main' into main
jeffmorin Feb 22, 2024
014af1b
Added beta feature warning in both Content Report pages
Feb 22, 2024
9b2bdca
Removed POST endpoints from documentation
Feb 27, 2024
1d4a1ff
Merge branch 'DSpace:main' into main
jeffmorin Feb 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 149 additions & 0 deletions contentreports-filteredcollections.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# Displaying the filtered collections report
[Back to the list of all defined endpoints](endpoints.md)

## Statistics for the whole repository
**GET /api/contentreports/filteredcollections**

**POST /api/contentreports/filteredcollections**

This endpoint provides aggregated statistics about the number of items per collection according to selected filters.

For each collection, the basic report consists of:
* name (label) and handle of the collection
* name (label) and handle of the parent community
* total number of items
* number of items matching all selected filters

In addition, a `summary` element provides the total number of items and the total number of items matching all filters
for the whole repository.

An example JSON response document to `/api/contentreports/filteredcollections`:
```json
{
"id": "filteredcollections",
"collections": [
{
"label": "Collection 1",
"handle": "100/1",
"values": {
"is_discoverable": 23,
"has_multiple_originals": 3,
"has_pdf_original": 14
},
"community_label": "Community 1",
"community_handle": "20.500.11794/1",
"nb_total_items": 23,
"all_filters_value": 3
},
{
"label": "Collection 2",
"handle": "100/2",
"values": {
"is_discoverable": 1,
"has_multiple_originals": 0,
"has_pdf_original": 0
},
"community_label": "Community 1",
"community_handle": "20.500.11794/1",
"nb_total_items": 1,
"all_filters_value": 0
},
{
"label": "Collection 3",
"handle": "100/3",
"values": {
"is_discoverable": 1,
"has_multiple_originals": 0,
"has_pdf_original": 1
},
"community_label": "Community 1",
"community_handle": "20.500.11794/1",
"nb_total_items": 1,
"all_filters_value": 0
}
],
"summary": {
"label": null,
"handle": null,
"values": {
"is_discoverable": 25,
"has_multiple_originals": 3,
"has_pdf_original": 15
},
"community_label": null,
"community_handle": null,
"nb_total_items": 25,
"all_filters_value": 3
},
"type": "filtered-collections",
"_links": {
"self": {
"href": "http://localhost:8080/dspace-server/api/contentreports/filtered-collections"
}
}
}
```

The request can be parameterized with a series of filters to add to the basic report.

In GET mode, it consists of a `filters` query parameter whose value is a comma-separated list of filters
like the following:
```
?filters=is_discoverable,has_multiple_originals,has_pdf_original
```

In POST mode, it is defined as a JSON document like this:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if we describe GET and POST mode separately. I'm having a hard time understanding the way this is documented. When would someone use GET and when would they use POST? It's unclear if everything below this point in the docs is ONLY for POST or if it also applies to GET? Could we give some more examples here as to what the differences are?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reorganized report parameterization in the Filtered Collections report documentation. I also fixed a few mistakes and added some info I realised that was missing (e.g., definition of "basic report" in Filtered Items).

About usage of GET vs. POST, please see my other comments below.

```json
{
"filters": {
"is_discoverable": true,
"has_multiple_originals": true,
"has_pdf_original": true
}
}
```

The available filters are as follows:

* Item Property Filters
* `is_item`: Is Item - always true
* `is_withdrawn`: Withdrawn Items
* `is_not_withdrawn`: Available Items - Not Withdrawn
* `is_discoverable`: Discoverable Items - Not Private
* `is_not_discoverable`: Not Discoverable - Private Item
* Basic Bitstream Filters
* `has_multiple_originals`: Item has Multiple Original Bitstreams
* `has_no_originals`: Item has No Original Bitstreams
* `has_one_original`: Item has One Original Bitstream
* Bitstream Filters by MIME Type
* `has_doc_original`: Item has a Doc Original Bitstream (PDF, Office, Text, HTML, XML, etc)
* `has_image_original`: Item has an Image Original Bitstream
* `has_unsupp_type`: Has Other Bitstream Types (not Doc or Image)
* `has_mixed_original`: Item has multiple types of Original Bitstreams (Doc, Image, Other)
* `has_pdf_original`: Item has a PDF Original Bitstream
* `has_jpg_original`: Item has JPG Original Bitstream
* `has_small_pdf`: Has unusually small PDF
* `has_large_pdf`: Has unusually large PDF
* `has_doc_without_text`: Has document bitstream without TEXT item
* Supported MIME Type Filters
* `has_only_supp_image_type`: Item Image Bitstreams are Supported
* `has_unsupp_image_type`: Item has Image Bitstream that is Unsupported
* `has_only_supp_doc_type`: Item Document Bitstreams are Supported
* `has_unsupp_doc_type`: Item has Document Bitstream that is Unsupported
* Bitstream Bundle Filters
* `has_unsupported_bundle`: Has bitstream in an unsupported bundle
* `has_small_thumbnail`: Has unusually small thumbnail
* `has_original_without_thumbnail`: Has original bitstream without thumbnail
* `has_invalid_thumbnail_name`: Has invalid thumbnail name (assumes one thumbnail for each original)
* `has_non_generated_thumb`: Has non-generated thumbnail
* `no_license`: Doesn't have a license
* `has_license_documentation`: Has documentation in the license bundle
* Permission Filters
* `has_restricted_original`: Item has Restricted Original Bitstream
* `has_restricted_thumbnail`: Item has Restricted Thumbnail
* `has_restricted_metadata`: Item has Restricted Metadata

Possible response status

* 200 OK - The specific report data was found, and the data has been properly returned.
* 403 Forbidden - if a valid CSRF token is missing when issuing a POST request.
129 changes: 129 additions & 0 deletions contentreports-filtereditems.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# Displaying the filtered collections report
[Back to the list of all defined endpoints](endpoints.md)

## Statistics for the whole repository
**POST /api/contentreports/filtereditems**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a POST instead of a GET? I notice that the "statistics" endpoints always use GET except when they are adding data to the statistics. See https://github.com/DSpace/RestContract/blob/main/statistics-reports.md and https://github.com/DSpace/RestContract/blob/main/statistics-viewevents.md

Could we better describe why we need to use POST for these endpoints? It appears they are readonly, which implies they might be switched to GET.

Copy link
Contributor Author

@jeffmorin jeffmorin Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, I thought about it. My concern with GET is, as you suggested above, the limited length of the parameters passed as part of the URL. There should be no problems with the Filtered Collections report (only Boolean filters).

Parameterization of the Filtered Items report, however, is much more complex and can easily become long enough to exceed any limit enforced by application servers for URL query strings. This is why I implemented this report as a POST endpoint.

For (a bit of) uniformity, I also added POST support to the Filtered Collections report.

Besides, while the HTTP spec clearly states that GET should be used for read-only requests, I saw nothing stating that POST should be used only for data-changing requests.

If you feel that everything should be switched to GET anyway, I can do it. In this case, the Filtered Items report shall be thoroughly tested with lots of parameters selected to make sure that nothing goes wrong due to too long a query string.

Copy link
Contributor Author

@jeffmorin jeffmorin Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About Filtered Items: The parameters could be organized into a GET query string, although such a string might become quite long. Another concern I had while designing the API for this report is the "query predicates" part: these are structured parameters (a query predicate is a (field, operator, value) tuple). This is another reason why I didn't include a GET version of this report.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check: https://github.com/DSpace/RestContract/blob/main/search-endpoint.md solution.

A query string solution it's used for filtering results:

f.<:filter-name>=<:filter-value>,<:filter-operator>

where a filter-name belongs to a predefined structure that is previously returned. Like: f.title=rainbows,notcontains

{
  "query":"my query",
  "scope":"9076bd16-e69a-48d6-9e41-0238cb40d863",
  "appliedFilters": [
      {
        "filter" : "title",
        "operator" : "notcontains",
        "value" : "abcd",
        "label" : "abcd"
      },

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this discussion stems from a disagreement between HTTP and REST about what POST is for. RFC 9110 says that creating a new resource is only one possible use for POST. "The POST method requests that the target resource process the representation enclosed in the request according to the resource's own specific semantics." The description of POST here is quite a bit narrower. One might say that REST is a reuse of HTTP syntax with different semantics.

So it can be argued that REST is not a very good fit for this operation, but it is what we have.

Would it be a violation of the "spirit of REST" to consider such a POST to be storing a report description, which is consumed in the act of generating the report? Reports may take some time to create. Suppose that one POSTs a document describing the desired report and receives a token in the response. The report generator runs in the background. When finished, the token can be presented (using GET) to receive the report, and the report description is then destroyed.


This endpoint provides a custom query API to select items from existing collections.

An example JSON response document to `/api/contentreports/filtereditems` (metadata removed for brevity):
```json
{
"id": "filtereditems",
"items": [
{
"id": "07e388ff-f22b-4d4f-8275-acab5c3edacc",
"uuid": "07e388ff-f22b-4d4f-8275-acab5c3edacc",
"name": "Enhancing the lubricity of an environmentally friendly Swedish diesel fuel MK1",
"handle": "20.500.11794/42",
"metadata": {
"dc.contributor.author": [
{
"value": "Smith, John",
"language": null,
"authority": "6eee383a-f126-4705-9ffb-b4aa4832070e",
"confidence": 600,
"place": 0
}
],
"dc.publisher": [
{
"value": "Elsevier",
"language": "fr_CA",
"authority": null,
"confidence": -1,
"place": 0
}
],
},
"inArchive": true,
"discoverable": true,
"withdrawn": false,
"lastModified": "2015-11-23T17:30:21.463+00:00",
"entityType": "Publication",
"owningCollection": {
"id": "d98a828c-45c2-43d9-9861-6b9800bf14f5",
"uuid": "d98a828c-45c2-43d9-9861-6b9800bf14f5",
"name": "Articles publiés dans des revues avec comité de lecture",
"handle": "100/1",
"metadata": {
"dc.identifier.uri": [
{
"value": "http://localhost:4000/handle/100/1",
"language": null,
"authority": null,
"confidence": -1,
"place": 0
}
],
"dspace.entity.type": [
{
"value": "Publication",
"language": null,
"authority": null,
"confidence": -1,
"place": 0
}
]
},
"type": "collection"
},
"type": "item"
},
{
...
}
],
"itemCount": 40,
"type": "filtereditemsreport",
"_links": {
"self": {
"href": "http://localhost:8080/dspace-server/api/contentreports/filtereditems"
}
}
}
```

The request is defined as a JSON document like this:
```json
{
{
"collections": [
""
],
"presetQuery": "new",
"queryPredicates": [
{
"field": "*",
"operator": null,
"value": null
}
],
"pageLimit": "100",
"filters": {
"is_discoverable": true,
"has_multiple_originals": true,
"has_pdf_original": true
},
"additionalFields": [
"dc.contributor.advisor"
]
}}
```

The parameters are specified as follows:

* collections: The collections where to search items. If none are provided, the whole repository is searched.
* presetQuery: This parameter is not used on the REST API side. It defines a predefined set of query predicates
defined in the Angular layer.
* queryPredicates: Predicates used to filter matching items. They can be predefined (see presetQuery above)
or defined specifically by the user.
* pageLimit: Maximum number of items per page.
* filters: Supplementary filters, these are the same as those available in the Filtered Collections report.
Please see [/api/contentreports/filteredcollections](contentreports-filteredcollections.md) for details.
* additionalFields: Fields to add to the basic report for each item included in the report.

Possible response status

* 200 OK - The specific report data was found, and the data has been properly returned.
* 403 Forbidden - if a valid CSRF token is missing when issuing a POST request.
5 changes: 3 additions & 2 deletions endpoints.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,8 @@
[REST Overview Documentation](README.md)

## Available Endpoints
* /api/core/sites
* [/api/core/communities](communities.md)
* /api/core/collections
* [/api/core/collections](collections.md)
* [/api/core/items](items.md)
* [/api/core/itemtemplates](itemtemplates.md)
* [/api/core/bitstreams](bitstreams.md)
Expand Down Expand Up @@ -48,6 +47,8 @@
* [/api/authz/features](features.md)
* [/api/statistics](statistics.md)
* [/api/tools/itemrequests](item-requests.md)
* [/api/contentreports/filteredcollections](contentreports-filteredcollections.md)
* [/api/contentreports/filtereditems](contentreports-filtereditems.md)

## Actuator endpoints
The following endpoints are implemented using [Spring Boot Actuator](https://docs.spring.io/spring-boot/docs/current/reference/html/actuator.html#actuator.enabling) and are enabled by default:
Expand Down