Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SFR-2476: Implement DSpace service and update DOAB ingest #531

Merged
merged 7 commits into from
Jan 31, 2025

Conversation

jackiequach
Copy link
Contributor

Description

  • Adds DSpace service that can be used by ingest processes
  • Updates DOAB ingest process to use DSpace service

Testing

python main.py -p DOABProcess -e local -r "20.500.12854/62823"

self.constants = get_constants()

self.base_url = base_url
self.SourceMapping = SourceMapping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do source_mapping for both the param and class variable here.

It looks a bit funky when you instantiate an object but it's common practice to use snake case.

'oaire': 'https://raw.githubusercontent.com/rcic/openaire4/master/schemas/4.0/oaire.xsd'
}

def __init__(self, base_url, SourceMapping: XMLMapping):
Copy link
Contributor

@kylevillegas93 kylevillegas93 Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fun cuz I've actually never done this in python so we are learning ha!

I think when we say that the SourceMapping type is XMLMapping, we are expecting an actual object that is an instance of XMLMapping.

Checkout typing for Python: https://docs.python.org/3/library/typing.html

I think we could do type[XMLMapping] or the like here.

Comment on lines 26 to 36
self.s3_manager = S3Manager()
self.s3_manager.createS3Client()
self.s3_bucket = os.environ['FILE_BUCKET']

self.file_queue = os.environ['FILE_QUEUE']
self.file_route = os.environ['FILE_ROUTING_KEY']

self.rabbitmq_manager = RabbitMQManager()
self.rabbitmq_manager.createRabbitConnection()
self.rabbitmq_manager.createOrConnectQueue(
self.file_queue, self.file_route)
Copy link
Contributor

@kylevillegas93 kylevillegas93 Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm are these composed class instances being called? I think we can remove them if not and it makes it cleaner because this class is really only responsible for getting records from DSpace so it's following the single responsibility principle: https://stackify.com/solid-design-principles/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved. This was a good read, thanks!

if self.process == 'daily':
records = self.dSpace_service.get_records(offset=self.offset, limit=self.limit)
elif self.process == 'complete':
records = self.dSpace_service.get_records(full_or_partial=True, offset=self.offset, limit=self.limit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend changing the name full_or_partial to full_import since it's more clear what this variable does and the publisherBacklistProcess uses that variable name too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up you could also just remove the full param!

@jackiequach
Copy link
Contributor Author

I'm working on adding integration tests for the DSpace service

self.saveRecords()
self.commitChanges()

logger.info(f'Ingested {len(self.records) if records else 1} DOAB records')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the result of the else conditional be 0 if the records array is empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should still return 0 since the len of the records array would be 0 if it is empty.

Copy link
Contributor

@mitri-slory mitri-slory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are looking good so far Jackie!

@jackiequach jackiequach marked this pull request as ready for review January 28, 2025 16:43

return BytesIO(content)

raise DSpaceError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just raise an Exception here unless downstream we need to handle it in a specific way.

f'Received {response.status_code} status code from {url}')


class DSpaceError(Exception):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is no longer needed - feel free to remove!

class DOABProcess(CoreProcess):
DOAB_BASE_URL = 'https://directory.doabooks.org/oai/request?'
ROOT_NAMESPACE = {None: 'http://www.openarchives.org/OAI/2.0/'}
DOAB_IDENTIFIER = "oai:directory.doabooks.org"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can do single quotes here! we really need a formatter lol


self.constants = get_constants()
self.dSpace_service = DSpaceService(base_url=self.DOAB_BASE_URL, source_mapping=DOABMapping)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommend doing d_space_service or dspace_service so that it's not a mix-match of camel and snake case.

Copy link
Contributor

@kylevillegas93 kylevillegas93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look great! Just a few minor non-blocking comments.

Copy link
Contributor

@mitri-slory mitri-slory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look really good Jackie!

@jackiequach jackiequach merged commit a301d80 into main Jan 31, 2025
1 check passed
@jackiequach jackiequach deleted the SFR-2476/abstract-dspace-integration branch January 31, 2025 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants