Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/mx-1502 wikidata results refactor #39

Merged
merged 38 commits into from
May 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
36e63dc
receive only one wikidata org
mr-kamran-ali Mar 25, 2024
e04b09a
Merge branch 'main' of https://github.com/robert-koch-institut/mex-ex…
mr-kamran-ali Mar 25, 2024
fecfec3
handle wikidata dagster asset for rki
mr-kamran-ali Mar 25, 2024
0d4e781
Timestamp updates in tests
mr-kamran-ali Mar 25, 2024
c8ddf10
update odk wikidata integration
mr-kamran-ali Mar 25, 2024
d743add
fix odk organization merged id extraction
mr-kamran-ali Mar 26, 2024
d99bbbd
fix temporal entity precisoin
mr-kamran-ali Mar 26, 2024
306d054
adjust temporalentity artificial data
mr-kamran-ali Mar 26, 2024
39008f6
update changlog and version
mr-kamran-ali Mar 27, 2024
073b1f6
Merge branch 'main' of https://github.com/robert-koch-institut/mex-ex…
mr-kamran-ali Mar 27, 2024
1d2bd47
move quotation mark filtering to mex-common
mr-kamran-ali Mar 28, 2024
c20fd94
update changelog
mr-kamran-ali Mar 28, 2024
4850e73
Merge branch 'main' of https://github.com/robert-koch-institut/mex-ex…
mr-kamran-ali Mar 28, 2024
29af6bc
Merge branch 'main' of https://github.com/robert-koch-institut/mex-ex…
mr-kamran-ali Apr 2, 2024
f4e2cf4
update changelog
mr-kamran-ali Apr 4, 2024
64f740c
adjust variable declaration
mr-kamran-ali Apr 4, 2024
c6c2984
rename get_timestamp to get_tempral_entity
mr-kamran-ali Apr 4, 2024
958cdd0
reverse ff projects temporal entity precision
mr-kamran-ali Apr 4, 2024
53f0753
remove TODO comment
mr-kamran-ali Apr 4, 2024
01c677b
fix datscha_web mocking result
mr-kamran-ali Apr 4, 2024
29f8bc9
reverse ff projects test assert
mr-kamran-ali Apr 4, 2024
bf6dc59
Merge branch 'main' of https://github.com/robert-koch-institut/mex-ex…
mr-kamran-ali Apr 5, 2024
3c1fbe4
Merge branch 'main' of https://github.com/robert-koch-institut/mex-ex…
mr-kamran-ali Apr 8, 2024
b5dee7c
update mex-common dependency
mr-kamran-ali Apr 8, 2024
ae3a045
temporal entity changes
mr-kamran-ali Apr 16, 2024
ac729f3
move odk wikidata extraction out of transform
mr-kamran-ali Apr 16, 2024
1c2a39c
Merge branch 'main' of https://github.com/robert-koch-institut/mex-ex…
mr-kamran-ali Apr 16, 2024
c821c92
Merge branch 'main' of https://github.com/robert-koch-institut/mex-ex…
cutoffthetop May 13, 2024
75391a1
Update mex/international_projects/extract.py
cutoffthetop May 13, 2024
d54bdfa
Apply new mex-common version
cutoffthetop May 14, 2024
d11f185
Merge branch 'main' of https://github.com/robert-koch-institut/mex-ex…
cutoffthetop May 14, 2024
17cf946
Fix tests
cutoffthetop May 14, 2024
00468ef
Merge branch 'main' of https://github.com/robert-koch-institut/mex-ex…
cutoffthetop May 14, 2024
05d07ef
Merge branch 'feature/mx-1502-wikidata-results-refactor' of https://g…
cutoffthetop May 14, 2024
e91b848
Update versions
cutoffthetop May 14, 2024
d111226
hold back version bump
cutoffthetop May 14, 2024
b032707
Fix tests
cutoffthetop May 16, 2024
fe83682
Merge branch 'main' of https://github.com/robert-koch-institut/mex-ex…
cutoffthetop May 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,16 @@ default_language_version:
python: python3.11
repos:
- repo: https://github.com/psf/black
rev: 24.3.0
rev: 24.4.2
hooks:
- id: black
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.3.5
rev: v0.4.4
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
rev: v4.6.0
hooks:
- id: pretty-format-json
name: json
Expand All @@ -28,7 +28,7 @@ repos:
- id: fix-byte-order-marker
name: byte-order
- repo: https://github.com/pdm-project/pdm
rev: 2.13.2
rev: 2.15.2
hooks:
- id: pdm-lock-check
name: pdm
Expand Down
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- make pyodbc a soft dependency (only pipelines that use it may fail)
- switch from poetry to pdm
- move MSSQL Server authentication to general settings
- receive one or None organization from wikidata aux extractor
- adjust Timestamp usage to TemporalEntity
- move quotation marks (") filtering to mex-common from requested wikidata label
- get seq-repo data via mex-drop connector (was: file)

### Deprecated
Expand Down
4 changes: 2 additions & 2 deletions mex/artificial/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@
IdentityProvider,
LinkProvider,
PatternProvider,
TemporalEntityProvider,
TextProvider,
TimestampProvider,
)
from mex.artificial.settings import ArtificialSettings
from mex.common.cli import entrypoint
Expand Down Expand Up @@ -68,7 +68,7 @@ def factories(faker: Faker, identities: IdentityMap) -> Faker:
factory.add_provider(PatternProvider(factory))
factory.add_provider(BuilderProvider(factory))
factory.add_provider(TextProvider(factory))
factory.add_provider(TimestampProvider(factory))
factory.add_provider(TemporalEntityProvider(factory))
return faker


Expand Down
29 changes: 19 additions & 10 deletions mex/artificial/provider.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,15 @@
from mex.common.identity import Identity
from mex.common.models import ExtractedData
from mex.common.types import (
TIMESTAMP_FORMATS_BY_PRECISION,
TEMPORAL_ENTITY_FORMATS_BY_PRECISION,
UTC,
Email,
Identifier,
Link,
LinkLanguage,
TemporalEntity,
TemporalEntityPrecision,
Text,
Timestamp,
)


Expand Down Expand Up @@ -86,8 +87,10 @@ def field_value(
factory = self.generator.email
elif issubclass(inner_type, Text):
factory = self.generator.text_object
elif issubclass(inner_type, Timestamp):
factory = self.generator.timestamp
elif issubclass(inner_type, TemporalEntity):
factory = partial(
self.generator.temporal_entity, inner_type.ALLOWED_PRECISION_LEVELS
)
elif issubclass(inner_type, Enum):
factory = partial(self.random_element, inner_type)
elif issubclass(inner_type, str):
Expand Down Expand Up @@ -160,15 +163,21 @@ def link(self) -> Link:
return Link(url=self.url(), title=title, language=language)


class TimestampProvider(PythonFakerProvider):
"""Faker provider that can return a custom Timestamp with random precision."""
class TemporalEntityProvider(PythonFakerProvider):
"""Faker provider that can return a custom TemporalEntity with random precision."""

def timestamp(self) -> Timestamp:
"""Return a custom Timestamp with random date, time and precision."""
return Timestamp(
def temporal_entity(
self, allowed_precision_levels: list[TemporalEntityPrecision]
) -> TemporalEntity:
"""Return a custom temporal entity with random date, time and precision."""
return TemporalEntity(
datetime.fromtimestamp(
self.pyint(int(8e8), int(datetime.now().timestamp())), tz=UTC
).strftime(self.random_element(TIMESTAMP_FORMATS_BY_PRECISION.values()))
).strftime(
TEMPORAL_ENTITY_FORMATS_BY_PRECISION[
self.random_element(allowed_precision_levels)
]
)
)


Expand Down
4 changes: 2 additions & 2 deletions mex/biospecimen/transform.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
Identifier,
Link,
ResourceTypeGeneral,
TemporalEntity,
Theme,
Timestamp,
)


Expand Down Expand Up @@ -133,7 +133,7 @@ def transform_biospecimen_resource_to_mex_resource(
rights=resource.rechte,
sizeOfDataBasis=resource.vorhandene_anzahl_der_proben,
spatial=resource.raeumlicher_bezug,
temporal=cast(list[Timestamp | str], resource.zeitlicher_bezug),
temporal=cast(list[TemporalEntity | str], resource.zeitlicher_bezug),
theme=theme,
title=resource.offizieller_titel_der_probensammlung,
unitInCharge=unit_in_charge,
Expand Down
6 changes: 3 additions & 3 deletions mex/blueant/models/project.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from mex.common.models import BaseModel
from mex.common.types import Timestamp
from mex.common.types import TemporalEntity


class BlueAntClient(BaseModel):
Expand All @@ -13,11 +13,11 @@ class BlueAntProject(BaseModel):

clients: list[BlueAntClient]
departmentId: int
end: Timestamp
end: TemporalEntity
name: str
number: str
projectLeaderId: int
start: Timestamp
start: TemporalEntity
statusId: int
typeId: int

Expand Down
10 changes: 5 additions & 5 deletions mex/blueant/models/source.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from collections.abc import Sequence

from mex.common.types import Timestamp
from mex.common.types import TemporalEntity
from mex.models import BaseRawData


Expand All @@ -9,23 +9,23 @@ class BlueAntSource(BaseRawData):

client_names: list[str]
department: str
end: Timestamp
end: TemporalEntity
name: str
number: str
projectLeaderEmployeeId: str | None = None
start: Timestamp
start: TemporalEntity
status: str
type_: str

def get_partners(self) -> Sequence[str | None]:
"""Return partners from extractor."""
return []

def get_start_year(self) -> Timestamp | None:
def get_start_year(self) -> TemporalEntity | None:
"""Return start year from extractor."""
return self.start

def get_end_year(self) -> Timestamp | None:
def get_end_year(self) -> TemporalEntity | None:
"""Return end year from extractor."""
return self.end

Expand Down
6 changes: 3 additions & 3 deletions mex/confluence_vvt/models/source.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from collections.abc import Sequence
from typing import cast

from mex.common.types import Timestamp
from mex.common.types import TemporalEntity
from mex.models import BaseRawData


Expand Down Expand Up @@ -35,11 +35,11 @@ def get_partners(self) -> Sequence[str | None]:
"""Return partners from extractor."""
return []

def get_start_year(self) -> Timestamp | None:
def get_start_year(self) -> TemporalEntity | None:
"""Return start year from extractor."""
return None

def get_end_year(self) -> Timestamp | None:
def get_end_year(self) -> TemporalEntity | None:
"""Return end year from extractor."""
return None

Expand Down
5 changes: 2 additions & 3 deletions mex/datscha_web/extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,6 @@ def extract_datscha_web_organizations(
for item in datscha_web_items:
for partner in item.get_partners():
if partner and partner != "None":
organization = list(search_organization_by_label(partner))
if len(organization) == 1:
partner_to_org_map[partner] = organization[0]
if organization := search_organization_by_label(partner):
partner_to_org_map[partner] = organization
return partner_to_org_map
6 changes: 3 additions & 3 deletions mex/datscha_web/models/item.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

from pydantic import Field

from mex.common.types import Timestamp
from mex.common.types import TemporalEntity
from mex.models import BaseRawData


Expand Down Expand Up @@ -108,11 +108,11 @@ def get_partners(self) -> Sequence[str | None]:
if partner
]

def get_start_year(self) -> Timestamp | None:
def get_start_year(self) -> TemporalEntity | None:
"""Return start year from extractor."""
return None

def get_end_year(self) -> Timestamp | None:
def get_end_year(self) -> TemporalEntity | None:
"""Return end year from extractor."""
return None

Expand Down
42 changes: 21 additions & 21 deletions mex/ff_projects/extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,11 @@
from mex.common.ldap.transform import analyse_person_string
from mex.common.logging import watch
from mex.common.models import ExtractedPrimarySource
from mex.common.types import MergedOrganizationIdentifier, Timestamp, TimestampPrecision
from mex.common.types import (
MergedOrganizationIdentifier,
TemporalEntity,
TemporalEntityPrecision,
)
from mex.common.wikidata.extract import search_organization_by_label
from mex.common.wikidata.models.organization import WikidataOrganization
from mex.ff_projects.models.source import FFProjectsSource
Expand Down Expand Up @@ -40,21 +44,21 @@ def extract_ff_projects_sources() -> Generator[FFProjectsSource, None, None]:
yield source


def get_timestamp_from_cell(cell_value: Any) -> Timestamp | None:
"""Try to extract a timestamp from a cell.
def get_temporal_entity_from_cell(cell_value: Any) -> TemporalEntity | None:
"""Try to extract a temporal_entity from a cell.

Args:
cell_value: Value of a cell, could be int, string or datetime

Returns:
Timestamp or None
TemporalEntity or None
"""
if isinstance(cell_value, datetime):
timestamp = Timestamp(cell_value)
timestamp.precision = (
TimestampPrecision.SECOND
) # keeps Timestamp precision in Seconds as standard.
return timestamp
temporal_entity = TemporalEntity(cell_value)
temporal_entity.precision = (
TemporalEntityPrecision.SECOND
) # keeps TemporalEntity precision in Seconds as standard.
return temporal_entity
return None


Expand Down Expand Up @@ -103,11 +107,12 @@ def extract_ff_projects_source(row: "pd.Series[Any]") -> FFProjectsSource | None
rki_az = get_string_from_cell(row.get("RKI-AZ"))
laufzeit_von_cell = row.get("Laufzeit:\nvon ")
laufzeit_bis_cell = row.get("bis")
laufzeit_cells = get_optional_string_from_cell(
laufzeit_von_cell
), get_optional_string_from_cell(laufzeit_bis_cell)
laufzeit_von = get_timestamp_from_cell(laufzeit_von_cell)
laufzeit_bis = get_timestamp_from_cell(laufzeit_bis_cell)
laufzeit_cells = (
get_optional_string_from_cell(laufzeit_von_cell),
get_optional_string_from_cell(laufzeit_bis_cell),
)
laufzeit_von = get_temporal_entity_from_cell(laufzeit_von_cell)
laufzeit_bis = get_temporal_entity_from_cell(laufzeit_bis_cell)
zuwendungs_oder_auftraggeber = str(row.get("Zuwendungs-/ Auftraggeber"))
lfd_nr = get_string_from_cell(row.get("lfd. Nr."))
projektleiter = get_string_from_cell(row.get("Projektleiter"))
Expand Down Expand Up @@ -205,15 +210,10 @@ def extract_ff_projects_organizations(
Dict with organization label and WikidataOrganization
"""
return {
source.zuwendungs_oder_auftraggeber: orgs[0]
source.zuwendungs_oder_auftraggeber: org
for source in ff_projects_sources
if source.zuwendungs_oder_auftraggeber
and (
orgs := list(
search_organization_by_label(source.zuwendungs_oder_auftraggeber)
)
)
and len(orgs) == 1
and (org := search_organization_by_label(source.zuwendungs_oder_auftraggeber))
}


Expand Down
10 changes: 5 additions & 5 deletions mex/ff_projects/models/source.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from collections.abc import Sequence

from mex.common.types import Timestamp
from mex.common.types import TemporalEntity
from mex.models import BaseRawData


Expand All @@ -12,8 +12,8 @@ class FFProjectsSource(BaseRawData):
thema_des_projekts: str
rki_az: str
laufzeit_cells: tuple[str | None, str | None]
laufzeit_bis: Timestamp | None = None
laufzeit_von: Timestamp | None = None
laufzeit_bis: TemporalEntity | None = None
laufzeit_von: TemporalEntity | None = None
projektleiter: str
rki_oe: str | None = None
zuwendungs_oder_auftraggeber: str
Expand All @@ -23,11 +23,11 @@ def get_partners(self) -> Sequence[str | None]:
"""Return partners from extractor."""
return []

def get_start_year(self) -> Timestamp | None:
def get_start_year(self) -> TemporalEntity | None:
"""Return start year from extractor."""
return self.laufzeit_von

def get_end_year(self) -> Timestamp | None:
def get_end_year(self) -> TemporalEntity | None:
"""Return end year from extractor."""
return self.laufzeit_bis

Expand Down
Loading