-
Notifications
You must be signed in to change notification settings - Fork 0
Data models and standards
Name | Description | Activities |
---|---|---|
Abstract records of amounts | Each individual currency amount is a record. | This is the model currently targeted by the PSAM and OpenUp partnership |
Open Contracting Data Standard |
Each record is a row. Fields that have multiple values associated with that amount must contain all the values on one cell, not on subsequent rows. A record contains exactly one buyer name, up to one supplier name, and up to one order amount in Rands (ZAR).
This format is intended to be the least amount of cleaning needed to be able to search, filter and sort the data, while handling the fact that the source data contains merged cells where there are multiple values in the same field corresponding to a single currency amount.
This format does not attempt to normalise different ways of writing the same supplier or director name, or even well-defined fields like CIPC or CSD numbers. We can create a separate repository with the data cleaned to such an extent.
This format does not attempt to represent multiple values, like multiple directors as distinct values associated with a single amount. Another format can do that.
Extracted files should represent the scope of the file.
For files from the OCPO data, use the following structure to name your file:
- National Departments e.g.
National Departments
-
Department name e.g.
National Departments/The Presidency
- Public entity or sub-department without abbreviation e.g.
National Departments/The Presidency/Department of Planning, Monitoring and Evaluation
- Filename e.g. ``National Departments/The Presidency/Department of Planning, Monitoring and Evaluation/PLANNING MONITORING and EVALUATION - Covid -19 Expenditure.csv'
- Public entity or sub-department without abbreviation e.g.
-
Department name e.g.
- Provincial Government e.g.
Provincial Government
- Province e.g.
Provincial Government/Gauteng/
- All departments e.g.
Provincial Government/Gauteng/Departments/GAUTENG PROVINCIAL GOVERNMENT.csv
- Specific department e.g.
Provincial Government/Gauteng/Human Settlements/GAUTENG HUMAN SETTLEMENTS Procurement for COVID-19 Report.csv
- All departments e.g.
- Province e.g.
Pass 1 files should have the suffix - pass 1
.
Pass 2 files should have the suffix - pass 2
.
Final approved files should have no suffix in the name.
Multiple values should ideally be placed on different lines inside the cell. In LibreOffice Calc, a new line can be created using Control+Enter while editing the cell.
This happens e.g. when multiple director names are given for one order with one company; or when multiple units are part of the same order, e.g. 1l and 500ml bottles.
If the data was not split onto multiple lines in the source document, don't split it when extracting from PDF.
All data should be normalised to the following column names. If in doubt, normalise what you can, and ask someone for help.
Standard column name | Common variations in source data | Example values | Description |
---|---|---|---|
buyer_name |
Usually a PDF is about one buyer department or entity. Some provinces have one file with many departments - only then is there already such a column. | Eastern Cape Department of Health | String. Buyer Name details |
supplier_name |
Supplier Name |
||
order_amount_zar |
Amount , Order Price , TOTAL EXP AND COMMITMENTS FOR THE DEPARTMENT
|
Decimal. Amount ZAR. Use this for the total of this record - paid or not - what did the organ of state commit to pay for these items? | |
invoice_amount_zar |
Decimal. Amount ZAR. JD is not sure why this was added and how this is supposed to differ from order_amount_zar right now - the thinking was possible for when it's clear that a record corresponds to a single invoice, while one order might have multiple invoices - be careful using this until its definition is decided. |
||
payment_amount_zar |
TOTAL PRICE (ATTACHED BAS DISBURSEMENT PER PAYEE REPORT) |
1234.56 , 0 , sometimes blank |
Decimal. Amount ZAR. Use this when it is clear this is an amount that has actually been paid to the supplier's bank account. |
cost_per_unit_zar |
UNIT PRICE |
Decimal. Amount ZAR | |
items_description |
String | ||
items_quantity |
String | ||
items_unit |
String | ||
director_names |
Director Name |
String | |
director_surnames |
Director Surname |
String | |
director_names_and_surnames |
String | ||
company_registration_number |
Company Registration |
String | |
central_supplier_database_number |
CSD Supplier Number |
String | |
supplier_numbers_other |
LOGIS & MAAA NUMBER OF THE SERVICE PROVIDER |
MAAA0009035 |
|
implementation_location_province |
Province but not Company Location
|
String. Add this for provincial departments as the province name, e.g. Free State if there is no such column already. |
|
implementation_location_district_municipality |
District |
String | |
implementation_location_local_municipality |
String | ||
implementation_location_facility |
Hospital |
String | |
implementation_location_other |
Locality |
String | |
procurement_method |
Procurement Strategy Sourcing Strategy
|
String | |
state_employee |
String | ||
award_date |
Date letter issued |
Date | |
invoice_date |
Order date |
Date | |
invoice_receipt_date |
Date | ||
invoice_number |
String | ||
order_number |
Purchase order , Contract/Quotation
|
RT24-2017 , UH139010
|
String |
payment_number |
String | ||
payment_period |
String | ||
bbbee_status |
String | ||
reporting_period |
String | ||
implementation_status |
String |
Generally write names in full, and not just the acronym, e.g. Department of Health
and not DOH
. Use commas, hyphens and capitalisation the way they do in their logo.
Combine the province name, the test "Department of", and the department name, unless it reads weirdly. e.g.
Free State Department of Health
or Eastern Cape Department of Social Development
but Free State Provincial Treasury
and Free State Office of the Premier
and not Free State Department of Provincial Treasury
and not Free State Department of the Office of the Premier
.
Take a look at the department's logo if you're unsure - that would even take precedence over the convention described here.
The goal of this convention is that the name should stand on its own - you shouldn't rely on a province field to know who the record is about.
Combine "Department of" and the department name, unless it reads weirdly. e.g.
Department of Higher Education
but not Department of The Presidency
or Department of Parliament
- instead use The Presidency
and Parliament
National, Provincial and Municipal:
Just use their name in full, but if they're commonly known by their acronym, include their acronym in parentheses at the end of their name. If in doubt, use what's on their logo.
Write their name in full with the word "Municipality" at the end. Do not specify Metro, District or Local in the name. This is because they seem to use just name and "Municipality" for official references in gazettes.
Be sure to also set implementation_location_province
for all municipalities, implementation_location_district
for locals and districts, and implementation_location_local_municipality
for local municipalities.
Format the ZAR columns - at least the ones named as above that will get imported as decimal numbers - according to the following rules
- Only decimal numbers or completely blank cells
- No thousand separator (usually space or comma)
- Period (
.
) as decimal separator - Remove any whitespace
- Remove any currency e.g.
R
- Remove any hypens(
-
) or other ways of indicating "Not available" liken/a
If the column name or something else suggests that the column is in thousands, do not just multiply by a thousand - at that point we need to introduce a factor column or something that would accompany the number, so that the number can be stored with the precision it has (thousands of rands) and its factor, and presented as n thousand rands or something like that.
Format date columns - at least the ones named according to the above table - as ISO 8601 dates, e.g. 2020-05-23
.
Do not add a day component like 01
or 31
to partial dates that include only a year and month - rather leave them as they are, or format them as a partial ISO 8601 date, e.g. 2020-05
for May 2020.