Skip to content

Data models and standards

Tristyn Ferreiro edited this page Nov 7, 2020 · 30 revisions
Name Description Activities
Abstract records of amounts Each individual currency amount is a record. This is the model currently targeted by the PSAM and OpenUp partnership
Open Contracting Data Standard

Abstract records of amounts

Each record is a row. Fields that have multiple values associated with that amount must contain all the values on one cell, not on subsequent rows. A record contains exactly one buyer name, up to one supplier name, and up to one order amount in Rands (ZAR).

This format is intended to be the least amount of cleaning needed to be able to search, filter and sort the data, while handling the fact that the source data contains merged cells where there are multiple values in the same field corresponding to a single currency amount.

This format does not attempt to normalise different ways of writing the same supplier or director name, or even well-defined fields like CIPC or CSD numbers. We can create a separate repository with the data cleaned to such an extent.

This format does not attempt to represent multiple values, like multiple directors as distinct values associated with a single amount. Another format can do that.

File locations

Extracted files should represent the scope of the file.

For files from the OCPO data, use the following structure to name your file:

  • National Departments e.g. National Departments
    • Department name e.g. National Departments/The Presidency
      • Public entity or sub-department without abbreviation e.g. National Departments/The Presidency/Department of Planning, Monitoring and Evaluation
        • Filename e.g. ``National Departments/The Presidency/Department of Planning, Monitoring and Evaluation/PLANNING MONITORING and EVALUATION - Covid -19 Expenditure.csv'
  • Provincial Government e.g. Provincial Government
    • Province e.g. Provincial Government/Gauteng/
      • All departments e.g. Provincial Government/Gauteng/Departments/GAUTENG PROVINCIAL GOVERNMENT.csv
      • Specific department e.g. Provincial Government/Gauteng/Human Settlements/GAUTENG HUMAN SETTLEMENTS Procurement for COVID-19 Report.csv

Pass 1/2/final

Pass 1 files should have the suffix - pass 1.

Pass 2 files should have the suffix - pass 2.

Final approved files should have no suffix in the name.

Multiple values in a field

Multiple values should ideally be placed on different lines inside the cell. In LibreOffice Calc, a new line can be created using Control+Enter while editing the cell.

This happens e.g. when multiple director names are given for one order with one company; or when multiple units are part of the same order, e.g. 1l and 500ml bottles.

If the data was not split onto multiple lines in the source document, don't split it when extracting from PDF.

Column names

All data should be normalised to the following column names. If in doubt, normalise what you can, and ask someone for help.

Standard column name Common variations in source data Example values Description
buyer_name Usually a PDF is about one buyer department or entity. Some provinces have one file with many departments - only then is there already such a column. Eastern Cape Department of Health String. Buyer Name details
supplier_name Supplier Name
order_amount_zar Amount, Order Price, TOTAL EXP AND COMMITMENTS FOR THE DEPARTMENT Decimal. Amount ZAR. Use this for the total of this record - paid or not - what did the organ of state commit to pay for these items?
invoice_amount_zar Decimal. Amount ZAR. JD is not sure why this was added and how this is supposed to differ from order_amount_zar right now - the thinking was possible for when it's clear that a record corresponds to a single invoice, while one order might have multiple invoices - be careful using this until its definition is decided.
payment_amount_zar TOTAL PRICE (ATTACHED BAS DISBURSEMENT PER PAYEE REPORT) 1234.56, 0, sometimes blank Decimal. Amount ZAR. Use this when it is clear this is an amount that has actually been paid to the supplier's bank account.
cost_per_unit_zar UNIT PRICE Decimal. Amount ZAR
items_description String
items_quantity String
items_unit String
director_names Director Name String
director_surnames Director Surname String
director_names_and_surnames String
company_registration_number Company Registration String
central_supplier_database_number CSD Supplier Number String
supplier_numbers_other LOGIS & MAAA NUMBER OF THE SERVICE PROVIDER MAAA0009035
implementation_location_province Province but not Company Location String. Add this for provincial departments as the province name, e.g. Free State if there is no such column already.
implementation_location_district_municipality District String
implementation_location_local_municipality String
implementation_location_facility Hospital String
implementation_location_other Locality String
procurement_method Procurement Strategy Sourcing Strategy String
state_employee String
award_date Date letter issued Date
invoice_date Order date Date
invoice_receipt_date Date
invoice_number String
order_number Purchase order, Contract/Quotation RT24-2017, UH139010 String
payment_number String
payment_period String
bbbee_status String
reporting_period String
implementation_status String

Buyer Name

Generally write names in full, and not just the acronym, e.g. Department of Health and not DOH. Use commas, hyphens and capitalisation the way they do in their logo.

Provincial departments:

Combine the province name, the test "Department of", and the department name, unless it reads weirdly. e.g.

Free State Department of Health or Eastern Cape Department of Social Development but Free State Provincial Treasury and Free State Office of the Premier and not Free State Department of Provincial Treasury and not Free State Department of the Office of the Premier.

Take a look at the department's logo if you're unsure - that would even take precedence over the convention described here.

The goal of this convention is that the name should stand on its own - you shouldn't rely on a province field to know who the record is about.

National departments

Combine "Department of" and the department name, unless it reads weirdly. e.g.

Department of Higher Education but not Department of The Presidency or Department of Parliament - instead use The Presidency and Parliament

Public Entities

National, Provincial and Municipal:

Just use their name in full, but if they're commonly known by their acronym, include their acronym in parentheses at the end of their name. If in doubt, use what's on their logo.

Municipalities

Write their name in full with the word "Municipality" at the end. Do not specify Metro, District or Local in the name. This is because they seem to use just name and "Municipality" for official references in gazettes.

Be sure to also set implementation_location_province for all municipalities, implementation_location_district for locals and districts, and implementation_location_local_municipality for local municipalities.

Amount ZAR columns

Format the ZAR columns - at least the ones named as above that will get imported as decimal numbers - according to the following rules

  • Only decimal numbers or completely blank cells
  • No thousand separator (usually space or comma)
  • Period (.) as decimal separator
  • Remove any whitespace
  • Remove any currency e.g. R
  • Remove any hypens(-) or other ways of indicating "Not available" like n/a

If the column name or something else suggests that the column is in thousands, do not just multiply by a thousand - at that point we need to introduce a factor column or something that would accompany the number, so that the number can be stored with the precision it has (thousands of rands) and its factor, and presented as n thousand rands or something like that.

Date columns

Format date columns - at least the ones named according to the above table - as ISO 8601 dates, e.g. 2020-05-23.

Do not add a day component like 01 or 31 to partial dates that include only a year and month - rather leave them as they are, or format them as a partial ISO 8601 date, e.g. 2020-05 for May 2020.