[infra] Adiciona suporte a Avro e Parquet (cont.) #1145

isadorabugarin · 2022-03-07T22:26:24Z

Esse PR é uma continuação do PR #1100

Replicação do texto do PR original

Motivação

Formatos de arquivo com tamanho reduzido e tempos de leitura menores implicam:
- Menor espaço necessário para armazenamento
- Menor tempo de processamento
- Consequentemente, menores custos

Modificações

Corrige dependência inconsistente:
- Para o extra "dev" era google-cloud-bigquery = "1.28.0"
- Para o pacote em si, google-cloud-bigquery = "2.30.1"
- Coloquei ambos como google-cloud-bigquery = "2.30.1"
Adiciona dependência pandavro para lidar com a interface Pandas <-> Avro
Adiciona suporte a Apache Avro e Apache Parquet para tabelas externas
- O default continua sendo CSV para compatibilidade
Corrige a falta do argumento source_format na chamada to Table.init

TODO @isadorabugarin

Criar testes
Revisar implementação

lucascr91 · 2022-03-10T19:09:02Z

Os testes precisam ser pré-fixados com a palavra test. Ver página 287 na documentação do pytest (https://buildmedia.readthedocs.org/media/pdf/pytest/latest/pytest.pdf).

def table_create_all_implemented_source_format(table, path, source_format):
    table.delete(mode="all")

    table.create(
        path=path,
        if_storage_data_exists="pass",
        if_table_config_exists="pass",
        source_format=source_format,
    )
    assert table_exists(table, "staging")


def table_create_not_implemented_source_format(table):

    with pytest.raises(NotImplementedError):
        table.create(
            if_table_exists="replace",
            if_storage_data_exists="pass",
            if_table_config_exists="pass",
            source_format="json",
        )

python-package/tests/test_table.py (Lines 227-247)

lucascr91

Parabéns pelo trabalho. O único reparo mesmo foi a formatação dos testes. Se possível, adicione um snippet no PR apenas mostrando como você fez teste.

* feat(infra): create version 1.6.2 * feat(infra): create version 1.6.2 * feat(infra): create version 1.6.2 * [infra] python-v1.6.2 (#1089) * [infra] fix dataset_config.yaml folder path (#1067) * feat(infra) merge master * [infra] conform Metadata to new metadata changes (#1093) * [dados-bot] br_ms_vacinacao_covid19 (2022-01-23) (#1086) Co-authored-by: terminal_name <github_email> * [dados] br_bd_diretorios_brasil.etnia_indigena (#1087) * Sobe diretorio etnia_indigena * Update table_config.yaml * Update table_config.yaml * feat: conform Metadata's schema to new one * fix: conform yaml generation to new schema * fix: delete test_dataset folder Co-authored-by: Lucas Moreira <[email protected]> Co-authored-by: Gustavo Aires Tiago <[email protected]> Co-authored-by: Ricardo Dahis <[email protected]> Co-authored-by: Lucas Moreira <[email protected]> Co-authored-by: Gustavo Aires Tiago <[email protected]> * feat(infra): 1.6.2a3 version * feat(infra): 1.6.2a3 version * fix(ingra): edit partitions and update_locally * feat(infra): update_columns new fields and accepts local files * [infra] option to make dataset public (#1020) * feat(infra): option to make dataset public * feat(infra): fix None data * fix(infra): roll back * fix(infra): fix retry in storage upload * fix(infra): add option to dataset data location * feat(infra): make staging dataset not public * feat(infra): make staging dataset not public * fix(infra): change bd version in actions * fix(infra): add toml to install in ci * fix(infra): remove a forget print * fix(infra): fix location location * fix(infra): fix dataset description * feat(infra): bump-version * feat(infra): temporal coverage as list in update_columns * feat(infra): add new parameters to cli * feat(infra): fix cli options * [infra] change download functions to consume CKAN endpoints #1129 (#1130) * [infra] add function to wrap bd_dataset_search endpoint * Update download.py * [infra] modify list_datasets function to consume CKAN endpoint * [infra] fix list_dataset function to include limit and remove order_by * [infra] change function list_dataset_tables to use CKAN endpoint * [infra] apply PEP8 to list_dataset_tables and respective tests * add get_dataset_description, get_table_description, get_table_columns * [infra] fix dataset_config.yaml folder path (#1067) * feat(infra) merge master * fix files organization to match master * remove download.py * remove test_download * Delete test_download.py * remove test files * remove test_download.py * remove test_download.py * remove test_download.py * remove test_download.py * add tests metadata * remove test_download.py * remove unused imports * [infra] add _safe_fetch and get_table_size functions Co-authored-by: lucascr91 <[email protected]> * fix(infra): add a empty list to not a partition * [infra] Adiciona suporte a Avro e Parquet (#1145) * adiciona suporte a Avro e Parquet para upload * Adds test for source formats * [infra] update tests for avro, parquet, and csv upload Co-authored-by: Gabriel Gazola Milan <[email protected]> Co-authored-by: Isadora Bugarin <[email protected] > Co-authored-by: lucascr91 <[email protected]> * [infra] Feedback messages in upload methods [issue #1059] (#1085) * Creating dataclass config * Success messages - create and update (table.py) using loguru * feat: improve log level control * refa: move logger config to Base.__init__ * Improving log level control * Adjusting log level control function in base.py * Fixing repeated 'DELETE' messages everytime Table is replaced. * Importing 'dataclass' from 'dataclasses' to make config work. * Fixing repeated 'UPDATE' messages inside other functions. * Defining a new script message format. * Definng standard log messages for 'dataset.py' functions * Definng standard log messages for 'storage.py' functions * Definng standard log messages for 'table.py' functions * Definng standard log messages for 'metadata.py' functions * Adds standard configuration to billing_project_id in download.py * Configuring billing_project_id in download.py * Configuring config_path in base.py Co-authored-by: Guilherme Salustiano <[email protected]> Co-authored-by: Isadora Bugarin <[email protected]> * update toml Co-authored-by: Ricardo Dahis <[email protected]> Co-authored-by: Lucas Moreira <[email protected]> Co-authored-by: Gustavo Aires Tiago <[email protected]> Co-authored-by: lucascr91 <[email protected]> Co-authored-by: Isadora Bugarin <[email protected]> Co-authored-by: Gabriel Gazola Milan <[email protected]> Co-authored-by: Isadora Bugarin <[email protected] > Co-authored-by: Guilherme Salustiano <[email protected]> Co-authored-by: Isadora Bugarin <[email protected]>

gabriel-milan and others added 2 commits February 7, 2022 09:46

adiciona suporte a Avro e Parquet para upload

8af888a

Adds test for source formats

5b1660f

lucascr91 requested review from JoaoCarabetta, d116626 and lucascr91 March 7, 2022 22:56

lucascr91 changed the title ~~PR 1100~~ [infra] Adiciona suporte a Avro e Parquet (cont.) Mar 10, 2022

lucascr91 changed the base branch from master to python-1.6.2 March 10, 2022 18:06

lucascr91 mentioned this pull request Mar 10, 2022

[infra] Adiciona suporte a Avro e Parquet #1100

Closed

2 tasks

lucascr91 closed this Mar 10, 2022

lucascr91 reopened this Mar 10, 2022

lucascr91 requested changes Mar 10, 2022

View reviewed changes

[infra] update tests for avro, parquet, and csv upload

b9531f8

lucascr91 approved these changes Mar 11, 2022

View reviewed changes

Merge branch 'python-1.6.2' into pr-1100

3599079

lucascr91 merged commit 564e671 into python-1.6.2 Mar 11, 2022

lucascr91 deleted the pr-1100 branch March 11, 2022 21:37

lucascr91 linked an issue Apr 4, 2022 that may be closed by this pull request

[infra] Suporte a parquet #1082

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[infra] Adiciona suporte a Avro e Parquet (cont.) #1145

[infra] Adiciona suporte a Avro e Parquet (cont.) #1145

isadorabugarin commented Mar 7, 2022 •

edited by lucascr91

Loading

lucascr91 commented Mar 10, 2022

lucascr91 left a comment

[infra] Adiciona suporte a Avro e Parquet (cont.) #1145

[infra] Adiciona suporte a Avro e Parquet (cont.) #1145

Conversation

isadorabugarin commented Mar 7, 2022 • edited by lucascr91 Loading

Motivação

Modificações

TODO @isadorabugarin

lucascr91 commented Mar 10, 2022

lucascr91 left a comment

Choose a reason for hiding this comment

isadorabugarin commented Mar 7, 2022 •

edited by lucascr91

Loading