- Python 3.11
- Poetry 1.8.3
- Docker 26.0.0
poetry install
poetry self add 'poethepoet[poetry_plugin]'
pre-commit install
We run all the scripts using Poe the Poet. You don't have to do anything else but install it as a Poetry plugin.
After you have installed all the dependencies, you have to fill a .env
file.
First, copy our example:
cp .env.example .env
Now, let's understand how to fill it.
You must download the Selenium Chrome driver to run the data collection pipeline. To proceed, use the links below:
- https://www.selenium.dev/documentation/webdriver/troubleshooting/errors/driver_location/
- https://googlechromelabs.github.io/chrome-for-testing/#stable
Warning
For MacOS users, after downloading the driver, run the following command to give permissions for the driver to be accessible: xattr -d com.apple.quarantine /path/to/your/driver/chromedriver
The last step is to add the path to the downloaded driver in your .env
file:
SELENIUM_BROWSER_DRIVER_PATH = "str"
For crawling LinkedIn, you have to fill in your username and password:
LINKEDIN_USERNAME = "str"
LINKEDIN_PASSWORD = "str"
For this to work, you also have to:
- disable 2FA
- disable suspicious activity
We also recommend to:
- create a dummy profile for crawling
- crawl only your data
You also have to configure the standard OPENAI_API_KEY
.
Important
Find more configuration options in the settings.py file.
Warning
You need Docker installed.
Start:
poetry poe local-infrastructure-up
Stop:
poetry poe local-infrastructure-down
Warning
When running on MacOS, before starting the server, export the following environment variable:
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
Otherwise, the connection between the local server and pipeline will break. 🔗 More details in this issue.
Web UI: localhost:8237
Default credentials: - username: default - password:
REST API: localhost:6333 Web UI: localhost:6333/dashboard GRPC API: localhost:6334
database URI: mongodb://decodingml:[email protected]:27017 database name: twin
All the pipelines will be orchestrated behind the scenes by ZenML.
To see the pipelines running and their results & metadata:
- go to your ZenML dashboard
- go to the
Pipelines
section - click on a specific pipeline (e.g.,
feature_engineering
) - click on a specific run (e.g.,
feature_engineering_run_2024_06_20_18_40_24
) - click on a specific step or artifact to find more details about the run
Export ZenML artifacts to JSON:
poetry poe run-export-artifact-to-json-pipeline
Run the data collection ETL:
poetry poe run-digital-data-etl
Important
To add additional links to collect, go to configs_digital_data_etl_[your_name].yaml
and add them to the links
field. Also, you can create a completely new file and specify it at run time, like this: python -m llm_engineering.interfaces.orchestrator.run --run-etl --etl-config-filename configs_digital_data_etl_[your_name].yaml
Run the feature engineering pipeline:
poetry poe run-feature-engineering-pipeline
Run the dataset generation pipeline:
poetry poe run-generate-instruct-datasets-pipeline
Run all of the above:
poetry poe run-preprocessing-pipeline
poetry poe run-training-pipeline
Check and fix your linting issues:
poetry poe lint-check
poetry poe lint-fix
Check and fix your formatting issues:
poetry poe format-check
poetry poe format-fix