LLM-Engineering

Dependencies

Python 3.11
Poetry 1.8.3
Docker 26.0.0

Install

poetry install
poetry self add 'poethepoet[poetry_plugin]'
pre-commit install

We run all the scripts using Poe the Poet. You don't have to do anything else but install it as a Poetry plugin.

Configure sensitive information

After you have installed all the dependencies, you have to fill a .env file.

First, copy our example:

cp .env.example .env

Now, let's understand how to fill it.

Selenium Drivers

You must download the Selenium Chrome driver to run the data collection pipeline. To proceed, use the links below:

Warning

For MacOS users, after downloading the driver, run the following command to give permissions for the driver to be accessible: xattr -d com.apple.quarantine /path/to/your/driver/chromedriver

The last step is to add the path to the downloaded driver in your .env file:

SELENIUM_BROWSER_DRIVER_PATH = "str"

LinkedIn Crawling

For crawling LinkedIn, you have to fill in your username and password:

LINKEDIN_USERNAME = "str"
LINKEDIN_PASSWORD = "str"

For this to work, you also have to:

disable 2FA
disable suspicious activity

We also recommend to:

create a dummy profile for crawling
crawl only your data

OpenAI

You also have to configure the standard OPENAI_API_KEY.

Important

Find more configuration options in the settings.py file.

Run Locally

Local Infrastructure

Warning

You need Docker installed.

Start:

poetry poe local-infrastructure-up

Stop:

poetry poe local-infrastructure-down

Warning

When running on MacOS, before starting the server, export the following environment variable: export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES Otherwise, the connection between the local server and pipeline will break. 🔗 More details in this issue.

ZenML is now accessible at:

Web UI: localhost:8237

Default credentials: - username: default - password:

🔗 More on ZenML

Qdrant is now accessible at:

REST API: localhost:6333 Web UI: localhost:6333/dashboard GRPC API: localhost:6334

🔗 More on Qdrant

MongoDB is now accessible at:

database URI: mongodb://decodingml:[email protected]:27017 database name: twin

Run Pipelines

All the pipelines will be orchestrated behind the scenes by ZenML.

To see the pipelines running and their results & metadata:

go to your ZenML dashboard
go to the Pipelines section
click on a specific pipeline (e.g., feature_engineering)
click on a specific run (e.g., feature_engineering_run_2024_06_20_18_40_24)
click on a specific step or artifact to find more details about the run

General Utilities

Export ZenML artifacts to JSON:

poetry poe run-export-artifact-to-json-pipeline

Data Preprocessing

Run the data collection ETL:

poetry poe run-digital-data-etl

Important

To add additional links to collect, go to configs_digital_data_etl_[your_name].yaml and add them to the links field. Also, you can create a completely new file and specify it at run time, like this: python -m llm_engineering.interfaces.orchestrator.run --run-etl --etl-config-filename configs_digital_data_etl_[your_name].yaml

Run the feature engineering pipeline:

poetry poe run-feature-engineering-pipeline

Run the dataset generation pipeline:

poetry poe run-generate-instruct-datasets-pipeline

Run all of the above:

poetry poe run-preprocessing-pipeline

Training

poetry poe run-training-pipeline

QA

Check and fix your linting issues:

poetry poe lint-check
poetry poe lint-fix

Check and fix your formatting issues:

poetry poe format-check
poetry poe format-fix

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github/workflows		.github/workflows
.vscode		.vscode
configs		configs
llm_engineering		llm_engineering
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-Engineering

Dependencies

Install

Configure sensitive information

Selenium Drivers

LinkedIn Crawling

OpenAI

Run Locally

Local Infrastructure

ZenML is now accessible at:

Qdrant is now accessible at:

MongoDB is now accessible at:

Run Pipelines

General Utilities

Data Preprocessing

Training

QA

About

Releases

Packages

Languages

License

mpattanaik708/LLM-Engineering

Folders and files

Latest commit

History

Repository files navigation

LLM-Engineering

Dependencies

Install

Configure sensitive information

Selenium Drivers

LinkedIn Crawling

OpenAI

Run Locally

Local Infrastructure

ZenML is now accessible at:

Qdrant is now accessible at:

MongoDB is now accessible at:

Run Pipelines

General Utilities

Data Preprocessing

Training

QA

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages