Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run CI on Modal, upgrade Bitsandbytes #641

Open
wants to merge 54 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
fc52696
Run CI on Modal, upgrade Bitsandbytes
mryab Feb 10, 2025
58f3d44
Add docs configuration
mryab Feb 10, 2025
6d36cd1
Fix formatting
mryab Feb 10, 2025
ab714bd
Configure concurrency for Modal tests
mryab Feb 10, 2025
c840ab9
Sort imports
mryab Feb 10, 2025
f717bf6
Set up the timeout
mryab Feb 10, 2025
0dca5a2
Set up concurrency for other actions as well
mryab Feb 10, 2025
11feccf
Remove concurrency limits
mryab Feb 10, 2025
cbf4450
Add concurrency, update bitsandbytes in dependencies
mryab Feb 10, 2025
4f303bd
Add cache, bump CI versions
mryab Feb 10, 2025
6a5ec5e
Skip test_allreduce_protocol for the time being
mryab Feb 10, 2025
ba3e386
Reduce the number of CPUs
mryab Feb 10, 2025
1fb8dec
Decrease the limits in test_dht_connection_successful
mryab Feb 10, 2025
67e040f
Restore the limits in test_dht_connection_successful
mryab Feb 10, 2025
c0af379
Clear the blacklist before attempting store
mryab Feb 10, 2025
6116570
Increase the wait in test_load_state_from_peers
mryab Feb 10, 2025
801bb4f
Parametrize tests by Python version, upload Codecov coverage
mryab Feb 11, 2025
fd69b64
Check out and build a specific version of bitsandbytes
mryab Feb 11, 2025
22739f5
Increase the timeouts to account for image builds
mryab Feb 11, 2025
635879f
Introduce timeouts
mryab Feb 22, 2025
8fbd9dd
Increase the number of CPUs for tests
mryab Feb 22, 2025
d70b4b9
Make tests more robust
mryab Feb 23, 2025
4254468
Make tests more robust
mryab Feb 23, 2025
1753bae
Reformat the code
mryab Feb 23, 2025
4753fef
Mark test_client_disconnect as flaky
mryab Feb 23, 2025
9705318
Build and test p2pd separately
mryab Feb 23, 2025
ae5ed98
Install Go only for a specific image
mryab Feb 23, 2025
11eb277
Don't use uv when building p2pd
mryab Feb 23, 2025
9d37fe9
Mark test_dhtnode_blacklist as flaky
mryab Feb 23, 2025
7abc9f0
Increase timeouts
mryab Feb 23, 2025
5b69835
Make test_averaging_trigger more robust
mryab Feb 23, 2025
9e37679
Download codecov with wget
mryab Feb 23, 2025
aa20215
Skip all training tests for the time being
mryab Feb 23, 2025
a03288e
Skip test_allgather for the time being
mryab Feb 23, 2025
a614a02
Mark test_performance_ema_threadsafe and test_remote_expert_worker_ru…
mryab Feb 23, 2025
2cfc94a
Reduce timeouts, mark test_background_server_identity_path as flaky
mryab Feb 23, 2025
df048db
Mention sponsorship by Prime Intellect
mryab Feb 23, 2025
e388e07
Fix missing import
mryab Feb 23, 2025
98e6a38
Mark flaky tests
mryab Feb 23, 2025
b317b29
Modify the codecov command
mryab Feb 23, 2025
66c9187
Pass extra environment variables to codecov
mryab Feb 23, 2025
93460aa
Remove --dist from codecov run
mryab Feb 23, 2025
75529a1
Pass GITHUB_EVENT_PULL_REQUEST_HEAD_SHA when running the test
mryab Feb 23, 2025
83b53bb
Mark test_fault_tolerance as flaky
mryab Feb 23, 2025
5984bad
Mark test_cli_run_server_identity_path as flaky
mryab Feb 23, 2025
2f67c52
Disable parallel execution for codecov management
mryab Feb 23, 2025
e8efb66
Increase codecov run timeout to 15 minutes
mryab Feb 23, 2025
f8ad2a8
Pass GITHUB_EVENT_PULL_REQUEST_HEAD_SHA to the workflow
mryab Feb 23, 2025
225439e
Pass additional secrets
mryab Feb 23, 2025
3695813
Mark one more test as flaky
mryab Feb 23, 2025
6bac780
Mark another test as flaky
mryab Feb 23, 2025
3228dfd
Pass codecov values explicitly
mryab Feb 23, 2025
0a9347d
Pass --no-use-pep517 to uv pip install
mryab Feb 23, 2025
87f0ece
Change uv pip to pip
mryab Feb 23, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions .github/workflows/check-style.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,24 @@ on:
branches: [ master ]
pull_request:

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
black:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- uses: psf/black@stable
with:
options: "--check --diff"
version: "22.3.0"
isort:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v3
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: 3.11
- uses: isort/isort-action@master
Expand All @@ -28,7 +32,7 @@ jobs:
codespell:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- uses: codespell-project/actions-codespell@v1
with:
only_warn: 1
Expand Down
6 changes: 5 additions & 1 deletion .github/workflows/push-docker-image.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,17 @@ on:
pull_request:
branches: [ master ]

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
build:
runs-on: ubuntu-latest

steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4

- name: Docker meta
id: meta
Expand Down
12 changes: 8 additions & 4 deletions .github/workflows/run-benchmarks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,23 @@ on:
branches: [ master ]
pull_request:

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
run_benchmarks:

runs-on: ubuntu-latest
timeout-minutes: 10
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v3
uses: actions/setup-python@v5
with:
python-version: 3.11
- name: Cache dependencies
uses: actions/cache@v3
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: Key-v1-3.11-${{ hashFiles('requirements.txt') }}-${{ hashFiles('requirements-dev.txt') }}
Expand All @@ -28,7 +32,7 @@ jobs:
pip install -r requirements-dev.txt
- name: Build bitsandbytes
run: |
pip install bitsandbytes==0.41.1
pip install bitsandbytes==0.45.2
- name: Build hivemind
run: |
pip install .
Expand Down
112 changes: 112 additions & 0 deletions .github/workflows/run-tests-on-modal.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
name: Modal tests

on:
push:
branches: [master]
pull_request:

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
run_tests:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12"]
fail-fast: false
env:
MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}
PYTHON_VERSION: ${{ matrix.python-version }}
timeout-minutes: 15
steps:
- name: Checkout Repository
uses: actions/checkout@v4

- name: Install Python
uses: actions/setup-python@v5
with:
python-version: "3.12"

- name: Cache dependencies
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: Key-v1-3.12-modal

- name: Install build dependencies
run: |
python -m pip install --upgrade pip
pip install modal==0.73.32

- name: Run tests
run: |
modal run modal_ci.py::run_tests

measure_coverage:
runs-on: ubuntu-latest
env:
MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
GITHUB_EVENT_NAME: ${{ github.event_name }}
GITHUB_EVENT_NUMBER: ${{ github.event.number }}
GITHUB_EVENT_PULL_REQUEST_HEAD_SHA: ${{ github.event.pull_request.head.sha }}
PYTHON_VERSION: "3.11"
timeout-minutes: 15
steps:
- name: Checkout Repository
uses: actions/checkout@v4

- name: Install Python
uses: actions/setup-python@v5
with:
python-version: "3.12"

- name: Cache dependencies
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: Key-v1-3.12-modal

- name: Install build dependencies
run: |
python -m pip install --upgrade pip
pip install modal==0.73.32

- name: Measure and upload coverage
run: |
modal run modal_ci.py::run_codecov

build_and_test_p2pd:
runs-on: ubuntu-latest
env:
MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}
PYTHON_VERSION: "3.11"
timeout-minutes: 10
steps:
- name: Checkout Repository
uses: actions/checkout@v4

- name: Install Python
uses: actions/setup-python@v5
with:
python-version: "3.12"

- name: Cache dependencies
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: Key-v1-3.12-modal

- name: Install build dependencies
run: |
python -m pip install --upgrade pip
pip install modal==0.73.32

- name: Run p2pd tests
run: |
modal run modal_ci.py::build_and_test_p2pd
20 changes: 11 additions & 9 deletions .github/workflows/run-tests.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
name: Tests

on:
push:
branches: [ master ]
pull_request:
# Tests in GHA only run manually, see run-tests-on-modal.yml for the same tests in CI
on: workflow_dispatch

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
run_tests:
Expand All @@ -15,13 +17,13 @@ jobs:
fail-fast: false
timeout-minutes: 15
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v3
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Cache dependencies
uses: actions/cache@v3
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: Key-v1-${{ matrix.python-version }}-${{ hashFiles('requirements.txt') }}-${{ hashFiles('requirements-dev.txt') }}
Expand All @@ -32,7 +34,7 @@ jobs:
pip install -r requirements-dev.txt
- name: Build bitsandbytes
run: |
pip install bitsandbytes==0.41.1
pip install bitsandbytes==0.45.2
- name: Build hivemind
run: |
pip install .
Expand Down Expand Up @@ -94,7 +96,7 @@ jobs:
pip install -r requirements-dev.txt
- name: Build bitsandbytes
run: |
pip install bitsandbytes==0.41.1
pip install bitsandbytes==0.45.2
- name: Build hivemind
run: |
pip install -e . --no-use-pep517
Expand Down
1 change: 1 addition & 0 deletions .readthedocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ version: 2

sphinx:
fail_on_warning: true
configuration: docs/conf.py

python:
install:
Expand Down
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,10 @@ the [contributing guidelines](https://github.com/learning-at-home/hivemind/blob/
more about other ways to contribute, read
our [guide](https://learning-at-home.readthedocs.io/en/latest/user/contributing.html).

## Collaborators and Sponsorship

* [Prime Intellect](https://www.primeintellect.ai/) sponsoring compute resources over [Modal](https://modal.com/) for CI

## Citation

If you found hivemind or its underlying algorithms useful for your research, please cite the following source:
Expand Down
6 changes: 3 additions & 3 deletions hivemind/compression/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,14 +107,14 @@ def extract(self, serialized_tensor: runtime_pb2.Tensor) -> torch.Tensor:
if serialized_tensor.dtype == "bfloat16":
numel = shape.numel()
if numel > 0 and len(serialized_tensor.buffer) // numel == 4:
array = np.frombuffer(serialized_tensor.buffer, dtype=np.float32)
array = np.frombuffer(bytearray(serialized_tensor.buffer), dtype=np.float32)
tensor = torch.as_tensor(array, dtype=torch.bfloat16)
else:
array = np.frombuffer(serialized_tensor.buffer, dtype=np.int16)
array = np.frombuffer(bytearray(serialized_tensor.buffer), dtype=np.int16)
# reinterpret_cast from an arbitrary 2-byte type supported by numpy
tensor = torch.as_tensor(array).view(torch.bfloat16)
else:
array = np.frombuffer(serialized_tensor.buffer, dtype=np.dtype(serialized_tensor.dtype))
array = np.frombuffer(bytearray(serialized_tensor.buffer), dtype=np.dtype(serialized_tensor.dtype))
tensor = torch.as_tensor(array)
return tensor.reshape(shape)

Expand Down
12 changes: 9 additions & 3 deletions hivemind/compression/quantization.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,8 +140,14 @@ def quantize(
except ImportError:
raise ImportError(BNB_MISSING_MESSAGE)

quantized, (absmax, codebook, *extra_params) = quantize_blockwise(tensor, blocksize=4096, nested=False)
assert tuple(extra_params) == self.EXTRA_PARAMS # blocksize, nested, dtype, offset, state2
assert tensor.dtype == torch.float32
quantized, quant_state = quantize_blockwise(tensor, blocksize=4096, nested=False)
absmax, codebook = quant_state.absmax, quant_state.code
assert quant_state.blocksize == 4096
assert quant_state.nested is False
assert quant_state.dtype == self.EXTRA_PARAMS[2]
assert quant_state.offset == self.EXTRA_PARAMS[3]
assert quant_state.state2 == self.EXTRA_PARAMS[4]
return quantized.numpy(), (absmax.numpy(), codebook.numpy())

def compress(self, tensor: torch.Tensor, info: CompressionInfo, allow_inplace: bool = False) -> runtime_pb2.Tensor:
Expand Down Expand Up @@ -187,5 +193,5 @@ def extract(self, serialized_tensor: runtime_pb2.Tensor) -> torch.Tensor:
absmax = torch.as_tensor(absmax)
codebook = torch.as_tensor(codebook)
quantized = torch.as_tensor(quantized).reshape(tuple(serialized_tensor.size))
result = dequantize_blockwise(quantized, (absmax, codebook, *self.EXTRA_PARAMS))
result = dequantize_blockwise(quantized, absmax=absmax, code=codebook, blocksize=4096, nested=False)
return result.to(getattr(torch, serialized_tensor.dtype)).requires_grad_(serialized_tensor.requires_grad)
16 changes: 14 additions & 2 deletions hivemind/moe/client/moe.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,9 +90,11 @@
else:
input_for_gating = input

logger.debug("Computing expert scores")

Check warning on line 93 in hivemind/moe/client/moe.py

View check run for this annotation

Codecov / codecov/patch

hivemind/moe/client/moe.py#L93

Added line #L93 was not covered by tests
# 1. compute scores and find most appropriate experts with beam search
grid_scores = self.proj(input_for_gating).split_with_sizes(self.beam_search.grid_size, dim=-1)

logger.debug("Finding best experts")

Check warning on line 97 in hivemind/moe/client/moe.py

View check run for this annotation

Codecov / codecov/patch

hivemind/moe/client/moe.py#L97

Added line #L97 was not covered by tests
chosen_experts: List[List[RemoteExpert]] = self.beam_search.batch_find_best_experts(
[scores.detach().cpu().numpy() for scores in grid_scores], self.k_best
)
Expand All @@ -108,6 +110,7 @@
except P2PDaemonError as e:
logger.warning(f"Failed to get RemoteMixtureOfExperts.output_shape: {e}")

logger.debug(f"Calling experts {chosen_experts}")

Check warning on line 113 in hivemind/moe/client/moe.py

View check run for this annotation

Codecov / codecov/patch

hivemind/moe/client/moe.py#L113

Added line #L113 was not covered by tests
expert_mask, *expert_outputs = _RemoteCallMany.apply(
DUMMY,
chosen_experts,
Expand All @@ -123,6 +126,7 @@
)
# ^-- multiple tensors of shape [batch_size, max_experts, ...output_shape]

logger.debug("Computing expert weights")

Check warning on line 129 in hivemind/moe/client/moe.py

View check run for this annotation

Codecov / codecov/patch

hivemind/moe/client/moe.py#L129

Added line #L129 was not covered by tests
expert_logits = self.compute_expert_scores(grid_scores, chosen_experts)
masked_logits = torch.full((1,), float("-inf"), device=expert_logits.device, dtype=expert_logits.dtype)
expert_logits = torch.where(expert_mask, expert_logits, masked_logits)
Expand Down Expand Up @@ -375,19 +379,26 @@
timeout_total = float("inf") if timeout_total is None else timeout_total
timeout_after_k_min = float("inf") if timeout_after_k_min is None else timeout_after_k_min
num_successful_tasks = [0 for _ in range(num_samples)]
pending_samples = num_samples # samples for which we have less than k_min results

samples_with_tasks = {sample_idx for sample_idx, _ in task_to_indices.values()}
pending_samples = len(samples_with_tasks) # samples for which we have less than k_min results
assert pending_samples <= num_samples

finished_indices, finished_outputs = [], []
t_finish = time.perf_counter() + timeout_total
pending_tasks = set(task_to_indices.keys())
finished_tasks = Queue()

logger.debug(f"Pending tasks: {list(pending_tasks)}")
try:
# the algorithm below is essentially futures.as_completed, but for grpc.Future
for task in pending_tasks:
task.add_done_callback(finished_tasks.put)

for _ in range(len(task_to_indices)):
timeout = max(0.0, t_finish - time.perf_counter()) if t_finish != float("inf") else None
logger.debug(f"Finished tasks: {list(finished_tasks.queue)}")
logger.debug(f"Pending tasks: {list(pending_tasks)}")
task = finished_tasks.get(timeout=timeout)
pending_tasks.discard(task)

Expand All @@ -399,6 +410,7 @@
# count how many successes we have for each input sample
sample_index = task_to_indices[task][0]
num_successful_tasks[sample_index] += 1
logger.debug(f"Num successful tasks: {num_successful_tasks}")

Check warning on line 413 in hivemind/moe/client/moe.py

View check run for this annotation

Codecov / codecov/patch

hivemind/moe/client/moe.py#L413

Added line #L413 was not covered by tests
if num_successful_tasks[sample_index] == k_min:
pending_samples -= 1
if (
Expand All @@ -416,7 +428,7 @@

def _process_dispatched_task(task: Future, detect_anomalies: bool) -> Optional[Tuple[torch.Tensor]]:
if task.exception() or task.cancelled():
logger.warning(f"Task {task} failed: {type(task.exception())}")
logger.warning(f"Task {task} failed: {task.exception()}")

Check warning on line 431 in hivemind/moe/client/moe.py

View check run for this annotation

Codecov / codecov/patch

hivemind/moe/client/moe.py#L431

Added line #L431 was not covered by tests
return None

outputs = task.result()
Expand Down
1 change: 1 addition & 0 deletions hivemind/moe/server/connection_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,7 @@ async def _process_inputs(
async def rpc_forward(self, request: runtime_pb2.ExpertRequest, context: P2PContext) -> runtime_pb2.ExpertResponse:
inputs = [deserialize_torch_tensor(tensor) for tensor in request.tensors]
expert = self.module_backends[request.uid]
logger.debug(f"Processing inputs for expert {request.uid}")
return runtime_pb2.ExpertResponse(
tensors=await self._process_inputs(inputs, expert.forward_pool, expert.outputs_schema)
)
Expand Down
Loading
Loading