Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: include dependency graph use cases #2684

Merged
merged 5 commits into from
Dec 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view

Large diffs are not rendered by default.

396 changes: 396 additions & 0 deletions apps/docs/docs/tutorials/dependencies.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,396 @@
---
title: Map the Software Supply Chain
sidebar_position: 3
---

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

Trace the dependencies in a software bill of materials (SBOM) for a given repository and assign weights or other metrics to each node. New to OSO? Check out our [Getting Started guide](../get-started/index.md) to set up your BigQuery or API access.

![Dependency Graph](dependency-graph.png)

## Getting Started

Before running any analysis, you'll need to set up your environment:

<Tabs>
<TabItem value="sql" label="SQL">

If you haven't already, subscribe to OSO public datasets in BigQuery by clicking the "Subscribe" button on our [Datasets page](../integrate/datasets/#oso-production-data-pipeline).

You can run all queries in this guide directly in the [BigQuery console](https://console.cloud.google.com/bigquery).

</TabItem>
<TabItem value="python" label="Python">

Start your Python notebook with the following:

```python
from google.cloud import bigquery
import pandas as pd
import os

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = # PATH TO YOUR CREDENTIALS JSON
GCP_PROJECT = # YOUR GCP PROJECT NAME

client = bigquery.Client(GCP_PROJECT)
```

For more details on setting up Python notebooks, see our guide on [writing Python notebooks](../integrate/python-notebooks.md).

</TabItem>
</Tabs>

## Identify Repositories and Packages

### Repository Metadata

Get metadata and basic stats about a repository using OSO's indexed data:

<Tabs>
<TabItem value="sql" label="SQL">

```sql
select *
from `oso_production.repositories_v0`
where artifact_url = 'https://github.com/ethereum/go-ethereum'
```

</TabItem>
<TabItem value="python" label="Python">

```python
query = """
select *
from `oso_production.repositories_v0`
where artifact_url = 'https://github.com/ethereum/go-ethereum'
"""
df = client.query(query).to_dataframe()
```

</TabItem>
</Tabs>

### SBOMs (Package Dependencies)

OSO uses GitHub's Software Bill of Materials (SBOMs) dataset to identify package dependencies. Note that this data doesn't differentiate between direct and indirect dependencies, but provides a good starting point for mapping the software supply chain:

<Tabs>
<TabItem value="sql" label="SQL">

```sql
select *
from `oso_production.sboms_v0`
where from_artifact_id = '0mjl8VhWsui_6TEZZnbQzyf8h1A9bOioIlK17p0D5hI='
```

</TabItem>
<TabItem value="python" label="Python">

```python
query = """
select *
from `oso_production.sboms_v0`
where from_artifact_id = '0mjl8VhWsui_6TEZZnbQzyf8h1A9bOioIlK17p0D5hI='
"""
df = client.query(query).to_dataframe()
```

</TabItem>
</Tabs>

### Package Maintainers

OSO leverages [Open Source Insights (deps.dev)](https://deps.dev) data to identify the repo that maintains a given package. This covers approximately 90% of packages based on our testing:

<Tabs>
<TabItem value="sql" label="SQL">

```sql
select
package_artifact_source,
package_artifact_name,
package_owner_project_id,
package_owner_artifact_namespace,
package_owner_artifact_name
from `oso_production.package_owners_v0`
where package_artifact_name = '@libp2p/echo'
```

</TabItem>
<TabItem value="python" label="Python">

```python
query = """
select
package_artifact_source,
package_artifact_name,
package_owner_project_id,
package_owner_artifact_namespace,
package_owner_artifact_name
from `oso_production.package_owners_v0`
where package_artifact_name = '@libp2p/echo'
"""
df = client.query(query).to_dataframe()
```

</TabItem>
</Tabs>

### Build a Deep Funding Graph

This example demonstrates how to create a dependency graph for a group of related repositories, such as the one used by [Deep Funding](https://deepfunding.org). The analysis maps relationships between key Ethereum repositories and their package dependencies:

<Tabs>
<TabItem value="sql" label="SQL">

```sql
select distinct
sboms.from_artifact_namespace as seed_repo_owner,
sboms.from_artifact_name as seed_repo_name,
sboms.to_package_artifact_name as package_name,
package_owners.package_owner_artifact_namespace as package_repo_owner,
package_owners.package_owner_artifact_name as package_repo_name,
sboms.to_package_artifact_source as package_source
from `oso_production.sboms_v0` sboms
join `oso_production.package_owners_v0` package_owners
on
sboms.to_package_artifact_name = package_owners.package_artifact_name
and sboms.to_package_artifact_source = package_owners.package_artifact_source
where
sboms.to_package_artifact_source in ('NPM','RUST','GO','PIP')
and package_owners.package_owner_artifact_namespace is not null
and concat(sboms.from_artifact_namespace, '/', sboms.from_artifact_name)
in ('prysmaticlabs/prysm','sigp/lighthouse','consensys/teku','status-im/nimbus-eth2',
'chainsafe/lodestar','grandinetech/grandine','ethereum/go-ethereum',
'nethermindeth/nethermind','hyperledger/besu','erigontech/erigon',
'paradigmxyz/reth','ethereum/solidity','ethereum/remix-project',
'vyperlang/vyper','ethereum/web3.py','ethereum/py-evm',
'eth-infinitism/account-abstraction','safe-global/safe-smart-account',
'a16z/helios','web3/web3.js','ethereumjs/ethereumjs-monorepo')
```

</TabItem>
<TabItem value="python" label="Python">

```python
query = """
select distinct
sboms.from_artifact_namespace as seed_repo_owner,
sboms.from_artifact_name as seed_repo_name,
sboms.to_package_artifact_name as package_name,
package_owners.package_owner_artifact_namespace as package_repo_owner,
package_owners.package_owner_artifact_name as package_repo_name,
sboms.to_package_artifact_source as package_source
from `oso_production.sboms_v0` sboms
join `oso_production.package_owners_v0` package_owners
on
sboms.to_package_artifact_name = package_owners.package_artifact_name
and sboms.to_package_artifact_source = package_owners.package_artifact_source
where
sboms.to_package_artifact_source in ('NPM','RUST','GO','PIP')
and package_owners.package_owner_artifact_namespace is not null
and concat(sboms.from_artifact_namespace, '/', sboms.from_artifact_name)
in ('prysmaticlabs/prysm','sigp/lighthouse','consensys/teku','status-im/nimbus-eth2',
'chainsafe/lodestar','grandinetech/grandine','ethereum/go-ethereum',
'nethermindeth/nethermind','hyperledger/besu','erigontech/erigon',
'paradigmxyz/reth','ethereum/solidity','ethereum/remix-project',
'vyperlang/vyper','ethereum/web3.py','ethereum/py-evm',
'eth-infinitism/account-abstraction','safe-global/safe-smart-account',
'a16z/helios','web3/web3.js','ethereumjs/ethereumjs-monorepo')
"""
df = client.query(query).to_dataframe()
```

We can also go further and create a network graph from the data we've just fetched:

```python
import networkx as nx

# turn each node into a GitHub URL
gh = 'https://github.com/'
df['seed_repo_url'] = df.apply(lambda x: f"{gh}{x['seed_repo_owner']}/{x['seed_repo_name']}", axis=1)
df['package_repo_url'] = df.apply(lambda x: f"{gh}{x['package_repo_owner']}/{x['package_repo_name']}", axis=1)

# Store in a Network Graph
G = nx.DiGraph()

for repo_url in df['seed_repo_url'].unique():
G.add_node(repo_url, level=1)

for repo_url in df['package_repo_url'].unique():
if repo_url not in G.nodes:
G.add_node(repo_url, level=2)

for _, row in df.iterrows():
G.add_edge(
row['seed_repo_url'],
row['package_repo_url'],
relation=row['package_source']
)

# Placeholder for adding weights to the graph
global_weight = 0
for u, v in G.edges:
G[u][v]['weight'] = global_weight
```

</TabItem>
</Tabs>

For more examples of dependency analysis, check out the [Deep Funding repo](https://github.com/deepfunding/dependency-graph).

## Weight Nodes and Edges

### Most Used Dependencies

Find the most commonly used dependencies across all projects in OSO. This query joins package ownership data with SBOM data to count how many projects depend on each package:

<Tabs>
<TabItem value="sql" label="SQL">

```sql
select
p.project_id,
pkgs.package_artifact_source,
pkgs.package_artifact_name,
count(distinct sboms.from_project_id) as num_dependents
from `oso_production.package_owners_v0` pkgs
join `oso_production.sboms_v0` sboms
on pkgs.package_artifact_name = sboms.to_package_artifact_name
and pkgs.package_artifact_source = sboms.to_package_artifact_source
join `oso_production.projects_v1` p
on pkgs.package_owner_project_id = p.project_id
where pkgs.package_owner_project_id is not null
group by 1,2,3
order by 4 desc
```

</TabItem>
<TabItem value="python" label="Python">

```python
query = """
select
p.project_id,
pkgs.package_artifact_source,
pkgs.package_artifact_name,
count(distinct sboms.from_project_id) as num_dependents
from `oso_production.package_owners_v0` pkgs
join `oso_production.sboms_v0` sboms
on pkgs.package_artifact_name = sboms.to_package_artifact_name
and pkgs.package_artifact_source = sboms.to_package_artifact_source
join `oso_production.projects_v1` p
on pkgs.package_owner_project_id = p.project_id
where pkgs.package_owner_project_id is not null
group by 1,2,3
order by 4 desc
"""
df = client.query(query).to_dataframe()

# Optional: Display top dependencies
print("Top 10 most used dependencies:")
print(df.head(10))
```

</TabItem>
</Tabs>

### Downstream Impact

This is an example of a more advanced analysis that demonstrates how to analyze relationships between onchain projects and their development dependencies:

<Tabs>
<TabItem value="sql" label="SQL">

```sql
select
onchain_projects.project_name as `onchain_builder`,
onchain_metrics.event_source as `network`,
onchain_metrics.address_count_90_days,
onchain_metrics.gas_fees_sum_6_months,
onchain_metrics.transaction_count_6_months as transactions_6_months,
code_metrics.project_name as `dev_tool_maintainer`,
package_owners.package_artifact_source as `package_source`,
code_metrics.active_developer_count_6_months,
code_metrics.contributor_count_6_months,
code_metrics.commit_count_6_months,
code_metrics.opened_issue_count_6_months,
code_metrics.opened_pull_request_count_6_months,
code_metrics.fork_count,
code_metrics.star_count,
code_metrics.last_updated_at_date
from `oso_production.sboms_v0` sboms
join `oso_production.projects_v1` onchain_projects
on sboms.from_project_id = onchain_projects.project_id
join `oso_production.projects_by_collection_v1` projects_by_collection
on onchain_projects.project_id = projects_by_collection.project_id
join `oso_production.onchain_metrics_by_project_v1` onchain_metrics
on onchain_projects.project_id = onchain_metrics.project_id
join `oso_production.package_owners_v0` package_owners
on sboms.to_package_artifact_name = package_owners.package_artifact_name
join `oso_production.code_metrics_by_project_v1` code_metrics
on package_owners.package_owner_project_id = code_metrics.project_id
where
projects_by_collection.collection_name = 'op-retrofunding-4'
and transaction_count_6_months >= 1000
and address_count_90_days >= 420
```

</TabItem>
<TabItem value="python" label="Python">

```python
query = """
select
onchain_projects.project_name as `onchain_builder`,
onchain_metrics.event_source as `network`,
onchain_metrics.address_count_90_days,
onchain_metrics.gas_fees_sum_6_months,
onchain_metrics.transaction_count_6_months as transactions_6_months,
code_metrics.project_name as `dev_tool_maintainer`,
package_owners.package_artifact_source as `package_source`,
code_metrics.active_developer_count_6_months,
code_metrics.contributor_count_6_months,
code_metrics.commit_count_6_months,
code_metrics.opened_issue_count_6_months,
code_metrics.opened_pull_request_count_6_months,
code_metrics.fork_count,
code_metrics.star_count,
code_metrics.last_updated_at_date
from `oso_production.sboms_v0` sboms
join `oso_production.projects_v1` onchain_projects
on sboms.from_project_id = onchain_projects.project_id
join `oso_production.projects_by_collection_v1` projects_by_collection
on onchain_projects.project_id = projects_by_collection.project_id
join `oso_production.onchain_metrics_by_project_v1` onchain_metrics
on onchain_projects.project_id = onchain_metrics.project_id
join `oso_production.package_owners_v0` package_owners
on sboms.to_package_artifact_name = package_owners.package_artifact_name
join `oso_production.code_metrics_by_project_v1` code_metrics
on package_owners.package_owner_project_id = code_metrics.project_id
where
projects_by_collection.collection_name = 'op-retrofunding-4'
and transaction_count_6_months >= 1000
and address_count_90_days >= 420
"""
df = client.query(query).to_dataframe()

# Optional: Add visualization code
import plotly.express as px

# Example visualization
fig = px.scatter(df,
x='address_count_90_days',
y='transactions_6_months',
size='gas_fees_sum_6_months',
hover_data=['onchain_builder', 'dev_tool_maintainer']
)
fig.show()
```

</TabItem>
</Tabs>

You can go even further in your analysis by joining on other OSO datasets. For more examples, check out the [Deep Funding repo](https://github.com/deepfunding/dependency-graph).
Binary file added apps/docs/docs/tutorials/dependency-graph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading