src/cargo/core/global_cache_tracker.rs

//! Support for tracking the last time files were used to assist with cleaning
//! up those files if they haven't been used in a while.
//!
//! Tracking of cache files is stored in a sqlite database which contains a
//! timestamp of the last time the file was used, as well as the size of the
//! file.
//!
//! While cargo is running, when it detects a use of a cache file, it adds a
//! timestamp to [`DeferredGlobalLastUse`]. This batches up a set of changes
//! that are then flushed to the database all at once (via
//! [`DeferredGlobalLastUse::save`]). Ideally saving would only be done once
//! for performance reasons, but that is not really possible due to the way
//! cargo works, since there are different ways cargo can be used (like `cargo
//! generate-lockfile`, `cargo fetch`, and `cargo build` are all very
//! different ways the code is used).
//!
//! All of the database interaction is done through the [`GlobalCacheTracker`]
//! type.
//!
//! There is a single global [`GlobalCacheTracker`] and
//! [`DeferredGlobalLastUse`] stored in [`GlobalContext`].
//!
//! The high-level interface for performing garbage collection is defined in
//! the [`crate::core::gc`] module. The functions there are responsible for
//! interacting with the [`GlobalCacheTracker`] to handle cleaning of global
//! cache data.
//!
//! ## Automatic gc
//!
//! Some commands (primarily the build commands) will trigger an automatic
//! deletion of files that haven't been used in a while. The high-level
//! interface for this is the [`crate::core::gc::auto_gc`] function.
//!
//! The [`GlobalCacheTracker`] database tracks the last time an automatic gc
//! was performed so that it is only done once per day for performance
//! reasons.
//!
//! ## Manual gc
//!
//! The user can perform a manual garbage collection with the `cargo clean`
//! command. That command has a variety of options to specify what to delete.
//! Manual gc supports deleting based on age or size or both. From a
//! high-level, this is done by the [`crate::core::gc::Gc::gc`] method, which
//! calls into [`GlobalCacheTracker`] to handle all the cleaning.
//!
//! ## Locking
//!
//! Usage of the database requires that the package cache is locked to prevent
//! concurrent access. Although sqlite has built-in locking support, we want
//! to use cargo's locking so that the "Blocking" message gets displayed, and
//! so that locks can block indefinitely for long-running build commands.
//! [`rusqlite`] has a default timeout of 5 seconds, though that is
//! configurable.
//!
//! When garbage collection is being performed, the package cache lock must be
//! in [`CacheLockMode::MutateExclusive`] to ensure no other cargo process is
//! running. See [`crate::util::cache_lock`] for more detail on locking.
//!
//! When performing automatic gc, [`crate::core::gc::auto_gc`] will skip the
//! GC if the package cache lock is already held by anything else. Automatic
//! GC is intended to be opportunistic, and should impose as little disruption
//! to the user as possible.
//!
//! ## Compatibility
//!
//! The database must retain both forwards and backwards compatibility between
//! different versions of cargo. For the most part, this shouldn't be too
//! difficult to maintain. Generally sqlite doesn't change on-disk formats
//! between versions (the introduction of WAL is one of the few examples where
//! version 3 had a format change, but we wouldn't use it anyway since it has
//! shared-memory requirements cargo can't depend on due to things like
//! network mounts).
//!
//! Schema changes must be managed through [`migrations`] by adding new
//! entries that make a change to the database. Changes must not break older
//! versions of cargo. Generally, adding columns should be fine (either with a
//! default value, or NULL). Adding tables should also be fine. Just don't do
//! destructive things like removing a column, or changing the semantics of an
//! existing column.
//!
//! Since users may run older versions of cargo that do not do cache tracking,
//! the [`GlobalCacheTracker::sync_db_with_files`] method helps dealing with
//! keeping the database in sync in the presence of older versions of cargo
//! touching the cache directories.
//!
//! ## Performance
//!
//! A lot of focus on the design of this system is to minimize the performance
//! impact. Every build command needs to save updates which we try to avoid
//! having a noticeable impact on build times. Systems like Windows,
//! particularly with a magnetic hard disk, can experience a fairly large
//! impact of cargo's overhead. Cargo's benchsuite has some benchmarks to help
//! compare different environments, or changes to the code here. Please try to
//! keep performance in mind if making any major changes.
//!
//! Performance of `cargo clean` is not quite as important since it is not
//! expected to be run often. However, it is still courteous to the user to
//! try to not impact it too much. One part that has a performance concern is
//! that the clean command will synchronize the database with whatever is on
//! disk if needed (in case files were added by older versions of cargo that
//! don't do cache tracking, or if the user manually deleted some files). This
//! can potentially be very slow, especially if the two are very out of sync.
//!
//! ## Filesystems
//!
//! Everything here is sensitive to the kind of filesystem it is running on.
//! People tend to run cargo in all sorts of strange environments that have
//! limited capabilities, or on things like read-only mounts. The code here
//! needs to gracefully handle as many situations as possible.
//!
//! See also the information in the [Performance](#performance) and
//! [Locking](#locking) sections when considering different filesystems and
//! their impact on performance and locking.
//!
//! There are checks for read-only filesystems, which is generally ignored.

use crate::core::gc::GcOpts;
use crate::core::Verbosity;
use crate::ops::CleanContext;
use crate::util::cache_lock::CacheLockMode;
use crate::util::interning::InternedString;
use crate::util::sqlite::{self, basic_migration, Migration};
use crate::util::{Filesystem, Progress, ProgressStyle};
use crate::{CargoResult, GlobalContext};
use anyhow::{bail, Context as _};
use cargo_util::paths;
use rusqlite::{params, Connection, ErrorCode};
use std::collections::{hash_map, HashMap};
use std::path::{Path, PathBuf};
use std::time::{Duration, SystemTime};
use tracing::{debug, trace};

/// The filename of the database.
const GLOBAL_CACHE_FILENAME: &str = ".global-cache";

const REGISTRY_INDEX_TABLE: &str = "registry_index";
const REGISTRY_CRATE_TABLE: &str = "registry_crate";
const REGISTRY_SRC_TABLE: &str = "registry_src";
const GIT_DB_TABLE: &str = "git_db";
const GIT_CO_TABLE: &str = "git_checkout";

/// How often timestamps will be updated.
///
/// As an optimization timestamps are not updated unless they are older than
/// the given number of seconds. This helps reduce the amount of disk I/O when
/// running cargo multiple times within a short window.
const UPDATE_RESOLUTION: u64 = 60 * 5;

/// Type for timestamps as stored in the database.
///
/// These are seconds since the Unix epoch.
type Timestamp = u64;

/// The key for a registry index entry stored in the database.
#[derive(Clone, Debug, Hash, Eq, PartialEq)]
pub struct RegistryIndex {
    /// A unique name of the registry source.
    pub encoded_registry_name: InternedString,
}

/// The key for a registry `.crate` entry stored in the database.
#[derive(Clone, Debug, Hash, Eq, PartialEq)]
pub struct RegistryCrate {
    /// A unique name of the registry source.
    pub encoded_registry_name: InternedString,
    /// The filename of the compressed crate, like `foo-1.2.3.crate`.
    pub crate_filename: InternedString,
    /// The size of the `.crate` file.
    pub size: u64,
}

/// The key for a registry src directory entry stored in the database.
#[derive(Clone, Debug, Hash, Eq, PartialEq)]
pub struct RegistrySrc {
    /// A unique name of the registry source.
    pub encoded_registry_name: InternedString,
    /// The directory name of the extracted source, like `foo-1.2.3`.
    pub package_dir: InternedString,
    /// Total size of the src directory in bytes.
    ///
    /// This can be None when the size is unknown. For example, when the src
    /// directory already exists on disk, and we just want to update the
    /// last-use timestamp. We don't want to take the expense of computing disk
    /// usage unless necessary. [`GlobalCacheTracker::populate_untracked`]
    /// will handle any actual NULL values in the database, which can happen
    /// when the src directory is created by an older version of cargo that
    /// did not track sizes.
    pub size: Option<u64>,
}

/// The key for a git db entry stored in the database.
#[derive(Clone, Debug, Hash, Eq, PartialEq)]
pub struct GitDb {
    /// A unique name of the git database.
    pub encoded_git_name: InternedString,
}

/// The key for a git checkout entry stored in the database.
#[derive(Clone, Debug, Hash, Eq, PartialEq)]
pub struct GitCheckout {
    /// A unique name of the git database.
    pub encoded_git_name: InternedString,
    /// A unique name of the checkout without the database.
    pub short_name: InternedString,
    /// Total size of the checkout directory.
    ///
    /// This can be None when the size is unknown. See [`RegistrySrc::size`]
    /// for an explanation.
    pub size: Option<u64>,
}

/// Filesystem paths in the global cache.
///
/// Accessing these assumes a lock has already been acquired.
struct BasePaths {
    /// Root path to the index caches.
    index: PathBuf,
    /// Root path to the git DBs.
    git_db: PathBuf,
    /// Root path to the git checkouts.
    git_co: PathBuf,
    /// Root path to the `.crate` files.
    crate_dir: PathBuf,
    /// Root path to the `src` directories.
    src: PathBuf,
}

/// Migrations which initialize the database, and can be used to evolve it over time.
///
/// See [`Migration`] for more detail.
///
/// **Be sure to not change the order or entries here!**
fn migrations() -> Vec<Migration> {
    vec![
        // registry_index tracks the overall usage of an index cache, and tracks a
        // numeric ID to refer to that index that is used in other tables.
        basic_migration(
            "CREATE TABLE registry_index (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                name TEXT UNIQUE NOT NULL,
                timestamp INTEGER NOT NULL
            )",
        ),
        // .crate files
        basic_migration(
            "CREATE TABLE registry_crate (
                registry_id INTEGER NOT NULL,
                name TEXT NOT NULL,
                size INTEGER NOT NULL,
                timestamp INTEGER NOT NULL,
                PRIMARY KEY (registry_id, name),
                FOREIGN KEY (registry_id) REFERENCES registry_index (id) ON DELETE CASCADE
             )",
        ),
        // Extracted src directories
        //
        // Note that `size` can be NULL. This will happen when marking a src
        // directory as used that was created by an older version of cargo
        // that didn't do size tracking.
        basic_migration(
            "CREATE TABLE registry_src (
                registry_id INTEGER NOT NULL,
                name TEXT NOT NULL,
                size INTEGER,
                timestamp INTEGER NOT NULL,
                PRIMARY KEY (registry_id, name),
                FOREIGN KEY (registry_id) REFERENCES registry_index (id) ON DELETE CASCADE
             )",
        ),
        // Git db directories
        basic_migration(
            "CREATE TABLE git_db (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                name TEXT UNIQUE NOT NULL,
                timestamp INTEGER NOT NULL
             )",
        ),
        // Git checkout directories
        basic_migration(
            "CREATE TABLE git_checkout (
                git_id INTEGER NOT NULL,
                name TEXT UNIQUE NOT NULL,
                size INTEGER,
                timestamp INTEGER NOT NULL,
                PRIMARY KEY (git_id, name),
                FOREIGN KEY (git_id) REFERENCES git_db (id) ON DELETE CASCADE
             )",
        ),
        // This is a general-purpose single-row table that can store arbitrary
        // data. Feel free to add columns (with ALTER TABLE) if necessary.
        basic_migration(
            "CREATE TABLE global_data (
                last_auto_gc INTEGER NOT NULL
            )",
        ),
        // last_auto_gc tracks the last time auto-gc was run (so that it only
        // runs roughly once a day for performance reasons). Prime it with the
        // current time to establish a baseline.
        Box::new(|conn| {
            conn.execute(
                "INSERT INTO global_data (last_auto_gc) VALUES (?1)",
                [now()],
            )?;
            Ok(())
        }),
    ]
}

/// Type for SQL columns that refer to the primary key of their parent table.
///
/// For example, `registry_crate.registry_id` refers to its parent `registry_index.id`.
#[derive(Copy, Clone, Debug, PartialEq)]
struct ParentId(i64);

impl rusqlite::types::FromSql for ParentId {
    fn column_result(value: rusqlite::types::ValueRef<'_>) -> rusqlite::types::FromSqlResult<Self> {
        let i = i64::column_result(value)?;
        Ok(ParentId(i))
    }
}

impl rusqlite::types::ToSql for ParentId {
    fn to_sql(&self) -> rusqlite::Result<rusqlite::types::ToSqlOutput<'_>> {
        Ok(rusqlite::types::ToSqlOutput::from(self.0))
    }
}

/// Tracking for the global shared cache (registry files, etc.).
///
/// This is the interface to the global cache database, used for tracking and
/// cleaning. See the [`crate::core::global_cache_tracker`] module docs for
/// details.
#[derive(Debug)]
pub struct GlobalCacheTracker {
    /// Connection to the SQLite database.
    conn: Connection,
    /// This is an optimization used to make sure cargo only checks if gc
    /// needs to run once per session. This starts as `false`, and then the
    /// first time it checks if automatic gc needs to run, it will be set to
    /// `true`.
    auto_gc_checked_this_session: bool,
}

impl GlobalCacheTracker {
    /// Creates a new [`GlobalCacheTracker`].
    ///
    /// The caller is responsible for locking the package cache with
    /// [`CacheLockMode::DownloadExclusive`] before calling this.
    pub fn new(gctx: &GlobalContext) -> CargoResult<GlobalCacheTracker> {
        let db_path = Self::db_path(gctx);
        // A package cache lock is required to ensure only one cargo is
        // accessing at the same time. If there is concurrent access, we
        // want to rely on cargo's own "Blocking" system (which can
        // provide user feedback) rather than blocking inside sqlite
        // (which by default has a short timeout).
        let db_path = gctx.assert_package_cache_locked(CacheLockMode::DownloadExclusive, &db_path);
        let mut conn = Connection::open(db_path)?;
        conn.pragma_update(None, "foreign_keys", true)?;
        sqlite::migrate(&mut conn, &migrations())?;
        Ok(GlobalCacheTracker {
            conn,
            auto_gc_checked_this_session: false,
        })
    }

    /// The path to the database.
    pub fn db_path(gctx: &GlobalContext) -> Filesystem {
        gctx.home().join(GLOBAL_CACHE_FILENAME)
    }

    /// Given an encoded registry name, returns its ID.
    ///
    /// Returns None if the given name isn't in the database.
    fn id_from_name(
        conn: &Connection,
        table_name: &str,
        encoded_name: &str,
    ) -> CargoResult<Option<ParentId>> {
        let mut stmt =
            conn.prepare_cached(&format!("SELECT id FROM {table_name} WHERE name = ?"))?;
        match stmt.query_row([encoded_name], |row| row.get(0)) {
            Ok(id) => Ok(Some(id)),
            Err(rusqlite::Error::QueryReturnedNoRows) => Ok(None),
            Err(e) => Err(e.into()),
        }
    }

    /// Returns a map of ID to path for the given ids in the given table.
    ///
    /// For example, given `registry_index` IDs, it returns filenames of the
    /// form "index.crates.io-6f17d22bba15001f".
    fn get_id_map(
        conn: &Connection,
        table_name: &str,
        ids: &[i64],
    ) -> CargoResult<HashMap<i64, PathBuf>> {
        let mut stmt =
            conn.prepare_cached(&format!("SELECT name FROM {table_name} WHERE id = ?1"))?;
        ids.iter()
            .map(|id| {
                let name = stmt.query_row(params![id], |row| {
                    Ok(PathBuf::from(row.get::<_, String>(0)?))
                })?;
                Ok((*id, name))
            })
            .collect()
    }

    /// Returns all index cache timestamps.
    pub fn registry_index_all(&self) -> CargoResult<Vec<(RegistryIndex, Timestamp)>> {
        let mut stmt = self
            .conn
            .prepare_cached("SELECT name, timestamp FROM registry_index")?;
        let rows = stmt
            .query_map([], |row| {
                let encoded_registry_name = row.get_unwrap(0);
                let timestamp = row.get_unwrap(1);
                let kind = RegistryIndex {
                    encoded_registry_name,
                };
                Ok((kind, timestamp))
            })?
            .collect::<Result<Vec<_>, _>>()?;
        Ok(rows)
    }

    /// Returns all registry crate cache timestamps.
    pub fn registry_crate_all(&self) -> CargoResult<Vec<(RegistryCrate, Timestamp)>> {
        let mut stmt = self.conn.prepare_cached(
            "SELECT registry_index.name, registry_crate.name, registry_crate.size, registry_crate.timestamp
             FROM registry_index, registry_crate
             WHERE registry_crate.registry_id = registry_index.id",
        )?;
        let rows = stmt
            .query_map([], |row| {
                let encoded_registry_name = row.get_unwrap(0);
                let crate_filename = row.get_unwrap(1);
                let size = row.get_unwrap(2);
                let timestamp = row.get_unwrap(3);
                let kind = RegistryCrate {
                    encoded_registry_name,
                    crate_filename,
                    size,
                };
                Ok((kind, timestamp))
            })?
            .collect::<Result<Vec<_>, _>>()?;
        Ok(rows)
    }

    /// Returns all registry source cache timestamps.
    pub fn registry_src_all(&self) -> CargoResult<Vec<(RegistrySrc, Timestamp)>> {
        let mut stmt = self.conn.prepare_cached(
            "SELECT registry_index.name, registry_src.name, registry_src.size, registry_src.timestamp
             FROM registry_index, registry_src
             WHERE registry_src.registry_id = registry_index.id",
        )?;
        let rows = stmt
            .query_map([], |row| {
                let encoded_registry_name = row.get_unwrap(0);
                let package_dir = row.get_unwrap(1);
                let size = row.get_unwrap(2);
                let timestamp = row.get_unwrap(3);
                let kind = RegistrySrc {
                    encoded_registry_name,
                    package_dir,
                    size,
                };
                Ok((kind, timestamp))
            })?
            .collect::<Result<Vec<_>, _>>()?;
        Ok(rows)
    }

    /// Returns all git db timestamps.
    pub fn git_db_all(&self) -> CargoResult<Vec<(GitDb, Timestamp)>> {
        let mut stmt = self
            .conn
            .prepare_cached("SELECT name, timestamp FROM git_db")?;
        let rows = stmt
            .query_map([], |row| {
                let encoded_git_name = row.get_unwrap(0);
                let timestamp = row.get_unwrap(1);
                let kind = GitDb { encoded_git_name };
                Ok((kind, timestamp))
            })?
            .collect::<Result<Vec<_>, _>>()?;
        Ok(rows)
    }

    /// Returns all git checkout timestamps.
    pub fn git_checkout_all(&self) -> CargoResult<Vec<(GitCheckout, Timestamp)>> {
        let mut stmt = self.conn.prepare_cached(
            "SELECT git_db.name, git_checkout.name, git_checkout.size, git_checkout.timestamp
             FROM git_db, git_checkout
             WHERE git_checkout.git_id = git_db.id",
        )?;
        let rows = stmt
            .query_map([], |row| {
                let encoded_git_name = row.get_unwrap(0);
                let short_name = row.get_unwrap(1);
                let size = row.get_unwrap(2);
                let timestamp = row.get_unwrap(3);
                let kind = GitCheckout {
                    encoded_git_name,
                    short_name,
                    size,
                };
                Ok((kind, timestamp))
            })?
            .collect::<Result<Vec<_>, _>>()?;
        Ok(rows)
    }

    /// Returns whether or not an auto GC should be performed, compared to the
    /// last time it was recorded in the database.
    pub fn should_run_auto_gc(&mut self, frequency: Duration) -> CargoResult<bool> {
        trace!(target: "gc", "should_run_auto_gc");
        if self.auto_gc_checked_this_session {
            return Ok(false);
        }
        let last_auto_gc: Timestamp =
            self.conn
                .query_row("SELECT last_auto_gc FROM global_data", [], |row| row.get(0))?;
        let should_run = last_auto_gc + frequency.as_secs() < now();
        trace!(target: "gc",
            "last auto gc was {}, {}",
            last_auto_gc,
            if should_run { "running" } else { "skipping" }
        );
        self.auto_gc_checked_this_session = true;
        Ok(should_run)
    }

    /// Writes to the database to indicate that an automatic GC has just been
    /// completed.
    pub fn set_last_auto_gc(&self) -> CargoResult<()> {
        self.conn
            .execute("UPDATE global_data SET last_auto_gc = ?1", [now()])?;
        Ok(())
    }

    /// Deletes files from the global cache based on the given options.
    pub fn clean(&mut self, clean_ctx: &mut CleanContext<'_>, gc_opts: &GcOpts) -> CargoResult<()> {
        self.clean_inner(clean_ctx, gc_opts)
            .with_context(|| "failed to clean entries from the global cache")
    }

    #[tracing::instrument(skip_all)]
    fn clean_inner(
        &mut self,
        clean_ctx: &mut CleanContext<'_>,
        gc_opts: &GcOpts,
    ) -> CargoResult<()> {
        let gctx = clean_ctx.gctx;
        let base = BasePaths {
            index: gctx.registry_index_path().into_path_unlocked(),
            git_db: gctx.git_db_path().into_path_unlocked(),
            git_co: gctx.git_checkouts_path().into_path_unlocked(),
            crate_dir: gctx.registry_cache_path().into_path_unlocked(),
            src: gctx.registry_source_path().into_path_unlocked(),
        };
        let now = now();
        trace!(target: "gc", "cleaning {gc_opts:?}");
        let tx = self.conn.transaction()?;
        let mut delete_paths = Vec::new();
        // This can be an expensive operation, so only perform it if necessary.
        if gc_opts.is_download_cache_opt_set() {
            // TODO: Investigate how slow this might be.
            Self::sync_db_with_files(
                &tx,
                now,
                gctx,
                &base,
                gc_opts.is_download_cache_size_set(),
                &mut delete_paths,
            )
            .with_context(|| "failed to sync tracking database")?
        }
        if let Some(max_age) = gc_opts.max_index_age {
            let max_age = now - max_age.as_secs();
            Self::get_registry_index_to_clean(&tx, max_age, &base, &mut delete_paths)?;
        }
        if let Some(max_age) = gc_opts.max_src_age {
            let max_age = now - max_age.as_secs();
            Self::get_registry_items_to_clean_age(
                &tx,
                max_age,
                REGISTRY_SRC_TABLE,
                &base.src,
                &mut delete_paths,
            )?;
        }
        if let Some(max_age) = gc_opts.max_crate_age {
            let max_age = now - max_age.as_secs();
            Self::get_registry_items_to_clean_age(
                &tx,
                max_age,
                REGISTRY_CRATE_TABLE,
                &base.crate_dir,
                &mut delete_paths,
            )?;
        }
        if let Some(max_age) = gc_opts.max_git_db_age {
            let max_age = now - max_age.as_secs();
            Self::get_git_db_items_to_clean(&tx, max_age, &base, &mut delete_paths)?;
        }
        if let Some(max_age) = gc_opts.max_git_co_age {
            let max_age = now - max_age.as_secs();
            Self::get_git_co_items_to_clean(&tx, max_age, &base.git_co, &mut delete_paths)?;
        }
        // Size collection must happen after date collection so that dates
        // have precedence, since size constraints are a more blunt
        // instrument.
        //
        // These are also complicated by the `--max-download-size` option
        // overlapping with `--max-crate-size` and `--max-src-size`, which
        // requires some coordination between those options which isn't
        // necessary with the age-based options. An item's age is either older
        // or it isn't, but contrast that with size which is based on the sum
        // of all tracked items. Also, `--max-download-size` is summed against
        // both the crate and src tracking, which requires combining them to
        // compute the size, and then separating them to calculate the correct
        // paths.
        if let Some(max_size) = gc_opts.max_crate_size {
            Self::get_registry_items_to_clean_size(
                &tx,
                max_size,
                REGISTRY_CRATE_TABLE,
                &base.crate_dir,
                &mut delete_paths,
            )?;
        }
        if let Some(max_size) = gc_opts.max_src_size {
            Self::get_registry_items_to_clean_size(
                &tx,
                max_size,
                REGISTRY_SRC_TABLE,
                &base.src,
                &mut delete_paths,
            )?;
        }
        if let Some(max_size) = gc_opts.max_git_size {
            Self::get_git_items_to_clean_size(&tx, max_size, &base, &mut delete_paths)?;
        }
        if let Some(max_size) = gc_opts.max_download_size {
            Self::get_registry_items_to_clean_size_both(&tx, max_size, &base, &mut delete_paths)?;
        }

        clean_ctx.remove_paths(&delete_paths)?;

        if clean_ctx.dry_run {
            tx.rollback()?;
        } else {
            tx.commit()?;
        }
        Ok(())
    }

    /// Returns a list of directory entries in the given path.
    fn names_from(path: &Path) -> CargoResult<Vec<String>> {
        let entries = match path.read_dir() {
            Ok(e) => e,
            Err(e) => {
                if e.kind() == std::io::ErrorKind::NotFound {
                    return Ok(Vec::new());
                } else {
                    return Err(
                        anyhow::Error::new(e).context(format!("failed to read path `{path:?}`"))
                    );
                }
            }
        };
        let names = entries
            .filter_map(|entry| entry.ok()?.file_name().into_string().ok())
            .collect();
        Ok(names)
    }

    /// Synchronizes the database to match the files on disk.
    ///
    /// This performs the following cleanups:
    ///
    /// 1. Remove entries from the database that are missing on disk.
    /// 2. Adds missing entries to the database that are on disk (such as when
    ///    files are added by older versions of cargo).
    /// 3. Fills in the `size` column where it is NULL (such as when something
    ///    is added to disk by an older version of cargo, and one of the mark
    ///    functions marked it without knowing the size).
    ///
    ///    Size computations are only done if `sync_size` is set since it can
    ///    be a very expensive operation. This should only be set if the user
    ///    requested to clean based on the cache size.
    /// 4. Checks for orphaned files. For example, if there are `.crate` files
    ///    associated with an index that does not exist.
    ///
    ///    These orphaned files will be added to `delete_paths` so that the
    ///    caller can delete them.
    #[tracing::instrument(skip(conn, gctx, base, delete_paths))]
    fn sync_db_with_files(
        conn: &Connection,
        now: Timestamp,
        gctx: &GlobalContext,
        base: &BasePaths,
        sync_size: bool,
        delete_paths: &mut Vec<PathBuf>,
    ) -> CargoResult<()> {
        debug!(target: "gc", "starting db sync");
        // For registry_index and git_db, add anything that is missing in the db.
        Self::update_parent_for_missing_from_db(conn, now, REGISTRY_INDEX_TABLE, &base.index)?;
        Self::update_parent_for_missing_from_db(conn, now, GIT_DB_TABLE, &base.git_db)?;

        // For registry_crate, registry_src, and git_checkout, remove anything
        // from the db that isn't on disk.
        Self::update_db_for_removed(
            conn,
            REGISTRY_INDEX_TABLE,
            "registry_id",
            REGISTRY_CRATE_TABLE,
            &base.crate_dir,
        )?;
        Self::update_db_for_removed(
            conn,
            REGISTRY_INDEX_TABLE,
            "registry_id",
            REGISTRY_SRC_TABLE,
            &base.src,
        )?;
        Self::update_db_for_removed(conn, GIT_DB_TABLE, "git_id", GIT_CO_TABLE, &base.git_co)?;

        // For registry_index and git_db, remove anything from the db that
        // isn't on disk.
        //
        // This also collects paths for any child files that don't have their
        // respective parent on disk.
        Self::update_db_parent_for_removed_from_disk(
            conn,
            REGISTRY_INDEX_TABLE,
            &base.index,
            &[&base.crate_dir, &base.src],
            delete_paths,
        )?;
        Self::update_db_parent_for_removed_from_disk(
            conn,
            GIT_DB_TABLE,
            &base.git_db,
            &[&base.git_co],
            delete_paths,
        )?;

        // For registry_crate, registry_src, and git_checkout, add anything
        // that is missing in the db.
        Self::populate_untracked_crate(conn, now, &base.crate_dir)?;
        Self::populate_untracked(
            conn,
            now,
            gctx,
            REGISTRY_INDEX_TABLE,
            "registry_id",
            REGISTRY_SRC_TABLE,
            &base.src,
            sync_size,
        )?;
        Self::populate_untracked(
            conn,
            now,
            gctx,
            GIT_DB_TABLE,
            "git_id",
            GIT_CO_TABLE,
            &base.git_co,
            sync_size,
        )?;

        // Update any NULL sizes if needed.
        if sync_size {
            Self::update_null_sizes(
                conn,
                gctx,
                REGISTRY_INDEX_TABLE,
                "registry_id",
                REGISTRY_SRC_TABLE,
                &base.src,
            )?;
            Self::update_null_sizes(
                conn,
                gctx,
                GIT_DB_TABLE,
                "git_id",
                GIT_CO_TABLE,
                &base.git_co,
            )?;
        }
        Ok(())
    }

    /// For parent tables, add any entries that are on disk but aren't tracked in the db.
    #[tracing::instrument(skip(conn, now, base_path))]
    fn update_parent_for_missing_from_db(
        conn: &Connection,
        now: Timestamp,
        parent_table_name: &str,
        base_path: &Path,
    ) -> CargoResult<()> {
        trace!(target: "gc", "checking for untracked parent to add to {parent_table_name}");
        let names = Self::names_from(base_path)?;

        let mut stmt = conn.prepare_cached(&format!(
            "INSERT INTO {parent_table_name} (name, timestamp)
                VALUES (?1, ?2)
                ON CONFLICT DO NOTHING",
        ))?;
        for name in names {
            stmt.execute(params![name, now])?;
        }
        Ok(())
    }

    /// Removes database entries for any files that are not on disk for the child tables.
    ///
    /// This could happen for example if the user manually deleted the file or
    /// any such scenario where the filesystem and db are out of sync.
    #[tracing::instrument(skip(conn, base_path))]
    fn update_db_for_removed(
        conn: &Connection,
        parent_table_name: &str,
        id_column_name: &str,
        table_name: &str,
        base_path: &Path,
    ) -> CargoResult<()> {
        trace!(target: "gc", "checking for db entries to remove from {table_name}");
        let mut select_stmt = conn.prepare_cached(&format!(
            "SELECT {table_name}.rowid, {parent_table_name}.name, {table_name}.name
             FROM {parent_table_name}, {table_name}
             WHERE {table_name}.{id_column_name} = {parent_table_name}.id",
        ))?;
        let mut delete_stmt =
            conn.prepare_cached(&format!("DELETE FROM {table_name} WHERE rowid = ?1"))?;
        let mut rows = select_stmt.query([])?;
        while let Some(row) = rows.next()? {
            let rowid: i64 = row.get_unwrap(0);
            let id_name: String = row.get_unwrap(1);
            let name: String = row.get_unwrap(2);
            if !base_path.join(id_name).join(name).exists() {
                delete_stmt.execute([rowid])?;
            }
        }
        Ok(())
    }

    /// Removes database entries for any files that are not on disk for the parent tables.
    #[tracing::instrument(skip(conn, base_path, child_base_paths, delete_paths))]
    fn update_db_parent_for_removed_from_disk(
        conn: &Connection,
        parent_table_name: &str,
        base_path: &Path,
        child_base_paths: &[&Path],
        delete_paths: &mut Vec<PathBuf>,
    ) -> CargoResult<()> {
        trace!(target: "gc", "checking for db entries to remove from {parent_table_name}");
        let mut select_stmt =
            conn.prepare_cached(&format!("SELECT rowid, name FROM {parent_table_name}"))?;
        let mut delete_stmt =
            conn.prepare_cached(&format!("DELETE FROM {parent_table_name} WHERE rowid = ?1"))?;
        let mut rows = select_stmt.query([])?;
        while let Some(row) = rows.next()? {
            let rowid: i64 = row.get_unwrap(0);
            let id_name: String = row.get_unwrap(1);
            if !base_path.join(&id_name).exists() {
                delete_stmt.execute([rowid])?;
                // Make sure any child data is also cleaned up.
                for child_base in child_base_paths {
                    let child_path = child_base.join(&id_name);
                    if child_path.exists() {
                        debug!(target: "gc", "removing orphaned path {child_path:?}");
                        delete_paths.push(child_path);
                    }
                }
            }
        }
        Ok(())
    }

    /// Updates the database to add any `.crate` files that are currently
    /// not tracked (such as when they are downloaded by an older version of
    /// cargo).
    #[tracing::instrument(skip(conn, now, base_path))]
    fn populate_untracked_crate(
        conn: &Connection,
        now: Timestamp,
        base_path: &Path,
    ) -> CargoResult<()> {
        trace!(target: "gc", "populating untracked crate files");
        let mut insert_stmt = conn.prepare_cached(
            "INSERT INTO registry_crate (registry_id, name, size, timestamp)
             VALUES (?1, ?2, ?3, ?4)
             ON CONFLICT DO NOTHING",
        )?;
        let index_names = Self::names_from(&base_path)?;
        for index_name in index_names {
            let Some(id) = Self::id_from_name(conn, REGISTRY_INDEX_TABLE, &index_name)? else {
                // The id is missing from the database. This should be resolved
                // via update_db_parent_for_removed_from_disk.
                continue;
            };
            let index_path = base_path.join(index_name);
            for crate_name in Self::names_from(&index_path)? {
                if crate_name.ends_with(".crate") {
                    // Missing files should have already been taken care of by
                    // update_db_for_removed.
                    let size = paths::metadata(index_path.join(&crate_name))?.len();
                    insert_stmt.execute(params![id, crate_name, size, now])?;
                }
            }
        }
        Ok(())
    }

    /// Updates the database to add any files that are currently not tracked
    /// (such as when they are downloaded by an older version of cargo).
    #[tracing::instrument(skip(conn, now, gctx, base_path, populate_size))]
    fn populate_untracked(
        conn: &Connection,
        now: Timestamp,
        gctx: &GlobalContext,
        id_table_name: &str,
        id_column_name: &str,
        table_name: &str,
        base_path: &Path,
        populate_size: bool,
    ) -> CargoResult<()> {
        trace!(target: "gc", "populating untracked files for {table_name}");
        // Gather names (and make sure they are in the database).
        let id_names = Self::names_from(&base_path)?;

        // This SELECT is used to determine if the directory is already
        // tracked. We don't want to do the expensive size computation unless
        // necessary.
        let mut select_stmt = conn.prepare_cached(&format!(
            "SELECT 1 FROM {table_name}
             WHERE {id_column_name} = ?1 AND name = ?2",
        ))?;
        let mut insert_stmt = conn.prepare_cached(&format!(
            "INSERT INTO {table_name} ({id_column_name}, name, size, timestamp)
             VALUES (?1, ?2, ?3, ?4)
             ON CONFLICT DO NOTHING",
        ))?;
        let mut progress = Progress::with_style("Scanning", ProgressStyle::Ratio, gctx);
        // Compute the size of any directory not in the database.
        for id_name in id_names {
            let Some(id) = Self::id_from_name(conn, id_table_name, &id_name)? else {
                // The id is missing from the database. This should be resolved
                // via update_db_parent_for_removed_from_disk.
                continue;
            };
            let index_path = base_path.join(id_name);
            let names = Self::names_from(&index_path)?;
            let max = names.len();
            for (i, name) in names.iter().enumerate() {
                if select_stmt.exists(params![id, name])? {
                    continue;
                }
                let dir_path = index_path.join(name);
                if !dir_path.is_dir() {
                    continue;
                }
                progress.tick(i, max, "")?;
                let size = if populate_size {
                    Some(du(&dir_path, table_name)?)
                } else {
                    None
                };
                insert_stmt.execute(params![id, name, size, now])?;
            }
        }
        Ok(())
    }

    /// Fills in the `size` column where it is NULL.
    ///
    /// This can happen when something is added to disk by an older version of
    /// cargo, and one of the mark functions marked it without knowing the
    /// size.
    ///
    /// `update_db_for_removed` should be called before this is called.
    #[tracing::instrument(skip(conn, gctx, base_path))]
    fn update_null_sizes(
        conn: &Connection,
        gctx: &GlobalContext,
        parent_table_name: &str,
        id_column_name: &str,
        table_name: &str,
        base_path: &Path,
    ) -> CargoResult<()> {
        trace!(target: "gc", "updating NULL size information in {table_name}");
        let mut null_stmt = conn.prepare_cached(&format!(
            "SELECT {table_name}.rowid, {table_name}.name, {parent_table_name}.name
             FROM {table_name}, {parent_table_name}
             WHERE {table_name}.size IS NULL AND {table_name}.{id_column_name} = {parent_table_name}.id",
        ))?;
        let mut update_stmt = conn.prepare_cached(&format!(
            "UPDATE {table_name} SET size = ?1 WHERE rowid = ?2"
        ))?;
        let mut progress = Progress::with_style("Scanning", ProgressStyle::Ratio, gctx);
        let rows: Vec<_> = null_stmt
            .query_map([], |row| {
                Ok((row.get_unwrap(0), row.get_unwrap(1), row.get_unwrap(2)))
            })?
            .collect();
        let max = rows.len();
        for (i, row) in rows.into_iter().enumerate() {
            let (rowid, name, id_name): (i64, String, String) = row?;
            let path = base_path.join(id_name).join(name);
            progress.tick(i, max, "")?;
            // Missing files should have already been taken care of by
            // update_db_for_removed.
            let size = du(&path, table_name)?;
            update_stmt.execute(params![size, rowid])?;
        }
        Ok(())
    }

    /// Adds paths to delete from either registry_crate or registry_src whose
    /// last use is older than the given timestamp.
    fn get_registry_items_to_clean_age(
        conn: &Connection,
        max_age: Timestamp,
        table_name: &str,
        base_path: &Path,
        delete_paths: &mut Vec<PathBuf>,
    ) -> CargoResult<()> {
        debug!(target: "gc", "cleaning {table_name} since {max_age:?}");
        let mut stmt = conn.prepare_cached(&format!(
            "DELETE FROM {table_name} WHERE timestamp < ?1
                RETURNING registry_id, name"
        ))?;
        let rows = stmt
            .query_map(params![max_age], |row| {
                let registry_id = row.get_unwrap(0);
                let name: String = row.get_unwrap(1);
                Ok((registry_id, name))
            })?
            .collect::<Result<Vec<_>, _>>()?;
        let ids: Vec<_> = rows.iter().map(|r| r.0).collect();
        let id_map = Self::get_id_map(conn, REGISTRY_INDEX_TABLE, &ids)?;
        for (id, name) in rows {
            let encoded_registry_name = &id_map[&id];
            delete_paths.push(base_path.join(encoded_registry_name).join(name));
        }
        Ok(())
    }

    /// Adds paths to delete from either `registry_crate` or `registry_src` in
    /// order to keep the total size under the given max size.
    fn get_registry_items_to_clean_size(
        conn: &Connection,
        max_size: u64,
        table_name: &str,
        base_path: &Path,
        delete_paths: &mut Vec<PathBuf>,
    ) -> CargoResult<()> {
        debug!(target: "gc", "cleaning {table_name} till under {max_size:?}");
        let total_size: u64 = conn.query_row(
            &format!("SELECT coalesce(SUM(size), 0) FROM {table_name}"),
            [],
            |row| row.get(0),
        )?;
        if total_size <= max_size {
            return Ok(());
        }
        // This SQL statement selects all of the rows ordered by timestamp,
        // and then uses a window function to keep a running total of the
        // size. It selects all rows until the running total exceeds the
        // threshold of the total number of bytes that we want to delete.
        //
        // The window function essentially computes an aggregate over all
        // previous rows as it goes along. As long as the running size is
        // below the total amount that we need to delete, it keeps picking
        // more rows.
        //
        // The ORDER BY includes `name` mainly for test purposes so that
        // entries with the same timestamp have deterministic behavior.
        //
        // The coalesce helps convert NULL to 0.
        let mut stmt = conn.prepare(&format!(
            "DELETE FROM {table_name} WHERE rowid IN \
                (SELECT x.rowid FROM \
                    (SELECT rowid, size, SUM(size) OVER \
                        (ORDER BY timestamp, name ROWS UNBOUNDED PRECEDING) AS running_amount \
                        FROM {table_name}) x \
                    WHERE coalesce(x.running_amount, 0) - x.size < ?1) \
                RETURNING registry_id, name;"
        ))?;
        let rows = stmt
            .query_map(params![total_size - max_size], |row| {
                let id = row.get_unwrap(0);
                let name: String = row.get_unwrap(1);
                Ok((id, name))
            })?
            .collect::<Result<Vec<_>, _>>()?;
        // Convert registry_id to the encoded registry name, and join those.
        let ids: Vec<_> = rows.iter().map(|r| r.0).collect();
        let id_map = Self::get_id_map(conn, REGISTRY_INDEX_TABLE, &ids)?;
        for (id, name) in rows {
            let encoded_name = &id_map[&id];
            delete_paths.push(base_path.join(encoded_name).join(name));
        }
        Ok(())
    }

    /// Adds paths to delete from both `registry_crate` and `registry_src` in
    /// order to keep the total size under the given max size.
    fn get_registry_items_to_clean_size_both(
        conn: &Connection,
        max_size: u64,
        base: &BasePaths,
        delete_paths: &mut Vec<PathBuf>,
    ) -> CargoResult<()> {
        debug!(target: "gc", "cleaning download till under {max_size:?}");

        // This SQL statement selects from both registry_src and
        // registry_crate so that sorting of timestamps incorporates both of
        // them at the same time. It uses a const value of 1 or 2 as the first
        // column so that the code below can determine which table the value
        // came from.
        let mut stmt = conn.prepare_cached(
            "SELECT 1, registry_src.rowid, registry_src.name AS name, registry_index.name,
                    registry_src.size, registry_src.timestamp AS timestamp
             FROM registry_src, registry_index
             WHERE registry_src.registry_id = registry_index.id AND registry_src.size NOT NULL

             UNION

             SELECT 2, registry_crate.rowid, registry_crate.name AS name, registry_index.name,
                    registry_crate.size, registry_crate.timestamp AS timestamp
             FROM registry_crate, registry_index
             WHERE registry_crate.registry_id = registry_index.id

             ORDER BY timestamp, name",
        )?;
        let mut delete_src_stmt =
            conn.prepare_cached("DELETE FROM registry_src WHERE rowid = ?1")?;
        let mut delete_crate_stmt =
            conn.prepare_cached("DELETE FROM registry_crate WHERE rowid = ?1")?;
        let rows = stmt
            .query_map([], |row| {
                Ok((
                    row.get_unwrap(0),
                    row.get_unwrap(1),
                    row.get_unwrap(2),
                    row.get_unwrap(3),
                    row.get_unwrap(4),
                ))
            })?
            .collect::<Result<Vec<(i64, i64, String, String, u64)>, _>>()?;
        let mut total_size: u64 = rows.iter().map(|r| r.4).sum();
        debug!(target: "gc", "total download cache size appears to be {total_size}");
        for (table, rowid, name, index_name, size) in rows {
            if total_size <= max_size {
                break;
            }
            if table == 1 {
                delete_paths.push(base.src.join(index_name).join(name));
                delete_src_stmt.execute([rowid])?;
            } else {
                delete_paths.push(base.crate_dir.join(index_name).join(name));
                delete_crate_stmt.execute([rowid])?;
            }
            // TODO: If delete crate, ensure src is also deleted.
            total_size -= size;
        }
        Ok(())
    }

    /// Adds paths to delete from the git cache, keeping the total size under
    /// the give value.
    ///
    /// Paths are relative to the `git` directory in the cache directory.
    fn get_git_items_to_clean_size(
        conn: &Connection,
        max_size: u64,
        base: &BasePaths,
        delete_paths: &mut Vec<PathBuf>,
    ) -> CargoResult<()> {
        debug!(target: "gc", "cleaning git till under {max_size:?}");

        // Collect all the sizes from git_db and git_checkouts, and then sort them by timestamp.
        let mut stmt = conn.prepare_cached("SELECT rowid, name, timestamp FROM git_db")?;
        let mut git_info = stmt
            .query_map([], |row| {
                let rowid: i64 = row.get_unwrap(0);
                let name: String = row.get_unwrap(1);
                let timestamp: Timestamp = row.get_unwrap(2);
                // Size is added below so that the error doesn't need to be
                // converted to a rusqlite error.
                Ok((timestamp, rowid, None, name, 0))
            })?
            .collect::<Result<Vec<_>, _>>()?;
        for info in &mut git_info {
            let size = cargo_util::du(&base.git_db.join(&info.3), &[])?;
            info.4 = size;
        }

        let mut stmt = conn.prepare_cached(
            "SELECT git_checkout.rowid, git_db.name, git_checkout.name,
                git_checkout.size, git_checkout.timestamp
                FROM git_checkout, git_db
                WHERE git_checkout.git_id = git_db.id AND git_checkout.size NOT NULL",
        )?;
        let git_co_rows = stmt
            .query_map([], |row| {
                let rowid = row.get_unwrap(0);
                let db_name: String = row.get_unwrap(1);
                let name = row.get_unwrap(2);
                let size = row.get_unwrap(3);
                let timestamp = row.get_unwrap(4);
                Ok((timestamp, rowid, Some(db_name), name, size))
            })?
            .collect::<Result<Vec<_>, _>>()?;
        git_info.extend(git_co_rows);

        // Sort by timestamp, and name. The name is included mostly for test
        // purposes so that entries with the same timestamp have deterministic
        // behavior.
        git_info.sort_by(|a, b| (b.0, &b.3).cmp(&(a.0, &a.3)));

        // Collect paths to delete.
        let mut delete_db_stmt = conn.prepare_cached("DELETE FROM git_db WHERE rowid = ?1")?;
        let mut delete_co_stmt =
            conn.prepare_cached("DELETE FROM git_checkout WHERE rowid = ?1")?;
        let mut total_size: u64 = git_info.iter().map(|r| r.4).sum();
        debug!(target: "gc", "total git cache size appears to be {total_size}");
        while let Some((_timestamp, rowid, db_name, name, size)) = git_info.pop() {
            if total_size <= max_size {
                break;
            }
            if let Some(db_name) = db_name {
                delete_paths.push(base.git_co.join(db_name).join(name));
                delete_co_stmt.execute([rowid])?;
                total_size -= size;
            } else {
                total_size -= size;
                delete_paths.push(base.git_db.join(&name));
                delete_db_stmt.execute([rowid])?;
                // If the db is deleted, then all the checkouts must be deleted.
                let mut i = 0;
                while i < git_info.len() {
                    if git_info[i].2.as_deref() == Some(name.as_ref()) {
                        let (_, rowid, db_name, name, size) = git_info.remove(i);
                        delete_paths.push(base.git_co.join(db_name.unwrap()).join(name));
                        delete_co_stmt.execute([rowid])?;
                        total_size -= size;
                    } else {
                        i += 1;
                    }
                }
            }
        }
        Ok(())
    }

    /// Adds paths to delete from `registry_index` whose last use is older
    /// than the given timestamp.
    fn get_registry_index_to_clean(
        conn: &Connection,
        max_age: Timestamp,
        base: &BasePaths,
        delete_paths: &mut Vec<PathBuf>,
    ) -> CargoResult<()> {
        debug!(target: "gc", "cleaning index since {max_age:?}");
        let mut stmt = conn.prepare_cached(
            "DELETE FROM registry_index WHERE timestamp < ?1
                RETURNING name",
        )?;
        let mut rows = stmt.query([max_age])?;
        while let Some(row) = rows.next()? {
            let name: String = row.get_unwrap(0);
            delete_paths.push(base.index.join(&name));
            // Also delete .crate and src directories, since by definition
            // they cannot be used without their index.
            delete_paths.push(base.src.join(&name));
            delete_paths.push(base.crate_dir.join(&name));
        }
        Ok(())
    }

    /// Adds paths to delete from `git_checkout` whose last use is
    /// older than the given timestamp.
    fn get_git_co_items_to_clean(
        conn: &Connection,
        max_age: Timestamp,
        base_path: &Path,
        delete_paths: &mut Vec<PathBuf>,
    ) -> CargoResult<()> {
        debug!(target: "gc", "cleaning git co since {max_age:?}");
        let mut stmt = conn.prepare_cached(
            "DELETE FROM git_checkout WHERE timestamp < ?1
                RETURNING git_id, name",
        )?;
        let rows = stmt
            .query_map(params![max_age], |row| {
                let git_id = row.get_unwrap(0);
                let name: String = row.get_unwrap(1);
                Ok((git_id, name))
            })?
            .collect::<Result<Vec<_>, _>>()?;
        let ids: Vec<_> = rows.iter().map(|r| r.0).collect();
        let id_map = Self::get_id_map(conn, GIT_DB_TABLE, &ids)?;
        for (id, name) in rows {
            let encoded_git_name = &id_map[&id];
            delete_paths.push(base_path.join(encoded_git_name).join(name));
        }
        Ok(())
    }

    /// Adds paths to delete from `git_db` in order to keep the total size
    /// under the given max size.
    fn get_git_db_items_to_clean(
        conn: &Connection,
        max_age: Timestamp,
        base: &BasePaths,
        delete_paths: &mut Vec<PathBuf>,
    ) -> CargoResult<()> {
        debug!(target: "gc", "cleaning git db since {max_age:?}");
        let mut stmt = conn.prepare_cached(
            "DELETE FROM git_db WHERE timestamp < ?1
                RETURNING name",
        )?;
        let mut rows = stmt.query([max_age])?;
        while let Some(row) = rows.next()? {
            let name: String = row.get_unwrap(0);
            delete_paths.push(base.git_db.join(&name));
            // Also delete checkout directories, since by definition they
            // cannot be used without their db.
            delete_paths.push(base.git_co.join(&name));
        }
        Ok(())
    }
}

/// Helper to generate the upsert for the parent tables.
///
/// This handles checking if the row already exists, and only updates the
/// timestamp it if it hasn't been updated recently. This also handles keeping
/// a cached map of the `id` value.
///
/// Unfortunately it is a bit tricky to share this code without a macro.
macro_rules! insert_or_update_parent {
    ($self:expr, $conn:expr, $table_name:expr, $timestamps_field:ident, $keys_field:ident, $encoded_name:ident) => {
        let mut select_stmt = $conn.prepare_cached(concat!(
            "SELECT id, timestamp FROM ",
            $table_name,
            " WHERE name = ?1"
        ))?;
        let mut insert_stmt = $conn.prepare_cached(concat!(
            "INSERT INTO ",
            $table_name,
            " (name, timestamp)
                VALUES (?1, ?2)
                ON CONFLICT DO UPDATE SET timestamp=excluded.timestamp
                RETURNING id",
        ))?;
        let mut update_stmt = $conn.prepare_cached(concat!(
            "UPDATE ",
            $table_name,
            " SET timestamp = ?1 WHERE id = ?2"
        ))?;
        for (parent, new_timestamp) in std::mem::take(&mut $self.$timestamps_field) {
            trace!(target: "gc",
                concat!("insert ", $table_name, " {:?} {}"),
                parent,
                new_timestamp
            );
            let mut rows = select_stmt.query([parent.$encoded_name])?;
            let id = if let Some(row) = rows.next()? {
                let id: ParentId = row.get_unwrap(0);
                let timestamp: Timestamp = row.get_unwrap(1);
                if timestamp < new_timestamp - UPDATE_RESOLUTION {
                    update_stmt.execute(params![new_timestamp, id])?;
                }
                id
            } else {
                insert_stmt.query_row(params![parent.$encoded_name, new_timestamp], |row| {
                    row.get(0)
                })?
            };
            match $self.$keys_field.entry(parent.$encoded_name) {
                hash_map::Entry::Occupied(o) => {
                    assert_eq!(*o.get(), id);
                }
                hash_map::Entry::Vacant(v) => {
                    v.insert(id);
                }
            }
        }
        return Ok(());
    };
}

/// This is a cache of modifications that will be saved to disk all at once
/// via the [`DeferredGlobalLastUse::save`] method.
///
/// This is here to improve performance.
#[derive(Debug)]
pub struct DeferredGlobalLastUse {
    /// Cache of registry keys, used for faster fetching.
    ///
    /// The key is the registry name (which is its directory name) and the
    /// value is the `id` in the `registry_index` table.
    registry_keys: HashMap<InternedString, ParentId>,
    /// Cache of git keys, used for faster fetching.
    ///
    /// The key is the git db name (which is its directory name) and the value
    /// is the `id` in the `git_db` table.
    git_keys: HashMap<InternedString, ParentId>,

    /// New registry index entries to insert.
    registry_index_timestamps: HashMap<RegistryIndex, Timestamp>,
    /// New registry `.crate` entries to insert.
    registry_crate_timestamps: HashMap<RegistryCrate, Timestamp>,
    /// New registry src directory entries to insert.
    registry_src_timestamps: HashMap<RegistrySrc, Timestamp>,
    /// New git db entries to insert.
    git_db_timestamps: HashMap<GitDb, Timestamp>,
    /// New git checkout entries to insert.
    git_checkout_timestamps: HashMap<GitCheckout, Timestamp>,
    /// This is used so that a warning about failing to update the database is
    /// only displayed once.
    save_err_has_warned: bool,
    /// The current time, used to improve performance to avoid accessing the
    /// clock hundreds of times.
    now: Timestamp,
}

impl DeferredGlobalLastUse {
    pub fn new() -> DeferredGlobalLastUse {
        DeferredGlobalLastUse {
            registry_keys: HashMap::new(),
            git_keys: HashMap::new(),
            registry_index_timestamps: HashMap::new(),
            registry_crate_timestamps: HashMap::new(),
            registry_src_timestamps: HashMap::new(),
            git_db_timestamps: HashMap::new(),
            git_checkout_timestamps: HashMap::new(),
            save_err_has_warned: false,
            now: now(),
        }
    }

    pub fn is_empty(&self) -> bool {
        self.registry_index_timestamps.is_empty()
            && self.registry_crate_timestamps.is_empty()
            && self.registry_src_timestamps.is_empty()
            && self.git_db_timestamps.is_empty()
            && self.git_checkout_timestamps.is_empty()
    }

    fn clear(&mut self) {
        self.registry_index_timestamps.clear();
        self.registry_crate_timestamps.clear();
        self.registry_src_timestamps.clear();
        self.git_db_timestamps.clear();
        self.git_checkout_timestamps.clear();
    }

    /// Indicates the given [`RegistryIndex`] has been used right now.
    pub fn mark_registry_index_used(&mut self, registry_index: RegistryIndex) {
        self.mark_registry_index_used_stamp(registry_index, None);
    }

    /// Indicates the given [`RegistryCrate`] has been used right now.
    ///
    /// Also implicitly marks the index used, too.
    pub fn mark_registry_crate_used(&mut self, registry_crate: RegistryCrate) {
        self.mark_registry_crate_used_stamp(registry_crate, None);
    }

    /// Indicates the given [`RegistrySrc`] has been used right now.
    ///
    /// Also implicitly marks the index used, too.
    pub fn mark_registry_src_used(&mut self, registry_src: RegistrySrc) {
        self.mark_registry_src_used_stamp(registry_src, None);
    }

    /// Indicates the given [`GitCheckout`] has been used right now.
    ///
    /// Also implicitly marks the git db used, too.
    pub fn mark_git_checkout_used(&mut self, git_checkout: GitCheckout) {
        self.mark_git_checkout_used_stamp(git_checkout, None);
    }

    /// Indicates the given [`RegistryIndex`] has been used with the given
    /// time (or "now" if `None`).
    pub fn mark_registry_index_used_stamp(
        &mut self,
        registry_index: RegistryIndex,
        timestamp: Option<&SystemTime>,
    ) {
        let timestamp = timestamp.map_or(self.now, to_timestamp);
        self.registry_index_timestamps
            .insert(registry_index, timestamp);
    }

    /// Indicates the given [`RegistryCrate`] has been used with the given
    /// time (or "now" if `None`).
    ///
    /// Also implicitly marks the index used, too.
    pub fn mark_registry_crate_used_stamp(
        &mut self,
        registry_crate: RegistryCrate,
        timestamp: Option<&SystemTime>,
    ) {
        let timestamp = timestamp.map_or(self.now, to_timestamp);
        let index = RegistryIndex {
            encoded_registry_name: registry_crate.encoded_registry_name,
        };
        self.registry_index_timestamps.insert(index, timestamp);
        self.registry_crate_timestamps
            .insert(registry_crate, timestamp);
    }

    /// Indicates the given [`RegistrySrc`] has been used with the given
    /// time (or "now" if `None`).
    ///
    /// Also implicitly marks the index used, too.
    pub fn mark_registry_src_used_stamp(
        &mut self,
        registry_src: RegistrySrc,
        timestamp: Option<&SystemTime>,
    ) {
        let timestamp = timestamp.map_or(self.now, to_timestamp);
        let index = RegistryIndex {
            encoded_registry_name: registry_src.encoded_registry_name,
        };
        self.registry_index_timestamps.insert(index, timestamp);
        self.registry_src_timestamps.insert(registry_src, timestamp);
    }

    /// Indicates the given [`GitCheckout`] has been used with the given
    /// time (or "now" if `None`).
    ///
    /// Also implicitly marks the git db used, too.
    pub fn mark_git_checkout_used_stamp(
        &mut self,
        git_checkout: GitCheckout,
        timestamp: Option<&SystemTime>,
    ) {
        let timestamp = timestamp.map_or(self.now, to_timestamp);
        let db = GitDb {
            encoded_git_name: git_checkout.encoded_git_name,
        };
        self.git_db_timestamps.insert(db, timestamp);
        self.git_checkout_timestamps.insert(git_checkout, timestamp);
    }

    /// Saves all of the deferred information to the database.
    ///
    /// This will also clear the state of `self`.
    #[tracing::instrument(skip_all)]
    pub fn save(&mut self, tracker: &mut GlobalCacheTracker) -> CargoResult<()> {
        trace!(target: "gc", "saving last-use data");
        if self.is_empty() {
            return Ok(());
        }
        let tx = tracker.conn.transaction()?;
        // These must run before the ones that refer to their IDs.
        self.insert_registry_index_from_cache(&tx)?;
        self.insert_git_db_from_cache(&tx)?;
        self.insert_registry_crate_from_cache(&tx)?;
        self.insert_registry_src_from_cache(&tx)?;
        self.insert_git_checkout_from_cache(&tx)?;
        tx.commit()?;
        trace!(target: "gc", "last-use save complete");
        Ok(())
    }

    /// Variant of [`DeferredGlobalLastUse::save`] that does not return an
    /// error.
    ///
    /// This will log or display a warning to the user.
    pub fn save_no_error(&mut self, gctx: &GlobalContext) {
        if let Err(e) = self.save_with_gctx(gctx) {
            // Because there is an assertion in auto-gc that checks if this is
            // empty, be sure to clear it so that assertion doesn't fail.
            self.clear();
            if !self.save_err_has_warned {
                if is_silent_error(&e) && gctx.shell().verbosity() != Verbosity::Verbose {
                    tracing::warn!("failed to save last-use data: {e:?}");
                } else {
                    crate::display_warning_with_error(
                        "failed to save last-use data\n\
                        This may prevent cargo from accurately tracking what is being \
                        used in its global cache. This information is used for \
                        automatically removing unused data in the cache.",
                        &e,
                        &mut gctx.shell(),
                    );
                    self.save_err_has_warned = true;
                }
            }
        }
    }

    fn save_with_gctx(&mut self, gctx: &GlobalContext) -> CargoResult<()> {
        let mut tracker = gctx.global_cache_tracker()?;
        self.save(&mut tracker)
    }

    /// Flushes all of the `registry_index_timestamps` to the database,
    /// clearing `registry_index_timestamps`.
    fn insert_registry_index_from_cache(&mut self, conn: &Connection) -> CargoResult<()> {
        insert_or_update_parent!(
            self,
            conn,
            "registry_index",
            registry_index_timestamps,
            registry_keys,
            encoded_registry_name
        );
    }

    /// Flushes all of the `git_db_timestamps` to the database,
    /// clearing `registry_index_timestamps`.
    fn insert_git_db_from_cache(&mut self, conn: &Connection) -> CargoResult<()> {
        insert_or_update_parent!(
            self,
            conn,
            "git_db",
            git_db_timestamps,
            git_keys,
            encoded_git_name
        );
    }

    /// Flushes all of the `registry_crate_timestamps` to the database,
    /// clearing `registry_index_timestamps`.
    fn insert_registry_crate_from_cache(&mut self, conn: &Connection) -> CargoResult<()> {
        let registry_crate_timestamps = std::mem::take(&mut self.registry_crate_timestamps);
        for (registry_crate, timestamp) in registry_crate_timestamps {
            trace!(target: "gc", "insert registry crate {registry_crate:?} {timestamp}");
            let registry_id = self.registry_id(conn, registry_crate.encoded_registry_name)?;
            let mut stmt = conn.prepare_cached(
                "INSERT INTO registry_crate (registry_id, name, size, timestamp)
                 VALUES (?1, ?2, ?3, ?4)
                 ON CONFLICT DO UPDATE SET timestamp=excluded.timestamp
                    WHERE timestamp < ?5
                 ",
            )?;
            stmt.execute(params![
                registry_id,
                registry_crate.crate_filename,
                registry_crate.size,
                timestamp,
                timestamp - UPDATE_RESOLUTION
            ])?;
        }
        Ok(())
    }

    /// Flushes all of the `registry_src_timestamps` to the database,
    /// clearing `registry_index_timestamps`.
    fn insert_registry_src_from_cache(&mut self, conn: &Connection) -> CargoResult<()> {
        let registry_src_timestamps = std::mem::take(&mut self.registry_src_timestamps);
        for (registry_src, timestamp) in registry_src_timestamps {
            trace!(target: "gc", "insert registry src {registry_src:?} {timestamp}");
            let registry_id = self.registry_id(conn, registry_src.encoded_registry_name)?;
            let mut stmt = conn.prepare_cached(
                "INSERT INTO registry_src (registry_id, name, size, timestamp)
                 VALUES (?1, ?2, ?3, ?4)
                 ON CONFLICT DO UPDATE SET timestamp=excluded.timestamp
                    WHERE timestamp < ?5
                 ",
            )?;
            stmt.execute(params![
                registry_id,
                registry_src.package_dir,
                registry_src.size,
                timestamp,
                timestamp - UPDATE_RESOLUTION
            ])?;
        }

        Ok(())
    }

    /// Flushes all of the `git_checkout_timestamps` to the database,
    /// clearing `registry_index_timestamps`.
    fn insert_git_checkout_from_cache(&mut self, conn: &Connection) -> CargoResult<()> {
        let git_checkout_timestamps = std::mem::take(&mut self.git_checkout_timestamps);
        for (git_checkout, timestamp) in git_checkout_timestamps {
            let git_id = self.git_id(conn, git_checkout.encoded_git_name)?;
            let mut stmt = conn.prepare_cached(
                "INSERT INTO git_checkout (git_id, name, size, timestamp)
                 VALUES (?1, ?2, ?3, ?4)
                 ON CONFLICT DO UPDATE SET timestamp=excluded.timestamp
                    WHERE timestamp < ?5",
            )?;
            stmt.execute(params![
                git_id,
                git_checkout.short_name,
                git_checkout.size,
                timestamp,
                timestamp - UPDATE_RESOLUTION
            ])?;
        }

        Ok(())
    }

    /// Returns the numeric ID of the registry, either fetching from the local
    /// cache, or getting it from the database.
    ///
    /// It is an error if the registry does not exist.
    fn registry_id(
        &mut self,
        conn: &Connection,
        encoded_registry_name: InternedString,
    ) -> CargoResult<ParentId> {
        match self.registry_keys.get(&encoded_registry_name) {
            Some(i) => Ok(*i),
            None => {
                let Some(id) = GlobalCacheTracker::id_from_name(
                    conn,
                    REGISTRY_INDEX_TABLE,
                    &encoded_registry_name,
                )?
                else {
                    bail!("expected registry_index {encoded_registry_name} to exist, but wasn't found");
                };
                self.registry_keys.insert(encoded_registry_name, id);
                Ok(id)
            }
        }
    }

    /// Returns the numeric ID of the git db, either fetching from the local
    /// cache, or getting it from the database.
    ///
    /// It is an error if the git db does not exist.
    fn git_id(
        &mut self,
        conn: &Connection,
        encoded_git_name: InternedString,
    ) -> CargoResult<ParentId> {
        match self.git_keys.get(&encoded_git_name) {
            Some(i) => Ok(*i),
            None => {
                let Some(id) =
                    GlobalCacheTracker::id_from_name(conn, GIT_DB_TABLE, &encoded_git_name)?
                else {
                    bail!("expected git_db {encoded_git_name} to exist, but wasn't found")
                };
                self.git_keys.insert(encoded_git_name, id);
                Ok(id)
            }
        }
    }
}

/// Converts a [`SystemTime`] to a [`Timestamp`] which can be stored in the database.
fn to_timestamp(t: &SystemTime) -> Timestamp {
    t.duration_since(SystemTime::UNIX_EPOCH)
        .expect("invalid clock")
        .as_secs()
}

/// Returns the current time.
///
/// This supports pretending that the time is different for testing using an
/// environment variable.
///
/// If possible, try to avoid calling this too often since accessing clocks
/// can be a little slow on some systems.
#[allow(clippy::disallowed_methods)]
fn now() -> Timestamp {
    match std::env::var("__CARGO_TEST_LAST_USE_NOW") {
        Ok(now) => now.parse().unwrap(),
        Err(_) => to_timestamp(&SystemTime::now()),
    }
}

/// Returns whether or not the given error should cause a warning to be
/// displayed to the user.
///
/// In some situations, like a read-only global cache, we don't want to spam
/// the user with a warning. I think once cargo has controllable lints, I
/// think we should consider changing this to always warn, but give the user
/// an option to silence the warning.
pub fn is_silent_error(e: &anyhow::Error) -> bool {
    if let Some(e) = e.downcast_ref::<rusqlite::Error>() {
        if matches!(
            e.sqlite_error_code(),
            Some(ErrorCode::CannotOpen | ErrorCode::ReadOnly)
        ) {
            return true;
        }
    }
    false
}

/// Returns the disk usage for a git checkout directory.
pub fn du_git_checkout(path: &Path) -> CargoResult<u64> {
    // !.git is used because clones typically use hardlinks for the git
    // contents. TODO: Verify behavior on Windows.
    // TODO: Or even better, switch to worktrees, and remove this.
    cargo_util::du(&path, &["!.git"])
}

fn du(path: &Path, table_name: &str) -> CargoResult<u64> {
    if table_name == GIT_CO_TABLE {
        du_git_checkout(path)
    } else {
        cargo_util::du(&path, &[])
    }
}