count_hops

Use pyranges to join the promoter regions with the qbed data and count the number of qbed records that overlap with each promoter.

additional keyword arguments are passed to the join method of the PyRanges object. Currently, the following are configured: - slack: which defaults to 0 - suffix: which defaults to “_b” - strandedness: which defaults to False

:param promoter_pr: a PyRanges of promoter regions. :type promoter_df: pr.PyRanges :param qbed_pr: a pandas DataFrame of qbed data from the experiment. :type qbed_pr: pr.PyRanges :param hops_colname: the name of the column in the qbed_df that contains the number of hops.

:return: a pandas DataFrame of promoter regions with a column containing the number of hops in the qbed_df for each promoter. :rtype: DataFrame

Source code in callingcardstools/PeakCalling/yeast/call_peaks.py

def count_hops(
    promoters_pr: pr.PyRanges,
    qbed_pr: pr.PyRanges,
    hops_colname: str,
    **kwargs,
) -> pd.DataFrame:
    """
    Use pyranges to join the promoter regions with the qbed data and count the
        number of qbed records that overlap with each promoter.

    additional keyword arguments are passed to the join method of the
      PyRanges object. Currently, the following are configured:
      - slack: which defaults to 0
      - suffix: which defaults to "_b"
      - strandedness: which defaults to False

    :param promoter_pr: a PyRanges of promoter regions.
    :type promoter_df: pr.PyRanges
    :param qbed_pr: a pandas DataFrame of qbed data from the
        experiment.
    :type qbed_pr: pr.PyRanges
    :param hops_colname: the name of the column in the qbed_df that
        contains the number of hops.

    :return: a pandas DataFrame of promoter regions with a column containing
        the number of hops in the qbed_df for each promoter.
    :rtype: DataFrame
    """
    overlaps = promoters_pr.join(
        qbed_pr,
        how="left",
        slack=kwargs.get("slack", 0),
        suffix=kwargs.get("suffix", "_b"),
        strandedness=kwargs.get("strandedness", False),
    )

    # Group by 'name' and count the number of records in each group
    # `observed` set to true b/c grouping is over categorical variable. This is default
    # in pandas 2.0. Without this set, memory usage skyrockets.
    # Setting "Start_b >= 0" to remove rows where there is no overlap, which are
    # represented by -1 in the _b columns by pyranges.
    overlap_counts = (
        overlaps.df.query("Start_b >= 0")
        .groupby("name", observed=True)
        .size()
        .reset_index(name="Count")
        .rename(columns={"Count": hops_colname})
    )

    return overlap_counts