Skip to content

poisson_pval_vectorized

Compute the Poisson p-value for the given hops counts.

:param total_background_hops: a pandas Series (column of a dataframe) of total number of hops in the background. :type total_background_hops: Series[int64] :param total_experiment_hops: a pandas Series (column of a dataframe) of total number of hops in the experiment. :type total_experiment_hops: Series[int64] :param background_hops: a pandas Series (column of a dataframe) of number of hops in the background by promoter region. :type background_hops: Series[int64] :param experiment_hops: a pandas Series (column of a dataframe) of number of hops in the experiment by promoter region. :type experiment_hops: Series[int64] :param pseudocount: , defaults to 1.0 :type pseudocount: float, optional :return: a pandas Series of length equal to the input Series with the Poisson p-value for each row. :rtype: Series[float]

.. note:: This function is vectorized, so it can be applied to pandas Series (columns of dataframes) to compute the Poisson p-value for each row.

:raises ValueError: If any of the input Series contain negative values or the input Series are not all the same length.

:Example:

import pandas as pd total_background_hops = pd.Series([100, 200, 300]) total_experiment_hops = pd.Series([10, 20, 30]) background_hops = pd.Series([5, 10, 15]) experiment_hops = pd.Series([2, 4, 6]) vectorized_poisson_pval( … total_background_hops, … total_experiment_hops, … background_hops, … experiment_hops) array([0.01438768, 0.00365985, 0.00092599])

Source code in callingcardstools/PeakCalling/yeast/poisson_pval_vectorized.py
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def poisson_pval_vectorized(
    total_background_hops: Series,
    total_experiment_hops: Series,
    background_hops: Series,
    experiment_hops: Series,
    pseudocount: float = 0.1,
    **kwargs
) -> Series:
    """
    Compute the Poisson p-value for the given hops counts.

    :param total_background_hops: a pandas Series (column of a dataframe)
        of total number of hops in the background.
    :type total_background_hops: Series[int64]
    :param total_experiment_hops: a pandas Series (column of a dataframe)
        of total number of hops in the experiment.
    :type total_experiment_hops: Series[int64]
    :param background_hops: a pandas Series (column of a dataframe)
        of number of hops in the background by promoter region.
    :type background_hops: Series[int64]
    :param experiment_hops: a pandas Series (column of a dataframe)
        of number of hops in the experiment by promoter region.
    :type experiment_hops: Series[int64]
    :param pseudocount: , defaults to 1.0
    :type pseudocount: float, optional
    :return: a pandas Series of length equal to the input Series with the
        Poisson p-value for each row.
    :rtype: Series[float]

    .. note:: This function is vectorized, so it can be applied to
        pandas Series (columns of dataframes) to compute the
        Poisson p-value for each row.

    :raises ValueError: If any of the input Series contain negative values or
        the input Series are not all the same length.

    :Example:

    >>> import pandas as pd
    >>> total_background_hops = pd.Series([100, 200, 300])
    >>> total_experiment_hops = pd.Series([10, 20, 30])
    >>> background_hops = pd.Series([5, 10, 15])
    >>> experiment_hops = pd.Series([2, 4, 6])
    >>> vectorized_poisson_pval(
    ...     total_background_hops,
    ...     total_experiment_hops,
    ...     background_hops,
    ...     experiment_hops)
    array([0.01438768, 0.00365985, 0.00092599])
    """
    # check input
    if (
        not len(total_background_hops)
        == len(total_experiment_hops)
        == len(background_hops)
        == len(experiment_hops)
    ):
        raise ValueError("All input Series must be the same length.")
    if total_background_hops.min() < 0 or total_background_hops.dtype != "int64":
        raise ValueError(("total_background_hops must " "be a non-negative integer."))
    if total_experiment_hops.min() < 0 or total_background_hops.dtype != "int64":
        raise ValueError(("total_experiment_hops must " "be a non-negative integer"))

    # cast to `float` b/c of scipy

    # note that read_in_background_data and read_in_experiment_data in
    # read_in_data.py require that there be at least 1 hop in both the background
    # and the experiment. Therefore the total_background_hops and total_experiment_hops
    # is always defined
    hop_ratio = (total_experiment_hops / total_background_hops).astype("float")

    # It is possible that there are promoters with no background hops. Add a small
    # pseudocount to require that mu > 0, per poisson definition
    mu = ((background_hops + pseudocount) * hop_ratio).astype("float")

    # there has been a pseudocount added to experiment hops. Not necessary, removed
    # 20240624
    # The way this is calculated, with pyranges and a sum, this value will always be
    # at minimum 0
    x = experiment_hops.astype("float")

    # 20250306: The p-value is calculated as the probability of observing x or more
    # hops given the expected number of hops. This is equal to 1 - the cumulative
    # distribution function (CDF) of the Poisson distribution at x, plus the
    # probability mass function (PMF) at x. This is a change from Rob's original code
    # and the code in callingCardsTools
    # the resolution in the CDF is very low, so this ends up being the PMF value.
    # However, by inspection, the values after `x` are an order of magnitude or more
    # smaller, so the pvalue is dominated by the first term.
    pval = (1 - poisson.cdf(x, mu)) + poisson.pmf(x, mu)
    # return the pvalue as a pandas Series
    return Series(pval)