Skip to content

label_responsive_genes

Labels genes in a DataFrame as responsive or not based on thresholds for expression effect and p-value. Note that the comparisons on the thresholds are strictly greater than for the abs_expression_effect_threshold and strictly less than for the expression_pvalue_threshold.

The function adds a new boolean column ‘responsive’ to the DataFrame, where each gene is labeled as responsive if its absolute effect expression is strictly greater than a threshold and its p-value is strictly less than a specified threshold. If normalization is enabled, only the top genes meeting the criteria up to the minimum number found in the normalized subset are labeled as responsive.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing gene data. Must include ‘expression_effect’ and ‘expression_pvalue’ columns.

required
abs_expression_effect_threshold float

Absolute value threshold for the absolute value of the expression effect. Values strictly greater than this threshold are considered responsive if the pvalue threshold passes.

required
expression_pvalue_threshold float

Threshold for the expression p-value. Values strictly less than this threshold are considered responsive if the effect threshold passes.

required
normalization_cutoff int

The maximum number of responsive genes to consider prior to labelling. This serves to normalize rank response across expression data sets. Defaults to -1, which disables normalization.

-1

Returns:

Type Description

pd.DataFrame: The input DataFrame with an added ‘responsive’ column.

Raises:

Type Description
KeyError

If ‘expression_effect’ or ‘expression_pvalue’ are not in

Examples:

>>> df = pd.DataFrame({'effect_expression': [0.5, 0.7, 1.2],
                       'p_expression': [0.01, 0.05, 0.2]})
>>> label_responsive_genes(df, 0.6, 0.05).responsive
[False, True, False]
Source code in callingcardstools/Analysis/yeast/rank_response.py
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
def label_responsive_genes(
    df,
    abs_expression_effect_threshold,
    expression_pvalue_threshold,
    normalization_cutoff: int = -1,
):
    """
    Labels genes in a DataFrame as responsive or not based on thresholds for
    expression effect and p-value. Note that the comparisons on the thresholds
    are strictly greater than for the abs_expression_effect_threshold and
    strictly less than for the expression_pvalue_threshold.

    The function adds a new boolean column 'responsive' to the DataFrame, where
    each gene is labeled as responsive if its absolute effect expression is
    strictly greater than a threshold and its p-value is strictly less than
    a specified threshold. If normalization is enabled, only the top genes
    meeting the criteria up to the minimum number found in the normalized
    subset are labeled as responsive.

    Args:
        df (pd.DataFrame): DataFrame containing gene data. Must include
            'expression_effect' and 'expression_pvalue' columns.
        abs_expression_effect_threshold (float): Absolute value threshold
            for the absolute value of the expression effect. Values strictly
            greater than this threshold are considered responsive if the pvalue
            threshold passes.
        expression_pvalue_threshold (float): Threshold for the expression
            p-value. Values strictly less than this threshold are considered
            responsive if the effect threshold passes.
        normalization_cutoff (int, optional): The maximum number of responsive
            genes to consider prior to labelling. This serves to normalize
            rank response across expression data sets. Defaults to -1, which
            disables normalization.

    Returns:
        pd.DataFrame: The input DataFrame with an added 'responsive' column.

    Raises:
        KeyError: If 'expression_effect' or 'expression_pvalue' are not in
        the DataFrame.

    Examples:
        >>> df = pd.DataFrame({'effect_expression': [0.5, 0.7, 1.2],
                               'p_expression': [0.01, 0.05, 0.2]})
        >>> label_responsive_genes(df, 0.6, 0.05).responsive
        [False, True, False]
    """
    if "expression_effect" not in df.columns:
        raise KeyError("Column 'effect_expression' is not in the data")
    if "expression_pvalue" not in df.columns:
        raise KeyError("Column 'effect_pvalue' is not in the data")

    expression_effect_rank_cutoff = (
        normalization_cutoff if normalization_cutoff > 0 else len(df) + 1
    )

    df_abs = df.assign(abs_expression_effect=df["expression_effect"].abs())

    # if either the effect or p-value threshold is `None`, then set
    # the threshold to the appropiate boundary to prevent filtering on that
    # column
    abs_expression_effect_threshold = (
        abs_expression_effect_threshold
        if abs_expression_effect_threshold is not None
        else min(df_abs["abs_expression_effect"]) - 1
    )

    expression_pvalue_threshold = (
        expression_pvalue_threshold
        if expression_pvalue_threshold is not None
        else max(df_abs["expression_pvalue"]) + 1
    )

    df_ranked = (
        df_abs.sort_values(
            by=["abs_expression_effect", "expression_pvalue"], ascending=[False, True]
        ).reset_index(drop=True)
        # Add 1 to start ranking from 1 instead of 0
        .assign(rank=lambda x: x.index + 1)
    )

    df_ranked["responsive"] = (
        (df_ranked["abs_expression_effect"] > abs_expression_effect_threshold)  # noqa
        & (df_ranked["expression_pvalue"] < expression_pvalue_threshold)
        & (df_ranked["rank"] <= expression_effect_rank_cutoff)
    )

    return df_ranked.drop(columns=["rank", "abs_expression_effect"])