Use pyranges to join the promoter regions with the qbed data and count the
number of qbed records that overlap with each promoter.
additional keyword arguments are passed to the join method of the
PyRanges object. Currently, the following are configured:
- slack: which defaults to 0
- suffix: which defaults to “_b”
- strandedness: which defaults to False
:param promoter_pr: a PyRanges of promoter regions.
:type promoter_df: pr.PyRanges
:param qbed_pr: a pandas DataFrame of qbed data from the
experiment.
:type qbed_pr: pr.PyRanges
:param hops_colname: the name of the column in the qbed_df that
contains the number of hops.
:return: a pandas DataFrame of promoter regions with a column containing
the number of hops in the qbed_df for each promoter.
:rtype: DataFrame
Source code in callingcardstools/PeakCalling/yeast/call_peaks.py
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102 | def count_hops(
promoters_pr: pr.PyRanges,
qbed_pr: pr.PyRanges,
hops_colname: str,
**kwargs,
) -> pd.DataFrame:
"""
Use pyranges to join the promoter regions with the qbed data and count the
number of qbed records that overlap with each promoter.
additional keyword arguments are passed to the join method of the
PyRanges object. Currently, the following are configured:
- slack: which defaults to 0
- suffix: which defaults to "_b"
- strandedness: which defaults to False
:param promoter_pr: a PyRanges of promoter regions.
:type promoter_df: pr.PyRanges
:param qbed_pr: a pandas DataFrame of qbed data from the
experiment.
:type qbed_pr: pr.PyRanges
:param hops_colname: the name of the column in the qbed_df that
contains the number of hops.
:return: a pandas DataFrame of promoter regions with a column containing
the number of hops in the qbed_df for each promoter.
:rtype: DataFrame
"""
overlaps = promoters_pr.join(
qbed_pr,
how="left",
slack=kwargs.get("slack", 0),
suffix=kwargs.get("suffix", "_b"),
strandedness=kwargs.get("strandedness", False),
)
# Group by 'name' and count the number of records in each group
# `observed` set to true b/c grouping is over categorical variable. This is default
# in pandas 2.0. Without this set, memory usage skyrockets.
# Setting "Start_b >= 0" to remove rows where there is no overlap, which are
# represented by -1 in the _b columns by pyranges.
overlap_counts = (
overlaps.df.query("Start_b >= 0")
.groupby("name", observed=True)
.size()
.reset_index(name="Count")
.rename(columns={"Count": hops_colname})
)
return overlap_counts
|