xorbits.experimental.dedup#
- xorbits.experimental.dedup(df: pandas.core.frame.DataFrame, col: str, method: str = 'minhash', **kws) pandas.core.frame.DataFrame [source]#
Applies deduplication on a DataFrame based on the chosen method.
This function provides two methods for deduplication: exact matching and MinHash-based. The exact matching uses md5 hashing for deduplication, while the MinHash-based method utilizes MinHash and MinHashLSH for identifying and removing duplicates based on Jaccard similarity. For the MinHash-based method, it operates by generating hash values for a specified column of the DataFrame, computing similarity between these hash values, and then removing the rows that are determined to be duplicates according to a provided Jaccard similarity threshold.
- Parameters
df (pd.DataFrame,) – The DataFrame to deduplicate.
col (str) – The column of the DataFrame on which to calculate hash values.
method (Additional Parameters for MinHash) – The method for deduplication. Options include ‘exact’ and ‘minhash’.
method –
---------------------------------------- –
threshold (float, default 0.7) – The Jaccard similarity threshold to use in the MinHashLSH.
num_perm (int, default 128) – The number of permutations to use in the MinHash.
min_length (int, default 5) – The minimum number of tokens to use in the MinHash. Texts shorter than this value will be filtered out.
ngrams (int, default 5) – The size of ngram to use in the MinHash.
seed (int, default 42) – The seed for the random number generator.
- Returns
The DataFrame after applying the chosen deduplication method.
- Return type
Notes
The ‘exact’ method performs deduplication by hashing each entry in the specified column with md5 and removing duplicates.
The ‘minhash’ method uses a combination of MinHash and MinHashLSH for efficient calculation of Jaccard similarity and identification of duplicates. This process involves hashing text to a finite set of integers (hash values), and then comparing these hash values to find duplicates.
The optimal parameters for the number of bands B and rows R per band are automatically calculated based on the provided similarity threshold and number of permutations, to balance the trade-off between precision and recall.
Examples
>>> from xorbits.experimental import dedup >>> words = list("abcdefghijklmnopqrstuvwxyz") >>> df = pd.DataFrame( ... { ... "text": [ ... " ".join(["".join(np.random.choice(words, 5)) for i in range(50)]) ... for _ in np.arange(10) ... ] ... * 2, ... } ... ) >>> res = dedup(df, col="text", method="exact") # for 'exact' method >>> res = dedup(df, col="text", method="minhash", threshold=0.8, num_perm=128, min_length=5, ngrams=5, seed=42) # for 'minhash' method