xorbits.experimental.dedup#

xorbits.experimental.dedup(df: pandas.core.frame.DataFrame, col: str, method: str = 'minhash', **kws) pandas.core.frame.DataFrame[source]#

Applies deduplication on a DataFrame based on the chosen method.

This function provides two methods for deduplication: exact matching and MinHash-based. The exact matching uses md5 hashing for deduplication, while the MinHash-based method utilizes MinHash and MinHashLSH for identifying and removing duplicates based on Jaccard similarity. For the MinHash-based method, it operates by generating hash values for a specified column of the DataFrame, computing similarity between these hash values, and then removing the rows that are determined to be duplicates according to a provided Jaccard similarity threshold.

Parameters
  • df (pd.DataFrame,) – The DataFrame to deduplicate.

  • col (str) – The column of the DataFrame on which to calculate hash values.

  • method (Additional Parameters for MinHash) – The method for deduplication. Options include ‘exact’ and ‘minhash’.

  • method

  • ----------------------------------------

  • threshold (float, default 0.7) – The Jaccard similarity threshold to use in the MinHashLSH.

  • num_perm (int, default 128) – The number of permutations to use in the MinHash.

  • min_length (int, default 5) – The minimum number of tokens to use in the MinHash. Texts shorter than this value will be filtered out.

  • ngrams (int, default 5) – The size of ngram to use in the MinHash.

  • seed (int, default 42) – The seed for the random number generator.

Returns

The DataFrame after applying the chosen deduplication method.

Return type

DataFrame

Notes

The ‘exact’ method performs deduplication by hashing each entry in the specified column with md5 and removing duplicates.

The ‘minhash’ method uses a combination of MinHash and MinHashLSH for efficient calculation of Jaccard similarity and identification of duplicates. This process involves hashing text to a finite set of integers (hash values), and then comparing these hash values to find duplicates.

The optimal parameters for the number of bands B and rows R per band are automatically calculated based on the provided similarity threshold and number of permutations, to balance the trade-off between precision and recall.

Examples

>>> from xorbits.experimental import dedup
>>> words = list("abcdefghijklmnopqrstuvwxyz")
>>> df = pd.DataFrame(
...     {
...         "text": [
...             " ".join(["".join(np.random.choice(words, 5)) for i in range(50)])
...             for _ in np.arange(10)
...         ]
...         * 2,
...     }
... )
>>> res = dedup(df, col="text", method="exact") # for 'exact' method
>>> res = dedup(df, col="text", method="minhash", threshold=0.8, num_perm=128, min_length=5, ngrams=5, seed=42) # for 'minhash' method