[Python] Deduplicate list in original order | 列表去重并且保留原始顺序的最快方法

0. Preparation | 准备工作

Generate a random number sequence has 20th power of 2 integers.


import random
testList = [random.randint(0,10000) for _ in range(2**20)]

1. Basic Method | 基本方法

Loop through the orginal list and add it if it is not in the result list.
This method naturally preserves the order of the original list, but it is extremely slow.


def basic_method(testList):
    result = []
    for i in testList:
        if i not in result:
    return result
t0 = time.time()
rlt1 = basic_method(testList)
print(len(rlt1), "| Time Spent:", f"{time.time()-t0:.4f}s")

>> 10001 | Time Spent: 68.5889s

2. Set Method | 使用集合

set is the built-in function in Python and can be used to remove duplicates.
But after set the original order will be chaotic, so use sort/sorted to reorder the output.
Although this method is much faster than basic one, the list is scanned multiple times.

set是 Python 中的自有方法,可以用于去重。

def set_method(testList):
    return sorted(set(testList), key=testList.index)
t0 = time.time()
rlt2 = set_method(testList)
print(len(rlt2), "| Time Spent:", f"{time.time()-t0:.4f}s")

>> 10001 | Time Spent: 1.4013s

3. Dict Method | 使用字典

This method is the fastest one.
The point is that keys of dict are non-repeating.
Therefore, employing this feature can remove duplicates and retain the original order.

字典dict的 keys 是天然不重复的,利用这种特性可以最快速的去重并且保留原有顺序。

def dict_method(testList):
    rlt5_dict = {}.fromkeys(testList)
    return list(rlt5_dict.keys())
t0 = time.time()
rlt3 = dict_method(testList)
print(len(rlt3), "| Time Spent:", f"{time.time()-t0:.4f}s")

>> 10001 | Time Spent: 0.0735s

Conclusion | 总结

1. Speed: dict_method > set_method > basic_method
2. Give priority to use Python built-in functions (C-based) instead of user-defined ones.
3. Avoid multiple traversal.

1. 速度最快方法是:字典 > 集合+排序 > 循环+条件
2. 尽量优先使用 C 语言底层的 Python 内建函数。
3. 避免多次遍历。

0 0 vote
Article Rating