[Python] Deduplicate list in original order | 列表去重并且保留原始顺序的最快方法

0. Preparation | 准备工作

Generate a random number sequence has 20th power of 2 integers.

生成一个长度为2的20次方的随机数列。

import random
testList = [random.randint(0,10000) for _ in range(2**20)]

1. Basic Method | 基本方法

Loop through the orginal list and add it if it is not in the result list.
This method naturally preserves the order of the original list, but it is extremely slow.

循环查找列表,如果不在结果列表中则添加。
此方法天然保留原列表的顺序,但是速度极慢。

def basic_method(testList):
    result = []
    for i in testList:
        if i not in result:
            result.append(i)
    return result
t0 = time.time()
rlt1 = basic_method(testList)
print(len(rlt1), "| Time Spent:", f"{time.time()-t0:.4f}s")

>> 10001 | Time Spent: 68.5889s

2. Set Method | 使用集合

set is the built-in function in Python and can be used to remove duplicates.
But after set the original order will be chaotic, so use sort/sorted to reorder the output.
Although this method is much faster than basic one, the list is scanned multiple times.

set是 Python 中的自有方法,可以用于去重。
但是它不保留原有顺序,因此需要配合sort/sorted来使用。
虽然这种方法比速度上一种快很多,但是它会对列表进行多次扫描。

def set_method(testList):
    return sorted(set(testList), key=testList.index)
t0 = time.time()
rlt2 = set_method(testList)
print(len(rlt2), "| Time Spent:", f"{time.time()-t0:.4f}s")

>> 10001 | Time Spent: 1.4013s

3. Dict Method | 使用字典

This method is the fastest one.
The point is that keys of dict are non-repeating.
Therefore, employing this feature can remove duplicates and retain the original order.

字典dict的 keys 是天然不重复的,利用这种特性可以最快速的去重并且保留原有顺序。

def dict_method(testList):
    rlt5_dict = {}.fromkeys(testList)
    return list(rlt5_dict.keys())
t0 = time.time()
rlt3 = dict_method(testList)
print(len(rlt3), "| Time Spent:", f"{time.time()-t0:.4f}s")

>> 10001 | Time Spent: 0.0735s

Conclusion | 总结

1. Speed: dict_method > set_method > basic_method
2. Give priority to use Python built-in functions (C-based) instead of user-defined ones.
3. Avoid multiple traversal.

1. 速度最快方法是:字典 > 集合+排序 > 循环+条件
2. 尽量优先使用 C 语言底层的 Python 内建函数。
3. 避免多次遍历。

Published By
Bob

2
说点什么

avatar
1 Comment threads
1 Thread replies
1 Followers
 
Most reacted comment
Hottest comment thread
2 Comment authors
Bobyoyo Recent comment authors
  Subscribe  
最新 最旧 得票最多
提醒
yoyo
游客
yoyo

已阅