10th November, 2011

Extracting a discrete set of values

Filed under: python — admin @ 3:42 pm

Today’s I love Python moment is bought to you by set types.

I have a file, XML naturally, the contains a series of transactions. Each transaction has a reference number, but the reference number may be repeated. I want to pull the distinct set of reference numbers from this file. The way I learnt to build up a discrete set of items (many years ago) was to use a dict and set default.

>>> ref_nos = {}
>>> for record in records:
>>> ref_nos.setdefault(record.key, 1)
>>> ref_nos.keys()

But Python has had a sets module since 2.3 and the set standard data type since 2.6 so my knowledge is woefully out of date. The latest way to get the unique values from a sequence looks something like this;

>>> ref_nos = set([record.key for record in records])

I think I should get bonus points for using a list comprehension as well.

28 Comments

  1. Using setdefault() was over-engineering a bit, even back in the day (perhaps you were excited when they introduced that into the language?). Simply ref_nos[record.key] = None would work.

    I had reason to profile the dictionary-with-empty-value versus sets again recently and, in Python 2.7, for reasonable sized input lists, sets were slightly *slower* than dictionaries (so some slightly non-intuitive Django code gets to live on in its current form). I’ve never gone deeper to work out why, since you’d think this would be hard to achieve internally: sets should at least be no slower than dictionaries. It’s nothing to worry about until you’re doing it hundreds of times, though.

    Comment by Malcolm — 10/11/2011 @ 4:35 pm

  2. Using setdefault() was over-engineering a bit, even back in the day (perhaps you were excited when they introduced that into the language?). Simply ref_nos[record.key] = None would work.

    I had reason to profile the dictionary-with-empty-value versus sets again recently and, in Python 2.7, for reasonable sized input lists, sets were slightly *slower* than dictionaries (so some slightly non-intuitive Django code gets to live on in its current form). I’ve never gone deeper to work out why, since you’d think this would be hard to achieve internally: sets should at least be no slower than dictionaries. It’s nothing to worry about until you’re doing it hundreds of times, though.

    Comment by Malcolm — 10/11/2011 @ 4:35 pm

  3. If you’re on Python 2.7 or later you can actually just write the set comprehension directly:

    >>> {x for x in range(10)}
    set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

    Comment by Nick Coghlan — 10/11/2011 @ 4:55 pm

  4. If you’re on Python 2.7 or later you can actually just write the set comprehension directly:

    >>> {x for x in range(10)}
    set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

    Comment by Nick Coghlan — 10/11/2011 @ 4:55 pm

  5. Indeed, kudos for the list comprehension. The downside of the list comprehension is that a full list is constructed first, and then passed to the set() function. A more efficient way would be to use a generator expression:

    >>> ref_nos = set(record.key for record in records)

    This works more like your woefully out of date example (you didn’t create a list either ;-)

    Comment by Sybren — 10/11/2011 @ 5:08 pm

  6. Indeed, kudos for the list comprehension. The downside of the list comprehension is that a full list is constructed first, and then passed to the set() function. A more efficient way would be to use a generator expression:

    >>> ref_nos = set(record.key for record in records)

    This works more like your woefully out of date example (you didn’t create a list either ;-)

    Comment by Sybren — 10/11/2011 @ 5:08 pm

  7. Well, you would get even more points for using a generator comprehension, i.e:

    >>> ref_nos = set(record.key for record in records)

    Thus you avoid the step of creating an intermediate list.

    Welcome to the present :)

    Comment by Christos Georgiou — 10/11/2011 @ 5:13 pm

  8. Well, you would get even more points for using a generator comprehension, i.e:

    >>> ref_nos = set(record.key for record in records)

    Thus you avoid the step of creating an intermediate list.

    Welcome to the present :)

    Comment by Christos Georgiou — 10/11/2011 @ 5:13 pm

  9. Drop the []s and you get a generator — same set at the end, no intermediate list eating memory and time.

    Comment by Chris — 10/11/2011 @ 5:16 pm

  10. Drop the []s and you get a generator — same set at the end, no intermediate list eating memory and time.

    Comment by Chris — 10/11/2011 @ 5:16 pm

  11. Or, use a generator expression instead of the list comprehension, and it becomes prettier (IMO):

    >>> ref_nos = set(record.key for record in records)

    Comment by Peter Ward — 10/11/2011 @ 5:25 pm

  12. Or, use a generator expression instead of the list comprehension, and it becomes prettier (IMO):

    >>> ref_nos = set(record.key for record in records)

    Comment by Peter Ward — 10/11/2011 @ 5:25 pm

  13. ref_nos = {record.key for record in records}

    :)

    Comment by Simon — 10/11/2011 @ 5:31 pm

  14. ref_nos = {record.key for record in records}
    :)

    Comment by Simon — 10/11/2011 @ 5:31 pm

  15. Technically, you can drop the [] and just do

    ref_nos = set(record.key for record in records)

    as well.

    Comment by Darren — 10/11/2011 @ 5:44 pm

  16. Technically, you can drop the [] and just do

    ref_nos = set(record.key for record in records)

    as well.

    Comment by Darren — 10/11/2011 @ 5:44 pm

  17. You can even get additional bonus points for using a generator expression! :)
    That would be:

    >>> ref_nos = set(record.key for record in records)

    (no list brackets). This prevents a memory-consuming list to first be constructed.

    Comment by EOL — 10/11/2011 @ 5:45 pm

  18. You can even get additional bonus points for using a generator expression! :)
    That would be:

    >>> ref_nos = set(record.key for record in records)

    (no list brackets). This prevents a memory-consuming list to first be constructed.

    Comment by EOL — 10/11/2011 @ 5:45 pm

  19. … and you could get even more bonus points for using a generator comprehension:

    set(record.key for record in records)

    or a set comprehension (2.7/3.x):

    {record.key for record in records}

    Comment by Tim Golden — 10/11/2011 @ 6:39 pm

  20. … and you could get even more bonus points for using a generator comprehension:

    set(record.key for record in records)

    or a set comprehension (2.7/3.x):

    {record.key for record in records}

    Comment by Tim Golden — 10/11/2011 @ 6:39 pm

  21. You may remove the parenthesis around your list comprehension. It will give you a small performance boost and an important memory boost if you have a lot of duplicates.

    You can use the python3/python2.7 set comprehension syntax:

    >>> ref_nos = {record.key for record in records}

    Comment by Guillaum — 10/11/2011 @ 7:46 pm

  22. You may remove the parenthesis around your list comprehension. It will give you a small performance boost and an important memory boost if you have a lot of duplicates.

    You can use the python3/python2.7 set comprehension syntax:

    >>> ref_nos = {record.key for record in records}

    Comment by Guillaum — 10/11/2011 @ 7:46 pm

  23. FWIW, you can remove the brackets to avoid building a temp list (use a generator). The difference is not huge, but still sensible for big sequences:

    > python2.7 -mtimeit -s ‘collection = range(100) + range(50, 150)’ ‘set(i for i in collection)’
    10000 loops, best of 3: 21.9 usec per loop
    > python2.7 -mtimeit -s ‘collection = range(100) + range(50, 150)’ ‘set([i for i in collection])’
    10000 loops, best of 3: 25.4 usec per loop

    Python 3 also added set comprehensions, which were backported to 2.7:

    > python2.7 -mtimeit -s ‘collection = range(100) + range(50, 150)’ ‘{i for i in collection}’
    100000 loops, best of 3: 14.3 usec per loop

    but those are *not* available in 2.6

    Comment by Masklinn — 10/11/2011 @ 8:16 pm

  24. FWIW, you can remove the brackets to avoid building a temp list (use a generator). The difference is not huge, but still sensible for big sequences:

    > python2.7 -mtimeit -s ‘collection = range(100) + range(50, 150)’ ‘set(i for i in collection)’
    10000 loops, best of 3: 21.9 usec per loop
    > python2.7 -mtimeit -s ‘collection = range(100) + range(50, 150)’ ‘set([i for i in collection])’
    10000 loops, best of 3: 25.4 usec per loop

    Python 3 also added set comprehensions, which were backported to 2.7:

    > python2.7 -mtimeit -s ‘collection = range(100) + range(50, 150)’ ‘{i for i in collection}’
    100000 loops, best of 3: 14.3 usec per loop

    but those are *not* available in 2.6

    Comment by Masklinn — 10/11/2011 @ 8:16 pm

  25. And you can use set() to remove duplicate comments! ;-)

    Comment by Etienne — 11/11/2011 @ 1:00 am

  26. And you can use set() to remove duplicate comments! ;-)

    Comment by Etienne — 11/11/2011 @ 1:00 am

  27. I think Etienne’s comment wins!

    Comment by Bryce — 12/11/2011 @ 4:30 am

  28. I think Etienne’s comment wins!

    Comment by Bryce — 12/11/2011 @ 4:30 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress