Extracting a discrete set of values
Today’s I love Python moment is bought to you by set types.
I have a file, XML naturally, the contains a series of transactions. Each transaction has a reference number, but the reference number may be repeated. I want to pull the distinct set of reference numbers from this file. The way I learnt to build up a discrete set of items (many years ago) was to use a dict and set default.
>>> ref_nos = {}
>>> for record in records:
>>> ref_nos.setdefault(record.key, 1)
>>> ref_nos.keys()
But Python has had a sets module since 2.3 and the set standard data type since 2.6 so my knowledge is woefully out of date. The latest way to get the unique values from a sequence looks something like this;
>>> ref_nos = set([record.key for record in records])
I think I should get bonus points for using a list comprehension as well.
Using setdefault() was over-engineering a bit, even back in the day (perhaps you were excited when they introduced that into the language?). Simply ref_nos[record.key] = None would work.
I had reason to profile the dictionary-with-empty-value versus sets again recently and, in Python 2.7, for reasonable sized input lists, sets were slightly *slower* than dictionaries (so some slightly non-intuitive Django code gets to live on in its current form). I’ve never gone deeper to work out why, since you’d think this would be hard to achieve internally: sets should at least be no slower than dictionaries. It’s nothing to worry about until you’re doing it hundreds of times, though.
Comment by Malcolm — 10/11/2011 @ 4:35 pm
Using setdefault() was over-engineering a bit, even back in the day (perhaps you were excited when they introduced that into the language?). Simply ref_nos[record.key] = None would work.
I had reason to profile the dictionary-with-empty-value versus sets again recently and, in Python 2.7, for reasonable sized input lists, sets were slightly *slower* than dictionaries (so some slightly non-intuitive Django code gets to live on in its current form). I’ve never gone deeper to work out why, since you’d think this would be hard to achieve internally: sets should at least be no slower than dictionaries. It’s nothing to worry about until you’re doing it hundreds of times, though.
Comment by Malcolm — 10/11/2011 @ 4:35 pm
If you’re on Python 2.7 or later you can actually just write the set comprehension directly:
>>> {x for x in range(10)}
set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Comment by Nick Coghlan — 10/11/2011 @ 4:55 pm
If you’re on Python 2.7 or later you can actually just write the set comprehension directly:
>>> {x for x in range(10)}
set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Comment by Nick Coghlan — 10/11/2011 @ 4:55 pm
Indeed, kudos for the list comprehension. The downside of the list comprehension is that a full list is constructed first, and then passed to the set() function. A more efficient way would be to use a generator expression:
>>> ref_nos = set(record.key for record in records)
This works more like your woefully out of date example (you didn’t create a list either
Comment by Sybren — 10/11/2011 @ 5:08 pm
Indeed, kudos for the list comprehension. The downside of the list comprehension is that a full list is constructed first, and then passed to the set() function. A more efficient way would be to use a generator expression:
>>> ref_nos = set(record.key for record in records)
This works more like your woefully out of date example (you didn’t create a list either
Comment by Sybren — 10/11/2011 @ 5:08 pm
Well, you would get even more points for using a generator comprehension, i.e:
>>> ref_nos = set(record.key for record in records)
Thus you avoid the step of creating an intermediate list.
Welcome to the present
Comment by Christos Georgiou — 10/11/2011 @ 5:13 pm
Well, you would get even more points for using a generator comprehension, i.e:
>>> ref_nos = set(record.key for record in records)
Thus you avoid the step of creating an intermediate list.
Welcome to the present
Comment by Christos Georgiou — 10/11/2011 @ 5:13 pm
Drop the []s and you get a generator — same set at the end, no intermediate list eating memory and time.
Comment by Chris — 10/11/2011 @ 5:16 pm
Drop the []s and you get a generator — same set at the end, no intermediate list eating memory and time.
Comment by Chris — 10/11/2011 @ 5:16 pm
Or, use a generator expression instead of the list comprehension, and it becomes prettier (IMO):
>>> ref_nos = set(record.key for record in records)
Comment by Peter Ward — 10/11/2011 @ 5:25 pm
Or, use a generator expression instead of the list comprehension, and it becomes prettier (IMO):
>>> ref_nos = set(record.key for record in records)
Comment by Peter Ward — 10/11/2011 @ 5:25 pm
ref_nos = {record.key for record in records}
Comment by Simon — 10/11/2011 @ 5:31 pm
ref_nos = {record.key for record in records}
Comment by Simon — 10/11/2011 @ 5:31 pm
Technically, you can drop the [] and just do
ref_nos = set(record.key for record in records)
as well.
Comment by Darren — 10/11/2011 @ 5:44 pm
Technically, you can drop the [] and just do
ref_nos = set(record.key for record in records)
as well.
Comment by Darren — 10/11/2011 @ 5:44 pm
You can even get additional bonus points for using a generator expression!
That would be:
>>> ref_nos = set(record.key for record in records)
(no list brackets). This prevents a memory-consuming list to first be constructed.
Comment by EOL — 10/11/2011 @ 5:45 pm
You can even get additional bonus points for using a generator expression!
That would be:
>>> ref_nos = set(record.key for record in records)
(no list brackets). This prevents a memory-consuming list to first be constructed.
Comment by EOL — 10/11/2011 @ 5:45 pm
… and you could get even more bonus points for using a generator comprehension:
set(record.key for record in records)
or a set comprehension (2.7/3.x):
{record.key for record in records}
Comment by Tim Golden — 10/11/2011 @ 6:39 pm
… and you could get even more bonus points for using a generator comprehension:
set(record.key for record in records)
or a set comprehension (2.7/3.x):
{record.key for record in records}
Comment by Tim Golden — 10/11/2011 @ 6:39 pm
You may remove the parenthesis around your list comprehension. It will give you a small performance boost and an important memory boost if you have a lot of duplicates.
You can use the python3/python2.7 set comprehension syntax:
>>> ref_nos = {record.key for record in records}
Comment by Guillaum — 10/11/2011 @ 7:46 pm
You may remove the parenthesis around your list comprehension. It will give you a small performance boost and an important memory boost if you have a lot of duplicates.
You can use the python3/python2.7 set comprehension syntax:
>>> ref_nos = {record.key for record in records}
Comment by Guillaum — 10/11/2011 @ 7:46 pm
FWIW, you can remove the brackets to avoid building a temp list (use a generator). The difference is not huge, but still sensible for big sequences:
> python2.7 -mtimeit -s ‘collection = range(100) + range(50, 150)’ ‘set(i for i in collection)’
10000 loops, best of 3: 21.9 usec per loop
> python2.7 -mtimeit -s ‘collection = range(100) + range(50, 150)’ ‘set([i for i in collection])’
10000 loops, best of 3: 25.4 usec per loop
Python 3 also added set comprehensions, which were backported to 2.7:
> python2.7 -mtimeit -s ‘collection = range(100) + range(50, 150)’ ‘{i for i in collection}’
100000 loops, best of 3: 14.3 usec per loop
but those are *not* available in 2.6
Comment by Masklinn — 10/11/2011 @ 8:16 pm
FWIW, you can remove the brackets to avoid building a temp list (use a generator). The difference is not huge, but still sensible for big sequences:
> python2.7 -mtimeit -s ‘collection = range(100) + range(50, 150)’ ‘set(i for i in collection)’
10000 loops, best of 3: 21.9 usec per loop
> python2.7 -mtimeit -s ‘collection = range(100) + range(50, 150)’ ‘set([i for i in collection])’
10000 loops, best of 3: 25.4 usec per loop
Python 3 also added set comprehensions, which were backported to 2.7:
> python2.7 -mtimeit -s ‘collection = range(100) + range(50, 150)’ ‘{i for i in collection}’
100000 loops, best of 3: 14.3 usec per loop
but those are *not* available in 2.6
Comment by Masklinn — 10/11/2011 @ 8:16 pm
And you can use set() to remove duplicate comments!
Comment by Etienne — 11/11/2011 @ 1:00 am
And you can use set() to remove duplicate comments!
Comment by Etienne — 11/11/2011 @ 1:00 am
I think Etienne’s comment wins!
Comment by Bryce — 12/11/2011 @ 4:30 am
I think Etienne’s comment wins!
Comment by Bryce — 12/11/2011 @ 4:30 am