Fetch unique words in a web page (in 7 lines of Python)

I got a chance to enroll for some Python training at work and had a nice time with it. I’ve tried getting into other languages myself before, and this has been my freshest experience so far. The three days of training were enough to show me how insanely easy it is to get productive in Python.

A quick example we saw was to:

  • Get the contents of a web page, and
  • Find all the unique words in it.
This simple example introduces the usage of lists, sets, modules and loops. The code looks like this:

import urllib2
words=[]
for line in urllib2.urlopen(‘http://slashdot.org’):
    words.extend(line.split(‘ ‘))
print ‘number of words is’, len(words)
uniq=set(words)
print ‘number of unique words is’, len(uniq)
That’s it! The whole job takes only eight lines. Running it gives an output like this:
C:Python27mypy>19.py
number of words is 13297
number of unique words is 3672

The first line imports the urllib2 module. 
The next line initializes an empty list.
The third line is where most of the magic happens. A for loop iterates over a list returned by urllib2 when we ask it to open Slashdot’s home page. For each line, we split the content into individual words and append it to our list.
That gives us all the words in the page, which we print.
Python supports a set data structure that we copy this list to, which ends up removing recurring elements automatically (since sets support only unique elements).
And that’s it, we’re done. Of course there is always more to do beyond this (such as in not counting html tags), but this was probably my favourite example from the course that captured how easy it was to get seemingly complicated tasks done in Python.