Skip to content
February 3, 2010 / Nandha kumar

Web Scraping using Python- Gold Rate

Hi to all.

This post is about webscraping which involves stripping the contents of a webpage using Python’s BeautifulSoup .

Problem:

http://www.mjdma.com/mjdmaRatechart.aspx

The above link shows the up-to-date Gold and Silver rate in the market everyday.

It lists the 22 carat and 24 carat rates of gold and gram rates of silver along with the date.

Now the issue is to find the particular date during which the gold rate will be low so that the user can pruchase Gold on that day and have a benefit of Gold over the money.

Solution :

The problem can be solved by using BeautifulSoup – an HTML/XML parser for Python that can turn even invalid markup into a parse tree. It provides simple ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

More information can be got from :

http://www.crummy.com/software/BeautifulSoup/documentation.html

To setup BeautifulSoup,

we must first install python-setuptools . This can be done by giving the following command in the terminal.

$ sudo apt-get install python-setuptools

After installing it , give

$ sudo easy_install BeautifulSoup

The latest version of BeautifulSoup will get installed.

Coding :


# Import the BeautifulSoup parser

from BeautifulSoup import BeautifulSoup

# Import RegularExpression

import re

# This is a Python module to fetch data across WWW
# More at : http://docs.python.org/library/urllib.html

import urllib

print("Start Reading the website- this may take some time - approx 2 mins ")

#open the url , read its contents into a variable

filecontent= urllib.urlopen('http://www.mjdma.com/mjdmaRatechart.aspx').read()

# convert the read contents into BeautifulSoup content

soupcontent = BeautifulSoup(filecontent)

print("Printing the values, which have been parsed")

# Create a list to store the value

gold=[ ]

# Find the extracted content that has the label named - mjdmalis_ctl
# This label has the Gold rates

soup_extract = soupcontent.findAll(id= re.compile("mjdmalist_ctl[0-9]+_Label[0-9]+"))

# For loop removes the html tags from the extracted contents and appends it to
# gold list

for each_soup_item in soup_extract:
        gold.append(each_soup_item.find("b").contents)

# Find the length of the list

length=len(gold)

# For loop fetches the value according to the date

for i in range(length):
      if (i%9==0):
             print (gold[i][0])
      elif((gold[i][0]).isspace()==True): # this if handles removing of whitespace
             print "No Value"
      else:
             print eval(gold[i][0])

The above code when executed on the console, prints the gold rate datewise and blank spaces will be replaced by the string ‘ No value ‘ .

The code is under construction which will include

  • A low rate search on a particular month
  • Plotting a Histogram about the Gold rate in a month that can display the lowest rate days in that month using MATPLOTLIB.

Myself and my friend T.Arulalan are working together to webscrap the gold rate webpage.

Arul’s experiences with Linux are at :

http://tuxworld.wordpress.com

Source Code :

http://pastebin.com/f6b3897ad

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: