Introduction: Graph Instructable Views With Python Screen Scraping

About: For now see me at: http://www.opencircuits.com/User:Russ_hensel

If you want to see how your instructables have done over time you can look at them with the somewhat flaky stats tab from your “you page” ( sorry pros only ).

But if you would like more control, the ability to get data from other people's instructables, and like Python you might like this Python Screen Scraping application.

So what is screen scraping? It is an automated process that goes to a website, finds a page, and extracts some piece of information from it.

This instructable uses Python to do the task, and is a reasonable introduction to Python ( a great language ) and some of its graphing package. Python runs on many operating system so be you a Mac person, a Windows person, an Ubuntu person, or a Pi person you should be fine.

Step 1: ​Tools and Materials

I run Windows as my major OS, these directions have this orientation, they should be adaptable to other environments.

The programs are written in Python 2.7 so you will need to install a compatible version Python, version 2.7 being a very compatible choice, but 2.6 should be fine, and 3.x should not take too much modification.

I like IDE's and would strongly recommend Spyder a download that is part of Anaconda ( its free! ). Not only do you get a basic Python installation, but you also get a lot of extensions that are useful for scientific and engineering applications.

Python is a very popular language, the web is full of help and documentation including a good number of ones on instructables ( as usual some good some not ).

You will need to download the zip file linked to in this instructable. The other non picture files need not be downloaded, they are here so you can click on them and read without downloading.

Step 2: ​Overview and Installation

There are a bunch of different files involved in this project. I have zipped together a working set and uploaded them here. To install download them, and unzip all into one directory. I have also uploaded some files without zipping. The intent is to let you click on them and read without downloading, they are all included in the zip file, and need not be separately downloaded. ( also the ziped files may be a bit more up to date )

You will need to install Python if it is not already installed. The web is full of information on this. On the Pi, Python is pre installed. On the PC I recommend the basic Python for people without much experience in programming and Spyder ( from http://docs.continuum.io/anaconda/ ) for those who are experienced.

To use the programs I highly recommend that you run them first from a development environment, and work towards running them from a file manager or desktop icon later. In these directions I will assume you are working from a development environment like Spyder. Some steps will depend on you OS and other system details. I will only address Windows in this document.

Much of the explanation of the programs is in the comments I will not try to pull the text from them into the body of the instructable, click and read them: this is where much of the instructable content is.

The programs here rely heavily on mathplotlib, and may require other Python extensions ( Sypder includes almost all of this by default, so is a good installation choice )

How It Works

The Python program scrape_views.py collects the data from instructables and puts it into text files for later graphing. This is controlled by a file called urllist.txt which has the url's ( and some other information ) of the sites to be scraped. The program scrape_views.py needs to be run on some regular basis to acquire data for the graphing program, graph_views.py.

The second program, graph_views.py, reads the text files with the data and graphs the data ( and optionally saves the graphs to .png files ). Control of the graphing is again via urllist.txt. Currently there are 3 different styles of graphs, read the files graph_views.py, and urllist.txt for more info.

The programs use the Python console for output keep your eye on it.

I have included a set of files that has data from some of Instructables most viewed topics. You can keep different setups in different directories for scraping different sets of pages.

There is log file: views.log

For some optional control over the program command line arguments may be included. Look at the batch files, in the section of the programs that process the command line ( get_args() ), and at the bottom of each *.py file for more comments.

Step 3: Run It ( and More Installation )

After you have a working python installation, and have installed the files in the instructable, you should be able to run scrape_views.py. Open it in your development environment, scroll to the bottom of the file, check the comments, make and required modification, and launch it.

Watch the Python console output, it should describe what is happening. It may work, it may fail, if it fails it is probably because your Python installation is missing something which the program needs. Expect this and if there are errors see the next section.

The setup file with the instructable should visit some of instructables top instructables, collect the data, and close. Watch the console.

Once scrape_views.py is working try graph_views.py Follow the directions as with scrape_views.py. When you get the graphs then your Python setup is complete.

Step 4: ​Dealing With Errors ( and More Installation )

Some errors will be logged ( views.log ) others will just show up on the Python console. Spyder offers the friendliest environment for dealing with errors, again I strongly suggest using it ( again ). When you first run it you may have errors due to missing components in your Python environment, Python will complain to the console.

Installation Issues

Python has a huge number of optional components. If needed ones are missing you will get errors and then need to install them.

Look at the Python Package Index ( https://pypi.python.org/pypi )

You can get at these and install them using a program called pip ( normally a part of the standare Python install, but not always on the Pi ) See: http://dubroy.com/blog/so-you-want-to-install-a-py...

Other Issues

The program has been fairly well tested but is still very vulnerable from errors in its control file, or in getting html that is not what it expects. The control file may show cryptic errors if its contents are messed up, the file in this instructable has been tested, but be careful when you modify it.

Dealing with Errors

Some errors will be logged others will just show up on the Python console. Spider offers the friendliest environment for dealing with errors, I strongly suggest using it. The program has been fairly well tested but is still very vulnerable from errors in its control file, or in getting html that is not what it expects. The control file may show cryptic errors if its contents are messed up.

Not on Windows?

The code has only been tested on Windows, directory operations may differ on other OS's. If you find a problem on other OS's let me know. In any case you may need to tweak this part of the code.

Step 5: Setup for Scraping Other Pages

The program is structured so you can have one or more sets of pages to scrape in subdirectories of the directory where you *.py programs are ( I will call this pyscrape ). As downloaded the program works directly out of pyscrape. If you want you can modify the files ( most importantly urllist.txt ) to scrape other pages, but if you mess up the whole tining will stop working. Instead set up a sub directory say mypages. Then you have pyscrape\mypages. Copy all the .txt files over to that directory. Run the .py programs with the command line argument mypages, and you should start processing out of that directory. Then you can modify mypages for what ever pages you want. Additional sets of pages may be created in additional subdirectories. The .bat files are set for this sort of processing, they are for my directory setup they will need to be modified to work for you.

Step 6: ​Notes and Comments

  • Please read the files carefully for more information I have put a lot of effort into them. This applies to the .py files the urllist.txt file, and the .bat files.
  • Lots more can be done with the graphing and other possible ways of analyzing the data, if you do something interesting ( or find a bug ) let me know.
  • The programs started off as fairly simple ones, but have evolved. If there is interest I will supply simpler, but less versatile versions for beginners, let me know.
  • You may want to run the files without using your development environment. On my system the .bat files do that. These let me run from a double click in a file manager, or an icon on the desktop. Other OS's have similar facilities, but you will need to figure out the details, if you wish publish here as comments.
  • The date system used here is a bit odd it uses ts = time.time() which is a timestamp: instants in time expressed in seconds since 12:00am, January 1, 1970(epoch). I convert this to days for graphing. Conversion to values that look like dates would be a nice enhancement.
  • Instructable view counts are a bit odd. I have noticed that when I run different browsers against Instructables, even at the same time, I will get a slightly different number of views. So you may also find that scrape_views.py may have somewhat different view counts than your browser.
  • All screen scrapers are quite sensitive to the format of the web page they are scraping. If instructables changes their format it may well break the program, typically however you can usually tweak it back into shape with out much work.
  • Adapting for other sites: most of this is done in the subroutine: ( scrape_views.py ) parseit( apage ) which has only about 20 lines of executable code, this should be an easy rewrite.
  • My Python style is a bit random, I have not yet learned/made up my mind on which conventions I will use, this may cause a bit of confusion. I have tried to keep to fairly simple Python.
  • There is a module called “Beautiful Soup” that does a lot of web manipulation, look into it for powerful methods, I have stuck to a simpler way of scraping.
  • There are some ideas about enhancements embedded in the code, read them.
  • There have also been a few instructables around this topic or similar ones. You may find them of interest:

https://www.instructables.com/id/How-to-get-a-graph...

https://www.instructables.com/id/How-to-graph-the-...

https://www.instructables.com/id/How-to-get-a-graph...

https://www.instructables.com/id/Beginning-web-page...

https://www.instructables.com/id/Getting-Stock-Pric...