Hey everyone! I have been teaching myself Python, and so as an exercise, I have tried writing an image grabber for OneManga.com. You put in the path to the comic page you want to start the grab from, and it grabs every page from there to the end of the comic.
The code for it is below:
import urllib
from xml.dom import minidom
import os
#Get directory to save comics in
savePath = ''
while not os.path.exists(savePath):
savePath = raw_input('Save in:')
stillGrabbing = True
#Get initial comic path
comicPath = raw_input('Path to first page:')
print '\nBeginning grab...'
while stillGrabbing:
#Create page URL, get HTML data and create XML object
nextURL = 'http://www.onemanga.com%s' % comicPath
pageHTML = urllib.urlopen(nextURL)
pageDoc = minidom.parse(pageHTML)
#Search div elements for the comic
divElements = pageDoc.getElementsByTagName('div')
foundImage = 0
for divTag in divElements:
try:
if divTag.attributes['class'].value == 'one-page':
print '\nGrabbing comic from %s'% nextURL
#Get image URL, split current comic path into name, chapter and page
imageURL = divTag.getElementsByTagName('img')[0].attributes['src'].value
foundImage = 1
[a, comicNameJoined, comicChapter, comicPage, b] = comicPath.split('/')
comicName = ' '.join(comicNameJoined.split('_'))
#Create directory if needed, and download image
if not os.path.exists('%s/%s/Chapter %s/' % (savePath, comicName, comicChapter)):
if not os.path.exists('%s/%s/' % (savePath, comicName)):
os.mkdir('%s/%s/' % (savePath, comicName))
os.mkdir('%s/%s/Chapter %s/' % (savePath, comicName, comicChapter))
urllib.urlretrieve(imageURL, '%s/%s/Chapter %s/%s.jpg' % (savePath, comicName, comicChapter, comicPage))
print
#Get new comic path
comicPath = divTag.getElementsByTagName('a')[0].attributes['href'].value
break
except KeyError:
#Ignore div tags with no class attribute
pass
if not foundImage:
print '\nFinished grab...'
stillGrabbing = False
I have trialled this on my localhost web server and it works fine. The problem is, whenever I run it on pages from OneManga.com, I get the following error:
Traceback (most recent call last):
File "<string>", line 244, in run_nodebug
File "G:\Comic Webpages\comicRipper.py", line 22, in <module>
pageDoc = minidom.parse(pageHTML)
File "G:\Program Files\PortablePython_1.1_py2.5.4\App\lib\xml\dom\minidom.py", line 1915, in parse
return expatbuilder.parse(file)
File "G:\Program Files\PortablePython_1.1_py2.5.4\App\lib\xml\dom\expatbuilder.py", line 928, in parse
result = builder.parseFile(file)
File "G:\Program Files\PortablePython_1.1_py2.5.4\App\lib\xml\dom\expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: mismatched tag: line 32, column 2
I *think* this means that either the xml parser is misinterpreting tags, or the webpage has mismatched tags that do not have any effect on web browsers like Firefox, which displays the page correctly.
My question is: is there a way of getting round this? Or is there another way of grabbing elements from HTML? All I need is the ability to get the following element from the page (actual element shown):
<div class="one-page">
<a href="/Fairy_Tail/135/19/">
<img class="manga-page" src="http://image.onemanga.com/010/mangas/00000022/000180942/18.jpg" alt="Loading... image010" />
</a>
</div>
Thanks!
(Hopefully this post isn't too long!)
EDIT: Also, comicPath is set to something like /Fairy_Tail/135/19/.