Original Url: http://matthewturland.com/2008/03/12/scraping-html-with-dom/
A friend of mine who shall remain nameless pointed a post out to me on the PHP DZone web site recently. Noting that the article’s content was misinformed at best and downright ignorant at worst, even when examining it sheerly from the author’s knowledge of PHP as a language, this friend asked that I set the author straight.
I gladly obliged with a comment on the post, having become somewhat of an authority on the application topic myself. As much of an unorthodox practice as web scraping may be, there are some methodologies for it that are obviously better than others. The aforementioned post illustrates a lot of the ones to avoid, and my arguments against them.
Later, I randomly encountered a post on the blog at xml.lt on the topic of web scraping using the DOM extension. This article showcases recommended practices and reasoned arguments against bad (and unfortunately common) alternatives. The author comes across as being significantly more informed on both the language and the application in the article’s content and code examples.
If you’re looking for references on topic of web scraping with PHP, there’s always the article I wrote for the December 2007 issue of php|architect magazine, of which you can still purchase an electronic copy in PDF format. At some point, I also hope to write a short book on the subject. Until then, if you have related questions, you can generally reach me in the #phpc channel on Freenode, under the nick Elazar. I’m always glad to give out advice on web scraping and PHP, as I’m sure my good friend Jared Folkins (who is also my “Little Sis” from the PHPWomen
Original Url: http://www.xml.lt/Blog/2008/03/11/Scraping+html+with+DOM
Scraping HTML with DOM
2008-03-11 11:28:45 by Martynas Jusevičius
HTML scraping is used to extract structured data from human-intended webpages. It is a common way to work around old school websites which do not provide data feeds in machine-readable formats such as RDF or at least custom XML.
Scraping can be implemented in several ways. Regular expressions (regexp) is probably the most widely-used technique. They employ specific syntax rules to define patterns that have to be matched in a string.
However, there are several issues in the regexp implementation. Because of complex HTML source, pattern strings soon become lengthy. The pattern syntax is not trivial and may differ on various systems. That leads to complicated and non–intuitive code, where the program logic (such as conditional cases) is hard–coded in the pattern and therefore not obvious.
Moreover, regular expressions operate on a generic string level and have no knowledge of HTML tags and the tree–shaped document model that they form. It means that special care has to be taken of insignificant whitespace (such as line breaks), control characters have to be escaped etc. In situations where tree-like structures need to be scraped (for example, nested comment or e-mail threads), sequential matching makes it difficult to figure out the item’s position in the tree, i. e. to relate the item to its parent, if there is one. Recursion would be an appropriate technique in this case, however regular expressions aren’t really meant for recursive solutions
.
Another approach to scraping is using Document Object Model (DOM). It is a standard object model for representing HTML or XML. DOM is usually associated with client–side scripting, but can be equally well used on the server side — its support is provided on many platforms, including Java libraries and PHP extension. Fortunately, PHP’s DOM extension is able to parse even invalid HTML, which is most often the case.
Here is a simple PHP scraper which turns a table with student information into FOAF data:
class StudentScraper
{
private $doc = null;
public function __construct($url)
{
$htmlString = file_get_contents($url); // load HTML file as a string
$this->doc = new DOMDocument();
$this->doc->loadHTML($htmlString); // load string into DOM document object
}
public function process() // return RDF/XML string
{
$xml = "n";
$table = $this->doc->getElementsByTagName("table")->item(0); // table element
foreach ($table->childNodes as $tr) // iterate rows
{
$name = $tr->getElementsByTagName("td")->item(0)->textContent; // content of first cell
$eMail = $tr->getElementsByTagName("td")->item(1)->textContent; // content of second cell
$xml .= " n"; // construct RDF/XML
$xml .= " ".htmlspecialchars($name)."n";
$xml .= " n";
$xml .= " n";
}
$xml .= "";
return $xml;
}
}
$scraper = new StudentScraper("http://some.university.com/students.htm"); // run scraper
print $scraper->process();
Here we load the contents of the HTML file in to a DOM document. Student data is found in a table, which contains a row for each student. Each row contains 2 cells — one for name, and one for e-mail address. We iterate through rows and create one foaf:Person entry per student. When run, the scraper will return RDF/XML string with FOAF data.
Constructing XML as a plain string is an error-prone practice and should be also done using DOM, but we leave it here for simplicity and clarity.