Screen Scraping Your Way Into RSS
by: Dennis Pallett
Introduction
RSS is one the hottest technologies at the moment, and even big web publishers (such as the New York Times) are getting into RSS as well. However, there are still a lot of websites that do not have RSS feeds.
If you still want to be able to check those websites in your favourite aggregator, you need to create your own RSS feed for those websites. This can be done automatically with PHP, using a method called screen scrapping. Screen scrapping is usually frowned upon, as it’s mostly used to steal content from other websites.
I personally believe that in this case, to automatically generate a RSS feed, screen scrapping is not a bad thing. Now, on to the code!
Getting the content
For this article, we’ll use PHPit as an example, despite the fact that PHPit already has RSS feeds (http://www.phpit.net/syndication/).
We’ll want to generate a RSS feed from the content listed on the frontpage (http://www.phpit.net). The first step in screen scraping is getting the complete page. In PHP this can be done very easily, by using implode(file(”", “[the url here]“)); IF your web host allows it. If you can’t use file() you’ll have to use a different method of getting the page, e.g. using the CURL library (http://www.php.net/curl).
Now that we have the content available, we can parse it for the content using some regular expressions. The key to screen scraping is looking for patterns that match the content, e.g. are all the content items wrapped in <div>’s or something else? If you can successfully discover a pattern, then you can use preg_match_all() to get all the content items.
For PHPit, the pattern that match the content is <div class="contentitem">[Content Here]<div>. You can verify this yourself by going to the main page of PHPit, and viewing the source.
Now that we have a match we can get all the content items. The next step is to retrieve the individual information, i.e. url, title, author, text. This can be done by using some more regular expression and str_replace() on the each content items.
By now we have the following code;
<?php
// Get page
$url = "http://www.phpit.net/";
$data = implode("", file($url));
// Get content items
preg_match_all ("/<div class="contentitem">([^`]*?)</div>/", $data, $matches);
Like I said, the next step is to retrieve the individual information, but first let’s make a beginning on our feed, by setting the appropriate header (text/xml) and printing the channel information, etc.
// Begin feed
header ("Content-Type: text/xml; charset=ISO-8859-1");
echo "<?xml version="1.0" encoding="ISO-8859-1" ?>
";
?>
<rss version="2.0"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:admin="http://webns.net/mvcb/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<channel>
<title>PHPit Latest Content</title>
<description>The latest content from PHPit (http://www.phpit.net), screen scraped!</description>
<link>http://www.phpit.net</link>
<language>en-us</language>
<?
Now it’s time to loop through the items, and print their RSS XML. We first loop through each item, and get all the information we get, by using more regular expressions and preg_match(). After that the RSS for the item is printed.
<?php
// Loop through each content item
foreach ($matches[0] as $match) {
// First, get title
preg_match ("/">([^`]*?)</a></h3>/", $match, $temp);
$title = $temp['1'];
$title = strip_tags($title);
$title = trim($title);
// Second, get url
preg_match ("/<a href="([^`]*?)">/", $match, $temp);
$url = $temp['1'];
$url = trim($url);
// Third, get text
preg_match ("/<p>([^`]*?)<span class="byline">/", $match, $temp);
$text = $temp['1'];
$text = trim($text);
// Fourth, and finally, get author
preg_match ("/<span class="byline">By ([^`]*?)</span>/", $match, $temp);
$author = $temp['1'];
$author = trim($author);
// Echo RSS XML
echo "<item>
";
echo " <title>" . strip_tags($title) . "</title>
";
echo " <link>http://www.phpit.net" . strip_tags($url) . "</link>
";
echo " <description>" . strip_tags($text) . "</description>
";
echo " <content:encoded><![CDATA[
";
echo $text . "
";
echo " ]]></content:encoded>
";
echo " <dc:creator>" . strip_tags($author) . "</dc:creator>
";
echo " </item>
";
}
?>
And finally, the RSS file is closed off.
</channel> </rss>
That’s all. If you put all the code together, like in the demo script, then you’ll have a perfect RSS feed.
Conclusion
In this tutorial I have shown you how to create a RSS feed from a website that does not have a RSS feed themselves yet. Though the regular expression is different for each website, the principle is exactly the same.
One thing I should mention is that you shouldn’t immediately screen scrape a website’s content. E-mail them first about a RSS feed. Who knows, they might set one up themselves, and that would be even better.
Download sample script at http://www.phpit.net/viewsource.php?url=/demo/screenscrape%20rss/examp le.php
About The Author
Dennis Pallett is a young tech writer, with much experience in ASP, PHP and other web technologies. He enjoys writing, and has written several articles and tutorials. To find more of his work, look at his websites at http://www.phpit.net, http://www.aspit.net and http://www.ezfaqs.com
Source: High Quality Article Database - 365Articles.com
More Related Articles From This Website...
Easy Screen Recorder : is a powerful video recording software for capturing screen activities into AVI movies. With this useful program, you can create demo videos that capture what's happening on your PC desktop screen.It can record the entire screen, a window, a fixed region, or any part of the screen. Audio can also be recorded from any source, including microphone, line-in, or speakers. You can choose video and audio codec and quality. Easy Screen Recorder offers many features to make the demonstrations clearer.For example, it can add caption and timestamp, highlights mouse-click areas with colored circles and adds....
Whether it's the word processor, spreadsheet, or financial package that isn't working, most people do not know that Microsoft Windows XP has a built in feature that just may come and save the day. That feature is called "System Restore." Basically, System Restore is like a little hound that runs behind the scenes taking "snapshots" of various parts of the Microsoft Windows XP operating system. So, if a part of Microsoft Windows XP stops working one day, the System Restore utility is able to reference one of the "snapshots" it took and restore your system from a previous day (a day when your computer worked). It's kind of like....
It is at times a common mistake for web designers, especially beginners, to create a website which is not optimised for various screen resolutions. With over 40 different screen resolutions, it is extremely important to optimise your web pages for the most popular screen resolutions so that your web pages can be viewed by a greater number of online visitors. Here are some basic facts to consider when designing for screen resolutions: - 800x600 is the minimum resolution offered on most PCs and notebooks. 800x600 is also the preferred....
It takes only a few minutes and is easy to setup your own weblog using the Blogger weblog system. By completing only three simple steps, creating an account, naming your blog and choosing a template, you too can join the thousands of people now blogging on the web. Here's a detailed step-by-step guide to help to get you started using Blogger Go to Blogger.com and click on the arrow that says Create Your Blog Now. You'll be asked to create a username and choose a password that you'll....
After reading a post submitted to my blog... here by Greg Hall, I decided to take his advice and change the focus of my Pro Blog Tips website to blogs and blogging. Have removed over 300 articles that related to other subjects that were not really blog related. Must admit that taking the plung and removing so many posts was a bit scary as I'm sure that for at least the next few months traffic to Pro Blog Tips will be effected. Not long ago I took the same steps with a number of dating blogs. Found that traffic dropped for about....
gning Professional Web Pages by: Joanne Glasspoole If your Web site doesn't project a professional and polished image to your visitors, your credibility and that of your products and services will suffer. Image is everything -- especially online where your competitor is only one mouse click away! Before your first HTML code is written, you will need to consider your Web site's navigational structure, color scheme and page layout. Is your content developed? If not, who is going to write it? Once you have done the necessary pre-planning, then the fun part begins -- coding your HTML pages. Following are some steps to consider when....
The web site should look awesome in the sense that its design must be user friendly and visually appealing. Usability, accessibility and some basic browser design guidelines are there behind any new web site to look afresh. User friendliness, good look & feel and fast & easy communications are some of the features of any successful and cheap website design to target the prospects. Only pretty images would not satisfy their quests. A site that is not visually appealing can drive a user away before he completes his task. Certain other things are also necessary here to....
Here's a not-so-secret secret: the market is getting tired of the usual eBooks and special reports that promises the most revolutionary information. They want something new something more something that they could actually respond to. And it becomes an Internet marketer's challenge to come up with novel products that would excite a market slowly sinking to indifferent stupor. And the most encouraging product types that have been invented in recent months are video products. Yes, video that you could actually watch from your monitor screen. Video that is accompanied by audio commentaries, which seek to inform, if....
When it comes to building your own website for business, there are many things to consider when you build your home page to make it look professional, fast loading and easy to navigate. There are many ways and tools to make your site look good, entertaining and fun to watch, but that doesn't mean you should use them all. First, using a lot of web graphics, flash, banners and pictures on your home page may make it look fancy or cool; but it will slow down the loading time....
When it comes to building your own website for business, there are many things to consider when you build your home page to make it look professional, fast loading and easy to navigate. There are many ways and tools to make your site look good, entertaining and fun to watch, but that doesn't mean you should use them all. First, using a lot of web graphics, flash, banners and pictures on your home page may make it look fancy or cool; but it will slow down the loading time....
Have you ever wondered what exactly is up with designing a website? This informative report can give you an insight into everything you've ever wanted to know about designing a website. Would you buy meat from a grocery store that left the bad meat in with the good meat or wasn't clean? Would you buy a car from a sales lot that had totaled automobiles on the front lot? I wouldn't and neither would you. Your website is your grocery store; your car lot. You must have an atmosphere that is pleasing to buyers. One that tells that buyer that....
It is the balance between the desire to have all the novel and cool features accessible on the Internet and the desire to make a website easy to use and to navigate, that differentiate a successful web designing. When web designing employs the menus and graphic motifs to move and fly around, when every available screen spot is maximum exploited, it becomes cumbersome for user to manipulate it. The websites need to be....
Product knowledge can be an entrepreneur's best friend - or worst enemy. You love your product and know it so well maybe too well. As a result your perspective can actually get in the way of website copy that sells. Are you so close to your fine product that you can't see what really matters to your customers? Try this. Put your nose right on your computer screen. Now look at what's on your screen. How's the view? Can you see the big picture or are you getting lost in the pixels? You just experienced pixelvision. In this exercise pixelvision....
For corporate communicators brought up on printed publications, the immediacy of on-line communications is a breath of fresh air. But just as we have had to adopt our writing style for the net, we should also be thinking differently about how we take and edit photographs. Why? Internet and intranet images are used very small often no bigger than about 250 pixels wide. This immediately throws up a problem. On printed pages where we had a whole page to play with we could afford to be clumsy with....
Believe it or not, but it's actually possible to make money from blogging. One teenager has, in fact, earned $5,000 a month just from making posts on his blog! Of course, not every blogger is this lucky. It does take some skill, and some luck to make money from blogging. What it also takes is some knowledge of how exactly to generate income just by posting blogs. In this newsletter, we will go over three specific methods of making money on your blog: adsense, Pay Per Lead, and Affiliate Links. So how exactly do....
Trackback URL for this post:
http://www.problogtips.com/screen-scraping-your-way-into-rss/716/trackback/
Posted by Jaron in the catagory of... General Interest





Easy Screen Recorder
Bluff Titler

