View Full Version : Help needed with web crawler in C#
joeserhal
03-18-2009, 07:43 AM
Hi there,
I am developing a web crawler in C# that is supposed lookup and visit every available web link within a particular domain. The crawler is retrieving all the links but i keep getting duplicate links.
How can avoid that? I probably need to maintain a list of the links that were retrieved and those that were visited, but I don't know exactly how to code that...any ideas? Existing code would be greatly appreciated!
I can post some code of the crawler if needed.
Thanks for any help provided.
ghell
03-18-2009, 04:37 PM
You can either check if a link exists before adding it to your list, check if you have been there before when you are about to use the link or remove all duplicates after generating the list with duplicates in.
Removing duplicate entries is fairly simple. The basic way of doing it is to loop through your list and then for each entry (other than itself), check for equality. If they are equal, remove one (or if there is no match, add it to a list).
This will vary in speed depending on the size of your data and how well it is organised. Sorted data can be searched quickly (for duplicates) but takes time to sort, for example. If you only have a few links and don't run the application frequently speed should not be a problem.
You could do something likeforeach(object o in duplicateList)
if(!uniqueList.Contains(o)) uniqueList.Add(o);if you don't want to do it in place. The same "only add it if it doesn't already exist in the list" can be used when adding the links in the first place.
The data structure used for unique elements its a Set (in set theory you have operations such as a union which will union the sets {1,2} and {2,3} into {1,2,3} not {1,2,2,3} for example). However, I don't know any class in C# that implements this off the top of my head other than the generic Dictionary<TKey, TValue> class, which uses unique keys.
joeserhal
03-18-2009, 05:55 PM
Wouldn't it be much more efficient if the crawler checked, prior to visiting a link, whether it has visited it before...i know this will require maintaining the list sorted to ease searching through the list...but if we don't do this (and simply remove duplicates at the end), this will mean that the crawler might visit each link more than once; and so if there's like 300 links, then it will spend most of its time looping around, right??
Thanks for the reply by the way! :)
joeserhal
03-19-2009, 06:44 AM
Well, what I've done is create a dictionary that would store the visited weblinks. Hence, everytime I encounter a new link, I check whether it exists in the Dictionary or not. If it's not in it, then I visit that new link and grab all other new links there.
That part is now working.
However, I have a new question :)
I need to grab webpages that are .htm or .html i.e, how can I exclude pdf, jpeg, php... pages.
my code for checking links in the dictionary is like this:
public static List<Uri> GrabLinks(string url, Dictionary<string, int> d)
{
List<Uri> links = SqueakTool.FindLinks(new Uri(url));
List<Uri> tlinks;
foreach (Uri uri in links)
{
if (!d.ContainsKey(uri.AbsoluteUri))
{
d.Add(uri.AbsoluteUri, 0);
Console.WriteLine(uri.AbsoluteUri);
tlinks = GrabLinks(uri.AbsoluteUri, d);
}
}
return links;
}
Any ideas anyone?
ghell
03-19-2009, 02:26 PM
Make a HTTP HEAD request to start with and check the Content-Type header. If it is one of the MIME types you want (such as "text/html"), send the GET request and download the document.
The header also appears in the GET request so you could just send the request and cut the stream early if it's not what you want.
Use a System.Net.HttpWebRequest to download the content.
A dictionary already checks for duplicate keys so you don't have to. You can either get rid of the "contains" check or switch to using a list with your existing code. If you don't use the value in the dictionary, switch to a list. Otherwise, you could use the value to store the downloaded content.
SqueakTool seems to already do all of the work for you. I take it the problem with it is that it returns a list of all links, including duplicates. If you can, the best thing to do would be to modify SqueakTool.FindLinks so that it performs the "contains" check first before adding them to its output list then get rid of your GrabLinks method altogether, otherwise just use something like this:
public static List<Uri> RemoveDuplicates(List<Uri> list)
{
List<Uri> unique = new List<Uri>();
foreach(Uri uri in list)
if(!unique.Contains(uri))
unique.Add(Uri);
return unique;
}which you would then call with justList<Uri> links = RemoveDuplicates(SqueakTool.FindLinks(new Uri(url)));
Note that you can make that function generic (pretty much just change "Uri" to "T") but because I do not have a compiler in front of me here, I will give you the Uri form rather than risking a mistake in the code I give you.
vBulletin® v3.8.2, Copyright ©2000-2012, Jelsoft Enterprises Ltd.