Conrad Fox



Topics


back to blog...
Programming

Coarse Progammer's Guide to Scraping: Know your URLs

June 8, 2020

This is the first in what I hope will be a series on "Coarse Programming". The Art of Coarse Acting is a hilarious guide to acting without really being an actor. These guides (if I really do write more than one) are about programming without being a programmer.

If you're here you probably already know what scraping is. I am developing an application called Newsicles that scrapes local news websites and other sources and uses natural language processing to identify entities (people, places, organization etc.) and the connections between them. The first step is to scrape newspapers using the python package Scrapy. I check headlines against a list of keywords and save the articles containing those keywords.

When I first started scraping newspapers, I thought all you had to do was give Scrapy a starting URL and wait for it to reel everything in. I knew you'd get a little by-catch in the process but figured that as long as what you want is reachable from your starting point you'll eventually get it. For that reason, I didn't bother reading up on the finer points of scraping in the Scrapy documentation. I just threw it an URL and sat back.

After hours of watching the Scrapy logs roll past, I realized that I was doing approximately this:

scraping_graph.png

Say an article of interest is ten links deep (equivalent to ten editions of the newspaper back from the most recent). If we assume very conservatively that each page links to three other pages, and I'm checking them all, I will have scraped 29524 pages just to get that article.

From the logs I saw that I was allowing my spider to follow links off-site. Most of the newspapers I am targeting are larded with ads, facebook links and whatsapp share buttons, and my spider was trying to parse them all. It was also wasting time following category pages I wasn't interested in: the sports section, social pages, commentaries. Of course, the designers of Scrapy have already anticipated this problem, and allow you to designate links to ignore and apply constraints like staying on-site... but I didn't discover this until I'd built my own filter.

def remove_ignore_links(links, ignore_links):
	return list(set(links)-set(ignore_links))

While blacklisting links will slow the growth of your trawl, it doesn't change the fact that the crawl is by nature geometric and wants to bloat in every direction possible. I tried limiting my search by adjusting Scrapy's depth parameter. For each page Scrapy parses, it keeps track of how many links it followed to get there. You can tell it to ignore pages whose chain of links exceeds a certain number. I thought I could approximate a date range by starting my crawl from an article with a given date, and use the depth parameter to limit how far back in the past it went. This is what I was getting:

screen_coverage.png

Most of my hits were clustering around the start date, but the spider also seemed to be racing to the other end as well. Changing the depth would sometimes limit how far back in time the spider went, but not always, and I was getting hits from up to 10 years ago. Anyone familiar with scraping will understand what is happening, and so did I eventually. Many newspaper sites have archival links that allow you to skip to the first, second, third page of entries, or even right to the last page. So a link to "/archivo/noticias/page/649" will still be only one link deep from the start page, even if it leads all the way to the first page ever created on the site. Duh.

screen_pagination.png

Clearly, my gobble-everything-blindly strategy was not going to work. I needed more finesse in my scraping. I would need to actually visit these sites, find out how their stories are categorized and how their archives are structured. I really did not want to do this. The sites are often messy, they load MBs worth of javascript, trackers and ads and each uses a different template. I didn't want to spend hours browsing each page trying to understand the idiosyncracies of their structure.

So I built this link browser into the Newsicles suite.

screen_linkbrowser.png

This is a simple Django app that takes any URL, parses the page, extracts the links with Beautiful Soup, eliminates duplicates, rewrites relative links in their absolute form, sorts them and displays them in categories. I'm sure many others have created the same service, and in more elegant code, but in the true spirit of the Coarse Programmer I didn't even know this was a problem that needed a solution until I created mine.

# Takes a list of urls extracted from a page (links), and the url of the page from which 
# they are extracted (base_page). Returns dictionary of categorized urls.
def link_filter(links, base_page):
	domain = urlparse(base_page).scheme + "://" + urlparse(base_page).netloc
        # create an alternate domain to filter against, using either https or http, whichever
        # the original domain isn't
	if domain.startswith("http:"):
		domain_alt = domain.replace("http:", "https:")
	else:
		domain_alt = domain.replace("https:", "http:")
	links = set(links)
	urls = {
		"main": [],
		"non_domain": [],
		"article": [],
		"archive": [],
	}
	for _link in links:
		if not _link.startswith(domain) and not _link.startswith(domain_alt):
			if _link.startswith("http"):
				urls["non_domain"].append(_link)
			else:
                               # create absolute url from relative
				_link = domain.rstrip("/") + "/" + _link.lstrip("/")
		if _link.startswith(domain) or _link.startswith(domain_alt):
			if len(re.findall("(\w-\w)", _link))>3:
				urls["article"].append(_link)
			elif re.findall("(/\d+/*$)", _link):
				urls["archive"].append(_link)
			else:
				urls["main"].append(_link)
		for k,v in urls.items():
			v.sort()
	return urls

The app distinguishes between four categories of link:

  • Articles: These are stories, containing slugs consisting-of-alternating-words-and-hyphens
    • https://www.dominio.com/maria-isabel-martinez-flores-se-integrara-como-diputada-local-la-proxima-semana/
  • Archival: These are the links that take you to the second, third, fourth and so on of any category. I identify them by having a / followed by digits at the end
    • https://www.dominio.com/archivo/noticias/page/10/
  • Topic: Any other on-site link. The majority of them lead to categories like "sport", "local", "national" etc.
    • https://www.dominio.com/policiaca
  • Off site: Links that don't begin with the same domain as the starting page
    • https://web.whatsapp.com/send?text=Hallan%20muerto%20a%20comerciante%20reportado%20 como%20secuestrado%20en%20Cuichapa%2C%20Veracruz%20http%3A%2F%2F

If I'm only interested in the the structure of links and the layout of the site, rather than the actual content, browsing the site with Link Browser is a much more relaxing experience than wading through the cruft of multiple wordpress blogs. I can easily identify what URLs I need, which ones I can add to an ignore list. Best, I can identify archival pages.

screen_linkbrowserlist.png

The archival pages allow me to specify a list of urls spanning a given date range based on the page number.

class Spider(Spider):

    def start_requests(self):
        base_url = http://newspaper.com/estatal/ 
        urls = [base_url + str(i) for i in range(15,25)]

Now, instead of spanning out across the site and beyond, I traverse a defined series of archival pages and scrape what I find there. I only need to set the depth to one, and my crawling now looks more like this:

scraping_graph2.png

You're welcome to use Link Browser. Just keep in mind it does all the parsing on the server, so in theory I can see what you're browsing. I've got better things to do, but just letting you know.


Hello. I'm a journalist, radio producer and teacher. I've worked in Latin America and the Caribbean for most of my career. My work has taken me across a minefield, into a gunfight, paddling a dugout canoe and inside the homes of many brave and generous people. I have also produced several major international reporting projects where a large part of my job was recruiting and mentoring local reporters. I love teaching, and besides journalism, I have taught soccer, robotics, anthropology and English.

cwzorro@fastmail.fm



Advertizement

Drag and drop writing tool for journalists and researchers.

Find out more




What they say

(Conrad's) vision, his journalistic nose, and his tireless search to go one more step for the new, helped us find really impactful stories...

Manuel Ureste

Journalist and co-author of La Estafa Maestra, Animal Politico, Mexico

more...