Building a crawler in Rust: Scraping and Parsing HTML

Building a crawler in Rust:

This post is an excerpt from my course Black Hat Rust

Now that we have a fast concurrent crawler, it's time to actually parse the HTML and extract structured data (remember that this process is called scraping).

The plain HTML website that we will crawl is CVE Details: the ultimate security vulnerabilities datasource.

It's a website providing an easy way to search for vulnerabilities with a CVE ID.

We will use this page as the start URL: https://www.cvedetails.com/vulnerability-list/vulnerabilities.html which, when you look at the bottom of the page, provides the links to all the other pages listing the vulnerabilities.

Extracting structured data

The first step is to identify what data we want. In this case, it's all the information of a CVE entry:
ch_05/crawler/src/spiders/cvedetails.rs

#[derive(Debug, Clone)]
pub struct Cve {
    name: String,
    url: String,
    cwe_id: Option<String>,
    cwe_url: Option<String>,
    vulnerability_type: String,
    publish_date: String,
    update_date: String,
    score: f32,
    access: String,
    complexity: String,
    authentication: String,
    confidentiality: String,
    integrity: String,
    availability: String,
}

Then, with a browser and the developers tools, we inspect the page to search the relevant HTML classes and ids that will allow us to extract that data:
ch_05/crawler/src/spiders/cvedetails.rs

async fn scrape(&self, url: String) -> Result<(Vec<Self::Item>, Vec<String>), Error> {
    log::info!("visiting: {}", url);

    let http_res = self.http_client.get(url).send().await?.text().await?;
    let mut items = Vec::new();

    let document = Document::from(http_res.as_str());

    let rows = document.select(Attr("id", "vulnslisttable").descendant(Class("srrowns")));
    for row in rows {
        let mut columns = row.select(Name("td"));
        let _ = columns.next(); // # column
        let cve_link = columns.next().unwrap().select(Name("a")).next().unwrap();
        let cve_name = cve_link.text().trim().to_string();
        let cve_url = self.normalize_url(cve_link.attr("href").unwrap());


        let _ = columns.next(); // # of exploits column

        let access = columns.next().unwrap().text().trim().to_string();
        let complexity = columns.next().unwrap().text().trim().to_string();
        let authentication = columns.next().unwrap().text().trim().to_string();
        let confidentiality = columns.next().unwrap().text().trim().to_string();
        let integrity = columns.next().unwrap().text().trim().to_string();
        let availability = columns.next().unwrap().text().trim().to_string();
        let cve = Cve {
            name: cve_name,
            url: cve_url,
            cwe_id: cwe.as_ref().map(|cwe| cwe.0.clone()),
            cwe_url: cwe.as_ref().map(|cwe| cwe.1.clone()),
            vulnerability_type,
            publish_date,
            update_date,
            score,
            access,
            complexity,
            authentication,
            confidentiality,
            integrity,
            availability,
        };
        items.push(cve);
    }
}

ch_05/crawler/src/spiders/cvedetails.rs

let next_pages_links = document
    .select(Attr("id", "pagingb").descendant(Name("a")))
    .filter_map(|n| n.attr("href"))
    .map(|url| self.normalize_url(url))
    .collect::<Vec<String>>();

To run this spider, go to the git repository accompanying this book, in ch_05/crawler, and run:

$ cargo run -- run --spider cvedetails

Want to learn more? Get my course Black Hat Rust where we build a crawler in Rust to scrape vulnerabilities and gather data about our targets.

1 email / week to learn how to (ab)use technology for fun & profit: Programming, Hacking & Entrepreneurship.
I hate spam even more than you do. I'll never share your email, and you can unsubscribe at any time.

Tags: hacking, programming, rust, tutorial

Want to learn Rust, Cryptography and Security? Get my book Black Hat Rust!