Need Automated Web Scraping Advice

This forum is for general support of Xbase++
Post Reply
Message
Author
User avatar
rdonnay
Site Admin
Posts: 4775
Joined: Wed Jan 27, 2010 6:58 pm
Location: Boise, Idaho USA
Contact:

Need Automated Web Scraping Advice

#1 Post by rdonnay »

I have done a variety of web scraping projects in the past, but they were all essentially very easy because I did not have to deal with logins and Javascript. Most of these were U.S. Government sites that listed potential terrorists for the bank OFAC program. I wrote an automated system for bank customers that went to 12 different government sites, extracted the data, and wrote it to a database. When automating such a process, the objective is to arrive at the page where the data resides and to bypass logins if that is possible.

Now I am being asked to write an automated system that retrieves a list of people who have violated the New York E-Z-Pass driving lanes.
The NY website requires an account number and a password. There are a variety of scraping systems out there that require a monthly service fee. I don't know how good any of these are and how much time it would take to implement this into an automated system that runs daily.

Does anyone have experience with this? I looked at Selenium and ParseHub. I also wasted $12 on a Web scraping e-book.
The eXpress train is coming - and it has more cars.

Cliff Wiernik
Posts: 605
Joined: Thu Jan 28, 2010 9:11 pm
Location: Steven Point, Wisconsin USA
Contact:

Re: Need Automated Web Scraping Advice

#2 Post by Cliff Wiernik »

The don't provide an API using REST or SOAP to connect and retrieve the data like other information services. I have worked with those but not web scraping.

User avatar
rdonnay
Site Admin
Posts: 4775
Joined: Wed Jan 27, 2010 6:58 pm
Location: Boise, Idaho USA
Contact:

Re: Need Automated Web Scraping Advice

#3 Post by rdonnay »

The don't provide an API using REST or SOAP to connect and retrieve the data like other information services
It's unbelievable that the State of New York can require that Bobby's company comply when a taxi is listed in the E-ZpassNy violations but don't give any support for automating this. Their workers must login every day and read the names and violations from the screen.

This is why I don't like working with the government.

I worked with the FBI years ago on a project.
The security to get into their computer at night changed every day.
They had to call me with new credentials every day.
Also, it was only available by dial-up modem because they could not be connected to the internet.
They couldn't even use a VPN.
The eXpress train is coming - and it has more cars.

patito
Posts: 121
Joined: Tue Aug 31, 2010 9:01 pm

Re: Need Automated Web Scraping Advice

#4 Post by patito »

Hi Roger

Scrapping with CURL
The curl is a part of libcurl, a library that allows you to connect to servers with many different types of protocols.
It supports the http, https and other protocols. This way of getting data from web

Best Regard
Hector Pezoa

User avatar
TWolfe
Posts: 60
Joined: Thu Jan 28, 2010 7:34 am

Re: Need Automated Web Scraping Advice

#5 Post by TWolfe »

Roger,

As you know, I have dealt with scrapping data from NY State web-sites for over 20 years. I would be happy to help you and Bobby with this. Call me or send a private email if you still need some help.
Terry

User avatar
rdonnay
Site Admin
Posts: 4775
Joined: Wed Jan 27, 2010 6:58 pm
Location: Boise, Idaho USA
Contact:

Re: Need Automated Web Scraping Advice

#6 Post by rdonnay »

Terry -

I would like to talk with you about this.
As of now, I have made some good progress using RoboTask.

RoboTask captures keyboard and mouse events in the web browser and writes everything to a file to be played back later.
It actually works very well with Chrome.

Scraping the data out of the captured HTML was rather simple for me.
I had to fix some of the bad html first in the Xbase++ program and save it as XML.
Then I read the XML into an object tree.
The eXpress train is coming - and it has more cars.

Post Reply