Google App Engine – On uploading serious bulk data

“The Google App Engine enables you to build web applications on the same scalable systems that power Google applications”, says Google. Sounds like that could come in handy for my route altitude profile. The NASA SRTM dataset has 1.5 billion data points just for Australia and I have no idea how many people will use my application.
In the last couple of days, I’ve worked through the tutorial and some of the documentation and have taken the first steps (on a separate git branch) in making my application run on the App Engine.
The first challenge was – and still is – to insert the data into the data store. The data consists of 1060 tiles of 1 degree longitude by 1 degree latitude. Each tile consists of 1200 x 1200 data points representing the altitude at each location; about every 100 meters. I wanted to insert just 1 tile, which took about a minute with Postgres when using the COPY command.
Google provides a bulk uploader that can be used to upload data. However, I think the author of that script had a different idea of ‘bulk’ than I do.
It is possible to test your application offline, before uploading it; the SDK comes with an offline data store. Of course, the offline data store is not the real thing. After 10 hours, it had only inserted about 10% of 1 tile. I also noticed that after each insert, the next insert would take longer.
So I skipped ahead and tried the real thing online. The good news is that the slowdown effect had disappeared, as I expected. The bad news is that I could only insert about 100 points each time. Any more and the server would kick me. This means I would need to send an HTTP POST request 15.000 times to upload 1 tile. That would take roughly 3 days with my current connection (remember: 1 minute with Postgres). In addition, even though I used small chunks , the connection got broken by the server after a while. That means starting over again, because the bulk upload script does not have a resume function.
Clearly we need a different approach here.
One way would be to improve the bulk upload script. It should be able to resume a failed upload. But that would not speed things up.
Therefore, I think the best solution would be if Google makes it possible to upload a CSV file, so their servers can perform their equivalent of a SQL COPY command.
The next challenge will be probably be data consumption. In Postgres, I store the coordinates in a ‘special way’ as one long int (8 bytes) and the altitude is stored as an small integer (2 bytes). So in theory that requires 10 bytes per data point. In practice we need to index the coordinate and there is probably some other overhead. I estimate (it’s a bit difficult in Postgres) that Postgres uses anywhere between 40 and 87 bytes per data point. Its even harder to estimate it with the Google App data store, but it looks like that uses about 100 bytes per data point. I have no idea if that is good or bad. But at that rate, I would need about 150 GB of storage space and the test version only allows 500 MB. Imagine if I would like to upload the whole world.
I have to run to airport now (or actually to the bus to the airport); tomorrow during the Google Developer Day in Sydney, I will hopefully have the opportunity to discuss these issues with experts. I’ll keep you posted: to be continued.

3 comments

Pingback: クラウドマップ開発ブログ » Google App Engine に挑戦
Ratnesh says:

November 4, 2009 at 12:31 am

Hi,
I am interested in knowing, if you got a solution to this problem.
Though, Resume function you can get through uploading data into memory through url fetch statements or upload file and then schedule tasks to insert the data.

1. Sjors says:
  
  November 4, 2009 at 2:04 pm
  
  Hi Ratnesh,
  Thank you for your comment. Unfortunately it has been a very long time since I last worked with App Engine, so I don’t really know the answer to your question. I wish you good luck with your application!
  Sjors

3 comments

Leave a comment Cancel reply