I recently came to know that NCERT was providing all the text books (class 1 – 12) for download. However, I found their interface hard to use for browsing through. So I wrote a crawler in python to get all the books to my webserver.
Along with storing the books, I generated thumbnails for every chapter. Navigation pages have also been generated to help browse through the books easily. You can access this dump here.
The crawler code is here.
These are the steps to run it yourself.
# scrape ncert site to get details about book pdf urls python get_ncert_books_data.py > ncert_books_data # create a directory to store all downloaded files mkdir ncert_books # download all the pdfs and cover page images cat ncert_books_data | python download_ncert_books.py ncert_books # generate thumbnails for each pdf (requires ImageMagick) find ncert_books/ -iname "*.pdf" > all_pdf_files sed -i "s/^/0\t/" all_pdf_files python generate_thumbnails.py # resize the book cover images python resize_image_thumbnails.py ncert_books # generate the navigation pages python generate_navigation_pages.py ncert_books
For the thumbnail generation to work, you will need ImageMagick. Read this for more information.