Fetching Blogger Posts With Python
As mentioned in a previous post, I am working through two chapters, 7 and 8 of Matthew Russel’s eminently fun-filled book Mining The Social Web (MTSW). As a first step, I took a go at blogger posts. Example 8-1, blogs_and_nlp__get_feed.py gets blog posts from their feeds and saves them in json format.
Very unsurprisingly, I started with my own blog. On blogger when you scroll down and click on Subscribe to: Posts (Atom), you get the feed URL. In my case it is --
http://mh-journal.blogspot.com/feeds/posts/default. One thing to note is that though my blog domain name is blogspot.in, the feed domain is blogspot.com.
So I started by passing the Atom feed URL as the command-line argument to the program. First hiccup, the wrong URL links were being copied to the output. So I corrected the index i.e., 'link': e.links[0].href to e.links[4].href to get the proper URL.
By default, blogger sends feed only for 25 posts. If you want more, you need to send max-results parameter set. The hitch is, it takes 500 as its maximum value. The URL to fetch is -
http://mh-journal.blogspot.com/feeds/posts/default?max_results=500
All my blog entries, a grand number of 36 were fetched, and saved in Json format. For the pure pleasure of further analysis and reflection and thought : )- But how much can you analyze on 36. Yeh Dil Maange More.
At this point, I switched to Annie Zaidi (knownturf.blogspot.in); I had met her recently in her book reading session and also posted a long entry on my blog about my impressions. Once you meet Annie, it’s not easy to get her out of your system.
Now wait, any journey isn’t very smooth and in this case there were road bumps: her blog feed gives only short versions. I checked my blogger settings. They were: Settings -> Site Feed -> Allow Blog Feed? = Full. So I tweeted to her asking her to enable full feed.
But got no reply. Busy bee, no time for the mahboobs of the world. I thought I would send a reminder. I realized there could be other bloggers whose feed settings are Short. It’s better to handle this situation in the code than to keep asking every such blogger in the world to change his / her settings.
The steps are simple: from the feed, fetch the post URLs; go to each URL and fetch the html page; then get the blog content. As Scott Meyers wrote, pretty this ain’t, but sometimes a programmer’s just gotta do what a programmer’s gotta do.
So how do we fetch the post URLs? Already done. As I wrote above, they are obtained from e.links[4].href.
So how do we go to each URL and fetch the html page? Helpful here is the old Python warhorse urllib2. It takes a URL and fetches you the html page.
So how do we fetch the blog content? That is embedded in the div “post-body entry-content”. You tell BeautifulSoup to find it as : find('div', attrs={'class' : 'post-body entry-content'})
Steps done, program works. But then I get ambitious. Why fetch only 500 posts, let’s get all of Annie’s posts. To get the next 500 posts, you have to use the URL -
http://knownturf.blogspot.com/feeds/posts/default?start-index=501&max-results=500
So how many times do we need to get 500 posts? Simple. It is (total no. of posts / 500) + 1.
So how do we get the total number of posts? Send the request to http://knownturf.blogspot.com/feeds/posts/default?alt=json and in the json return there is a value nested as - "feed" -> "openSearch$totalResult” -> "$t", and accessed in the code as ["feed"]["openSearch$totalResults"]["$t"]
One small correction was in order. In Python for loop, we use range operator which startIndex and endIndex as arguments in order to loop from startIndex to endIndex - 1. I prefer to startIndex from 1, so what should endIndex value be? It should be (Total number of posts / 500) + 2.
The remaining part of the program is same as that of Matthew Russel wrote. I thought I was done and ran the program. As it was running, it bombed while fetching the URL ---
http://knownturf.blogspot.com/2012/02/image-says-it-all.html
Why? Because this blog post has no text. Uff, ladkiyon ke nakhre. Annie had posted only one image in that post.
So the following code
return BeautifulStoneSoup(clean_html(text),
convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
throws IndexError as there is no content array. If there is no text (non-HTML) content we need to raise an exception and tell the calling code that there is no non-html code.
For this purpose, I enclosed the code in a try block and in the except block I returned a string “NO NON_HTML TEXT FOUND”. In the calling for loop, I skip doing the appending to the blog_posts array with a simple continue statement.
The last modification I did was to take the blog id as a command line argument. And concatenate the blogpost.com URLs from it.
Finally, everything done. I downloaded all posts and saved it on my computer. After having met Annie, interacted with her and written a blog post, I could not get her out of “my” system, so I got her into my “system.”
Full code for the program, which I call mh-get-blogger-posts.py is available on github and also given below. The next two programs in MTSW are for finding sentences and max words, and to generate summaries. I suspect there will be some adventure there too. If yes, that will be the material for another post.
Very unsurprisingly, I started with my own blog. On blogger when you scroll down and click on Subscribe to: Posts (Atom), you get the feed URL. In my case it is --
http://mh-journal.blogspot.com/feeds/posts/default. One thing to note is that though my blog domain name is blogspot.in, the feed domain is blogspot.com.
So I started by passing the Atom feed URL as the command-line argument to the program. First hiccup, the wrong URL links were being copied to the output. So I corrected the index i.e., 'link': e.links[0].href to e.links[4].href to get the proper URL.
By default, blogger sends feed only for 25 posts. If you want more, you need to send max-results parameter set. The hitch is, it takes 500 as its maximum value. The URL to fetch is -
http://mh-journal.blogspot.com/feeds/posts/default?max_results=500
All my blog entries, a grand number of 36 were fetched, and saved in Json format. For the pure pleasure of further analysis and reflection and thought : )- But how much can you analyze on 36. Yeh Dil Maange More.
At this point, I switched to Annie Zaidi (knownturf.blogspot.in); I had met her recently in her book reading session and also posted a long entry on my blog about my impressions. Once you meet Annie, it’s not easy to get her out of your system.
Now wait, any journey isn’t very smooth and in this case there were road bumps: her blog feed gives only short versions. I checked my blogger settings. They were: Settings -> Site Feed -> Allow Blog Feed? = Full. So I tweeted to her asking her to enable full feed.
But got no reply. Busy bee, no time for the mahboobs of the world. I thought I would send a reminder. I realized there could be other bloggers whose feed settings are Short. It’s better to handle this situation in the code than to keep asking every such blogger in the world to change his / her settings.
The steps are simple: from the feed, fetch the post URLs; go to each URL and fetch the html page; then get the blog content. As Scott Meyers wrote, pretty this ain’t, but sometimes a programmer’s just gotta do what a programmer’s gotta do.
So how do we fetch the post URLs? Already done. As I wrote above, they are obtained from e.links[4].href.
So how do we go to each URL and fetch the html page? Helpful here is the old Python warhorse urllib2. It takes a URL and fetches you the html page.
So how do we fetch the blog content? That is embedded in the div “post-body entry-content”. You tell BeautifulSoup to find it as : find('div', attrs={'class' : 'post-body entry-content'})
Steps done, program works. But then I get ambitious. Why fetch only 500 posts, let’s get all of Annie’s posts. To get the next 500 posts, you have to use the URL -
http://knownturf.blogspot.com/feeds/posts/default?start-index=501&max-results=500
So how many times do we need to get 500 posts? Simple. It is (total no. of posts / 500) + 1.
So how do we get the total number of posts? Send the request to http://knownturf.blogspot.com/feeds/posts/default?alt=json and in the json return there is a value nested as - "feed" -> "openSearch$totalResult” -> "$t", and accessed in the code as ["feed"]["openSearch$totalResults"]["$t"]
One small correction was in order. In Python for loop, we use range operator which startIndex and endIndex as arguments in order to loop from startIndex to endIndex - 1. I prefer to startIndex from 1, so what should endIndex value be? It should be (Total number of posts / 500) + 2.
The remaining part of the program is same as that of Matthew Russel wrote. I thought I was done and ran the program. As it was running, it bombed while fetching the URL ---
http://knownturf.blogspot.com/2012/02/image-says-it-all.html
Why? Because this blog post has no text. Uff, ladkiyon ke nakhre. Annie had posted only one image in that post.
So the following code
return BeautifulStoneSoup(clean_html(text),
convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
throws IndexError as there is no content array. If there is no text (non-HTML) content we need to raise an exception and tell the calling code that there is no non-html code.
For this purpose, I enclosed the code in a try block and in the except block I returned a string “NO NON_HTML TEXT FOUND”. In the calling for loop, I skip doing the appending to the blog_posts array with a simple continue statement.
The last modification I did was to take the blog id as a command line argument. And concatenate the blogpost.com URLs from it.
Finally, everything done. I downloaded all posts and saved it on my computer. After having met Annie, interacted with her and written a blog post, I could not get her out of “my” system, so I got her into my “system.”
Full code for the program, which I call mh-get-blogger-posts.py is available on github and also given below. The next two programs in MTSW are for finding sentences and max words, and to generate summaries. I suspect there will be some adventure there too. If yes, that will be the material for another post.
INTERNATIONAL CONCEPT OF WORK FROM HOME
ReplyDeleteWork from home theory is fast gaining popularity because of the freedom and flexibility that comes with it. Since one is not bound by fixed working hours, they can schedule their work at the time when they feel most productive and convenient to them. Women & Men benefit a lot from this concept of work since they can balance their home and work perfectly. People mostly find that in this situation, their productivity is higher and stress levels lower. Those who like isolation and a tranquil work environment also tend to prefer this way of working. Today, with the kind of communication networks available, millions of people worldwide are considering this option.
Women & Men who want to be independent but cannot afford to leave their responsibilities at home aside will benefit a lot from this concept of work. It makes it easier to maintain a healthy balance between home and work. The family doesn't get neglected and you can get your work done too. You can thus effectively juggle home responsibilities with your career. Working from home is definitely a viable option but it also needs a lot of hard work and discipline. You have to make a time schedule for yourself and stick to it. There will be a time frame of course for any job you take up and you have to fulfill that project within that time frame.
There are many things that can be done working from home. A few of them is listed below that will give you a general idea about the benefits of this concept.
Baby-sitting
This is the most common and highly preferred job that Women & Men like doing. Since in today's competitive world both the parents have to work they need a secure place to leave behind their children who will take care of them and parents can also relax without being worried all the time. In this job you don't require any degree or qualifications. You only have to know how to take care of children. Parents are happy to pay handsome salary and you can also earn a lot without putting too much of an effort.
Nursery
For those who have a garden or an open space at your disposal and are also interested in gardening can go for this method of earning money. If given proper time and efforts nursery business can flourish very well and you will earn handsomely. But just as all jobs establishing it will be a bit difficult but the end results are outstanding.
Freelance
Freelance can be in different wings. Either you can be a freelance reporter or a freelance photographer. You can also do designing or be in the advertising field doing project on your own. Being independent and working independently will depend on your field of work and the availability of its worth in the market. If you like doing jewellery designing you can do that at home totally independently. You can also work on freelancing as a marketing executive working from home. Wanna know more, email us on workfromhome.otr214422@gmail.com and we will send you information on how you can actually work as a marketing freelancer.
Internet related work
This is a very vast field and here sky is the limit. All you need is a computer and Internet facility. Whatever field you are into work at home is perfect match in the software field. You can match your time according to your convenience and complete whatever projects you get. To learn more about how to work from home, contact us today on workfromhome.otr214422@gmail.comand our team will get you started on some excellent work from home projects.
Diet food
Since now a days Women & Men are more conscious of the food that they eat hence they prefer to have homemade low cal food and if you can start supplying low cal food to various offices then it will be a very good source of income and not too much of efforts. You can hire a few ladies who will help you out and this can be a good business.
Thus think over this concept and go ahead.
Mahboob'S Journal: Fetching Blogger Posts With Python >>>>> Download Now
Delete>>>>> Download Full
Mahboob'S Journal: Fetching Blogger Posts With Python >>>>> Download LINK
>>>>> Download Now
Mahboob'S Journal: Fetching Blogger Posts With Python >>>>> Download Full
>>>>> Download LINK ue
This comment has been removed by the author.
ReplyDeleteWell Done ! the blog is great and Interactive it is about Fetching Blogger Posts With Python for students and Python Developers for more updates on python python online training
ReplyDeleteEnjoyed reading the article above, really explains everything in detail, the article is very interesting and effective. Thank you and good luck for the upcoming articles learn python training in Bangalore
ReplyDeleteHiiii....Great post...Thank u for sharing information....
ReplyDeletePython Training in Hyderabad
Thanks for sharing your valuable information and time.
ReplyDeletePython Training in Gurgaon
Python Training institute in Gurgaon
Good Post. I like your blog. Thanks for Sharing
ReplyDeletePython Course in Noida
Good Post. I like your blog. Thanks for Sharing
ReplyDeletePython Course in Noida
I really like your blog. You make it interesting to read and entertaining at the same time. I cant wait to read more from you.
ReplyDeletePython Training in Chennai
Such an excellent and interesting blog, do post like this more with more information, this is very useful for me.
ReplyDeleteSalesforce Training India
Great Information! Here's a list of latest Python Job Vacancies in India
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteGreat post! I am actually getting ready to across this information, It’s very helpful for this blog. Also great with all of the valuable information you have Keep up the good work you are doing well.
ReplyDeleteCRS Info Solutions Salesforce training for beginners
This comment has been removed by the author.
ReplyDeleteThanks a lot very much for the high your blog post quality and results-oriented help. I won’t think twice to endorse to anybody who wants and needs support about this area.
ReplyDeletehadoop training in chennai
hadoop training in tambaram
salesforce training in chennai
salesforce training in tambaram
c and c plus plus course in chennai
c and c plus plus course in tambaram
machine learning training in chennai
machine learning training in tambaram
I must appreciate you for providing such a valuable content for us. This is one amazing piece of article. Helped a lot in increasing my knowledge
ReplyDeleteoracle training in chennai
oracle training in porur
oracle dba training in chennai
oracle dba training in porur
ccna training in chennai
ccna training in porur
seo training in chennai
seo training in porur
Enjoyed reading the article above, really explains everything in detail, the article is very interesting and effective.
ReplyDeletesap training in chennai
sap training in omr
azure training in chennai
azure training in omr
cyber security course in chennai
cyber security course in omr
ethical hacking course in chennai
ethical hacking course in omr
The Optimized training programs will equip you with the fundamental knowledge and skills required to be a professional cyber security consultant.
ReplyDeletecourses in cyber security
Iso Data Security Course Online
Ethical Hacking Courses
Ethical Hacking Courses online
Data Security Training & Certification
Cyber Security Training Hyderabad
Data Science Training
courses on data analytics
courses on artificial intelligence
Machine Learning And Artificial Intelligence Course
Awesome post
ReplyDeleteData Science Course In Hyderabad
Hello,
ReplyDeleteThis article is a great article that I have seen in my python programming career so far. It helps to Fetching Blogger Posts With Python to new as well as senior python developer and will continue to do so in the future.
hire python developers in US
This is helpful and interesting post. I enjoyed this article.Digital Marketing Company in Jaipur
ReplyDeletenice post, thanks for sharing this helpful information with us. Python
ReplyDeletePython Training in Noida
ReplyDeleteSelling USA FRESH SPAMMED SSN Leads/Fullz, along with Driving License/ID Number with EXCELLENT connectivity.
ReplyDelete**PRICE**
>>2$ FOR EACH LEAD/FULLZ/PROFILE
>>5$ FOR EACH PREMIUM LEAD/FULLZ/PROFILE
**DETAILS IN EACH LEAD/FULLZ**
->FULL NAME
->SSN
->DATE OF BIRTH
->DRIVING LICENSE NUMBER WITH EXPIRY DATE
->ADDRESS WITH ZIP
->PHONE NUMBER, EMAIL, I.P ADDRESS
->EMPLOYEE DETAILS
->REALTIONSHIP DETAILS
->MORTGAGE INFO
->BANK ACCOUNT DETAILS
>All Leads are Tested & Verified.
>Invalid info found, will be replaced.
>Serious buyers will be welcome & I will give discounts for bulk orders.
>Fresh spammed data of USA Credit Bureau
>Good credit Scores, 700 minimum scores
>Bulk order will be preferable
>Minimum order 20 leads/fullz
>Hope for the long term business
>You can asked for samples, specific states & zips (if needed)
>Payment mode BTC, PAYPAL & PERFECT MONEY
Email > leads.sellers1212@gmail.com
Telegram > @leadsupplier
ICQ > 752822040
''OTHER GADGETS PROVIDING''
>Dead Fullz
>Carding Tutorials
>Hacking Tutorials
>SMTP Linux Root
>DUMPS with pins track 1 and 2
>Sock Tools
>Server I.P's
>USA emails with passwords (bulk order preferable)
**Contact 24/7**
Email > leads.sellers1212@gmail.com
Telegram > @leadsupplier
ICQ > 752822040
This is the nice post hire python developers in US
ReplyDeleteThank you for this Nice information Best Digital Marketing Agency in Hyderabad
ReplyDeleteThank you for this wonderful and much required information
ReplyDeleteThank you for this wonderful information.
ReplyDeleteStudents can choose Python training without having any programming background and software programmer who wants to start their career from scratch. This online training provides complete knowledge from basic level to advance level. Python Online Training || Python Online Course
ReplyDeletePython training in Bangalore
ReplyDeletePython classes in Bangalore
nice,
ReplyDeletesoftware training institute
Your blog is Best in the market for Python and I guess you must write something for Digital Marketing Services as well. That will grab more audiences
ReplyDeleteThankyou for sharing this blog this is really helpful and informative. Great Share! Else anyone interested in Python Couse, Contact Us On 9311002620 or You can Visit our Website : https://www.htsindia.com/Courses/python/python-training-institute-in-south-delhi
ReplyDeleteThanks For Sharing Informative Blog.Keep Sharing
ReplyDeleteVisit us: Java Online Training Hyderabad
Visit us: Core Java Online Course
Thanks for sharing such a good article having valuable information.
ReplyDeleteVisit us: Core java online training Hyderabad
Visit us: Java Online Training
Thanks for Share the Details of python Training and Courses and Certifications Process and Understand the Clear Concept.
ReplyDeletePython course in Bangalore
Python Training in Bangalore
Best Python Training Institutes in Bangalore
python training institute in Bangalore
Best software training courses for freshers and experience candidates to upgade the next level in an Software Industries Technologies,
ReplyDeletePython course in Bangalore
Python Training in Bangalore
Python Online Training in Bangalore
python training institute in Bangalore
Join in for Python Training in Hyderabad program by AI Patasala to start your career in Python. AI Patasala Python Course will take your career to the next level.
ReplyDeletePython Institutes in Hyderabad
It is extremely nice to see the greatest details presented in an easy and understanding manner.
ReplyDeletedata scientist certification malaysia
This comment has been removed by the author.
ReplyDeleteits really extremely amazing post thanks for sharing
ReplyDeleteData Science
Python Training Institute in Noida
ReplyDeleteReally good information to show through this blog. I really appreciate you for all the valuable information that you are providing us through your blog tax return for self employed in London
ReplyDeleteThanks for taking the time to write this and I hope to read more of your posts in the future.
ReplyDeletePython Classes in Nagpur
This comment has been removed by the author.
ReplyDelete