FBLYZE is a Facebook scraping system and analysis system.
Getting started tutorial on Medium.
The goal of this project is to implement a Facebook scraping and extraction engine. This project is originally based on the scraper from minimaxir which you can find here. However, our project aims to take this one step further and create a continous scraping and processing system which can easily be deployed into production. Specifically, for our purposes we want to extract information about upcoming paddling meetups, event information, flow info, and other river related reports. However, this project should be useful for anyone who needs regular scrapping of FB pages or groups.
To get the ID of a Facebook group go here and input the url of the group you are trying to scrape. Pages you can just use after the slash (i.e. http://facebook.com/paddlesoft would be paddlesoft).
We recommend you use our Docker images as it contains everything you need. For instructions on how to use our Dockerfile please see the wiki page. Our Dockerfile is tested regularly on Codefresh so you can easily see if the build is passing above.
You will need to have Python 3.5+. If you want to use the examples (located in /data) you will need Jupyter Notebooks and Spark.
from get_posts import scrape_comments_from_last_scrape, scrape_posts_from_last_scrape
group_id = "115285708497149"
scrape_posts_from_last_scrape(group_id)
scrape_comments_from_last_scrape(group_id)
Note that our messaging system using Kafka currently only works with the basic json data (comparable to the CSV). We are working on addeding a new schema for the more complex data see issue 11. Plans to upgrade to add authentication for Kafka authentication are in progress.
Currently the majority of examples of actual analysis are contained in the Examining data using Spark.ipynb notebook located in the data folder. You can open the notebook and specify the name of your CSV.
ElasticSearch is ocassionally throwing an authentication error when to trying to save posts. If you get an authentication error when using ES please add it to issue 15. Ability to connect to Bonsai and elastic.co are in the works.
There are some other use case examples on my main GitHub page which you can look at as well. However, I have omitted them from this repo since they are mainly in Java and require Apache Flink.
We are also working on automating scraping with Apache Airflow. The dags we have created so far are in the dags folder. It is recomended that you use the dags in conjunction with our Docker image. This will avoid directory errors.