Movie Recommending System using Spark
A web platform for recommending movies to users that log in with their Facebook account. They rate movies and receive recommendations based on their activity. They also receive a predicted rating for particular movies.
Technologies used:
- Apache Spark
- Flask - Python based web framework
- PostgreSQL
- React
- jQuery, Ajax
- Bootstrap
Requirements and installation
It is highly recommended to use virtualenv.
Please see requirements.txt. To install these packages, use the following command in a virtualenv.
$ pip install -r requirements.txt
Download Spark v1.6.1 from here.
$ tar -xf <name_of_spark_archive>.tar
Follow instructions from spark-1.6.1/README.md to build and install.
Environment variables in your ~/.bash_profile for OS X or ~/.bashrc for Linux.
export SPARK_HOME=~/path/to/spark-1.6.1
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/pyhon/lib/py4j-0.9-src.zip:$PYTHONPATH
Verify it was successfully installed by running
$ ./bin/pyspark
from spark-1.6.1/
Copy spark-env.sh, spark-defaults.conf and slaves files from spark_utils/ to path/to/spark/conf/.
-
spark-env.sh
- settings regarding the master node, like number of workers if manages, etc. Note: Edit the SCALA_HOME, SPARK_WORKER_DIR environment variables to your local needs.
-
spark-defaults.conf
- settings regarding the workers. Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf invocation in the code. Properties set directly on the SparkConf creation, in code, take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.
-
slaves
- addresses of the workers.
For database we use PostgreSQL. Download and install it from here.
Add the following to ~/.bash_profile
export PATH=$PATH:/Applications/Postgres.app/Contents/Versions/latest/bin
to ~/.bash_profile.
$ which psql
should output like this
/Applications/Postgres.app/Contents/Versions/latest/bin/psql
Create a PostgreSQL database
$ createdb rs
To see the created database, got to Terminal and type "psql", then in the postgres console type "\l", which will list all the databases available, "\connect rs;" to connect to the newly created rs database created above, and "\d" will show the available tables from the connected database.
Note: When droping table from psql console, also drop the sequence mentioned in the models, otherwise the autoincrementing id will not be reseted, as it should be.
DROP TABLE "Users"; DROP SEQUENCE seq_user_id;
DROP TABLE "Movies"; DROP SEQUENCE seq_movie_id;
Install npm packages needed for React
$ npm install
Any changes on the React components from app/templates/client/components/ or app/templates/client/index.js file should be followed by the build command.
Build bundle.js from all the npm packages, scripts with behavior defined in package.json,
$ npm run-script build
or to build bundle.js and run the webpack server
$ npm run-script run
To download the dataset that will be used, run the following script
$ python download_dataset.py
Usage
- [ ] Write full instructions and an example.
Activate the created virtualenv directory.
source name_of_virtualenv_directory/bin/activate
1. Starting the Spark Master and Workers.
From path/to/spark type
$ sbin/start-master.sh
$ sbin/start-slaves.sh
or
$ sbin/start-all.sh
Similar, to stop the Master and Workers.
$ sbin/stop-master.sh
$ sbin/stop-slaves.sh
or
$ sbin/stop-all.sh
See Spark stats via UI at
2. Sending the Python sources to Spark and run them
Run from project directory, the following
$ sh path/to/spark/bin/spark-submit \
--master spark://<server_name/server_ip>:7077 \
--num-executors 2 \
--total-executor-cores 2 \
--executor-memory 2g \
server.py [options] > stdout 2> stderr
where server_name is yosemite/ubuntu/localhost if it's running locally.
Options, all optional, include:
--dataset <name>
Specify a dataset, e.g. "ml-latest" or "ml-latest-small".
If omitted, defaults to "ml-latest".
Logs can be seen in the above provided files.
$ tail -f stdout
$ tail -f stderr
By default, as it is mentioned in server.py, CherryPy will use port 5434. Change it from the same file if it is busy.
3. Operations on the constructed model
Go to http://0.0.0.0:5434/ and have fun.
Or interact with the constructed ML model via terminal with the following commands.
- POSTing new ratings to the model
$ curl --data-binary @user_ratings/user_ratings.file http://0.0.0.0:5434/<user_id>/ratings
where user_id is 0 by default representing a total new user, outside from those mentioned in the dataset.
Description: POSTs user_id's ratings from user_ratings.file, where
every line has movie_id,rating.
Will start some computations and end up with an output representing
the ratings that has been submitted as a list of lists.
In the server output window you will see the actual Spark computation
output together with CherryPy's output messages about HTTP requests.
Output represents ratings as - (user_id, movie_id, rating) rating awarded by the user from user_ratings.file.
- GETing best recommendations
$ curl http://0.0.0.0:5434/<user_id>/ratings/top/<num_movies>
or in browser
http://0.0.0.0:5434/user_id/ratings/top/num_movies
Example
$ curl http://0.0.0.0:5434/0/ratings/top/10
$ curl http://0.0.0.0:5434/3/ratings/top/10
http://0.0.0.0:5434/0/ratings/top/10
Description: Will present the best num_movies recommendations for user with user_id.
- GETing individual ratings
$ curl http://0.0.0.0:5434/<user_id>/ratings/<movie_id>
or in browser
http://0.0.0.0:5434/user_id/ratings/movie_id
Example
curl http://0.0.0.0:5434/0/ratings/500
curl http://0.0.0.0:5434/3/ratings/500
http://0.0.0.0:5434/0/ratings/500
http://0.0.0.0:5434/1/ratings/500
Description: Will get the predicted movie rating, from the model, of user_id for movie_id.
Tests
TODO
License
TODO