Movie Recommending System using Spark

A web platform for recommending movies to users that log in with their Facebook account. They rate movies and receive recommendations based on their activity. They also receive a predicted rating for particular movies.

Technologies used:

Apache Spark
Flask - Python based web framework
PostgreSQL
React
jQuery, Ajax
Bootstrap

Requirements and installation

It is highly recommended to use virtualenv.

Please see requirements.txt. To install these packages, use the following command in a virtualenv.

$ pip install -r requirements.txt

Download Spark v1.6.1 from here.

$ tar -xf <name_of_spark_archive>.tar

Follow instructions from spark-1.6.1/README.md to build and install.

Environment variables in your ~/.bash_profile for OS X or ~/.bashrc for Linux.

export SPARK_HOME=~/path/to/spark-1.6.1
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/pyhon/lib/py4j-0.9-src.zip:$PYTHONPATH

Verify it was successfully installed by running

$ ./bin/pyspark

from spark-1.6.1/

Copy spark-env.sh, spark-defaults.conf and slaves files from spark_utils/ to path/to/spark/conf/.

spark-env.sh
- settings regarding the master node, like number of workers if manages, etc. Note: Edit the SCALA_HOME, SPARK_WORKER_DIR environment variables to your local needs.
spark-defaults.conf
- settings regarding the workers. Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf invocation in the code. Properties set directly on the SparkConf creation, in code, take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.
slaves
- addresses of the workers.

For database we use PostgreSQL. Download and install it from here.

Add the following to ~/.bash_profile

export PATH=$PATH:/Applications/Postgres.app/Contents/Versions/latest/bin

to ~/.bash_profile.

$ which psql

should output like this

/Applications/Postgres.app/Contents/Versions/latest/bin/psql

Create a PostgreSQL database

$ createdb rs

To see the created database, got to Terminal and type "psql", then in the postgres console type "\l", which will list all the databases available, "\connect rs;" to connect to the newly created rs database created above, and "\d" will show the available tables from the connected database.

Note: When droping table from psql console, also drop the sequence mentioned in the models, otherwise the autoincrementing id will not be reseted, as it should be.

DROP TABLE "Users"; DROP SEQUENCE seq_user_id; 
DROP TABLE "Movies"; DROP SEQUENCE seq_movie_id;

Install npm packages needed for React

$ npm install

Any changes on the React components from app/templates/client/components/ or app/templates/client/index.js file should be followed by the build command.

Build bundle.js from all the npm packages, scripts with behavior defined in package.json,

$ npm run-script build

or to build bundle.js and run the webpack server

$ npm run-script run

To download the dataset that will be used, run the following script

$ python download_dataset.py

Usage

[ ] Write full instructions and an example.

Activate the created virtualenv directory.

source name_of_virtualenv_directory/bin/activate

1. Starting the Spark Master and Workers.

From path/to/spark type

$ sbin/start-master.sh
$ sbin/start-slaves.sh

$ sbin/start-all.sh

Similar, to stop the Master and Workers.

$ sbin/stop-master.sh
$ sbin/stop-slaves.sh

$ sbin/stop-all.sh

See Spark stats via UI at

http://localhost:8080

http://localhost:4040

2. Sending the Python sources to Spark and run them

Run from project directory, the following

$ sh path/to/spark/bin/spark-submit \
           --master spark://<server_name/server_ip>:7077 \
           --num-executors 2 \
           --total-executor-cores 2 \
           --executor-memory 2g \
           server.py [options] > stdout 2> stderr

where server_name is yosemite/ubuntu/localhost if it's running locally.

Options, all optional, include:

--dataset <name>
  Specify a dataset, e.g. "ml-latest" or "ml-latest-small". 
  If omitted, defaults to "ml-latest".

Logs can be seen in the above provided files.

$ tail -f stdout
$ tail -f stderr

By default, as it is mentioned in server.py, CherryPy will use port 5434. Change it from the same file if it is busy.

3. Operations on the constructed model

Go to http://0.0.0.0:5434/ and have fun. 🎆

Or interact with the constructed ML model via terminal with the following commands.

POSTing new ratings to the model

$ curl --data-binary @user_ratings/user_ratings.file http://0.0.0.0:5434/<user_id>/ratings

where user_id is 0 by default representing a total new user, outside from those mentioned in the dataset.

Description: POSTs user_id's ratings from user_ratings.file, where every line has movie_id,rating.
Will start some computations and end up with an output representing the ratings that has been submitted as a list of lists.
In the server output window you will see the actual Spark computation output together with CherryPy's output messages about HTTP requests.

Output represents ratings as - (user_id, movie_id, rating) rating awarded by the user from user_ratings.file.

GETing best recommendations

$ curl http://0.0.0.0:5434/<user_id>/ratings/top/<num_movies>

or in browser

http://0.0.0.0:5434/user_id/ratings/top/num_movies

Example

$ curl http://0.0.0.0:5434/0/ratings/top/10
$ curl http://0.0.0.0:5434/3/ratings/top/10

http://0.0.0.0:5434/0/ratings/top/10

Description: Will present the best num_movies recommendations for user with user_id.

GETing individual ratings

$ curl http://0.0.0.0:5434/<user_id>/ratings/<movie_id>

or in browser

http://0.0.0.0:5434/user_id/ratings/movie_id

Example

curl http://0.0.0.0:5434/0/ratings/500
curl http://0.0.0.0:5434/3/ratings/500

http://0.0.0.0:5434/0/ratings/500

http://0.0.0.0:5434/1/ratings/500

Description: Will get the predicted movie rating, from the model, of user_id for movie_id.

Tests

TODO

License