Mirror site setup notes

Recently we set up a ICGC portal mirror site with a lot of help from people on this forum. I am sharing my notes here, hope it is useful for others. Please note this is just what worked for me. If you follow the instruction, you should be able to get a working site. But there is no guarantee that you don’t encounter some unforeseen issues (most likely you will). In that case, you should find out if they are site specific issues. You may also need to search this forum for answers.

1. Ubunutu 14.04 hosts which meet minimum requirements for a data portal site.
2. Enough disk space for applications. 
3. Enough swap space 
4. Internet access 
1. Some applications (elasticsearch, hadoop) requires large disk space. If you 
   don't have enough space on root file system, you can add a second disk. The 
   file system on the second disk must be mounted in certain directory for 
   application to use it: 
    1. For elasticsearch nodes, mount on /var/lib/elasticsearch directory. 
       Create the directory if it doesn't exist. 
    2. For HDFS nodes, mount on /dfs directory. Create the directory if 
       it doesn't exist. 
   Make sure you add the mount to /etc/fstab so they will be mounted 
   automatically on reboot 
2. Set up ssh passwordless login on all server. 
    1. On deployment server, run "ssh-keygen" 
    2. ssh-copy-id <username>@<host> 
    3. Verify passwordless login by running command "ssh <username>@host" 
3. Set up passwordless sudo on all hosts 
   Add the following to /etc/hosts 
   <username>           ALL=(ALL:ALL) ALL 
   Defaults:<username>  !authenticate 
4. Check ansible version on the host which will be running ansible playbook. 
   If you have 1.x version, first remove it by running command "sudo dpkg -r ansible", 
   then follow the instructions at 
   to install ansible2 on Ubuntu 14.04 deployment server: 
    1. sudo apt-get update && sudo apt-get install software-properties-common 
    2. sudo apt-add-repository ppa:ansible/ansible 
    3. sudo apt-get update && sudo apt-get install ansible 
   If you see the message below, you have 1.x version of ansible: 
     $> ansible-playbook -i config/hosts portal.yml 
     ERROR: become is not a legal parameter in an Ansible task or handler 
   Follow the instruction above to reinstall 2.x or newer version of Ansible. 
5. Checkout ansible playbook repo from https://github.com/icgc-dcc/dcc-cm/tree/develop/ansible,
   then cd into the ansible repo directory. All the commands below are run from the top 
   of ansible directory. 
6. Edit vars/main.yml. Make sure ansible_ssh_user and ansible_ssh_private_key_file 
   match current setup. 
7. Edit config/hosts. Make sure their name/pattern are correct. 
8. Depending on your setup, you may need to run the command below to update 
   package list on all servers: 
    $> sudo apt-get update 
9. Install portal servers: 
   Run command to install portal server automatically: 
   ansible-playbook -i config/hosts portal.yml 
   You may need to add "-c paramiko" option if installation failed due to ssh host key check. 
   Ignore the following error when install java sdk: 
   fatal: [lxv-icgc-elastic04]: FAILED! => {"changed": false, "cmd": "java -version", "failed": true, "msg": "[Errno 2] No such file or directory", "rc": 2} 
   TASK [portal : make sure to kill the potential previously running instance] **** 
   fatal: [lxv-icgc-portal02]: FAILED! => {"changed": true, "cmd": "pkill -IO -f WrapperSimpleApp || true", "delta": "0:00:00.060439", "end": "2016-09-12 22:39:48.332188", "failed": true, "rc": -29, "start": "2016-09-12 22:39:48.271749", "stderr": "", "stdout": "", "stdout_lines": [], "warnings": []} 
   You may need to update the old knapsack on the elasticsearch master node: 
    /usr/share/elasticsearch/bin/plugin -url http://bit.ly/29A1hsz -install knapsack 
    sudo /etc/init.d/elasticsearch restart 
    After installation of portal servers, you need to change the wrong server name in 
      1. change dcc-elasticsearch to your elasticsearch master. 
      2. change dcc-nginx to your nginx server. 
   Make changes to your elasticsearch config so first node is master, the rest of nodes 
   are data nodes. 
   On first elasticsearch node, edit /etc/elasticsearch/elasticsearch.yml. Add the following 
     master: true 
     data: false 
   On the rest elasticsearch nodes, edit /etc/elasticsearch/elasticsearch.yml, add 
   the following: 
     master: false 
     data: true 
   After making the changes above, restart all elasticsearch instance: 
   cd /etc/init.d/; sudo ./elasticsearch restart 
   Check number of elasticsearch node after installation: 
   curl 'http://<elastic_search_master_node_ip>:9200/_cluster/health?pretty=1' 
   If it doesn't show any data nodes, multicast discovery probably doesn't work. 
   You need to enable unicast discovery. Add the following to 
   /etc/elasticsearch/elasticsearch.yml to enable unicast discovery: 
    discovery.zen.ping.multicast.enabled: false 
    discovery.zen.ping.unicast.hosts: [ "<node 2>", "<node 3>",  ... ] 
   Also increase elasticsearch java heap space to at least 16 GB on production 
   system. For 64 GB memory system, use 30 GB heap space. Edit 
   /etc/default/elasticsearch to make the change: 
   And restart elasticsearch server after making the changes above 
   Update postgres server database dcc-portal to the same scheme as defined at: 
   Update portal server config in /srv/dcc-portal-server/conf/application.yml 
   (adjust the index name to match the index you will be loading) 
    indexName: icgc22-13 
    repoIndexName: icgc-repository-20160830 
       - host: <dns name of elasticsearch master node> 
         port: 9300 
   Change portal server config to use Postgres: 
     # Datasource 
       driver-class-name: org.postgresql.Driver 
       url: jdbc:postgresql://<postgres server dns name>/dcc-portal 
       username: dcc 
       password: dcc 
       max-active: 10 
       max-idle: 1 
       min-idle: 1 
   Change download server URL in /srv/dcc-portal-server/conf/application.yml to point 
   to your download server: 
   You need to make sure portal server settings match download server: 
      enabled: true 
      serverUrl: "https://<download server name>:443" 
      publicServerUrl: "https://<portal external dns name>:443" 
      sharedSecret: "deadbeefdeadbeefdeadbeefdeadbeef" 
      aesKey: "deadbeefdeadbeef" 
   The "sharedSecret" and "aesKey" must match the ones in the jwt section of 
   download server's application.yml. 
   Update nginx config /etc/nginx/sites-available/dcc_portal to use correct 
   server name, and redirect http to https: 
   # HTTPS 
    server { 
            listen  443; 
            server_name <portal external DNS name>; 
            ssl                 on; 
            ssl_certificate     /etc/ssl/dcc/portal.crt; 
            ssl_certificate_key /etc/ssl/dcc/portal.key; 
            ssl_session_timeout  5m; 
            ssl_protocols  TLSv1 TLSv1.1 TLSv1.2; 
            ssl_ciphers  HIGH:!RC4:!MD5:!aNULL:!eNULL:!NULL:!DH:!EDH:!EXP:+MEDIUM; 
            ssl_prefer_server_ciphers   on; 
            location / { 
                    proxy_pass http://web-cluster; 
    # HTTP 
    server { 
            listen 80; 
            server_name <port external DNS name>; 
            return 301 https://$server_name$request_uri; 
10. Install HDFS nodes: 
   ansible-playbook -i config/hosts hdfs.yml 
   Check status after install at http://<hdfs-master-node>:50070 
11. ansible-playbook -i config/hosts download.yml -c paramiko 
   Use the following command to check communication between portal server and 
   download server: 
   curl -k -v https://<download server name>/srv-info/health 
   Make sure portal and download servers use https to communicate with each other. 
12. Generate a self-assigned certificate for https on download server: 
   keytool -genkey -keyalg RSA -alias selfsigned -keystore keystore.jks -storepass <password> -validity 3600 -keysize 2048 
   Copy the key file keystore.jks to /srv/dcc-download-server/conf/, and edit 
   /srv/dcc-download-server/conf/application.yml to add the following section: 
     port: 443 
       keyStore: "../conf/keystore.ks" 
       keyStorePassword: "<key password>" 
    After the change, restart download server: 
    cd /srv/dcc-download-server/bin; sudo ./dcc-download-server restart 
12. Download data from ICGC download server 
    1. Get a list of files and URLs to download: 
       wget https://download.icgc.org/exports 
    2. Download each of those files. Some of them can be quite large. 
13. Untar data.open.tar and make the directory available to a hdfs node either 
    locally or via NFS. Run the following command on the HDFS node. Due to 
    permission setting, you have to run the command as user hdfs: 
      $> sudo su hdfs 
      $> hdfs dfs -copyFromLocal release_21 /icgc/input 
14. Load index into elasitcsearch. 
   1. Run the following command from a node (e.g., hdfs nodes) with sufficient memory: 
    java -Xmx12g -jar dcc-download-import.jar -i release.tar -es es://<DNS name of elasticsearch master node>:9300 -p LAML-CN       
    Please note you can only import either one project or whole projects. 
    Use "-p" to choose which project to load. Without specifying "-p", 
    the command loads all projects (which takes much longer). 
    If the output says "cluster is in red" or "cluster is in yellow", you can 
    log into elasticsearch nodes, check if elasticsearch is running with 
    sufficient heap space.     
15. Import repository index, must be on the host with the knapsack plugin installed 
    curl -XPOST "http://<dns name of elasticsearch master node>:9200/_import?path=<path_to_repository.tar.gz>" 
16. Restart portal servers: 
    sudo /svr/dcc-portal-server/bin/dcc-portal-server restart 
After all this long setup, if everthing works, you should be able to see the portal at 
https://<DNS of portal external IP> in a browser window. 
1. Command to get list of index in Elasticsearch cluster. Must run on elasticsearch node: 
     curl 'http://localhost:9200/_cat/indices?pretty=1' 
2. Delete an index in Elasticsearch: 
     curl -XDELETE lxv-icgc-elastic01:9200/icgc21-0-3 
3. Command to download latest java import utility: 
     wget 'https://artifacts.oicr.on.ca/artifactory/dcc-release/org/icgc/dcc/dcc-download-import/[RELEASE]/dcc-download-import-[RELEASE].jar'

Thank you for this Brady.

We will use your feedback to help improve the process provided by our automation and CM.