Recently we set up a ICGC portal mirror site with a lot of help from people on this forum. I am sharing my notes here, hope it is useful for others. Please note this is just what worked for me. If you follow the instruction, you should be able to get a working site. But there is no guarantee that you don’t encounter some unforeseen issues (most likely you will). In that case, you should find out if they are site specific issues. You may also need to search this forum for answers.
Prerequisites
1. Ubunutu 14.04 hosts which meet minimum requirements for a data portal site.
2. Enough disk space for applications.
3. Enough swap space
4. Internet access
Instructions:
1. Some applications (elasticsearch, hadoop) requires large disk space. If you
don't have enough space on root file system, you can add a second disk. The
file system on the second disk must be mounted in certain directory for
application to use it:
1. For elasticsearch nodes, mount on /var/lib/elasticsearch directory.
Create the directory if it doesn't exist.
2. For HDFS nodes, mount on /dfs directory. Create the directory if
it doesn't exist.
Make sure you add the mount to /etc/fstab so they will be mounted
automatically on reboot
2. Set up ssh passwordless login on all server.
1. On deployment server, run "ssh-keygen"
2. ssh-copy-id <username>@<host>
3. Verify passwordless login by running command "ssh <username>@host"
3. Set up passwordless sudo on all hosts
Add the following to /etc/hosts
<username> ALL=(ALL:ALL) ALL
Defaults:<username> !authenticate
4. Check ansible version on the host which will be running ansible playbook.
If you have 1.x version, first remove it by running command "sudo dpkg -r ansible",
then follow the instructions at
https://community.spiceworks.com/how_to/110622-install-ansible-on-64-bit-ubuntu-14-04-lts
to install ansible2 on Ubuntu 14.04 deployment server:
1. sudo apt-get update && sudo apt-get install software-properties-common
2. sudo apt-add-repository ppa:ansible/ansible
3. sudo apt-get update && sudo apt-get install ansible
If you see the message below, you have 1.x version of ansible:
$> ansible-playbook -i config/hosts portal.yml
ERROR: become is not a legal parameter in an Ansible task or handler
Follow the instruction above to reinstall 2.x or newer version of Ansible.
5. Checkout ansible playbook repo from https://github.com/icgc-dcc/dcc-cm/tree/develop/ansible,
then cd into the ansible repo directory. All the commands below are run from the top
of ansible directory.
6. Edit vars/main.yml. Make sure ansible_ssh_user and ansible_ssh_private_key_file
match current setup.
7. Edit config/hosts. Make sure their name/pattern are correct.
8. Depending on your setup, you may need to run the command below to update
package list on all servers:
$> sudo apt-get update
9. Install portal servers:
Run command to install portal server automatically:
ansible-playbook -i config/hosts portal.yml
You may need to add "-c paramiko" option if installation failed due to ssh host key check.
Ignore the following error when install java sdk:
fatal: [lxv-icgc-elastic04]: FAILED! => {"changed": false, "cmd": "java -version", "failed": true, "msg": "[Errno 2] No such file or directory", "rc": 2}
TASK [portal : make sure to kill the potential previously running instance] ****
fatal: [lxv-icgc-portal02]: FAILED! => {"changed": true, "cmd": "pkill -IO -f WrapperSimpleApp || true", "delta": "0:00:00.060439", "end": "2016-09-12 22:39:48.332188", "failed": true, "rc": -29, "start": "2016-09-12 22:39:48.271749", "stderr": "", "stdout": "", "stdout_lines": [], "warnings": []}
...ignoring
You may need to update the old knapsack on the elasticsearch master node:
/usr/share/elasticsearch/bin/plugin -url http://bit.ly/29A1hsz -install knapsack
sudo /etc/init.d/elasticsearch restart
After installation of portal servers, you need to change the wrong server name in
/srv/dcc-portal-server/conf/application.yml.
1. change dcc-elasticsearch to your elasticsearch master.
2. change dcc-nginx to your nginx server.
Make changes to your elasticsearch config so first node is master, the rest of nodes
are data nodes.
On first elasticsearch node, edit /etc/elasticsearch/elasticsearch.yml. Add the following
node:
master: true
data: false
On the rest elasticsearch nodes, edit /etc/elasticsearch/elasticsearch.yml, add
the following:
node:
master: false
data: true
After making the changes above, restart all elasticsearch instance:
cd /etc/init.d/; sudo ./elasticsearch restart
Check number of elasticsearch node after installation:
curl 'http://<elastic_search_master_node_ip>:9200/_cluster/health?pretty=1'
If it doesn't show any data nodes, multicast discovery probably doesn't work.
You need to enable unicast discovery. Add the following to
/etc/elasticsearch/elasticsearch.yml to enable unicast discovery:
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [ "<node 2>", "<node 3>", ... ]
Also increase elasticsearch java heap space to at least 16 GB on production
system. For 64 GB memory system, use 30 GB heap space. Edit
/etc/default/elasticsearch to make the change:
ES_HEAP_SIZE=30g
And restart elasticsearch server after making the changes above
Update postgres server database dcc-portal to the same scheme as defined at:
https://raw.githubusercontent.com/icgc-dcc/dcc-portal/develop/dcc-portal-server/src/main/sql/schema.sql
Update portal server config in /srv/dcc-portal-server/conf/application.yml
(adjust the index name to match the index you will be loading)
indexName: icgc22-13
repoIndexName: icgc-repository-20160830
nodeAddresses:
- host: <dns name of elasticsearch master node>
port: 9300
Change portal server config to use Postgres:
# Datasource
spring.datasource:
driver-class-name: org.postgresql.Driver
url: jdbc:postgresql://<postgres server dns name>/dcc-portal
username: dcc
password: dcc
max-active: 10
max-idle: 1
min-idle: 1
Change download server URL in /srv/dcc-portal-server/conf/application.yml to point
to your download server:
You need to make sure portal server settings match download server:
download:
enabled: true
serverUrl: "https://<download server name>:443"
publicServerUrl: "https://<portal external dns name>:443"
sharedSecret: "deadbeefdeadbeefdeadbeefdeadbeef"
aesKey: "deadbeefdeadbeef"
The "sharedSecret" and "aesKey" must match the ones in the jwt section of
download server's application.yml.
Update nginx config /etc/nginx/sites-available/dcc_portal to use correct
server name, and redirect http to https:
# HTTPS
server {
listen 443;
server_name <portal external DNS name>;
ssl on;
ssl_certificate /etc/ssl/dcc/portal.crt;
ssl_certificate_key /etc/ssl/dcc/portal.key;
ssl_session_timeout 5m;
ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
ssl_ciphers HIGH:!RC4:!MD5:!aNULL:!eNULL:!NULL:!DH:!EDH:!EXP:+MEDIUM;
ssl_prefer_server_ciphers on;
location / {
proxy_pass http://web-cluster;
}
}
# HTTP
server {
listen 80;
server_name <port external DNS name>;
return 301 https://$server_name$request_uri;
}
10. Install HDFS nodes:
ansible-playbook -i config/hosts hdfs.yml
Check status after install at http://<hdfs-master-node>:50070
11. ansible-playbook -i config/hosts download.yml -c paramiko
Use the following command to check communication between portal server and
download server:
curl -k -v https://<download server name>/srv-info/health
Make sure portal and download servers use https to communicate with each other.
12. Generate a self-assigned certificate for https on download server:
keytool -genkey -keyalg RSA -alias selfsigned -keystore keystore.jks -storepass <password> -validity 3600 -keysize 2048
Copy the key file keystore.jks to /srv/dcc-download-server/conf/, and edit
/srv/dcc-download-server/conf/application.yml to add the following section:
server:
port: 443
ssl:
keyStore: "../conf/keystore.ks"
keyStorePassword: "<key password>"
After the change, restart download server:
cd /srv/dcc-download-server/bin; sudo ./dcc-download-server restart
12. Download data from ICGC download server
1. Get a list of files and URLs to download:
wget https://download.icgc.org/exports
2. Download each of those files. Some of them can be quite large.
13. Untar data.open.tar and make the directory available to a hdfs node either
locally or via NFS. Run the following command on the HDFS node. Due to
permission setting, you have to run the command as user hdfs:
$> sudo su hdfs
$> hdfs dfs -copyFromLocal release_21 /icgc/input
14. Load index into elasitcsearch.
1. Run the following command from a node (e.g., hdfs nodes) with sufficient memory:
java -Xmx12g -jar dcc-download-import.jar -i release.tar -es es://<DNS name of elasticsearch master node>:9300 -p LAML-CN
Please note you can only import either one project or whole projects.
Use "-p" to choose which project to load. Without specifying "-p",
the command loads all projects (which takes much longer).
If the output says "cluster is in red" or "cluster is in yellow", you can
log into elasticsearch nodes, check if elasticsearch is running with
sufficient heap space.
15. Import repository index, must be on the host with the knapsack plugin installed
curl -XPOST "http://<dns name of elasticsearch master node>:9200/_import?path=<path_to_repository.tar.gz>"
16. Restart portal servers:
sudo /svr/dcc-portal-server/bin/dcc-portal-server restart
After all this long setup, if everthing works, you should be able to see the portal at
https://<DNS of portal external IP> in a browser window.
APPENDIX:
1. Command to get list of index in Elasticsearch cluster. Must run on elasticsearch node:
curl 'http://localhost:9200/_cat/indices?pretty=1'
2. Delete an index in Elasticsearch:
curl -XDELETE lxv-icgc-elastic01:9200/icgc21-0-3
3. Command to download latest java import utility:
wget 'https://artifacts.oicr.on.ca/artifactory/dcc-release/org/icgc/dcc/dcc-download-import/[RELEASE]/dcc-download-import-[RELEASE].jar'