Regression Error on Centos 6.4 Install


#1

After compiling and installing to Centos 6.4 I deployed to 5 hosts and ran regressions.
The regression report indicates 8 errors.

[root@prdslsldsafht01 scidb-13.12.0.6872]# grep -i fail regression_tests.log
[color=#FF0000][1][Mon Mar 24 18:37:43 2014]: t.checkin.consume … EXECUTOR_FAILED
[472][Mon Mar 24 18:39:57 2014]: t.checkin.mpi.mpi … EXECUTOR_FAILED
[629][Mon Mar 24 18:43:19 2014]: t.checkin.other.list_1 … FILES_DIFFER
[693][Mon Mar 24 18:47:12 2014]: t.checkin.other.sample_1 … FILES_DIFFER
[777][Mon Mar 24 18:47:42 2014]: t.checkin.scalapack.10_gemm … EXECUTOR_FAILED
[778][Mon Mar 24 18:47:42 2014]: t.checkin.scalapack.19_svd_inferSchema … EXECUTOR_FAILED
[779][Mon Mar 24 18:47:47 2014]: t.checkin.scalapack.30_svd_doSvd_verySmall … FILES_DIFFER
[780][Mon Mar 24 18:47:48 2014]: t.checkin.scalapack.32_svd_driverXS … EXECUTOR_FAILED[/color]
[color=#000000]testcases_failed = 8[/color]

  • Are these error critical to the usability and acruacy of the system?
  • Is it worth going forward and using this system, given these error?
  • What can I do to correct these errors?

#2

So I’ve found the output of the “run.py tests” in “./stage/build/tests/harness/testcases/r/checkin”.

less consume.log
[color=#0000FF]2014-03-25 01:44:43,672 INFO EXECUTOR[140561157367552] - 1:double>[i=0:3,4,0,j=0:3,4,0],random()),2,2))
Error during query: load_library(‘dense_linear_algebra’)
SystemException in file: src/util/PluginManager.cpp function: findModule line: 117[/color]
[color=#BF0000]Error id: scidb::SCIDB_SE_PLUGIN_MGR::SCIDB_LE_CANT_LOAD_MODULE
Error description: Plugin manager error. Cannot load module ‘/opt/scidb/13.12/lib/scidb/plugins/libdense_linear_algebra.so’, dlopen returned '/opt/scidb/13.12/lib/scidb/plugins/libdense_linear_algebra.so: undefined symbol: _ZN5scidb11MPIPhysical8setQueryERKN5boost10shared_ptrINS_5QueryEEE.[/color]
[color=#0000FF]Failed query id: 1101259188122
Error during query: consume(gemm(build(attr1:double[i=0:9,32,0,j=0:9,32,0],random()),build(attr1:double[i=0:9,32,0,j=0:9,32,0],random()),build(attr1:double[i=0:9,32,0,j=0:9,32,0],random())))
SystemException in file: src/query/OperatorLibrary.cpp function: createLogicalOperator line: 85
Error id: scidb::SCIDB_SE_QPROC::SCIDB_LE_LOGICAL_OP_DOESNT_EXIST[/color]
[color=#BF0000]Error description: Query processor error. Logical operator ‘gemm’ does not exist.[/color]
[color=#0000FF]Failed query id: 1101259198123
Error during query: consume(gesvd(build(attr1:double[i=0:9,32,0,j=0:9,32,0],random()%1.0),‘values’))
SystemException in file: src/query/OperatorLibrary.cpp function: createLogicalOperator line: 85
Error id: scidb::SCIDB_SE_QPROC::SCIDB_LE_LOGICAL_OP_DOESNT_EXIST[/color]
[color=#BF0000]Error description: Query processor error. Logical operator ‘gesvd’ does not exist.[/color]
[color=#0000FF]Failed query id: 1101259198124
All done.[/color]

How do I fix that problem?
Are sources for libdense_linear_algebra.so plugin available?


#3

Run by themselves the following tests passed…

[color=#0000FF]./run.py -v tests --test-id checkin.other.list_1 --record
./run.py -v tests --test-id checkin.other.sample_1 --record
./run.py -v tests --test-id checkin.scalapack.30_svd_doSvd_verySmall --record[/color]

So maybe only these 5 tests are failing, unless the tests above are intended to consider prior state and thus fail appropriately.

[color=#BF0000]
[1][Mon Mar 24 18:37:43 2014]: t.checkin.consume … EXECUTOR_FAILED
[472][Mon Mar 24 18:39:57 2014]: t.checkin.mpi.mpi … EXECUTOR_FAILED
[777][Mon Mar 24 18:47:42 2014]: t.checkin.scalapack.10_gemm … EXECUTOR_FAILED
[778][Mon Mar 24 18:47:42 2014]: t.checkin.scalapack.19_svd_inferSchema … EXECUTOR_FAILED
[780][Mon Mar 24 18:47:48 2014]: t.checkin.scalapack.32_svd_driverXS … EXECUTOR_FAILED
[/color]


#4

Found thal all of the other errors are associated to an undefined symbol in /opt/scidb/13.12/lib/scidb/plugins/libdense_linear_algebra.so…
The following errors are a consistent theme…

[color=#BF0000]Error id: scidb::SCIDB_SE_PLUGIN_MGR::SCIDB_LE_CANT_LOAD_MODULE
Error description: Plugin manager error. Cannot load module ‘/opt/scidb/13.12/lib/scidb/plugins/libdense_linear_algebra.so’, dlopen returned '/opt/scidb/13.12/lib/scidb/plugins/libdense_linear_algebra.so: undefined symbol: _ZN5scidb11MPIPhysical8setQueryERKN5boost10shared_ptrINS_5QueryEEE.[/color]

I think that all would be fine once I get that shared library corrected.
Does anyone have any pointers to correct it?


#5

These are critical problems. I strongly recommend either:

  1. Using the pre-built RPMS for CentOS
    -or-

  2. Carefully following the build scripts deployment/deploy.sh and run.py in the source code directory.

In particular, deploy.sh should prepare your system to include all required software dependencies.


#6

Also note that we’re on the verge of a 14.3 release, so perhaps you want to wait for that one before re-building?

14.3’s build process is about the same, though. Use the deploy.sh and run.py scripts to help automate the complicated setup and build process steps.


#7

@blewis: Thanks, I was able to make use the prebuilt RPMs.
It seems that CMake had an issue with capturing the OS…

[scidb@prdslsldsafht01 mpi]$ cat /etc/centos-release
[color=#4000FF]CentOS release 6.4 (Final)[/color]

So I hacked CMakeLists.txtto pick up the mpi libs.

[scidb@prdslsldsafht01 mpi]$ pwd
[color=#4000FF]/builds/src/SciDB/scidb-13.12.0.6872/src/mpi[/color]

[scidb@prdslsldsafht01 mpi]$ diff CMakeLists.txt~ CMakeLists.txt

[color=#4000FF]45a46,48

DGS: Force Distro Name

set(DISTRO_NAME_VER “CentOS-6”)

58c61,62
< set(MPI_LIBRARIES “${LOCAL_MPI_PATH}/lib/libmpichf90.a”) # fill in the blanks

   # set(MPI_LIBRARIES "${LOCAL_MPI_PATH}/lib/libmpichf90.a") # fill in the blanks
   set(MPI_LIBRARIES "/usr/lib64/mpich2/lib/libmpichf90.a") # fill in the blanks[/color]

I don’t get the libdense_linear_algebra.so error during the regression tests anymore.
But, the regression tests seem to be hanging in t.checkin.comsume…

[scidb@prdslsldsafht01 scidb-13.12.0.6872]$ ./run.py -v tests --record 2>&1 | tee regression_tests.log
[color=#0000FF]./run.py: DBG: cmd=tests
./run.py: DBG: Executing: cwd[/builds/src/SciDB/scidb-13.12.0.6872] [‘grep’, ‘CMAKE_INSTALL_PREFIX’, ‘/builds/src/SciDB/scidb-13.12.0.6872/stage/build/CMakeCache.txt’]
./run.py: DBG: CMAKE_INSTALL_PREFIX:PATH=/opt/scidb/scidb-13.12.0.6872
(raw var line)
./run.py: DBG: /opt/scidb/scidb-13.12.0.6872 (parsed var)
./run.py: DBG: Executing: cwd[/builds/src/SciDB/scidb-13.12.0.6872/stage/build] [’/builds/src/SciDB/scidb-13.12.0.6872/deployment/deploy.sh usage | grep “SciDB version:”’]
./run.py: DBG: SciDB version: 13.12
(raw)
./run.py: DBG: 13.12
(parsed)
./run.py: DBG: Executing: cwd[/builds/src/SciDB/scidb-13.12.0.6872/stage/build/tests/harness] ['export SCIDB_NAME=mydb ; export SCIDB_HOST=prdslsldsafht01 ; export SCIDB_PORT=1239 ; export SCIDB_BUILD_PATH=/builds/src/SciDB/scidb-13.12.0.6872/stage/build ; export SCIDB_INSTALL_PATH=/opt/scidb/scidb-13.12.0.6872 ; exp]
[1][Tue Mar 25 09:46:33 2014]: t.checkin.consume ______________________________ Executing[/color]
…hangs…

[root@prdslsldsafht01 r]# tail -f checkin/consume.log
[color=#0000FF]Time = 00:00:00, sucess: consume(substitute(build(<attr1:double null>[i=0:9,4,0,j=0:9,4,0],iif(i=j,null,1)),build(attr1:double[i=0:0,1,0],0)))
Time = 00:00:00, sucess: consume(sum(build(attr1:int64[i=0:9,4,0,j=0:9,4,0],random()%5)))
Time = 00:00:00, sucess: consume(thin(build(attr1:double[i=0:99,4,0,j=0:99,4,0],random()%10),0,2,0,2))
Time = 00:00:00, sucess: consume(transpose(build(attr1:double[i=0:19,4,0,j=0:29,4,0],random()%10)))
Time = 00:00:00, sucess: consume(unpack(build(attr1:double[i=0:9,4,0,j=0:9,4,0],1),j))
Time = 00:00:00, sucess: consume(var(build(attr1:double[i=0:99,4,0,j=0:99,4,0],random()),attr1))
Time = 00:00:02, sucess: consume(variable_window(build(attr1:double[i=0:99,4,0,j=0:99,4,0],random()),i,2,6,sum(attr1)))
Time = 00:00:00, sucess: consume(versions(array_for_store_test))
Time = 00:00:01, sucess: consume(window(build(attr1:double[i=0:99,4,0,j=0:99,4,0],random()),2,11,4,13,min(attr1)))
Time = 00:00:00, sucess: consume(xgrid(build(<attr[/color]
…hangs…

I’m running these regression tests with a 5 node cluster deployment. Is that okay?
Or, should these tests only be run on a single node deployment?
Any, set-up parameters that I should tweak?


#8

Regression tests continue to have errors.
After installing to a single node…

[scidb@prdslsldsafht01 scidb-13.12.0.6872]$ ./deployment/deploy.sh scidb_prepare scidb “” mydb mydb mydb /ogfs/scidb-001/mydb-DB 2 default 1 default prdslsldsafht01

Seem to be getting Network connection errors.
In the Regression Test Logs…

[color=#BF0000]2014-03-25 13:18:33,690 INFO EXECUTOR[139779383559936] - Exception CAUGHT for SCIDB query…:SystemException in file: src/network/BaseConnection.cpp function: connect line: 262
Error id: scidb::SCIDB_SE_NETWORK::SCIDB_LE_CONNECTION_ERROR
Error description: Network error. Error #system:111 when connecting to prdslsldsafht01:1239.[/color]

Strange.


#9

I’m thinking that the DISTRO_NAME_VER being incorrect in all other CMakeLists.txt files is causing a problem with the build.
Trying a clean build again, beginning with “run.py set”, but this time, I’ve placed “CentOS 6.4” as the first line of /etc/issue.
Which is where cmake/Modules/LinuxDistroVersion.cmake looks to determine the distro.

Regression continue to fail. :frowning:

[scidb@prdslsldsafht01 scidb-13.12.0.6872]$ egrep -ie ‘fail|differ’ regression_tests.log

[1][Tue Mar 25 15:37:43 2014]: t.checkin.consume … EXECUTOR_FAILED
[472][Tue Mar 25 15:39:52 2014]: t.checkin.mpi.mpi … EXECUTOR_FAILED
[473][Tue Mar 25 15:39:52 2014]: t.checkin.negative.build_1 … EXECUTOR_FAILED
…gets stuck…

Test stalls at…
[1][Tue Mar 25 15:37:43 2014]: t.checkin.consume ______________________________ Executing

  • But once file is created… On re-run, is bypassed but with error. (unless we initall the DB).

Then, Test stalls at…
[474][Tue Mar 25 15:39:52 2014]: t.checkin.negative.build_2 ______________________________ Executing

Anyone ever get this beast compile and pass all regressions on CentOS 6.4?
If not, I’ll move on to other tools.


#10

We run nightly tests usually on a four node cluster on CentOS and RHEL 6.4, so these should work. I think you’re fine with the five node setup, sorry for all the trouble you’re running in to.

I’ve seen errors like your network connectivity error before. If you can, try only using explicit ip addresses in your scidb config.ini file. That might mean that you have to become the postgres user and run

/opt/scidb/13.12/bin/scidb.py init_syscat

and then as the usual scidb user run scidb.py initall to re-initialize everything.


#11

@blewis: Thanks for the reply.
I tested with only 1 instance and continued to have issues on 6.4.
I noticed that the scidb repos only went as var as CentOS 6.3, so I’ve more recently run a build on a CentOS 6.3 with better results when running regressions (only 1 failure: other.list_1 - diff).
So, it seems to me that maybe the 6.4 libs may not be available as yet. Unless the 6.3 libs are also supposed to work with 6.4.


#12

No diff errors when I re-ran the others test suite…
./run.py tests --suite-id checkin.other --record

testcases_total = 225
testcases_passed = 210
testcases_failed = 0
testcases_skipped = 15
testsuites_skipped = 0

Re-ran all regressions again (this time all pass [single instance cluster])…
./run.py -v tests --record 2>&1 | tee run_tests.log

testcases_total = 834
testcases_passed = 794
testcases_failed = 0
testcases_skipped = 40
testsuites_skipped = 0

Also retrying on CentOS 6.4 but using ip addresses vs host names.
Will report on results.


#13

Cool! All regressions passing using CenOS 6.4.
As you mentioned (blewis), the trick was to use ip-addresses in the “./deployment/deploy.sh scidb_install /tmp/packages …” step.
I had already done previous installs and /opt/scidb/etc/config.ini only changed to ip-adresses on the build host (host0).
So, I copied /opt/scidb to all other hosts (including all other files just in-case).

Then, regressions ran fine.
Thanks!