NWXC
====

An implementation of shared density functionals for the Gaussian and
Plane Wave DFT codes.

The revision history is as follows:

r25639 Unified DFT library of pre-existing code from the NWDFT and NWPW modules
r26056 Univariate automatic differentiation implementation of derivatives
r26255 Multivariate automatic differentiation implementation

Before deciding on the path forward a performance evaluation on the three
implementations listed above was performed. For this purpose the QA test
cases nwxc_nwdft_* and nwxc_tddft_* were used. These cases run all implemented
function terms with 1st derivatives and 2nd derivatives respectively. The
test cases were run in serial on Cascade (Intel Xeon E5-2670 processor cores).
The code was compiled with the Intel 14.0.3 compilers and the Intel MPI 
4.1.2.040 libraries. The NWChem BLAS and LAPACK routines were used for the
linear algebra. All calculations were performed with the spin unrestricted
Kohn-Sham formalism.

Below the timing results are reported. The columns labeled "manual" correspond
to the hand written code of the unified library (i.e. r25639), the columns
labeled "univariate" relate to the univariate differentiation approach of
revision r26056, whereas "multivariate" refers to the initial multivariate
differentiation approach. 

=============================================================================
               | CPU time (s)                   | Wall clock time (s)
-----------------------------------------------------------------------------
Job            | Manual  | Uni-    | Multi-     | Manual  | Uni-    | Multi-
               |         | variate | variate    |         | variate | variate
-----------------------------------------------------------------------------
nwxc_nwdft_1he |    14.1 |    84.4 |    43.6    |   223.3 |   295.3 |   250.5
nwxc_nwdft_1ne |    29.8 |    98.8 |    61.1    |   255.9 |   342.4 |   303.5
nwxc_nwdft_1ar |    69.2 |   205.2 |   130.6    |   307.9 |   424.7 |   368.9
nwxc_nwdft_1kr |   259.9 |   428.5 |   337.3    |   491.5 |   628.3 |   572.9
nwxc_nwdft_1xe |   652.8 |   817.6 |   734.3    |   853.0 |  1023.0 |   936.0
nwxc_nwdft_3he |    11.7 |    47.9 |    27.4    |   198.2 |   248.4 |   221.7
nwxc_nwdft_4n  |    35.0 |   117.1 |   100.5    |   303.5 |   395.5 |   346.6
nwxc_nwdft_4p  |    84.9 |   265.1 |   243.9    |   373.9 |   551.7 |   453.1
nwxc_nwdft_4as |   314.9 |   536.1 |   478.3    |   594.4 |   825.1 |   701.7
nwxc_nwdft_4sb |   788.4 |   997.7 |   815.8    |  1046.3 |  1241.9 |  1129.9
-----------------------------------------------------------------------------
nwxc_tddft_1he |    17.9 |   143.5 |    58.7    |   230.1 |   361.8 |   289.7
nwxc_tddft_1ne |    35.9 |   162.5 |    77.3    |   283.5 |   330.3 |   311.1
nwxc_tddft_1ar |    82.6 |   360.3 |   174.6    |   341.1 |   615.4 |   358.0
nwxc_tddft_1kr |   243.5 |   609.3 |   362.5    |   445.7 |   949.1 |   632.0
nwxc_tddft_1xe |   633.7 |  1015.9 |   746.2    |   956.6 |  1428.2 |  1117.7
nwxc_tddft_3he |    10.2 |    52.0 |    25.7    |   153.0 |   182.6 |   159.5
nwxc_tddft_4n  |    57.2 |   183.3 |   100.5    |   321.7 |   375.6 |   355.1
nwxc_tddft_4p  |   142.5 |   425.2 |   243.9    |   402.3 |   646.3 |   452.5
nwxc_tddft_4as |   367.9 |   677.1 |   478.3    |   575.2 |   985.3 |   702.1
nwxc_tddft_4sb |   734.0 |   973.3 |   815.8    |   955.1 |  1282.7 |  1099.1
=============================================================================

The timings show that even the best automatic differentiation approach is up
to 3 times slower than the hand written code in cases where the density 
functional evaluation represents a large chunk of the compute. For heavier
atoms the density and the matrix element evaluation (which scale as N**3) 
dominates and the functional evaluation (which scales as N**1) has less of an
impact.

As the multivariate differentiation is clearly the best bet we will proceed
with that. The next steps are to optimize this code. Opportunities for 
optimization are:

1. In binary operators the sets of input variables need to be merged. This can
   be sped up using bit operations.
2. After 1. the variable set is represented by a single integer and hence it
   is easy to compare them. If in a binary operator the input sets of variables
   are the same then a simpler loop structure can be used.
3. In binary operations the array indeces can be pre-computed and re-used.
4. In some operations quantities can be precalculated and re-used.
5. In the functionals the code has been optimized for the calculation of the
   energy AND the 1st derivatives. With automatic differentiation the code 
   should be optimized for the energy evaluation only. E.g. in the manually
   written code rho^4/3 is typically calculated as rho13 = rho**(1/3);
   rho43 = rho*rho13. This is handy for the energy and gradient evaluation as
   rho13 is needed for the latter anyway. In the automatic differentiation 
   approach this just leads to a superfluous multiplication and assignment.

Optimization steps 1 & 2
========================

At the moment optimization steps 1 and 2 as outlined above have been
implemented. Below the timings are compared. The column "Opt Multi-variate"
refers to the new optimized code. The "Multi-variate" column refers to the
code prior to optimization.

=============================================================================
               | CPU time (s)                   | Wall clock time (s)
-----------------------------------------------------------------------------
Job            | Manual  | Multi-  | Opt Multi- | Manual  | Multi-  | Opt M-
               |         | variate | variate    |         | variate | variate
-----------------------------------------------------------------------------
nwxc_nwdft_1he |    14.1 |    43.6 |    34.7    |   223.3 |   250.5 |   163.0
nwxc_nwdft_1ne |    29.8 |    61.1 |    52.0    |   255.9 |   303.5 |   181.9
nwxc_nwdft_1ar |    69.2 |   130.6 |   112.5    |   307.9 |   368.9 |   270.6
nwxc_nwdft_1kr |   259.9 |   337.3 |   312.6    |   491.5 |   572.9 |   478.4
nwxc_nwdft_1xe |   652.8 |   734.3 |   696.9    |   853.0 |   936.0 |   904.1
nwxc_nwdft_3he |    11.7 |    27.4 |    23.3    |   198.2 |   221.7 |   171.8
nwxc_nwdft_4n  |    35.0 |   100.5 |    60.8    |   303.5 |   346.6 |   242.8
nwxc_nwdft_4p  |    84.9 |   243.9 |   142.1    |   373.9 |   453.1 |   315.7
nwxc_nwdft_4as |   314.9 |   478.3 |   389.3    |   594.4 |   701.7 |   593.4
nwxc_nwdft_4sb |   788.4 |   815.8 |   846.5    |  1046.3 |  1129.9 |  1086.5
-----------------------------------------------------------------------------
nwxc_tddft_1he |    17.9 |    58.7 |    46.8    |   230.1 |   289.7 |   405.5
nwxc_tddft_1ne |    35.9 |    77.3 |    64.0    |   283.5 |   311.1 |   388.2
nwxc_tddft_1ar |    82.6 |   174.6 |   144.7    |   341.1 |   358.0 |   443.3
nwxc_tddft_1kr |   243.5 |   362.5 |   319.9    |   445.7 |   632.0 |   668.7
nwxc_tddft_1xe |   633.7 |   746.2 |   703.6    |   956.6 |  1117.7 |   999.1
nwxc_tddft_3he |    10.2 |    25.7 |    21.5    |   153.0 |   159.5 |   227.4
nwxc_tddft_4n  |    57.2 |   100.5 |    86.2    |   321.7 |   355.1 |   433.7
nwxc_tddft_4p  |   142.5 |   243.9 |   208.1    |   402.3 |   452.5 |   513.0
nwxc_tddft_4as |   367.9 |   478.3 |   436.7    |   575.2 |   702.1 |   779.1
nwxc_tddft_4sb |   734.0 |   815.8 |   776.9    |   955.1 |  1099.1 |  1045.2
-----------------------------------------------------------------------------
total          |  4586.1 |  6056.3 |  5479.1    |         |         |
=============================================================================

The results show that whereas the unoptimized code has an overall overhead of
30% over the hand written code, the optimized code is 10% faster than the 
unoptimized code leaving an overhead of about 20%.

Potential optimization steps 3 and 4 are unlikely to help very much. 
Precomputing the indeces only helps for higher order derivatives such as in the
TDDFT calculations. In the regular energy evaluation we already use stored
indeces and no further gain can be obtained from that approach. Also, the most
costly operators are the multiplication and addition operators. In their 
implementations there are no quantities we can precompute and reuse (this could
be done for the division and exponentiation but they take relatively little
time anyway).

So the only remaining next step to optimize the code for energy evaluations
(step 5) rather than for the original energy+gradient evaluation.

At present this is still VERY MUCH UNDER CONSTRUCTION.
