Lab 1: MC Statistical Analysis

Topics covered in this lab

This lab focuses on the basics of analyzing data from MC calculations. In this lab, participants will use data from VMC calculations of a simple 1-electron system with an analytically soluble system (the ground state of the hydrogen atom) to understand how to interpret an MC situation. Most of these analyses will also carry over to DMC simulations. Topics covered include:

  • Averaging MC variables

  • The statisical error bar of mean values

  • The effects of autocorrelation and variance on the error bar

  • The relationship between MC time step and autocorrelation

  • The use of blocking to reduce autocorrelation

  • The significance of the acceptance ratio

  • The significance of the sample size

  • How to determine whether an MC run was successful

  • The relationship between wavefunction quality and variance

  • Gauging the efficiency of MC runs

  • The cost of scaling up to larger system sizes

Lab directories and files

labs/lab1_qmc_statistics/
│
├── atom                              - H atom VMC calculation
│   ├── H.s000.scalar.dat                - H atom VMC data
│   └── H.xml                            - H atom VMC input file
│
├── autocorrelation                   - varying autocorrelation
│   ├── H.dat                            - data for gnuplot
│   ├── H.plt                            - gnuplot for time step vs. E_L, tau_c
│   ├── H.s000.scalar.dat                - H atom VMC data: time step = 10
│   ├── H.s001.scalar.dat                - H atom VMC data: time step =  5
│   ├── H.s002.scalar.dat                - H atom VMC data: time step =  2
│   ├── H.s003.scalar.dat                - H atom VMC data: time step =  1
│   ├── H.s004.scalar.dat                - H atom VMC data: time step =  0.5
│   ├── H.s005.scalar.dat                - H atom VMC data: time step =  0.2
│   ├── H.s006.scalar.dat                - H atom VMC data: time step =  0.1
│   ├── H.s007.scalar.dat                - H atom VMC data: time step =  0.05
│   ├── H.s008.scalar.dat                - H atom VMC data: time step =  0.02
│   ├── H.s009.scalar.dat                - H atom VMC data: time step =  0.01
│   ├── H.s010.scalar.dat                - H atom VMC data: time step =  0.005
│   ├── H.s011.scalar.dat                - H atom VMC data: time step =  0.002
│   ├── H.s012.scalar.dat                - H atom VMC data: time step =  0.001
│   ├── H.s013.scalar.dat                - H atom VMC data: time step =  0.0005
│   ├── H.s014.scalar.dat                - H atom VMC data: time step =  0.0002
│   ├── H.s015.scalar.dat                - H atom VMC data: time step =  0.0001
│   └── H.xml                            - H atom VMC input file
│
├── average                            - Python scripts for average/std. dev.
│   ├── average.py                         - average five E_L from H atom VMC
│   ├── stddev2.py                         - standard deviation using (E_L)^2
│   └── stddev.py                          - standard deviation around the mean
│
├── basis                              - varying basis set for orbitals
│   ├── H__exact.s000.scalar.dat           - H atom VMC data using STO basis
│   ├── H_STO-2G.s000.scalar.dat           - H atom VMC data using STO-2G basis
│   ├── H_STO-3G.s000.scalar.dat           - H atom VMC data using STO-3G basis
│   └── H_STO-6G.s000.scalar.dat           - H atom VMC data using STO-6G basis
│
├── blocking                           - varying block/step ratio
│   ├── H.dat                              - data for gnuplot
│   ├── H.plt                              - gnuplot for N_block vs. E, tau_c
│   ├── H.s000.scalar.dat                  - H atom VMC data 50000:1 blocks:steps
│   ├── H.s001.scalar.dat                  - "  "    "    "  25000:2 blocks:steps
│   ├── H.s002.scalar.dat                  - "  "    "    "  12500:4 blocks:steps
│   ├── H.s003.scalar.dat                  - "  "    "    "  6250: 8 blocks:steps
│   ├── H.s004.scalar.dat                  - "  "    "    "  3125:16 blocks:steps
│   ├── H.s005.scalar.dat                  - "  "    "    "  2500:20 blocks:steps
│   ├── H.s006.scalar.dat                  - "  "    "    "  1250:40 blocks:steps
│   ├── H.s007.scalar.dat                  - "  "    "    "  1000:50 blocks:steps
│   ├── H.s008.scalar.dat                  - "  "    "    "  500:100 blocks:steps
│   ├── H.s009.scalar.dat                  - "  "    "    "  250:200 blocks:steps
│   ├── H.s010.scalar.dat                  - "  "    "    "  125:400 blocks:steps
│   ├── H.s011.scalar.dat                  - "  "    "    "  100:500 blocks:steps
│   ├── H.s012.scalar.dat                  - "  "    "    "  50:1000 blocks:steps
│   ├── H.s013.scalar.dat                  - "  "    "    "  40:1250 blocks:steps
│   ├── H.s014.scalar.dat                  - "  "    "    "  20:2500 blocks:steps
│   ├── H.s015.scalar.dat                  - "  "    "    "  10:5000 blocks:steps
│   └── H.xml                             - H atom VMC input file
│
├── blocks                             -  varying total number of blocks
│   ├── H.dat                             - data for gnuplot
│   ├── H.plt                             - gnuplot for N_block vs. E
│   ├── H.s000.scalar.dat                 - H atom VMC data    500 blocks
│   ├── H.s001.scalar.dat                 - "  "    "    "    2000 blocks
│   ├── H.s002.scalar.dat                 - "  "    "    "    8000 blocks
│   ├── H.s003.scalar.dat                 - "  "    "    "   32000 blocks
│   ├── H.s004.scalar.dat                 - "  "    "    "  128000 blocks
│   └── H.xml                             - H atom VMC input file
│
├── dimer                          - comparing no and simple Jastrow factor
│   ├── H2_STO___no_jastrow.s000.scalar.dat - H dimer VMC data without Jastrow
│   └── H2_STO_with_jastrow.s000.scalar.dat - H dimer VMC data with Jastrow
│
├──  docs                               - documentation
│   ├──  Lab_1_MC_Analysis.pdf             - this document
│   └──  Lab_1_Slides.pdf                  - slides presented in the lab
│
├── nodes                              - varying number of computing nodes
│   ├──  H.dat                             - data for gnuplot
│   ├──  H.plt                             - gnuplot for N_node vs. E
│   ├──  H.s000.scalar.dat                 - H atom VMC data with  32 nodes
│   ├──  H.s001.scalar.dat                 - H atom VMC data with 128 nodes
│   └──  H.s002.scalar.dat                 - H atom VMC data with 512 nodes
│
├── problematic                        - problematic VMC run
│   └──  H.s000.scalar.dat                 - H atom VMC data with a problem
│
└── size                                - scaling with number of particles
    ├──  01________H.s000.scalar.dat       - H atom VMC data
    ├──  02_______H2.s000.scalar.dat       - H dimer "   "
    ├──  06________C.s000.scalar.dat       - C atom  "   "
    ├──  10______CH4.s000.scalar.dat       - methane "   "
    ├──  12_______C2.s000.scalar.dat       - C dimer "   "
    ├──  16_____C2H4.s000.scalar.dat       - ethene  "
    ├──  18___CH4CH4.s000.scalar.dat       - methane dimer VMC data
    ├──  32_C2H4C2H4.s000.scalar.dat       - ethene dimer   "   "
    ├──  nelectron_tcpu.dat                - data for gnuplot
    └──  Nelectron_tCPU.plt                - gnuplot for N_elec vs. t_CPU

Atomic units

QMCPACK operates in Ha atomic units to reduce the number of factors in the Schrödinger equation. Thus, the unit of length is the bohr (5.291772 \(\times 10^{-11}\) m = 0.529177 Å); the unit of energy is the Ha (4.359744 \(\times 10^{-18}\) J = 27.211385 eV). The energy of the ground state of the hydrogen atom in these units is -0.5 Ha.

Reviewing statistics

We will practice taking the average (mean) and standard deviation of some MC data by hand to review the basic definitions.

Enter Python’s command line by typing python [Enter]. You will see a prompt “>>>.”

The mean of a dataset is given by:

(61)\[\overline{x} = \frac{1}{N}\sum_{i=1}^{N} x_i\:.\]

To calculate the average of five local energies from an MC calculation of the ground state of an electron in the hydrogen atom, input (truncate at the thousandths place if you cannot copy and paste; script versions are also available in the average directory):

(
(-0.45298911858) +
(-0.45481953564) +
(-0.48066105923) +
(-0.47316713469) +
(-0.46204733302)
)/5.

Then, press [Enter] to get:

>>> ((-0.45298911858) + (-0.45481953564) + (-0.48066105923) +
(-0.47316713469) + (-0.4620473302))/5.
-0.46473683566800006

To understand the significance of the mean, we also need the standard deviation around the mean of the data (also called the error bar), given by:

(62)\[\sigma = \sqrt{\frac{1}{N(N-1)}\sum_{i=1}^{N} ({x_i} - \overline{x})^2}\:.\]

To calculate the standard deviation around the mean (-0.464736835668) of these five data points, put in:

( (1./(5.*(5.-1.))) * (
(-0.45298911858-(-0.464736835668))**2 + \\
(-0.45481953564-(-0.464736835668))**2 +
(-0.48066105923-(-0.464736835668))**2 +
(-0.47316713469-(-0.464736835668))**2 +
(-0.46204733302-(-0.464736835668))**2 )
)**0.5

Then, press [Enter] to get:

>>> ( (1./(5.*(5.-1.))) * ( (-0.45298911858-(-0.464736835668))**2 +
(-0.45481953564-(-0.464736835668))**2 + (-0.48066105923-(-0.464736835668))**2 +
(-0.47316713469-(-0.464736835668))**2 + (-0.46204733302-(-0.464736835668))**2
) )**0.5
0.0053303187464332066

Thus, we might report this data as having a value -0.465 +/- 0.005 Ha. This calculation of the standard deviation assumes that the average for this data is fixed, but we can continually add MC samples to the data, so it is better to use an estimate of the error bar that does not rely on the overall average. Such an estimate is given by:

(63)\[\tilde{\sigma} = \sqrt{\frac{1}{N-1}\sum_{i=1}^{N} \left[{(x^2)}_i - ({x_i})^2\right]}\:.\]

To calculate the standard deviation with this formula, input the following, which includes the square of the local energy calculated with each corresponding local energy:

( (1./(5.-1.)) * (
(0.60984565298-(-0.45298911858)**2) + \\
(0.61641291630-(-0.45481953564)**2) +
(1.35860151160-(-0.48066105923)**2) + \\
(0.78720769003-(-0.47316713469)**2) +
(0.56393677687-(-0.46204733302)**2) )
)**0.5

and press [Enter] to get:

>>> ((1./(5.-1.))*((0.60984565298-(-0.45298911858)**2)+
(0.61641291630-(-0.45481953564)**2)+(1.35860151160-(-0.48066105923)**2)+
(0.78720769003-(-0.47316713469)**2)+(0.56393677687-(-0.46204733302)**2))
)**0.5
0.84491636672906634

This much larger standard deviation, acknowledging that the mean of this small data set is not the average in the limit of infinite sampling, more accurately reports the value of the local energy as -0.5 +/- 0.8 Ha.

Type quit() and press [Enter] to exit the Python command line.

Inspecting MC Data

QMCPACK outputs data from MC calculations into files ending in scalar.dat. Several quantities are calculated and written for each block of MC steps in successive columns to the right of the step index.

Change directories to atom, and open the file ending in scalar.dat with a text editor (e.g., vi *.scalar.dat or emacs *.scalar.dat. If possible, adjust the terminal so that lines do not wrap. The data will begin as follows (broken into three groups to fit on this page):

#   index    LocalEnergy         LocalEnergy_sq      LocalPotential     ...
         0   -4.5298911858e-01    6.0984565298e-01   -1.1708693521e+00
         1   -4.5481953564e-01    6.1641291630e-01   -1.1863425644e+00
         2   -4.8066105923e-01    1.3586015116e+00   -1.1766446209e+00
         3   -4.7316713469e-01    7.8720769003e-01   -1.1799481122e+00
         4   -4.6204733302e-01    5.6393677687e-01   -1.1619244081e+00
         5   -4.4313854290e-01    6.0831516179e-01   -1.2064503041e+00
         6   -4.5064926960e-01    5.9891422196e-01   -1.1521370176e+00
         7   -4.5687452611e-01    5.8139614676e-01   -1.1423627617e+00
         8   -4.5018503739e-01    8.4147849706e-01   -1.1842075439e+00
         9   -4.3862013841e-01    5.5477715836e-01   -1.2080979177e+00

The first line begins with a #, indicating that this line does not contain MC data but rather the labels of the columns. After a blank line, the remaining lines consist of the MC data. The first column, labeled index, is an integer indicating which block of MC data is on that line. The second column contains the quantity usually of greatest interest from the simulation: the local energy. Since this simulation did not use the exact ground state wavefunction, it does not produce -0.5 Ha as the local energy although the value lies within about 10%. The value of the local energy fluctuates from block to block, and the closer the trial wavefunction is to the ground state the smaller these fluctuations will be. The next column contains an important ingredient in estimating the error in the MC average—the square of the local energy—found by evaluating the square of the Hamiltonian.

...   Kinetic             Coulomb             BlockWeight        ...
       7.1788023352e-01   -1.1708693521e+00    1.2800000000e+04
       7.3152302871e-01   -1.1863425644e+00    1.2800000000e+04
       6.9598356165e-01   -1.1766446209e+00    1.2800000000e+04
       7.0678097751e-01   -1.1799481122e+00    1.2800000000e+04
       6.9987707508e-01   -1.1619244081e+00    1.2800000000e+04
       7.6331176120e-01   -1.2064503041e+00    1.2800000000e+04
       7.0148774798e-01   -1.1521370176e+00    1.2800000000e+04
       6.8548823555e-01   -1.1423627617e+00    1.2800000000e+04
       7.3402250655e-01   -1.1842075439e+00    1.2800000000e+04
       7.6947777925e-01   -1.2080979177e+00    1.2800000000e+04

The fourth column from the left consists of the values of the local potential energy. In this simulation, it is identical to the Coulomb potential (contained in the sixth column) because the one electron in the simulation has only the potential energy coming from its interaction with the nucleus. In many-electron simulations, the local potential energy contains contributions from the electron-electron Coulomb interactions and the nuclear potential or pseudopotential. The fifth column contains the local kinetic energy value for each MC block, obtained from the Laplacian of the wavefunction. The sixth column shows the local Coulomb interaction energy. The seventh column displays the weight each line of data has in the average (the weights are identical in this simulation).

...    BlockCPU            AcceptRatio
       6.0178991748e-03    9.8515625000e-01
       5.8323097461e-03    9.8562500000e-01
       5.8213412744e-03    9.8531250000e-01
       5.8330412549e-03    9.8828125000e-01
       5.8108362256e-03    9.8625000000e-01
       5.8254170264e-03    9.8625000000e-01
       5.8314813086e-03    9.8679687500e-01
       5.8258469971e-03    9.8726562500e-01
       5.8158433545e-03    9.8468750000e-01
       5.7959401123e-03    9.8539062500e-01

The eighth column shows the CPU time (in seconds) to calculate the data in that line. The ninth column from the left contains the acceptance ratio (1 being full acceptance) for MC steps in that line’s data. Other than the block weight, all quantities vary from line to line.

Exit the text editor ([Esc] :q! [Enter] in vi, [Ctrl]-x [Ctrl]-c in emacs).

Averaging quantities in the MC data

QMCPACK includes the qmca Python tool to average quantities in the scalar.dat file (and also the dmc.dat file of DMC simulations). Without any flags, qmca will output the average of each column with a quantity in the scalar.dat file as follows.

Execute qmca by qmca *.scalar.dat, which for this data outputs:

H  series 0
LocalEnergy           =          -0.45446 +/-          0.00057
Variance              =             0.529 +/-            0.018
Kinetic               =            0.7366 +/-           0.0020
LocalPotential        =           -1.1910 +/-           0.0016
Coulomb               =           -1.1910 +/-           0.0016
LocalEnergy_sq        =             0.736 +/-            0.018
BlockWeight           =    12800.00000000 +/-       0.00000000
BlockCPU              =        0.00582002 +/-       0.00000067
AcceptRatio           =          0.985508 +/-         0.000048
Efficiency            =        0.00000000 +/-       0.00000000

After one blank, qmca prints the title of the subsequent data, gleaned from the data file name. In this case, H.s000.scalar.dat became “H series 0.” Everything before the first “.s” will be interpreted as the title, and the number between “.s” and the next “.” will be interpreted as the series number.

The first column under the title is the name of each quantity qmca averaged. The column to the right of the equal signs contains the average for the quantity of that line, and the column to the right of the plus-slash-minus is the statistical error bar on the quantity. All quantities calculated from MC simulations have and must be reported with a statistical error bar!

Two new quantities not present in the scalar.dat file are computed by qmca from the data—variance and efficiency. We will look at these later in this lab.

To view only one value, qmca takes the -q (quantity) flag. For example, the output of qmca -q LocalEnergy *.scalar.dat in this directory produces a single line of output:

H  series 0  LocalEnergy = -0.454460 +/- 0.000568

Type qmca –help to see the list of all quantities and their abbreviations.

Evaluating MC simulation quality

There are several aspects of a MC simulation to consider in deciding how well it went. Besides the deviation of the average from an expected value (if there is one), the stability of the simulation in its sampling, the autocorrelation between MC steps, the value of the acceptance ratio (accepted steps over total proposed steps), and the variance in the local energy all indicate the quality of an MC simulation. We will look at these one by one.

Tracing MC quantities

Visualizing the evolution of MC quantities over the course of the simulation by a trace offers a quick picture of whether the random walk had the expected behavior. qmca plots traces with the -t flag.

Type qmca -q e -t H.s000.scalar.dat, which produces a graph of the trace of the local energy:

_images/lab_qmc_statistics_tracing1.png

The solid black line connects the values of the local energy at each MC block (labeled “samples”). The average value is marked with a horizontal, solid red line. One standard deviation above and below the average are marked with horizontal, dashed red lines.

The solid black line connects the values of the local energy at each MC block (labeled “samples”). The average value is marked with a horizontal, solid red line. One standard deviation above and below the average are marked with horizontal, dashed red lines.

The trace of this run is largely centered on the average with no large-scale oscillations or major shifts, indicating a good-quality MC run.

Try tracing the kinetic and potential energies, seeing that their behavior is comparable with the total local energy.

Change to directory problematic and type qmca -q e -t H.s000.scalar.dat to produce this graph:

_images/lab_qmc_statistics_tracing2.png

Here, the local energy samples cluster around the expected -0.5 Ha for the first 150 samples or so and then begin to oscillate more wildly and increase erratically toward 0, indicating a poor-quality MC run.

Again, trace the kinetic and potential energies in this run and see how their behavior compares with the total local energy.

Blocking away autocorrelation

Autocorrelation occurs when a given MC step biases subsequent MC steps, leading to samples that are not statistically independent. We must take this autocorrelation into account to obtain accurate statistics. qmca outputs autocorrelation when given the –sac flag.

Change to directory autocorrelation and type qmca -q e --sac H.s000.scalar.dat.

H  series 0  LocalEnergy = -0.454982 +/- 0.000430    1.0

The value after the error bar on the quantity is the autocorrelation (1.0 in this case).

Proposing too small a step in configuration space, the MC time step, can lead to autocorrelation since the new samples will be in the neighborhood of previous samples. Type grep timestep H.xml to see the varying time step values in this QMCPACK input file (H.xml):

<parameter name="timestep">10</parameter>
<parameter name="timestep">5</parameter>
<parameter name="timestep">2</parameter>
<parameter name="timestep">1</parameter>
<parameter name="timestep">0.5</parameter>
<parameter name="timestep">0.2</parameter>
<parameter name="timestep">0.1</parameter>
<parameter name="timestep">0.05</parameter>
<parameter name="timestep">0.02</parameter>
<parameter name="timestep">0.01</parameter>
<parameter name="timestep">0.005</parameter>
<parameter name="timestep">0.002</parameter>
<parameter name="timestep">0.001</parameter>
<parameter name="timestep">0.0005</parameter>
<parameter name="timestep">0.0002</parameter>
<parameter name="timestep">0.0001</parameter>

Generally, as the time step decreases, the autocorrelation will increase (caveat: very large time steps will also have increasing autocorrelation). To see this, type qmca -q e --sac *.scalar.dat to see the energies and autocorrelation times, then plot with gnuplot by inputting gnuplot H.plt:

_images/lab_qmc_statistics_blocking1.png

The error bar also increases with the autocorrelation.

Press q [Enter] to quit gnuplot.

To get around the bias of autocorrelation, we group the MC steps into blocks, take the average of the data in the steps of each block, and then finally average the averages in all the blocks. QMCPACK outputs the block averages as each line in the scalar.dat file. (For DMC simulations, in addition to the scalar.dat, QMCPACK outputs the quantities at each step to the dmc.dat file, which permits reblocking the data differently from the specification in the input file.)

Change directories to blocking. Here we look at the time step of the last dataset in the autocorrelation directory. Verify this by typing grep timestep H.xml to see that all values are set to 0.001. Now to see how we will vary the blocking, type grep -A1 blocks H.xml. The parameter “steps” indicates the number of steps per block, and the parameter “blocks” gives the number of blocks. For this comparison, the total number of MC steps (equal to the product of “steps” and “blocks”) is fixed at 50,000. Now check the effect of blocking on autocorrelation—type qmca -q e --sac *scalar.dat to see the data and gnuplot H.plt to visualize the data:

_images/lab_qmc_statistics_blocking2.png

The greatest number of steps per block produces the smallest autocorrelation time. The larger number of blocks over which to average at small step-per-block number masks the corresponding increase in error bar with increasing autocorrelation.

Press q [Enter] to quit gnuplot.

Balancing autocorrelation and acceptance ratio

Adjusting the time step value also affects the ratio of accepted steps to proposed steps. Stepping nearby in configuration space implies that the probability distribution is similar and thus more likely to result in an accepted move. Keeping the acceptance ratio high means the algorithm is efficiently exploring configuration space and not sticking at particular configurations. Return to the autocorrelation directory. Refresh your memory on the time steps in this set of simulations by grep timestep H.xml. Then, type qmca -q ar *scalar.dat to see the acceptance ratio as it varies with decreasing time step:

H  series 0  AcceptRatio = 0.047646 +/- 0.000206
H  series 1  AcceptRatio = 0.125361 +/- 0.000308
H  series 2  AcceptRatio = 0.328590 +/- 0.000340
H  series 3  AcceptRatio = 0.535708 +/- 0.000313
H  series 4  AcceptRatio = 0.732537 +/- 0.000234
H  series 5  AcceptRatio = 0.903498 +/- 0.000156
H  series 6  AcceptRatio = 0.961506 +/- 0.000083
H  series 7  AcceptRatio = 0.985499 +/- 0.000051
H  series 8  AcceptRatio = 0.996251 +/- 0.000025
H  series 9  AcceptRatio = 0.998638 +/- 0.000014
H  series 10  AcceptRatio = 0.999515 +/- 0.000009
H  series 11  AcceptRatio = 0.999884 +/- 0.000004
H  series 12  AcceptRatio = 0.999958 +/- 0.000003
H  series 13  AcceptRatio = 0.999986 +/- 0.000002
H  series 14  AcceptRatio = 0.999995 +/- 0.000001
H  series 15  AcceptRatio = 0.999999 +/- 0.000000

By series 8 (time step = 0.02), the acceptance ratio is in excess of 99%.

Considering the increase in autocorrelation and subsequent increase in error bar as time step decreases, it is important to choose a time step that trades off appropriately between acceptance ratio and autocorrelation. In this example, a time step of 0.02 occupies a spot where the acceptance ratio is high (99.6%) and autocorrelation is not appreciably larger than the minimum value (1.4 vs. 1.0).

Considering variance

Besides autocorrelation, the dominant contributor to the error bar is the variance in the local energy. The variance measures the fluctuations around the average local energy, and, as the fluctuations go to zero, the wavefunction reaches an exact eigenstate of the Hamiltonian. qmca calculates this from the local energy and local energy squared columns of the scalar.dat.

Type qmca -q v H.s009.scalar.dat to calculate the variance on the run with time step balancing autocorrelation and acceptance ratio:

H  series 9  Variance = 0.513570 +/- 0.010589

Just as the total energy does not tell us much by itself, neither does the variance. However, comparing the ratio of the variance with the energy indicates how the magnitude of the fluctuations compares with the energy itself. Type qmca -q ev H.s009.scalar.dat to calculate the energy and variance on the run side by side with the ratio:

                    LocalEnergy               Variance        ratio
H  series 0  -0.454460 +/- 0.000568   0.529496 +/- 0.018445   1.1651

The very high ration of 1.1651 indicates the square of the fluctuations is on average larger than the value itself. In the next section, we will approach ways to improve the variance that subsequent labs will build on.

Reducing statistical error bars

Increasing MC sampling

Increasing the number of MC samples in a dataset reduces the error bar as the inverse of the square root of the number of samples. There are two ways to increase the number of MC samples in a simulation: (1) running more samples in parallel and (2) increasing the number of blocks (with fixed number of steps per block, this increases the total number of MC steps).

To see the effect of running more samples in parallel, change to the directory nodes. The series here increases the number of nodes by factors of four from 32 to 128 to 512. Type qmca -q ev *scalar.dat and note the change in the error bar on the local energy as the number of nodes. Visualize this with gnuplot H.plt:

_images/lab_qmc_statistics_nodes.png

Increasing the number of blocks, unlike running in parallel, increases the total CPU time of the simulation.

Press q [Enter] to quit gnuplot.

To see the effect of increasing the block number, change to the directory blocks. To see how we will vary the number of blocks, type grep -A1 blocks H.xml. The number of steps remains fixed, thus increasing the total number of samples. Visualize the tradeoff by inputting gnuplot H.plt:

_images/lab_qmc_statistics_blocks.png

Press q [Enter] to quit gnuplot.

Improving the basis sets

In all of the previous examples, we are using the sum of two Gaussian functions (STO-2G) to approximate what should be a simple decaying exponential for the wavefunction of the ground state of the hydrogen atom. The sum of multiple copies of a function varying each copy’s width and amplitude with coefficients is called a basis set. As we add Gaussians to the basis set, the approximation improves, the variance goes toward zero, and the energy goes to -0.5 Ha. In nearly every other case, the exact function is unknown, and we add basis functions until the total energy does not change within some threshold.

Change to the directory basis and look at the total energy and variance as we change the wavefunction by typing qmca -q ev H_:

                          LocalEnergy               Variance        ratio
H_STO-2G  series 0  -0.454460 +/- 0.000568   0.529496 +/- 0.018445   1.1651
H_STO-3G  series 0  -0.465386 +/- 0.000502   0.410491 +/- 0.010051   0.8820
H_STO-6G  series 0  -0.471332 +/- 0.000491   0.213919 +/- 0.012954   0.4539
H__exact  series 0  -0.500000 +/- 0.000000   0.000000 +/- 0.000000   -0.0000

qmca also puts out the ratio of the variance to the local energy in a column to the right of the variance error bar. A typical high-quality value for this ratio is lower than 0.1 or so—none of these few-Gaussian wavefunctions satisfy that rule of thumb.

Use qmca to plot the trace of the local energy, kinetic energy, and potential energy of H__exact. The total energy is constantly -0.5 Ha even though the kinetic and potential energies fluctuate from configuration to configuration.

Adding a Jastrow factor

Another route to reducing the variance is the introduction of a Jastrow factor to account for electron-electron correlation (not the statistical autocorrelation of MC steps but the physical avoidance that electrons have of one another). To do this, we will switch to the hydrogen dimer with the exact ground state wavefunction of the atom (STO basis)—this will not be exact for the dimer. The ground state energy of the hydrogen dimer is -1.174 Ha.

Change directories to dimer and put in qmca -q ev *scalar.dat to see the result of adding a simple, one-parameter Jastrow to the STO basis for the hydrogen dimer at experimental bond length:

                                LocalEnergy               Variance
H2_STO___no_jastrow  series 0  -0.876548 +/- 0.005313   0.473526 +/- 0.014910
H2_STO_with_jastrow  series 0  -0.912763 +/- 0.004470   0.279651 +/- 0.016405

The energy reduces by 0.044 +/- 0.006 HA and the variance by 0.19 +/- 0.02. This is still 20% above the ground state energy, and subsequent labs will cover how to improve on this with improved forms of the wavefunction that capture more of the physics.

Scaling to larger numbers of electrons

Calculating the efficiency

The inverse of the product of CPU time and the variance measures the efficiency of an MC calculation. Use qmca to calculate efficiency by typing qmca -q eff *scalar.dat to see the efficiency of these two H \(_2\) calculations:

H2_STO___no_jastrow  series 0  Efficiency = 16698.725453 +/- 0.000000
H2_STO_with_jastrow  series 0  Efficiency = 52912.365609 +/- 0.000000

The Jastrow factor increased the efficiency in these calculations by a factor of three, largely through the reduction in variance (check the average block CPU time to verify this claim).

Scaling up

To see how MC scales with increasing particle number, change directories to size. Here are the data from runs of increasing numbers of electrons for H, H\(_2\), C, CH\(_4\), C\(_2\), C\(_2\)H\(_4\), (CH\(_4\))\(_2\), and (C\(_2\)H\(_4\))\(_2\) using the STO-6G basis set for the orbitals of the Slater determinant. The file names begin with the number of electrons simulated for those data.

Use qmca -q bc *scalar.dat to see that the CPU time per block increases with the number of electrons in the simulation; then plot the total CPU time of the simulation by gnuplot Nelectron_tCPU.plt:

_images/lab_qmc_statistics_scaling.png

The green pluses represent the CPU time per block at each electron number. The red line is a quadratic fit to those data. For a fixed basis set size, we expect the time to scale quadratically up to 1,000s of electrons, at which point a cubic scaling term may become dominant. Knowing the scaling allows you to roughly project the calculation time for a larger number of electrons.

Press q [Enter] to quit gnuplot.

This is not the whole story, however. The variance of the energy also increases with a fixed basis set as the number of particles increases at a faster rate than the energy decreases. To see this, type qmca -q ev *scalar.dat:

            LocalEnergy               Variance
01________H  series 0  -0.471352 +/- 0.000493      0.213020 +/- 0.012950
02_______H2  series 0  -0.898875 +/- 0.000998      0.545717 +/- 0.009980
06________C  series 0  -37.608586 +/- 0.020453   184.322000 +/- 45.481193
10______CH4  series 0  -38.821513 +/- 0.022740   169.797871 +/- 24.765674
12_______C2  series 0  -72.302390 +/- 0.037691   491.416711 +/- 106.090103
16_____C2H4  series 0  -75.488701 +/- 0.042919   404.218115 +/- 60.196642
18___CH4CH4  series 0  -58.459857 +/- 0.039309   498.579645 +/- 92.480126
32_C2H4C2H4  series 0  -91.567283 +/- 0.048392   632.114026 +/- 69.637760

The increase in variance is not uniform, but the general trend is upward with a fixed wavefunction form and basis set. Subsequent labs will address how to improve the wavefunction to keep the variance manageable.