u-nix

Main   |   Articles   |   Reviews   |   Download   |   Links   |   Contact

Technical Report: Performance Analysis of Normal CDF Inverse Implementation – C99 vs. Tcl 9.0

Keywords: Numerical Computing, Tcl 9.0, C99, Acklam’s Algorithm, Parallelism, Benchmark, Intel Core i7.

Abstract

This study evaluates the computational efficiency and numerical precision of Acklam’s algorithm for approximating standard normal distribution quantiles. The implementation was ported from C99 to the newly released Tcl 9.0. Performance was measured in both sequential and multi-threaded modes on a dataset of  10*6 points. Results demonstrate that by leveraging Tcl 9.0's enhanced threading capabilities on a dual-core Intel architecture, the performance gap between interpreted scripting and compiled binaries is reduced to a marginal 18% differential.


1. Introduction

Approximating the inverse of the standard normal cumulative distribution function (CDF) is critical for Monte Carlo simulations. This study utilizes Acklam’s rational approximation, which maintains a relative error 

|ϵ|<1.15×10-9

 by segmenting the probability domain 

p∈(0,1)

 into three distinct regions.

2. Experimental Setup

All benchmarks were conducted on the following hardware and software environment:

3. Numerical Integrity

Absolute consistency between C99 and Tcl 9.0 outputs was verified. A binary comparison (diff) confirmed a zero-error delta across all 

10*6 samples, validated to 10 decimal places (10-10). This confirms that Tcl 9.0’s IEEE 754 floating-point implementation is strictly aligned with C99 standards on Intel x86_64 architectures.

4. Benchmarking Results

4.1. Single-Threaded Sequential Performance

The C99 binary, executing direct machine instructions, processed the dataset in 0.852s. The sequential Tcl 9.0 implementation required 5.987s.

4.2. Multi-Threaded Optimization (Tcl Threading)

To exploit the Dual-Core i7 architecture, the Tcl implementation was refactored using the Thread package and a thread pool (tpool). The workload was decomposed into parallel tasks:

By saturating both physical cores (utilizing 4 logical threads), Tcl 9.0 achieved a 5.79x speedup over its own sequential version. The resulting latency is only 0.18s behind the optimized C binary.

5. Conclusion

On a 2.2 GHz Intel Core i7 MacBook Air, Tcl 9.0 demonstrates that it is no longer restricted to "slow" scripting roles. While C99 remains the efficiency ceiling for single-core tasks, the ease of implementing data parallelism in Tcl 9.0 allows it to match compiled performance for massive numerical processing. For systems engineers, this offers a powerful trade-off: the safety and flexibility of Tcl with the throughput of C.


Appendix: Raw Performance Data Table

Implementation

Mode

Real Time (s)

User Time (s)

Sys Time (s)

C99 (GCC -O3)

Single-Thread

0.852

0.757

0.046

Tcl 9.0

Single-Thread

5.987

5.873

0.090

Tcl 9.0

Multi-Thread

1.034

0.094

0.095


*Note: User time in Tcl multi-thread tests reflects only the main thread's CPU usage in certain environments.


Reference Implementation (Tcl 9.0 Parallel)

This snippet illustrates the core logic used to achieve near-native performance through data decomposition.


tcl

# Worker definition for Thread Pool
set worker_code {
    proc normal_quantile {p} {
        # Acklam's Algorithm Implementation
        if {$p <= 0.0 || $p >= 1.0} { return "NaN" }
        set q [expr {$p < 0.02425 ? sqrt(-2.0*log($p)) : ($p > 0.97575 ? sqrt(-2.0*log(1.0-$p)) : $p-0.5)}]
        # ... (Rational approximation formulas) ...
    }
    proc process_chunk {chunk} {
        set res {}
        foreach v $chunk { lappend res [format "%.10f\t%.10f" $v [normal_quantile $v]] }
        return [join $res "\n"]
    }
}

# Parallel Execution Logic
package require Thread
set pool [tpool::create -minworkers 4 -maxworkers 4 -initcmd $worker_code]
# Data is split into chunks and posted to the pool
set job [tpool::post $pool [list process_chunk $data_chunk]]



Reproducibility Note

To ensure numerical consistency across platforms, the input dataset was generated using:
LC_NUMERIC=C awk 'BEGIN {srand(); for (i=1; i<=1000000; i++) print rand()}' > input.txt
Verification was performed via:
diff output_c.txt output_tcl.txt | wc -l (Result: 0)


neocities.org     -   Copyright Gh. C. - 2026