ML Success Metrics specific to Regression

Machine Learning Metrics Analysis

Understanding RMSE and RMSLE

According to the Official Google Cloud Certified Professional Machine Learning Engineer Study Guide:

The root‐mean‐squared error (RMSE) is the square root of the average squared difference between the target and predicted values. If you are worried that your model might incorrectly predict a very large value and want to penalize the model, you can use this. Ranges from 0 to infinity.

The root‐mean‐squared logarithmic error (RMSLE) metric is similar to RMSE, except that it uses the natural logarithm of the predicted and actual values +1. This is an asymmetric metric, which penalizes underprediction (value predicted is lower than actual) rather than overprediction.

Mona, Mona; Ramamurthy, Pratap. Official Google Cloud Certified Professional Machine Learning Engineer Study Guide (Sybex Study Guide) (p. 12). Wiley. Kindle Edition.

Practical Testing

Upon testing what I thought this meant, here is what I saw:

from sklearn.metrics import mean_squared_error, root_mean_squared_error, root_mean_squared_error, root_mean_squared_log_error
y_true = [60, 80, 90, 750] 
y_pred = [67, 78, 91, 102]
root_mean_squared_log_error(y_pred=y_true, y_true= y_pred)                                
0.9949158238939428
root_mean_squared_log_error(y_pred=y_pred, y_true=y_true)
0.9949158238939428
root_mean_squared_error(y_pred, y_true)
324.02083266358045
root_mean_squared_error(y_true, y_pred)
324.02083266358045

Confusion and Clarification

The results were confusing for a couple of reasons:

  1. If RMSE is for when "your model might incorrectly predict a very large value and want to penalize the model", then shouldn't it be an asymmetric metric, as the authors allege that RMSLE is?
  2. Why are these two metrics, one of which is allegedly asymmetric, and the other presumably asymmetric, obviously symmetric metrics?

The answer revolves around the nature of logarithmic transformation used in RMSLE and its practical implications, which while mathematically symmetric, react differently to over and under-predictions based on the data scale.

Parity of the number of bits == 1

Basic Parity Calculation

The function parity(n: int) -> int calculates the parity of the number n. Parity is 1 if the number of 1s in the binary representation of n is odd, and 0 otherwise. The function iterates over each bit of n, using result ^= n & 1 to flip the result if the current bit is 1, and then right-shifts n by one bit.

def parity(n: int) -> int:

        result = 0
while n:
result ^= n & 1
n >>= 1
return result

Complexity: The time complexity is O(m), where m is the number of bits in the number. This is because the algorithm checks each bit exactly once.

Kernigan's Algorithm for Parity Calculation

The function kernigan_parity(n: int) -> int also calculates the parity of the number n. It optimizes the basic approach by directly skipping to the next set bit using the operation n &= n - 1, which drops the lowest set bit of n. The result is XORed with 1 for each set bit found.

def kernigan_parity(n: int) -> int:

    result = 0
while n:
result ^= 1
n &= n - 1 # drop the lowest set bit
return result

Complexity: The time complexity is O(k), where k is the number of bits set to 1 in the number. This improvement over the basic method occurs because it processes only the bits that are 1, skipping all zeros.

I benchmarked it as follows:

First, I had to generate example inputs for the different n,k combinations. I did that like so:

```


@staticmethod
def get_inputs(k: int, m: int) -> list:
# k is number of bits that are 1
# m is the bitlength
random_numbers = ParityBenchmarking.gen_random_numbers(k, m, 25)
return random_numbers

@staticmethod
def gen_random_number(k, m):
r = Random()
randbits = r.getrandbits(m)
while bin(randbits).count('1') != k:
random_num_between_0_and_m = r.randint(0, m-1)
if bin(randbits).count('1') < k:
randbits |= 1 << random_num_between_0_and_m
else:
randbits &= ~(1 << random_num_between_0_and_m)
return randbits

@staticmethod
def gen_random_numbers(k, m, numNumbers) -> list:
return [ParityBenchmarking.gen_random_number(k, m) for _ in range(numNumbers)]
``` 

Benchmarking Code:

@staticmethod
def benchmark(my_func: Callable[[int], None], inputs: list = None, n: int = 5000) -> timedelta:
start_time = datetime.datetime.now()
for _ in range(n):
    for input_val in inputs:
    my_func(input_val)
return end_time - start_time


@staticmethod
def print_benchmark3(k: int, m: int, benchmarks: Dict, my_func: Callable[[int], None], impl: str) -> None:
inputs = ParityBenchmarking.get_inputs(k, m)
    benchmark_key = (k,m,impl)
    duration = Benchmarker.benchmark(my_func, inputs=inputs).microseconds
print(f"Input: {inputs} ({benchmark_key}, Time taken: {duration} microseconds")
        benchmarks[benchmark_key] = duration

Results:

The summarized output from the benchmarking data displays the walltime measurements for computing parity using the Kernighan and standard parity algorithms over various configurations of 'k' (the number of set bits) and 'm' (the total number of bits). The walltime generally increases with both higher values of 'k' and 'm'.

For the **Kernighan parity algorithm**, the results show a clear trend where the walltime increases with both the number of set bits 'k' and the total number of bits 'm'. The increase in walltime is more pronounced with an increase in 'k' compared to an increase in 'm'. For example, as 'k' increases from 1 to 32 at a fixed 'm', the walltime shows significant growth. Similarly, for a fixed 'k', increasing 'm' also results in increased walltime, but the effect is generally less dramatic than changes in 'k'.

For the **standard parity calculation** (simply labeled "parity" in the table), the results also show an increase in walltime with increases in 'k' and 'm', but the growth pattern appears less systematic than with the Kernighan method.






Correlations






|   m     | algorithm       |   correlation |
|---------|-----------------|---------------|
| 48 | parity | 0.777746 |
| 148 | parity | 0.693694 |
| 248 | parity | -0.590393 |
| 348 | parity | 0.369503 |
| 448 | parity | -0.459705 |
| 548 | parity | 0.388659 |
| 48 | kernigan_parity | 0.999983 |
| 148 | kernigan_parity | 0.999875 |
| 248 | kernigan_parity | 0.999627 |
| 348 | kernigan_parity | 0.999851 |
| 448 | kernigan_parity | 0.999986 |
| 548 | kernigan_parity | 0.999858 |



| k | algorithm | correlation |
|---------|-----------------|---------------|
| 1 | parity | 0.337478 |
| 2 | parity | 0.0487518 |
| 4 | parity | -0.387477 |
| 8 | parity | 0.317987 |
| 16 | parity | 0.348491 |
| 32 | parity | 0.0311046 |
| 1 | kernigan_parity | 0.936622 |
| 2 | kernigan_parity | 0.981157 |
| 4 | kernigan_parity | 0.395702 |
| 8 | kernigan_parity | 0.986231 |
| 16 | kernigan_parity | 0.994522 |
| 32 | kernigan_parity | 0.603758 |

  - The relationship between 'm' and walltime is also significant but slightly weaker than 'k', still showing a strong positive correlation.

- For **standard parity**:

  - The correlation between 'k' and walltime is also positive, but much weaker compared to Kernighan's parity, suggesting less predictability.

  - The correlation between 'm' and walltime shows variability, with some coefficients indicating weak to moderate relationships.

In summary, the Kernighan algorithm’s performance is more consistently influenced by the number of set bits and the total number of bits, showing predictably higher times with increases in these parameters. The standard parity algorithm exhibits increased times with larger 'k' and 'm' but with more variability and generally less efficiency compared to Kernighan’s method.

Statistical Analysis

From the regression and correlation coefficients, it's clear that the correlation between 'k' and walltime is very strong for Kernighan's parity, suggesting that 'k' is a reliable predictor of performance for this algorithm. The relationship between 'm' and walltime is significant but slightly weaker. For the standard parity calculation, the correlation is positive but much weaker compared to Kernighan's parity, indicating less predictability.


Profiling

I used python memory-profiler (`@profile`). Unfortunately, I couldn't use perf due to its absence on OSX, and I didn't delve into using dtrace. 


Questions


1. How can a bit string, in practical systems, keep growing without running into overflow, considering limited word sizes in CPUs?

   - Some modern programming environments support extended or arbitrary-precision arithmetic that handles integers larger than a CPU's standard word size.


2. Does the single assembly instruction for decrementing a register/address alter the computational complexity?

   - Both basic and Kernighan's algorithms operate at the bit level. The difference in computational complexity (O(m) vs. O(k)) does manifest in practical performance benefits, as fewer operations are needed in Kernighan's method when fewer bits are set.


Further exploration could include leveraging CPU-specific features to see if they substantially alter the computational complexity or the practical efficiency of these algorithms and understanding how compiler optimizations impact operational speed and efficiency in different computing environments.



Quote

 

“No amount of experimentation can ever prove me right; a single experiment can prove me wrong.”


Quote

 


“More business is lost every year through neglect than through any other cause.” 



Quote

 


A few observations and much reasoning lead to error; many observations and a little reasoning to truth.


I/O Latency in Relative Terms

 I/O Latency in Relative Terms

If Rotational disk I/O were one to twelve months, a CPU Cycle would be one second.



Event

Latency

Scaled

1 CPU Cycle

0.3 ns

1 second

Level 1 cache access

0.9 ns

3 second

Level 2 cache access

2.8 ns

9 seconds

Level 3 Cache access

12.9 ns

43 seconds

Main memory access (DRAM, from CPU)

120 ns

6  min

Solid-state disk I/O (flash memory)

50-150 µs

2-6 days

Rotational disk I/O

1-10 ms

1-12 months

Internet: San Francisco to New York

40 ms

4 years

Internet: San Francisco to UK

81 ms

8 years

Internet: San Francisco to Australia

183 ms

19 years

TCP packet retransmit

1-3 s

105-317 years


import this (The Zen of Python, by Tim Peters)

 


>>> import this The Zen of Python, by Tim Peters Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those!

Fixing an Incorrectly Resolved Git Rebase Conflict After a Force Push

Scenario You resolved a rebase conflict, but did it wrong. You’ve already committed and force-pushed the branch to your remote (e.g.,  origi...