Elizezen
/

SniffyOtter-7B

@@ -27,8 +27,9 @@ In extensive testing and benchmarks, SniffyOtter has proven to be an exceptional
 | Model                               | average   | eroticism | complexity | contextual maintenance |
 | ----------------------------------- | --------- | --------- | ---------- | ---------------------- |
-| Antler-RP-ja-westlake-chatvector    | **48.88** | 4.65      | **47.1**   | **94.9**               |
-| **SniffyOtter-7B**                  | 48.80     | **5.7**   | 46.2       | 94.5                   |
 | Antler-7B                           | 47.62     | 5.25      | 45.3       | 92.3                   |
 | Nocturn-7B                          | 47.25     | 5.15      | 44.7       | 91.9                   |
 | Sapphire-7B                         | 46.90     | 4.9       | 43.5       | 92.3                   |
@@ -37,9 +38,13 @@ In extensive testing and benchmarks, SniffyOtter has proven to be an exceptional
 | chatntq-ja-7b-v1.0                  | 45.12     | 2.55      | 41.4       | 91.4                   |
 | Calm2-7B-Chat                       | 45.07     | 3.4       | 40.2       | 91.6                   |
 **Benchmark Metrics:**
 - Eroticism: Measures the frequency of erotic words in the generated text. Calculated using a predefined set of words considered erotic.
 - Complexity: Evaluates the model's ability to produce non-repetitive responses. Higher scores indicate more diverse and less repetitive text, calculated using zlib.compress, which I find effective at detecting significantly repetitive texts.
 - Context Maintenance: Assesses how well the model maintains the given topic. Responses that stray from the context result in lower scores. Calculated using japanese-reranker-cross-encoder-large-v1 to measure relevance between the input and the generated response.
-*Note: While the benchmark provides some insights, it is important to consider that the specific set of erotic words and the undisclosed details of the benchmark may introduce biases. Therefore, it is recommended to take this result with a grain of salt for now.*

 | Model                               | average   | eroticism | complexity | contextual maintenance |
 | ----------------------------------- | --------- | --------- | ---------- | ---------------------- |
+| Antler-RP-ja-westlake-chatvector    | 49.17 | 5.5      | 47.1   | 94.9               |
+| **SniffyOtter-7B**                  | 48.80     | 5.7   | 46.2       | 94.5                   |
+| Sabbath-7B                          | 48.10     | 4.8       | 45.8       | 93.7                   |
 | Antler-7B                           | 47.62     | 5.25      | 45.3       | 92.3                   |
 | Nocturn-7B                          | 47.25     | 5.15      | 44.7       | 91.9                   |
 | Sapphire-7B                         | 46.90     | 4.9       | 43.5       | 92.3                   |
 | chatntq-ja-7b-v1.0                  | 45.12     | 2.55      | 41.4       | 91.4                   |
 | Calm2-7B-Chat                       | 45.07     | 3.4       | 40.2       | 91.6                   |
+Eroticism: Frequency of erotic
+*tested in 8bit version because of lack of GPU memory
 **Benchmark Metrics:**
 - Eroticism: Measures the frequency of erotic words in the generated text. Calculated using a predefined set of words considered erotic.
 - Complexity: Evaluates the model's ability to produce non-repetitive responses. Higher scores indicate more diverse and less repetitive text, calculated using zlib.compress, which I find effective at detecting significantly repetitive texts.
 - Context Maintenance: Assesses how well the model maintains the given topic. Responses that stray from the context result in lower scores. Calculated using japanese-reranker-cross-encoder-large-v1 to measure relevance between the input and the generated response.
+The benchmark is a refined version of what I used in [Sapphire7B](https://huggingface.co/Elizezen/Sapphire-7B). *While it provides some insights, it is important to consider that the specific set of erotic words and the undisclosed details of the benchmark may introduce biases. Therefore, it is recommended to take this result with a grain of salt for now.*