Discussion:
the 'original' string function?
(too old to reply)
Emanuel Berg
2024-07-10 10:00:01 UTC
Permalink
Okay, this is gonna be a challenge to most guys who have been
processing text for a long time.

So, I would like a command, function or script, 'original',
that takes a string STR and a text file TXT and outputs
a score, from 0 to 100, how _original_ STR is, compared to
what is already in TXT.

So if I do

$ original "This isn't just another party" comments.txt

this will score 0 if that exact phrase to the letter already
exists in comments.txt.

But it will score 100 if not a single of those words exists in
the file! Because that would be 100% original.

Those endpoints are easy. But how to make it score - say - 62%
if some of the words are present, mostly spelled like that and
combined in ways that are not completely different?

Note: The above examples are examples, other definitions of
originality are okay. That is not the important part now - but
can be as interesting a part, later.

I'm not sure what function can solve this but I'm sure it will
be very original!
--
underground experts united
https://dataswamp.org/~incal
Nicolas George
2024-07-10 10:10:01 UTC
Permalink
Post by Emanuel Berg
Okay, this is gonna be a challenge to most guys who have been
processing text for a long time.
So, I would like a command, function or script, 'original',
that takes a string STR and a text file TXT and outputs
a score, from 0 to 100, how _original_ STR is, compared to
what is already in TXT.
So if I do
$ original "This isn't just another party" comments.txt
this will score 0 if that exact phrase to the letter already
exists in comments.txt.
But it will score 100 if not a single of those words exists in
the file! Because that would be 100% original.
Those endpoints are easy. But how to make it score - say - 62%
if some of the words are present, mostly spelled like that and
combined in ways that are not completely different?
Note: The above examples are examples, other definitions of
originality are okay. That is not the important part now - but
can be as interesting a part, later.
You can use that:

https://en.wikipedia.org/wiki/Levenshtein_distance

But you also need to define what you want with more precision:

How do you count the replacement of a word by a synonym?

How do you count a change in the order of the words?

How do you count a transparent spelling mistake?

How do you count a spelling mistake that turns a word into another
existing word?

Not related to Debian, putting “[OT]” in the subject.

Regards,
--
Nicolas George
t***@tuxteam.de
2024-07-10 10:40:01 UTC
Permalink
The modern way would be to use a LLM in API mode and set a context to
achieve your aims.
All modern languages have some kind of rand() function.
Which is way more CO2 efficient, at that :-)
I assumed the OP wanted to do something useful with the answers.
ChatGPT is bullshit [1].

Cheers

[1] https://link.springer.com/article/10.1007/s10676-024-09775-5
--
t
Emanuel Berg
2024-07-10 16:10:01 UTC
Permalink
Post by t***@tuxteam.de
ChatGPT is bullshit
Without context then:

The comment "ChatGPT4 is bullshit" is not original as it is
a common expression used to express frustration [...]

Not a bad start!

I agree that comment would score between 0 and 5 as it doesn't
have any originality to it, really.
--
underground experts united
https://dataswamp.org/~incal
t***@tuxteam.de
2024-07-10 16:50:01 UTC
Permalink
Post by Emanuel Berg
Post by t***@tuxteam.de
ChatGPT is bullshit
This is not a comment, but the title of an academic article.
Worth reading, BTW.

Link is in my original post.
Post by Emanuel Berg
The comment "ChatGPT4 is bullshit" is not original as it is
a common expression used to express frustration [...]
Not a bad start!
I'd guess that this has been injected directly into ChatGPT. Perhaps
by an African worker [1], earning less than two dollars an hour.
Post by Emanuel Berg
I agree that comment would score between 0 and 5 as it doesn't
have any originality to it, really.
Uh -- it was a quote. They aren't supposed to be original.

Cheers

[1] https://www.theguardian.com/technology/article/2024/jul/06/mercy-anita-african-workers-ai-artificial-intelligence-exploitation-feeding-machine
--
t
Emanuel Berg
2024-07-10 17:10:01 UTC
Permalink
Post by t***@tuxteam.de
Uh -- it was a quote. They aren't supposed to be original.
Spoiler alert: This post contains higher-order originality.

I mentioned computing the similarity with a bunch of tests.
The result can then be expressed as a string of test results.

And what happens next? You guessed it, you run the same
algorithm again! Only now asking who has the most
original originality test result string!

It is happening. Originality turning hyper-original!

\o/
--
underground experts united
https://dataswamp.org/~incal
Nicolas George
2024-07-10 10:40:01 UTC
Permalink
The modern way would be to use a LLM in API mode and set a context to
achieve your aims.
All modern languages have some kind of rand() function. I assumed the OP
wanted to do something useful with the answers.

Regards,
--
Nicolas George
Emanuel Berg
2024-07-10 11:20:01 UTC
Permalink
The modern way would be to use a LLM in API mode and set
a context to achieve your aims.
All modern languages have some kind of rand() function.
I assumed the OP wanted to do something useful with
the answers.
Yes, of course!

/B. A. Original
--
underground experts united
https://dataswamp.org/~incal
jeremy ardley
2024-07-10 10:40:01 UTC
Permalink
Post by Nicolas George
Post by Emanuel Berg
Okay, this is gonna be a challenge to most guys who have been
processing text for a long time.
So, I would like a command, function or script, 'original',
that takes a string STR and a text file TXT and outputs
a score, from 0 to 100, how _original_ STR is, compared to
what is already in TXT.
So if I do
$ original "This isn't just another party" comments.txt
this will score 0 if that exact phrase to the letter already
exists in comments.txt.
But it will score 100 if not a single of those words exists in
the file! Because that would be 100% original.
Those endpoints are easy. But how to make it score - say - 62%
if some of the words are present, mostly spelled like that and
combined in ways that are not completely different?
Note: The above examples are examples, other definitions of
originality are okay. That is not the important part now - but
can be as interesting a part, later.
https://en.wikipedia.org/wiki/Levenshtein_distance
How do you count the replacement of a word by a synonym?
How do you count a change in the order of the words?
How do you count a transparent spelling mistake?
How do you count a spelling mistake that turns a word into another
existing word?
Not related to Debian, putting “[OT]” in the subject.
Regards,
The modern way would be to use a LLM in API mode and set a context to
achieve your aims.

You can do this locally using a LLM hosted on your computer or you can
use a remote API such as ChatGPT.

This is usually scripted in python.

The interesting thing is you can get a good LLM such as GPT4 to help
write a context to be run by a lesser LLM.

You should not expect perfection and may not get 100% repeatable results
but It'll still be fairly good.
Emanuel Berg
2024-07-10 11:10:01 UTC
Permalink
The modern way would be to use a LLM in API mode and set
a context to achieve your aims.
You can do this locally using a LLM hosted on your computer
or you can use a remote API such as ChatGPT.
Okay, great, I already have this installed from the console,
all setup just another model.

But, what should I with it? Just ask how original is this?
--
underground experts united
https://dataswamp.org/~incal
Emanuel Berg
2024-07-10 15:50:01 UTC
Permalink
The modern way would be to use a LLM in API mode and set
a context to achieve your aims.
Here is the command. Turns out, I used the llamafile method
with llava or mistral as LLMs.

In the command, we see '-c 2048'. This, says the man file, is
to "Set the size of the prompt context." If the unit is bytes
(chars) it isn't a lot.

But then note '--prompt-cache-all' so what you do can be saved
and brought back.

While this I'm sure is enough for a lot of use cases, here we
envision a huge backlog file, obviously it cannot be loaded in
chunks of 2048 chars at the time.

I also provide a CLI but REPL version for people who would
care to experiment. But neither method will work for this,
at least not in their present form.

llm is a link so you can alternate between llava and mistral
as LLM :)

taskset -c 0-2 \
./llm \
--cli \
--log-disable \
--prompt-cache-all \
--silent-prompt \
-c 2048 \
-ngl 9999 \
-p "$(cat $src)" > $dst

./llm \
--cli \
--color \
--log-disable \
--silent-prompt \
-cml \
-i \
-ld log \

Versions are:

$ mistral-7b-instruct-v0.2.Q5_K_M.llamafile --version
llamafile v0.8.5 (Apache License 2.0)

llava-v1.5-7b-q4.llamafile

Apache License 2.0 is FOSS so this is all CLI, all local and
free. If it can be made to work for this, maybe people would
be happy about it all tho there isn't an old school algorithm
which is deterministic and you can fiddle with until it is
just right, so you are missing that out unfortunately.
--
underground experts united
https://dataswamp.org/~incal
jeremy ardley
2024-07-11 01:20:01 UTC
Permalink
Post by Emanuel Berg
The modern way would be to use a LLM in API mode and set
a context to achieve your aims.
Here is the command. Turns out, I used the llamafile method
with llava or mistral as LLMs.
In the command, we see '-c 2048'. This, says the man file, is
to "Set the size of the prompt context." If the unit is bytes
(chars) it isn't a lot.
But then note '--prompt-cache-all' so what you do can be saved
and brought back.
While this I'm sure is enough for a lot of use cases, here we
envision a huge backlog file, obviously it cannot be loaded in
chunks of 2048 chars at the time.
I also provide a CLI but REPL version for people who would
care to experiment. But neither method will work for this,
at least not in their present form.
llm is a link so you can alternate between llava and mistral
as LLM :)
taskset -c 0-2 \
./llm \
--cli \
--log-disable \
--prompt-cache-all \
--silent-prompt \
-c 2048 \
-ngl 9999 \
-p "$(cat $src)" > $dst
./llm \
--cli \
--color \
--log-disable \
--silent-prompt \
-cml \
-i \
-ld log \
$ mistral-7b-instruct-v0.2.Q5_K_M.llamafile --version
llamafile v0.8.5 (Apache License 2.0)
llava-v1.5-7b-q4.llamafile
Apache License 2.0 is FOSS so this is all CLI, all local and
free. If it can be made to work for this, maybe people would
be happy about it all tho there isn't an old school algorithm
which is deterministic and you can fiddle with until it is
just right, so you are missing that out unfortunately.
I asked ChatGPT4 about this using your original email as a prompt. It
came back with a solution based on how LLMs are trained but it would
have required some development. The many-dimensional vector comparison
mechanisms used in LLMs do seem quite well matched to what you want to do.

In your case, without fully knowing precisely what your aim is, one
possible approach is to put all the text you want to search into a
GPT4All localdocs directory where it will be indexed on the fly. Then
create a prompt/context with the search text and instructions to
generate a similarity index and report any that meet some threshold.

You will have to get the results in some format such as json and post
process

You may want to get ChatGPT 4 to help you craft the general prompt.

For reference, I have GPT4All and am planning to use its localdocs
feature with man page text. That will ensure that answers it gives on
technical questions will at least quote the man pages accurately.
Emanuel Berg
2024-07-13 21:00:01 UTC
Permalink
Post by jeremy ardley
In your case, without fully knowing precisely what your aim
is, one possible approach is to put all the text you want to
search into a GPT4All localdocs directory where it will be
indexed on the fly. Then create a prompt/context with the
search text and instructions to generate a similarity index
and report any that meet some threshold.
I also believe in AI for this.

I would gladly try to do this with AI from my FOSS and local,
CLI and REPL setup with llamafile and then either llava or
mistral - but I don't know how to add the context. I've looked
in the man page many times but still don't know how.

If I have a file with the data, how do I add even that
single file?
--
underground experts united
https://dataswamp.org/~incal
Emanuel Berg
2024-07-14 06:00:01 UTC
Permalink
Then create a prompt/context with the search text and
instructions to generate a similarity index and report any
that meet some threshold.
You will have to get the results in some format such as json
and post process
You may want to get ChatGPT 4 to help you craft the
general prompt.
Okay, context is context and query is the question after
the context.

But then what is the prompt and what is the general prompt?
localdocs
There are no hits for that in the man page, maybe it is called
something else.

Is the idea to have a big context locally in directories
and stuff? That should be supported but maybe it isn't.
--
underground experts united
https://dataswamp.org/~incal
jeremy ardley
2024-07-14 06:10:01 UTC
Permalink
Post by Emanuel Berg
Then create a prompt/context with the search text and
instructions to generate a similarity index and report any
that meet some threshold.
You will have to get the results in some format such as json
and post process
You may want to get ChatGPT 4 to help you craft the
general prompt.
Okay, context is context and query is the question after
the context.
But then what is the prompt and what is the general prompt?
Ask ChatGPT4 . Explain what you are trying to do and get it to give you
a suitable context and prompt
Post by Emanuel Berg
localdocs
There are no hits for that in the man page, maybe it is called
something else.
Is the idea to have a big context locally in directories
and stuff? That should be supported but maybe it isn't.
localdocs contains text you trust that can be used in responses in
preference to something synthesised from the internet.

My suggestion is to use localdocs as the text you want to search and the
context is how to match and the prompt is the string you want to match.
Emanuel Berg
2024-07-14 06:40:01 UTC
Permalink
Post by jeremy ardley
Ask ChatGPT4 . Explain what you are trying to do and get it
to give you a suitable context and prompt
I don't know what to ask exactly, maybe I can ask ChatGPT4 ...
Post by jeremy ardley
localdocs contains text you trust that can be used in
responses in preference to something synthesised from
the internet.
I tried to google but it seems everything is local with the
llamafile so I don't know, also if it was included it would
have been in the man page or appeared as a simple plugin when
you googled it.

Anyway, the context is big enough to play around with for
now then.

Or the context size is maybe the same? Just when looking for
an answer, the AI first looks for the localdocs, then
the internet? Or you can throw in stuff there, and this gets
crunched by the AI? But then, I don't have that much stuff, so
I would have to get it from the internet anyway.

Wait, let's first find out how big a file you can use! In my
command, we see that -c or --ctx-size "Set the size of the
prompt context" (llamafile(1)).

If the unit is tokens, that means I know have not 512 but 2048
of them. There are a lot of token related stuff in the log
even when it is disabled, seemingly.

--prompt-cache-all will maybe store some data on the file or
when you exit the application. Then that sweat stuff will be
brought just right in again.

taskset -c 0-2 \
./llm \
--cli \
--log-disable \
--prompt-cache-all \
--silent-prompt \
-c 2048 \
-ngl 9999 \
-p "$(cat $src)" > $dst
--
underground experts united
https://dataswamp.org/~incal
Emanuel Berg
2024-07-14 07:20:02 UTC
Permalink
Post by Emanuel Berg
Anyway, the context is big enough to play around with for
now then.
Yes, this method works, I think?

I used it again with CONTEXT / QUERY and, as context, had the
5 first parts of this:

http://www.scifiscripts.com/scripts/starship_troopers.txt

then query: What sentence is the most original one?

Answer from llamafile/mistral:

I suppose the most original one is the one that is written in
a way that is not often used or seen before. In this case,
I think the most original sentence is:

Join the mobile infantry and save the world!
Service guarantees citizen-ship.
Would you like to know more?

because it uses a conversational tone and the offer of
a service and a guarantee, which is not something you often
see in the context of a military recruitment message.

I don't know if that is the reason why it is original - and
also that is three sentences, but that's okay - because
I agree 100% it is very original!
--
underground experts united
https://dataswamp.org/~incal
Emanuel Berg
2024-07-14 16:50:02 UTC
Permalink
Here is the AI script!

It is all CLI/TUI, all FOSS, and all local execution/storage
as well.

#! /bin/zsh
#
# Find the most original sentence in a text file.
#
# uses:
# mistral-7b-instruct-v0.2.Q5_K_M.llamafile
# llamafile v0.8.5
#
# usage:
# $ ori input.txt # outputs to input-ori.txt

src=$1
dst=${src:r}-ori.${src:e}

query="What is the most original sentence above?"

prompt="Context: $(cat $src) Query: ${query}"

taskset -c 0-2 \
llm \
--cli \
--log-disable \
--prompt-cache-all \
--silent-prompt \
-c 2048 \
-ngl 9999 \
-p $prompt | tee $dst
--
underground experts united
https://dataswamp.org/~incal
Emanuel Berg
2024-07-15 06:10:02 UTC
Permalink
Post by Emanuel Berg
Here is the AI script!
It is all CLI/TUI, all FOSS, and all local execution/storage
as well. [...]
I have stored it here:

https://dataswamp.org/~incal/ai/ori

I'm very happy with this solution and would like to thank
everyone for helping me out and making me aware of new
concepts and terminology.

That script is great and I'm already working on another
version so that one can ask arbitrary, multiple questions to
the AI.

Another idea is to be able to provide several files so one can
compare and stuff.
--
underground experts united
https://dataswamp.org/~incal
Emanuel Berg
2024-07-10 15:50:01 UTC
Permalink
The modern way would be to use a LLM in API mode and set
a context
How do I set the context?

That is the backlog file, right?

This is the step that is missing from my commands.
--
underground experts united
https://dataswamp.org/~incal
Nicolas George
2024-07-10 11:10:01 UTC
Permalink
Indeed, one can have a bunch of such rules and apply them and
award points and stuff.
But maybe one could also have a more general, basic
mechanical/math/stats inspired algorithm?
You cannot have a more general basic mechanical algorithm unless you
know what you want it to do.
One can also see the data as a bunch of entries, say one for
each line. To what extent is the new entry unlike all others?
You still need to define unlike.
See? How can we define what is original in a way that 1.
makes sense and 2. is useful for this application?
Only you know what “this application” is.

Regards,
--
Nicolas George
Emanuel Berg
2024-07-10 16:30:01 UTC
Permalink
Post by Nicolas George
You still need to define unlike.
That's an idea!

You define similarity and the least similar to its most
similar other item is the most original.

Best suggestion yet?

Or, did you think it quite, but not exactly like that?
--
underground experts united
https://dataswamp.org/~incal
Emanuel Berg
2024-07-10 19:00:01 UTC
Permalink
Here is a piece of code after all wild ideas, but uncoded.

I think I can come up with a lot of those!

They should compare the data but as close to the way a human
thinks as possible.

So, the size of something was my initial idea as that is
a very human first thing to note. Who has the biggest house,
car or army, that's a big player and I want a Big Mac.

Okay, I felt the need to post some code and not be all talk
but anyway I'll now work on with this and not bother you with
this idea anymore, aight? :)

;;; -*- lexical-binding: t -*-
;;
;; this file:
;; https://dataswamp.org/~incal/emacs-init/b-a-original.el

(require 'pcase)

(defun length-test (s1 s2)
(pcase-let*((l1 (length s1))
(l2 (length s2))
(`(,n ,d) (if (< l1 l2) `(,l1 ,l2) `(,l2 ,l1))) )
(round (* 100 (/ n d 1.0))) ))

;; (length-test "b" "a") ; 100
;; (length-test "a" "original") ; 12
;; (length-test "original" "b") ; 12

(provide 'b-a-original)
--
underground experts united
https://dataswamp.org/~incal
t***@tuxteam.de
2024-07-11 04:30:01 UTC
Permalink
Post by Emanuel Berg
Here is a piece of code after all wild ideas, but uncoded.
[...]
Post by Emanuel Berg
;;; -*- lexical-binding: t -*-
[...]

If you are doing this in Emacs Lisp, after all, there /is/
a Levenshtein distance function in there. Finding its name
is left as an exercise to the reader, though...

Cheers

(I always found standard Levenshtein distances to be just
a tiny step of a powerful generalisation, by being able to
assign weights to the substitutions, but that's me)
--
t
Emanuel Berg
2024-07-13 21:10:01 UTC
Permalink
Post by t***@tuxteam.de
If you are doing this in Emacs Lisp, after all, there /is/
a Levenshtein distance function in there. Finding its name
is left as an exercise to the reader, though...
(I know of course, `string-distance'.)

I thought I was just going to experiment some in Elisp but now
I've done the whole multi-test idea with it, mostly to show the
idea and I wanted to realize something practically.

I yank the source last, before that, an example run
and output. It is not done but you will get it.

Maybe one should use CL for this if Elisp and/or the
programmer's skills are such, it will be all to slow on
big data. One can start with optimizing the Elisp but it can
be an uphill battle if it is way too slow. No idea if that is
the case, only tried with small data. On the TODO list!

But again, I wanted to show the idea and that it works.
There are only three tests, but one can add more. As for
`string-distance', I was hesitant if that should be added
since it isn't the way a human would think, and the idea was
to make it similar to the human concept of originality.
But now - I don't know, maybe I'll just include every single
test that makes sense and I can think of?

-*-

Example run. "min" means, it isn't original compared to this.
"max" means, it is. Yes, I only tried with 4 sentences so far :)

TOFT - originality finder tool

String: "duncan is a dork"

min: 30% - In a random string.
max: 64% - What's up what's up wanna party or n...
sum: 250
avg: 50%

Data executed on:

What's up what's up wanna party or not say. Maybe to some
people, it is true. In a random string. I know right?

-*-

Source - not done but works:

;;; -*- lexical-binding: t -*-
;;
;; this file:
;; https://dataswamp.org/~incal/emacs-init/b-a-original.el

(require 'cl-lib)
(require 'pcase)
(require 'thingatpt)

;; useful

(defun string-words (str &optional no-sort)
(let ((words (split-string (downcase str) "[[:space:]()]+" t "[[:punct:]]+")))
(if no-sort
words
(sort words) )))

(defun buffer-sentences (&optional beg end no-sort)
(or beg (setq beg (point-min)))
(or end (setq end (point-max)))
(save-mark-and-excursion
(goto-char beg)
(cl-loop
with buf-strs
with next-str
while (< (point) end) do
(setq next-str (sentence-at-point t))
(when next-str
(push next-str buf-strs) )
(forward-sentence)
finally return (if no-sort buf-strs (sort buf-strs)) )))

;; interface

(defun prepare-string (str)
(format "%s\n"
(truncate-string-to-width
(string-trim (replace-regexp-in-string "\n" " " str))
50 nil nil "...") ))

(defun print-results (res &optional init)
(let ((buf-dst (get-buffer-create "*original*")))
(with-current-buffer buf-dst
(erase-buffer)
(insert "TOFT - originality finder tool\n")
(when init
(insert "\n" (prepare-string (format "String: \"%s\"" init))))
(insert "\n")
(insert (prepare-string (format "min: %d%% - %s" (nth 0 res) (nth 1 res))))
(insert (prepare-string (format "max: %d%% - %s" (nth 2 res) (nth 3 res))))
(insert (format "sum: %d\n" (nth 4 res)))
(insert (format "avg: %d%%\n" (nth 5 res)))
(goto-char (point-min)))
(pop-to-buffer buf-dst) ))

(defun string-compare-buffer (str &optional beg end)
(interactive
(list (read-string "string: ")
(when (use-region-p) (list (region-beginning) (region-end))) ))
(or beg (setq beg (point-min)))
(or end (setq end (point-max)))
(print-results
(string-compare-many str (buffer-sentences beg end))
str) ) ; TODO: fix

;; main test

(defun string-compare-many (str str-all)
(cl-loop
with min
with max
with min-str
with max-str
with sum = 0
with mean
with s-score
for s in str-all do
(setq s-score (string-compare str s))
(cl-incf sum s-score)
(when (or (not min) (< s-score min))
(setq min s-score)
(setq min-str s) )
(when (or (not max) (< max s-score))
(setq max s-score)
(setq max-str s) )
finally (setq mean (floor (/ sum (length str-all) 1.0)))
finally return (list min min-str max max-str sum mean) ))

(defun string-compare (s1 s2 &optional full)
(let ((tests (list
#'length-test
#'number-of-words-test
#'same-words-test
)))
(cl-loop
with res
with score = 0
with f-score
for f in tests do
(setq f-score (apply f (list s1 s2)))
(push f-score res)
(cl-incf score f-score)
finally (setq score (floor (/ score (length tests) 1.0)))
finally return (if full res score) )))

;; individual tests, helpers

(defun originality-score (a b)
(when (zerop a) (setq a 1))
(when (zerop b) (setq b 1))
(pcase-let*((`(,n ,d) (if (< a b) `(,a ,b) `(,b ,a))))
(floor (- 100 (* 100 (/ n d 1.0))) )))

;; individual tests

(defun length-test (s1 s2)
(let ((l1 (length s1))
(l2 (length s2)) )
(originality-score l1 l2) ))

(defun number-of-words-test (s1 s2)
(let ((nw1 (length (string-split s1)))
(nw2 (length (string-split s2))) )
(originality-score nw1 nw2) ))

(defun same-words-test (s1 s2)
(let*((w1 (string-words s1))
(w2 (string-words s2))
(len-w1 (length w1))
(len-w2 (length w2))
(len-short (if (< len-w1 len-w2) len-w1 len-w2))
(same (length (cl-intersection w1 w2 :test #'string=))) )
(originality-score len-short same) ))

(provide 'b-a-original)
--
underground experts united
https://dataswamp.org/~incal
Emanuel Berg
2024-07-14 03:50:01 UTC
Permalink
I yank the source last [...]
I've made several improvements, including adding the
`string-distance-percentage' that was mentioned. But let's
forget about that branch [1] or visit that URL for the latest
source on that.

Here, in this thread, the context thing with respect to AI,
anyone having any luck knowing what to do with that?

It is mentioned 14 times in llamafile(1) but not how to
actually set it up with your own data?

Okay, let's do it like this, here is the file, maybe I'm blind.

[1] https://dataswamp.org/~incal/emacs-init/b-a-original.el

LLAMAFILE(1) BSD General Commands Manual LLAMAFILE(1)

NAME
llamafile — large language model runner

SYNOPSIS
llamafile [--server] [flags...] -m model.gguf [--mmproj vision.gguf]
llamafile [--cli] [flags...] -m model.gguf -p prompt
llamafile [--cli] [flags...] -m model.gguf --mmproj vision.gguf --image
graphic.png -p prompt

DESCRIPTION
llamafile is a large language model tool. It has use cases such as:

- Code completion
- Prose composition
- Chatbot that passes the Turing test
- Text/image summarization and analysis

OPTIONS
The following options are available:

--version
Print version and exit.

-h, --help
Show help message and exit.

--cli Puts program in command line interface mode. This flag is im‐
plied when a prompt is supplied using either the -p or -f flags.

--server
Puts program in server mode. This will launch an HTTP server on
a local port. This server has both a web UI and an OpenAI API
compatible completions endpoint. When the server is run on a
desk system, a tab browser tab will be launched automatically
that displays the web UI. This --server flag is implied if no
prompt is specified, i.e. neither the -p or -f flags are passed.

-m FNAME, --model FNAME
Model path in the GGUF file format.

Default: models/7B/ggml-model-f16.gguf

--mmproj FNAME
Specifies path of the LLaVA vision model in the GGUF file for‐
mat. If this flag is supplied, then the --model and --image
flags should also be supplied.

-s SEED, --seed SEED
Random Number Generator (RNG) seed. A random seed is used if
this is less than zero.

Default: -1

-t N, --threads N
Number of threads to use during generation.

Default: $(nproc)/2

-tb N, --threads-batch N
Set the number of threads to use during batch and prompt pro‐
cessing. In some systems, it is beneficial to use a higher num‐
ber of threads during batch processing than during generation.
If not specified, the number of threads used for batch process‐
ing will be the same as the number of threads used for genera‐
tion.

Default: Same as --threads

-td N, --threads-draft N
Number of threads to use during generation.

Default: Same as --threads

-tbd N, --threads-batch-draft N
Number of threads to use during batch and prompt processing.

Default: Same as --threads-draft

--in-prefix-bos
Prefix BOS to user inputs, preceding the --in-prefix string.

--in-prefix STRING
This flag is used to add a prefix to your input, primarily, this
is used to insert a space after the reverse prompt. Here's an
example of how to use the --in-prefix flag in conjunction with
the --reverse-prompt flag:

./main -r "User:" --in-prefix " "

Default: empty

--in-suffix STRING
This flag is used to add a suffix after your input. This is use‐
ful for adding an "Assistant:" prompt after the user's input.
It's added after the new-line character (\n) that's automati‐
cally added to the end of the user's input. Here's an example of
how to use the --in-suffix flag in conjunction with the
--reverse-prompt flag:

./main -r "User:" --in-prefix " " --in-suffix "Assistant:"

Default: empty

-n N, --n-predict N
Number of tokens to predict.

- -1 = infinity
- -2 = until context filled

Default: -1

-c N, --ctx-size N
Set the size of the prompt context. A larger context size helps
the model to better comprehend and generate responses for longer
input or conversations. The LLaMA models were built with a con‐
text of 2048, which yields the best results on longer input /
inference.

- 0 = loaded automatically from model

Default: 512

-b N, --batch-size N
Batch size for prompt processing.

Default: 512

--top-k N
Top-k sampling.

- 0 = disabled

Default: 40

--top-p N
Top-p sampling.

- 1.0 = disabled

Default: 0.9

--min-p N
Min-p sampling.

- 0.0 = disabled

Default: 0.1

--tfs N
Tail free sampling, parameter z.

- 1.0 = disabled

Default: 1.0

--typical N
Locally typical sampling, parameter p.

- 1.0 = disabled

Default: 1.0

--repeat-last-n N
Last n tokens to consider for penalize.

- 0 = disabled
- -1 = ctx_size

Default: 64

--repeat-penalty N
Penalize repeat sequence of tokens.

- 1.0 = disabled

Default: 1.1

--presence-penalty N
Repeat alpha presence penalty.

- 0.0 = disabled

Default: 0.0

--frequency-penalty N
Repeat alpha frequency penalty.

- 0.0 = disabled

Default: 0.0

--mirostat N
Use Mirostat sampling. Top K, Nucleus, Tail Free and Locally
Typical samplers are ignored if used..

- 0 = disabled
- 1 = Mirostat
- 2 = Mirostat 2.0

Default: 0

--mirostat-lr N
Mirostat learning rate, parameter eta.

Default: 0.1

--mirostat-ent N
Mirostat target entropy, parameter tau.

Default: 5.0

-l TOKEN_ID(+/-)BIAS, --logit-bias TOKEN_ID(+/-)BIAS
Modifies the likelihood of token appearing in the completion,
i.e. --logit-bias 15043+1 to increase likelihood of token
' Hello', or --logit-bias 15043-1 to decrease likelihood of to‐
ken ' Hello'.

-md FNAME, --model-draft FNAME
Draft model for speculative decoding.

Default: models/7B/ggml-model-f16.gguf

--cfg-negative-prompt PROMPT
Negative prompt to use for guidance..

Default: empty

--cfg-negative-prompt-file FNAME
Negative prompt file to use for guidance.

Default: empty

--cfg-scale N
Strength of guidance.

- 1.0 = disable

Default: 1.0

--rope-scaling {none,linear,yarn}
RoPE frequency scaling method, defaults to linear unless speci‐
fied by the model

--rope-scale N
RoPE context scaling factor, expands context by a factor of N
where N is the linear scaling factor used by the fine-tuned
model. Some fine-tuned models have extended the context length
by scaling RoPE. For example, if the original pre-trained model
have a context length (max sequence length) of 4096 (4k) and the
fine-tuned model have 32k. That is a scaling factor of 8, and
should work by setting the above --ctx-size to 32768 (32k) and
--rope-scale to 8.

--rope-freq-base N
RoPE base frequency, used by NTK-aware scaling.

Default: loaded from model

--rope-freq-scale N
RoPE frequency scaling factor, expands context by a factor of
1/N

--yarn-orig-ctx N
YaRN: original context size of model.

Default: 0 = model training context size

--yarn-ext-factor N
YaRN: extrapolation mix factor.

- 0.0 = full interpolation

Default: 1.0

--yarn-attn-factor N
YaRN: scale sqrt(t) or attention magnitude.

Default: 1.0

--yarn-beta-slow N
YaRN: high correction dim or alpha.

Default: 1.0

--yarn-beta-fast N
YaRN: low correction dim or beta.

Default: 32.0

--ignore-eos
Ignore end of stream token and continue generating (implies
--logit-bias 2-inf)

--no-penalize-nl
Do not penalize newline token.

--temp N
Temperature.

Default: 0.8

--logits-all
Return logits for all tokens in the batch.

Default: disabled

--hellaswag
Compute HellaSwag score over random tasks from datafile supplied
with -f

--hellaswag-tasks N
Number of tasks to use when computing the HellaSwag score.

Default: 400

--keep N
This flag allows users to retain the original prompt when the
model runs out of context, ensuring a connection to the initial
instruction or conversation topic is maintained, where N is the
number of tokens from the initial prompt to retain when the
model resets its internal context.

- 0 = no tokens are kept from initial prompt
- -1 = retain all tokens from initial prompt

Default: 0

--draft N
Number of tokens to draft for speculative decoding.

Default: 16

--chunks N
Max number of chunks to process.

- -1 = all

Default: -1

-ns N, --sequences N
Number of sequences to decode.

Default: 1

-pa N, --p-accept N
speculative decoding accept probability.

Default: 0.5

-ps N, --p-split N
Speculative decoding split probability.

Default: 0.1

--mlock
Force system to keep model in RAM rather than swapping or com‐
pressing.

--no-mmap
Do not memory-map model (slower load but may reduce pageouts if
not using mlock).

--numa Attempt optimizations that help on some NUMA systems if run
without this previously, it is recommended to drop the system
page cache before using this. See
https://github.com/ggerganov/llama.cpp/issues/1437.

--recompile
Force GPU support to be recompiled at runtime if possible.

--nocompile
Never compile GPU support at runtime.

If the appropriate DSO file already exists under ~/.llamafile/
then it'll be linked as-is without question. If a prebuilt DSO
is present in the PKZIP content of the executable, then it'll be
extracted and linked if possible. Otherwise, llamafile will skip
any attempt to compile GPU support and simply fall back to using
CPU inference.

--gpu GPU
Specifies which brand of GPU should be used. Valid choices are:

- AUTO: Use any GPU if possible, otherwise fall back to CPU
inference (default)

- APPLE: Use Apple Metal GPU. This is only available on MacOS
ARM64. If Metal could not be used for any reason, then a fa‐
tal error will be raised.

- AMD: Use AMD GPUs. The AMD HIP ROCm SDK should be installed
in which case we assume the HIP_PATH environment variable
has been defined. The set of gfx microarchitectures needed
to run on the host machine is determined automatically based
on the output of the hipInfo command. On Windows, llamafile
release binaries are distributed with a tinyBLAS DLL so
it'll work out of the box without requiring the HIP SDK to
be installed. However, tinyBLAS is slower than rocBLAS for
batch and image processing, so it's recommended that the SDK
be installed anyway. If an AMD GPU could not be used for any
reason, then a fatal error will be raised.

- NVIDIA: Use NVIDIA GPUs. If an NVIDIA GPU could not be used
for any reason, a fatal error will be raised. On Windows,
NVIDIA GPU support will use our tinyBLAS library, since it
works on stock Windows installs. However, tinyBLAS goes
slower for batch and image processing. It's possible to use
NVIDIA's closed-source cuBLAS library instead. To do that,
both MSVC and CUDA need to be installed and the llamafile
command should be run once from the x64 MSVC command prompt
with the --recompile flag passed. The GGML library will then
be compiled and saved to ~/.llamafile/ so the special
process only needs to happen a single time.

- DISABLE: Never use GPU and instead use CPU inference. This
setting is implied by -ngl 0.

-ngl N, --n-gpu-layers N
Number of layers to store in VRAM.

-ngld N, --n-gpu-layers-draft N
Number of layers to store in VRAM for the draft model.

-sm SPLIT_MODE, --split-mode SPLIT_MODE
How to split the model across multiple GPUs, one of:
- none: use one GPU only
- layer (default): split layers and KV across GPUs
- row: split rows across GPUs

-ts SPLIT, --tensor-split SPLIT
When using multiple GPUs this option controls how large tensors
should be split across all GPUs. SPLIT is a comma-separated
list of non-negative values that assigns the proportion of data
that each GPU should get in order. For example, "3,2" will as‐
sign 60% of the data to GPU 0 and 40% to GPU 1. By default the
data is split in proportion to VRAM but this may not be optimal
for performance. Requires cuBLAS. How to split tensors across
multiple GPUs, comma-separated list of proportions, e.g. 3,1

-mg i, --main-gpu i
The GPU to use for scratch and small tensors.

-nommq, --no-mul-mat-q
Use cuBLAS instead of custom mul_mat_q CUDA kernels. Not recom‐
mended since this is both slower and uses more VRAM.

--verbose-prompt
Print prompt before generation.

--simple-io
Use basic IO for better compatibility in subprocesses and lim‐
ited consoles.

--lora FNAME
Apply LoRA adapter (implies --no-mmap)

--lora-scaled FNAME S
Apply LoRA adapter with user defined scaling S (implies
--no-mmap)

--lora-base FNAME
Optional model to use as a base for the layers modified by the
LoRA adapter

--unsecure
Disables pledge() sandboxing on Linux and OpenBSD.

--samplers
Samplers that will be used for generation in the order, sepa‐
rated by semicolon, for example: top_k;tfs;typi‐
cal;top_p;min_p;temp

--samplers-seq
Simplified sequence for samplers that will be used.

-cml, --chatml
Run in chatml mode (use with ChatML-compatible models)

-dkvc, --dump-kv-cache
Verbose print of the KV cache.

-nkvo, --no-kv-offload
Disable KV offload.

-ctk TYPE, --cache-type-k TYPE
KV cache data type for K.

-ctv TYPE, --cache-type-v TYPE
KV cache data type for V.

-gan N, --grp-attn-n N
Group-attention factor.

Default: 1

-gaw N, --grp-attn-w N
Group-attention width.

Default: 512

-bf FNAME, --binary-file FNAME
Binary file containing multiple choice tasks.

--winogrande
Compute Winogrande score over random tasks from datafile sup‐
plied by the -f flag.

--winogrande-tasks N
Number of tasks to use when computing the Winogrande score.

Default: 0

--multiple-choice
Compute multiple choice score over random tasks from datafile
supplied by the -f flag.

--multiple-choice-tasks N
Number of tasks to use when computing the multiple choice score.

Default: 0

--kl-divergence
Computes KL-divergence to logits provided via the
--kl-divergence-base flag.

--save-all-logits FNAME, --kl-divergence-base FNAME
Save logits to filename.

-ptc N, --print-token-count N
Print token count every N tokens.

Default: -1

--pooling KIND
Specifies pooling type for embeddings. This may be one of:

- none
- mean
- cls

The model default is used if unspecified.

CLI OPTIONS
The following options may be specified when llamafile is running in
--cli mode.

-e, --escape
Process prompt escapes sequences (\n, \r, \t, \´, \", \\)

-p STRING, --prompt STRING
Prompt to start text generation. Your LLM works by auto-complet‐
ing this text. For example:

llamafile -m model.gguf -p "four score and"

Stands a pretty good chance of printing Lincoln's Gettysburg Ad‐
dress. Prompts can take on a structured format too. Depending
on how your model was trained, it may specify in its docs an in‐
struction notation. With some models that might be:

llamafile -p "[INST]Summarize this: $(cat file)[/INST]"

In most cases, simply colons and newlines will work too:

llamafile -e -p "User: What is best in life?\nAssistant:"

-f FNAME, --file FNAME
Prompt file to start generation.

--grammar GRAMMAR
BNF-like grammar to constrain which tokens may be selected when
generating text. For example, the grammar:

root ::= "yes" | "no"

will force the LLM to only output yes or no before exiting. This
is useful for shell scripts when the --no-display-prompt flag is
also supplied.

--grammar-file FNAME
File to read grammar from.

--fast Put llamafile into fast math mode. This disables algorithms that
reduce floating point rounding, e.g. Kahan summation, and cer‐
tain functions like expf() will be vectorized but handle under‐
flows less gracefully. It's unspecified whether llamafile runs
in fast or precise math mode when neither flag is specified.

--precise
Put llamafile into precise math mode. This enables algorithms
that reduce floating point rounding, e.g. Kahan summation, and
certain functions like expf() will always handle subnormals cor‐
rectly. It's unspecified whether llamafile runs in fast or pre‐
cise math mode when neither flag is specified.

--trap Put llamafile into math trapping mode. When floating point ex‐
ceptions occur, such as NaNs, overflow, and divide by zero, lla‐
mafile will print a warning to the console. This warning will
include a C++ backtrace the first time an exception is trapped.
The op graph will also be dumped to a file, and llamafile will
report the specific op where the exception occurred. This is
useful for troubleshooting when reporting issues. USing this
feature will disable sandboxing. Math trapping is only possible
if your CPU supports it. That is generally the case on AMD64,
however it's less common on ARM64.

--prompt-cache FNAME
File to cache prompt state for faster startup.

Default: none

-fa FNAME, --flash-attn
Enable Flash Attention. This is a mathematical shortcut that can
speed up inference for certain models. This feature is still un‐
der active development.

--prompt-cache-all
If specified, saves user input and generations to cache as well.
Not supported with --interactive or other interactive options.

--prompt-cache-ro
If specified, uses the prompt cache but does not update it.

--random-prompt
Start with a randomized prompt.

--image IMAGE_FILE
Path to an image file. This should be used with multimodal mod‐
els. Alternatively, it's possible to embed an image directly
into the prompt instead; in which case, it must be base64 en‐
coded into an HTML img tag URL with the image/jpeg MIME type.
See also the --mmproj flag for supplying the vision model.

-i, --interactive
Run the program in interactive mode, allowing users to engage in
real-time conversations or provide specific instructions to the
model.

--interactive-first
Run the program in interactive mode and immediately wait for
user input before starting the text generation.

-ins, --instruct
Run the program in instruction mode, which is specifically de‐
signed to work with Alpaca models that excel in completing tasks
based on user instructions.

Technical details: The user's input is internally prefixed with
the reverse prompt (or "### Instruction:" as the default), and
followed by "### Response:" (except if you just press Return
without any input, to keep generating a longer response).

By understanding and utilizing these interaction options, you
can create engaging and dynamic experiences with the LLaMA mod‐
els, tailoring the text generation process to your specific
needs.

-r PROMPT, --reverse-prompt PROMPT
Specify one or multiple reverse prompts to pause text generation
and switch to interactive mode. For example, -r "User:" can be
used to jump back into the conversation whenever it's the user's
turn to speak. This helps create a more interactive and conver‐
sational experience. However, the reverse prompt doesn't work
when it ends with a space. To overcome this limitation, you can
use the --in-prefix flag to add a space or any other characters
after the reverse prompt.

--color
Enable colorized output to differentiate visually distinguishing
between prompts, user input, and generated text.

--no-display-prompt, --silent-prompt
Don't echo the prompt itself to standard output.

--keep N
Specifies number of tokens to keep from the initial prompt. The
default is -1 which means all tokens.

--multiline-input
Allows you to write or paste multiple lines without ending each
in '\'.

--cont-batching
Enables continuous batching, a.k.a. dynamic batching. is -1
which means all tokens.

--embedding
In CLI mode, the embedding flag may be use to print embeddings
to standard output. By default, embeddings are computed over a
whole prompt. However the --multiline flag may be passed, to
have a separate embeddings array computed for each line of text
in the prompt. In multiline mode, each embedding array will be
printed on its own line to standard output, where individual
floats are separated by space. If both the --multiline-input and
--interactive flags are passed, then a pretty-printed summary of
embeddings along with a cosine similarity matrix will be printed
to the terminal.

SERVER OPTIONS
The following options may be specified when llamafile is running in
--server mode.

--port PORT
Port to listen

Default: 8080

--host IPADDR
IP address to listen.

Default: 127.0.0.1

-to N, --timeout N
Server read/write timeout in seconds.

Default: 600

-np N, --parallel N
Number of slots for process requests.

Default: 1

-cb, --cont-batching
Enable continuous batching (a.k.a dynamic batching).

Default: disabled

-spf FNAME, --system-prompt-file FNAME
Set a file to load a system prompt (initial prompt of all
slots), this is useful for chat applications.

-a ALIAS, --alias ALIAS
Set an alias for the model. This will be added as the model
field in completion responses.

--path PUBLIC_PATH
Path from which to serve static files.

Default: /zip/llama.cpp/server/public

--nobrowser
Do not attempt to open a web browser tab at startup.

-gan N, --grp-attn-n N
Set the group attention factor to extend context size through
self-extend. The default value is 1 which means disabled. This
flag is used together with --grp-attn-w.

-gaw N, --grp-attn-w N
Set the group attention width to extend context size through
self-extend. The default value is 512. This flag is used to‐
gether with --grp-attn-n.

LOG OPTIONS
The following log options are available:

-ld LOGDIR, --logdir LOGDIR
Path under which to save YAML logs (no logging if unset)

--log-test
Run simple logging test

--log-disable
Disable trace logs

--log-enable
Enable trace logs

--log-file
Specify a log filename (without extension)

--log-new
Create a separate new log file on start. Each log file will have
unique name: <name>.<ID>.log

--log-append
Don't truncate the old log file.

EXAMPLES
Here's an example of how to run llama.cpp's built-in HTTP server. This
example uses LLaVA v1.5-7B, a multimodal LLM that works with llama.cpp's
recently-added support for image inputs.

llamafile \
-m llava-v1.5-7b-Q8_0.gguf \
--mmproj llava-v1.5-7b-mmproj-Q8_0.gguf \
--host 0.0.0.0

Here's an example of how to generate code for a libc function using the
llama.cpp command line interface, utilizing WizardCoder-Python-13B
weights:

llamafile \
-m wizardcoder-python-13b-v1.0.Q8_0.gguf --temp 0 -r '}\n' -r '```\n' \
-e -p '```c\nvoid *memcpy(void *dst, const void *src, size_t size) {\n'

Here's a similar example that instead utilizes Mistral-7B-Instruct
weights for prose composition:

llamafile \
-m mistral-7b-instruct-v0.2.Q5_K_M.gguf \
-p '[INST]Write a story about llamas[/INST]'

Here's an example of how llamafile can be used as an interactive chatbot
that lets you query knowledge contained in training data:

llamafile -m llama-65b-Q5_K.gguf -p '
The following is a conversation between a Researcher and their helpful AI
assistant Digital Athena which is a large language model trained on the
sum of human knowledge.
Researcher: Good morning.
Digital Athena: How can I help you today?
Researcher:' --interactive --color --batch_size 1024 --ctx_size 4096 \
--keep -1 --temp 0 --mirostat 2 --in-prefix ' ' --interactive-first \
--in-suffix 'Digital Athena:' --reverse-prompt 'Researcher:'

Here's an example of how you can use llamafile to summarize HTML URLs:

(
echo '[INST]Summarize the following text:'
links -codepage utf-8 \
-force-html \
-width 500 \
-dump https://www.poetryfoundation.org/poems/48860/the-raven |
sed 's/ */ /g'
echo '[/INST]'
) | llamafile \
-m mistral-7b-instruct-v0.2.Q5_K_M.gguf \
-f /dev/stdin \
-c 0 \
--temp 0 \
-n 500 \
--no-display-prompt 2>/dev/null

Here's how you can use llamafile to describe a jpg/png/gif/bmp image:

llamafile --temp 0 \
--image lemurs.jpg \
-m llava-v1.5-7b-Q4_K.gguf \
--mmproj llava-v1.5-7b-mmproj-Q4_0.gguf \
-e -p '### User: What do you see?\n### Assistant: ' \
--no-display-prompt 2>/dev/null

If you wanted to write a script to rename all your image files, you
could use the following command to generate a safe filename:

llamafile --temp 0 \
--image ~/Pictures/lemurs.jpg \
-m llava-v1.5-7b-Q4_K.gguf \
--mmproj llava-v1.5-7b-mmproj-Q4_0.gguf \
--grammar 'root ::= [a-z]+ (" " [a-z]+)+' \
-e -p '### User: The image has...\n### Assistant: ' \
--no-display-prompt 2>/dev/null |
sed -e's/ /_/g' -e's/$/.jpg/'
three_baby_lemurs_on_the_back_of_an_adult_lemur.jpg

Here's an example of how to make an API request to the OpenAI API com‐
patible completions endpoint when your llamafile is running in the back‐
ground in --server mode.

curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" -d '{
"model": "gpt-3.5-turbo",
"stream": true,
"messages": [
{
"role": "system",
"content": "You are a poetic assistant."
},
{
"role": "user",
"content": "Compose a poem that explains FORTRAN."
}
]
}' | python3 -c '
import json
import sys
json.dump(json.load(sys.stdin), sys.stdout, indent=2)
print()

PROTIP
The -ngl 35 flag needs to be passed in order to use GPUs made by NVIDIA
and AMD. It's not enabled by default since it sometimes needs to be
tuned based on the system hardware and model architecture, in order to
achieve optimal performance, and avoid compromising a shared display.

SEE ALSO
llamafile-quantize(1), llamafile-perplexity(1), llava-quantize(1),
zipalign(1), unzip(1)

Mozilla Ocho January 1, 2024 Mozilla Ocho
--
underground experts united
https://dataswamp.org/~incal
Emanuel Berg
2024-07-14 04:00:01 UTC
Permalink
Post by Emanuel Berg
Okay, let's do it like this, here is the file, maybe
I'm blind.
Here is it for download as well if you want to use your own
pager:

https://dataswamp.org/~incal/tmp/llamafile.1
--
underground experts united
https://dataswamp.org/~incal
jeremy ardley
2024-07-14 04:20:01 UTC
Permalink
Post by Emanuel Berg
I've made several improvements, including adding the
`string-distance-percentage' that was mentioned. But let's
forget about that branch [1] or visit that URL for the latest
source on that.
Here, in this thread, the context thing with respect to AI,
anyone having any luck knowing what to do with that?
It is mentioned 14 times in llamafile(1) but not how to
actually set it up with your own data?
One way to set context is via the http api which is openai compatible.

You create the queries using python and include your context with the
query. It may have the ability to remember context so you only need it
once, but running locally you can resend the context each time.

https://platform.openai.com/docs/api-reference/chat/create
Emanuel Berg
2024-07-14 05:00:01 UTC
Permalink
Post by jeremy ardley
Post by Emanuel Berg
Here, in this thread, the context thing with respect to AI,
anyone having any luck knowing what to do with that? It is
mentioned 14 times in llamafile(1) but not how to actually
set it up with your own data?
One way to set context is via the http api which is openai compatible.
You create the queries using python and include your context
with the query. It may have the ability to remember context
so you only need it once, but running locally you can resend
the context each time.
You can, but how much?

So this is the context? You mean include it in the prompt?

Then it is more easy to find in the llamafile(1) man page, it
is probably this

-c N, --ctx-size N
Set the size of the prompt context. A larger
context size helps the model to better comprehend
and generate responses for longer input or
conversations. The LLaMA models were built with
a context of 2048, which yields the best results
on longer input / inference.

If the unit is bytes that means 2048 bytes.

Okay, that I can try it right now just by inserting all the
data from a file into the query (prompt) and ask. And everyone
can try that, actually.
--
underground experts united
https://dataswamp.org/~incal
jeremy ardley
2024-07-14 05:10:02 UTC
Permalink
Post by Emanuel Berg
You can, but how much?
So this is the context? You mean include it in the prompt?
Then it is more easy to find in the llamafile(1) man page, it
is probably this
-c N, --ctx-size N
Set the size of the prompt context. A larger
context size helps the model to better comprehend
and generate responses for longer input or
conversations. The LLaMA models were built with
a context of 2048, which yields the best results
on longer input / inference.
If the unit is bytes that means 2048 bytes.
Okay, that I can try it right now just by inserting all the
data from a file into the query (prompt) and ask. And everyone
can try that, actually.
the 2048 is tokens which is approximately the number of words in a
prompt, so not character count.

The context explains how you want it to respond and the prompt is the
actual question.

You can massively increase the size of strings to match using localdocs.
I know how to do this in GPT4all. I assume there is a localdocs
equivalent equivalent in llamafile?
Emanuel Berg
2024-07-14 05:20:01 UTC
Permalink
Post by jeremy ardley
the 2048 is tokens which is approximately the number of
words in a prompt, so not character count.
Ah, right.
Post by jeremy ardley
The context explains how you want it to respond and the
prompt is the actual question.
See the other mail, I don't know if the labels should look in
a certain way or anything?
Post by jeremy ardley
You can massively increase the size of strings to match
using localdocs. I know how to do this in GPT4all. I assume
there is a localdocs equivalent equivalent in llamafile?
I don't know, let's check it out!
--
underground experts united
https://dataswamp.org/~incal
Emanuel Berg
2024-07-14 05:20:01 UTC
Permalink
So I used a text file mail.txt and fed it to the AI.

Well, well! What do you say?

I'll let you read the whole file to find out who won -
and why!
Post by jeremy ardley
Post by Emanuel Berg
Here, in this thread, the context thing with respect to AI,
anyone having any luck knowing what to do with that? It is
mentioned 14 times in llamafile(1) but not how to actually
set it up with your own data?
One way to set context is via the http api which is
openai compatible.
You create the queries using python and include your context
with the query. It may have the ability to remember context
so you only need it once, but running locally you can resend
the context each time.
You can, but how much?

So this is the context? You mean include it in the prompt?

Then it is more easy to find in the llamafile(1) man page, it
is probably -c N, --ctx-size N.

If the unit is bytes that means 2048 bytes.

Okay, that I can try it right now just by inserting all the
data from a file into the query (prompt) and ask. And everyone
can try that, actually.

The question:

In the above mail, what sentence is the most original?

Here is what the AI said:

The possible answers:

1. Here, in this thread, the context thing with respect to AI,
anyone having any luck knowing what to do with that?

2. One way to set context is via the http api which is
openai compatible.

3. You create the queries using python and include your
context with the query.

4. It may have the ability to remember context so you only
need it once, but running locally you can resend the
context each time.

5. Then it is more easy to find in the llamafile(1) man page,
it is probably -c N, --ctx-size N.

6. If the unit is bytes that means 2048 bytes.

7. Okay, that I can try it right now just by inserting all the
data from a file into the query (prompt) and ask.

8. And everyone can try that, actually.

The answer is 1. "Here, in this thread, the context thing with
respect to AI, anyone having any luck knowing what to do
with that?"

This sentence is original because it starts the discussion
about context in the thread. It is asking a question that is
specific to the thread and is trying to find out if anyone has
had any success with setting context for AI. It is not
repeating or summarizing information that has already been
stated and is the first mention of the topic in the
thread.
--
underground experts united
https://dataswamp.org/~incal
Emanuel Berg
2024-07-14 05:40:01 UTC
Permalink
Post by Emanuel Berg
The answer is 1. "Here, in this thread, the context thing
with respect to AI, anyone having any luck knowing what to
do with that?"
This sentence is original because it starts the discussion
about context in the thread.
Ah, there we have the next project:

The 'origin' string function!
--
underground experts united
https://dataswamp.org/~incal
Emanuel Berg
2024-07-10 16:40:01 UTC
Permalink
Post by Nicolas George
You cannot have a more general basic mechanical algorithm
unless you know what you want it to do.
I think one should do a network of nodes that all have
computed distances to every other and the one with the longest
sum of distances is the most isolated, so, most original!

So it is some sort of topology algorithm? Maybe it already
exists in some form. Most distant node to everything else?
Easter Island?

But an expensive algorithm, yes, in this unoptimized form.

Anyway, you got me thinking, it is all good. I can start
coding some stuff even now, that just compares very
simple units. A weighted sum of such comparison can then be
put together and the comparison can begin.

See? I know some way. And I know some day :)
--
underground experts united
https://dataswamp.org/~incal
Emanuel Berg
2024-07-10 11:10:01 UTC
Permalink
Post by Nicolas George
https://en.wikipedia.org/wiki/Levenshtein_distance
But you also need to define what you want with more
How do you count the replacement of a word by a synonym?
How do you count a change in the order of the words?
How do you count a transparent spelling mistake?
How do you count a spelling mistake that turns a word into
another existing word?
Indeed, one can have a bunch of such rules and apply them and
award points and stuff.

But maybe one could also have a more general, basic
mechanical/math/stats inspired algorithm?

Or a combination!

I forgot to say, the file with data. One can either see this
as just a bunch of data. How original is the new data compared
to the old.

One can also see the data as a bunch of entries, say one for
each line. To what extent is the new entry unlike all others?

That sounds more easy, but not necessarily so because that can
be applied generally as well.

"This entry was unlike all others. However bits and pieces of
it appear all over the place."

See? How can we define what is original in a way that 1.
makes sense and 2. is useful for this application?
Post by Nicolas George
Not related to Debian, putting "[OT]" in the subject.
I forgot to say, one is expected to only use software from the
Debian repos or other sources readily available on
FOSS Unix-like systems.
--
underground experts united
https://dataswamp.org/~incal
d***@howorth.org.uk
2024-07-10 12:00:01 UTC
Permalink
Post by Nicolas George
Post by Emanuel Berg
Okay, this is gonna be a challenge to most guys who have been
processing text for a long time.
So, I would like a command, function or script, 'original',
that takes a string STR and a text file TXT and outputs
a score, from 0 to 100, how _original_ STR is, compared to
what is already in TXT.
So if I do
$ original "This isn't just another party" comments.txt
this will score 0 if that exact phrase to the letter already
exists in comments.txt.
But it will score 100 if not a single of those words exists in
the file! Because that would be 100% original.
Those endpoints are easy. But how to make it score - say - 62%
if some of the words are present, mostly spelled like that and
combined in ways that are not completely different?
Note: The above examples are examples, other definitions of
originality are okay. That is not the important part now - but
can be as interesting a part, later.
https://en.wikipedia.org/wiki/Levenshtein_distance
Levenshtein distance isn't suited to the problem. It compares the
entirety of two strings. Emanuel is interesting in comparing one string
against substrings of a potentially much larger string, or even
substrings of the first string in random order against portions of the
second string!
Nicolas George
2024-07-10 12:50:01 UTC
Permalink
Post by d***@howorth.org.uk
Levenshtein distance isn't suited to the problem.
Either your crystal ball is much better than mine or you are trusting
yours too much, because we have not been told what the problem is.
Post by d***@howorth.org.uk
It compares the
entirety of two strings. Emanuel is interesting in comparing one string
against substrings of a potentially much larger string,
This is not what I read in the messages.
Post by d***@howorth.org.uk
or even
substrings of the first string in random order against portions of the
second string!
This is what I am trying to get the OP to explicit.

Regards,
--
Nicolas George
Greg Wooledge
2024-07-10 13:00:01 UTC
Permalink
Post by Nicolas George
Post by d***@howorth.org.uk
Levenshtein distance isn't suited to the problem.
Either your crystal ball is much better than mine or you are trusting
yours too much, because we have not been told what the problem is.
The problem will be something at the PhD level which the OP somehow
believes is a high school level problem.

I lost all faith when the OP revealed that their initial problem
description was just an *instance* of some overarching question, and
that they really wanted an n-dimensional generalization of that already
bleeding-edge research question.

Whatever the actual issue is, nobody here is going to be able to solve
it. I can guarantee it.
Emanuel Berg
2024-07-10 16:20:01 UTC
Permalink
Post by Greg Wooledge
I lost all faith when the OP revealed that their initial
problem description was just an *instance* of some
overarching question, and that they really wanted an
n-dimensional generalization of that already bleeding-edge
research question.
Oh, no, don't worry, this is the problem!

If one can solve it to output data there are a lot of
interesting applications.

What is the most original sentence, paragraph, or page in some
classic novel?

Who is the most original poster on some IRC channel, with some
bot keeping a rank?

Those are just toy examples I made up just now, but I'm sure
one could do really useful things with it as well. It is more
difficult to shoot from the holster here tho, but it is always
like that, there are always applications, and you don't expect
where, even.

What would be interesting to theorize about for those
attracted by such things would be the algorithm, but perhaps
even more interesting how one would define originality?
Because the program would be a de facto enforcer of its
own definition.

For example, is originality something that exists now, and
before it did, it didn't? I.e., a new thing? Or can
a combination of old things actually be as original?

Yes, it can - if you define the algorithm that way.
--
underground experts united
https://dataswamp.org/~incal
Emanuel Berg
2024-07-10 16:50:01 UTC
Permalink
Post by Greg Wooledge
The problem will be something at the PhD level which the OP
somehow believes is a high school level problem.
I lost all faith when the OP revealed that their initial
problem description was just an *instance* of some
overarching question, and that they really wanted an
n-dimensional generalization of that already bleeding-edge
research question.
Well, thank you!

A cognitive idea no doubt!

"The General Exception".

PhD dissertation by E. Berg AKA B. A. Original.
--
underground experts united
https://dataswamp.org/~incal
Emanuel Berg
2024-07-10 16:20:01 UTC
Permalink
Post by d***@howorth.org.uk
Post by Nicolas George
https://en.wikipedia.org/wiki/Levenshtein_distance
Levenshtein distance isn't suited to the problem.
It compares the entirety of two strings. Emanuel is
interesting in comparing one string against substrings of
a potentially much larger string, or even substrings of the
first string in random order against portions of the
second string!
Yes, this is how I see it as well.

But maybe I'm wrong and it can be included as part of some
bigger algorithm, maybe?
--
underground experts united
https://dataswamp.org/~incal
Thomas Schmitt
2024-07-10 10:20:01 UTC
Permalink
Hi,
Post by Emanuel Berg
how to make it score - say - 62%
if some of the words are present, mostly spelled like that and
combined in ways that are not completely different?
Looks like a popular aspect of artificial intelligence.
Googling "ai distance between texts" brings arbitrarily sophisticated
ways of fuzzy text comparison. Like:
https://medium.com/activewizards-machine-learning-company/comparison-of-the-text-distance-metrics-aed2eadfd1f1
https://en.wikipedia.org/wiki/Levenshtein_distance

The public search engines use such comparisons themselves.
"hunt the wumpis" usually leads to "hunt the wumpus".


Have a nice day :)

Thomas
Loading...