scicloj.metamorph.ml.text

Large-scale text processing and TF-IDF feature engineering for NLP pipelines.

This namespace provides efficient tools for converting raw text documents into machine learning-ready features using TF-IDF (Term Frequency-Inverse Document Frequency) scoring. Designed to handle large text corpora with flexible memory management strategies.

Core Functions:

->tidy-text Parses text files or datasets into tidy-text format (one token per row). Line-by-line processing enables handling of files larger than available RAM. Supports custom tokenization and metadata extraction.

Output format: tech.v3.dataset with columns:

  • :document (int): Document/line identifier
  • :token-idx (int): Token as indexed integer (maps to lookup table)
  • :token-pos (int): Position of token within document
  • :meta (optional): Arbitrary metadata from line-split-fn

->tfidf Transforms tidy-text into TF-IDF vector representation for bag-of-words models. Calculates term frequency (TF) and inverse document frequency (IDF) for each token.

Output columns:

  • :document
  • :token-idx
  • :token-count
  • :tf
  • :tfid

Memory Optimization:

The namespace provides flexible memory control for large texts via options:

Container Types:

  • :jvm-heap (default): Java heap storage (fast, limited by heap)
  • :native-heap: Off-heap native memory via tech.v3
  • :mmap: Memory-mapped files (disk-backed, bypasses heap limits)

Processing Options:

  • container-type: Storage for intermediate results during processing
  • column-container-type: Storage for final output dataset
  • combine-method: :coalesce-blocks! or :concat-buffers (tradeoffs)
  • compacting-document-interval: Batch size for consolidating data
  • datatype-document/token-pos/idx: Memory datatype selection (:int16 vs :int32)

Token Management:

  • token->index-map: Custom token lookup table (can reuse across runs)
  • new-token-behaviour: :store (default), :fail, or :as-unknown

Performance Characteristics:

  • Typical text requires ~1.5x the original file size in RAM
  • A 8GB text file typically needs ≥12GB total memory
  • Scaling strategy: Use off-heap or mmap for large corpora

Typical Workflow:

  1. Use ->tidy-text to create tidy text format from raw documents
  2. Use ->tfidf to create TF-IDF feature vectors
  3. Pass vectors to classification/regression models

See also: scicloj.metamorph.ml.column-metric for evaluation, scicloj.metamorph.ml/train for model training

->tfidf

(->tfidf tidy-text & {:keys [container-type column-container-type combine-method datatype-meta], :or {combine-method :coalesce-blocks!, column-container-type :jvm-heap, container-type :jvm-heap, datatype-meta :object}})

Transforms a dataset in tidy text format in the bag-of-words representation including TFIDF calculation of the the tokens.

tidy-text needs to be a dataset with columns

  • :document
  • :token-idx
  • :token-pos

The following three can be used to move data off heap during calculations. They can make dramatic differences in performance (faster and slower) and memory usage.

  • container-type decides if the intermidiate results are stored on-heap (:jvm-heap, the default) or off-heap (:native-heap) or :mmap (as mmaped file)
  • column-container-type same decides if the resulting dataset os store on-hep (:jvm-heap, the default) or off-heap (:native-heap) or :mmap (as mmaped file)
  • combine-method How to combine the intermidiate containers, either :concat-bufders or :coalesce-buffers!

Returns a dataset with columns:

  • :document document id
  • :token-idx The token as id
  • :token-count How often the token appears in a ‘document’
  • :tf :token-count divided by document length
  • :tfidf tfidf value for token

Examples

Convert CSV to tfidf

(let [line-tokenzier-fn (fn [line] (str/split line #" "))
      parse-review-line-fn
      (fn [line]
        (let [splitted (first (clojure.data.csv/read-csv line))]
          [(first splitted)
           (dec (Integer/parseInt (second splitted)))]))]
  (-> (scicloj.metamorph.ml.text/->tidy-text (io/reader
                                              "test/data/reviews.csv")
                                             line-seq
                                             parse-review-line-fn
                                             line-tokenzier-fn
                                             :max-lines 5
                                             :skip-lines 1)
      :datasets
      first
      (scicloj.metamorph.ml.text/->tfidf)
      str))
;;=> _unnamed [429 6]:
;;=> 
;;=> | :document |     :tfidf |        :tf | :token-idx | :token-count | :meta |
;;=> |----------:|-----------:|-----------:|-----------:|-------------:|-------|
;;=> |         0 | 0.00884772 | 0.01265823 |         65 |            1 |     3 |
;;=> |         0 | 0.00884772 | 0.01265823 |         62 |            1 |     3 |
;;=> |         0 | 0.00884772 | 0.01265823 |          7 |            1 |     3 |
;;=> |         0 | 0.00884772 | 0.01265823 |         59 |            1 |     3 |
;;=> |         0 | 0.01769544 | 0.02531646 |         20 |            1 |     3 |
;;=> |         0 | 0.00884772 | 0.01265823 |         58 |            1 |     3 |
;;=> |         0 | 0.00884772 | 0.01265823 |         60 |            1 |     3 |
;;=> |         0 | 0.00884772 | 0.01265823 |         27 |            4 |     3 |
;;=> |         0 | 0.00503722 | 0.01265823 |          1 |            1 |     3 |
;;=> |         0 | 0.00245342 | 0.02531646 |         24 |            2 |     3 |
;;=> |       ... |        ... |        ... |        ... |          ... |   ... |
;;=> |         4 | 0.00303902 | 0.01369863 |        140 |            1 |     4 |
;;=> |         4 | 0.00957493 | 0.01369863 |        321 |            1 |     4 |
;;=> |         4 | 0.02872480 | 0.04109589 |        320 |            1 |     4 |
;;=> |         4 | 0.00303902 | 0.01369863 |        173 |            1 |     4 |
;;=> |         4 | 0.00545123 | 0.01369863 |         30 |            1 |     4 |
;;=> |         4 | 0.00957493 | 0.01369863 |        336 |            1 |     4 |
;;=> |         4 | 0.00303902 | 0.01369863 |         10 |            1 |     4 |
;;=> |         4 | 0.00545123 | 0.01369863 |        185 |            1 |     4 |
;;=> |         4 | 0.00957493 | 0.01369863 |        318 |            3 |     4 |
;;=> |         4 | 0.00303902 | 0.01369863 |        161 |            1 |     4 |
;;=> |         4 | 0.00303902 | 0.01369863 |         71 |            1 |     4 |

->tidy-text

(->tidy-text lines-source line-seq-fn line-split-fn line-tokenizer-fn & {:keys [skip-lines max-lines container-type datatype-document datatype-token-pos datatype-meta datatype-token-idx compacting-document-intervall combine-method token->index-map column-container-type new-token-behaviour], :or {datatype-token-idx :int16, max-lines Integer/MAX_VALUE, datatype-document :int16, container-type :jvm-heap, datatype-meta :object, datatype-token-pos :int16, compacting-document-intervall 10000, skip-lines 0, column-container-type :jvm-heap, combine-method :coalesce-blocks!, new-token-behaviour :store, token->index-map (Object2IntOpenHashMap. 10000)}})

Reads, parses and tokenizes a text file or a TMD dataset into a seq of tech.v3.dataset in the tidy-text format, so one word per row. It does the parsing and conversion strictly line based, so it should work for large documents.

Initial tests show that each byte of text size need 1.5 byte on average So a 8 GB text file can be sucessfully loaded when having at least 12 GB.

  • lines-source Either a buffered reader or a TMD dadaset
  • line-seq-fn A function which return a lazy-list of lines , given the lines-source
  • line-split-fn A fn which should seperate a single line of input in text and other Supposed to return a seq of size 2, where the first is the ‘text’ of the line and meta can be anything non-nil (map, vector, scalar). It’s value will be returned in column meta and is supposed to be further processed later. meta can be nil always, so no column meta will be created
  • text-tokenizer-fn A function which will be called for any text as obtained by line-split-fn It should split the text by word boundaries and return the obtained tokens as a seq of strings. It can do any text normalisation desired.

Optional options are:

  • skip-lines 0 Lines to skip at beginning
  • max-lines MAX_INT max lines to return

The following can be used to optimize the heap usage for larger texts. It can be tune depending on how may documents, how many words per document, and how many tokens overall are in the text corpus.

  • datatype-document :int16 Datatype of :document column (:int16 or :int32)
  • datatype-token-pos :int16 Datatype of :token-pos column (:int16 or :int32)
  • datatype-meta :object Datatype of :meta column (anything, need to match what line-split-fn returns as ‘meta’)
  • datatype-token-idx :int16 Datatype of :token-idx column (:int16 or :int32)

The following options can be used to move data off heap during calculations. They can make dramatic differences in performance (faster and slower) and memory usage.

  • column-container-type :jvm-heap If the resulting table is created on heap (:jvm-heap ) of off heap (:native-heap)
  • container-type :jvm-heap as column-container-type but for intermidiate reuslts, per interval compacting-document-intervall 10000 After how many lines the data is written into a continous block
  • combine-method :coalesce-blocks! Which method to use to combine blocks (:coalesce-blocks! or :concat-buffers) One or the other might need less RAM in ceratin scenarious.
  • token->index-map Object2IntOpenHashMap Can be overriden with a own object->int map implementation, (maybe off-heap). Can as well be a map obtained from a prevoius run in order to guranty same mappings.
  • new-token-behaviour :store How to react when new tokens appear , which are no in token->id-map Either :store (default), :fail (throw exception) or :as-unknown (use specific token UNKNOWN)

The following three can be used to move data off heap during calculations. They can make dramatic differences in performance (faster and slower) and memory usage.

  • container-type decides if the intermidiate results are stored on-heap (:jvm-heap, the default) or off-heap (:native-heap) or :mmap (as mmaped file)
  • column-container-type same decides if the resulting dataset os store on-hep (:jvm-heap, the default) or off-heap (:native-heap) or :mmap (as mmaped file)
  • combine-method How to combine the intermidiate containers, either :concat-bufders or :coalesce-buffers!

Function returns a map of :datasets and :token-lookup-table

:datasets is a seq of TMD datasets each having 4 columns which represent the input text in the tidy-text format:

  • :document The ‘document/line’ a token is comming from
  • :token-idx The token/word (as int) , which is present as well in the token->int look up table returned
  • :token-pos The position of the token in the document
  • :meta The meta values if return by line-split-fn

Assuming that the text-tokenizer-fn does no text normalisation, the table is a exact representation of the input text. I contains as well the word order in column :token-pos, so resorting the table keeps the original text.

Examples

Parse csv file, take second text column and transform to tidy format

(def tidy
  (let [parse-review-line-fn
        (fn [line]
          (let [splitted (first (clojure.data.csv/read-csv line))]
            [(first splitted)
             (dec (Integer/parseInt (second splitted)))]))
        tokenize-fn (fn [s] (str/split s #" "))]
    (->tidy-text (io/reader "test/data/reviews.csv")
                 line-seq
                 parse-review-line-fn
                 tokenize-fn
                 :max-lines 5
                 :skip-lines 1
                 :datatype-meta :int16)))
;;=> #'scicloj.metamorph.ml.text/tidy
(-> tidy
    :datasets
    first
    str)
;;=> _unnamed [596 4]:
;;=> 
;;=> | :token-idx | :token-pos | :document | :meta |
;;=> |-----------:|-----------:|----------:|------:|
;;=> |          1 |          0 |         0 |     3 |
;;=> |          2 |          1 |         0 |     3 |
;;=> |          3 |          2 |         0 |     3 |
;;=> |          4 |          3 |         0 |     3 |
;;=> |          5 |          4 |         0 |     3 |
;;=> |          6 |          5 |         0 |     3 |
;;=> |          4 |          6 |         0 |     3 |
;;=> |          7 |          7 |         0 |     3 |
;;=> |          8 |          8 |         0 |     3 |
;;=> |          9 |          9 |         0 |     3 |
;;=> |        ... |        ... |       ... |   ... |
;;=> |        323 |         62 |         4 |     4 |
;;=> |        337 |         63 |         4 |     4 |
;;=> |        338 |         64 |         4 |     4 |
;;=> |          3 |         65 |         4 |     4 |
;;=> |        339 |         66 |         4 |     4 |
;;=> |        340 |         67 |         4 |     4 |
;;=> |        341 |         68 |         4 |     4 |
;;=> |        342 |         69 |         4 |     4 |
;;=> |        343 |         70 |         4 |     4 |
;;=> |        110 |         71 |         4 |     4 |
;;=> |        344 |         72 |         4 |     4 |
(-> tidy
    :token-lookup-table)
;;=> {larger=>221, three=>92, 50-60=>39, Jules=>106, treats=>134, salsa=>77, highly=>191, charge=>291, texture,=>256, ordering=>220, worse=>49, accessory=>336, Best=>325, goods.=>236, quite=>159, begin=>283, with=>67, eat=>131, />In=>254, adding=>275, They=>194, since=>21, />ron=>344, Yes.=>15, actually=>112, wonderful=>108, cream,=>201, girlfriend's=>11, wheat=>268, looking=>240, recommend=>192, and=>42, happened=>149, around=>38, makes=>120, an=>166, anyway?=>14, highlight=>138, Christmas=>139, each=>122, So,=>29, gifts.=>68, crisps?109, place.170, flavor,=>311, strong=>232, for=>140, Crisps=>153, savory=>248, ISO=>306, could=>47, And=>278, addictive=>189, Extremely=>330, goodies.=>54, box.=>156, ask?).296, put=>282, at=>124, product=>5, very=>323, kitchen.=>96, Sorry,=>245, quantity.=>222, grainy.=>258, store=>148, it.=>28, terms=>255, />I=>111, at-best=>224, it=>2, Once=>69, uses.343, my=>10, with?=>284, difficult=>32, Destrooper=>151, sweetened,=>187, there's=>31, as=>160, hot=>331, packet=>288, Very=>321, has=>43, flour.=>230, more!).319, knew=>99, adequate=>226, still=>305, people=>66, want=>63, shipped=>81, again=>70, than=>50, thoughtful=>19, (its=>315, problem=>301, desserts=>251, terribly=>188, snack=>195, of=>53, Christmas.=>125, they=>135, who=>119, texture.=>270, lightly=>186, staples=>93, it's=>22, barely=>225, solution=>299, after=>341, sweet.=>324, far=>250, imparts=>231, from=>116, around.=>183, every=>95, a=>3, what=>237, baked=>235, substitute.=>307, order=>318, crisps=>115, any=>196, don't=>242, upon=>150, make=>56, (Heaven=>289, me.141, flours=>264, japan.=>79, sure=>57, bean=>233, She=>126, gum=>294, myself=>308, me=>179, sweet=>44, cap=>340, way=>73, almost=>94, I=>74, The=>333, because=>309, It=>16, />Highly=>206, same=>266, your=>35, before=>219, local=>147, get=>76, />110, wife's=>117, brand=>174, but=>176, Destrooper's=>107, receiving=>27, store.=>213, If=>184, All=>214, ingredients=>102, good,=>322, December=>181, say=>215, xanthan=>276, aren't=>158, you=>46, Or=>285, eggs=>90, can=>75, While=>157, syrup=>334, found=>208, include=>287, much=>295, rolls=>182, thing=>267, know=>58, dud.253, God=>216, time=>197, person=>26, Is=>1, ahead=>83, too=>257, other=>143, that=>100, ice=>200, product,=>261, GF=>263, Did=>9, is.=>61, gluten=>300, am=>304, recipes=>274, box=>52, more=>132, had=>298, close=>168, treat,=>190, is=>71, chocolate.332, />The=>142, Just=>55, convenient,=>337, enough=>292, section=>211, come=>165, now,=>317, crisps.=>193, food=>210, why=>272, love=>13, didn't=>303, wife=>312, No.=>8, compliment=>199, produce=>105, gives=>128, virtually=>273, least=>286, Sugar,=>88, recommended.=>207, used=>313, seems=>17, Butter=>152, taste=>234, smallest=>217, by=>204, learned=>113, until=>180, years=>40, recipient=>60, parents=>12, fast=>82, enjoy=>185, might=>246, lot=>130, not=>265, did=>314, to=>24, that's=>37, you're=>238, specific=>23, already=>328, flour--would=>293, most=>262, in=>78, if=>30, snacks--it's=>252, offending=>65, always=>127, think=>243, someone=>33, all-purpose=>229, only=>72, day=>144, the=>25, us=>129, flavor-wise?=>241, gum.=>277, grabbed=>155, knows=>290, great=>4, product.=>302, yous=>87, aunt=>118, on=>34, xanthan,=>280, time,=>84, so.=>244, should.=>133, gone=>316, gift=>20, coffee=>202, all=>203, thank=>86, health=>209, just=>281, second=>169, will=>177, immediately=>154, But=>259, dishes,=>249, these=>101, them=>121, regular=>228, This=>223, />=>320, ones=>163, beats=>172, sticky=>339, tooth,=>45, [UNKNOWN]=>0, grocery=>212, perfect=>198, are=>91, hoping=>297, things=>327, or=>6, now=>178, point--since=>279, have=>136, defense=>260, themselves.205, />Homemade=>171, Who=>98, avoids=>338, combine=>103, sweet.329, flour=>269, like=>18, pump=>335, size=>218, such=>104, be=>247, year=>123, walking=>145, added=>326, this=>51, value?=>7, opinion,=>175, Don't=>62, was=>80, good=>161, many=>342, mint=>310, substitute=>227, do=>48, through=>146, homemade=>162, really=>239, old=>41, bought=>173, about=>114, become=>137, That's=>271, munchas=>85, how=>59, aunt,=>164, butter=>89, go=>64, extremely=>167, list=>36, =>97}

libsvm->tidy

(libsvm->tidy reader)

Reads LIBSVM format data into a tidy dataset.

reader - Reader (typically from a file) containing LIBSVM formatted text

Returns a dataset with columns: * :instance - Document/instance ID (0-indexed) * :label - Class label from the LIBSVM file * :index - Feature index * :value - Feature value

Each line in LIBSVM format is parsed: <label> <index>:<value> ...

The reader is automatically closed after reading.

See also: tidy->libsvm!, ->tidy-text

Examples

Read and convert file in libsvm format to ‘tidy’ dataset

(-> (scicloj.metamorph.ml.text/libsvm->tidy
     (io/reader "test/data/iris.libsvm.txt"))
    str)
;;=> _unnamed [587 4]:
;;=> 
;;=> | :instance | :index |     :value | :label |
;;=> |----------:|-------:|-----------:|-------:|
;;=> |         0 |      1 | -0.5555560 |      1 |
;;=> |         0 |      2 |  0.2500000 |      1 |
;;=> |         0 |      3 | -0.8644070 |      1 |
;;=> |         0 |      4 | -0.9166670 |      1 |
;;=> |         4 |      1 | -0.6666670 |      1 |
;;=> |         4 |      2 | -0.1666670 |      1 |
;;=> |         4 |      3 | -0.8644070 |      1 |
;;=> |         4 |      4 | -0.9166670 |      1 |
;;=> |         8 |      1 | -0.7777780 |      1 |
;;=> |         8 |      3 | -0.8983050 |      1 |
;;=> |       ... |    ... |        ... |    ... |
;;=> |       575 |      2 | -0.1666670 |      3 |
;;=> |       575 |      3 |  0.4237290 |      3 |
;;=> |       575 |      4 |  0.5833330 |      3 |
;;=> |       579 |      1 |  0.0555554 |      3 |
;;=> |       579 |      2 |  0.1666670 |      3 |
;;=> |       579 |      3 |  0.4915250 |      3 |
;;=> |       579 |      4 |  0.8333330 |      3 |
;;=> |       583 |      1 | -0.1111110 |      3 |
;;=> |       583 |      2 | -0.1666670 |      3 |
;;=> |       583 |      3 |  0.3898300 |      3 |
;;=> |       583 |      4 |  0.4166670 |      3 |

tidy->libsvm!

(tidy->libsvm! tidy-ds writer column)

Writes a tidy dataset to LIBSVM text format.

  • tidy-ds - Dataset with TF-IDF features (usually from ->tidy or ->tfidf) including columns:token-idx :token-pos :document
  • writer - BufferedWriter for output
  • column - Column name containing feature values (e.g., :meta or :tfidf)

Writes the dataset in LIBSVM sparse format: <label> <index>:<value> <index>:<value> ... where label comes from the column column. Groups by :document and writes one line per :document with tokens sorted by token-idx.

The writer is automatically closed after writing.

See also: ->tfidf, libsvm->tidy, ->tidy

Examples

Usage

(let
  [parse-review-line (fn [line]
                       (let [splitted (first (csv/read-csv line))]
                         [(first splitted)
                          (dec (Integer/parseInt (second splitted)))]))
   tokenize-fn (fn [s] (str/split s #" "))
   f (java.io.File/createTempFile "tidy" ".txt")
   _ (.deleteOnExit f)
   _
   (->
     (java.io.StringReader.
      "this is a a sample,1\nthis is another another example example example,2\nthe world is full of examples,3")
     (io/reader)
     (->tidy-text line-seq parse-review-line tokenize-fn)
     :datasets)]
  (->
    (java.io.StringReader.
     "this is a a sample,1\nthis is another another example example example,2\nthe world is full of examples,3")
    (io/reader)
    (->tidy-text line-seq parse-review-line tokenize-fn)
    :datasets
    first
    (tidy->libsvm! (io/writer f) :meta))
  (slurp f))
;;=> 0 1:0 2:0 3:0 3:0 4:0
;;=> 1 1:1 2:1 5:1 5:1 6:1 6:1 6:1
;;=> 2 2:2 7:2 8:2 9:2 10:2 11:2