You can subscribe to this list here.
| 2009 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
|
Oct
|
Nov
(1) |
Dec
(1) |
| 2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(2) |
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2012 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
(3) |
Sep
(2) |
Oct
(1) |
Nov
|
Dec
|
| 2013 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
| 2014 |
Jan
|
Feb
(1) |
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(2) |
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
(2) |
Nov
|
Dec
|
| 2016 |
Jan
(7) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2017 |
Jan
(7) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2018 |
Jan
|
Feb
|
Mar
(2) |
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2019 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
(5) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2020 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2021 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2023 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
|
From: Serge H. <sl...@en...> - 2023-06-28 15:40:57
|
Dear all, We are pleased to announce the availability of TXM version 0.8.3: https://pages.textometrie.org/textometrie/files/software/TXM/0.8.3/en WHAT IS TXM? Read TXM leaflet https://txm.gitpages.huma-num.fr/textometrie/files/documentation/TXM%20Leaftlet%20EN.pdf [a bit old but still a synthetic presentation] All up to date information at https://www.textometrie.org NEWS In addition to fixing many bugs, this new version includes some notable new features: * Mac compatibility (including M1 and M2 processors) * Improved o interface ergonomy for some commands o interface robustness for saving CQP and URS annotations * Simplified o choice of language for corpora imported via clipboard o export/import of corpus word properties o utilities management * New functions o export/import corpus o export/import calculus call with parameters * General component upgrades o R 4.2.2 o Java 17.0.7 o Eclipse 2022-09 o Saxon 12.2 o Groovy 4.0.3 Extensions * TreeTagger o upgrade to version 3.2.5 * URS (Unit-Relation-Schema) annotation o more robust input and save interface ACKNOWLEDGEMENTS Thank you to the testers of this version: Flora Badin, Ioana Galleron, Amal Guha, Philippe Guillet, Jean-Louis Janin, Loïc Liégeois, Giancarlo Luxardo, Christophe Parisse, Pierre Ratinaud and Gilles Toubiana. Good explorations, Serge Heiden, for the TXM team -- Dr. Serge Heiden, slh AT ens-lyon.fr, http://textometrie.ens-lyon.fr Équipe de recherche Cactus, laboratoire IHRIM UMR5317, ENS de Lyon 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883 |
|
From: Nubia C. P. <nun...@ya...> - 2021-05-24 10:05:44
|
Bonjour à tous, Je souhaite ajouter quelques annotations de lemmatisation dans la propriété "frpos". L'expression "à taille humaine" a été annotée comme "LOC:ADJ". Cependant, quand je regarde le nombre d'occurrences des noms et des adjectifs taille apparît dans la valeur "noms" et humaine dans la catégorie "adj". Comment faire pourque les deux termes soient bien pris en compte comme "LOC:ADJ" et ne soient pas comptés une deuxième fois dans la valeur "NOM" et "ADJ"? Merci d'avance pour vos conseils. Cordialement, Nubia Ch. P. |
|
From: Serge H. <sl...@en...> - 2020-05-27 14:57:42
|
Dear Gabriele,
Le 27/05/2020 à 16:13, Gabriele Brandolini a écrit :
> Maybe there is a need to change some lines in the script "TokenizerClasses.groovy", as the script looks to be stronger than the options given by the user as Tokenizer Parameters . Is it possible?
it should be (the TXM import modules are written in Groovy scripts for that), but still not easy, yet. We need to give better access to that and document it
>
> This solution is even harder when the user (like me) does not know exactly the meaning of the character classes used to specify how to cut punctuation, *[\p{Ps}\p{Pe}\p{Pi}\p{Pf}\p{Po}\p{S}] *and in
> which way the single apostrophe could be deleted in the list.
The TXM manual about word segmentation (https://txm.gitpages.huma-num.fr/txm-manual/importer-un-corpus-dans-txm.html#segmentation-en-phrases-et-en-mots) mentions the "catégorie Unicode" of characters
used in those regexs which links to https://en.wikipedia.org/wiki/Unicode_character_property#General_Category.
If you look at the documentation of the U+0027 APOSTROPHE ' character (https://codepoints.net/U+0027), you see that it is of Category "Other Punctuation" = Po
So if you remove the "Po" Unicode category from the Ponctuation parameter regex: [\p{Ps}\p{Pe}\p{Pi}\p{Pf}\p{S}]
your words including apostrophes should not be segmented
>
> Maybe another option to take into account could be to allow the user to specify her/his own external tokenizer program/script to be used?
it is difficult to call a Perl or Python script from TXM to do that job currently, but not impossible (and not documented)
the easiest option for that currently is to communicate your pre-segmented corpus (produced by any tool you want) in XML format to TXM, with <w>...</w> tags around your words, and use the XML/w+CSV
import module
this corresponds to solution B.2) in my previous answer
please let me know if this is not sufficiently clear
Best,
Serge
--
Dr. Serge Heiden, sl...@en..., http://textometrie.ens-lyon.fr
ENS de Lyon - IHRIM UMR5317
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883
|
|
From: Serge H. <sl...@en...> - 2020-05-26 11:31:24
|
Dear Gabriele, Currently, the apostrophe character category is managed by TXM for French (grammatical elisions) and English (genitives, contractions) word segmentation. [https://txm.gitpages.huma-num.fr/txm-manual/importer-un-corpus-dans-txm.html#segmentation-en-phrases-et-en-mots] By default, three characters are interpreted as apostrophes : ['‘’] With TXT+CSV import module, the simplest way to handle the apostrophe would be to tune the "Caractères d’élision" parameter value in the import parameters form. If the Swahili writing system doesn't use apostrophe for elision, just remove the characters from the Caractères d’élision parameter = <empty> If the Swahili writing system uses the same apostrophes for elision and internally in words: A) for the TXT+CSV import module rewrite all the "word internal" apostrophes by an apostrophe character sign that is not listed in the "Caractères d’élision" parameter. B) for the XML/w+CSV import module, two solutions : B.1) force the segmentation encoding of all the words having internal apostrophes with the <w> tag: for example by encoding the word "ng’ombe" as <w>ng’ombe</w> (with this encoding, TXM will not segment "ng’ombe" further) B.2) segment your texts with a Swahili aware external tool, then encode all the words with <w> tags (TXM will use all your words 'as is') Best, Serge Le 26/05/2020 à 13:51, Gabriele Brandolini a écrit : > > I’ve trouble in obtaining the correct tokenization of Swahili texts when in a word is present the combination of the letters <ng> + the apostrophe <’>, which is the usual and correct way to express > orthographically in Swahili the sound /ŋ/. > > It should be noted that the use of the simple apostrophe is present as normal Swahili punctuation, so the solution couldn’t be simply to remove it from the words in which it appears. > > As a rule, during the process of tokenization of a text, the apostrophe – as such – will be separated from the word which precedes it, as well as from that which follows it. BUT the sequence <ng’> > when present inside any Swahili word – always – should not cause to split the word in two parts at the apostrophe character. I.e. <ng’ombe> (=”cow”) should be tokenized as it is: ng’ombe (in one > line); NOT > > ng > > ‘ > > ombe > > in three lines, neither: > > ng’ > > ombe > > or, > > ng > > ‘ombe > > in two lines. > > I will be very grateful if someone could give me any suggestions to solve this question; maybe by suggesting which are the correct and needed import parameters for a TXT (or XML) + CSV Swahili > corpus in TXM. > > Many thanks! > > Gabriele Brandolini > > > > _______________________________________________ > TXM-open mailing list > TXM...@li... > https://lists.sourceforge.net/lists/listinfo/txm-open -- Dr. Serge Heiden, sl...@en..., http://textometrie.ens-lyon.fr ENS de Lyon - IHRIM UMR5317 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883 |
|
From: Gabriele B. <gab...@gm...> - 2020-05-26 10:43:33
|
I’ve trouble in obtaining the correct tokenization of Swahili texts when in a word is present the combination of the letters <ng> + the apostrophe <’>, which is the usual and correct way to express orthographically in Swahili the sound /ŋ/. It should be noted that the use of the simple apostrophe is present as normal Swahili punctuation, so the solution couldn’t be simply to remove it from the words in which it appears. As a rule, during the process of tokenization of a text, the apostrophe – as such – will be separated from the word which precedes it, as well as from that which follows it. BUT the sequence <ng’> when present inside any Swahili word – always – should not cause to split the word in two parts at the apostrophe character. I.e. <ng’ombe> (=”cow”) should be tokenized as it is: ng’ombe (in one line); NOT ng ‘ ombe in three lines, neither: ng’ ombe or, ng ‘ombe in two lines. I will be very grateful if someone could give me any suggestions to solve this question; maybe by suggesting which are the correct and needed import parameters for a TXT (or XML) + CSV Swahili corpus in TXM. Many thanks! Gabriele Brandolini |
|
From: Nubia C. P. <nun...@ya...> - 2020-01-01 20:25:54
|
Bonjour, Je souhaite m0inscrire au groupe de TXM. Je vous remercie de me confirmer mon inscription! Cordialement, Nubia Ch. P. |
|
From: Péter H. <hor...@ya...> - 2019-07-05 13:44:46
|
Thank you, Serge. Putting the properties of structural units into the properties of words is a good idea. I will check this Groovy thing as well.Best,Péter
Serge Heiden<sl...@en...> ezt írta (2019. július 5., péntek 14:48:27 CEST):
TXM UI tools only work on words currently, sorry.
Usually, to be able to extract and count structure properties, projects duplicate the information into words underneath, and then use usual TXM tools on that information. This is typically done during import with the XTZ+CSV import module : in the 3-posttok step, an XSL stylesheet, like txm-posttok-addRef.xsl (from TXM XSL library <https://sourceforge.net/projects/txm/files/library/xsl>), builds new word properties from element attributes found in their ancestors.
I your case the XSL should copy say l@met to all w@met underneath.
Then, applying tools on words would get access to the properties of the structures above.
Otherwise you get a direct access to structure properties in Groovy scripts.
For example, the CQi tutorial shows how to display structures attributes in the "Accéder aux valeurs des propriétés de structures du corpus DISCOURS"
<https://groupes.renater.fr/wiki/txm-info/public/tutoriel_cqi#acceder_aux_valeurs_des_proprietes_de_structures_du_corpus_discours>
A simple macro wrapping all that low level code should help to get an easier access to structure attributes of a corpus or sub-corpus though.
Best,
Serge
Le 05/07/2019 à 13:51, Péter Horváth a écrit :
Hi Serge,
Thank you for your help, but I'm looking for a solution where I just query the rhyme or rhythm patterns, not the words under a defined constraint of a rhyme or a rhythm pattern. I would like to get frequency lists of rhyme and rhythm patterns in the same way as I can get frequency lists of words, lemmas, pos in the Lexicon feature of TXM. Is this solvable somehow?
Best, Péter
Serge Heiden<sl...@en...> ezt írta (2019. július 5., péntek 12:01:19 CEST):
Hi Péter,
Have you tried to use the syntax presented in section 12.5.2 "Using a property on a structure" of the manual?
(http://textometrie.ens-lyon.fr/files/documentation/TXM%20Manual%200.7.pdf)
For example :
[_.l_met="\+--\+-\+-\+-\+-"]
to query words inside specific verse lines
or
<l_met="\+--\+-\+-\+-\+-"> []
to query words at the beginning of specific verse lines
Best,
Serge
Le 04/07/2019 à 13:11, Péter Horváth via TXM-open a écrit :
Hello,
We are building a poetry corpus, and besides the grammatical features of words, we annotate the rhyme pattern of stanzas and the rhythm of lines. These annotations are the values of the attributes of lg (line group) and l (line) tags in TEI XML. Is it possible to query these attributes in TXM? Because for me it seems that only the attributes of words can be queried in TXM.
Any help is welcome,
Péter
_______________________________________________
TXM-open mailing list
TXM...@li...
https://lists.sourceforge.net/lists/listinfo/txm-open
--
Dr. Serge Heiden, slh AT ens-lyon.fr, http://textometrie.ens-lyon.fr
Équipe de recherche Cactus, laboratoire IHRIM UMR5317, ENS de Lyon
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883
_______________________________________________
TXM-open mailing list
TXM...@li...
https://lists.sourceforge.net/lists/listinfo/txm-open
--
Dr. Serge Heiden, slh AT ens-lyon.fr, http://textometrie.ens-lyon.fr
Équipe de recherche Cactus, laboratoire IHRIM UMR5317, ENS de Lyon
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883
_______________________________________________
TXM-open mailing list
TXM...@li...
https://lists.sourceforge.net/lists/listinfo/txm-open
|
|
From: Serge H. <sl...@en...> - 2019-07-05 12:48:03
|
TXM UI tools only work on words currently, sorry. Usually, to be able to extract and count structure properties, projects duplicate the information into words underneath, and then use usual TXM tools on that information. This is typically done during import with the XTZ+CSV import module : in the 3-posttok step, an XSL stylesheet, like txm-posttok-addRef.xsl (from TXM XSL library <https://sourceforge.net/projects/txm/files/library/xsl>), builds new word properties from element attributes found in their ancestors. I your case the XSL should copy say l@met to all w@met underneath. Then, applying tools on words would get access to the properties of the structures above. Otherwise you get a direct access to structure properties in Groovy scripts. For example, the CQi tutorial shows how to display structures attributes in the "Accéder aux valeurs des propriétés de structures du corpus DISCOURS" <https://groupes.renater.fr/wiki/txm-info/public/tutoriel_cqi#acceder_aux_valeurs_des_proprietes_de_structures_du_corpus_discours> A simple macro wrapping all that low level code should help to get an easier access to structure attributes of a corpus or sub-corpus though. Best, Serge Le 05/07/2019 à 13:51, Péter Horváth a écrit : > Hi Serge, > > Thank you for your help, but I'm looking for a solution where I just > query the rhyme or rhythm patterns, not the words under a defined > constraint of a rhyme or a rhythm pattern. I would like to get > frequency lists of rhyme and rhythm patterns in the same way as I can > get frequency lists of words, lemmas, pos in the Lexicon feature of > TXM. Is this solvable somehow? > > Best, > Péter > > Serge Heiden<sl...@en...> ezt írta (2019. július 5., péntek > 12:01:19 CEST): > > > Hi Péter, > > Have you tried to use the syntax presented in section 12.5.2 "Using a > property on a structure" of the manual? > (http://textometrie.ens-lyon.fr/files/documentation/TXM%20Manual%200.7.pdf) > For example : > > [_.l_met="\+--\+-\+-\+-\+-"] > to query words inside specific verse lines > > or > > <l_met="\+--\+-\+-\+-\+-"> [] > to query words at the beginning of specific verse lines > > Best, > Serge > > > Le 04/07/2019 à 13:11, Péter Horváth via TXM-open a écrit : >> Hello, >> >> We are building a poetry corpus, and besides the grammatical features of words, we annotate the rhyme pattern of stanzas and the rhythm of lines. These annotations are the values of the attributes of lg (line group) and l (line) tags in TEI XML. Is it possible to query these attributes in TXM? Because for me it seems that only the attributes of words can be queried in TXM. >> Any help is welcome, >> Péter >> >> >> _______________________________________________ >> TXM-open mailing list >> TXM...@li... <mailto:TXM...@li...> >> https://lists.sourceforge.net/lists/listinfo/txm-open > > -- > Dr. Serge Heiden, slh AT ens-lyon.fr,http://textometrie.ens-lyon.fr > Équipe de recherche Cactus, laboratoire IHRIM UMR5317, ENS de Lyon > 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883 > _______________________________________________ > TXM-open mailing list > TXM...@li... <mailto:TXM...@li...> > https://lists.sourceforge.net/lists/listinfo/txm-open -- Dr. Serge Heiden, slh AT ens-lyon.fr, http://textometrie.ens-lyon.fr Équipe de recherche Cactus, laboratoire IHRIM UMR5317, ENS de Lyon 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883 |
|
From: Péter H. <hor...@ya...> - 2019-07-05 11:54:17
|
Hi Serge,
Thank you for your help, but I'm looking for a solution where I just query the rhyme or rhythm patterns, not the words under a defined constraint of a rhyme or a rhythm pattern. I would like to get frequency lists of rhyme and rhythm patterns in the same way as I can get frequency lists of words, lemmas, pos in the Lexicon feature of TXM. Is this solvable somehow?
Best, Péter
Serge Heiden<sl...@en...> ezt írta (2019. július 5., péntek 12:01:19 CEST):
Hi Péter,
Have you tried to use the syntax presented in section 12.5.2 "Using a property on a structure" of the manual?
(http://textometrie.ens-lyon.fr/files/documentation/TXM%20Manual%200.7.pdf)
For example :
[_.l_met="\+--\+-\+-\+-\+-"]
to query words inside specific verse lines
or
<l_met="\+--\+-\+-\+-\+-"> []
to query words at the beginning of specific verse lines
Best,
Serge
Le 04/07/2019 à 13:11, Péter Horváth via TXM-open a écrit :
Hello,
We are building a poetry corpus, and besides the grammatical features of words, we annotate the rhyme pattern of stanzas and the rhythm of lines. These annotations are the values of the attributes of lg (line group) and l (line) tags in TEI XML. Is it possible to query these attributes in TXM? Because for me it seems that only the attributes of words can be queried in TXM.
Any help is welcome,
Péter
_______________________________________________
TXM-open mailing list
TXM...@li...
https://lists.sourceforge.net/lists/listinfo/txm-open
--
Dr. Serge Heiden, slh AT ens-lyon.fr, http://textometrie.ens-lyon.fr
Équipe de recherche Cactus, laboratoire IHRIM UMR5317, ENS de Lyon
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883
_______________________________________________
TXM-open mailing list
TXM...@li...
https://lists.sourceforge.net/lists/listinfo/txm-open
|
|
From: Serge H. <sl...@en...> - 2019-07-05 10:00:56
|
Hi Péter, Have you tried to use the syntax presented in section 12.5.2 "Using a property on a structure" of the manual? (http://textometrie.ens-lyon.fr/files/documentation/TXM%20Manual%200.7.pdf) For example : [_.l_met="\+--\+-\+-\+-\+-"] to query words inside specific verse lines or <l_met="\+--\+-\+-\+-\+-"> [] to query words at the beginning of specific verse lines Best, Serge Le 04/07/2019 à 13:11, Péter Horváth via TXM-open a écrit : > Hello, > > We are building a poetry corpus, and besides the grammatical features of words, we annotate the rhyme pattern of stanzas and the rhythm of lines. These annotations are the values of the attributes of lg (line group) and l (line) tags in TEI XML. Is it possible to query these attributes in TXM? Because for me it seems that only the attributes of words can be queried in TXM. > Any help is welcome, > Péter > > > _______________________________________________ > TXM-open mailing list > TXM...@li... > https://lists.sourceforge.net/lists/listinfo/txm-open -- Dr. Serge Heiden, slh AT ens-lyon.fr, http://textometrie.ens-lyon.fr Équipe de recherche Cactus, laboratoire IHRIM UMR5317, ENS de Lyon 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883 |
|
From: Péter H. <hor...@ya...> - 2019-07-04 11:13:57
|
Hello, We are building a poetry corpus, and besides the grammatical features of words, we annotate the rhyme pattern of stanzas and the rhythm of lines. These annotations are the values of the attributes of lg (line group) and l (line) tags in TEI XML. Is it possible to query these attributes in TXM? Because for me it seems that only the attributes of words can be queried in TXM. Any help is welcome, Péter |
|
From: Serge H. <sl...@en...> - 2019-05-29 15:45:43
|
Dear all, We are pleased to announce the availability of TXM version 0.8.0: http://textometrie.ens-lyon.fr/files/software/TXM/0.8.0. To install this version, you must download and then run its installation software. NEWS In addition to fixing many bugs, this new version includes some notable new elements: Improved installation - TXM is now installed in a directory dedicated to its version. This makes it possible to use different versions of TXM at the same time, for example TXM 0.7.9 and TXM 0.8.0; - it is no longer necessary to painstakingly install TreeTagger yourself then link to TXM: two new TXM extensions now allow you to download and install in a few clicks both TreeTagger and two default language models (en and fr). Improved user session - all calculation results can now be kept between two TXM sessions, either on demand (default operating mode) or systematically; - all commands now include the setting of all their parameters from the results window in a dedicated area. This makes it much easier to refine the results. For example, you can easily adjust the queries to use or the structure to display in a Progression, without having to relaunch the command and enter all its parameters. Consequences : - for most commands, you must first specify one or two mandatory parameters in the dedicated area before you can obtain a result by pressing the Calculate button in the results window; - the re-calculation of a result causes the re-calculation of all the results which depend on it (its descendants in the Corpus view): for example, recalculating a lexical table recalculates the tables of specificities or the AFCs built from it ; - the results, tables or graphs, can be updated either as soon as one of the parameters is modified (operating mode called "electric", deactivated by default), or on demand by pressing the "[re-]calculate" button (in this operating mode, an "asterisk" mark is displayed in the window tab as soon as one of its parameters is modified, to indicate that the display of the results can be refreshed) ; - New preferences: - TXM > User > Automatic recalculation when changing a parameter: a result is recalculated as soon as one of the parameters of the command is modified; - TXM > User > Backup all default results (persistence): all new results are kept between TXM sessions; - a result can be cloned, with or without its descendants, for example to compare variants of calculations; - graphic display variants (font type or font size, background color, etc.) can be adjusted directly from the results window in a dedicated toolbar; Improved annotation tools - You can now correct or add simple word properties from a concordance, such as TreeTagger's "enpos" and "enlemma" properties. This annotation mode has been added to the initial annotation modes of word sequences; Improved import - the metadata file "metadata.*" can now be read either in .ods (LibreOffice Calc), .xlsx (MS Excel) or .csv format; - the import and the loading of corpora are faster because TXM does not restart the engines during these treatments anymore. The new version of the TXM manual for this version 0.8 is being finalized. We will announce its availability soon on the list. Good explorations, Serge Heiden, for the TXM team -- Dr. Serge Heiden, slh AT ens-lyon.fr, http://textometrie.ens-lyon.fr Équipe de recherche Cactus, laboratoire IHRIM UMR5317, ENS de Lyon 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883 |
|
From: Gabriele B. <gab...@gm...> - 2018-10-23 05:57:30
|
Buongiorno! Looking for a corpus I have already installed and used in TXM, I have no more found it. I've seen that its <import.xml> seems to be corrupted, as it ends like this: --------------------- * <page id="13"* * wordid="w_Constantinus_Porphyrogenitus_PG_112_113_De_thematibus_6001"/>* * <page id="14"* * wordid="w_Constantinus_Porphyrogenitus_PG_112_113_De_thematibu* ------------------ Is it possible to re-create this import.xml file, or in some way to be able to use again that corpus? Thank you very much for your help! Gabriele Brandolini |
|
From: Gabriele B. <gab...@gm...> - 2018-07-17 11:25:40
|
Hello! In order to get a good tokenization of Swahili words containing an apostrophe inside (see ng’ombe; mang’amuzi) how should punctuation characters in Tokenizer Parameters be set? I’m trying to import a swahili corpus by using TXT+CSV Import Module. Many thanks! Gabriele |
|
From: Serge H. <sl...@en...> - 2018-03-14 17:12:05
|
Bonjour Ciarán, Il manque dans votre XSL un template pour recopier toutes les autres balises (étiquettes). Le résultat n'est plus du XML parce qu'il ne contient plus de balises du tout (il en faut au moins une autour de l'ensemble du texte, mais ce n'est pas ce que vous souhaitez de toute façon). Je vous suggère de prendre exemple sur la XSL exemple nommée "filter-out-p.xsl" livrée avec TXM dans votre répertoire $HOME/TXM/xsl pour faire ce genre de traitement, ou bien de la récupérer sur Sourceforge : https://sourceforge.net/projects/txm/files/library/xsl Le fichier README.markdown documente toutes les XSL disponibles. Dans la XSL filter-out-p.xsl, le template (qui ressemble beaucoup au votre) : <xsl:template match="//p[@type='ouverture']"> <!--<xsl:copy-of select="."/>--> </xsl:template> va provoquer la suppression de toutes les balises <p> ayant un attribut "type" à la valeur "ouverture". Le template suivant : <xsl:template match="//p"> <!--<xsl:copy-of select="."/>--> </xsl:template> va provoquer la suppression de toutes les balises <p> et leur contenu. Par contre il faut absolument le template suivant pour recopier les autres balises : <xsl:template match="@*|node()"> <xsl:copy> <xsl:apply-templates select="@*|node()" /> </xsl:copy> </xsl:template> Une autre XSL intéressante pour filtrer certaines balises est "filter-keep-only-select.xsl". Pour tester vos XSL, je vous suggère d'utiliser la macro ExecXSL, documentée ici : https://groupes.renater.fr/wiki/txm-users/public/macros#execxsl Si vous pouvez correspondre en français, je vous suggère de vous adresser plutôt à la liste de diffusion "txm...@gr..." à laquelle plusieurs experts en XSL sont abonnés et pourront vous répondre directement : https://groupes.renater.fr/sympa//info/txm-users. La liste txm-open est beaucoup moins active et plutôt pour les échanges en anglais. à bientôt, Serge Heiden Le 14/03/2018 à 17:36, Ciarán Ó Duibhín via TXM-open a écrit : > Bonjour, > J'essaie de créer un corpus, en important un fichier LU006.xml > (inclus) par XML-XTZ+CSV. J'ai crée un ficher a.xsl dans le > répertoire \xsl\3-posttok , pour enlever du pivot tous les étiquettes > <c>..</c>, et aussi leur contenu. Le voici: il y a certainement des > erreurs: > <?xml version="1.0"?> > <xsl:stylesheet xmlns:edate="http://exslt.org/dates-and-times" > xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > xmlns:tei="http://www.tei-c.org/ns/1.0" > xmlns:txm="http://textometrie.org/1.0" > xmlns:xs="http://www.w3.org/2001/XMLSchema" > exclude-result-prefixes="#all" version="2.0"> > > <xsl:output method="xml" encoding="UTF-8" omit-xml-declaration="no" > indent="no"/> > <xsl:template match="c"/> > </xsl:stylesheet> > TXM me donne un message "Content is not allowed in prolog." Voir > tmx1.jpg (inclus). > J'ai aperçu, dans le répertoire corpora/TOBAR/tokenized , un fichier > LU006.xml (voir tokenizedLU006.xml , inclus), ou les <c>..</c> et leur > contenu ont été enlevé, mais il manque aussi tous les autres > étiquettes (leur contenu reste toujours là). Au lieu des étiquettes > <c>..</c>, il y a un fin de ligne, ce qui n'était pas mon intention. > Que faire? Merci de vos idées. > Ciarán Ó Duibhín. > > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > _______________________________________________ > TXM-open mailing list > TXM...@li... > https://lists.sourceforge.net/lists/listinfo/txm-open -- Dr. Serge Heiden, slh AT ens-lyon.fr, http://textometrie.ens-lyon.fr Équipe de recherche Cactus, laboratoire IHRIM UMR5317, ENS de Lyon 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883 |
|
From: Ciarán Ó D. <cod...@bt...> - 2018-03-14 16:36:45
|
Bonjour, J'essaie de créer un corpus, en important un fichier LU006.xml (inclus) par XML-XTZ+CSV. J'ai crée un ficher a.xsl dans le répertoire \xsl\3-posttok , pour enlever du pivot tous les étiquettes <c>..</c>, et aussi leur contenu. Le voici: il y a certainement des erreurs: <?xml version="1.0"?> <xsl:stylesheet xmlns:edate="http://exslt.org/dates-and-times" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:txm="http://textometrie.org/1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="#all" version="2.0"> <xsl:output method="xml" encoding="UTF-8" omit-xml-declaration="no" indent="no"/> <xsl:template match="c"/> </xsl:stylesheet> TXM me donne un message "Content is not allowed in prolog." Voir tmx1.jpg (inclus). J'ai aperçu, dans le répertoire corpora/TOBAR/tokenized , un fichier LU006.xml (voir tokenizedLU006.xml , inclus), ou les <c>..</c> et leur contenu ont été enlevé, mais il manque aussi tous les autres étiquettes (leur contenu reste toujours là). Au lieu des étiquettes <c>..</c>, il y a un fin de ligne, ce qui n'était pas mon intention. Que faire? Merci de vos idées. Ciarán Ó Duibhín. |
|
From: Antonio R. C. <roj...@gm...> - 2017-10-26 12:45:36
|
Dear friends, I am trying to install the Spanish and French tree-tagger packages in TXM. I have followed these instructions: http://txm.sourceforge.net/installtreetagger_en.html After finishing the whole process, I query with a POS and I get these error messages: - For a Spanish corpus: <[espos=“ADJ”]>Index of <[espos=“ADJ”]> with property: [word] in the corpus: TXT org.txm.searchengine.cqp.serverException.CqiClErrorInternal Last CQP error:``espos’’ is neither a positional/structural attribute nor a label reference - For a French corpus: Index of <[frpos=“ADJ”]> with property: [word] in the corpus: GRAAL org.txm.searchengine.cqp.serverException.CqiClErrorInternal The tree-tagger directory is a sibling of the TXM directory. Is that right? I have attached some screenshots of my file directories. Any idea what is the problem? Many thanks in advance! Merci beaucoup! -- Dr. Antonio Rojas Castro Researcher, Cologne Center for eHumanities Communication coordinator, EADH < http://www.antoniorojascastro.com > |
|
From: Serge H. <sl...@en...> - 2017-01-18 16:15:32
|
Le 18/01/2017 à 14:32, Peter Boot a écrit : >> Concerning TreeTagger, we will start a development that will more work on getting rid of it >> than optimizing its usage. If you are interested, here are the different alternative technologies > I'm not a linguist, but the big advantage of TreeTagger seems to me that it can be used in a uniform way with so many languages and on multiple platforms. Yes, this is why we use it. But it is not perfect (by definition), and the software and the parameter files are not open-source so we can't make the lemmatization process better or adapt it to a specific sub-language. > >> To come back to the corpus geometry target, if a description of a set of more 'text mining' >> oriented - than textometry - tools can be defined (with different priorities than the textometry set), > I don't have a textometry background but am reading up on corpus linguistics, mostly research from the UK. My texts don't have markup, except for the linguistic tagging added during import. I do have about ten metadata fields at text level. What to me is important (but I'm just beginning) is e.g. > - the CQL support > - concordancing with integrated metadata > - collocation analysis > - looking at specificity of words by partition Thanks for the feedback. Best, Serge -- Dr. Serge Heiden, sl...@en..., http://textometrie.ens-lyon.fr ENS de Lyon - IHRIM UMR5317 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883 |
|
From: Serge H. <sl...@en...> - 2017-01-18 16:07:30
|
Le 18/01/2017 à 13:49, Peter Boot a écrit : > In fact, I already processed these 600.000 texts using TreeTagger once. In this case, if you mind raw text editions, you could consider using the CWB import module instead of TXT+CSV. Just give it a directory containing a corpus source file that includes pre-encoded word properties (in the native CQP source corpus format), similar to the sample I joined to this mail. And this should speed things up. Best, Serge -- Dr. Serge Heiden, sl...@en..., http://textometrie.ens-lyon.fr ENS de Lyon - IHRIM UMR5317 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883 |
|
From: Peter B. <pet...@hu...> - 2017-01-18 13:32:20
|
Hello Serge, Thanks for your reply. > That been said, any feedback to speed up the import process is very welcome My main suggestion would be to buffer calls to TreeTagger, if you don't already do that. But if you want to get rid of it, that's not a solution. > Concerning TreeTagger, we will start a development that will more work on getting rid of it > than optimizing its usage. If you are interested, here are the different alternative technologies I'm not a linguist, but the big advantage of TreeTagger seems to me that it can be used in a uniform way with so many languages and on multiple platforms. > To come back to the corpus geometry target, if a description of a set of more 'text mining' > oriented - than textometry - tools can be defined (with different priorities than the textometry set), I don't have a textometry background but am reading up on corpus linguistics, mostly research from the UK. My texts don't have markup, except for the linguistic tagging added during import. I do have about ten metadata fields at text level. What to me is important (but I'm just beginning) is e.g. - the CQL support - concordancing with integrated metadata - collocation analysis - looking at specificity of words by partition But I should say that at the moment I'm doing my analysis work in Python notebooks running against a relational database, and I'm still wondering which tool fits which functionality best. Best, Peter |
|
From: Peter B. <pet...@hu...> - 2017-01-18 12:49:50
|
Hello Piotr, > We had a similar issue (outside TXM) that was caused by TT re-loading > the language model for each new file. The trick to handle it was to keep > the model in memory for the entire duration of pipeline processing. Thanks. In fact, I already processed these 600.000 texts using TreeTagger once. What I did was call TreeTagger once for every hundred texts. That sped up processing enough to make the process manageable. Best, Peter |
|
From: Piotr B. <ba...@o2...> - 2017-01-18 11:42:16
|
Hello Peter, We had a similar issue (outside TXM) that was caused by TT re-loading the language model for each new file. The trick to handle it was to keep the model in memory for the entire duration of pipeline processing. Obviously, you have to use a pipeline outside of TXM for that, but that seems more or less in accordance with what Serge says about TXM resigning from internally using TT (unless I misunderstood). I can try to find the relevant code (can't recall... probably Python), but it might take a while. Best regards, Piotr On 16/01/17 21:26, Peter Boot wrote: > Hello all, > > I have downloaded TXM and am quite impressed by what it can do. I hope to use it for studying a collection of book reviews. > > I have been testing the TXT + CSV import process, and it works wonderfully. However, it takes a minute per hundred files. For my 600000 files, that would be 6000 minutes, i.e. 100 hours. That's a rather long time. Something more than half of the time is apparently spent in TreeTagger (or perhaps mostly building a connection to TreeTagger). Not building an edition improves the speed only marginally. > > Do I have other options to improve import speed? > > Thanks in advance, > Peter Boot > |
|
From: Serge H. <sl...@en...> - 2017-01-18 11:23:26
|
Hi Peter, Thank you for your interest in TXM. Currently the main corpus geometry target for TXM is about hundreds of texts of about dozens or hundreds of thousand words with medium to dense XML-TEI markup, some properties for each word (pos, lemma...) and for each text (metadata). This is oriented by the textometry corpus analysis methodology, that TXM implements. That methodology tries to balance text reading (with scholarly editions, concordance...) with various quantitative synthesis tools. I'm afraid so that you will not be able to import 600 000 TXT texts directly today. That been said, any feedback to speed up the import process is very welcome and we have some development tickets related to that: - https://forge.cbp.ens-lyon.fr/redmine/issues/1630 - https://forge.cbp.ens-lyon.fr/redmine/issues/1666 Not sure if what has been developed to optimize is already available in TXM 0.7.8 (guys?). Concerning TreeTagger, we will start a development that will more work on getting rid of it than optimizing its usage. If you are interested, here are the different alternative technologies we are currently contemplating: https://groupes.renater.fr/wiki/txm-info/public/specs_import_annotation_lexicale_auto. To come back to the corpus geometry target, if a description of a set of more 'text mining' oriented - than textometry - tools can be defined (with different priorities than the textometry set), we could try to help to develop a response within the TXM platform. For example a partnership that develops an alternative corpus geometry target like this : - lots of small to medium texts - lots of words - shallow or no text structure - some properties for each word and each text All the best, Serge Le 16/01/2017 à 21:26, Peter Boot a écrit : > Hello all, > > I have downloaded TXM and am quite impressed by what it can do. I hope to use it for studying a collection of book reviews. > > I have been testing the TXT + CSV import process, and it works wonderfully. However, it takes a minute per hundred files. For my 600000 files, that would be 6000 minutes, i.e. 100 hours. That's a rather long time. Something more than half of the time is apparently spent in TreeTagger (or perhaps mostly building a connection to TreeTagger). Not building an edition improves the speed only marginally. > > Do I have other options to improve import speed? > > Thanks in advance, > Peter Boot > > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > _______________________________________________ > TXM-open mailing list > TXM...@li... > https://lists.sourceforge.net/lists/listinfo/txm-open -- Dr. Serge Heiden, sl...@en..., http://textometrie.ens-lyon.fr ENS de Lyon - IHRIM UMR5317 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883 |
|
From: Peter B. <pet...@hu...> - 2017-01-16 20:45:51
|
Hello all, I have downloaded TXM and am quite impressed by what it can do. I hope to use it for studying a collection of book reviews. I have been testing the TXT + CSV import process, and it works wonderfully. However, it takes a minute per hundred files. For my 600000 files, that would be 6000 minutes, i.e. 100 hours. That's a rather long time. Something more than half of the time is apparently spent in TreeTagger (or perhaps mostly building a connection to TreeTagger). Not building an edition improves the speed only marginally. Do I have other options to improve import speed? Thanks in advance, Peter Boot |
|
From: Jean-Claude <jc....@sk...> - 2016-01-13 14:09:13
|
Bonjour,
J’ai installé TXM 0.7.7 sur un McBook air tournant sous OSX 10.10.5
J’ai un problème lorsque je demande une classification j’obtiens un message d’erreur (identique avec les deux moteurs graphiques) et pas de dendogramme. En outre j’ai la même erreur lorsque je demande une « progression » en densité pour une forme quelconque.
Voici le message d’erreur (fichier joint) et ce qui s’écrit dans la console : alcul de la progression de CESFEVRIER2016 avec la requête : [[word="syndicat"]], la structure : null, la propriété : null, cumulatif: true
Requête sur CESFEVRIER2016 : Q14 <- [word="syndicat"]
Terminé : [15] positions affichées
Calcul de la progression de CESFEVRIER2016 avec la requête : [[word="syndicat"]], la structure : null, la propriété : null, cumulatif: false
Requête sur CESFEVRIER2016 : Q15 <- [word="syndicat"]
Reval : library(textometry)
Reval : rm(x0)
INT_VECTOR_ADDED_TO_WORKSPACE[Ljava.lang.Object;@75565551
Reval : rm(positions)
Reval : positions <- list(x0)
Reval : rm(structurepositions)
INT_VECTOR_ADDED_TO_WORKSPACE[Ljava.lang.Object;@76fb7505
Reval : rm(structurenames)
CHAR_VECTOR_ADDED_TO_WORKSPACE[Ljava.lang.Object;@76bfd849
Reval : rm(colors)
CHAR_VECTOR_ADDED_TO_WORKSPACE[Ljava.lang.Object;@2d10dd87
Reval : rm(styles)
INT_VECTOR_ADDED_TO_WORKSPACE[Ljava.lang.Object;@2e3fe12e
Reval : rm(widths)
INT_VECTOR_ADDED_TO_WORKSPACE[Ljava.lang.Object;@63d8aaba
Reval : rm(names)
CHAR_VECTOR_ADDED_TO_WORKSPACE[Ljava.lang.Object;@25b0cc8c
Sauvegarde du graphique dans le fichier /Users/jean-claude/TXM/results/progression5156786576870559964.svg
Reval : try(svg("/Users/jean-claude/TXM/results/progression5156786576870559964.svg"))
Dernier 'safeeval'try(svg("/Users/jean-claude/TXM/results/progression5156786576870559964.svg"))
Loading SVG document from file: /Users/jean-claude/TXM/results/progression5156786576870559964.svg
Terminé : [15] positions affichées
|