Generazione di combinazioni di sottostringhe da una stringa

Sto provando a generare tutte le possibili combinazioni di sillabe per una determinata parola. Il processo per identificare cosa è una sillaba non è rilevante qui, ma è la generazione di tutte le combinazioni che mi sta dando un problema. Penso che sia probabilmente possibile fare ricorsivamente in poche righe (anche se ogni altro modo va bene), ma ho difficoltà a farlo funzionare. Qualcuno può aiutare?Generazione di combinazioni di sottostringhe da una stringa

// how to test a syllable, just for the purpose of this example 
    bool IsSyllable(string possibleSyllable) 
    { 
     return Regex.IsMatch(possibleSyllable, "^(mis|und|un|der|er|stand)$"); 
    } 

    List<string> BreakIntoSyllables(string word) 
    { 
     // the code here is what I'm trying to write 
     // if 'word' is "misunderstand" , I'd like this to return 
     // => {"mis","und","er","stand"},{ "mis","un","der","stand"} 
     // and for any other combinations to be not included 
    }

fonte

2011-10-06 mikel

Dovresti usare Trie per catalogare le tue sillabe. O puoi usare le soluzioni naiver :-) – xanatos

provare a partire con questo:

var word = "misunderstand"; 

Func<string, bool> isSyllable = 
    t => Regex.IsMatch(t, "^(mis|und|un|der|er|stand)$"); 

var query = 
    from i in Enumerable.Range(0, word.Length) 
    from l in Enumerable.Range(1, word.Length - i) 
    let part = word.Substring(i, l) 
    where isSyllable(part) 
    select part;

Ciò restituisce:

misunderstand-results

Se tale aiuto per cominciare, almeno?

EDIT: ho pensato un po 'più su questo problema un avvicinò con questa coppia di query:

Func<string, IEnumerable<string[]>> splitter = null; 
splitter = 
    t => 
     from n in Enumerable.Range(1, t.Length - 1) 
     let s = t.Substring(0, n) 
     let e = t.Substring(n) 
     from g in (new [] { new [] { e } }).Concat(splitter(e)) 
     select (new [] { s }).Concat(g).ToArray(); 

var query = 
    from split in (new [] { new [] { word } }).Concat(splitter(word)) 
    where split.All(part => isSyllable(part)) 
    select split;

Ora query ritorno questo:

misunderstanding-results2

Let me know se questo è inchiodato ora.

fonte

2011-10-06 13:27:49 Enigmativity

Grazie, sembra promettente ma al momento mi sta facendo un errore di compilazione - "'System.Collections.Generic.IEnumerable ' non contiene una definizione per 'StartWith' e nessun metodo di estensione ' Iniziare con'" . – mikel

@ mikel - Ah, mi dispiace per quello. Ho usato un metodo di estensione Reactive Framework. Lo aggiusterò quando torno sul mio PC. – Enigmativity

@ mikel - Ho modificato la query per utilizzare operatori standard. – Enigmativity

Normalmente questo tipo di problemi viene risolto utilizzando Tries. Baserò la mia implementazione di un Trie su How to create a trie in c# (ma notate che l'ho riscritto).

var trie = new Trie(new[] { "un", "que", "stio", "na", "ble", "qu", "es", "ti", "onable", "o", "nable" }); 
//var trie = new Trie(new[] { "u", "n", "q", "u", "e", "s", "t", "i", "o", "n", "a", "b", "l", "e", "un", "qu", "es", "ti", "on", "ab", "le", "nq", "ue", "st", "io", "na", "bl", "unq", "ues", "tio", "nab", "nqu", "est", "ion", "abl", "que", "stio", "nab" }); 

var word = "unquestionable"; 

var parts = new List<List<string>>(); 

Split(word, 0, trie, trie.Root, new List<string>(), parts); 

// 

public static void Split(string word, int index, Trie trie, TrieNode node, List<string> currentParts, List<List<string>> parts) 
{ 
    // Found a syllable. We have to split: one way we take that syllable and continue from it (and it's done in this if). 
    // Another way we ignore this possible syllable and we continue searching for a longer word (done after the if) 
    if (node.IsTerminal) 
    { 
     // Add the syllable to the current list of syllables 
     currentParts.Add(node.Word); 

     // "covered" the word with syllables 
     if (index == word.Length) 
     { 
      // Here we make a copy of the parts of the word. This because the currentParts list is a "working" list and is modified every time. 
      parts.Add(new List<string>(currentParts)); 
     } 
     else 
     { 
      // There are remaining letters in the word. We restart the scan for more syllables, restarting from the root. 
      Split(word, index, trie, trie.Root, currentParts, parts); 
     } 

     // Remove the syllable from the current list of syllables 
     currentParts.RemoveAt(currentParts.Count - 1); 
    } 

    // We have covered all the word with letters. No more work to do in this subiteration 
    if (index == word.Length) 
    { 
     return; 
    } 

    // Here we try to find the edge corresponding to the current character 

    TrieNode nextNode; 

    if (!node.Edges.TryGetValue(word[index], out nextNode)) 
    { 
     return; 
    } 

    Split(word, index + 1, trie, nextNode, currentParts, parts); 
} 

public class Trie 
{ 
    public readonly TrieNode Root = new TrieNode(); 

    public Trie() 
    { 
    } 

    public Trie(IEnumerable<string> words) 
    { 
     this.AddRange(words); 
    } 

    public void Add(string word) 
    { 
     var currentNode = this.Root; 

     foreach (char ch in word) 
     { 
      TrieNode nextNode; 

      if (!currentNode.Edges.TryGetValue(ch, out nextNode)) 
      { 
       nextNode = new TrieNode(); 
       currentNode.Edges[ch] = nextNode; 
      } 

      currentNode = nextNode; 
     } 

     currentNode.Word = word; 
    } 

    public void AddRange(IEnumerable<string> words) 
    { 
     foreach (var word in words) 
     { 
      this.Add(word); 
     } 
    } 
} 

public class TrieNode 
{ 
    public readonly Dictionary<char, TrieNode> Edges = new Dictionary<char, TrieNode>(); 
    public string Word { get; set; } 

    public bool IsTerminal 
    { 
     get 
     { 
      return this.Word != null; 
     } 
    } 
}

word è la stringa che ti interessa, parts conterrà l'elenco delle liste di possibili sillabe (probabilmente sarebbe più corretto per renderlo un List<string[]>, ma è abbastanza facile da fare. Invece di parts.Add(new List<string>(currentParts)); scrivere parts.Add(currentParts.ToArray()); e cambiare tutto il List<List<string>>-List<string[]>.

aggiungerò una variante di risposta Enigmativity PTA è theretically più veloce del suo, perché scarta sillabe sbagliate immediatamente invece di post-filtraggio in un secondo momento. Se ti piace, si dovrebbe dargli un +1, perché senza la sua idea, questa variante non sarebbe possibile Ma nota che è ancora un hack. La soluzione "giusta" è nell'uso Trie (s) :-)

Func<string, bool> isSyllable = t => Regex.IsMatch(t, "^(un|que|stio|na|ble|qu|es|ti|onable|o|nable)$"); 

Func<string, IEnumerable<string[]>> splitter = null; 
splitter = 
    t => 
     (
     from n in Enumerable.Range(1, t.Length - 1) 
     let s = t.Substring(0, n) 
     where isSyllable(s) 
     let e = t.Substring(n) 
     let f = splitter(e) 
     from g in f 
     select (new[] { s }).Concat(g).ToArray() 
     ) 
     .Concat(isSyllable(t) ? new[] { new string[] { t } } : new string[0][]); 

var parts = splitter(word).ToList();

Una spiegazione:

 from n in Enumerable.Range(1, t.Length - 1) 
     let s = t.Substring(0, n) 
     where isSyllable(s)

Calcoliamo tutte le possibili sillabe di una parola, di lunghezza 1 alla lunghezza della parola - 1 e controlla se è una sillaba. Estirpiamo direttamente le non-sillabe. La parola completa come sillaba verrà verificata in seguito.

 let e = t.Substring(n) 
     let f = splitter(e)

Cerchiamo le sillabe della parte restante della stringa

 from g in f 
     select (new[] { s }).Concat(g).ToArray()

E noi catena le sillabe trovati con la sillaba "corrente". Nota che stiamo creando molti matrici inutili. Se accettiamo di avere un IEnumerable<IEnumerable<string>> come risultato, possiamo togliere questo ToArray.

(potremmo riscrivere molte righe insieme, l'eliminazione di molti let, come

 from g in splitter(t.Substring(n)) 
     select (new[] { s }).Concat(g).ToArray()

ma noi non lo faremo per chiarezza)

E noi concatenare la sillaba "corrente" con le sillabe trovati .

 .Concat(isSyllable(t) ? new[] { new string[] { t } } : new string[0][]);

Qui si potrebbe ricostruire la query un poco in modo di non usare questo Concat e creare array vuoti, ma sarebbe un po 'complessa (potremmo riscrivere l'intera funzione lambda come isSyllable(t) ? new[] { new string[] { t } }.Concat(oldLambdaFunction) : oldLambdaFunction)

In Alla fine, se l'intera parola è una sillaba, aggiungiamo l'intera parola come una sillaba.Altrimenti Concat un array vuoto (quindi non Concat)

fonte

2011-10-06 13:26:29 xanatos

Potresti avere problemi con il ridimensionamento, ma non sono sicuro di quanto sia grande il set di dati, ma una soluzione basata su un semplice 'questa è una sillaba?' dovrai chiamare la tua routine di 'sillaba rilevamento' all'incirca 0 (n * n) per ogni parola dove n = il numero di caratteri nella parola (se questo non ha significato, significa che è probabile che sia lento per i set di dati di grandi dimensioni!) . Ciò non tiene conto della scalabilità dell'algoritmo di rilevamento che potrebbe anche rallentare man mano che si aggiungono altre sillabe. .

So che hai detto che il tuo processo per identificare cosa sia una sillaba o meno non è rilevante, ma diciamo che puoi cambiarlo per farlo funzionare più come completamento automatico, cioè passare all'inizio di una sillaba e dirgli tutto le sillabe che sono possibili da questo punto sarebbero molto più scalabili. Date un'occhiata a sostituirlo con un trie se le prestazioni sfuggono di mano.

fonte

2011-10-06 13:53:39

Generazione di combinazioni di sottostringhe da una stringa

risposta

Problemi correlati