Conteggio delle frequenze di parola Java 8

55

voglio condividere la soluzione che ho trovato, perché in un primo momento mi aspettavo di usare carta-e-ridurre i metodi, ma era un po 'diverso.

Map<String, Long> collect = 
     wordsList.stream().collect(groupingBy(Function.identity(), counting()));

O per valori interi:

Map<String, Integer> collect = 
     wordsList.stream().collect(groupingBy(Function.identity(), summingInt(e -> 1)));

EDIT

aggiungo come ordinare la mappa per valore:

LinkedHashMap<String, Long> countByWordSorted = collect.entrySet() 
      .stream() 
      .sorted(Map.Entry.comparingByValue(Comparator.reverseOrder())) 
      .collect(Collectors.toMap(
        Map.Entry::getKey, 
        Map.Entry::getValue, 
        (v1, v2) -> { 
         throw new IllegalStateException(); 
        }, 
        LinkedHashMap::new 
      ));

fonte

2015-03-18 12:43:02 Mouna

+13

Perché 'summingInt (e -> 1) '? Basta usare ['counting()'] (http://docs.oracle.com/javase/8/docs/api/java/util/stream/Collectors.html#counting--). Il risultato sarà "Map ", ma non dovrebbe ferire. Btw. puoi sostituire 'e-> e' con' Function.identity() '(ma non devi). – Holger

15

(Nota: vedere le modifiche sotto)

In alternativa al Mounas answer, qui è un approccio che conta la parola in parallelo:

import java.util.Arrays; 
import java.util.List; 
import java.util.Map; 
import java.util.stream.Collectors; 

public class ParallelWordCount 
{ 
    public static void main(String[] args) 
    { 
     List<String> list = Arrays.asList(
      "hello", "bye", "ciao", "bye", "ciao"); 
     Map<String, Integer> counts = list.parallelStream(). 
      collect(Collectors.toConcurrentMap(
       w -> w, w -> 1, Integer::sum)); 
     System.out.println(counts); 
    } 
}

EDIT In risposta al commento, ho fatto un piccolo test con JMH, confrontando il toConcurrentMap e l'approccio groupingByConcurrent, con diverse dimensioni di lista di input e parole casuali di diverse lunghezze. Questo test suggeriva che l'approccio toConcurrentMap era più veloce. Quando si considera quanto diversi questi approcci siano "sotto la cappa", è difficile prevedere qualcosa di simile.

Come ulteriore estensione, sulla base di ulteriori commenti, ho esteso il test per coprire tutte e quattro le combinazioni di toMap, groupingBy, seriale e parallelo.

I risultati sono ancora che l'approccio toMap è più veloce, ma inaspettatamente (almeno, per me) le versioni "concorrenti" in entrambi i casi sono più lenti rispetto alle versioni di serie ...:

   (method) (count) (wordLength) Mode Cnt  Score Error Units 
     toConcurrentMap  1000   2 avgt 50 146,636 ± 0,880 us/op 
     toConcurrentMap  1000   5 avgt 50 272,762 ± 1,232 us/op 
     toConcurrentMap  1000   10 avgt 50 271,121 ± 1,125 us/op 
       toMap  1000   2 avgt 50 44,396 ± 0,541 us/op 
       toMap  1000   5 avgt 50 46,938 ± 0,872 us/op 
       toMap  1000   10 avgt 50 46,180 ± 0,557 us/op 
      groupingBy  1000   2 avgt 50 46,797 ± 1,181 us/op 
      groupingBy  1000   5 avgt 50 68,992 ± 1,537 us/op 
      groupingBy  1000   10 avgt 50 68,636 ± 1,349 us/op 
groupingByConcurrent  1000   2 avgt 50 231,458 ± 0,658 us/op 
groupingByConcurrent  1000   5 avgt 50 438,975 ± 1,591 us/op 
groupingByConcurrent  1000   10 avgt 50 437,765 ± 1,139 us/op 
     toConcurrentMap 10000   2 avgt 50 712,113 ± 6,340 us/op 
     toConcurrentMap 10000   5 avgt 50 1809,356 ± 9,344 us/op 
     toConcurrentMap 10000   10 avgt 50 1813,814 ± 16,190 us/op 
       toMap 10000   2 avgt 50 341,004 ± 16,074 us/op 
       toMap 10000   5 avgt 50 535,122 ± 24,674 us/op 
       toMap 10000   10 avgt 50 511,186 ± 3,444 us/op 
      groupingBy 10000   2 avgt 50 340,984 ± 6,235 us/op 
      groupingBy 10000   5 avgt 50 708,553 ± 6,369 us/op 
      groupingBy 10000   10 avgt 50 712,858 ± 10,248 us/op 
groupingByConcurrent 10000   2 avgt 50 901,842 ± 8,685 us/op 
groupingByConcurrent 10000   5 avgt 50 3762,478 ± 21,408 us/op 
groupingByConcurrent 10000   10 avgt 50 3795,530 ± 32,096 us/op

io non sono così esperto con JMH, forse ho fatto qualcosa di sbagliato qui - suggerimenti e correzioni sono benvenute:

import java.util.ArrayList; 
import java.util.List; 
import java.util.Map; 
import java.util.Random; 
import java.util.concurrent.TimeUnit; 
import java.util.function.Function; 
import java.util.stream.Collectors; 

import org.openjdk.jmh.annotations.Benchmark; 
import org.openjdk.jmh.annotations.BenchmarkMode; 
import org.openjdk.jmh.annotations.Mode; 
import org.openjdk.jmh.annotations.OutputTimeUnit; 
import org.openjdk.jmh.annotations.Param; 
import org.openjdk.jmh.annotations.Scope; 
import org.openjdk.jmh.annotations.Setup; 
import org.openjdk.jmh.annotations.State; 
import org.openjdk.jmh.infra.Blackhole; 

@State(Scope.Thread) 
public class ParallelWordCount 
{ 

    @Param({"toConcurrentMap", "toMap", "groupingBy", "groupingByConcurrent"}) 
    public String method; 

    @Param({"2", "5", "10"}) 
    public int wordLength; 

    @Param({"1000", "10000" }) 
    public int count; 

    private List<String> list; 

    @Setup 
    public void initList() 
    { 
     list = createRandomStrings(count, wordLength, new Random(0)); 
    } 

    @Benchmark 
    @BenchmarkMode(Mode.AverageTime) 
    @OutputTimeUnit(TimeUnit.MICROSECONDS) 
    public void testMethod(Blackhole bh) 
    { 

     if (method.equals("toMap")) 
     { 
      Map<String, Integer> counts = 
       list.stream().collect(
        Collectors.toMap(
         w -> w, w -> 1, Integer::sum)); 
      bh.consume(counts); 
     } 
     else if (method.equals("toConcurrentMap")) 
     { 
      Map<String, Integer> counts = 
       list.parallelStream().collect(
        Collectors.toConcurrentMap(
         w -> w, w -> 1, Integer::sum)); 
      bh.consume(counts); 
     } 
     else if (method.equals("groupingBy")) 
     { 
      Map<String, Long> counts = 
       list.stream().collect(
        Collectors.groupingBy(
         Function.identity(), Collectors.<String>counting())); 
      bh.consume(counts); 
     } 
     else if (method.equals("groupingByConcurrent")) 
     { 
      Map<String, Long> counts = 
       list.parallelStream().collect(
        Collectors.groupingByConcurrent(
         Function.identity(), Collectors.<String> counting())); 
      bh.consume(counts); 
     } 
    } 

    private static String createRandomString(int length, Random random) 
    { 
     StringBuilder sb = new StringBuilder(); 
     for (int i = 0; i < length; i++) 
     { 
      int c = random.nextInt(26); 
      sb.append((char) (c + 'a')); 
     } 
     return sb.toString(); 
    } 

    private static List<String> createRandomStrings(
     int count, int length, Random random) 
    { 
     List<String> list = new ArrayList<String>(count); 
     for (int i = 0; i < count; i++) 
     { 
      list.add(createRandomString(length, random)); 
     } 
     return list; 
    } 
}

I tempi sono simili solo per il caso di serie di una lista con 10000 elementi, e parole di 2 lettere.

Potrebbe essere utile verificare se per le liste di dimensioni ancora più grandi, le versioni concorrenti alla fine superano quelle seriali, ma attualmente non hanno il tempo per un altro benchmark dettagliato eseguito con tutte queste configurazioni.

fonte

2015-03-18 16:18:52 Marco13

+0

Immagino che 'Collectors.groupingByConcurrent (w-> w, Collectors.counting())' sarà più efficiente. – Holger

+0

@Holger Ho aggiunto un EDIT a riguardo. – Marco13

+0

Il tuo dovrebbe anche preoccuparsi del numero di * uguale * parole. La contesa per una singola voce sulla mappa potrebbe avere un impatto significativo. Il conteggio di migliaia di parole distinte potrebbe funzionare senza alcuna contesa in 'ConcurrentMap' di Java 8 anche se non chiamerei la memorizzazione del conteggio del' 1's ". Quindi, contando migliaia di occorrenze della stessa parola potrebbe dare un'immagine diversa ... – Holger

3

Se si utilizza Eclipse Collections, è sufficiente convertire lo List in un Bag.

Bag<String> words = Lists.mutable.with("hello", "bye", "ciao", "bye", "ciao").toBag(); 
Assert.assertEquals(2, words.occurrencesOf("ciao")); 
Assert.assertEquals(1, words.occurrencesOf("hello")); 
Assert.assertEquals(2, words.occurrencesOf("bye"));

Questo codice funzionerà con Java 5 - 8.

Nota: Sono un committer per Eclipse Collezioni

fonte

2015-03-19 02:59:54

2

mi presento qui la soluzione che ho fatto (quella con raggruppamento è molto meglio :)).

static private void test0(List<String> input) { 
    Set<String> set = input.stream() 
      .collect(Collectors.toSet()); 
    set.stream() 
      .collect(Collectors.toMap(Function.identity(), 
        str -> Collections.frequency(input, str))); 
}

Solo il mio 0.02 $

fonte

2015-03-19 19:36:57 Eugene

+0

Sapevo di Collections.frequency (input, str). Grazie per il tuo contributo. – Sam

0

Un altro 2 cent di mine, dato un array:

import static java.util.stream.Collectors.*; 

String[] str = {"hello", "bye", "ciao", "bye", "ciao"};  
Map<String, Integer> collected 
= Arrays.stream(str) 
     .collect(groupingBy(Function.identity(), 
        collectingAndThen(counting(), Long::intValue)));

fonte

2017-11-13 21:03:59 Sam

0

Ecco un modo per creare una mappa di frequenza utilizzando le funzioni della mappa.

List<String> words = Stream.of("hello", "bye", "ciao", "bye", "ciao").collect(toList()); 
Map<String, Integer> frequencyMap = new HashMap<>(); 

words.forEach(word -> 
     frequencyMap.merge(word, 1, (v, newV) -> v + newV) 
); 

System.out.println(frequencyMap); // {ciao=2, hello=1, bye=2}

O

words.forEach(word -> 
     frequencyMap.compute(word, (k, v) -> v != null ? v + 1 : 1) 
);

fonte

2018-02-24 16:45:51 Piyush

0

Trova elemento più frequente nella raccolta, con i generici:

private <V> V findMostFrequentItem(final Collection<V> items) 
{ 
    return items.stream() 
     .filter(Objects::nonNull) 
     .collect(Collectors.groupingBy(Functions.identity(), Collectors.counting())).entrySet().stream() 
     .max(Comparator.comparing(Entry::getValue)) 
     .map(Entry::getKey) 
     .orElse(null); 
}

frequenze

Compute voce:

private <V> Map<V, Long> findFrequencies(final Collection<V> items) 
{ 
    return items.stream() 
     .filter(Objects::nonNull) 
     .collect(Collectors.groupingBy(Functions.identity(), Collectors.counting())); 
}

fonte

2018-02-27 14:33:45 nejckorasa

Conteggio delle frequenze di parola Java 8

risposta

Problemi correlati